In [38]:
import pandas as pd

In [39]:
df = pd.read_csv('kc_house_data.csv')

### Understanding the 'grade' property in the dataset

Taking a subset of our data:

In [40]:
df_1 = df[['grade', 'price', 'bedrooms', 'bathrooms', 'sqft_living', 'sqft_lot', 'floors']]
df_1

Unnamed: 0,grade,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors
0,7,221900.0,3,1.00,1180,5650,1.0
1,7,538000.0,3,2.25,2570,7242,2.0
2,6,180000.0,2,1.00,770,10000,1.0
3,7,604000.0,4,3.00,1960,5000,1.0
4,8,510000.0,3,2.00,1680,8080,1.0
...,...,...,...,...,...,...,...
21608,8,360000.0,3,2.50,1530,1131,3.0
21609,8,400000.0,4,2.50,2310,5813,2.0
21610,7,402101.0,2,0.75,1020,1350,2.0
21611,8,400000.0,3,2.50,1600,2388,2.0


In [41]:
agg_properties = {'price': 'mean', 
'bedrooms': ['median'], 
'bathrooms': ['median'], 
'sqft_living': ['mean'],
'sqft_lot' : ['mean'], 
'floors': ['median']}

df_1.groupby('grade').agg(agg_properties).round(2)

Unnamed: 0_level_0,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors
Unnamed: 0_level_1,mean,median,median,mean,mean,median
grade,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
1,142000.0,0.0,0.0,290.0,20875.0,1.0
3,205666.67,1.0,0.0,596.67,26953.0,1.0
4,214381.03,2.0,1.0,660.48,22101.48,1.0
5,248523.97,2.0,1.0,983.33,24019.91,1.0
6,301916.57,3.0,1.0,1191.56,12646.95,1.0
7,402593.32,3.0,1.75,1689.4,11766.44,1.0
8,542895.5,3.0,2.5,2184.75,13510.19,2.0
9,773738.22,4.0,2.5,2868.14,20638.52,2.0
10,1072347.47,4.0,2.75,3520.3,28191.06,2.0
11,1497792.38,4.0,3.5,4395.45,38372.79,2.0


The above analysis demonstrates the relationship between grade and other properties. As we can see, price, bedrooms, bathrooms, living area and floors increase on average with grade. The relationship for lot area is less clear. However, while this information gives us the relationship, it doesn't answer the question, what is grade? Grade could be part of a ranking system given to each house based on a combination of its other qualities, e.g. price and living area. It could also be a distinct property, such as the quality of construction materials or finishings, which is simply positively correlated with the remaining qualities.  

Understanding that grade is positively correlated with all of price, living area and room/floor number is useful for cases where we have limited time to analyze and we wish to understand, for example, if houses have gotten bigger and more expensive with time or in a certain area. Before delving into each of those variables separately, grade can give an overview and allow us to decide if further analysis is warranted.

Although all the mentioned values are positively correlated with grade, the degree of correlation between them may differ. The degree of this correlation can be further examined by using the corr() function. 

In [42]:
df_1.corr().loc[['grade']]

Unnamed: 0,grade,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors
grade,1.0,0.667463,0.356967,0.664983,0.762704,0.113621,0.458183


This confirms our previous conclusion, and additionally shows that grade's strongest correlation is with living area, bathrooms and price. It is less strongly correlated with bedrooms and floors. As observed above, the relationship with lot area is the weakest.

### Analyzing by geographical data

I will try to determine if house grade within similar geographic clusters is relatively consistent. This can help us understand whether this is a stratified community or one where different social classes exist side by side. This type of analysis can also be useful when comparing cities to understand the degree to which they are segregated by socioeconomic class.

I will begin the analysis by retreiving the most and least expensive houses by ID.

In [43]:
max_price_id = df.loc[df['price'] == df['price'].max(), ['id', 'price']].iloc[0, 0]
max_price_id

6762700020

In [44]:
min_price_id = df.loc[df['price'] == df['price'].min(), ['id', 'price']].iloc[0, 0]
min_price_id

3421079032

Then I will import the distance formulas:

In [45]:
from math import pi, sin, cos, acos

def calc_distance(loc1, loc2):
    loc1[0] = loc1[0] * pi/180
    loc1[1] = loc1[1] * pi/180
    loc2[0] = loc2[0] * pi/180
    loc2[1] = loc2[1] * pi/180
    return acos(sin(loc1[0]) * sin(loc2[0]) + cos(loc1[0]) * cos(loc2[0]) * cos(loc2[1] - loc1[1]))  * 6371

def distance_between(id1, id2):
    if df[df.id == id1]['id'].count() == 0 or df[df.id == id2]['id'].count() == 0:
        return None
    house1 = [df.loc[df['id'] == id1, ['lat']].iat[0, 0], df.loc[df['id'] == id1, ['long']].iat[0, 0]]
    house2 = [df.loc[df['id'] == id2, ['lat']].iat[0, 0], df.loc[df['id'] == id2, ['long']].iat[0, 0]]
    return (calc_distance(house1, house2))

First, I will get an overall picture of the size of the community by calculating maximum North-South and East-West distance.

In [46]:
max_lat = df['lat'].max()
min_lat = df['lat'].min()

I will assume constant longitude to calculate the North-South distance:

In [47]:
calc_distance([max_lat, -121.0], [min_lat, -121.0])

69.12988589485569

In [48]:
max_long = df['long'].max()
min_long = df['long'].min()

For calculating East-West distance, it is necessary to use an accurate figure for latitude as the distance between different degrees of longitudes changes going from the equator to the poles:

In [49]:
calc_distance([47.0, max_long], [47.0, min_long])

91.30414959917016

The area in question is approximately 70 km by 90 km (6,300 sqkm) which is quite large. This gives a better idea of how to define a 'cluster' in this community. For example, in a city or county this size, we can look at clusters with a radius as large as 10 km.

I will begin by creating a new column, distance_from_max, which shows the distance of each house from the most expensive house:

In [50]:
df['distance_from_max'] = df.apply(lambda row: distance_between(max_price_id, row['id']), axis=1)
df[['id', 'price', 'distance_from_max']]

Unnamed: 0,id,price,distance_from_max
0,7129300520,221900.0,14.086599
1,6414100192,538000.0,10.145399
2,5631500400,180000.0,13.779532
3,2487200875,604000.0,13.208839
4,1954400510,510000.0,20.884953
...,...,...,...
21608,263000018,360000.0,7.917660
21609,6600060120,400000.0,13.562668
21610,1523300141,402101.0,4.327950
21611,291310100,400000.0,21.799998


### Looking at the area as a whole

First let us define a subset of the relevant data:

In [51]:
df_2 = df[['id', 'grade', 'price', 'distance_from_max']]

Grouping houses by price, we can calculate the mean distance from the most expensive house.

In [52]:
df_2.groupby('price')['distance_from_max'].mean().rename('mean_distance_from_max').reset_index()

Unnamed: 0,price,mean_distance_from_max
0,75000.0,51.376655
1,78000.0,17.633098
2,80000.0,31.098039
3,81000.0,16.578917
4,82000.0,14.640082
...,...,...
3620,5350000.0,7.753733
3621,5570000.0,6.745066
3622,6890000.0,6.220157
3623,7060000.0,8.469738


We can see here that the average distance from the most expensive house tends to decrease as price increases. For example, houses with the lowest price (75,000) are on average 51 km away from the most expensive one. As price goes up, the distance to the most expensive house tends to diminish. When looking at the five highest price categories, the distance from the most expensive is much smaller (6-8 km). This suggests that the distribution of prices in the community is not random, but that similarly priced houses tend to be closer to one another.

### Looking at geographical clusters

Another way to examine this is to define a cluster within 10 km of our most expensive house.

Let's create a new subset:

In [53]:
df_3 = df_2[df['distance_from_max'] < 10].copy()
df_3

Unnamed: 0,id,grade,price,distance_from_max
11,9212900260,7,468000.0,7.085124
14,1175000570,7,530000.0,6.947439
15,9297300055,9,650000.0,7.574322
17,6865200140,7,485000.0,4.170247
20,6300500875,7,385000.0,8.195477
...,...,...,...,...
21604,9834201367,8,429000.0,7.158916
21607,2997800021,8,475000.0,8.697855
21608,263000018,8,360000.0,7.917660
21610,1523300141,7,402101.0,4.327950


Mean house price globally:

In [54]:
df['price'].mean()

540182.1587933188

Mean house price in our cluster:

In [55]:
df_3['price'].mean()

696098.1879581151

In the 10 km radius around the most expensive house, the mean house price is significantly higher than the overall mean. Again, this indicates that there is a cluster of more expensive houses in this area.

### Looking at distance from the cheapest house using binning

We apply our function as before to add a 'distance_from_min' column:

In [56]:
df['distance_from_min'] = df.apply(lambda row: distance_between(min_price_id, row['id']), axis=1)

In [57]:
df_4 = df[['id', 'grade', 'price', 'distance_from_min']].copy()
df_4

Unnamed: 0,id,grade,price,distance_from_min
0,7129300520,7,221900.0,38.144715
1,6414100192,7,538000.0,59.559782
2,5631500400,6,180000.0,58.159586
3,2487200875,7,604000.0,46.481765
4,1954400510,8,510000.0,40.620019
...,...,...,...,...
21608,263000018,8,360000.0,58.637113
21609,6600060120,8,400000.0,43.956883
21610,1523300141,7,402101.0,47.173348
21611,291310100,8,400000.0,32.504629


This time, instead of grouping by price, I will bin by distance from the cheapest house, and find the mean price for each bin.

In [58]:
df_4['distance_from_min'].max()

67.57497095359423

In [59]:
def categorize_by_distance(value):
    if value < 20:
        return 1
    elif value < 40:
        return 2
    elif value < 60:
        return 3
    else:
        return 4

df_4['distance_from_min_category'] = df_4.apply(lambda row: categorize_by_distance(row['distance_from_min']), axis=1)

In [60]:
df_4

Unnamed: 0,id,grade,price,distance_from_min,distance_from_min_category
0,7129300520,7,221900.0,38.144715,2
1,6414100192,7,538000.0,59.559782,3
2,5631500400,6,180000.0,58.159586,3
3,2487200875,7,604000.0,46.481765,3
4,1954400510,8,510000.0,40.620019,3
...,...,...,...,...,...
21608,263000018,8,360000.0,58.637113,3
21609,6600060120,8,400000.0,43.956883,3
21610,1523300141,7,402101.0,47.173348,3
21611,291310100,8,400000.0,32.504629,2


In [61]:
df_4.groupby('distance_from_min_category').agg({'price': ['mean', 'count']}).round(0)

Unnamed: 0_level_0,price,price
Unnamed: 0_level_1,mean,count
distance_from_min_category,Unnamed: 1_level_2,Unnamed: 2_level_2
1,344398.0,1316
2,449316.0,7635
3,632713.0,11290
4,472209.0,1372


Here, we can see that for the lowest category (houses within 20 km of the cheapest), the average price is 344,398. For houses further away, between 20 and 40 km from the cheapest, the average price is higher at 449,316. Houses between 40 and 60 km from the cheapest house are more expensive, with an average price of 632,713. The final category, houses further than 60 km from the cheapest house, breaks with the trend. Additionally, the count column can show us the distribution of houses among the the different categories.

In general, these findings mirror the relationship we saw above in terms of how price changes with distance from the most expensive house.

### Examining the relationship between bedrooms and bathrooms

We can examine the relationship between two properties by grouping by multiple columns and using value counts. For example, I am interested in finding the relationship between the number of bathrooms and bedrooms in a house. This can help clients decide what type of house to look at to suit their needs.

In [62]:
df_5 = df.groupby(['bedrooms', 'bathrooms'])[['bedrooms', 'bathrooms']].value_counts().rename('frequency').to_frame()
df_5

Unnamed: 0_level_0,Unnamed: 1_level_0,frequency
bedrooms,bathrooms,Unnamed: 2_level_1
0,0.00,7
0,0.75,1
0,1.00,1
0,1.50,1
0,2.50,3
...,...,...
10,2.00,1
10,3.00,1
10,5.25,1
11,3.00,1


The above shows us the frequency of bathroom category per bedroom category. For comparative purposes, I want to focus on the two most common bedroom groups. I can retreive this using count:

In [63]:
df.groupby('bedrooms')['id'].count().rename('count').to_frame()

Unnamed: 0_level_0,count
bedrooms,Unnamed: 1_level_1
0,13
1,199
2,2760
3,9824
4,6882
5,1601
6,272
7,38
8,13
9,6


The most common groups are 3-bedroom houses and 4-bedroom houses, so let us compare them:

In [64]:
df_6 = df.loc[df['bedrooms'] == 3, 'bathrooms'].value_counts().to_frame().sort_values('bathrooms').reset_index()
df_6

Unnamed: 0,bathrooms,count
0,0.75,16
1,1.0,1780
2,1.25,4
3,1.5,829
4,1.75,1870
5,2.0,1048
6,2.25,1082
7,2.5,2357
8,2.75,275
9,3.0,197


The above dataframe shows us the count of each bathroom category amongst 3-bedroom houses. Alternatively, we can look at the percentage per category:

In [65]:
df_6 = df.loc[df['bedrooms'] == 3, 'bathrooms'].value_counts(normalize=True).mul(100).rename('percentage').to_frame().sort_values('bathrooms').reset_index()
df_6

Unnamed: 0,bathrooms,percentage
0,0.75,0.162866
1,1.0,18.118893
2,1.25,0.040717
3,1.5,8.438518
4,1.75,19.035016
5,2.0,10.667752
6,2.25,11.013844
7,2.5,23.992264
8,2.75,2.799267
9,3.0,2.005293


In [66]:
df_6.loc[df_6['bathrooms'] < 2.50, 'percentage'].sum().round(0)

67.0

In [67]:
df_6.loc[df_6['bathrooms'] > 2.50, 'percentage'].sum().round(0)

9.0

As we can see, anyone purchasing a house with 3 bedrooms can expect to have 2.50 bathrooms around 25% of the time. They are more likely to have fewer bathrooms than this (67% of the time), and not very likely to have more (only 9% of the time).

Let us compare with 4-bedroom homes:

In [68]:
df_7 = df.loc[df['bedrooms'] == 4, 'bathrooms'].value_counts(normalize=True).mul(100).rename('percentage').to_frame().sort_values('bathrooms').reset_index()
df_7

Unnamed: 0,bathrooms,percentage
0,0.5,0.014531
1,0.75,0.029061
2,1.0,4.722464
3,1.5,3.690788
4,1.75,10.447544
5,2.0,7.628596
6,2.25,10.302238
7,2.5,36.355711
8,2.75,9.285092
9,3.0,4.736995


In [69]:
df_7.loc[df_7['bathrooms'] < 2.50, 'percentage'].sum().round(0)

37.0

In [70]:
df_7.loc[df_7['bathrooms'] > 2.50, 'percentage'].sum().round(0)

27.0

Anyone purchasing a house with 4 bedrooms can expect to have 2.50 bathrooms 36% of the time. This is a greater guarantee of having 2.50 bathrooms than for 3-bedroom houses. Additionally, they may expect to have fewer bathrooms than this only 37% of the time, while there is a 27% chance they may have more bathrooms. This information can be useful in determining which house category to show to clients based on their needs.