In [496]:
import pandas as pd

In [497]:
df = pd.read_csv('kc_house_data.csv')

In [498]:
df.columns

Index(['id', 'date', 'price', 'bedrooms', 'bathrooms', 'sqft_living',
       'sqft_lot', 'floors', 'waterfront', 'view', 'condition', 'grade',
       'sqft_above', 'sqft_basement', 'yr_built', 'yr_renovated', 'zipcode',
       'lat', 'long', 'sqft_living15', 'sqft_lot15'],
      dtype='object')

### Trying to understand the 'grade' property

In [499]:
df['grade'].nunique()

12

In [500]:
df['grade'].unique()

array([ 7,  6,  8, 11,  9,  5, 10, 12,  4,  3, 13,  1], dtype=int64)

In [501]:
df_1 = df[['grade', 'price', 'bedrooms', 'bathrooms', 'sqft_living', 'floors']]
df_1

Unnamed: 0,grade,price,bedrooms,bathrooms,sqft_living,floors
0,7,221900.0,3,1.00,1180,1.0
1,7,538000.0,3,2.25,2570,2.0
2,6,180000.0,2,1.00,770,1.0
3,7,604000.0,4,3.00,1960,1.0
4,8,510000.0,3,2.00,1680,1.0
...,...,...,...,...,...,...
21608,8,360000.0,3,2.50,1530,3.0
21609,8,400000.0,4,2.50,2310,2.0
21610,7,402101.0,2,0.75,1020,2.0
21611,8,400000.0,3,2.50,1600,2.0


In [502]:
agg_properties = {'price': 'mean', 
'bedrooms': ['median'], 
'bathrooms': ['median'], 
'sqft_living': ['mean'], 
'floors': ['median']}

df_1.groupby('grade').agg(agg_properties).round(2)

Unnamed: 0_level_0,price,bedrooms,bathrooms,sqft_living,floors
Unnamed: 0_level_1,mean,median,median,mean,median
grade,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
1,142000.0,0.0,0.0,290.0,1.0
3,205666.67,1.0,0.0,596.67,1.0
4,214381.03,2.0,1.0,660.48,1.0
5,248523.97,2.0,1.0,983.33,1.0
6,301916.57,3.0,1.0,1191.56,1.0
7,402593.32,3.0,1.75,1689.4,1.0
8,542895.5,3.0,2.5,2184.75,2.0
9,773738.22,4.0,2.5,2868.14,2.0
10,1072347.47,4.0,2.75,3520.3,2.0
11,1497792.38,4.0,3.5,4395.45,2.0


The above analysis demonstrates the relationship between grade and other properties. As we can see, all examined properties (price, bedrooms, bathrooms, sqft_living and floors) increase on average with grade. Therefore, grade is likely to be a measure of overall quality. However, grade may also be linked to only one property, e.g. price, through which it is related to the remaining properties. 

In the above example, I used the median function for the properties bedrooms, bathrooms and floors, as they are categorical variables. It is also possible to examine the mode of those values:

In [503]:
df_1.groupby('grade')[['bedrooms', 'bathrooms', 'floors']].apply(lambda x: x.mode())

Unnamed: 0_level_0,Unnamed: 1_level_0,bedrooms,bathrooms,floors
grade,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,0,0.0,0.0,1.0
3,0,1.0,0.0,1.0
4,0,2.0,0.75,1.0
4,1,,1.0,
5,0,2.0,1.0,1.0
6,0,3.0,1.0,1.0
7,0,3.0,1.0,1.0
8,0,3.0,2.5,2.0
9,0,4.0,2.5,2.0
10,0,4.0,2.5,2.0


The results are very similar. Notably, rows with grade value 4 have two modes for bathrooms, 0.75 and 1.00. For a cleaner table, we can ommit the second mode by calling the following:

In [504]:
df_1.groupby('grade')[['bedrooms', 'bathrooms', 'floors']].apply(lambda x: x.mode().iloc[0])

Unnamed: 0_level_0,bedrooms,bathrooms,floors
grade,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,0.0,0.0,1.0
3,1.0,0.0,1.0
4,2.0,0.75,1.0
5,2.0,1.0,1.0
6,3.0,1.0,1.0
7,3.0,1.0,1.0
8,3.0,2.5,2.0
9,4.0,2.5,2.0
10,4.0,2.5,2.0
11,4.0,3.5,2.0


Understanding that grade is positively correlated with all of price, living area and room/floor number is useful for cases where we have limited time to analyze and we wish to understand, for example, if houses have gotten bigger and more expensive with time or in a certain area. Before delving into each of those variables separately, grade can give an overview and allow us to decide if further analysis is warranted.

### Deeper look at correlation

We can better understand that degree of this correlation by using the corr() function. Although all the mentioned values are positively correlated with grade, the degree of correlation may differ between them. Calling the corr() function returns a correlation matrix of the values, showing the correlation coefficient between each pair:

In [505]:
df_1.corr()

Unnamed: 0,grade,price,bedrooms,bathrooms,sqft_living,floors
grade,1.0,0.667463,0.356967,0.664983,0.762704,0.458183
price,0.667463,1.0,0.308338,0.525134,0.702044,0.256786
bedrooms,0.356967,0.308338,1.0,0.515884,0.576671,0.175429
bathrooms,0.664983,0.525134,0.515884,1.0,0.754665,0.500653
sqft_living,0.762704,0.702044,0.576671,0.754665,1.0,0.353949
floors,0.458183,0.256786,0.175429,0.500653,0.353949,1.0


This confirms our previous conclusion, but additionally shows that grade's strongest correlation is with living area, bathrooms and price. It is less strongly correlated with bedrooms and floors. This also suggests that house price is correlated with living area and bathroom. We can examine the relationships further to see that they are indeed strongly correlated:

In [506]:
df[['price', 'sqft_living', 'bathrooms']].corr()

Unnamed: 0,price,sqft_living,bathrooms
price,1.0,0.702044,0.525134
sqft_living,0.702044,1.0,0.754665
bathrooms,0.525134,0.754665,1.0


This kind of information is useful when speaking to a client, for instance, who wants to know if paying more will give them additional floors or bedrooms. A good answer may be, yes, but it is more likely to give you additional living space and bathrooms.

### Analyzing grade by geographical cluster

I will try to determine if house grade within similar geographic clusters is relatively consistent. This can help us understand whether this is a stratified community or one where different social classes exist side by side. This type of analysis can also be useful when comparing cities to understand the degree to which they are segregated by socioeconomic class.

I will begin the analysis by retreiving the maximum, minimum and median houses by price.

In [507]:
max_price_id = df.loc[df['price'] == df['price'].max(), ['id', 'price']].iloc[0, 0]
max_price_id

6762700020

In [508]:
min_price_id = df.loc[df['price'] == df['price'].min(), ['id', 'price']].iloc[0, 0]
min_price_id

3421079032

In [509]:
df.loc[df['price'] == df['price'].median(), ['id', 'price']]

Unnamed: 0,id,price
48,9215400105,450000.0
276,9189700045,450000.0
376,9423400140,450000.0
406,7821200390,450000.0
773,1623300160,450000.0
...,...,...
21020,9826701201,450000.0
21122,2708450020,450000.0
21152,9268850290,450000.0
21198,4140940130,450000.0


Multiple houses share the median price, so I will take the first:

In [510]:
median_price_id = df.loc[df['price'] == df['price'].median(), ['id', 'price']].iloc[0, 0]
median_price_id

9215400105

I will import the distance formulas:

In [511]:
from math import pi, sin, cos, acos

def calc_distance(loc1, loc2):
    loc1[0] = loc1[0] * pi/180
    loc1[1] = loc1[1] * pi/180
    loc2[0] = loc2[0] * pi/180
    loc2[1] = loc2[1] * pi/180
    return acos(sin(loc1[0]) * sin(loc2[0]) + cos(loc1[0]) * cos(loc2[0]) * cos(loc2[1] - loc1[1]))  * 6371

def distance_between(id1, id2):
    if df[df.id == id1]['id'].count() == 0 or df[df.id == id2]['id'].count() == 0:
        return None
    house1 = [df.loc[df['id'] == id1, ['lat']].iat[0, 0], df.loc[df['id'] == id1, ['long']].iat[0, 0]]
    house2 = [df.loc[df['id'] == id2, ['lat']].iat[0, 0], df.loc[df['id'] == id2, ['long']].iat[0, 0]]
    return (calc_distance(house1, house2))

First, I will get an overall picture of the size of the community by calculating maximum North-South and East-West distance.

In [512]:
df[['lat', 'long']].head()

Unnamed: 0,lat,long
0,47.5112,-122.257
1,47.721,-122.319
2,47.7379,-122.233
3,47.5208,-122.393
4,47.6168,-122.045


In [513]:
max_lat = df['lat'].max()

In [514]:
min_lat = df['lat'].min()

I will assume constant longitude to calculate the North-South distance:

In [515]:
calc_distance([max_lat, -121.0], [min_lat, -121.0])

69.12988589485569

In [516]:
max_long = df['long'].max()

In [517]:
min_long = df['long'].min()

For calculating East-West distance, it is necessary to use an accurate figure for latitude as the distance between different degrees of longitudes changes going from the equator to the poles:

In [518]:
calc_distance([47.0, max_long], [47.0, min_long])

91.30414959917016

The area in question is approximately 70km by 90km (6,300sqkm) which is quite large. It is more likely to be a county than a single city. This gives a better idea of how to define a 'cluster' in this community. For example, in a city this size, a cluster may very well be considered a radius of 10km around a single point.

I will begin by creating a new column, distance_from_max, which shows the distance of each house from the most expensive house:

In [519]:
for i in range(len(df)):
    df.loc[i, 'distance_from_max'] = distance_between(max_price_id, df['id'].iat[i])
df[['id', 'price', 'distance_from_max']]

Unnamed: 0,id,price,distance_from_max
0,7129300520,221900.0,14.086599
1,6414100192,538000.0,10.145399
2,5631500400,180000.0,13.779532
3,2487200875,604000.0,13.208839
4,1954400510,510000.0,20.884953
...,...,...,...
21608,263000018,360000.0,7.917660
21609,6600060120,400000.0,13.562668
21610,1523300141,402101.0,4.327950
21611,291310100,400000.0,21.799998


Wow, that took a long time! Using ChatGPT, I researched an alternative syntax:<br> 

df['distance_from_max'] = df.apply(lambda i: distance_between(max_price_id, i['id']), axis=1).<br>

I expected this to have better performance time but it was the same.

Now that we understand how far each house is from the most expensive house, we can analyze this data in different ways.

### Looking at simple relationships

In [531]:
df_2 = df[['id', 'grade', 'price', 'distance_from_max']]

Unnamed: 0,id,grade,price,distance_from_max
0,7129300520,7,221900.0,14.086599
1,6414100192,7,538000.0,10.145399
2,5631500400,6,180000.0,13.779532
3,2487200875,7,604000.0,13.208839
4,1954400510,8,510000.0,20.884953
...,...,...,...,...
21608,263000018,8,360000.0,7.917660
21609,6600060120,8,400000.0,13.562668
21610,1523300141,7,402101.0,4.327950
21611,291310100,8,400000.0,21.799998


In [541]:
df_2.groupby('price')['distance_from_max'].mean().rename('mean_distance_from_max').reset_index()

Unnamed: 0,price,mean_distance_from_max
0,75000.0,51.376655
1,78000.0,17.633098
2,80000.0,31.098039
3,81000.0,16.578917
4,82000.0,14.640082
...,...,...
3620,5350000.0,7.753733
3621,5570000.0,6.745066
3622,6890000.0,6.220157
3623,7060000.0,8.469738


We can see here that the average distance from the most expensive house tends to decrease as price increases. For example, the cheapest house is 51km away from the most expensive one. As price goes up, the distance to the most expensive house tends to close in. When looking at the five most expensive houses, the distance from the most expensive is much smaller (6-7 km). This suggests that there is a trend for cheaper and more expensive houses to be located separately. However, they are not necessarily in the same neighborhood, as a distance of 6 to 7km is still significant and indicates a different part of town.

We can also expand this table to see the median grade alongside the mean distance:

In [552]:
df_2.groupby('price')[['grade', 'distance_from_max']].agg({'grade': ['median'], 'distance_from_max': ['mean']})

Unnamed: 0_level_0,grade,distance_from_max
Unnamed: 0_level_1,median,mean
price,Unnamed: 1_level_2,Unnamed: 2_level_2
75000.0,3.0,51.376655
78000.0,5.0,17.633098
80000.0,4.0,31.098039
81000.0,5.0,16.578917
82000.0,6.0,14.640082
...,...,...
5350000.0,12.0,7.753733
5570000.0,13.0,6.745066
6890000.0,13.0,6.220157
7060000.0,11.0,8.469738


As we can see, not only does distance to the most expensive house, on average, close in as houses get more expensive, but the grade of the house increases.

### Looking at clusters

Another way to examine this, is to consider houses that are within 10km of our most expensive house. This is a large distance, and in many countries would perhaps indicate a different town, but we have already determined that the area we are examining is quite large (the size of a big county or even a small country).

In [555]:
df_3 = df_2[df['distance_from_max'] < 10]
df_3

Unnamed: 0,id,grade,price,distance_from_max
11,9212900260,7,468000.0,7.085124
14,1175000570,7,530000.0,6.947439
15,9297300055,9,650000.0,7.574322
17,6865200140,7,485000.0,4.170247
20,6300500875,7,385000.0,8.195477
...,...,...,...,...
21604,9834201367,8,429000.0,7.158916
21607,2997800021,8,475000.0,8.697855
21608,263000018,8,360000.0,7.917660
21610,1523300141,7,402101.0,4.327950


In [558]:
df_2['price'].mean()

540182.1587933188

In [556]:
df_3['price'].mean()

696098.1879581151

In the 10km radius around the most expensive house, the mean house price is significantly higher than the overall mean. Again, this indicates that there is a cluster of more expensive houses in this area.

In [565]:
df_2['grade'].mode()

0    7
Name: grade, dtype: int64

In [566]:
df_3['grade'].mode()

0    7
Name: grade, dtype: int64

A difference is not observed in the mode grade. 

### Returning to the larger dataset to compare with year built and renovated

Since I have been working with a subset of data, I want to insert my 'distance_from_max' column back into the original DataFrame to look at some additional columns like year built and year renovated.

In [567]:
distance_list = list(df_2['distance_from_max'])

In [568]:
df['distance_from_max'] = distance_list

In [571]:
df[['distance_from_max']]

Unnamed: 0,distance_from_max
0,14.086599
1,10.145399
2,13.779532
3,13.208839
4,20.884953
...,...
21608,7.917660
21609,13.562668
21610,4.327950
21611,21.799998


In [574]:
df.groupby('yr_built')['distance_from_max'].mean().reset_index()

Unnamed: 0,yr_built,distance_from_max
0,1900,9.640329
1,1901,3.774700
2,1902,4.172150
3,1903,9.304340
4,1904,6.317436
...,...,...
111,2011,22.477898
112,2012,25.231119
113,2013,22.536076
114,2014,18.907627


The oldest houses are closer to the most expensive house, while houses built in the 2000s are further away. The most expensive house appears to be in a cluster of historical units. Let's examine this hypothesis more closely by returning to the cluster idea using a radius of 10km.

In [608]:
df.loc[df['distance_from_max'] < 10, ['yr_built']].median()

yr_built    1947.0
dtype: float64

In [609]:
df['yr_built'].median()

1975.0

Indeed, the median building year for those houses within a 10km radius of the most expensive house is 1947. This suggests a cluster of older stock housing. By contrast, the median for entire community is 1975.

We can also find relationships with year renovated:

In [612]:
df[['yr_renovated', 'price', 'grade', 'distance_from_max']].groupby('yr_renovated').agg({'price': 'mean', 'grade': 'median', 'distance_from_max': 'mean'})

Unnamed: 0_level_0,price,grade,distance_from_max
yr_renovated,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,530447.958597,7.0,18.795900
1934,459950.000000,6.0,18.591166
1940,378400.000000,6.5,7.432593
1944,521000.000000,6.0,10.195156
1945,398666.666667,6.0,12.050241
...,...,...,...
2011,607496.153846,7.0,9.433041
2012,625181.818182,8.0,13.319654
2013,664960.810811,7.0,14.514327
2014,655030.098901,7.0,12.205265


Here the relationships are less clear. Houses renovated more recently do tend to be more expensive, but there is no easy correlation to spot with their grade or distance from the most expensive house. We can make sure of this using corr().

In [619]:
df[['yr_renovated', 'price', 'grade', 'distance_from_max']].corr()

Unnamed: 0,yr_renovated,price,grade,distance_from_max
yr_renovated,1.0,0.126442,0.014414,-0.078202
price,0.126442,1.0,0.667463,-0.319728
grade,0.014414,0.667463,1.0,-0.022351
distance_from_max,-0.078202,-0.319728,-0.022351,1.0


As we can see, the correlations with year of renovation (whether positive or negative), are very small.

We can do similar analysis for the cheapest house and house with the median price to better understand geographical and other relationships. However, for the final example I will look at different properties which allow grouping by multiple variables.

### Grouping by multiple variables