In [162]:
import pandas as pd

In [163]:
df = pd.read_csv('kc_house_data.csv')

In [164]:
df.columns

Index(['id', 'date', 'price', 'bedrooms', 'bathrooms', 'sqft_living',
       'sqft_lot', 'floors', 'waterfront', 'view', 'condition', 'grade',
       'sqft_above', 'sqft_basement', 'yr_built', 'yr_renovated', 'zipcode',
       'lat', 'long', 'sqft_living15', 'sqft_lot15'],
      dtype='object')

### Trying to understand the 'grade' property

In [165]:
df['grade'].nunique()

12

In [166]:
df['grade'].unique()

array([ 7,  6,  8, 11,  9,  5, 10, 12,  4,  3, 13,  1], dtype=int64)

In [167]:
df_1 = df[['grade', 'price', 'bedrooms', 'bathrooms', 'sqft_living', 'floors']]
df_1

Unnamed: 0,grade,price,bedrooms,bathrooms,sqft_living,floors
0,7,221900.0,3,1.00,1180,1.0
1,7,538000.0,3,2.25,2570,2.0
2,6,180000.0,2,1.00,770,1.0
3,7,604000.0,4,3.00,1960,1.0
4,8,510000.0,3,2.00,1680,1.0
...,...,...,...,...,...,...
21608,8,360000.0,3,2.50,1530,3.0
21609,8,400000.0,4,2.50,2310,2.0
21610,7,402101.0,2,0.75,1020,2.0
21611,8,400000.0,3,2.50,1600,2.0


In [168]:
agg_properties = {'price': 'mean', 
'bedrooms': ['median'], 
'bathrooms': ['median'], 
'sqft_living': ['mean'], 
'floors': ['median']}

df_1.groupby('grade').agg(agg_properties).round(2)

Unnamed: 0_level_0,price,bedrooms,bathrooms,sqft_living,floors
Unnamed: 0_level_1,mean,median,median,mean,median
grade,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
1,142000.0,0.0,0.0,290.0,1.0
3,205666.67,1.0,0.0,596.67,1.0
4,214381.03,2.0,1.0,660.48,1.0
5,248523.97,2.0,1.0,983.33,1.0
6,301916.57,3.0,1.0,1191.56,1.0
7,402593.32,3.0,1.75,1689.4,1.0
8,542895.5,3.0,2.5,2184.75,2.0
9,773738.22,4.0,2.5,2868.14,2.0
10,1072347.47,4.0,2.75,3520.3,2.0
11,1497792.38,4.0,3.5,4395.45,2.0


The above analysis demonstrates the relationship between grade and other properties. As we can see, all examined properties (price, bedrooms, bathrooms, sqft_living and floors) increase on average with grade. Therefore, grade is likely to be a measure of overall quality. However, grade may also be linked to only one property, e.g. price, through which it is related to the remaining properties. 

In the above example, I used the median function for the properties bedrooms, bathrooms and floors, as they are categorical variables. It is also possible to examine the mode of those values:

In [169]:
df_1.groupby('grade')[['bedrooms', 'bathrooms', 'floors']].apply(lambda x: x.mode())

Unnamed: 0_level_0,Unnamed: 1_level_0,bedrooms,bathrooms,floors
grade,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,0,0.0,0.0,1.0
3,0,1.0,0.0,1.0
4,0,2.0,0.75,1.0
4,1,,1.0,
5,0,2.0,1.0,1.0
6,0,3.0,1.0,1.0
7,0,3.0,1.0,1.0
8,0,3.0,2.5,2.0
9,0,4.0,2.5,2.0
10,0,4.0,2.5,2.0


The results are very similar. Notably, rows with grade value 4 have two modes for bathrooms, 0.75 and 1.00. For a cleaner table, we can ommit the second mode by calling the following:

In [170]:
df_1.groupby('grade')[['bedrooms', 'bathrooms', 'floors']].apply(lambda x: x.mode().iloc[0])

Unnamed: 0_level_0,bedrooms,bathrooms,floors
grade,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,0.0,0.0,1.0
3,1.0,0.0,1.0
4,2.0,0.75,1.0
5,2.0,1.0,1.0
6,3.0,1.0,1.0
7,3.0,1.0,1.0
8,3.0,2.5,2.0
9,4.0,2.5,2.0
10,4.0,2.5,2.0
11,4.0,3.5,2.0


Understanding grade as a catchall for price, living area and room/floor number is useful for cases where we have limited time to analyze and we wish to understand, for example, if houses have gotten bigger and more expensive with time or in a certain area. Before delving into each of those variables separately, grade can give an overview and allow us to decide if further analysis is warranted.

### Analzing grade by geographical cluster

I will try to determine if house grade within similar geographic clusters is relatively consistent. This can help us understand whether this is a stratified community or one where different social classes exist side by side. This type of analysis can also be useful when comparing cities to understand the degree to which they are segregated by socioeconomic class.

I will begin the analysis by retreiving the maximum, minimum and (a) median house(s) by price.

In [171]:
max_price_id = df.loc[df['price'] == df['price'].max(), ['id', 'price']].iloc[0, 0]
max_price_id

6762700020

In [172]:
min_price_id = df.loc[df['price'] == df['price'].min(), ['id', 'price']].iloc[0, 0]
min_price_id

3421079032

In [173]:
df.loc[df['price'] == df['price'].median(), ['id', 'price']]

Unnamed: 0,id,price
48,9215400105,450000.0
276,9189700045,450000.0
376,9423400140,450000.0
406,7821200390,450000.0
773,1623300160,450000.0
...,...,...
21020,9826701201,450000.0
21122,2708450020,450000.0
21152,9268850290,450000.0
21198,4140940130,450000.0


Multiple houses share the median price, so I will take the first:

In [174]:
median_price_id = df.loc[df['price'] == df['price'].median(), ['id', 'price']].iloc[0, 0]
median_price_id

9215400105

I will import the distance formulas:

In [175]:
from math import pi, sin, cos, acos

def calc_distance(loc1, loc2):
    loc1[0] = loc1[0] * pi/180
    loc1[1] = loc1[1] * pi/180
    loc2[0] = loc2[0] * pi/180
    loc2[1] = loc2[1] * pi/180
    return acos(sin(loc1[0]) * sin(loc2[0]) + cos(loc1[0]) * cos(loc2[0]) * cos(loc2[1] - loc1[1]))  * 6371

def distance_between(id1, id2):
    if df[df.id == id1]['id'].count() == 0 or df[df.id == id2]['id'].count() == 0:
        return None
    house1 = [df.loc[df['id'] == id1, ['lat']].iat[0, 0], df.loc[df['id'] == id1, ['long']].iat[0, 0]]
    house2 = [df.loc[df['id'] == id2, ['lat']].iat[0, 0], df.loc[df['id'] == id2, ['long']].iat[0, 0]]
    return (calc_distance(house1, house2))

First, I will get an overall picture of the size of the community by calculating maximum North-South and East-West distance.

In [176]:
df[['lat', 'long']].head()

Unnamed: 0,lat,long
0,47.5112,-122.257
1,47.721,-122.319
2,47.7379,-122.233
3,47.5208,-122.393
4,47.6168,-122.045


In [177]:
max_lat = df['lat'].max()

In [178]:
min_lat = df['lat'].min()

I will assume constant longitude to calculate the North-South distance:

In [211]:
calc_distance([max_lat, -121.0], [min_lat, -121.0])

69.12988589485569

In [180]:
max_long = df['long'].max()

In [199]:
min_long = df['long'].min()

For calculating East-West distance, it is necessary to use an accurate figure for latitude as the distance between different degrees of longitudes changes going from the equator to the poles:

In [210]:
calc_distance([47.0, max_long], [47.0, min_long])

91.30414959917016

The area in question is approximately 70km by 90km (6,300sqkm) which is quite large. It is more likely to be a county than a single city. This gives a better idea of how to define a 'cluster' in this community. For example, in a city this size, a cluster may very well be considered a radius of 10km around a single point.