# Project: Predicting Housing Prices in King County

We will use the **King County House Sales dataset**:  
[House Sales Prediction (Kaggle)](https://www.kaggle.com/datasets/harlfoxem/housesalesprediction)

# Data Insights

I've prepared discussions on three data insights below:
- Effect of latitude on model
- Effect of sigma filtering
- Effect of grade

Project conconclusions follow at the bottom of this page.

Note: the code on this page is largely non-operational. The cached dataframes provide tabular data for discussion.

### Effect of latitude on model
- num_columns
    - 19: Original data columns
    - 16: All spatial columns removed
    - 17: lat/long removed
    - 18: zipcode removed
    - 14: User columns (winsorized bed/bath, no date, no spatial)
- sigma filter = 6/10: Top ~1.3% of data removed.

These tests show that it is specifically latitude driving the loss of model performance. The effect of latitude is a curiousity of the King County dataset and wouldn't apply elsewhere.
             

In [535]:
# Viewing linear regression entries only

subset = []
[subset.append(record) if (record[0]=='linear') else False for record in results_list]

pd.DataFrame(subset, columns=['model', 'alpha', 'columns', 'num_columns', 'Sigma Filter', 'num_rows',
                                                'r2_train', 'r2_test', 'mae_train', 'mae_test', 'rmse_train', 'rmse_test',
                                                'median price', 'n_features', 'coefficients', 'intercept'])


Unnamed: 0,model,alpha,columns,num_columns,Sigma Filter,num_rows,r2_train,r2_test,mae_train,mae_test,rmse_train,rmse_test,median price,n_features,coefficients,intercept
0,linear,,"Index(['bedrooms', 'bathrooms', 'sqft_living',...",19,10,21212,0.68,0.69,121436.65,116645.7,186908.67,177626.61,450000.0,19,"[-28419.90665916544, 28672.55007935164, 71500....",532108.016867
1,linear,,"Index(['bedrooms', 'bathrooms', 'sqft_living',...",16,10,21212,0.62,0.63,136504.23,130702.67,203561.89,194590.33,450000.0,16,"[-32640.477817263152, 31655.405661261873, 6641...",532108.016867
2,linear,,"Index(['bedrooms', 'bathrooms', 'sqft_living',...",17,10,21212,0.62,0.63,136337.39,130427.65,203436.66,194432.58,450000.0,17,"[-32053.036512553655, 31798.554796488505, 6631...",532108.016867
3,linear,,"Index(['bedrooms', 'bathrooms', 'sqft_living',...",18,10,21212,0.68,0.69,121639.87,116852.83,187993.73,178312.86,450000.0,18,"[-27051.711712076067, 29419.31174237642, 70399...",532108.016867
4,linear,,"Index(['sqft_living', 'sqft_lot', 'waterfront'...",14,10,21212,0.62,0.62,136952.98,131328.78,203395.29,194863.95,450000.0,14,"[65858.8276253656, -4048.4966194476, 3.6379788...",532108.016867
5,linear,,"Index(['bedrooms', 'bathrooms', 'sqft_living',...",19,6,20953,0.69,0.69,115477.12,114017.67,172099.82,167531.37,450000.0,19,"[-23327.254316288545, 25771.059832827406, 6112...",525082.656732
6,linear,,"Index(['bedrooms', 'bathrooms', 'sqft_living',...",16,6,20953,0.62,0.62,130806.5,129052.86,189614.75,184161.64,450000.0,16,"[-27393.72181954755, 28075.092683226612, 56224...",525082.656732
7,linear,,"Index(['bedrooms', 'bathrooms', 'sqft_living',...",17,6,20953,0.62,0.62,130640.52,128986.15,189473.87,184210.92,450000.0,17,"[-26810.546210406595, 28297.36400442422, 56122...",525082.656732
8,linear,,"Index(['bedrooms', 'bathrooms', 'sqft_living',...",18,6,20953,0.69,0.68,115727.87,114162.04,173074.37,168681.96,450000.0,18,"[-22119.715886961098, 26527.412912878353, 6016...",525082.656732
9,linear,,"Index(['sqft_living', 'sqft_lot', 'waterfront'...",14,6,20953,0.62,0.62,131357.4,129227.7,189807.27,183966.57,450000.0,14,"[56311.52975307567, -2933.3914795182645, 1.455...",525082.656732


### Effect of sigma filtering
- 6-sigma filter
    - 1.2% of data removed
    - R-squared fit parameters generally unaffected.
    - No change to median sale price
    - Modest 1.6% improvement to MAE and 5.5% improvement to RMSE
- R-squared and median sale price become less stable with additional filtering, but MAE and RMSE stats continue to improve. At 3-sigma filtering, 12.6% of data is removed.

In [410]:
subset = []
for record in results_list:
    subset.append(record) if ((record[0]=='ridge') & (record[1] == 1000)) else False

pd.DataFrame(subset, columns=['model', 'alpha', 'columns', 'num_columns', 'Sigma Filter', 'num_rows',
                                                'r2_train', 'r2_test', 'mae_train', 'mae_test', 'rmse_train', 'rmse_test',
                                                'median price', 'n_features', 'coefficients', 'intercept'])


Unnamed: 0,model,alpha,columns,num_columns,Sigma Filter,num_rows,r2_train,r2_test,mae_train,mae_test,rmse_train,rmse_test,median price,n_features,coefficients,intercept
0,ridge,1000,"Index(['sqft_living', 'sqft_lot', 'waterfront'...",14,10,21212,0.62,0.62,136411.76,131003.71,203754.85,194944.16,450000.0,13,"[64374.89058437479, -3690.1929249457694, 0.0, ...",532108.016867
1,ridge,1000,"Index(['sqft_living', 'sqft_lot', 'waterfront'...",14,9,21196,0.62,0.61,135624.65,134068.73,202332.94,196044.81,455000.0,13,"[63022.41486071822, -2800.787532571816, 0.0, 3...",531009.868369
2,ridge,1000,"Index(['sqft_living', 'sqft_lot', 'waterfront'...",14,8,21175,0.62,0.61,134797.25,134201.34,199852.08,195941.72,450000.0,13,"[61356.17212374283, -4153.720273098709, 0.0, 3...",530386.7052
3,ridge,1000,"Index(['sqft_living', 'sqft_lot', 'waterfront'...",14,7,21042,0.62,0.61,132454.09,138634.2,194655.67,206186.9,450000.0,13,"[58755.97608812813, -3145.591059732761, 0.0, 3...",527315.270476
4,ridge,1000,"Index(['sqft_living', 'sqft_lot', 'waterfront'...",14,6,20953,0.62,0.62,131159.4,128906.1,190153.17,184254.13,450000.0,13,"[55963.425340665875, -3145.1423440861445, 0.0,...",525082.656732
5,ridge,1000,"Index(['sqft_living', 'sqft_lot', 'waterfront'...",14,5,20819,0.61,0.63,127871.34,125931.5,182585.14,174233.97,462500.0,13,"[52549.65416495284, -2182.52621931349, 0.0, 30...",518646.621818
6,ridge,1000,"Index(['sqft_living', 'sqft_lot', 'waterfront'...",14,4,19613,0.61,0.59,117919.87,120750.09,164028.7,166770.1,440000.0,13,"[45417.702696091794, -1979.4719531453288, 0.0,...",500816.363492
7,ridge,1000,"Index(['sqft_living', 'sqft_lot', 'waterfront'...",14,3,18539,0.57,0.56,111590.63,113496.46,150948.82,151316.75,441875.0,13,"[37628.28104432048, -1329.0658119713853, 0.0, ...",479739.952173


### Effect of grade
- Looking at the effect of grade, which has an obvious polynomial relationship with price. 
- I ran three tests filtering for lower grades (7 and below), higher grades (8 and above), and median grades (7-8 inclusive). Applying a filter does not provide a practical way forward, but it does give us some clues on model performance by homogenizing the data.
- Grade 8+ does not show an impact to R-squared despite being the smallest group and it performs better for MAE as a percentage of median price. This suggests that the main driver for these properties are already in the dataset: sqft, waterfront & view, num beds/baths, etc.
- The lower and median groups but have large hits to R-squared, but MAE and RMSE improve as a percentage to median price. I think the hit to R-squared here suggests that the missing factor is not currently in the dataset.

In [72]:
# Viewing linear regression entries only

subset = []
[subset.append(record) if (record[0]=='linear') else False for record in results_list]

pd.DataFrame(subset, columns=['model', 'alpha', 'columns', 'num_columns', 'Sigma Filter', 'num_rows',
                                                'r2_train', 'r2_test', 'mae_train', 'mae_test', 'rmse_train', 'rmse_test',
                                                'median price', 'n_features', 'coefficients', 'intercept'])


Unnamed: 0,model,alpha,columns,num_columns,Sigma Filter,num_rows,r2_train,r2_test,mae_train,mae_test,rmse_train,rmse_test,median price,n_features,coefficients,intercept
0,linear,All Grades,"Index(['sqft_living', 'sqft_lot', 'waterfront'...",14,6,20953,0.62,0.62,131357.4,129227.7,189807.27,183966.57,450000.0,14,"[56311.52975307567, -2933.3914795182645, 1.455...",525082.656732
1,linear,Grades 0-7,"Index(['sqft_living', 'sqft_lot', 'waterfront'...",14,6,11006,0.36,0.31,94317.4,97898.78,122435.55,127672.14,350000.0,14,"[21876.131281606067, 5811.590943381721, -7.275...",381035.68733
2,linear,Grades 8+,"Index(['sqft_living', 'sqft_lot', 'waterfront'...",14,6,9947,0.56,0.58,165417.09,164723.91,234803.38,230338.67,607500.0,14,"[62209.549390388696, -11667.47210163153, -4.36...",682933.45353
3,linear,Grades 7-8 only,"Index(['sqft_living', 'sqft_lot', 'waterfront'...",14,6,14744,0.39,0.4,110228.17,110968.17,146431.92,148732.38,425000.0,14,"[28037.585907679026, 5527.718945561952, 1.0913...",457164.352551


### Project Conclusions

My assumption during EDA was that the model performance would be limited by linear regression, but my insights point me to three major conclusions.
1. More data is needed than just the physical attribtues of the house. Beyond the easily measured data provided, we need metrics that characterize the desirabily of the neighborhood: access to jobs, transportation, schools, recreation & nightlife, shopping, etc. After all, the sale price is only what someone was willing to pay for it.
2. The need for time and location paired data of comparable houses, ie: the way real estate has been traditionally performed. This would allow of to measure neighborhood desirability and overcome the challenges of evaluating the many soft reasons that go into a house purchase.
3. Moving to a k-means analysis model (along with longer time horizons for the dataset), would allow clustering spatial, time, and phyiscal attributes simulateously, allowing for better model performance that is more similar to manual methods historically used.
