> Ask a home buyer to describe their dream house, and they probably won't begin with the height of the basement ceiling or the proximity to an east-west railroad. But this playground competition's dataset proves that much more influences price negotiations than the number of bedrooms or a white-picket fence.

> With 79 explanatory variables describing (almost) every aspect of residential homes in Ames, Iowa, this competition challenges you to predict the final price of each home.

In [1]:
%pwd

'/home/lao/notebook/HousePrices'

In [2]:
path = '/home/lao/notebook/HousePrices'

In [3]:
import pandas as pd, numpy as np, matplotlib.pyplot as plt

In [4]:
housing_data = pd.read_csv(path + '/train.csv')

In [5]:
housing_data.head()

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,...,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,...,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,...,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,...,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,...,0,,,,0,12,2008,WD,Normal,250000


In [6]:
housing_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 81 columns):
Id               1460 non-null int64
MSSubClass       1460 non-null int64
MSZoning         1460 non-null object
LotFrontage      1201 non-null float64
LotArea          1460 non-null int64
Street           1460 non-null object
Alley            91 non-null object
LotShape         1460 non-null object
LandContour      1460 non-null object
Utilities        1460 non-null object
LotConfig        1460 non-null object
LandSlope        1460 non-null object
Neighborhood     1460 non-null object
Condition1       1460 non-null object
Condition2       1460 non-null object
BldgType         1460 non-null object
HouseStyle       1460 non-null object
OverallQual      1460 non-null int64
OverallCond      1460 non-null int64
YearBuilt        1460 non-null int64
YearRemodAdd     1460 non-null int64
RoofStyle        1460 non-null object
RoofMatl         1460 non-null object
Exterior1st      1460 non-n

In [7]:
housing_data.shape

(1460, 81)

In [8]:
housing_data.describe()

Unnamed: 0,Id,MSSubClass,LotFrontage,LotArea,OverallQual,OverallCond,YearBuilt,YearRemodAdd,MasVnrArea,BsmtFinSF1,...,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,MiscVal,MoSold,YrSold,SalePrice
count,1460.0,1460.0,1201.0,1460.0,1460.0,1460.0,1460.0,1460.0,1452.0,1460.0,...,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0
mean,730.5,56.89726,70.049958,10516.828082,6.099315,5.575342,1971.267808,1984.865753,103.685262,443.639726,...,94.244521,46.660274,21.95411,3.409589,15.060959,2.758904,43.489041,6.321918,2007.815753,180921.19589
std,421.610009,42.300571,24.284752,9981.264932,1.382997,1.112799,30.202904,20.645407,181.066207,456.098091,...,125.338794,66.256028,61.119149,29.317331,55.757415,40.177307,496.123024,2.703626,1.328095,79442.502883
min,1.0,20.0,21.0,1300.0,1.0,1.0,1872.0,1950.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,2006.0,34900.0
25%,365.75,20.0,59.0,7553.5,5.0,5.0,1954.0,1967.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5.0,2007.0,129975.0
50%,730.5,50.0,69.0,9478.5,6.0,5.0,1973.0,1994.0,0.0,383.5,...,0.0,25.0,0.0,0.0,0.0,0.0,0.0,6.0,2008.0,163000.0
75%,1095.25,70.0,80.0,11601.5,7.0,6.0,2000.0,2004.0,166.0,712.25,...,168.0,68.0,0.0,0.0,0.0,0.0,0.0,8.0,2009.0,214000.0
max,1460.0,190.0,313.0,215245.0,10.0,9.0,2010.0,2010.0,1600.0,5644.0,...,857.0,547.0,552.0,508.0,480.0,738.0,15500.0,12.0,2010.0,755000.0


In [9]:
numeric_col = housing_data.describe().columns[2:]
numeric_col

Index(['LotFrontage', 'LotArea', 'OverallQual', 'OverallCond', 'YearBuilt',
       'YearRemodAdd', 'MasVnrArea', 'BsmtFinSF1', 'BsmtFinSF2', 'BsmtUnfSF',
       'TotalBsmtSF', '1stFlrSF', '2ndFlrSF', 'LowQualFinSF', 'GrLivArea',
       'BsmtFullBath', 'BsmtHalfBath', 'FullBath', 'HalfBath', 'BedroomAbvGr',
       'KitchenAbvGr', 'TotRmsAbvGrd', 'Fireplaces', 'GarageYrBlt',
       'GarageCars', 'GarageArea', 'WoodDeckSF', 'OpenPorchSF',
       'EnclosedPorch', '3SsnPorch', 'ScreenPorch', 'PoolArea', 'MiscVal',
       'MoSold', 'YrSold', 'SalePrice'],
      dtype='object')

In [10]:
housing_data_num = housing_data[numeric_col]

In [11]:
numeric_corr = housing_data_num.corr()['SalePrice'].sort_values(ascending = False)
numeric_corr.head(20)

SalePrice       1.000000
OverallQual     0.790982
GrLivArea       0.708624
GarageCars      0.640409
GarageArea      0.623431
TotalBsmtSF     0.613581
1stFlrSF        0.605852
FullBath        0.560664
TotRmsAbvGrd    0.533723
YearBuilt       0.522897
YearRemodAdd    0.507101
GarageYrBlt     0.486362
MasVnrArea      0.477493
Fireplaces      0.466929
BsmtFinSF1      0.386420
LotFrontage     0.351799
WoodDeckSF      0.324413
2ndFlrSF        0.319334
OpenPorchSF     0.315856
HalfBath        0.284108
Name: SalePrice, dtype: float64

In [12]:
numeric_corr

SalePrice        1.000000
OverallQual      0.790982
GrLivArea        0.708624
GarageCars       0.640409
GarageArea       0.623431
TotalBsmtSF      0.613581
1stFlrSF         0.605852
FullBath         0.560664
TotRmsAbvGrd     0.533723
YearBuilt        0.522897
YearRemodAdd     0.507101
GarageYrBlt      0.486362
MasVnrArea       0.477493
Fireplaces       0.466929
BsmtFinSF1       0.386420
LotFrontage      0.351799
WoodDeckSF       0.324413
2ndFlrSF         0.319334
OpenPorchSF      0.315856
HalfBath         0.284108
LotArea          0.263843
BsmtFullBath     0.227122
BsmtUnfSF        0.214479
BedroomAbvGr     0.168213
ScreenPorch      0.111447
PoolArea         0.092404
MoSold           0.046432
3SsnPorch        0.044584
BsmtFinSF2      -0.011378
BsmtHalfBath    -0.016844
MiscVal         -0.021190
LowQualFinSF    -0.025606
YrSold          -0.028923
OverallCond     -0.077856
EnclosedPorch   -0.128578
KitchenAbvGr    -0.135907
Name: SalePrice, dtype: float64

In [13]:
%matplotlib inline

In [14]:
attributes = numeric_corr.index[1:]

In [15]:
housing_data_num['OverallQual'].value_counts()

5     397
6     374
7     319
8     168
4     116
9      43
3      20
10     18
2       3
1       2
Name: OverallQual, dtype: int64

In [16]:
housing = housing_data_num.drop('SalePrice', axis = 1)
housing_labels = housing_data_num['SalePrice'].copy()

In [29]:
housing.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 35 columns):
LotFrontage      1460 non-null float64
LotArea          1460 non-null int64
OverallQual      1460 non-null int64
OverallCond      1460 non-null int64
YearBuilt        1460 non-null int64
YearRemodAdd     1460 non-null int64
MasVnrArea       1460 non-null float64
BsmtFinSF1       1460 non-null int64
BsmtFinSF2       1460 non-null int64
BsmtUnfSF        1460 non-null int64
TotalBsmtSF      1460 non-null int64
1stFlrSF         1460 non-null int64
2ndFlrSF         1460 non-null int64
LowQualFinSF     1460 non-null int64
GrLivArea        1460 non-null int64
BsmtFullBath     1460 non-null int64
BsmtHalfBath     1460 non-null int64
FullBath         1460 non-null int64
HalfBath         1460 non-null int64
BedroomAbvGr     1460 non-null int64
KitchenAbvGr     1460 non-null int64
TotRmsAbvGrd     1460 non-null int64
Fireplaces       1460 non-null int64
GarageYrBlt      1460 non-null float64

In [17]:
housing_labels.isna().sum()

0

In [18]:
housing.isna().sum()

LotFrontage      259
LotArea            0
OverallQual        0
OverallCond        0
YearBuilt          0
YearRemodAdd       0
MasVnrArea         8
BsmtFinSF1         0
BsmtFinSF2         0
BsmtUnfSF          0
TotalBsmtSF        0
1stFlrSF           0
2ndFlrSF           0
LowQualFinSF       0
GrLivArea          0
BsmtFullBath       0
BsmtHalfBath       0
FullBath           0
HalfBath           0
BedroomAbvGr       0
KitchenAbvGr       0
TotRmsAbvGrd       0
Fireplaces         0
GarageYrBlt       81
GarageCars         0
GarageArea         0
WoodDeckSF         0
OpenPorchSF        0
EnclosedPorch      0
3SsnPorch          0
ScreenPorch        0
PoolArea           0
MiscVal            0
MoSold             0
YrSold             0
dtype: int64

#housing = housing.drop(['LotFrontage','GarageYrBlt'], axis = 1)

from sklearn.preprocessing import Imputer
imputer = Imputer(strategy="median")

housing = imputer.fit_transform(housing)

In [19]:
lf_median = housing_data['LotFrontage'].median()
gyb_median = housing_data['GarageYrBlt'].median()

values = {'LotFrontage': lf_median, 'GarageYrBlt': gyb_median, 'MasVnrArea': 0, }
housing.fillna(value=values, inplace = True)

In [20]:
from sklearn.tree import DecisionTreeRegressor
tree_reg = DecisionTreeRegressor(random_state = 42)

In [21]:
from sklearn.model_selection import cross_val_score
scores = cross_val_score(tree_reg, housing, housing_labels, scoring="neg_mean_squared_error", cv=10)
tree_rmse_scores = np.sqrt(-scores)

In [22]:
def display_scores(scores):
    print("Scores:", scores)
    print("Mean:", scores.mean())
    print("Standard deviation:", scores.std())

In [23]:
display_scores(tree_rmse_scores)

Scores: [35403.47927237 37783.10895041 32226.82654791 45996.39603981
 38848.70758918 31078.94618339 36423.14355909 33700.31229826
 58169.74353804 37894.90991391]
Mean: 38752.557389236
Standard deviation: 7581.7022897053785


In [24]:
from sklearn.linear_model import LinearRegression
lin_reg = LinearRegression()

scores = cross_val_score(lin_reg, housing, housing_labels, scoring="neg_mean_squared_error", cv=10)
lin_rmse_scores = np.sqrt(-scores)

In [25]:
display_scores(lin_rmse_scores)

Scores: [28383.4585999  30021.2348714  26892.61567888 42024.9087222
 40612.6478668  31406.91012591 30071.47596465 30818.10672053
 65853.24520307 29472.21573748]
Mean: 35555.6819490814
Standard deviation: 11176.885751437801


In [22]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()

In [24]:
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV

from sklearn.ensemble import RandomForestRegressor
forest_reg = RandomForestRegressor(random_state = 42)

In [25]:
param_grid = [ 
    {'n_estimators' : [3, 6, 9 , 10, 15, 20, 30, 50], 'max_features': [2, 4, 6, 8, 10, 12, 14, 15]},
    {'bootstrap': [False], 'n_estimators' : [3, 6, 9 , 10, 15, 20, 30, 50], 'max_features':[2, 4, 6, 8, 10, 12, 14, 15] },
    
]

grid_search = GridSearchCV(forest_reg, param_grid, cv = 10, scoring = 'neg_mean_squared_error')

grid_search.fit(housing_scaled, housing_labels)

GridSearchCV(cv=10, error_score='raise-deprecating',
       estimator=RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_decrease=0.0, min_impurity_split=None,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators='warn', n_jobs=None,
           oob_score=False, random_state=42, verbose=0, warm_start=False),
       fit_params=None, iid='warn', n_jobs=None,
       param_grid=[{'n_estimators': [3, 6, 9, 10, 15, 20, 30, 50], 'max_features': [2, 4, 6, 8, 10, 12, 14, 15]}, {'bootstrap': [False], 'n_estimators': [3, 6, 9, 10, 15, 20, 30, 50], 'max_features': [2, 4, 6, 8, 10, 12, 14, 15]}],
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring='neg_mean_squared_error', verbose=0)

In [26]:
grid_search.best_params_

{'bootstrap': False, 'max_features': 10, 'n_estimators': 30}

In [27]:
cvres = grid_search.cv_results_
for mean_score, params in zip(cvres["mean_test_score"], cvres["params"]):
    print(np.sqrt(-mean_score), params)

40313.557183206976 {'max_features': 2, 'n_estimators': 3}
36018.4824532102 {'max_features': 2, 'n_estimators': 6}
34061.74865299846 {'max_features': 2, 'n_estimators': 9}
33760.90121587345 {'max_features': 2, 'n_estimators': 10}
32558.277302911647 {'max_features': 2, 'n_estimators': 15}
32664.695057385496 {'max_features': 2, 'n_estimators': 20}
32373.138683000994 {'max_features': 2, 'n_estimators': 30}
31624.932822914943 {'max_features': 2, 'n_estimators': 50}
38716.54995085181 {'max_features': 4, 'n_estimators': 3}
36154.83997449991 {'max_features': 4, 'n_estimators': 6}
34160.10290212184 {'max_features': 4, 'n_estimators': 9}
33607.57182940592 {'max_features': 4, 'n_estimators': 10}
32405.532934456474 {'max_features': 4, 'n_estimators': 15}
31887.953867361244 {'max_features': 4, 'n_estimators': 20}
31318.687979901944 {'max_features': 4, 'n_estimators': 30}
30313.938391538977 {'max_features': 4, 'n_estimators': 50}
37789.03252648005 {'max_features': 6, 'n_estimators': 3}
34888.5558695

In [30]:
param_grid = {'n_estimators' : list(range(3, 60)), 'max_features': list(range(3, 30)), 'bootstrap': [False]}

grid_search = RandomizedSearchCV(forest_reg, param_grid, cv = 10, n_iter = 100, scoring = 'neg_mean_squared_error')

grid_search.fit(housing_scaled, housing_labels)

RandomizedSearchCV(cv=10, error_score='raise-deprecating',
          estimator=RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_decrease=0.0, min_impurity_split=None,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators='warn', n_jobs=None,
           oob_score=False, random_state=42, verbose=0, warm_start=False),
          fit_params=None, iid='warn', n_iter=100, n_jobs=None,
          param_distributions={'n_estimators': [3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59], 'max_features': [3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29], 'bootstrap': [False]},
          pre_dispatch='2*n_jobs', random_state=None, re

In [33]:
attributes = numeric_col[:-1]
feature_importances = grid_search.best_estimator_.feature_importances_
sorted(zip(feature_importances, attributes), reverse=True)

[(0.2389452976001587, 'OverallQual'),
 (0.14506999585425331, 'GrLivArea'),
 (0.08915598864651797, 'GarageCars'),
 (0.07460308427195834, 'TotalBsmtSF'),
 (0.06664325163625943, 'YearBuilt'),
 (0.06146688024881677, 'GarageArea'),
 (0.05866814933839687, '1stFlrSF'),
 (0.03674182623730724, '2ndFlrSF'),
 (0.03269795281726134, 'BsmtFinSF1'),
 (0.023816667985711738, 'FullBath'),
 (0.02134029411912672, 'YearRemodAdd'),
 (0.018289435317097192, 'TotRmsAbvGrd'),
 (0.017453118427017494, 'Fireplaces'),
 (0.016560202269476004, 'LotArea'),
 (0.01474179208715508, 'GarageYrBlt'),
 (0.010721970041764013, 'MasVnrArea'),
 (0.009710178478891204, 'OpenPorchSF'),
 (0.009526241744897189, 'LotFrontage'),
 (0.008402801744470115, 'BsmtUnfSF'),
 (0.007114856124653201, 'OverallCond'),
 (0.007030559114000595, 'BedroomAbvGr'),
 (0.006130742361140982, 'WoodDeckSF'),
 (0.004874896900307765, 'HalfBath'),
 (0.00427374404301598, 'MoSold'),
 (0.0030307361664192556, 'BsmtFullBath'),
 (0.0026780673598745144, 'KitchenAbvGr'),

In [31]:
grid_search.best_params_

{'bootstrap': False, 'max_features': 9, 'n_estimators': 58}

In [32]:
cvres = grid_search.cv_results_
for mean_score, params in zip(cvres["mean_test_score"], cvres["params"]):
    print(np.sqrt(-mean_score), params)

29681.599647596347 {'n_estimators': 22, 'max_features': 11, 'bootstrap': False}
29564.652990853432 {'n_estimators': 55, 'max_features': 16, 'bootstrap': False}
29815.19919482037 {'n_estimators': 45, 'max_features': 3, 'bootstrap': False}
31384.16439987276 {'n_estimators': 58, 'max_features': 28, 'bootstrap': False}
30274.35775698467 {'n_estimators': 19, 'max_features': 23, 'bootstrap': False}
32536.56709975148 {'n_estimators': 16, 'max_features': 29, 'bootstrap': False}
29950.51411106374 {'n_estimators': 27, 'max_features': 21, 'bootstrap': False}
29528.572589134186 {'n_estimators': 50, 'max_features': 21, 'bootstrap': False}
29936.477505374514 {'n_estimators': 22, 'max_features': 12, 'bootstrap': False}
29408.70412636958 {'n_estimators': 42, 'max_features': 5, 'bootstrap': False}
32106.53358428233 {'n_estimators': 6, 'max_features': 5, 'bootstrap': False}
30728.06797093393 {'n_estimators': 10, 'max_features': 27, 'bootstrap': False}
29110.296486671887 {'n_estimators': 28, 'max_feature