> Ask a home buyer to describe their dream house, and they probably won't begin with the height of the basement ceiling or the proximity to an east-west railroad. But this playground competition's dataset proves that much more influences price negotiations than the number of bedrooms or a white-picket fence.

> With 79 explanatory variables describing (almost) every aspect of residential homes in Ames, Iowa, this competition challenges you to predict the final price of each home.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [2]:
path = %pwd
path

'/home/lao/notebook/HousePrices'

In [3]:
housing_data = pd.read_csv(path + '/train.csv')

In [4]:
housing_data.head()

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,...,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,...,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,...,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,...,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,...,0,,,,0,12,2008,WD,Normal,250000


Evaluation: Log of the mse

In [5]:
housing_data['SalePrice'] = np.log1p(housing_data['SalePrice'])

In [6]:
housing_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 81 columns):
Id               1460 non-null int64
MSSubClass       1460 non-null int64
MSZoning         1460 non-null object
LotFrontage      1201 non-null float64
LotArea          1460 non-null int64
Street           1460 non-null object
Alley            91 non-null object
LotShape         1460 non-null object
LandContour      1460 non-null object
Utilities        1460 non-null object
LotConfig        1460 non-null object
LandSlope        1460 non-null object
Neighborhood     1460 non-null object
Condition1       1460 non-null object
Condition2       1460 non-null object
BldgType         1460 non-null object
HouseStyle       1460 non-null object
OverallQual      1460 non-null int64
OverallCond      1460 non-null int64
YearBuilt        1460 non-null int64
YearRemodAdd     1460 non-null int64
RoofStyle        1460 non-null object
RoofMatl         1460 non-null object
Exterior1st      1460 non-n

In [7]:
housing_data.shape

(1460, 81)

In [8]:
housing_data.describe()

Unnamed: 0,Id,MSSubClass,LotFrontage,LotArea,OverallQual,OverallCond,YearBuilt,YearRemodAdd,MasVnrArea,BsmtFinSF1,...,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,MiscVal,MoSold,YrSold,SalePrice
count,1460.0,1460.0,1201.0,1460.0,1460.0,1460.0,1460.0,1460.0,1452.0,1460.0,...,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0
mean,730.5,56.89726,70.049958,10516.828082,6.099315,5.575342,1971.267808,1984.865753,103.685262,443.639726,...,94.244521,46.660274,21.95411,3.409589,15.060959,2.758904,43.489041,6.321918,2007.815753,12.024057
std,421.610009,42.300571,24.284752,9981.264932,1.382997,1.112799,30.202904,20.645407,181.066207,456.098091,...,125.338794,66.256028,61.119149,29.317331,55.757415,40.177307,496.123024,2.703626,1.328095,0.399449
min,1.0,20.0,21.0,1300.0,1.0,1.0,1872.0,1950.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,2006.0,10.460271
25%,365.75,20.0,59.0,7553.5,5.0,5.0,1954.0,1967.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5.0,2007.0,11.775105
50%,730.5,50.0,69.0,9478.5,6.0,5.0,1973.0,1994.0,0.0,383.5,...,0.0,25.0,0.0,0.0,0.0,0.0,0.0,6.0,2008.0,12.001512
75%,1095.25,70.0,80.0,11601.5,7.0,6.0,2000.0,2004.0,166.0,712.25,...,168.0,68.0,0.0,0.0,0.0,0.0,0.0,8.0,2009.0,12.273736
max,1460.0,190.0,313.0,215245.0,10.0,9.0,2010.0,2010.0,1600.0,5644.0,...,857.0,547.0,552.0,508.0,480.0,738.0,15500.0,12.0,2010.0,13.534474


In [9]:
numeric_col = housing_data.describe().columns[2:-2]
numeric_col

Index(['LotFrontage', 'LotArea', 'OverallQual', 'OverallCond', 'YearBuilt',
       'YearRemodAdd', 'MasVnrArea', 'BsmtFinSF1', 'BsmtFinSF2', 'BsmtUnfSF',
       'TotalBsmtSF', '1stFlrSF', '2ndFlrSF', 'LowQualFinSF', 'GrLivArea',
       'BsmtFullBath', 'BsmtHalfBath', 'FullBath', 'HalfBath', 'BedroomAbvGr',
       'KitchenAbvGr', 'TotRmsAbvGrd', 'Fireplaces', 'GarageYrBlt',
       'GarageCars', 'GarageArea', 'WoodDeckSF', 'OpenPorchSF',
       'EnclosedPorch', '3SsnPorch', 'ScreenPorch', 'PoolArea', 'MiscVal',
       'MoSold'],
      dtype='object')

In [10]:
from sklearn.base import BaseEstimator, TransformerMixin

class DataFrameSelector(BaseEstimator, TransformerMixin):
    """Creates a class to select columns required for pipeline"""
    def __init__(self, attribute_names):
        self.attribute_names = attribute_names
    def fit(self, X, y=None):
        return self
    def transform(self,X):
        return X[self.attribute_names].values

In [11]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import Imputer

numeric_attributes = list(numeric_col)

num_pipeline = Pipeline([
    ('selector', DataFrameSelector(numeric_attributes)),
    ('imputer', Imputer(strategy="median")),
    ('std_scaler', StandardScaler())
])



In [12]:
num_pipeline.fit(housing_data)

Pipeline(memory=None,
     steps=[('selector', DataFrameSelector(attribute_names=['LotFrontage', 'LotArea', 'OverallQual', 'OverallCond', 'YearBuilt', 'YearRemodAdd', 'MasVnrArea', 'BsmtFinSF1', 'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF', '1stFlrSF', '2ndFlrSF', 'LowQualFinSF', 'GrLivArea', 'BsmtFullBath', 'BsmtHalfBath', 'Full...egy='median', verbose=0)), ('std_scaler', StandardScaler(copy=True, with_mean=True, with_std=True))])

In [13]:
housing_prepared = num_pipeline.transform(housing_data)

In [14]:
housing_labels = housing_data['SalePrice'].copy()

In [21]:
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV

from sklearn.ensemble import RandomForestRegressor
forest_reg = RandomForestRegressor(random_state=42, n_jobs=-1)

In [16]:
param_grid = [ 
    {'n_estimators' : [3, 5, 7, 8, 10, 20, 30, 50], 'max_features': [2, 4, 6, 8, 10, 13, 15]},
    {'bootstrap': [False], 'n_estimators' : [3, 5, 7, 8, 10, 20, 30, 50], 'max_features':[2, 4, 6, 8, 10, 13, 15] },
    
]

grid_search = GridSearchCV(forest_reg, param_grid, cv = 10, scoring = 'neg_mean_squared_error')

grid_search.fit(housing_prepared, housing_labels)

GridSearchCV(cv=10, error_score='raise-deprecating',
       estimator=RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_decrease=0.0, min_impurity_split=None,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators='warn', n_jobs=None,
           oob_score=False, random_state=42, verbose=0, warm_start=False),
       fit_params=None, iid='warn', n_jobs=None,
       param_grid=[{'n_estimators': [3, 5, 7, 8, 10, 20, 30, 50], 'max_features': [2, 4, 6, 8, 10, 13, 15]}, {'bootstrap': [False], 'n_estimators': [3, 5, 7, 8, 10, 20, 30, 50], 'max_features': [2, 4, 6, 8, 10, 13, 15]}],
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring='neg_mean_squared_error', verbose=0)

In [17]:
grid_search.best_params_

{'bootstrap': False, 'max_features': 13, 'n_estimators': 50}

In [18]:
cvres = grid_search.cv_results_
for mean_score, params in zip(cvres["mean_test_score"], cvres["params"]):
    print(np.sqrt(-mean_score), params)

0.19000979422989553 {'max_features': 2, 'n_estimators': 3}
0.1765025204296145 {'max_features': 2, 'n_estimators': 5}
0.17246758317600003 {'max_features': 2, 'n_estimators': 7}
0.1698864050945849 {'max_features': 2, 'n_estimators': 8}
0.16598957460063934 {'max_features': 2, 'n_estimators': 10}
0.15637604815286194 {'max_features': 2, 'n_estimators': 20}
0.15553872826357082 {'max_features': 2, 'n_estimators': 30}
0.15299135084706297 {'max_features': 2, 'n_estimators': 50}
0.17588273570479354 {'max_features': 4, 'n_estimators': 3}
0.1655794124684055 {'max_features': 4, 'n_estimators': 5}
0.15824108062419842 {'max_features': 4, 'n_estimators': 7}
0.1564536822992599 {'max_features': 4, 'n_estimators': 8}
0.15541419131128284 {'max_features': 4, 'n_estimators': 10}
0.1494497164909639 {'max_features': 4, 'n_estimators': 20}
0.14724182069813346 {'max_features': 4, 'n_estimators': 30}
0.14542213411189425 {'max_features': 4, 'n_estimators': 50}
0.17766541046018466 {'max_features': 6, 'n_estimators

In [20]:
param_grid = {'n_estimators' : list(range(3, 50)), 'max_features': list(range(2,30)), 'bootstrap': [True, False]}

grid_search = RandomizedSearchCV(forest_reg, param_grid, cv = 10, n_iter = 50, scoring = 'neg_mean_squared_error')

grid_search.fit(housing_prepared, housing_labels)

RandomizedSearchCV(cv=10, error_score='raise-deprecating',
          estimator=RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_decrease=0.0, min_impurity_split=None,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators='warn', n_jobs=None,
           oob_score=False, random_state=42, verbose=0, warm_start=False),
          fit_params=None, iid='warn', n_iter=50, n_jobs=None,
          param_distributions={'n_estimators': [3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49], 'max_features': [2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29], 'bootstrap': [True, False]},
          pre_dispatch='2*n_jobs', random_state=None, refit=True,
          return_train

In [22]:
grid_search.best_params_

{'bootstrap': False, 'max_features': 18, 'n_estimators': 38}

In [23]:
cvres = grid_search.cv_results_
for mean_score, params in zip(cvres["mean_test_score"], cvres["params"]):
    print(np.sqrt(-mean_score), params)

0.142254044272629 {'n_estimators': 47, 'max_features': 5, 'bootstrap': False}
0.14806821536434012 {'n_estimators': 23, 'max_features': 4, 'bootstrap': True}
0.14752953438574046 {'n_estimators': 28, 'max_features': 4, 'bootstrap': True}
0.1413589055728788 {'n_estimators': 47, 'max_features': 13, 'bootstrap': False}
0.16110287382871408 {'n_estimators': 4, 'max_features': 16, 'bootstrap': False}
0.15052248399682475 {'n_estimators': 9, 'max_features': 8, 'bootstrap': False}
0.14598578064446863 {'n_estimators': 40, 'max_features': 4, 'bootstrap': True}
0.1424865130033904 {'n_estimators': 34, 'max_features': 14, 'bootstrap': False}
0.14335112816045176 {'n_estimators': 49, 'max_features': 4, 'bootstrap': False}
0.14326594471594403 {'n_estimators': 48, 'max_features': 15, 'bootstrap': False}
0.14814212240480126 {'n_estimators': 13, 'max_features': 24, 'bootstrap': False}
0.14777837501696028 {'n_estimators': 18, 'max_features': 24, 'bootstrap': False}
0.15203955539836836 {'n_estimators': 9, 'ma

In [24]:
test_data = pd.read_csv(path + '/test.csv')

In [25]:
test_id = test_data["Id"]
test_prepared = num_pipeline.transform(test_data)

In [26]:
forest_reg = RandomForestRegressor(random_state = 42, bootstrap = False, max_features = 18, n_estimators = 38 )

In [27]:
forest_reg.fit(housing_prepared, housing_labels)

RandomForestRegressor(bootstrap=False, criterion='mse', max_depth=None,
           max_features=18, max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           n_estimators=38, n_jobs=None, oob_score=False, random_state=42,
           verbose=0, warm_start=False)

In [28]:
test_label = forest_reg.predict(test_prepared)

In [30]:
test_label

array([11.73544981, 11.95878462, 12.10379639, ..., 11.99077913,
       11.58336073, 12.3980137 ])

In [32]:
test_sales_price = [np.expm1(x) for x in test_label]
test_sales_price

[124921.62289748511,
 156182.1472320573,
 180555.0230774821,
 184367.948205152,
 192109.5882759652,
 181995.23913029654,
 171665.58154319655,
 178067.81088496564,
 186030.72870148823,
 113603.73300635247,
 185167.9932050444,
 94548.62241973929,
 91243.45604789251,
 150542.88417189245,
 128076.9270832469,
 366726.4044946998,
 253226.89616700984,
 298793.1851237625,
 198298.3182298668,
 473299.4945252313,
 309509.22076569893,
 202447.95801070082,
 177129.98606032605,
 167658.98524536027,
 168819.7949631576,
 209812.60000001275,
 327141.57476377365,
 224192.8394071952,
 207226.78182721196,
 201850.63154642665,
 187923.5019195654,
 106424.52850262061,
 170423.67488669342,
 289402.9366650742,
 307722.2899825559,
 222015.0312579642,
 183018.91407094325,
 152641.53054518503,
 150934.7460990484,
 156869.05363622794,
 175209.8845014515,
 148581.52243107645,
 287882.8245674818,
 226333.737534349,
 212904.1266155598,
 181642.0858978538,
 230848.34502248093,
 195744.12536306318,
 162872.7049595748

In [33]:
data = {'Id': test_id,
'SalePrice': test_sales_price
       }

frame = pd.DataFrame(data)

In [34]:
frame.head()

Unnamed: 0,Id,SalePrice
0,1461,124921.622897
1,1462,156182.147232
2,1463,180555.023077
3,1464,184367.948205
4,1465,192109.588276


In [35]:
frame.to_csv('submission3.csv', index = False)

In [36]:
frame.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1459 entries, 0 to 1458
Data columns (total 2 columns):
Id           1459 non-null int64
SalePrice    1459 non-null float64
dtypes: float64(1), int64(1)
memory usage: 22.9 KB
