# House Prices: Advanced Regression Techniques

A Kaggle competition on predicting sales prices and practice feature engineering, regression techniques such as random forests, and gradient boosting 

https://www.kaggle.com/c/house-prices-advanced-regression-techniques/overview

In [1]:
# Scoring function
def rmsle(y, y0):
    assert len(y) == len(y0)
    return np.sqrt(np.mean(np.power(np.log1p(y)-np.log1p(y0), 2)))

## 1- Load libraries

In [2]:
import numpy as np
import pandas as pd
import sklearn
import matplotlib.pyplot as plt


## 2- Import and preprocess data

In [3]:
data_df = pd.read_csv('data/train.csv')
data_df.head()

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,...,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,...,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,...,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,...,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,...,0,,,,0,12,2008,WD,Normal,250000


In [29]:
test_df = pd.read_csv('data/test.csv')
test_df.head()

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,ScreenPorch,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition
0,1461,20,RH,80.0,11622,Pave,,Reg,Lvl,AllPub,...,120,0,,MnPrv,,0,6,2010,WD,Normal
1,1462,20,RL,81.0,14267,Pave,,IR1,Lvl,AllPub,...,0,0,,,Gar2,12500,6,2010,WD,Normal
2,1463,60,RL,74.0,13830,Pave,,IR1,Lvl,AllPub,...,0,0,,MnPrv,,0,3,2010,WD,Normal
3,1464,60,RL,78.0,9978,Pave,,IR1,Lvl,AllPub,...,0,0,,,,0,6,2010,WD,Normal
4,1465,120,RL,43.0,5005,Pave,,IR1,HLS,AllPub,...,144,0,,,,0,1,2010,WD,Normal


In [4]:
data_df.shape

(1460, 81)

In [5]:
data_df.columns

Index(['Id', 'MSSubClass', 'MSZoning', 'LotFrontage', 'LotArea', 'Street',
       'Alley', 'LotShape', 'LandContour', 'Utilities', 'LotConfig',
       'LandSlope', 'Neighborhood', 'Condition1', 'Condition2', 'BldgType',
       'HouseStyle', 'OverallQual', 'OverallCond', 'YearBuilt', 'YearRemodAdd',
       'RoofStyle', 'RoofMatl', 'Exterior1st', 'Exterior2nd', 'MasVnrType',
       'MasVnrArea', 'ExterQual', 'ExterCond', 'Foundation', 'BsmtQual',
       'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinSF1',
       'BsmtFinType2', 'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF', 'Heating',
       'HeatingQC', 'CentralAir', 'Electrical', '1stFlrSF', '2ndFlrSF',
       'LowQualFinSF', 'GrLivArea', 'BsmtFullBath', 'BsmtHalfBath', 'FullBath',
       'HalfBath', 'BedroomAbvGr', 'KitchenAbvGr', 'KitchenQual',
       'TotRmsAbvGrd', 'Functional', 'Fireplaces', 'FireplaceQu', 'GarageType',
       'GarageYrBlt', 'GarageFinish', 'GarageCars', 'GarageArea', 'GarageQual',
       'GarageCond', 'PavedDrive

## 3- Build something quick and dirty

In [17]:
columns = list(data_df.select_dtypes(include=['int64', 'float64']).columns)
columns.remove('SalePrice')

In [18]:
X = data_df[columns].copy()
X.isnull().sum()

Id                 0
MSSubClass         0
LotFrontage      259
LotArea            0
OverallQual        0
OverallCond        0
YearBuilt          0
YearRemodAdd       0
MasVnrArea         8
BsmtFinSF1         0
BsmtFinSF2         0
BsmtUnfSF          0
TotalBsmtSF        0
1stFlrSF           0
2ndFlrSF           0
LowQualFinSF       0
GrLivArea          0
BsmtFullBath       0
BsmtHalfBath       0
FullBath           0
HalfBath           0
BedroomAbvGr       0
KitchenAbvGr       0
TotRmsAbvGrd       0
Fireplaces         0
GarageYrBlt       81
GarageCars         0
GarageArea         0
WoodDeckSF         0
OpenPorchSF        0
EnclosedPorch      0
3SsnPorch          0
ScreenPorch        0
PoolArea           0
MiscVal            0
MoSold             0
YrSold             0
dtype: int64

In [8]:
X['LotFrontage'].fillna(X['LotFrontage'].median(), inplace=True)
X['GarageYrBlt'].fillna(X['GarageYrBlt'].median(), inplace=True)
X['MasVnrArea'].fillna(X['MasVnrArea'].mean(), inplace=True)

# Check if there is still any missing values
X.isnull().sum().sum()

0

In [9]:
y = data_df['SalePrice']
y.head()

0    208500
1    181500
2    223500
3    140000
4    250000
Name: SalePrice, dtype: int64

In [10]:
# normalize data
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
X_scaled = scaler.fit_transform(X)

  return self.partial_fit(X, y)


In [11]:
# Now train a quick model
from sklearn.model_selection import train_test_split
X_train, X_dev, y_train, y_dev = train_test_split(X_scaled, y, random_state=0)

print('Number of examples in the traning set:', X_train.shape[0])
print('Number of examples in the development set:', X_dev.shape[0])

Number of examples in the traning set: 1095
Number of examples in the development set: 365


In [14]:
from sklearn.linear_model import LinearRegression

LR = LinearRegression().fit(X_train, y_train)
print('R2 score on the train set: ', LR.score(X_train, y_train))
print('R2 score on the dev set: ', LR.score(X_dev, y_dev))
print('Mean squared error on the train set', rmsle(LR.predict(X_train), y_train))
print('Mean squared error on the dev set', rmsle(LR.predict(X_dev), y_dev))

R2 score on the train set:  0.8449554270569105
R2 score on the dev set:  0.6801233629810874
Mean squared error on the train set 0.18592729324938925
Mean squared error on the dev set 0.20022687231720146


  after removing the cwd from sys.path.


In [15]:
from sklearn.ensemble import GradientBoostingRegressor

GBR = GradientBoostingRegressor(loss='ls', learning_rate=0.5, n_estimators=10, max_depth=3).fit(X_train, y_train)
print('R2 score on the train set: ', GBR.score(X_train, y_train))
print('R2 score on the dev set: ', GBR.score(X_dev, y_dev))
print('Mean squared error on the train set', rmsle(GBR.predict(X_train), y_train))
print('Mean squared error on the dev set', rmsle(GBR.predict(X_dev), y_dev))

R2 score on the train set:  0.9333336681705634
R2 score on the dev set:  0.8201268129087149
Mean squared error on the train set 0.127042641007794
Mean squared error on the dev set 0.15507222415195202


In [16]:
from sklearn.ensemble import RandomForestRegressor

RFR = RandomForestRegressor(n_estimators=10, max_features=10, random_state=0).fit(X_train, y_train)
print('R2 score on the train set: ', RFR.score(X_train, y_train))
print('R2 score on the dev set: ', RFR.score(X_dev, y_dev))
print('Mean squared error on the train set', rmsle(RFR.predict(X_train), y_train))
print('Mean squared error on the dev set', rmsle(RFR.predict(X_dev), y_dev))

R2 score on the train set:  0.9609503497710953
R2 score on the dev set:  0.8039680588423219
Mean squared error on the train set 0.07784395754890554
Mean squared error on the dev set 0.15027740860870198


All methods appear to be overfitting badly. Now what will happen if I just submit this nobrainer model?

In [33]:
# Try submitting the solution with GradientBoostingRegressor
X_test = test_df[columns].copy()
X_test['LotFrontage'].fillna(X_test['LotFrontage'].median(), inplace=True)
X_test['GarageYrBlt'].fillna(X_test['GarageYrBlt'].median(), inplace=True)
X_test['MasVnrArea'].fillna(X_test['MasVnrArea'].mean(), inplace=True)
X_test['BsmtFinSF1'].fillna(X_test['BsmtFinSF1'].mean(), inplace=True)
X_test['BsmtFinSF2'].fillna(X_test['BsmtFinSF2'].mean(), inplace=True)
X_test['BsmtUnfSF'].fillna(X_test['BsmtUnfSF'].mean(), inplace=True)
X_test['TotalBsmtSF'].fillna(X_test['TotalBsmtSF'].mean(), inplace=True)
X_test['BsmtFullBath'].fillna(X_test['BsmtFullBath'].mean(), inplace=True)
X_test['BsmtHalfBath'].fillna(X_test['BsmtHalfBath'].mean(), inplace=True)
X_test['GarageCars'].fillna(X_test['GarageCars'].mean(), inplace=True)
X_test['GarageArea'].fillna(X_test['GarageArea'].mean(), inplace=True)

X_test.isnull().sum().sum()

0

In [36]:
X_test_scaled = scaler.transform(X_test)

In [39]:
y_pred = GBR.predict(X_test_scaled)
answer = pd.DataFrame(data=y_pred, columns=['SalePrice'])
answer.insert(loc=0, column='Id', value=test_df['Id'])

answer.to_csv('data/submission.csv', index=False)

This results in an abysmal score, as expected, of 0.24740, which put me in position 89%. Yucky! Now, let's try a more serious job.

## 4- Exploratory Data Analysis