## Predicting house prices

# By Greg Headley and Leon Chan

This project examines the efficacy of various regression models on predicting sale prices of homes in Ames, Iowa.

This 'README' describes the high level processing we conducted, while referencing additional notebooks for specific details.

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import math
import matplotlib.pyplot as plt


pd.options.display.max_rows = 1000
pd.options.display.max_columns = 200
plt.rcParams['figure.figsize'] = [10, 4]
plt.rcParams['figure.dpi'] = 100

## Import data, extract target, merge test & train

As our first step, we will load the training and testing datasets from CSV straight into a pandas DataFrame. We will briefly combine these data to address null values on a single dataframe and then split them back apart. We also split off the target variable and drop it from the training data.

In [2]:
missing_values = ["n/a", "na", "--"]

train = pd.read_csv("data/train.csv", na_values = missing_values)
test = pd.read_csv("data/test.csv", na_values = missing_values)

# Set flag to discriminate between test and train
train['test_data'] = False
test['test_data'] = True

# Concatenate datasets and renumber the index
full_data = pd.concat([train, test]).reset_index(drop=True)

In [3]:
full_data.head()

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,LotConfig,LandSlope,Neighborhood,Condition1,Condition2,BldgType,HouseStyle,OverallQual,OverallCond,YearBuilt,YearRemodAdd,RoofStyle,RoofMatl,Exterior1st,Exterior2nd,MasVnrType,MasVnrArea,ExterQual,ExterCond,Foundation,BsmtQual,BsmtCond,BsmtExposure,BsmtFinType1,BsmtFinSF1,BsmtFinType2,BsmtFinSF2,BsmtUnfSF,TotalBsmtSF,Heating,HeatingQC,CentralAir,Electrical,1stFlrSF,2ndFlrSF,LowQualFinSF,GrLivArea,BsmtFullBath,BsmtHalfBath,FullBath,HalfBath,BedroomAbvGr,KitchenAbvGr,KitchenQual,TotRmsAbvGrd,Functional,Fireplaces,FireplaceQu,GarageType,GarageYrBlt,GarageFinish,GarageCars,GarageArea,GarageQual,GarageCond,PavedDrive,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice,test_data
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,Inside,Gtl,CollgCr,Norm,Norm,1Fam,2Story,7,5,2003,2003,Gable,CompShg,VinylSd,VinylSd,BrkFace,196.0,Gd,TA,PConc,Gd,TA,No,GLQ,706.0,Unf,0.0,150.0,856.0,GasA,Ex,Y,SBrkr,856,854,0,1710,1.0,0.0,2,1,3,1,Gd,8,Typ,0,,Attchd,2003.0,RFn,2.0,548.0,TA,TA,Y,0,61,0,0,0,0,,,,0,2,2008,WD,Normal,208500.0,False
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,FR2,Gtl,Veenker,Feedr,Norm,1Fam,1Story,6,8,1976,1976,Gable,CompShg,MetalSd,MetalSd,,0.0,TA,TA,CBlock,Gd,TA,Gd,ALQ,978.0,Unf,0.0,284.0,1262.0,GasA,Ex,Y,SBrkr,1262,0,0,1262,0.0,1.0,2,0,3,1,TA,6,Typ,1,TA,Attchd,1976.0,RFn,2.0,460.0,TA,TA,Y,298,0,0,0,0,0,,,,0,5,2007,WD,Normal,181500.0,False
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,Inside,Gtl,CollgCr,Norm,Norm,1Fam,2Story,7,5,2001,2002,Gable,CompShg,VinylSd,VinylSd,BrkFace,162.0,Gd,TA,PConc,Gd,TA,Mn,GLQ,486.0,Unf,0.0,434.0,920.0,GasA,Ex,Y,SBrkr,920,866,0,1786,1.0,0.0,2,1,3,1,Gd,6,Typ,1,TA,Attchd,2001.0,RFn,2.0,608.0,TA,TA,Y,0,42,0,0,0,0,,,,0,9,2008,WD,Normal,223500.0,False
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,Corner,Gtl,Crawfor,Norm,Norm,1Fam,2Story,7,5,1915,1970,Gable,CompShg,Wd Sdng,Wd Shng,,0.0,TA,TA,BrkTil,TA,Gd,No,ALQ,216.0,Unf,0.0,540.0,756.0,GasA,Gd,Y,SBrkr,961,756,0,1717,1.0,0.0,1,0,3,1,Gd,7,Typ,1,Gd,Detchd,1998.0,Unf,3.0,642.0,TA,TA,Y,0,35,272,0,0,0,,,,0,2,2006,WD,Abnorml,140000.0,False
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,FR2,Gtl,NoRidge,Norm,Norm,1Fam,2Story,8,5,2000,2000,Gable,CompShg,VinylSd,VinylSd,BrkFace,350.0,Gd,TA,PConc,Gd,TA,Av,GLQ,655.0,Unf,0.0,490.0,1145.0,GasA,Ex,Y,SBrkr,1145,1053,0,2198,1.0,0.0,2,1,4,1,Gd,9,Typ,1,TA,Attchd,2000.0,RFn,3.0,836.0,TA,TA,Y,192,84,0,0,0,0,,,,0,12,2008,WD,Normal,250000.0,False


## Drop useless columns, bad row

Next, we will address some of the most egregious missing/null values up front.

We supply a list of variables to drop in their entirety, as they add little value to the dataset and are >90% null values. We also remove an observation with no value for the "Electrical" variable.

In [4]:
from src.preprocess import clean

drops = ['PoolQC', 'MiscFeature', 'FireplaceQu', 'Id', 'Utilities']

elec_na = full_data["Electrical"].isna()
full_data.drop(elec_na.loc[elec_na].index, inplace=True)

full_data = clean(full_data, drop_list=drops)

## Match null count of sibling columns

Before filling null values, we must address some discrepancies in null values between 'sibling' columns. For example, the basement columns have slight mismatches in the number of null values, which some variables possessing a few extra null values. To rectify this, we set all basement variables to null if any sibling has a null value. Likewise for other sibling columns like those describing the garage.

In [5]:
# Import custom function to handle sibling columns. 
from src.preprocess import null_match

siblings = [
    ["BsmtExposure", "BsmtCond", "BsmtQual", "BsmtFinType1", "BsmtFinType2"],
    ["GarageFinish", "GarageYrBlt", "GarageQual", "GarageCond", "GarageType"],
    ["MasVnrType", "MasVnrArea"]   
]   

full_data = null_match(full_data, siblings) 

## Fill null values

Now we will fill all null values with more appropriate values. This is executed using a dictionary that maps fill values to variable data types. First we compute a list of variables pertaining to each data type, then we supply this to our `clean` function.

In [6]:
# Create lists of variables names for each data type: integer, float and categorical (objects)
ints = [col for col in full_data.columns if full_data.dtypes[col] == "int64"]
floats =  [col for col in full_data.columns if full_data.dtypes[col] == "float64"]
cats =  [col for col in full_data.columns if full_data.dtypes[col] == "object"]

fill_dict = {0: ints, 0.0: floats, "None": cats}

full_data = clean(full_data, fill_na=fill_dict)

# Let's confirm we've removed all nulls:
full_data.isna().sum().sum()

0

## Feature Engineering

Now, we create some interesting features consisting of various existing features in our dataset that could potentially assist us in making better predictions. In addition, we make a distinction between categorical and ordinal variables and convert variables accordingly. This is achieved by using our custom functions `feat_create` and `ordinal_create`.

In [7]:
from src.preprocess import feat_create
from src.preprocess import ordinal_create

ordinal_vars = ['ExterQual', 'ExterCond', 'BsmtQual', 'BsmtCond',
        'HeatingQC', 'KitchenQual', 'GarageQual', 'GarageCond'] # May want to add BsmtExposure

new_feats = {
        "Total_Bath": 
            {
                1:['BsmtFullBath','FullBath'], 
                0.5: ['BsmtHalfBath', 'HalfBath']
            },
        "Porch_SF":
            {
                1: ['WoodDeckSF', 'OpenPorchSF', 'EnclosedPorch',
                    '3SsnPorch', 'ScreenPorch']
            },
        "Total_SF":
            {
                1: ['TotalBsmtSF', '1stFlrSF', '2ndFlrSF'] 
            }
}

swap_subclass = {20:'1story 1946+', 
                 30:'1story 1946-', 
                 40:'1story w attic', 
                 45:'1halfstory unfinish', 
                 50:'1halfstory finish', 
                 60:'2story 1946+', 
                 70:'2story 1946-', 
                 75:'2halfstory', 
                 80:'split multi-level', 
                 85:'split foyer', 
                 90:'duplex', 
                 120:'1story PUD 1946+', 
                 150:'1halfstory PUD', 
                 160:'2story PUD 1946+', 
                 180:'PUD multilevel', 
                 190:'2 family conv'}

full_data = feat_create(full_data, new_feats)
full_data = ordinal_create(full_data, ordinal_vars)
full_data['MSSubClass'] = full_data['MSSubClass'].map(swap_subclass)

In [8]:
# Returns new dataframe with categorical variables converted to dummy variables. 
full_data = pd.get_dummies(full_data, drop_first=True)

# Split dataset into training and testing dataset. 
final_train = full_data.loc[(full_data.test_data == False), :].copy()
final_train.drop(columns=['test_data'], inplace = True)
final_train.reset_index(drop=True, inplace=True)

final_test = full_data.loc[(full_data.test_data == True), :].copy()
final_test.drop(columns=['test_data', 'SalePrice'], inplace = True)
final_test.reset_index(drop=True, inplace=True)

## Scale (standardise) and Transform (normalise) numeric variables

There were some numeric variables which required scaling and transforming according to our [analysis](greg-eda.ipynb). We achieved this using the custom function `preprocess` which accepts lists consisting of variables which require scaling, transforming or both. 

In addition to the dataframe, the `preprocess` function returns a dictionary of pipelines which stores the transformation objects for each variable.  

In [9]:
from src.preprocess import preprocess

ordinal_vars.append('OverallQual')
ordinal_vars.append('OverallCond')

scale_feats =  [col for col in final_train.columns if (final_train.dtypes[col] != "object") and (col not in ordinal_vars)]
trans_feats = ['SalePrice', 'LotArea', 'Total_SF', 'GrLivArea', 'LotFrontage', 'GarageArea']

# Drop two massive outliers identified in analysis. 
final_train = final_train.drop([523, 1298]) 

final_train, pipelines = preprocess(final_train, scale_list=scale_feats, transform_list=trans_feats)

Before modelling, we dropped off the target from the training dataset. 

In [10]:
target = final_train.loc[:, 'SalePrice']
final_train.drop(columns=['SalePrice'], inplace=True)
final_train.reset_index(drop=True, inplace=True)

## Modelling

We will try a handful of cutting edge regression models to make predictions and make comparisons between the models. 

In [11]:
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.linear_model import Lasso, Ridge

RANDOM_SEED = 42

forest = RandomForestRegressor(n_jobs=-1, 
                               random_state=RANDOM_SEED, 
                               n_estimators=200, 
                               max_features=50, 
                               min_samples_leaf=2, 
                               min_samples_split=2, 
                               max_depth=20)
forest.fit(final_train, target)

gboost = GradientBoostingRegressor(n_estimators=1500, 
                                   learning_rate=0.03, 
                                   max_features=40, 
                                   min_samples_leaf=2, 
                                   min_samples_split=12, 
                                   random_state=RANDOM_SEED)
gboost.fit(final_train, target)

lasso = Lasso(alpha=0.0005,
              max_iter=5000,
              random_state=RANDOM_SEED)
lasso.fit(final_train, target)

ridge = Ridge(alpha=7.5,
              random_state=RANDOM_SEED)
ridge.fit(final_train, target)


def cv_rmse(model):
    rmse = -cross_val_score(model, final_train, target,
                            scoring="neg_root_mean_squared_error",
                            cv=10, n_jobs=-1)
    return (rmse)

In [12]:
fscore = cv_rmse(forest)
gscore = cv_rmse(gboost)
lscore = cv_rmse(lasso)
rscore = cv_rmse(ridge)
print("RandomForest CV score is:   {:.4f} ({:.4f})".format(fscore.mean(), fscore.std()))
print("Gradient Boost CV score is: {:.4f} ({:.4f})".format(gscore.mean(), gscore.std()))
print("Lasso CV score is:          {:.4f} ({:.4f})".format(lscore.mean(), lscore.std()))
print("Ridge CV score is:          {:.4f} ({:.4f})".format(rscore.mean(), rscore.std()))

RandomForest CV score is:   0.3238 (0.0352)
Gradient Boost CV score is: 0.2768 (0.0400)
Lasso CV score is:          0.2688 (0.0379)
Ridge CV score is:          0.2697 (0.0358)


In [13]:
# from sklearn.ensemble import StackingRegressor

# stack = StackingRegressor(
#     estimators=[
#         ('forest', forest),
#         ('gboost', gboost),
#         ('lasso', lasso),
#         ('ridge', ridge)
#     ], 
#     cv=10,
#     n_jobs=-1
# )
# stack.fit(final_train, target)

# sscore = cv_rmse(stack)
# print("Stacking CV score is: {:.4f} ({:.4f})".format(rscore.mean(), rscore.std()))

In [14]:
from src.preprocess import pipe_apply

pipe_test = pipe_apply(final_test, pipelines, direction='forward')

In [15]:
pipe_test['SalePrice'] = lasso.predict(pipe_test)
submission = pipe_apply(pipe_test, pipelines, direction='inverse')

In [16]:
np.floor(submission.SalePrice)

0       118084.0
1       157036.0
2       182565.0
3       199980.0
4       196272.0
          ...   
1454     87560.0
1455     78566.0
1456    166047.0
1457    121373.0
1458    218004.0
Name: SalePrice, Length: 1459, dtype: float64

In [17]:
sub_df = pd.DataFrame()
sub_df['Id'] = test.Id 
sub_df['SalePrice'] = np.floor(submission.SalePrice)

In [18]:
sub_df.to_csv('submission.csv', index=False)