# Introduction

In this project, I'll be working with housing data for the city of Ames, Iowa, US from 2006 to 2010.  We'll be using the features in this dataset to train and use a Multiple Linear Regression Model to predict sale prices for a house.  We'll begin by reading in the data, and then creating some simple functions to illustrate the workflow.  We'll slowly build up these functions as the project goes on.

In [132]:
import pandas as pd
pd.options.display.max_columns = 999
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import KFold

from sklearn.metrics import mean_squared_error
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import KFold

In [133]:
df = pd.read_csv('AmesHousing.tsv', sep='\t')

In [134]:
def transform_features(df):
    return df

In [135]:
def select_features(df):
    return df[['Gr Liv Area', 'SalePrice']]

In [136]:
def train_and_test(df):
    train = df[:1460]
    test = df[1460:]   
    
    numeric_train = train.select_dtypes(include=['int', 'float'])
    numeric_test = test.select_dtypes(include=['int', 'float'])
    
    features = numeric_train.columns.drop('SalePrice')
    model = LinearRegression()
    model.fit(train[features], train['SalePrice'])
    predictions = model.predict(test[features])
    mse = mean_squared_error(test['SalePrice'], predictions)
    rmse = np.sqrt(mse)
    return rmse

transform_df = transform_features(df)
filtered_df = select_features(transform_df)
rmse = train_and_test(filtered_df)

print(rmse)

57088.25161263909


# Feature Engineering

Now we have to clean the data so that our model can be as accurate as possible.  To do this we'll use the transform_features function defined above.  The goal of this function is:

* To remove features we don't want to use in the modal - based on data leakage or the number of missing values

* Transform features into the proper format (numerical to categorical, scaling numerical values, filling in missing values etc.)

* Create new features by combining other features


We'll investigate the columns in the dataset and see if they correspond to any of these criteria:

* Which columns contain less than 5% missing values? Let's fill the numerical columns in with the most popular value for the column

* Which new features can we create that better capture the information found in other features?

* Which columns could be droppped for other reasons? E.g. not useful for machine learning or leak data

For all columns we'll:

* Drop those columns which have greater than 5% of missing values for now

For text columns we'll:

* Drop those columns which have 1 more more missing values for now

For numerical columns we'll:

* Fill in the missing values using the most popular value for that column

In [137]:
#missing values
missing_vals = df.isnull().sum()
#filtering missing values to only include those with less than 5%
cols_to_drop = missing_vals[missing_vals > len(df)/20]
df = df.drop(cols_to_drop.index, axis=1)

In [138]:
#text columns
#dropping columns which have more than 1 missing value
text_cols = df.select_dtypes(include=['object'])
text_cols_mv = text_cols.isnull().sum()
text_cols_mv = text_cols_mv[text_cols_mv > 0]

df = df.drop(text_cols_mv.index, axis=1)

In [139]:
#Numerical columns
#filling in the missing value using the most popular values for that column
numeric_cols = df.select_dtypes(include=['int','float'])
numeric_cols_mv = numeric_cols.isnull().sum()

for col in numeric_cols:
    if df[col].isnull().sum() > 0:
        df[col].fillna(df[col].mode()[0], inplace=True)
    


In [140]:
df.isnull().sum().value_counts() #veryifying all the missing values have been cleaned

0    64
dtype: int64

Next, we'll go ahead and combine a couple of features to create a new feature which would better reflect the information displayed in the original two.  We're going to create a feature that describes the years since the house was remodelled, by using the 'Yr Sold' and 'Yr Remod/Add' features, alongside a feature that describes the years since the house was sold - using 'Yr Built' and 'Yr Sold'.

In [141]:
df['years_sold'] = df['Yr Sold'] - df['Year Built']
df['years_since_remod'] = df['Yr Sold'] - df['Year Remod/Add']

In [142]:
#checking for negative values
negative_sold = df[df['years_sold'] < 0]
negative_sold

Unnamed: 0,Order,PID,MS SubClass,MS Zoning,Lot Area,Street,Lot Shape,Land Contour,Utilities,Lot Config,Land Slope,Neighborhood,Condition 1,Condition 2,Bldg Type,House Style,Overall Qual,Overall Cond,Year Built,Year Remod/Add,Roof Style,Roof Matl,Exterior 1st,Exterior 2nd,Mas Vnr Area,Exter Qual,Exter Cond,Foundation,BsmtFin SF 1,BsmtFin SF 2,Bsmt Unf SF,Total Bsmt SF,Heating,Heating QC,Central Air,1st Flr SF,2nd Flr SF,Low Qual Fin SF,Gr Liv Area,Bsmt Full Bath,Bsmt Half Bath,Full Bath,Half Bath,Bedroom AbvGr,Kitchen AbvGr,Kitchen Qual,TotRms AbvGrd,Functional,Fireplaces,Garage Cars,Garage Area,Paved Drive,Wood Deck SF,Open Porch SF,Enclosed Porch,3Ssn Porch,Screen Porch,Pool Area,Misc Val,Mo Sold,Yr Sold,Sale Type,Sale Condition,SalePrice,years_sold,years_since_remod
2180,2181,908154195,20,RL,39290,Pave,IR1,Bnk,AllPub,Inside,Gtl,Edwards,Norm,Norm,1Fam,1Story,10,5,2008,2009,Hip,CompShg,CemntBd,CmentBd,1224.0,Ex,TA,PConc,4010.0,0.0,1085.0,5095.0,GasA,Ex,Y,5095,0,0,5095,1.0,1.0,2,1,2,1,Ex,15,Typ,2,3.0,1154.0,Y,546,484,0,0,0,0,17000,10,2007,New,Partial,183850,-1,-2


In [143]:
negative_remod = df[df['years_since_remod'] < 0]
negative_remod

Unnamed: 0,Order,PID,MS SubClass,MS Zoning,Lot Area,Street,Lot Shape,Land Contour,Utilities,Lot Config,Land Slope,Neighborhood,Condition 1,Condition 2,Bldg Type,House Style,Overall Qual,Overall Cond,Year Built,Year Remod/Add,Roof Style,Roof Matl,Exterior 1st,Exterior 2nd,Mas Vnr Area,Exter Qual,Exter Cond,Foundation,BsmtFin SF 1,BsmtFin SF 2,Bsmt Unf SF,Total Bsmt SF,Heating,Heating QC,Central Air,1st Flr SF,2nd Flr SF,Low Qual Fin SF,Gr Liv Area,Bsmt Full Bath,Bsmt Half Bath,Full Bath,Half Bath,Bedroom AbvGr,Kitchen AbvGr,Kitchen Qual,TotRms AbvGrd,Functional,Fireplaces,Garage Cars,Garage Area,Paved Drive,Wood Deck SF,Open Porch SF,Enclosed Porch,3Ssn Porch,Screen Porch,Pool Area,Misc Val,Mo Sold,Yr Sold,Sale Type,Sale Condition,SalePrice,years_sold,years_since_remod
1702,1703,528120010,60,RL,16659,Pave,IR1,Lvl,AllPub,Corner,Gtl,NridgHt,Norm,Norm,1Fam,2Story,8,5,2007,2008,Gable,CompShg,VinylSd,VinylSd,0.0,Gd,TA,PConc,0.0,0.0,1582.0,1582.0,GasA,Ex,Y,1582,570,0,2152,0.0,0.0,2,1,3,1,Gd,7,Typ,1,2.0,728.0,Y,0,368,0,0,0,0,0,6,2007,New,Partial,260116,0,-1
2180,2181,908154195,20,RL,39290,Pave,IR1,Bnk,AllPub,Inside,Gtl,Edwards,Norm,Norm,1Fam,1Story,10,5,2008,2009,Hip,CompShg,CemntBd,CmentBd,1224.0,Ex,TA,PConc,4010.0,0.0,1085.0,5095.0,GasA,Ex,Y,5095,0,0,5095,1.0,1.0,2,1,2,1,Ex,15,Typ,2,3.0,1154.0,Y,546,484,0,0,0,0,17000,10,2007,New,Partial,183850,-1,-2
2181,2182,908154205,60,RL,40094,Pave,IR1,Bnk,AllPub,Inside,Gtl,Edwards,PosN,PosN,1Fam,2Story,10,5,2007,2008,Hip,CompShg,CemntBd,CmentBd,762.0,Ex,TA,PConc,2260.0,0.0,878.0,3138.0,GasA,Ex,Y,3138,1538,0,4676,1.0,0.0,3,1,3,1,Ex,11,Typ,1,3.0,884.0,Y,208,406,0,0,0,0,0,10,2007,New,Partial,184750,0,-1


Let's drop these rows from the dataframe! And also the original columns that we used to engineer new features.

In [144]:
df = df.drop([1702, 2180, 2181], axis=0)
df = df.drop(['Yr Sold', 'Year Built', 'Year Remod/Add'], axis=1)

In [145]:
df

Unnamed: 0,Order,PID,MS SubClass,MS Zoning,Lot Area,Street,Lot Shape,Land Contour,Utilities,Lot Config,Land Slope,Neighborhood,Condition 1,Condition 2,Bldg Type,House Style,Overall Qual,Overall Cond,Roof Style,Roof Matl,Exterior 1st,Exterior 2nd,Mas Vnr Area,Exter Qual,Exter Cond,Foundation,BsmtFin SF 1,BsmtFin SF 2,Bsmt Unf SF,Total Bsmt SF,Heating,Heating QC,Central Air,1st Flr SF,2nd Flr SF,Low Qual Fin SF,Gr Liv Area,Bsmt Full Bath,Bsmt Half Bath,Full Bath,Half Bath,Bedroom AbvGr,Kitchen AbvGr,Kitchen Qual,TotRms AbvGrd,Functional,Fireplaces,Garage Cars,Garage Area,Paved Drive,Wood Deck SF,Open Porch SF,Enclosed Porch,3Ssn Porch,Screen Porch,Pool Area,Misc Val,Mo Sold,Sale Type,Sale Condition,SalePrice,years_sold,years_since_remod
0,1,526301100,20,RL,31770,Pave,IR1,Lvl,AllPub,Corner,Gtl,NAmes,Norm,Norm,1Fam,1Story,6,5,Hip,CompShg,BrkFace,Plywood,112.0,TA,TA,CBlock,639.0,0.0,441.0,1080.0,GasA,Fa,Y,1656,0,0,1656,1.0,0.0,1,0,3,1,TA,7,Typ,2,2.0,528.0,P,210,62,0,0,0,0,0,5,WD,Normal,215000,50,50
1,2,526350040,20,RH,11622,Pave,Reg,Lvl,AllPub,Inside,Gtl,NAmes,Feedr,Norm,1Fam,1Story,5,6,Gable,CompShg,VinylSd,VinylSd,0.0,TA,TA,CBlock,468.0,144.0,270.0,882.0,GasA,TA,Y,896,0,0,896,0.0,0.0,1,0,2,1,TA,5,Typ,0,1.0,730.0,Y,140,0,0,0,120,0,0,6,WD,Normal,105000,49,49
2,3,526351010,20,RL,14267,Pave,IR1,Lvl,AllPub,Corner,Gtl,NAmes,Norm,Norm,1Fam,1Story,6,6,Hip,CompShg,Wd Sdng,Wd Sdng,108.0,TA,TA,CBlock,923.0,0.0,406.0,1329.0,GasA,TA,Y,1329,0,0,1329,0.0,0.0,1,1,3,1,Gd,6,Typ,0,1.0,312.0,Y,393,36,0,0,0,0,12500,6,WD,Normal,172000,52,52
3,4,526353030,20,RL,11160,Pave,Reg,Lvl,AllPub,Corner,Gtl,NAmes,Norm,Norm,1Fam,1Story,7,5,Hip,CompShg,BrkFace,BrkFace,0.0,Gd,TA,CBlock,1065.0,0.0,1045.0,2110.0,GasA,Ex,Y,2110,0,0,2110,1.0,0.0,2,1,3,1,Ex,8,Typ,2,2.0,522.0,Y,0,0,0,0,0,0,0,4,WD,Normal,244000,42,42
4,5,527105010,60,RL,13830,Pave,IR1,Lvl,AllPub,Inside,Gtl,Gilbert,Norm,Norm,1Fam,2Story,5,5,Gable,CompShg,VinylSd,VinylSd,0.0,TA,TA,PConc,791.0,0.0,137.0,928.0,GasA,Gd,Y,928,701,0,1629,0.0,0.0,2,1,3,1,TA,6,Typ,1,2.0,482.0,Y,212,34,0,0,0,0,0,3,WD,Normal,189900,13,12
5,6,527105030,60,RL,9978,Pave,IR1,Lvl,AllPub,Inside,Gtl,Gilbert,Norm,Norm,1Fam,2Story,6,6,Gable,CompShg,VinylSd,VinylSd,20.0,TA,TA,PConc,602.0,0.0,324.0,926.0,GasA,Ex,Y,926,678,0,1604,0.0,0.0,2,1,3,1,Gd,7,Typ,1,2.0,470.0,Y,360,36,0,0,0,0,0,6,WD,Normal,195500,12,12
6,7,527127150,120,RL,4920,Pave,Reg,Lvl,AllPub,Inside,Gtl,StoneBr,Norm,Norm,TwnhsE,1Story,8,5,Gable,CompShg,CemntBd,CmentBd,0.0,Gd,TA,PConc,616.0,0.0,722.0,1338.0,GasA,Ex,Y,1338,0,0,1338,1.0,0.0,2,0,2,1,Gd,6,Typ,0,2.0,582.0,Y,0,0,170,0,0,0,0,4,WD,Normal,213500,9,9
7,8,527145080,120,RL,5005,Pave,IR1,HLS,AllPub,Inside,Gtl,StoneBr,Norm,Norm,TwnhsE,1Story,8,5,Gable,CompShg,HdBoard,HdBoard,0.0,Gd,TA,PConc,263.0,0.0,1017.0,1280.0,GasA,Ex,Y,1280,0,0,1280,0.0,0.0,2,0,2,1,Gd,5,Typ,0,2.0,506.0,Y,0,82,0,0,144,0,0,1,WD,Normal,191500,18,18
8,9,527146030,120,RL,5389,Pave,IR1,Lvl,AllPub,Inside,Gtl,StoneBr,Norm,Norm,TwnhsE,1Story,8,5,Gable,CompShg,CemntBd,CmentBd,0.0,Gd,TA,PConc,1180.0,0.0,415.0,1595.0,GasA,Ex,Y,1616,0,0,1616,1.0,0.0,2,0,2,1,Gd,5,Typ,1,2.0,608.0,Y,237,152,0,0,0,0,0,3,WD,Normal,236500,15,14
9,10,527162130,60,RL,7500,Pave,Reg,Lvl,AllPub,Inside,Gtl,Gilbert,Norm,Norm,1Fam,2Story,7,5,Gable,CompShg,VinylSd,VinylSd,0.0,TA,TA,PConc,0.0,0.0,994.0,994.0,GasA,Gd,Y,1028,776,0,1804,0.0,0.0,2,1,3,1,Gd,7,Typ,1,2.0,442.0,Y,140,60,0,0,0,0,0,6,WD,Normal,189000,11,11


Next, we'll drop the columns which aren't useful for machine learning, followed by those which leak data about the final sale.

In [146]:
#dropping columns which aren't useful for ML
df = df.drop(['Order', 'PID'], axis=1)

#dropping columns which leak data about the final sale

df = df.drop(['Mo Sold', 'Sale Type', 'Sale Condition'], axis=1)

Now we'll go ahead and use this to update the transform_features function which was defined earlier.

In [147]:
def transform_features(df):
    #missing values
    missing_vals = df.isnull().sum()
    cols_to_drop = missing_vals[missing_vals > len(df)/20]
    df = df.drop(cols_to_drop.index, axis=1)
    
    #text columns
    #dropping columns which have more than 1 missing value
    text_cols = df.select_dtypes(include=['object'])
    text_cols_mv = text_cols.isnull().sum()
    text_cols_mv = text_cols_mv[text_cols_mv > 0]

    df = df.drop(text_cols_mv.index, axis=1)
    
    #Numerical columns
    #filling in the missing value using the most popular values for that column
    numeric_cols = df.select_dtypes(include=['int','float'])
    numeric_cols_mv = numeric_cols.isnull().sum()

    for col in numeric_cols:
        if df[col].isnull().sum() > 0:
            df[col].fillna(df[col].mode()[0], inplace=True)
    
    df['years_sold'] = df['Yr Sold'] - df['Year Built']
    df['years_since_remod'] = df['Yr Sold'] - df['Year Remod/Add']
    
    df = df.drop([1702, 2180, 2181], axis=0)
    df = df.drop(['Yr Sold', 'Year Built', 'Year Remod/Add', 'Order', 'PID', 'Mo Sold', 'Sale Type', 'Sale Condition'], axis=1)

    return df

df = pd.read_csv("AmesHousing.tsv", delimiter="\t")
transform_df = transform_features(df)

# Feature Selection

Now we have a function which cleans and transforms many of the features in the dataset, it's time to identify which features we should use in the model.  We'll begin by generating a correlation heatmap for the numerical features of the training data set

In [148]:
numerical_transform_df = transform_df.select_dtypes(include=['int','float'])

In [149]:
numerical_transform_corrs = numerical_transform_df.corr()
sale_price_corrs = numerical_transform_corrs['SalePrice'].abs().sort_values()
sale_price_corrs

BsmtFin SF 2         0.006127
Misc Val             0.019273
3Ssn Porch           0.032268
Bsmt Half Bath       0.035875
Low Qual Fin SF      0.037629
Pool Area            0.068438
MS SubClass          0.085128
Overall Cond         0.101540
Screen Porch         0.112280
Kitchen AbvGr        0.119760
Enclosed Porch       0.128685
Bedroom AbvGr        0.143916
Bsmt Unf SF          0.182751
Lot Area             0.267520
2nd Flr SF           0.269601
Bsmt Full Bath       0.276258
Half Bath            0.284871
Open Porch SF        0.316262
Wood Deck SF         0.328183
BsmtFin SF 1         0.439284
Fireplaces           0.474831
TotRms AbvGrd        0.498574
Mas Vnr Area         0.506983
years_since_remod    0.534985
Full Bath            0.546118
years_sold           0.558979
1st Flr SF           0.635185
Garage Area          0.641425
Total Bsmt SF        0.644012
Garage Cars          0.648361
Gr Liv Area          0.717596
Overall Qual         0.801206
SalePrice            1.000000
Name: Sale

Let's get rid of those features with a correlation of less than 0.4.  This can always be changed later but this is a good starting point I feel

In [150]:
sale_price_corrs[sale_price_corrs > 0.4]

BsmtFin SF 1         0.439284
Fireplaces           0.474831
TotRms AbvGrd        0.498574
Mas Vnr Area         0.506983
years_since_remod    0.534985
Full Bath            0.546118
years_sold           0.558979
1st Flr SF           0.635185
Garage Area          0.641425
Total Bsmt SF        0.644012
Garage Cars          0.648361
Gr Liv Area          0.717596
Overall Qual         0.801206
SalePrice            1.000000
Name: SalePrice, dtype: float64

In [151]:
transform_df = transform_df.drop(sale_price_corrs[sale_price_corrs < 0.4].index, axis=1)

Let's now take a look at the nominal variables and see which ones can be converted into categorical variables.  Since we are going to create dummy variables, we don't want to use categorical variables with hundreds of unique values, and we don't want to use variables where a lot of the data falls into one category, since there is no variability in the data for the model to capture.

In [152]:
nominal_features = ['PID', 'MS SubClass','MS Zoning', 'Street', 'Alley', 'Land Contour', 
                    'Lot Config', 'Neighborhood', 'Condition 1', 'Condition 2', 'Bldg Type', 
                    'House Style', 'Roof Style', 'Roof Matl', 'Exterior 1st', 'Exterior 2nd', 
                    'Mas Vnr Type', 'Foundation', 'Heating', 'Central Air', 'Garage Type',
                    'Misc Feature', 'Sale Type', 'Sale Condition']

In [153]:
#How many of these columns do we still have
transform_df_features = []

for cat in nominal_features:
    if cat in transform_df:
        transform_df_features.append(cat)
        
transform_df_features

['MS Zoning',
 'Street',
 'Land Contour',
 'Lot Config',
 'Neighborhood',
 'Condition 1',
 'Condition 2',
 'Bldg Type',
 'House Style',
 'Roof Style',
 'Roof Matl',
 'Exterior 1st',
 'Exterior 2nd',
 'Foundation',
 'Heating',
 'Central Air']

Let's now only include those columns with less than 10 different unique values to create dummy variables from.

In [154]:
for col in transform_df:
    if col in transform_df_features:
        unique_vals = transform_df[col].unique()
        if len(unique_vals) > 10:
            transform_df = transform_df.drop([col], axis=1)
            
transform_df


Unnamed: 0,MS Zoning,Street,Lot Shape,Land Contour,Utilities,Lot Config,Land Slope,Condition 1,Condition 2,Bldg Type,House Style,Overall Qual,Roof Style,Roof Matl,Mas Vnr Area,Exter Qual,Exter Cond,Foundation,BsmtFin SF 1,Total Bsmt SF,Heating,Heating QC,Central Air,1st Flr SF,Gr Liv Area,Full Bath,Kitchen Qual,TotRms AbvGrd,Functional,Fireplaces,Garage Cars,Garage Area,Paved Drive,SalePrice,years_sold,years_since_remod
0,RL,Pave,IR1,Lvl,AllPub,Corner,Gtl,Norm,Norm,1Fam,1Story,6,Hip,CompShg,112.0,TA,TA,CBlock,639.0,1080.0,GasA,Fa,Y,1656,1656,1,TA,7,Typ,2,2.0,528.0,P,215000,50,50
1,RH,Pave,Reg,Lvl,AllPub,Inside,Gtl,Feedr,Norm,1Fam,1Story,5,Gable,CompShg,0.0,TA,TA,CBlock,468.0,882.0,GasA,TA,Y,896,896,1,TA,5,Typ,0,1.0,730.0,Y,105000,49,49
2,RL,Pave,IR1,Lvl,AllPub,Corner,Gtl,Norm,Norm,1Fam,1Story,6,Hip,CompShg,108.0,TA,TA,CBlock,923.0,1329.0,GasA,TA,Y,1329,1329,1,Gd,6,Typ,0,1.0,312.0,Y,172000,52,52
3,RL,Pave,Reg,Lvl,AllPub,Corner,Gtl,Norm,Norm,1Fam,1Story,7,Hip,CompShg,0.0,Gd,TA,CBlock,1065.0,2110.0,GasA,Ex,Y,2110,2110,2,Ex,8,Typ,2,2.0,522.0,Y,244000,42,42
4,RL,Pave,IR1,Lvl,AllPub,Inside,Gtl,Norm,Norm,1Fam,2Story,5,Gable,CompShg,0.0,TA,TA,PConc,791.0,928.0,GasA,Gd,Y,928,1629,2,TA,6,Typ,1,2.0,482.0,Y,189900,13,12
5,RL,Pave,IR1,Lvl,AllPub,Inside,Gtl,Norm,Norm,1Fam,2Story,6,Gable,CompShg,20.0,TA,TA,PConc,602.0,926.0,GasA,Ex,Y,926,1604,2,Gd,7,Typ,1,2.0,470.0,Y,195500,12,12
6,RL,Pave,Reg,Lvl,AllPub,Inside,Gtl,Norm,Norm,TwnhsE,1Story,8,Gable,CompShg,0.0,Gd,TA,PConc,616.0,1338.0,GasA,Ex,Y,1338,1338,2,Gd,6,Typ,0,2.0,582.0,Y,213500,9,9
7,RL,Pave,IR1,HLS,AllPub,Inside,Gtl,Norm,Norm,TwnhsE,1Story,8,Gable,CompShg,0.0,Gd,TA,PConc,263.0,1280.0,GasA,Ex,Y,1280,1280,2,Gd,5,Typ,0,2.0,506.0,Y,191500,18,18
8,RL,Pave,IR1,Lvl,AllPub,Inside,Gtl,Norm,Norm,TwnhsE,1Story,8,Gable,CompShg,0.0,Gd,TA,PConc,1180.0,1595.0,GasA,Ex,Y,1616,1616,2,Gd,5,Typ,1,2.0,608.0,Y,236500,15,14
9,RL,Pave,Reg,Lvl,AllPub,Inside,Gtl,Norm,Norm,1Fam,2Story,7,Gable,CompShg,0.0,TA,TA,PConc,0.0,994.0,GasA,Gd,Y,1028,1804,2,Gd,7,Typ,1,2.0,442.0,Y,189000,11,11


In [155]:
transform_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2927 entries, 0 to 2929
Data columns (total 36 columns):
MS Zoning            2927 non-null object
Street               2927 non-null object
Lot Shape            2927 non-null object
Land Contour         2927 non-null object
Utilities            2927 non-null object
Lot Config           2927 non-null object
Land Slope           2927 non-null object
Condition 1          2927 non-null object
Condition 2          2927 non-null object
Bldg Type            2927 non-null object
House Style          2927 non-null object
Overall Qual         2927 non-null int64
Roof Style           2927 non-null object
Roof Matl            2927 non-null object
Mas Vnr Area         2927 non-null float64
Exter Qual           2927 non-null object
Exter Cond           2927 non-null object
Foundation           2927 non-null object
BsmtFin SF 1         2927 non-null float64
Total Bsmt SF        2927 non-null float64
Heating              2927 non-null object
Heating Q

Let's covnert all of the object columns into categorical columns first.

In [156]:
text_cols = transform_df.select_dtypes(include=['object'])
for col in text_cols:
    transform_df[col] = transform_df[col].astype('category')
    
transform_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2927 entries, 0 to 2929
Data columns (total 36 columns):
MS Zoning            2927 non-null category
Street               2927 non-null category
Lot Shape            2927 non-null category
Land Contour         2927 non-null category
Utilities            2927 non-null category
Lot Config           2927 non-null category
Land Slope           2927 non-null category
Condition 1          2927 non-null category
Condition 2          2927 non-null category
Bldg Type            2927 non-null category
House Style          2927 non-null category
Overall Qual         2927 non-null int64
Roof Style           2927 non-null category
Roof Matl            2927 non-null category
Mas Vnr Area         2927 non-null float64
Exter Qual           2927 non-null category
Exter Cond           2927 non-null category
Foundation           2927 non-null category
BsmtFin SF 1         2927 non-null float64
Total Bsmt SF        2927 non-null float64
Heating            

In [157]:
dummy_cols = pd.get_dummies(transform_df.select_dtypes(include=['category']))


In [158]:
transform_df = pd.concat([transform_df, dummy_cols], axis=1)
transform_df.shape

(2927, 152)

In [159]:
#dropping the old categorical columns
transform_df = transform_df.drop(text_cols, axis=1)
transform_df.shape

(2927, 130)

Let's now update the logic for the select_features() function

In [163]:
def select_features(df,correlation_threshold=0.4, unique_val_threshold=10):
    numerical_df = df.select_dtypes(include=['int','float'])
    numerical_corrs = numerical_df.corr()
    sale_price_corrs = numerical_corrs['SalePrice'].abs().sort_values()
    df = df.drop(sale_price_corrs[sale_price_corrs < correlation_threshold].index, axis=1)
    
    nominal_features = ['PID', 'MS SubClass','MS Zoning', 'Street', 'Alley', 'Land Contour', 
                    'Lot Config', 'Neighborhood', 'Condition 1', 'Condition 2', 'Bldg Type', 
                    'House Style', 'Roof Style', 'Roof Matl', 'Exterior 1st', 'Exterior 2nd', 
                    'Mas Vnr Type', 'Foundation', 'Heating', 'Central Air', 'Garage Type',
                    'Misc Feature', 'Sale Type', 'Sale Condition']
    
    #How many of these columns do we still have
    df_features = []

    for cat in nominal_features:
        if cat in df:
            df_features.append(cat)
    
    for col in df:
        if col in df_features:
            unique_vals = df[col].unique()
            if len(unique_vals) > unique_val_threshold:
                df = df.drop([col], axis=1)
                
    text_cols = df.select_dtypes(include=['object'])
    for col in text_cols:
        df[col] = df[col].astype('category')
    
    dummy_cols = pd.get_dummies(df.select_dtypes(include=['category']))
    df = pd.concat([df, dummy_cols], axis=1)
    
    #dropping the old categorical columns
    df = df.drop(text_cols, axis=1)
    
    return df

transform_df = transform_features(df)
filtered_df = select_features(transform_df)

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2927 entries, 0 to 2929
Columns: 130 entries, Overall Qual to Paved Drive_Y
dtypes: float64(5), int64(9), uint8(116)
memory usage: 674.6 KB


# Training and Testing

Now the features have been transformed and are ready to be used to train the Multiple Linear Regression Model.  However, first let's change the train_and_test function to include a parameter k - which controls the type of cross validation which occurs

In [175]:
def train_and_test(df, k):
    numeric_df = df.select_dtypes(include=['int', 'float'])
    features = numeric_df.columns.drop('SalePrice')
    model = LinearRegression()
    
    if k == 0: #perform holdout validation
        train = df[:1460]
        test = df[1460:]   

    
        model.fit(train[features], train['SalePrice'])
        predictions = model.predict(test[features])
        mse = mean_squared_error(test['SalePrice'], predictions)
        rmse = np.sqrt(mse)
        
        return rmse

    if k == 1: #simple cross validation
        shuffled_df = df.sample(frac=1)
        fold_one = shuffled_df[:1460]
        fold_two = shuffled_df[1460:]   
        
        model.fit(fold_one[features], fold_one['SalePrice'])
        predictions_one = model.predict(fold_two[features])
        mse_one = mean_squared_error(fold_two['SalePrice'], predictions_one)
        rmse_one = np.sqrt(mse_one)
        
        model.fit(fold_two[features], fold_two['SalePrice'])
        predictions_two = model.predict(fold_one[features])
        mse_two = mean_squared_error(fold_one['SalePrice'], predictions_two)
        rmse_two = np.sqrt(mse_two)
        avg_rmse = np.mean([rmse_one, rmse_two])
        
        return avg_rmse
    
    else: #k-fold cross validation
        
        kf = KFold(n_splits=k, shuffle=True, random_state = 1)
        rmse_values = []
        for train_index, test_index in kf.split(df):
            train = df.iloc[train_index]
            test = df.iloc[test_index]
            model.fit(train[features], train['SalePrice'])
            predictions = model.predict(test[features])
            mse = mean_squared_error(test['SalePrice'], predictions)
            rmse = np.sqrt(mse)
            rmse_values.append(rmse)
            
        avg_rmse = sum(rmse_values)/len(rmse_values)
        
        return avg_rmse
                                      

transform_df = transform_features(df)
filtered_df = select_features(transform_df)
rmse = train_and_test(filtered_df, k=0)
print(rmse)


36623.53562910476


In [176]:
rmse = train_and_test(filtered_df, k=1)
print(rmse)

34117.17016154507


In [177]:
rmse = train_and_test(filtered_df, k=5)
print(rmse)

32689.6616480942


# Conclusion

A multiple linear regression model was successfully trained on a transformed and cleaned dataset, and cross validation was performed to minimise the root mean-squared-error of the sale price. 