# Ames Housing Prices 

This notebook will demonstrate the application of supervised machine learning to a problem with continuous outcome (sale price) using regression. The example uses the Ames Housing dataset compiled by Dean De Cock.

## 1. Load necessary packages

I will use pandas to store the data and various packages from scikit-learn to train and evaluate the models.

In [25]:
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestRegressor
from sklearn import linear_model
from sklearn import metrics
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
import warnings
warnings.filterwarnings('ignore')




## 2. Preprocessing

Cleaning the data requires dealing with missing values in several columns and transforming categorical features. Note that missing values in several features are likely to indicate that the feature does not exist, so these will be filled with zeros rather than the mean or median. 

In [2]:
filepath = './data/AmesHousing.csv'
data = pd.read_csv(filepath)
data.head()

Unnamed: 0.1,Unnamed: 0,Order,PID,MS SubClass,MS Zoning,Lot Frontage,Lot Area,Street,Alley,Lot Shape,...,Pool Area,Pool QC,Fence,Misc Feature,Misc Val,Mo Sold,Yr Sold,Sale Type,Sale Condition,SalePrice
0,0,1,526301100,20,RL,141.0,31770,Pave,,IR1,...,0,,,,0,5,2010,WD,Normal,215000
1,1,2,526350040,20,RH,80.0,11622,Pave,,Reg,...,0,,MnPrv,,0,6,2010,WD,Normal,105000
2,2,3,526351010,20,RL,81.0,14267,Pave,,IR1,...,0,,,Gar2,12500,6,2010,WD,Normal,172000
3,3,4,526353030,20,RL,93.0,11160,Pave,,Reg,...,0,,,,0,4,2010,WD,Normal,244000
4,4,5,527105010,60,RL,74.0,13830,Pave,,IR1,...,0,,MnPrv,,0,3,2010,WD,Normal,189900


### Convert MSSubClass type to object

In [3]:
data['MS SubClass'] = data['MS SubClass'].astype('object')

### Transform categorical variables

I will use the **get_dummies** method to apply one-hot encoding to the various categorical variables.

In [4]:
# MSSubclass
dummies = pd.get_dummies(data['MS SubClass'], prefix = 'SubClass', drop_first=True)
data = pd.concat([data, dummies], axis=1)
data = data.drop(['MS SubClass'], axis = 1)

In [5]:
# Neighborhood
dummies = pd.get_dummies(data['Neighborhood'], drop_first=True)
data = pd.concat([data, dummies], axis=1)
data = data.drop(['Neighborhood'], axis = 1)

In [6]:
# Lot Config
dummies = pd.get_dummies(data['Lot Config'], drop_first=True)
data = pd.concat([data, dummies], axis=1)
data = data.drop(['Lot Config'], axis = 1)

In [7]:
# Building Type
dummies = pd.get_dummies(data['Bldg Type'], drop_first=True)
data = pd.concat([data, dummies], axis=1)
data = data.drop(['Bldg Type'], axis = 1)

In [8]:
# Basement Finish
dummies = pd.get_dummies(data['BsmtFin Type 1'], drop_first=True)
data = pd.concat([data, dummies], axis=1)
data = data.drop(['BsmtFin Type 1'], axis = 1)

In [9]:
# Fence
dummies = pd.get_dummies(data['Fence'], drop_first=True)
data = pd.concat([data, dummies], axis=1)
data = data.drop(['Fence'], axis = 1)

In [10]:
# Sale Type
dummies = pd.get_dummies(data['Sale Type'], drop_first=True)
data = pd.concat([data, dummies], axis=1)
data = data.drop(['Sale Type'], axis = 1)

In [11]:
# Sale Condition
dummies = pd.get_dummies(data['Sale Condition'], drop_first=True)
data = pd.concat([data, dummies], axis=1)
data = data.drop(['Sale Condition'], axis = 1)

In [12]:
# Kitchen Quality
dummies = pd.get_dummies(data['Kitchen Qual'], prefix = 'Kitchen', drop_first=True)
data = pd.concat([data, dummies], axis=1)
data = data.drop(['Kitchen Qual'], axis = 1)

### Fill missing values with zeros

In [13]:
data.fillna(0, inplace = True)

### Combine various bathrooms into a single feature

In [14]:
data['Bath'] = ((data['Bsmt Full Bath']) + (0.5 * data['Bsmt Half Bath']) + (data['Full Bath']) + (0.5 * data['Half Bath']))
data = data.drop(['Bsmt Half Bath', 'Bsmt Full Bath','Half Bath', 'Full Bath'], axis = 1)

### Examine cleaned data

In [15]:
data.head()

Unnamed: 0.1,Unnamed: 0,Order,PID,MS Zoning,Lot Frontage,Lot Area,Street,Alley,Lot Shape,Land Contour,...,AdjLand,Alloca,Family,Normal,Partial,Kitchen_Fa,Kitchen_Gd,Kitchen_Po,Kitchen_TA,Bath
0,0,1,526301100,RL,141.0,31770,Pave,0,IR1,Lvl,...,0,0,0,1,0,0,0,0,1,2.0
1,1,2,526350040,RH,80.0,11622,Pave,0,Reg,Lvl,...,0,0,0,1,0,0,0,0,1,1.0
2,2,3,526351010,RL,81.0,14267,Pave,0,IR1,Lvl,...,0,0,0,1,0,0,1,0,0,1.5
3,3,4,526353030,RL,93.0,11160,Pave,0,Reg,Lvl,...,0,0,0,1,0,0,0,0,0,3.5
4,4,5,527105010,RL,74.0,13830,Pave,0,IR1,Lvl,...,0,0,0,1,0,0,0,0,1,2.5


## 3. Split into test and train sets

I have used scikit-learn's train_test_split to randomly divide the data.


In [16]:
train, test = train_test_split(data, test_size=0.25, random_state=1)

## 4. Feature Selection

Each feature in the train set can be correlated to **SalePrice** using the **corr** method from pandas. The 25 features with the largest absolute values will be selected for the model.

In [35]:
c = train[train.columns[1:]].corr(method = 'pearson')['SalePrice'][:].sort_values(ascending=False)

correlations = []
features = []
ratings = zip(c.keys(), c.tolist())
for r in ratings:
    if abs(r[1]) > 0.05 and r[0] != 'SalePrice':
        correlations.append((r[0], abs(r[1])))

correlations.sort(key = lambda x: x[1], reverse=True)

for cor in correlations[:25]:
    features.append(cor[0])
    print cor

('Overall Qual', 0.7999283301254475)
('Gr Liv Area', 0.6996746308182443)
('Garage Cars', 0.6436816371817263)
('Garage Area', 0.6370891457076133)
('Bath', 0.6368318033295597)
('Total Bsmt SF', 0.6296049620039135)
('1st Flr SF', 0.6191638606412277)
('Year Built', 0.5599749453883134)
('Year Remod/Add', 0.5313409904270173)
('Kitchen_TA', 0.527946447954556)
('Mas Vnr Area', 0.5069010919982276)
('TotRms AbvGrd', 0.4863745955039656)
('Fireplaces', 0.48502939143586576)
('GLQ', 0.45937977697480753)
('NridgHt', 0.44333351903803686)
('BsmtFin SF 1', 0.4403013896955913)
('SubClass_60', 0.36090260664133)
('Wood Deck SF', 0.3436444274167528)
('New', 0.33895103888572814)
('Partial', 0.33192923192680546)
('Open Porch SF', 0.3206823656039884)
('NoRidge', 0.30018251959027253)
('Kitchen_Gd', 0.2969177912235974)
('Lot Area', 0.2693210355060212)
('2nd Flr SF', 0.2648897045205262)


In [21]:
train[features].describe()

Unnamed: 0,Overall Qual,Gr Liv Area,Garage Cars,Garage Area,Bath,Total Bsmt SF,1st Flr SF,Year Built,Year Remod/Add,Kitchen_TA,...,BsmtFin SF 1,SubClass_60,Wood Deck SF,New,Partial,Open Porch SF,NoRidge,Kitchen_Gd,Lot Area,2nd Flr SF
count,2197.0,2197.0,2197.0,2197.0,2197.0,2197.0,2197.0,2197.0,2197.0,2197.0,...,2197.0,2197.0,2197.0,2197.0,2197.0,2197.0,2197.0,2197.0,2197.0,2197.0
mean,6.116523,1507.873464,1.781065,476.498862,2.234638,1061.537551,1171.744652,1972.030951,1984.781065,0.502959,...,445.299499,0.199818,92.436959,0.084206,0.086482,48.76832,0.023669,0.395995,10268.070096,331.93218
std,1.429613,511.844406,0.763373,216.189472,0.810983,456.993558,398.204596,30.450069,20.898245,0.500105,...,465.836158,0.399954,121.422842,0.277759,0.281138,69.127588,0.152049,0.489175,7975.740885,430.76106
min,1.0,334.0,0.0,0.0,1.0,0.0,334.0,1872.0,1950.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1300.0,0.0
25%,5.0,1142.0,1.0,336.0,2.0,793.0,882.0,1954.0,1966.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,7500.0,0.0
50%,6.0,1452.0,2.0,480.0,2.0,998.0,1097.0,1975.0,1994.0,1.0,...,368.0,0.0,0.0,0.0,0.0,28.0,0.0,0.0,9488.0,0.0
75%,7.0,1762.0,2.0,577.0,3.0,1338.0,1422.0,2002.0,2004.0,1.0,...,739.0,0.0,168.0,0.0,0.0,72.0,0.0,1.0,11660.0,702.0
max,10.0,5642.0,5.0,1488.0,7.0,6110.0,5095.0,2010.0,2010.0,1.0,...,5644.0,1.0,870.0,1.0,1.0,742.0,1.0,1.0,215245.0,2065.0


## 5. Fit and Evaluate Models

I will experiment with three models: Ridge Regressor, Decision Tree Regressor, and Random Forest Regressor.

In [32]:
X_train = train[features]
Y_train = train['SalePrice']
X_test = test[features]
Y_test = test['SalePrice']

### Ridge Regressor

In [33]:
lm = linear_model.Ridge(alpha=10).fit(X_train, Y_train)
Y_hat_1 = lm.predict(X_test)

print 'RMSE = ', np.sqrt(metrics.mean_squared_error(Y_test, Y_hat_1))

RMSE =  27585.49326995055


### Decision Tree Regressor

In [31]:
dtr = DecisionTreeRegressor(max_leaf_nodes=100, random_state=1).fit(X_train, Y_train)
Y_hat_2 = dtr.predict(X_test)

print 'RMSE = ', np.sqrt(metrics.mean_squared_error(Y_test, Y_hat_2))

RMSE =  31980.513495970943


### Random Forest Regressor

In [29]:
rfm = RandomForestRegressor(n_estimators=1000, max_depth=30, random_state=2).fit(X_train, Y_train)
Y_hat_3 = rfm.predict(X_test)

print 'RMSE = ', np.sqrt(metrics.mean_squared_error(Y_test, Y_hat_3))

RMSE =  23716.760442302577


The Random Forest Regressor performed the best with the lowest root mean squared error.