### Approach
* Load and clean the data.  Fill in empty values (where appropriate).  Convert categorical features to numerical (via dictionaries and one-hot encoding.  Create additional features as well.
* Split the data using train_test_split.  Then, identify the features with the highest correlation to SalePrice.  Run a simple linear regression individually and aggregately on the two features with the highest correlation.
* Use a Pipeline and Grid Search for all other tests.  Apply the Standard Scaler to normalize the features (so that features do not dominate the model as a result of the magnetude of their units).
* Run a linear regression using Polynomial Features, a Ridge Regression and a Lasso Regression.

### Findings
* the worst findings were single feature linear regression models on un-scaled data.  combining the top two features improved those results decently.
* a linear regression grid search with the standard scaler and polynomial features yielded only slightly improved upon the two-feature linear regression model.
* a ridge regression and lasso regression model (with standard scaler and polynomial features) produces strikingly similar results -- which were meaningfully better than the linear regression grid search with the standard scaler and polynomial features.

#### Step 1: Load the Data, Fillna & Convert Categorical Data to Numeric (where possible).
we can encode these to follow the data dictionary. https://ww2.amstat.org/publications/jse/v19n3/decock/datadocumentation.txt

In [32]:
#%matplotlib notebook
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
import statsmodels.formula.api as smf

from sklearn.datasets import make_regression
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.pipeline import Pipeline

In [33]:
ames = pd.read_csv('data/ames_housing.csv')

In [34]:
ames['Alley'].value_counts()
ames['Alley'] = ames['Alley'].fillna("None")
ames = ames.replace({"Alley": {"None": 0, "Grvl": 1, "Pave": 2}})

In [35]:
ames['FireplaceQu'].value_counts()
ames['FireplaceQu'] = ames['FireplaceQu'].fillna("None")
ames = ames.replace({"FireplaceQu": {"None": 0, "Po": 1, "Fa": 2, "TA": 3, "Gd": 4, "Ex": 5}})

In [36]:
LF_mean=ames.LotFrontage.mean()
ames['LotFrontage'] = ames['LotFrontage'].fillna(LF_mean)

In [37]:
ames['PoolQC'].value_counts()
ames['PoolQC'] = ames['PoolQC'].fillna("None")
ames = ames.replace({"PoolQC": {"None": 0, "Po": 1, "Fa": 2, "TA": 3, "Gd": 4, "Ex": 5}})

In [38]:
ames['BsmtCond'].value_counts()
ames = ames.replace({"BsmtCond": {"No": 0, "Po": 1, "Fa": 2, "TA": 3, "Gd": 4, "Ex": 5}})
ames['BsmtCond'] = ames['BsmtCond'].fillna(0)

In [39]:
ames['BsmtQual'].value_counts()
ames = ames.replace({"BsmtQual" : {"No" : 0, "Po" : 1, "Fa" : 2, "TA": 3, "Gd" : 4, "Ex" : 5}})
ames['BsmtQual'] = ames['BsmtQual'].fillna(0)

In [40]:
ames['Fence'].value_counts()
ames['Fence'] = ames['Fence'].fillna("NA")
ames = ames.replace({"Fence" : {"NA" : 1, "MnWw" : 2, "GdWo": 3, "MnPrv" : 4, "GdPrv" : 5}})

In [41]:
ames['BsmtExposure'].value_counts()
ames['BsmtExposure'] = ames['BsmtExposure'].fillna("None")
ames = ames.replace({"BsmtExposure" : {"None" : 1, "No" : 2, "Mn": 3, "Av" : 4, "Gd" : 5}})

In [42]:
ames['BsmtFinType1'].value_counts()
ames['BsmtFinType1'] = ames['BsmtFinType1'].fillna("None")
ames = ames.replace({"BsmtFinType1" : {"None" : 1,"NA" : 2,"Unf" : 3,"LwQ" : 4, "Rec" : 5, "BLQ": 6, "ALQ" : 7, "GLQ" : 8}})

In [43]:
ames['BsmtFinType2'].value_counts()
ames['BsmtFinType2'] = ames['BsmtFinType2'].fillna("None")
ames = ames.replace({"BsmtFinType2" : {"None" : 1,"NA" : 2,"Unf" : 3,"LwQ" : 4, "Rec" : 5, "BLQ": 6, "ALQ" : 7, "GLQ" : 8}})

In [44]:
ames['Electrical'].value_counts()
ames['Electrical'] = ames['Electrical'].fillna("None")
ames = ames.replace({"Electrical" : {"None" : 0,"Mix" : 1, "FuseP" : 2, "FuseF": 3, "FuseA" : 4, "SBrkr" : 5}})

In [45]:
# since we're going to create dummy variables... and since these variables have missing values
# I won't fillna... instead, when I create dummy variables, I won't drop one.
    #ames['MasVnrType'].value_counts()
    #ames['MasVnrType'] = ames['MasVnrType'].fillna("None")

    #ames['GarageType'].value_counts()
    #ames['GarageType'] = ames['GarageType'].fillna("None")

    #ames['MiscFeature'].value_counts()
    #ames['MiscFeature'] = ames['MiscFeature'].fillna("None")

In [46]:
ames['GarageQual'].value_counts()
ames = ames.replace({"GarageQual": {"No": 0, "Po": 1, "Fa": 2, "TA": 3, "Gd": 4, "Ex": 5}})
ames['GarageQual'] = ames['GarageQual'].fillna(0)

In [47]:
ames['GarageFinish'].value_counts()
ames = ames.replace({"GarageFinish": {"Unf": 1, "RFn": 2, "Fin": 3}})
ames['GarageFinish'] = ames['GarageFinish'].fillna(0)

In [48]:
ames['GarageCond'].value_counts()
ames = ames.replace({"GarageCond": {"No": 0, "Po": 1, "Fa": 2, "TA": 3, "Gd": 4, "Ex": 5}})
ames['GarageCond'] = ames['GarageCond'].fillna(0)

In [49]:
ames['MasVnrArea'] = ames['MasVnrArea'].fillna(0)

In [50]:
GYB_median=ames.GarageYrBlt.median()
ames['GarageYrBlt'] = ames['GarageYrBlt'].fillna(GYB_median)

In [51]:
# create dummy variables
masonry_dummies = pd.get_dummies(ames.MasVnrType, prefix='Masonry')
misc_dummies = pd.get_dummies(ames.MiscFeature, prefix='MiscFeature')
garage_dummies = pd.get_dummies(ames.GarageType, prefix='Garage')
zoning_dummies = pd.get_dummies(ames.MSZoning, drop_first=True, prefix='Zoning')

ames=ames.join(masonry_dummies)
ames=ames.join(misc_dummies)
ames=ames.join(garage_dummies)
ames=ames.join(zoning_dummies)
ames=ames.drop('MasVnrType',axis=1)
ames=ames.drop('MiscFeature',axis=1)
ames=ames.drop('GarageType',axis=1)
ames=ames.drop('MSZoning',axis=1)

In [52]:
# add some new features (not using ordinal features since it's difficult to create a formula)... and drop the originals

ames['OverallGrade'] = ames['OverallQual'] * ames['OverallCond']
ames['BasementOverall'] = ames['BsmtCond'] * ames['BsmtQual']
ames['GarageOverall'] = ames['GarageQual'] * ames['GarageCond']
ames['PoolOverall'] = ames['PoolArea'] * ames['PoolQC']

ames=ames.drop('OverallQual',axis=1)
ames=ames.drop('OverallCond',axis=1)
ames=ames.drop('BsmtCond',axis=1)
ames=ames.drop('BsmtQual',axis=1)
ames=ames.drop('GarageQual',axis=1)
ames=ames.drop('GarageCond',axis=1)
ames=ames.drop('PoolArea',axis=1)
ames=ames.drop('PoolQC',axis=1)

In [53]:
ames.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 91 columns):
Id                  1460 non-null int64
MSSubClass          1460 non-null int64
LotFrontage         1460 non-null float64
LotArea             1460 non-null int64
Street              1460 non-null object
Alley               1460 non-null int64
LotShape            1460 non-null object
LandContour         1460 non-null object
Utilities           1460 non-null object
LotConfig           1460 non-null object
LandSlope           1460 non-null object
Neighborhood        1460 non-null object
Condition1          1460 non-null object
Condition2          1460 non-null object
BldgType            1460 non-null object
HouseStyle          1460 non-null object
YearBuilt           1460 non-null int64
YearRemodAdd        1460 non-null int64
RoofStyle           1460 non-null object
RoofMatl            1460 non-null object
Exterior1st         1460 non-null object
Exterior2nd         1460 non-null obj

In [54]:
# eliminate all non-numeric features
#ames = ames.select_dtypes(include = 'int64')
ames=ames.loc[:, (ames.dtypes == np.float64) | (ames.dtypes==np.int64) | (ames.dtypes==np.uint8)];

#### Step 2

Identify the features that have the highest correlation to the SalePrice 

In [55]:
corr_mat = ames.corr()

In [56]:
corr_mat['SalePrice'].sort_values(ascending=False)[:21]

SalePrice          1.000000
GrLivArea          0.708624
GarageCars         0.640409
GarageArea         0.623431
TotalBsmtSF        0.613581
1stFlrSF           0.605852
BasementOverall    0.571438
OverallGrade       0.565294
FullBath           0.560664
GarageFinish       0.549247
TotRmsAbvGrd       0.533723
YearBuilt          0.522897
FireplaceQu        0.520438
YearRemodAdd       0.507101
MasVnrArea         0.472614
Fireplaces         0.466929
GarageYrBlt        0.466754
BsmtFinSF1         0.386420
BsmtExposure       0.374696
Garage_Attchd      0.335961
LotFrontage        0.334901
Name: SalePrice, dtype: float64

#### Step 3

split the data and start running some basic models

In [57]:
from sklearn.preprocessing import StandardScaler, MinMaxScaler, PolynomialFeatures
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.linear_model import Ridge, Lasso, ElasticNet

In [58]:
## train test split
y=ames['SalePrice']
X=ames.drop('SalePrice', axis = 1)
X_train, X_test, y_train, y_test = train_test_split(X, y)

In [59]:
def basic_linear_reg(X_in,y_in,X_out,y_out):
    lr=LinearRegression()
    lr.fit(X_in,y_in)
    score=lr.score(X_in,y_in)
    y_predict=lr.predict(X_out)
    rmse=np.sqrt(mean_squared_error(y_predict,y_out))
    print("rmse=",rmse," (score={:.3f}".format(score),")")

In [60]:
# run three basic linear regression models without scaling the data
basic_linear_reg(X_train['GrLivArea'].values.reshape(-1,1),y_train,X_test['GrLivArea'].values.reshape(-1,1),y_test)
basic_linear_reg(X_train['GarageCars'].values.reshape(-1,1),y_train,X_test['GarageCars'].values.reshape(-1,1),y_test)
basic_linear_reg(X_train[['GrLivArea','GarageCars']],y_train,X_test[['GrLivArea','GarageCars']],y_test)

rmse= 60111.46691981992  (score=0.502 )
rmse= 64811.72311125428  (score=0.406 )
rmse= 52117.42365964715  (score=0.624 )


In [61]:
def process_data(model,params):
    pipe=make_pipeline(StandardScaler(),PolynomialFeatures(), model)
    grid = GridSearchCV(pipe,param_grid=params, cv=3)   # create a grid object to put the pipeline into
    grid.fit(X_train,y_train)
    best=grid.best_estimator_
    score=best.score(X_test,y_test)
    prediction=best.predict(X_test)
    rmse=np.sqrt(mean_squared_error(prediction,y_test))
    best_params=str(grid.best_params_)
    print("rmse=",rmse," (score={:.3f}".format(score),"); ",best_params)

In [62]:
# pass a linear regression model and parameters.  data will be scaled
params = {'polynomialfeatures__degree':[1,2,3]}
process_data(LinearRegression(),params)

rmse= 41664.55184299176  (score=0.759 );  {'polynomialfeatures__degree': 3}


In [63]:
# pass a ridge regression model and parameters.  data will be scaled
params = {'ridge__alpha':[10,20,30],'polynomialfeatures__degree':[1,2,3]}
process_data(Ridge(),params);

rmse= 33971.30275790455  (score=0.840 );  {'polynomialfeatures__degree': 1, 'ridge__alpha': 30}


In [64]:
# pass a lasso regression model and parameters.  data will be scaled
params = {'lasso__alpha':[15,20,25],'polynomialfeatures__degree':[1,2]}
process_data(Lasso(),params);



rmse= 34010.85465191278  (score=0.839 );  {'lasso__alpha': 25, 'polynomialfeatures__degree': 1}
