# Overfit? and Regularize House Prices Model

In this assignment, you'll continue working with the house prices data. To complete this assignment, submit a link to a Jupyter notebook containing your solutions to the following tasks:

* Load the **houseprices** data from Thinkful's database.
* Reimplement your model from the previous checkpoint.
* Try OLS, Lasso, Ridge, and ElasticNet regression using the same model specification. This time, you need to do **k-fold cross-validation** to choose the best hyperparameter values for your models. Scikit-learn has RidgeCV, LassoCV, and ElasticNetCV that you can utilize to do this. Which model is the best? Why?

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sqlalchemy import create_engine
%matplotlib inline

from sklearn.model_selection import train_test_split
import statsmodels.api as sm
from sklearn.metrics import mean_absolute_error
from statsmodels.tools.eval_measures import mse, rmse
from sklearn.linear_model import LinearRegression

import warnings
warnings.filterwarnings("ignore")

In [2]:
postgres_user = 'dsbc_student'
postgres_pw = '7*.8G9QH21'
postgres_host = '142.93.121.174'
postgres_port = '5432'
postgres_db = 'houseprices'

engine = create_engine('postgresql://{}:{}@{}:{}/{}'.format(
    postgres_user, postgres_pw, postgres_host, postgres_port, postgres_db))

df = pd.read_sql_query('select * from houseprices',con=engine)

engine.dispose()

Cleaning...

In [3]:
df.drop(columns='id', inplace=True)

In [4]:
cat_list = ['mssubclass','mszoning','street','alley','lotshape','landcontour','utilities','lotconfig','landslope','neighborhood','condition1','condition2',
            'bldgtype','housestyle','roofstyle','roofmatl','exterior1st','exterior2nd','masvnrtype','exterqual','extercond','foundation','bsmtqual','bsmtcond',
            'bsmtexposure','bsmtfintype1','bsmtfintype2','heating','heatingqc','centralair','electrical','kitchenqual','functional','fireplacequ','garagetype',
            'garagefinish','garagequal','garagecond','paveddrive','poolqc','fence','miscfeature','saletype','salecondition', 'overallqual', 'overallcond', 
            'yearbuilt', 'yearremodadd', 'mosold', 'yrsold']
for var in cat_list:
    df[var] = df[var].astype('category')

In [5]:
def add_cat_fillna(variable, new_cat='None'):
    df[variable] = df[variable].cat.add_categories(new_cat).fillna(new_cat).copy()

In [6]:
nulls_list = ['alley','bsmtqual','bsmtcond','bsmtexposure','bsmtfintype1','bsmtfintype2','fireplacequ','garagetype','garagefinish','garagequal','garagecond',
             'poolqc','fence','miscfeature']
for var in nulls_list:
    add_cat_fillna(var)

In [7]:
df.masvnrtype = df.masvnrtype.fillna('None').copy()
df.masvnrarea = df.masvnrarea.fillna(0).copy()
df.lotfrontage = df.lotfrontage.fillna(df.lotfrontage.median()).copy()
df.electrical = df.electrical.fillna(df.electrical.mode()[0]).copy()
df.drop(columns='garageyrblt', inplace=True)

In [8]:
def outliers_std(data, columns, thresh=2):
    outlier_indexes = []
    for col in columns:
        ser_col = data[col]
        mean = ser_col.mean()
        sd = ser_col.std()
        outliers_mask = data[(data[col] > mean + thresh*sd) | (data[col] < mean - thresh*sd)].index
        outlier_indexes += [x for x in outliers_mask]
    return list(set(outlier_indexes))

In [9]:
df.drop(outliers_std(df, df.describe().columns), inplace=True)

In [10]:
data = pd.DataFrame()
data['target'] = df.saleprice
data['totalsf'] = df.totalbsmtsf + df.firstflrsf + df.secondflrsf
data['totalsf_grl_rel'] = (df.totalbsmtsf + df.firstflrsf + df.secondflrsf) * df.grlivarea
data['bedroomabvgr'] = df.bedroomabvgr
data['bedroomabvgr_sf_rel'] = df.bedroomabvgr * (df.totalbsmtsf + df.firstflrsf + df.secondflrsf)
data['garagearea'] = df.garagearea
data['selling_age'] = df.yrsold.astype('int') - df.yearbuilt.astype('int')
data = data.join(pd.get_dummies(df.mszoning, prefix='mszoning', drop_first=True))
data = data.join(pd.get_dummies(df.mssubclass, prefix='mssubclass', drop_first=True))
data.drop(columns=['mszoning_RH', 'mszoning_RM'], inplace=True)
data.drop(columns=['mssubclass_40', 'mssubclass_50', 'mssubclass_60', 'mssubclass_75', 'mssubclass_80', 'mssubclass_85', 'mssubclass_90',
                   'mssubclass_120', 'mssubclass_180', 'mssubclass_190', 'mssubclass_30', 'mssubclass_45'], inplace=True)
data['overallqual'] = df.overallqual.astype(int)
data['overallcond'] = df.overallcond.astype(int)
data['sf_qual_rel'] = (df.totalbsmtsf + df.firstflrsf + df.secondflrsf) * df.overallqual.astype(int)

In [11]:
target = data['target']
X_data = data.iloc[:, 1:]

In [12]:
X_train, X_test, y_train, y_test = train_test_split(X_data, target, test_size=.3, random_state=12)

In [13]:
lrm = LinearRegression(normalize=True)
lrm.fit(X_train, y_train)

lrm_y_train = lrm.predict(X_train)
lrm_y_test = lrm.predict(X_test)

In [14]:
from sklearn.linear_model import RidgeCV, LassoCV, ElasticNetCV

In [15]:
ridgecv = RidgeCV(alphas=np.logspace(-20, 20, 13), normalize=True)
lassocv = LassoCV(normalize=True)
elastcv = ElasticNetCV(alphas=np.logspace(-20, 20, 13), l1_ratio=[0.000000000001, .1, .5, .7, .9, .95, .99, 1], normalize=True)

In [16]:
ridgecv.fit(X_train, y_train)
lassocv.fit(X_train, y_train)
elastcv.fit(X_train, y_train)

ElasticNetCV(alphas=array([1.00000000e-20, 2.15443469e-17, 4.64158883e-14, 1.00000000e-10,
       2.15443469e-07, 4.64158883e-04, 1.00000000e+00, 2.15443469e+03,
       4.64158883e+06, 1.00000000e+10, 2.15443469e+13, 4.64158883e+16,
       1.00000000e+20]),
             copy_X=True, cv='warn', eps=0.001, fit_intercept=True,
             l1_ratio=[1e-12, 0.1, 0.5, 0.7, 0.9, 0.95, 0.99, 1], max_iter=1000,
             n_alphas=100, n_jobs=None, normalize=True, positive=False,
             precompute='auto', random_state=None, selection='cyclic',
             tol=0.0001, verbose=0)

In [17]:
ridge_y_train = ridgecv.predict(X_train)
ridge_y_test = ridgecv.predict(X_test)

lasso_y_train = lassocv.predict(X_train)
lasso_y_test = lassocv.predict(X_test)

elast_y_train = elastcv.predict(X_train)
elast_y_test = elastcv.predict(X_test)

In [18]:
print('-----Standard OLS-----')
print("R-squared of the model in the training set is: {}".format(lrm.score(X_train, y_train)))
print("-----Test set statistics-----")
print("R-squared of the model in the test set is: {}".format(lrm.score(X_test, y_test)))
print("Mean absolute error of the prediction is: {}".format(mean_absolute_error(y_test, lrm_y_test)))
print("Mean squared error of the prediction is: {}".format(mse(y_test, lrm_y_test)))
print("Root mean squared error of the prediction is: {}".format(rmse(y_test, lrm_y_test)))
print("Mean absolute percentage error of the prediction is: {}".format(np.mean(np.abs((y_test - lrm_y_test) / y_test)) * 100))

-----Standard OLS-----
R-squared of the model in the training set is: 0.8740889214748138
-----Test set statistics-----
R-squared of the model in the test set is: 0.8746785930946342
Mean absolute error of the prediction is: 12176.827366735932
Mean squared error of the prediction is: 273365577.789391
Root mean squared error of the prediction is: 16533.770827896187
Mean absolute percentage error of the prediction is: 7.917380402264537


In [19]:
print('-----Ridge Regression-----')
print("R-squared of the model in the training set is: {}".format(ridgecv.score(X_train, y_train)))
print("-----Test set statistics-----")
print("R-squared of the model in the test set is: {}".format(ridgecv.score(X_test, y_test)))
print("Mean absolute error of the prediction is: {}".format(mean_absolute_error(y_test, ridge_y_test)))
print("Mean squared error of the prediction is: {}".format(mse(y_test, ridge_y_test)))
print("Root mean squared error of the prediction is: {}".format(rmse(y_test, ridge_y_test)))
print("Mean absolute percentage error of the prediction is: {}".format(np.mean(np.abs((y_test - ridge_y_test) / y_test)) * 100))

-----Ridge Regression-----
R-squared of the model in the training set is: 0.8740352118065822
-----Test set statistics-----
R-squared of the model in the test set is: 0.8747131174586468
Mean absolute error of the prediction is: 12159.457360787943
Mean squared error of the prediction is: 273290269.24515164
Root mean squared error of the prediction is: 16531.493255152473
Mean absolute percentage error of the prediction is: 7.885233300878501


In [20]:
print('-----Lasso Regression-----')
print("R-squared of the model in the training set is: {}".format(lassocv.score(X_train, y_train)))
print("-----Test set statistics-----")
print("R-squared of the model in the test set is: {}".format(lassocv.score(X_test, y_test)))
print("Mean absolute error of the prediction is: {}".format(mean_absolute_error(y_test, lasso_y_test)))
print("Mean squared error of the prediction is: {}".format(mse(y_test, lasso_y_test)))
print("Root mean squared error of the prediction is: {}".format(rmse(y_test, lasso_y_test)))
print("Mean absolute percentage error of the prediction is: {}".format(np.mean(np.abs((y_test - lasso_y_test) / y_test)) * 100))

-----Lasso Regression-----
R-squared of the model in the training set is: 0.8735448553287073
-----Test set statistics-----
R-squared of the model in the test set is: 0.8748534694098992
Mean absolute error of the prediction is: 12141.682790958777
Mean squared error of the prediction is: 272984117.30195665
Root mean squared error of the prediction is: 16522.231002560056
Mean absolute percentage error of the prediction is: 7.834584943929994


In [21]:
print('-----ElasticNet Regression-----')
print("R-squared of the model in the training set is: {}".format(elastcv.score(X_train, y_train)))
print("-----Test set statistics-----")
print("R-squared of the model in the test set is: {}".format(elastcv.score(X_test, y_test)))
print("Mean absolute error of the prediction is: {}".format(mean_absolute_error(y_test, elast_y_test)))
print("Mean squared error of the prediction is: {}".format(mse(y_test, elast_y_test)))
print("Root mean squared error of the prediction is: {}".format(rmse(y_test, elast_y_test)))
print("Mean absolute percentage error of the prediction is: {}".format(np.mean(np.abs((y_test - elast_y_test) / y_test)) * 100))

-----ElasticNet Regression-----
R-squared of the model in the training set is: 0.8740865328508938
-----Test set statistics-----
R-squared of the model in the test set is: 0.8746974787119978
Mean absolute error of the prediction is: 12172.728133576571
Mean squared error of the prediction is: 273324382.2919104
Root mean squared error of the prediction is: 16532.524982346476
Mean absolute percentage error of the prediction is: 7.910376022564387


After normalization, it looks like both methods come up with practically the same results (ElasticNet simply defaults to Lasso since L1_Ratio cannot be zero). The Lasso method appears to slightly edge out the Ridge and ElasticNet method when it comes to R-squared values (but the difference is miniscule). All the other evaluation metrics are also nearly identical, but it looks like the Lasso method is better in every aspect.