    JACOB KNOPPING
    1/21/2020
    
## 18.7 Assignment

In this assignment, you'll continue working with the house prices data. To complete this assignment, submit a link to a Jupyter notebook containing your solutions to the following tasks:

* Load the **houseprices** data from Thinkful's database.
* Reimplement your model from the previous checkpoint.
* Try OLS, Lasso, Ridge, and ElasticNet regression using the same model specification. This time, you need to do **k-fold cross-validation** to choose the best hyperparameter values for your models. Scikit-learn has RidgeCV, LassoCV, and ElasticNetCV that you can utilize to do this. Which model is the best? Why?

### SOLUTION

In [1]:
#Re-implement model

# import the relevant libraries:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

from sklearn import linear_model
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error
import statsmodels.api as sm
from statsmodels.tools.eval_measures import mse, rmse
from sqlalchemy import create_engine

import warnings
warnings.filterwarnings('ignore')

postgres_user = 'dsbc_student'
postgres_pw = '7*.8G9QH21'
postgres_host = '142.93.121.174'
postgres_port = '5432'
postgres_db = 'houseprices'

#load the data from the PostgreSQL database
engine = create_engine('postgresql://{}:{}@{}:{}/{}'.format(
    postgres_user, postgres_pw, postgres_host, postgres_port, postgres_db))

houseprices_df = pd.read_sql_query('SELECT * FROM houseprices', con=engine)

#no need for an open connection (just the one query)
engine.dispose()

non_numeric_columns = houseprices_df.select_dtypes(['object']).columns
numeric_columns = houseprices_df.select_dtypes(['int64', 'float64']).columns

houseprices_copy = houseprices_df.copy()
houseprices_df.drop(['alley', 'fireplacequ', 'poolqc', 'fence', 'miscfeature'], axis=1, inplace=True)

fill_list = list(houseprices_df.columns)
for column in fill_list:
    houseprices_df[column].interpolate(inplace=True)
    
houseprices_df = houseprices_df.fillna(houseprices_df.mode().iloc[0])

from scipy.stats.mstats import winsorize

houseprices_df['winsorized_overallqual'] = winsorize(houseprices_df.overallqual, (0.05, 0.00))
houseprices_df['winsorized_grlivarea'] = winsorize(houseprices_df.grlivarea, (0.00, 0.05))
houseprices_df['winsorized_garagecars'] = winsorize(houseprices_df.garagecars, (0.00, 0.05))
houseprices_df['winsorized_garagearea'] = winsorize(houseprices_df.garagearea, (0.00, 0.05))
houseprices_df['winsorized_totalbsmtsf'] = winsorize(houseprices_df.totalbsmtsf, (0.05, 0.05))

#Selecting the features for categorical variables as a dummy list
houseprices_df = pd.concat([houseprices_df,pd.get_dummies(houseprices_df.mszoning, prefix="mszoning", drop_first=True)], axis=1)
houseprices_df = pd.concat([houseprices_df,pd.get_dummies(houseprices_df.street, prefix="street", drop_first=True)], axis=1)
houseprices_df = pd.concat([houseprices_df,pd.get_dummies(houseprices_df.utilities, prefix="utilities", drop_first=True)], axis=1)
houseprices_df = pd.concat([houseprices_df,pd.get_dummies(houseprices_df.neighborhood, prefix="neighborhood", drop_first=True)], axis=1)
houseprices_df = pd.concat([houseprices_df,pd.get_dummies(houseprices_df.exterqual, prefix="exterqual", drop_first=True)], axis=1)
houseprices_df = pd.concat([houseprices_df,pd.get_dummies(houseprices_df.kitchenqual, prefix="kitchenqual", drop_first=True)], axis=1)

dummy_column_names = list(pd.get_dummies(houseprices_df.mszoning, prefix="mszoning", drop_first=True).columns)
dummy_column_names = dummy_column_names + list(pd.get_dummies(houseprices_df.street, prefix="street", drop_first=True).columns)
dummy_column_names = dummy_column_names + list(pd.get_dummies(houseprices_df.utilities, prefix="utilities", drop_first=True).columns)
dummy_column_names = dummy_column_names + list(pd.get_dummies(houseprices_df.neighborhood, prefix="neighborhood", drop_first=True).columns)
dummy_column_names = dummy_column_names + list(pd.get_dummies(houseprices_df.exterqual, prefix="exterqual", drop_first=True).columns)
dummy_column_names = dummy_column_names + list(pd.get_dummies(houseprices_df.kitchenqual, prefix="kitchenqual", drop_first=True).columns)

In [2]:
#MODEL using OLS

#Feature set(X)
X = houseprices_df[['winsorized_overallqual', 'winsorized_grlivarea', 'winsorized_garagecars', 'winsorized_totalbsmtsf'] + dummy_column_names]
#Log transform the target variable(Y)
Y = np.log1p(houseprices_df.saleprice)

#Train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.2, random_state=465)

# We fit an OLS model using sklearn
lrm = LinearRegression()
lrm.fit(X_train, y_train)

# We are making predictions here
y_preds_train = lrm.predict(X_train)
y_preds_test = lrm.predict(X_test)

print("R-squared of the model in the training set is: {}".format(lrm.score(X_train, y_train)))
print("-----Test set statistics-----")
print("R-squared of the model in the test set is: {}".format(lrm.score(X_test, y_test)))
print("Mean absolute error of the prediction is: {}".format(mean_absolute_error(y_test, y_preds_test)))
print("Mean squared error of the prediction is: {}".format(mse(y_test, y_preds_test)))
print("Root mean squared error of the prediction is: {}".format(rmse(y_test, y_preds_test)))
print("Mean absolute percentage error of the prediction is: {}".format(np.mean(np.abs((y_test - y_preds_test) / y_test)) * 100))

R-squared of the model in the training set is: 0.8646568780369258
-----Test set statistics-----
R-squared of the model in the test set is: 0.8630854071431089
Mean absolute error of the prediction is: 0.1133008349216371
Mean squared error of the prediction is: 0.02282991261842369
Root mean squared error of the prediction is: 0.1510957068166521
Mean absolute percentage error of the prediction is: 0.9473702255293278


In [3]:
#MODEL using Ridge regression

from sklearn.linear_model import LinearRegression, LassoCV, RidgeCV, ElasticNetCV

alphas = [np.power(10.0,p) for p in np.arange(-10,40,1)]

#fitting a ridge regression model
#alpha is regularization parameter, here (usually called lamda, which doesn't
#work in python because it is a keyword!)
ridge_cv = RidgeCV(alphas=alphas, cv=5)
ridge_cv.fit(X_train, y_train)

y_preds_train = ridge_cv.predict(X_train)
y_preds_test = ridge_cv.predict(X_test)

print("Best alpha value is: {}".format(ridge_cv.alpha_))
print("R-squared of the model in training set is: {}".format(ridge_cv.score(X_train, y_train)))
print("-----Test set statistics-----")
print("R-squared of the model in test set is: {}".format(ridge_cv.score(X_test, y_test)))
print("Mean absolute error of the prediction is: {}".format(mean_absolute_error(y_test, y_preds_test)))
print("Mean squared error of the prediction is: {}".format(mse(y_test, y_preds_test)))
print("Root mean squared error of the prediction is: {}".format(rmse(y_test, y_preds_test)))
print("Mean absolute percentage error of the prediction is: {}".format(np.mean(np.abs((y_test - y_preds_test) / y_test)) * 100))

Best alpha value is: 1.0
R-squared of the model in training set is: 0.8637673365967764
-----Test set statistics-----
R-squared of the model in test set is: 0.8587829700406089
Mean absolute error of the prediction is: 0.11438904815466043
Mean squared error of the prediction is: 0.02354732528457393
Root mean squared error of the prediction is: 0.15345137759099436
Mean absolute percentage error of the prediction is: 0.9572076490906779


In [4]:
#MODEL using Lasso regression
lasso_cv = LassoCV(alphas=alphas, cv=5)

lasso_cv.fit(X_train, y_train)

# We are making predictions here
y_preds_train = lasso_cv.predict(X_train)
y_preds_test = lasso_cv.predict(X_test)

print("Best alpha value is: {}".format(lasso_cv.alpha_))
print("R-squared of the model in training set is: {}".format(lasso_cv.score(X_train, y_train)))
print("-----Test set statistics-----")
print("R-squared of the model in test set is: {}".format(lasso_cv.score(X_test, y_test)))
print("Mean absolute error of the prediction is: {}".format(mean_absolute_error(y_test, y_preds_test)))
print("Mean squared error of the prediction is: {}".format(mse(y_test, y_preds_test)))
print("Root mean squared error of the prediction is: {}".format(rmse(y_test, y_preds_test)))
print("Mean absolute percentage error of the prediction is: {}".format(np.mean(np.abs((y_test - y_preds_test) / y_test)) * 100))

Best alpha value is: 0.0001
R-squared of the model in training set is: 0.8641632254524528
-----Test set statistics-----
R-squared of the model in test set is: 0.8598082216321428
Mean absolute error of the prediction is: 0.11413124342757196
Mean squared error of the prediction is: 0.023376369042743043
Root mean squared error of the prediction is: 0.152893325697177
Mean absolute percentage error of the prediction is: 0.9547201754973095


In [5]:
#MODEL using ElasticNet regression
elasticnet_cv = ElasticNetCV(alphas=alphas, cv=5)

elasticnet_cv.fit(X_train, y_train)

# We are making predictions here
y_preds_train = elasticnet_cv.predict(X_train)
y_preds_test = elasticnet_cv.predict(X_test)

print("Best alpha value is: {}".format(elasticnet_cv.alpha_))
print("R-squared of the model in training set is: {}".format(elasticnet_cv.score(X_train, y_train)))
print("-----Test set statistics-----")
print("R-squared of the model in test set is: {}".format(elasticnet_cv.score(X_test, y_test)))
print("Mean absolute error of the prediction is: {}".format(mean_absolute_error(y_test, y_preds_test)))
print("Mean squared error of the prediction is: {}".format(mse(y_test, y_preds_test)))
print("Root mean squared error of the prediction is: {}".format(rmse(y_test, y_preds_test)))
print("Mean absolute percentage error of the prediction is: {}".format(np.mean(np.abs((y_test - y_preds_test) / y_test)) * 100))

Best alpha value is: 0.0001
R-squared of the model in training set is: 0.864449493441301
-----Test set statistics-----
R-squared of the model in test set is: 0.8611366642899303
Mean absolute error of the prediction is: 0.11372265608339417
Mean squared error of the prediction is: 0.02315485700985423
Root mean squared error of the prediction is: 0.15216720083465501
Mean absolute percentage error of the prediction is: 0.9511584841466773


Therefore, the best model is the OLS regression model, as that is the model with highest R-squared value.