## Assignment

In this assignment, you'll continue working with the house prices data. To complete this assignment, submit a link to a Jupyter notebook containing your solutions to the following tasks:

* Load the **houseprices** data from Thinkful's database.
* Reimplement your model from the previous checkpoint.
* Try OLS, Lasso, Ridge, and ElasticNet regression using the same model specification. This time, you need to do **k-fold cross-validation** to choose the best hyperparameter values for your models. Scikit-learn has RidgeCV, LassoCV, and ElasticNetCV that you can utilize to do this. Which model is the best? Why?

This is not a graded checkpoint, but you should discuss your solution with your mentor. After you've submitted your work, take a moment to compare your solution to [this example solution](https://github.com/Thinkful-Ed/machine-learning-regression-problems/blob/master/notebooks/7.solution_overfitting_and_regularization.ipynb).

In [80]:
import warnings

import matplotlib.pyplot as plt
import numpy as np 
import pandas as pd 
import statsmodels.api as sm # Not used in this assignemnt

from scipy import stats
from scipy.stats.mstats import winsorize
from sklearn.linear_model import LinearRegression 
from sklearn.model_selection import train_test_split 
from sklearn.metrics import mean_absolute_error 
from statsmodels.tools.eval_measures import mse, rmse 
from sqlalchemy import create_engine 
from sqlalchemy.engine.url import URL 

from sklearn.linear_model import Ridge # New for this assignment
from sklearn.linear_model import Lasso # New for this assignment
from sklearn.linear_model import ElasticNet # New for this assignment

from sklearn.model_selection import cross_val_score # For cross-validation
from sklearn.model_selection import KFold # For cross-validation

pd.options.display.float_format = "{:3f}".format
warnings.filterwarnings(action="ignore")

kagle = dict(
    drivername = "postgresql",
    username = "dsbc_student",
    password = "7*.8G9QH21",
    host = "142.93.121.174",
    port = "5432",
    database = "houseprices"
)

In [23]:
def get_test_scores(model,X_train, X_test, y_train, y_test, y_preds_train, y_preds_test):
    train_score = model.score(X_train, y_train)
    test_score = model.score(X_test, y_test)
    mae_score = mean_absolute_error(y_test, y_preds_test)
    mse_score = mse(y_test, y_preds_test)
    rmse_score = rmse(y_test, y_preds_test)
    mape_score = np.mean(np.abs((y_test - y_preds_test) / y_test)) * 100

    return [train_score, test_score, mae_score, mse_score, rmse_score, mape_score]

In [27]:
def print_stats(train_score, test_score, mae_score, mse_score, rmse_score, mape_score):
    print(f"R-squared of the model in the training set is: {train_score:,.4f}")
    print("\n", 30*"-", "Test set statistics", 30*"-", "\n")
    print(f"R-squared of the model in the test set is: {test_score:,.4f}")
    print(f"Mean absolute error of the prediction is: {mae_score:,.4f}")
    print(f"Mean squared error of the prediction is: {mse_score:,.4f}")
    print(f"Root mean squared error of the prediction is: {rmse_score:,.4f}")
    print(f"Mean absolute percentage error of the prediction is: {mape_score:,.4f}")

In [4]:
# Load the data from the medicalcosts database
engine=create_engine(URL(**kagle), echo=True)

houses_raw = pd.read_sql("SELECT * FROM houseprices", con=engine)

# No need for an open connection, please close
engine.dispose()

2020-01-09 12:43:59,189 INFO sqlalchemy.engine.base.Engine select version()
2020-01-09 12:43:59,190 INFO sqlalchemy.engine.base.Engine {}
2020-01-09 12:43:59,284 INFO sqlalchemy.engine.base.Engine select current_schema()
2020-01-09 12:43:59,285 INFO sqlalchemy.engine.base.Engine {}
2020-01-09 12:43:59,381 INFO sqlalchemy.engine.base.Engine SELECT CAST('test plain returns' AS VARCHAR(60)) AS anon_1
2020-01-09 12:43:59,382 INFO sqlalchemy.engine.base.Engine {}
2020-01-09 12:43:59,431 INFO sqlalchemy.engine.base.Engine SELECT CAST('test unicode returns' AS VARCHAR(60)) AS anon_1
2020-01-09 12:43:59,432 INFO sqlalchemy.engine.base.Engine {}
2020-01-09 12:43:59,479 INFO sqlalchemy.engine.base.Engine show standard_conforming_strings
2020-01-09 12:43:59,480 INFO sqlalchemy.engine.base.Engine {}
2020-01-09 12:43:59,578 INFO sqlalchemy.engine.base.Engine select relname from pg_class c join pg_namespace n on n.oid=c.relnamespace where pg_catalog.pg_table_is_visible(c.oid) and relname=%(name)s
20

In [5]:
# Create a copy of the raw data to work on
houses_working = houses_raw.copy()

In [8]:
houses_winsorized = houses_working[["neighborhood","overallqual","lotarea",
                                    "totalbsmtsf","firstflrsf","grlivarea",
                                    "totrmsabvgrd","garagecars","garagearea","saleprice"]]

# Winsorized values were derrived during EDA
winsorize_vals = dict(
    lotarea=(0.10,0.05),
    totalbsmtsf=(0.10,0.05),
    firstflrsf=(0.0,0.1),
    grlivarea=(0.0,0.1),
    totrmsabvgrd=(0.0,0.1),
    garagecars=(0.0,0.1),
    garagearea=(0.0,0.1),
    saleprice=(0.0,0.1)
)

# Add a column for each of the winsorized values
for i, (k,v) in enumerate(winsorize_vals.items()):
    houses_winsorized[f"{k}_winsorized"] = winsorize(houses_winsorized[k], v)

In [9]:
# Create a set of dummies for the neighborhood variable, prefix the dummies with "neighborhood"
houses_winsorized = pd.concat([houses_winsorized, pd.get_dummies(houses_winsorized["neighborhood"], prefix="neighborhood",drop_first=True)], axis=1)

# Create a set of dumies for the overallqual variable, previs the dummies with "overallqual"
houses_winsorized = pd.concat([houses_winsorized, pd.get_dummies(houses_winsorized["overallqual"], prefix="overallqual",drop_first=True)], axis=1)

In [10]:
# Add an interaction between garagecars and garagearea
houses_winsorized["garagecars_garagearea"] = houses_winsorized["garagecars"] * houses_winsorized["garagearea"]

# Get a list of column names to be used for feature consideration
feature_names = houses_winsorized.iloc[:,2:].columns.to_list()

# Pop saleprice from the list of feature_names
feature_names.pop(15)

# Get the final list of feature columns for the model
feature_names = feature_names[8:]

In [90]:
# Y is the target variable
Y = houses_winsorized["saleprice_winsorized"]

# X is the feature set
X = houses_winsorized[feature_names]

test_size = 0.20
random_state = 465

X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size = test_size, random_state = random_state)

# Fit an OLS model using sklearn
lrm = LinearRegression()
lrm.fit(X_train, y_train)

print(f"With {test_size*100}% Holdout: {lrm.fit(X_train, y_train).score(X_test, y_test)*100:.3f}%")
print(f"Testing on Sample: {lrm.fit(X,Y).score(X,Y)*100:.3f}%")

cv_scores = cross_val_score(lrm,X,Y,cv=28)
print(f"Accuracy: {cv_scores.mean():.2f}% (+/- {cv_scores.std()*2:.2f}) Mode: {stats.mode(cv_scores)[0][0]:2f}%")

# Make some predictions
m1_y_preds_train = lrm.predict(X_train)
m1_y_preds_test = lrm.predict(X_test)

# Get the model_01 scores (m1)
m1_scores = get_test_scores(lrm, X_train, X_test, y_train, y_test, m1_y_preds_train, m1_y_preds_test)
m1_scores.insert(0,"model_01")

# Print the results
print_stats(m1_scores[1], m1_scores[2], m1_scores[3], m1_scores[4], m1_scores[5], m1_scores[6])

With 20.0% Holdout: 85.235%
Testing on Sample: 86.925%
Accuracy: 0.86% (+/- 0.07) Mode: 0.759648%
R-squared of the model in the training set is: 0.8711

 ------------------------------ Test set statistics ------------------------------ 

R-squared of the model in the test set is: 0.8619
Mean absolute error of the prediction is: 16,640.4101
Mean squared error of the prediction is: 487,224,950.4771
Root mean squared error of the prediction is: 22,073.1726
Mean absolute percentage error of the prediction is: 10.7951


In [91]:
# Now, a Ridge regression
ridgeregr = Ridge(alpha=10**37)
ridgeregr.fit(X_train, y_train)

# Make some predictions
m2_y_preds_train = ridgeregr.predict(X_train)
m2_y_preds_test = ridgeregr.predict(X_test)

# Get the model_02 scores (m2)
m2_scores = get_test_scores(ridgeregr, X_train, X_test, y_train, y_test, m2_y_preds_train, m2_y_preds_test)
m2_scores.insert(0,"model_02")

# Print the results
print_stats(m2_scores[1], m2_scores[2], m2_scores[3], m2_scores[4], m2_scores[5], m2_scores[6])

R-squared of the model in the training set is: 0.0000

 ------------------------------ Test set statistics ------------------------------ 

R-squared of the model in the test set is: -0.0033
Mean absolute error of the prediction is: 49,645.3173
Mean squared error of the prediction is: 3,540,445,067.4218
Root mean squared error of the prediction is: 59,501.6392
Mean absolute percentage error of the prediction is: 33.6389


In [92]:
# This time a Lasso regression
lassoregr = Lasso(alpha=10**20.5)
lassoregr.fit(X_train, y_train)

# Making predictions here
m3_y_preds_train = lassoregr.predict(X_train)
m3_y_preds_test = lassoregr.predict(X_test)

# Get the model_03 scores (m3)
m3_scores = get_test_scores(ridgeregr, X_train, X_test, y_train, y_test, m3_y_preds_train, m3_y_preds_test)
m3_scores.insert(0,"model_03")

# Print the results
print_stats(m3_scores[1], m3_scores[2], m3_scores[3], m3_scores[4], m3_scores[5], m3_scores[6])

R-squared of the model in the training set is: 0.0000

 ------------------------------ Test set statistics ------------------------------ 

R-squared of the model in the test set is: -0.0033
Mean absolute error of the prediction is: 49,645.3173
Mean squared error of the prediction is: 3,540,445,067.4218
Root mean squared error of the prediction is: 59,501.6392
Mean absolute percentage error of the prediction is: 33.6389


In [93]:
# Finally, an ElasticNet regression
elasticregr = ElasticNet(alpha=10**21, l1_ratio=0.5)
elasticregr.fit(X_train, y_train)

# Making predictions here
m4_y_preds_train = elasticregr.predict(X_train)
m4_y_preds_test = elasticregr.predict(X_test)

# Get the model_04 scores (m3)
m4_scores = get_test_scores(elasticregr, X_train, X_test, y_train, y_test, m4_y_preds_train, m4_y_preds_test)
m4_scores.insert(0,"model_04")

# Print the results
print_stats(m4_scores[1], m4_scores[2], m4_scores[3], m4_scores[4], m4_scores[5], m4_scores[6])

R-squared of the model in the training set is: 0.0000

 ------------------------------ Test set statistics ------------------------------ 

R-squared of the model in the test set is: -0.0033
Mean absolute error of the prediction is: 49,645.3173
Mean squared error of the prediction is: 3,540,445,067.4218
Root mean squared error of the prediction is: 59,501.6392
Mean absolute percentage error of the prediction is: 33.6389


In [94]:
# Build a dataframe to compare the results
comparison_df = pd.DataFrame([m1_scores,m2_scores,m3_scores,m4_scores], 
    columns=["Model","Train_Score","Test_Score","MAE","MSE","RMSE","MAPE"])

In [95]:
comparison_df

Unnamed: 0,Model,Train_Score,Test_Score,MAE,MSE,RMSE,MAPE
0,model_01,0.871079,0.861922,16640.410072,487224950.477143,22073.172642,10.795143
1,model_02,0.0,-0.003349,49645.317297,3540445067.421752,59501.639199,33.638882
2,model_03,0.0,-0.003349,49645.317297,3540445067.421752,59501.639199,33.638882
3,model_04,0.0,-0.003349,49645.317297,3540445067.421752,59501.639199,33.638882


In [None]:
# Test the model with different holdout gorups
X_train, X_test, y_train, y_test = train_test_split(X,Y, test_size=0.2, random_state=100)

