### Assignment

In this assignment, you'll continue working with the house prices data. To complete this assignment, submit a link to a Jupyter notebook containing your solutions to the following tasks:

    Load the houseprices data from Thinkful's database.
    Reimplement your model from the previous checkpoint.
    Try OLS, Lasso, Ridge, and ElasticNet regression using the same model specification. This time, you need to do k-fold cross-validation to choose the best hyperparameter values for your models. Scikit-learn has RidgeCV, LassoCV, and ElasticNetCV that you can utilize to do this. Which model is the best? Why?

This is not a graded checkpoint, but you should discuss your solution with your mentor. After you've submitted your work, take a moment to compare your solution to this example solution.


#### Load the houseprices data from Thinkful's database

In [1]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import mean_absolute_error
import statsmodels.api as sm
from statsmodels.tools.eval_measures import mse, rmse
from sqlalchemy import create_engine
from sklearn.linear_model import LinearRegression, LassoCV, RidgeCV, ElasticNetCV

# Display preferences.
%matplotlib inline
pd.options.display.float_format = '{:.3f}'.format

import warnings
warnings.filterwarnings(action="ignore")

postgres_user = 'dsbc_student'
postgres_pw = '7*.8G9QH21'
postgres_host = '142.93.121.174'
postgres_port = '5432'
postgres_db = 'houseprices'

engine = create_engine('postgresql://{}:{}@{}:{}/{}'.format(
    postgres_user, postgres_pw, postgres_host, postgres_port, postgres_db))
houseprices_df = pd.read_sql_query('select * from houseprices',con=engine)

# no need for an open connection, as we're only doing a single query
engine.dispose()

houseprices_df.head()

Unnamed: 0,id,mssubclass,mszoning,lotfrontage,lotarea,street,alley,lotshape,landcontour,utilities,...,poolarea,poolqc,fence,miscfeature,miscval,mosold,yrsold,saletype,salecondition,saleprice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,...,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,...,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,...,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,...,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,...,0,,,,0,12,2008,WD,Normal,250000


#### Reimplement your model from the previous checkpoint.

In [2]:
kitchenqual_ohc_df = pd.get_dummies(houseprices_df.kitchenqual, prefix = 'kitchenqual',drop_first=True)
houseprices_df = pd.concat([houseprices_df, kitchenqual_ohc_df], axis=1)

fullbath_ohc_df = pd.get_dummies(houseprices_df.fullbath, prefix = 'fullbath', drop_first=True)
houseprices_df = pd.concat([houseprices_df, fullbath_ohc_df], axis=1)

In [4]:
houseprices_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 87 columns):
id                1460 non-null int64
mssubclass        1460 non-null int64
mszoning          1460 non-null object
lotfrontage       1201 non-null float64
lotarea           1460 non-null int64
street            1460 non-null object
alley             91 non-null object
lotshape          1460 non-null object
landcontour       1460 non-null object
utilities         1460 non-null object
lotconfig         1460 non-null object
landslope         1460 non-null object
neighborhood      1460 non-null object
condition1        1460 non-null object
condition2        1460 non-null object
bldgtype          1460 non-null object
housestyle        1460 non-null object
overallqual       1460 non-null int64
overallcond       1460 non-null int64
yearbuilt         1460 non-null int64
yearremodadd      1460 non-null int64
roofstyle         1460 non-null object
roofmatl          1460 non-null object
exte

#### Split your data into train and test sets

In [7]:
#M4
Y = houseprices_df['saleprice']

X = houseprices_df[['overallqual', 'grlivarea', 'garagecars',
                    'totalbsmtsf', 'yearbuilt', 'fullbath_1',
                    'kitchenqual_TA', 'yearremodadd', 'fullbath_2',
                    'kitchenqual_Gd', 'kitchenqual_Fa']]


In [8]:
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size = 0.2, random_state = 465)

print('The number of observations in training set is {}'.format(X_train.shape[0]))
print('The number of observations in test set is {}'.format(X_test.shape[0]))

The number of observations in training set is 1168
The number of observations in test set is 292


#### Try OLS, Lasso, Ridge, and ElasticNet regression using the same model specification. 
This time, you need to do k-fold cross-validation to choose the best hyperparameter values for your models. Scikit-learn has **RidgeCV, LassoCV, and ElasticNetCV** that you can utilize to do this. 

Which model is the best? Why?

####  are we not using the OLS model because the L1, L2 & ElasticNet regressions incorporate it?

how is the alpha figured out????

In [9]:
# We fit an OLS model using sklearn (per checkpoint ex & sample)
lrm = LinearRegression()
lrm.fit(X_train, y_train)


# We are making predictions here
y_preds_train = lrm.predict(X_train)
y_preds_test = lrm.predict(X_test)

print("R-squared of the model in the training set is: {}".format(lrm.score(X_train, y_train)))
print("-----Test set statistics-----")
print("R-squared of the model in the test set is: {}".format(lrm.score(X_test, y_test)))
print("Mean absolute error of the prediction is: {}".format(mean_absolute_error(y_test, y_preds_test)))
print("Mean squared error of the prediction is: {}".format(mse(y_test, y_preds_test)))
print("Root mean squared error of the prediction is: {}".format(rmse(y_test, y_preds_test)))
print("Mean absolute percentage error of the prediction is: {}".format(np.mean(np.abs((y_test - y_preds_test) / y_test)) * 100))

R-squared of the model in the training set is: 0.7952791638781295
-----Test set statistics-----
R-squared of the model in the test set is: 0.8207507273524226
Mean absolute error of the prediction is: 22952.208191210604
Mean squared error of the prediction is: 1203426572.9449835
Root mean squared error of the prediction is: 34690.43921522158
Mean absolute percentage error of the prediction is: 13.641610774304322


# sample answer

house_prices_df['totalsf'] = house_prices_df['totalbsmtsf'] + house_prices_df['firstflrsf'] + house_prices_df['secondflrsf']

house_prices_df['int_over_sf'] = house_prices_df['totalsf'] * house_prices_df['overallqual']

# Y is the target variable
Y = np.log1p(house_prices_df['saleprice'])
# X is the feature set
X = house_prices_df[['overallqual', 'grlivarea', 'garagecars', 'garagearea', 'totalsf', 'int_over_sf'] + dummy_column_names]

X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size = 0.2, random_state = 465)

alphas = [np.power(10.0,p) for p in np.arange(-10,40,1)]


Comparison to sample answer: 
1.  #Y is the target variable
Y = np.log1p(house_prices_df['saleprice'])
WHY?? where did np.log1p come from???

2. alphas = [np.power(10.0,p) for p in np.arange(-10,40,1)]
I have no idea how to choose these numbers...


# sample answer

lrm = LinearRegression()

lrm.fit(X_train, y_train)

# We are making predictions here
y_preds_train = lrm.predict(X_train)
y_preds_test = lrm.predict(X_test)

print("R-squared of the model in training set is: {}".format(lrm.score(X_train, y_train)))
print("-----Test set statistics-----")
print("R-squared of the model in test set is: {}".format(lrm.score(X_test, y_test)))
print("Mean absolute error of the prediction is: {}".format(mean_absolute_error(y_test, y_preds_test)))
print("Mean squared error of the prediction is: {}".format(mse(y_test, y_preds_test)))
print("Root mean squared error of the prediction is: {}".format(rmse(y_test, y_preds_test)))
print("Mean absolute percentage error of the prediction is: {}".format(np.mean(np.abs((y_test - y_preds_test) / y_test)) * 100))

#### LassoCV model

In [23]:
from sklearn.linear_model import LassoCV

lasso_cv = LassoCV(cv=5) #alpha set automatically if none?
lasso_cv.fit(X_train, y_train)

# We are making predictions here
y_preds_train = lasso_cv.predict(X_train)
y_preds_test = lasso_cv.predict(X_test)

print("R-squared of the model on the training set is: {}".format(lassoregr.score(X_train, y_train)))
print("-----Test set statistics-----")
print("R-squared of the model on the test set is: {}".format(lassoregr.score(X_test, y_test)))
print("Mean absolute error of the prediction is: {}".format(mean_absolute_error(y_test, y_preds_test)))
print("Mean squared error of the prediction is: {}".format(mse(y_test, y_preds_test)))
print("Root mean squared error of the prediction is: {}".format(rmse(y_test, y_preds_test)))
print("Mean absolute percentage error of the prediction is: {}".format(np.mean(np.abs((y_test - y_preds_test) / y_test)) * 100))

R-squared of the model on the training set is: 0.0
-----Test set statistics-----
R-squared of the model on the test set is: -0.0016183407463286061
Mean absolute error of the prediction is: 25877.17617322646
Mean squared error of the prediction is: 1657846077.813971
Root mean squared error of the prediction is: 40716.656024457254
Mean absolute percentage error of the prediction is: 14.504589895782852


# sample answer

from sklearn.linear_model import LassoCV

lasso_cv = LassoCV(alphas=alphas, cv=5)

lasso_cv.fit(X_train, y_train)

# We are making predictions here
y_preds_train = lasso_cv.predict(X_train)
y_preds_test = lasso_cv.predict(X_test)

print("Best alpha value is: {}".format(lasso_cv.alpha_))
print("R-squared of the model in training set is: {}".format(lasso_cv.score(X_train, y_train)))
print("-----Test set statistics-----")
print("R-squared of the model in test set is: {}".format(lasso_cv.score(X_test, y_test)))
print("Mean absolute error of the prediction is: {}".format(mean_absolute_error(y_test, y_preds_test)))
print("Mean squared error of the prediction is: {}".format(mse(y_test, y_preds_test)))
print("Root mean squared error of the prediction is: {}".format(rmse(y_test, y_preds_test)))
print("Mean absolute percentage error of the prediction is: {}".format(np.mean(np.abs((y_test - y_preds_test) / y_test)) * 100))

In [29]:
alphas = [np.power(10.0,p) for p in np.arange(-10,40,1)]

from sklearn.linear_model import LassoCV

lasso_cv = LassoCV(alphas=alphas, cv=5) #alpha set automatically if none?
lasso_cv.fit(X_train, y_train)

# We are making predictions here
y_preds_train = lasso_cv.predict(X_train)
y_preds_test = lasso_cv.predict(X_test)

print("R-squared of the model on the training set is: {}".format(lassoregr.score(X_train, y_train)))
print("-----Test set statistics-----")
print("R-squared of the model on the test set is: {}".format(lassoregr.score(X_test, y_test)))
print("Mean absolute error of the prediction is: {}".format(mean_absolute_error(y_test, y_preds_test)))
print("Mean squared error of the prediction is: {}".format(mse(y_test, y_preds_test)))
print("Root mean squared error of the prediction is: {}".format(rmse(y_test, y_preds_test)))
print("Mean absolute percentage error of the prediction is: {}".format(np.mean(np.abs((y_test - y_preds_test) / y_test)) * 100))

R-squared of the model on the training set is: 0.0
-----Test set statistics-----
R-squared of the model on the test set is: -0.0016183407463286061
Mean absolute error of the prediction is: 22953.359552827376
Mean squared error of the prediction is: 1203595998.0103812
Root mean squared error of the prediction is: 34692.881085467394
Mean absolute percentage error of the prediction is: 13.642754318267764


#### RidgeCV model

In [25]:
from sklearn.linear_model import RidgeCV

# Fitting a ridge regression model. Alpha is the regularization
# parameter (usually called lambda). As alpha gets larger, parameter
# shrinkage grows more pronounced.
ridge_cv = RidgeCV(alphas=(0.1, 1.0, 10.0), cv=5) #using alphas from documentation 
ridge_cv.fit(X_train, y_train)

# We are making predictions here
y_preds_train = ridge_cv.predict(X_train)
y_preds_test = ridge_cv.predict(X_test)

print("R-squared of the model on the training set is: {}".format(ridgeregr.score(X_train, y_train)))
print("-----Test set statistics-----")
print("R-squared of the model on the test set is: {}".format(ridgeregr.score(X_test, y_test)))
print("Mean absolute error of the prediction is: {}".format(mean_absolute_error(y_test, y_preds_test)))
print("Mean squared error of the prediction is: {}".format(mse(y_test, y_preds_test)))
print("Root mean squared error of the prediction is: {}".format(rmse(y_test, y_preds_test)))
print("Mean absolute percentage error of the prediction is: {}".format(np.mean(np.abs((y_test - y_preds_test) / y_test)) * 100))

R-squared of the model on the training set is: 0.0
-----Test set statistics-----
R-squared of the model on the test set is: -0.0016183407463286061
Mean absolute error of the prediction is: 22998.93780891314
Mean squared error of the prediction is: 1209719304.4443116
Root mean squared error of the prediction is: 34781.019312899836
Mean absolute percentage error of the prediction is: 13.683211777367857


# sample answer

ridge_cv = RidgeCV(alphas=alphas, cv=5)

ridge_cv.fit(X_train, y_train)

# We are making predictions here
y_preds_train = ridge_cv.predict(X_train)
y_preds_test = ridge_cv.predict(X_test)

print("Best alpha value is: {}".format(ridge_cv.alpha_))
print("R-squared of the model in training set is: {}".format(ridge_cv.score(X_train, y_train)))
print("-----Test set statistics-----")
print("R-squared of the model in test set is: {}".format(ridge_cv.score(X_test, y_test)))
print("Mean absolute error of the prediction is: {}".format(mean_absolute_error(y_test, y_preds_test)))
print("Mean squared error of the prediction is: {}".format(mse(y_test, y_preds_test)))
print("Root mean squared error of the prediction is: {}".format(rmse(y_test, y_preds_test)))
print("Mean absolute percentage error of the prediction is: {}".format(np.mean(np.abs((y_test - y_preds_test) / y_test)) * 100))

In [30]:
alphas = [np.power(10.0,p) for p in np.arange(-10,40,1)]

from sklearn.linear_model import RidgeCV

# Fitting a ridge regression model. Alpha is the regularization
# parameter (usually called lambda). As alpha gets larger, parameter
# shrinkage grows more pronounced.
ridge_cv = RidgeCV(alphas=(0.1, 1.0, 10.0), cv=5) #using alphas from documentation 
ridge_cv.fit(X_train, y_train)

# We are making predictions here
y_preds_train = ridge_cv.predict(X_train)
y_preds_test = ridge_cv.predict(X_test)

print("R-squared of the model on the training set is: {}".format(ridgeregr.score(X_train, y_train)))
print("-----Test set statistics-----")
print("R-squared of the model on the test set is: {}".format(ridgeregr.score(X_test, y_test)))
print("Mean absolute error of the prediction is: {}".format(mean_absolute_error(y_test, y_preds_test)))
print("Mean squared error of the prediction is: {}".format(mse(y_test, y_preds_test)))
print("Root mean squared error of the prediction is: {}".format(rmse(y_test, y_preds_test)))
print("Mean absolute percentage error of the prediction is: {}".format(np.mean(np.abs((y_test - y_preds_test) / y_test)) * 100))

R-squared of the model on the training set is: 0.0
-----Test set statistics-----
R-squared of the model on the test set is: -0.0016183407463286061
Mean absolute error of the prediction is: 22998.93780891314
Mean squared error of the prediction is: 1209719304.4443116
Root mean squared error of the prediction is: 34781.019312899836
Mean absolute percentage error of the prediction is: 13.683211777367857


#### ElasticNetCV model

In [27]:
from sklearn.linear_model import ElasticNetCV

elastic_cv = ElasticNetCV(cv=5) 
elastic_cv.fit(X_train, y_train)

# We are making predictions here
y_preds_train = elastic_cv.predict(X_train)
y_preds_test = elastic_cv.predict(X_test)

print("R-squared of the model on the training set is: {}".format(elasticregr.score(X_train, y_train)))
print("-----Test set statistics-----")
print("R-squared of the model on the test set is: {}".format(elasticregr.score(X_test, y_test)))
print("Mean absolute error of the prediction is: {}".format(mean_absolute_error(y_test, y_preds_test)))
print("Mean squared error of the prediction is: {}".format(mse(y_test, y_preds_test)))
print("Root mean squared error of the prediction is: {}".format(rmse(y_test, y_preds_test)))
print("Mean absolute percentage error of the prediction is: {}".format(np.mean(np.abs((y_test - y_preds_test) / y_test)) * 100))

R-squared of the model on the training set is: 0.0
-----Test set statistics-----
R-squared of the model on the test set is: -0.0016183407463286061
Mean absolute error of the prediction is: 30928.78954232496
Mean squared error of the prediction is: 2231712427.6414804
Root mean squared error of the prediction is: 47241.003668862504
Mean absolute percentage error of the prediction is: 17.91963926135301


# sample answer

elasticnet_cv = ElasticNetCV(alphas=alphas, cv=5)

elasticnet_cv.fit(X_train, y_train)

# We are making predictions here
y_preds_train = elasticnet_cv.predict(X_train)
y_preds_test = elasticnet_cv.predict(X_test)

print("Best alpha value is: {}".format(elasticnet_cv.alpha_))
print("R-squared of the model in training set is: {}".format(elasticnet_cv.score(X_train, y_train)))
print("-----Test set statistics-----")
print("R-squared of the model in test set is: {}".format(elasticnet_cv.score(X_test, y_test)))
print("Mean absolute error of the prediction is: {}".format(mean_absolute_error(y_test, y_preds_test)))
print("Mean squared error of the prediction is: {}".format(mse(y_test, y_preds_test)))
print("Root mean squared error of the prediction is: {}".format(rmse(y_test, y_preds_test)))
print("Mean absolute percentage error of the prediction is: {}".format(np.mean(np.abs((y_test - y_preds_test) / y_test)) * 100))

In [32]:
alphas = [np.power(10.0,p) for p in np.arange(-10,40,1)]

from sklearn.linear_model import ElasticNetCV

elastic_cv = ElasticNetCV(alphas=alphas,cv=5) 
elastic_cv.fit(X_train, y_train)

# We are making predictions here
y_preds_train = elastic_cv.predict(X_train)
y_preds_test = elastic_cv.predict(X_test)

print("R-squared of the model on the training set is: {}".format(elasticregr.score(X_train, y_train)))
print("-----Test set statistics-----")
print("R-squared of the model on the test set is: {}".format(elasticregr.score(X_test, y_test)))
print("Mean absolute error of the prediction is: {}".format(mean_absolute_error(y_test, y_preds_test)))
print("Mean squared error of the prediction is: {}".format(mse(y_test, y_preds_test)))
print("Root mean squared error of the prediction is: {}".format(rmse(y_test, y_preds_test)))
print("Mean absolute percentage error of the prediction is: {}".format(np.mean(np.abs((y_test - y_preds_test) / y_test)) * 100))

R-squared of the model on the training set is: 0.0
-----Test set statistics-----
R-squared of the model on the test set is: -0.0016183407463286061
Mean absolute error of the prediction is: 22980.106983842867
Mean squared error of the prediction is: 1207124251.8108363
Root mean squared error of the prediction is: 34743.69369843737
Mean absolute percentage error of the prediction is: 13.666495825608916


**M4_train R-sq:** 
linear: 0.7952791638781295
LassoCV: 0.0 #is this b/c of alpha?
RidgeCV: 0.0
ElasticNetCV: 0.0


**M4_test R-sq:**
linear R-sq: 0.8207507273524226**
R-sq of the test set is higher than the training set. The gap doesn't seem overly large
LassoCV: -0.0016183407463286061
RidgeCV: -0.0016183407463286061
ElasticNetCV: -0.0016183407463286061

**MAE:** 
linear: 22952.208191210604
LassoCV: 25877.17617322646
RidgeCV: 22998.93780891314
ElasticNetCV: 30928.78954232496

**MSE:** 
linear: 1203426572.9449835
LassoCV: 1657846077.813971
RidgeCV: 1209719304.4443116
ElasticNetCV:  2231712427.6414804

**RMSE:** 
linear: 34690.43921522158
LassoCV: 40716.656024457254
RidgeCV: 34781.019312899836
ElasticNetCV: 47241.003668862504

**MAPE:** 
linear: 13.641610774304322
LassoCV: 14.504589895782852
RidgeCV: 13.683211777367857
ElasticNetCV: 17.91963926135301
