## Assignment

In this assignment, you'll continue working with the house prices data. To complete this assignment, submit a link to a Jupyter notebook containing your solutions to the following tasks:

* Load the **houseprices** data from Thinkful's database.
* Reimplement your model from the previous checkpoint.
* Try OLS, Lasso, Ridge, and ElasticNet regression using the same model specification. This time, you need to do **k-fold cross-validation** to choose the best hyperparameter values for your models. Which model is the best? Why?


### Load the houseprices data from Thinkful's database.


In [76]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import statsmodels.api as sm


from sklearn.model_selection  import train_test_split
from sklearn import linear_model
from sklearn.linear_model import LinearRegression
from sqlalchemy import create_engine
from sklearn.metrics import mean_absolute_error
from statsmodels.tools.eval_measures import mse, rmse
from sklearn.linear_model import Lasso, Ridge,ElasticNet

from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LinearRegression, LassoCV, RidgeCV, ElasticNetCV

import warnings
warnings.filterwarnings('ignore')

postgres_user = 'dsbc_student'
postgres_pw = '7*.8G9QH21'
postgres_host = '142.93.121.174'
postgres_port = '5432'
postgres_db = 'houseprices'

engine = create_engine('postgresql://{}:{}@{}:{}/{}'.format(
    postgres_user, postgres_pw, postgres_host, postgres_port, postgres_db))

houses_df = pd.read_sql_query('select * from houseprices',con=engine)

# no need for an open connection, as we're only doing a single query
engine.dispose()

### Reimplement your model from the previous checkpoint.

In [77]:
#Preparing dummy variables

houses_df = pd.concat([houses_df,pd.get_dummies(houses_df.mszoning, prefix="mszoning", drop_first=True)], axis=1)
houses_df = pd.concat([houses_df,pd.get_dummies(houses_df.street, prefix="street", drop_first=True)], axis=1)

dummy_column_names = list(pd.get_dummies(houses_df.mszoning, prefix="mszoning", drop_first=True).columns)
dummy_column_names = dummy_column_names + list(pd.get_dummies(houses_df.street, prefix="street", drop_first=True).columns)

#Creating a interaction feature with two or more related variables

houses_df['totalsf'] = houses_df['totalbsmtsf'] + houses_df['firstflrsf'] + houses_df['secondflrsf']
houses_df['int_over_sf'] = houses_df['totalsf'] * houses_df['overallqual']

#Creating feature set for X
X = houses_df[['overallqual', 'grlivarea', 'garagecars', 'garagearea', 'totalsf', 'int_over_sf'] + dummy_column_names]

# Y is the target variable
Y = houses_df['saleprice']


# We need to manually add a constant
# in statsmodels' sm
X = sm.add_constant(X)

#Creating train and set data
X_train, X_test, y_train, y_test = train_test_split(X,Y,test_size=0.3, random_state=101)


### Try OLS, Lasso, Ridge, and ElasticNet regression using the same model specification. This time, you need to do k-fold cross-validation to choose the best hyperparameter values for your models. Which model is the best? Why?

In [78]:
#OLS

##Question: if I use those line of code is also OLS?
#results = sm.OLS(y_train, X_train).fit()
#results.summary()

lrm = LinearRegression()
lrm.fit(X_train, y_train)

y_preds_train = lrm.predict(X_train)
y_preds_test = lrm.predict(X_test)

#do k-fold cross-validation
accuracy=cross_val_score(lrm, X_train, y_train, cv=5)

print("R-squared of the model in the training set is: {}".format(lrm.score(X_train, y_train)))
print("-----Test set statistics-----")
print("R-squared of the model in the test set is: {}".format(lrm.score(X_test, y_test)))
print("Mean absolute error of the prediction is: {}".format(mean_absolute_error(y_test, y_preds_test)))
print("Mean squared error of the prediction is: {}".format(mse(y_test, y_preds_test)))
print("Root mean squared error of the prediction is: {}".format(rmse(y_test, y_preds_test)))
print("Mean absolute percentage error of the prediction is: {}".format(np.mean(np.abs((y_test - y_preds_test) / y_test)) * 100))
print("Accuracy: %0.3f (+/- %0.3f)" % (accuracy.mean(), v.std() * 2))


R-squared of the model in the training set is: 0.8260989854797562
-----Test set statistics-----
R-squared of the model in the test set is: 0.633088045709367
Mean absolute error of the prediction is: 22212.677425688515
Mean squared error of the prediction is: 2440747178.984056
Root mean squared error of the prediction is: 49403.91866020403
Mean absolute percentage error of the prediction is: 12.995030707204586
Accuracy: 0.803 (+/- 0.199)


In [79]:
#Lasso 

#My solution
#lassoregr = Lasso(alpha=10**20.5) 

#Solution from thinkful
alphas = [np.power(10.0,p) for p in np.arange(-10,40,1)]
lassoregr = LassoCV(alphas=alphas, cv=5)

lassoregr.fit(X_train, y_train)

# We are making predictions here
y_preds_train = lassoregr.predict(X_train)
y_preds_test = lassoregr.predict(X_test)

#do k-fold cross-validation
accuracy=cross_val_score(lassoregr, X_train, y_train, cv=5)

print("R-squared of the model on the training set is: {}".format(lassoregr.score(X_train, y_train)))
print("-----Test set statistics-----")
print("R-squared of the model on the test set is: {}".format(lassoregr.score(X_test, y_test)))
print("Mean absolute error of the prediction is: {}".format(mean_absolute_error(y_test, y_preds_test)))
print("Mean squared error of the prediction is: {}".format(mse(y_test, y_preds_test)))
print("Root mean squared error of the prediction is: {}".format(rmse(y_test, y_preds_test)))
print("Mean absolute percentage error of the prediction is: {}".format(np.mean(np.abs((y_test - y_preds_test) / y_test)) * 100))
print("Accuracy: %0.3f (+/- %0.3f)" % (accuracy.mean(), v.std() * 2))



R-squared of the model on the training set is: 0.8142978406288434
-----Test set statistics-----
R-squared of the model on the test set is: 0.6151222223278816
Mean absolute error of the prediction is: 23081.396021862798
Mean squared error of the prediction is: 2560258228.498001
Root mean squared error of the prediction is: 50598.9943427535
Mean absolute percentage error of the prediction is: 13.678965549611696
Accuracy: 0.796 (+/- 0.199)


In [80]:
#Ridge

# Fitting a ridge regression model. Alpha is the regularization
# parameter (usually called lambda). As alpha gets larger, parameter
# shrinkage grows more pronounced.


#My solution:
#ridgeregr = Ridge(alpha=10**20.5) 


ridgeregr=RidgeCV(alphas=alphas, cv=5)
ridgeregr.fit(X_train, y_train)

# We are making predictions here
y_preds_train = ridgeregr.predict(X_train)
y_preds_test = ridgeregr.predict(X_test)


#do k-fold cross-validation
accuracy=cross_val_score(ridgeregr, X_train, y_train, cv=5)

print("R-squared of the model on the training set is: {}".format(ridgeregr.score(X_train, y_train)))
print("-----Test set statistics-----")
print("R-squared of the model on the test set is: {}".format(ridgeregr.score(X_test, y_test)))
print("Mean absolute error of the prediction is: {}".format(mean_absolute_error(y_test, y_preds_test)))
print("Mean squared error of the prediction is: {}".format(mse(y_test, y_preds_test)))
print("Root mean squared error of the prediction is: {}".format(rmse(y_test, y_preds_test)))
print("Mean absolute percentage error of the prediction is: {}".format(np.mean(np.abs((y_test - y_preds_test) / y_test)) * 100))
print("Accuracy: %0.3f (+/- %0.3f)" % (accuracy.mean(), v.std() * 2))


R-squared of the model on the training set is: 0.8172218955900827
-----Test set statistics-----
R-squared of the model on the test set is: 0.6228895607275104
Mean absolute error of the prediction is: 22823.418049921336
Mean squared error of the prediction is: 2508588859.1427264
Root mean squared error of the prediction is: 50085.81494937191
Mean absolute percentage error of the prediction is: 13.509130264741643
Accuracy: 0.796 (+/- 0.199)


In [81]:
#ElasticNet

#My solution
#elasticregr = ElasticNet(alpha=10**20.5, l1_ratio=0.5) 

elasticregr = ElasticNetCV(alphas=alphas, cv=5)
elasticregr.fit(X_train, y_train)

# We are making predictions here
y_preds_train = elasticregr.predict(X_train)
y_preds_test = elasticregr.predict(X_test)

#do k-fold cross-validation
accuracy=cross_val_score(elasticregr, X_train, y_train, cv=5)

print("R-squared of the model on the training set is: {}".format(elasticregr.score(X_train, y_train)))
print("-----Test set statistics-----")
print("R-squared of the model on the test set is: {}".format(elasticregr.score(X_test, y_test)))
print("Mean absolute error of the prediction is: {}".format(mean_absolute_error(y_test, y_preds_test)))
print("Mean squared error of the prediction is: {}".format(mse(y_test, y_preds_test)))
print("Root mean squared error of the prediction is: {}".format(rmse(y_test, y_preds_test)))
print("Mean absolute percentage error of the prediction is: {}".format(np.mean(np.abs((y_test - y_preds_test) / y_test)) * 100))
print("Accuracy: %0.3f (+/- %0.3f)" % (accuracy.mean(), v.std() * 2))



R-squared of the model on the training set is: 0.8188416451780465
-----Test set statistics-----
R-squared of the model on the test set is: 0.6257188892730867
Mean absolute error of the prediction is: 22672.868765219457
Mean squared error of the prediction is: 2489767788.8960905
Root mean squared error of the prediction is: 49897.572976008465
Mean absolute percentage error of the prediction is: 13.408486877384773
Accuracy: 0.795 (+/- 0.199)


#### Conclusion:

Based on numbers above, OLS model is the best one with thde highest score on R-squared of the model and accuracy. 

__________________________________________________________

By: Wendy Navarrete

8.15.2019