Assignment:

In this assignment, you'll continue working with the house prices data. To complete this assignment, submit a link to a Jupyter notebook containing your solutions to the following tasks:

- Load the houseprices data from Thinkful's database.

In [1]:
# Import libraries and packages:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
from sklearn.linear_model import LinearRegression, LassoCV, RidgeCV, ElasticNetCV
from sklearn.model_selection import KFold
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.metrics import mean_absolute_error
from sklearn.linear_model import Ridge
from sklearn.naive_bayes import BernoulliNB
import statsmodels.api as sm
from statsmodels.tools.eval_measures import mse,rmse
from sqlalchemy import create_engine

# Display preferences.
%matplotlib inline
pd.options.display.float_format = '{:.3f}'.format

import warnings
warnings.filterwarnings(action="ignore")

In [2]:
# Connect to the data base:
postgres_user = 'dsbc_student'
postgres_pw = '7*.8G9QH21'
postgres_host = '142.93.121.174'
postgres_port = '5432'
postgres_db = 'houseprices'
engine = create_engine('postgresql://{}:{}@{}:{}/{}'.format( postgres_user, postgres_pw, postgres_host, postgres_port, postgres_db))
df = pd.read_sql_query('select * from houseprices', con=engine)
engine.dispose()

In [3]:
# Look at the dataset:
df.head()

Unnamed: 0,id,mssubclass,mszoning,lotfrontage,lotarea,street,alley,lotshape,landcontour,utilities,...,poolarea,poolqc,fence,miscfeature,miscval,mosold,yrsold,saletype,salecondition,saleprice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,...,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,...,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,...,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,...,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,...,0,,,,0,12,2008,WD,Normal,250000


- Reimplement your model from the previous checkpoint.

In [4]:
# Convert categorical columns to numeric by using dummies:
categorical = df.select_dtypes(include=['object'])
dummies = pd.get_dummies(categorical, drop_first=True)
df = df.drop(categorical, axis=1)
df = pd.concat([df,dummies], axis=1)

In [5]:
# Now we build a simple linear regression model and estimate it using OLS:

# Y is the target variable:
Y = df['saleprice']
# X is the feature set:
X = df[['overallqual', 'grlivarea', 'garagecars', 'garagearea', 'totalbsmtsf', 'exterqual_TA', 'kitchenqual_TA']]

In [6]:
# Split dataset:
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size = 0.2, random_state = 465)

print("The number of observations in training set is {}".format(X_train.shape[0]))
print("The number of observations in test set is {}".format(X_test.shape[0]))

The number of observations in training set is 1168
The number of observations in test set is 292


- Try OLS, Lasso, Ridge, and ElasticNet regression using the same model specification. This time, you need to do k-fold cross-validation to choose the best hyperparameter values for your models. Scikit-learn has RidgeCV, LassoCV, and ElasticNetCV that you can utilize to do this. Which model is the best? Why?

In [11]:
# We fit an OLS model using sklearn:
lrm = LinearRegression()
lrm.fit(X_train, y_train)


# We are making predictions here
y_preds_train = lrm.predict(X_train)
y_preds_test = lrm.predict(X_test)

print("R-squared of the model in the training set is: {}".format(lrm.score(X_train, y_train)))
print("-----Test set statistics-----")
print("R-squared of the model in the test set is: {}".format(lrm.score(X_test, y_test)))
print("Mean absolute error of the prediction is: {}".format(mean_absolute_error(y_test, y_preds_test)))
print("Mean squared error of the prediction is: {}".format(mse(y_test, y_preds_test)))
print("Root mean squared error of the prediction is: {}".format(rmse(y_test, y_preds_test)))
print("Mean absolute percentage error of the prediction is: {}".format(np.mean(np.abs((y_test - y_preds_test) / y_test)) * 100))

score = cross_val_score(lrm, X_test, y_test, cv=5)
print("\n mean is {} and two stand deviations is {}".format(np.mean(score), 2*np.std(score)))

R-squared of the model in the training set is: 0.7653043702630363
-----Test set statistics-----
R-squared of the model in the test set is: 0.7744461385990857
Mean absolute error of the prediction is: 25427.91243226023
Mean squared error of the prediction is: 1514301879.3380787
Root mean squared error of the prediction is: 38914.03190801589
Mean absolute percentage error of the prediction is: 15.3407185525745

 mean is 0.7732087658703305 and two stand deviations is 0.13134999644860354


As we see, the R-squared of the model in the training set is 0.77; also, R-squared of the model in the test set is 0.77.


In [13]:
# LassoCV regression:

alphas = [np.power(10.0,p) for p in np.arange(-10,40,1)] 

lassoregr = LassoCV(alphas=alphas, cv=5) 
lassoregr.fit(X_train, y_train)

# We are making predictions here:
y_preds_train = lassoregr.predict(X_train)
y_preds_test = lassoregr.predict(X_test)

print("Best alpha value is: {}".format(lassoregr.alpha_))
print("R-squared of the model on the training set is: {}".format(lassoregr.score(X_train, y_train)))
print("-----Test set statistics-----")
print("R-squared of the model on the test set is: {}".format(lassoregr.score(X_test, y_test)))
print("Mean absolute error of the prediction is: {}".format(mean_absolute_error(y_test, y_preds_test)))
print("Mean squared error of the prediction is: {}".format(mse(y_test, y_preds_test)))
print("Root mean squared error of the prediction is: {}".format(rmse(y_test, y_preds_test)))
print("Mean absolute percentage error of the prediction is: {}".format(np.mean(np.abs((y_test - y_preds_test) / y_test)) * 100))

score = cross_val_score(lassoregr, X_test, y_test, cv=5)
print("\n mean is {} and two stand deviations is {}".format(np.mean(score), 2*np.std(score)))

Best alpha value is: 1e-10
R-squared of the model on the training set is: 0.7653043702630362
-----Test set statistics-----
R-squared of the model on the test set is: 0.7744461385990783
Mean absolute error of the prediction is: 25427.91243226072
Mean squared error of the prediction is: 1514301879.3381288
Root mean squared error of the prediction is: 38914.031908016535
Mean absolute percentage error of the prediction is: 15.340718552574847

 mean is 0.7697793919116175 and two stand deviations is 0.14089694993844332


As we see, the R-squared of the model in the training set is 0.77; also, R-squared of the model in the test set is 0.77.


In [15]:
# Fitting a ridge regression model. 
ridgeregr = RidgeCV(alphas=alphas, cv=5)
ridgeregr.fit(X_train, y_train)

# We are making predictions here
y_preds_train = ridgeregr.predict(X_train)
y_preds_test = ridgeregr.predict(X_test)

print("Best alpha value is: {}".format(ridgeregr.alpha_))
print("R-squared of the model on the training set is: {}".format(ridgeregr.score(X_train, y_train)))
print("-----Test set statistics-----")
print("R-squared of the model on the test set is: {}".format(ridgeregr.score(X_test, y_test)))
print("Mean absolute error of the prediction is: {}".format(mean_absolute_error(y_test, y_preds_test)))
print("Mean squared error of the prediction is: {}".format(mse(y_test, y_preds_test)))
print("Root mean squared error of the prediction is: {}".format(rmse(y_test, y_preds_test)))
print("Mean absolute percentage error of the prediction is: {}".format(np.mean(np.abs((y_test - y_preds_test) / y_test)) * 100))

score = cross_val_score(ridgeregr, X_test, y_test, cv=5)
print("\n mean is {} and two stand deviations is {}".format(np.mean(score), 2*np.std(score)))

Best alpha value is: 10.0
R-squared of the model on the training set is: 0.7652841680051667
-----Test set statistics-----
R-squared of the model on the test set is: 0.7747510438155123
Mean absolute error of the prediction is: 25350.760203657755
Mean squared error of the prediction is: 1512254835.9428246
Root mean squared error of the prediction is: 38887.72088902646
Mean absolute percentage error of the prediction is: 15.290493798195367

 mean is 0.7736814207885284 and two stand deviations is 0.13454876617542114


As we see, the R-squared of the model in the training set is 0.77; also, R-squared of the model in the test set is 0.77.


In [16]:
# ElasticNet regression:
elasticregr = ElasticNetCV(alphas=alphas, cv=5)
elasticregr.fit(X_train, y_train)

# We are making predictions here
y_preds_train = elasticregr.predict(X_train)
y_preds_test = elasticregr.predict(X_test)

print("Best alpha value is: {}".format(elasticregr.alpha_))
print("R-squared of the model on the training set is: {}".format(elasticregr.score(X_train, y_train)))
print("-----Test set statistics-----")
print("R-squared of the model on the test set is: {}".format(elasticregr.score(X_test, y_test)))
print("Mean absolute error of the prediction is: {}".format(mean_absolute_error(y_test, y_preds_test)))
print("Mean squared error of the prediction is: {}".format(mse(y_test, y_preds_test)))
print("Root mean squared error of the prediction is: {}".format(rmse(y_test, y_preds_test)))
print("Mean absolute percentage error of the prediction is: {}".format(np.mean(np.abs((y_test - y_preds_test) / y_test)) * 100))

score = cross_val_score(elasticregr, X_test, y_test, cv=5)
print("\n mean is {} and two stand deviations is {}".format(np.mean(score), 2*np.std(score)))

Best alpha value is: 0.1
R-squared of the model on the training set is: 0.764824683275236
-----Test set statistics-----
R-squared of the model on the test set is: 0.7757730201705589
Mean absolute error of the prediction is: 25079.27253351844
Mean squared error of the prediction is: 1505393589.1192317
Root mean squared error of the prediction is: 38799.40191703001
Mean absolute percentage error of the prediction is: 15.115559584164304

 mean is 0.7736827099523016 and two stand deviations is 0.13501067752557033


As we see, the R-squared of the model in the training set is 0.76; also, R-squared of the model in the test set is 0.78.


I think the ElasticNet regression model is the best one,because the R-squared higher than others.

These four different metrics are essentially using for the difference between what we know to be correct saleprice house and predicted one from the model. So, lower values are desirable for all four metrics. The lower the value, the better the performance of the model. So, in ElasticNet regression model values for all four metrics are lower than others.