# <font color=blue>Assignments for "Overfitting and Regularization"</font>

In this assignment, you'll continue working with the house prices data. To complete this assignment, submit a link to a Jupyter notebook containing your solutions to the following tasks:

- Load the **houseprices** data from Kaggle.
- Reimplement your model from the previous lesson.
- Try OLS, Lasso, Ridge and ElasticNet regressions using the same model specification. This time, you need to do **k-fold cross-validation** to choose the best hyperparameter values for your models. Which model is the best? Why?

In [1]:
import numpy as np
import pandas as pd
import seaborn as sns
from sklearn import linear_model
import matplotlib.pyplot as plt
import pandas.api.types as pt
import scipy.stats as stats
from scipy.stats import chi2_contingency
import researchpy as rp
import statsmodels.api as sm
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error
from statsmodels.tools.eval_measures import mse, rmse
from statsmodels.formula.api import ols
from sklearn.linear_model import Ridge,Lasso,ElasticNet, LinearRegression
import warnings
warnings.filterwarnings('ignore')
pd.set_option('display.max_rows', 1000)
pd.set_option('display.max_columns', 500)

sns.set(style="whitegrid")
pd.options.display.float_format = '{:.2f}'.format
plt.rcParams['figure.dpi'] = 100
plt.rcParams['figure.figsize'] = (8,5.5)

title_font = {'family': 'arial', 'color': 'darkred','weight': 'bold','size': 13 }
axis_font  = {'family': 'arial', 'color': 'darkblue','weight': 'bold','size': 10}

In [2]:
house_prices_train = pd.read_csv("../../data/regression_assignments/train.csv")
variables=house_prices_train[['SalePrice','OverallQual','YearBuilt','YearRemodAdd','MasVnrArea','TotalBsmtSF','FullBath','Fireplaces',
                                'GarageCars','MSZoning','Street','LotShape','LandContour','BldgType','CentralAir', 'SaleCondition']]
variables['MasVnrArea'].fillna(variables['MasVnrArea'].median(),inplace=True)
var_numeric=variables.select_dtypes(include=['float64','int64'])
var_cat=variables.select_dtypes(include=['object'])
var_dummies=pd.get_dummies(var_cat,drop_first=True)

var_regress=pd.concat([var_numeric,var_dummies],axis=1)

In [3]:
Y=var_regress['SalePrice']
X=var_regress.loc[:,var_regress.columns!='SalePrice']

In [4]:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.3, random_state = 6)

**Linear Regression**

In [5]:
lrm = LinearRegression()
lrm.fit(X_train, Y_train)

Y_preds_train = lrm.predict(X_train)
Y_preds_test = lrm.predict(X_test)

print("R-squared of the model in training set is: {:.4f}".format(lrm.score(X_train, Y_train)))
print("-----Test set statistics-----")
print("R-squared of the model in test set is: {:.4f}".format(lrm.score(X_test, Y_test)))
print("Mean absolute error of the prediction is: {:.4f}".format(mean_absolute_error(Y_test, Y_preds_test)))
print("Mean squared error of the prediction is: {:.4f}".format(mse(Y_test, Y_preds_test)))
print("Root mean squared error of the prediction is: {:.4f}".format(rmse(Y_test, Y_preds_test)))
print("Mean absolute percentage error of the prediction is: {:.4f}".format(np.mean(np.abs((Y_test - Y_preds_test) / Y_test)) * 100))

R-squared of the model in training set is: 0.7757
-----Test set statistics-----
R-squared of the model in test set is: 0.7469
Mean absolute error of the prediction is: 26203.0203
Mean squared error of the prediction is: 1739675034.6602
Root mean squared error of the prediction is: 41709.4118
Mean absolute percentage error of the prediction is: 15.2806


**Ridge Regression**

In [6]:
ridgeregr = Ridge(alpha=10**1.1) 
ridgeregr.fit(X_train, Y_train)

# We are making predictions here
Y_preds_train = ridgeregr.predict(X_train)
Y_preds_test = ridgeregr.predict(X_test)

print("R-squared of the model in training set is: {:.4f}".format(ridgeregr.score(X_train, Y_train)))
print("-----Test set statistics-----")
print("R-squared of the model in test set is: {:.4f}".format(ridgeregr.score(X_test, Y_test)))
print("Mean absolute error of the prediction is: {:.4f}".format(mean_absolute_error(Y_test, Y_preds_test)))
print("Mean squared error of the prediction is: {:.4f}".format(mse(Y_test, Y_preds_test)))
print("Root mean squared error of the prediction is: {:.4f}".format(rmse(Y_test, Y_preds_test)))
print("Mean absolute percentage error of the prediction is: {:.4f}".format(np.mean(np.abs((Y_test - Y_preds_test) / Y_test)) * 100))

R-squared of the model in training set is: 0.7714
-----Test set statistics-----
R-squared of the model in test set is: 0.7546
Mean absolute error of the prediction is: 25902.2899
Mean squared error of the prediction is: 1686664580.4456
Root mean squared error of the prediction is: 41069.0222
Mean absolute percentage error of the prediction is: 14.9915


**Lasso Regression**

In [7]:
lassoregr = Lasso(alpha=10**2.0)
lassoregr.fit(X_train, Y_train)

# We are making predictions here
Y_preds_train = lassoregr.predict(X_train)
Y_preds_test = lassoregr.predict(X_test)

print("R-squared of the model in training set is: {:.4f}".format(lassoregr.score(X_train, Y_train)))
print("-----Test set statistics-----")
print("R-squared of the model in test set is: {:.4f}".format(lassoregr.score(X_test, Y_test)))
print("Mean absolute error of the prediction is: {:.4f}".format(mean_absolute_error(Y_test, Y_preds_test)))
print("Mean squared error of the prediction is: {:.4f}".format(mse(Y_test, Y_preds_test)))
print("Root mean squared error of the prediction is: {:.4f}".format(rmse(Y_test, Y_preds_test)))
print("Mean absolute percentage error of the prediction is: {:.4f}".format(np.mean(np.abs((Y_test - Y_preds_test) / Y_test)) * 100))

R-squared of the model in training set is: 0.7733
-----Test set statistics-----
R-squared of the model in test set is: 0.7501
Mean absolute error of the prediction is: 25971.7619
Mean squared error of the prediction is: 1717728722.0432
Root mean squared error of the prediction is: 41445.4910
Mean absolute percentage error of the prediction is: 14.9796


**ElasticNet Regression**

In [8]:
elasticregr = ElasticNet(alpha=10**2.0,l1_ratio=1.0) 
elasticregr.fit(X_train, Y_train)

# We are making predictions here
Y_preds_train = elasticregr.predict(X_train)
Y_preds_test = elasticregr.predict(X_test)

print("R-squared of the model in training set is: {:.4f}".format(elasticregr.score(X_train, Y_train)))
print("-----Test set statistics-----")
print("R-squared of the model in test set is: {:.4f}".format(elasticregr.score(X_test, Y_test)))
print("Mean absolute error of the prediction is: {:.4f}".format(mean_absolute_error(Y_test, Y_preds_test)))
print("Mean squared error of the prediction is: {:.4f}".format(mse(Y_test, Y_preds_test)))
print("Root mean squared error of the prediction is: {:.4f}".format(rmse(Y_test, Y_preds_test)))
print("Mean absolute percentage error of the prediction is: {:.4f}".format(np.mean(np.abs((Y_test - Y_preds_test) / Y_test)) * 100))

R-squared of the model in training set is: 0.7733
-----Test set statistics-----
R-squared of the model in test set is: 0.7501
Mean absolute error of the prediction is: 25971.7619
Mean squared error of the prediction is: 1717728722.0432
Root mean squared error of the prediction is: 41445.4910
Mean absolute percentage error of the prediction is: 14.9796


In [9]:
print("**"*25)
print("The best model is Ridge Regression.")
print("**"*25)

**************************************************
The best model is Ridge Regression.
**************************************************
