# 05 Regularization Models

This is a notebook exclusively for experimenting with Regularization models. These can be helpful when additional constrating on the coefficients is required. I found the below operations helpful to practice but did not end up incorporating this particular approach in my final model.  

## Contents:  
[Ridge Model](#Ridge-Model)  
[RidgeCV Model](#RidgeCV-Model)  
[LASSO Model](#LASSO-Model)

In [1]:
#imports 
from scipy.stats import ttest_ind
import pandas as pd
import numpy as np
from scipy import stats
import seaborn as sns
import matplotlib.pyplot as plt
import re
#week 3
from sklearn.linear_model import LinearRegression
from sklearn import metrics
import statsmodels.api as sm
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
#week 4
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import Pipeline
from sklearn.linear_model import Ridge

In [2]:
test = pd.read_csv('../datasets/test.csv')

In [3]:
test.columns = [col.lower().replace(' ', '_') for col in test.columns] #reformat df columns

In [4]:
test.fillna(0, inplace=True)

In [5]:
home = pd.read_csv('../datasets/train.csv') #read in data set

In [6]:
home.columns = [col.lower().replace(' ', '_') for col in home.columns] #reformat df columns

In [7]:
home.head() #preview to ensure read in successful

Unnamed: 0,id,pid,ms_subclass,ms_zoning,lot_frontage,lot_area,street,alley,lot_shape,land_contour,...,screen_porch,pool_area,pool_qc,fence,misc_feature,misc_val,mo_sold,yr_sold,sale_type,saleprice
0,109,533352170,60,RL,,13517,Pave,,IR1,Lvl,...,0,0,,,,0,3,2010,WD,130500
1,544,531379050,60,RL,43.0,11492,Pave,,IR1,Lvl,...,0,0,,,,0,4,2009,WD,220000
2,153,535304180,20,RL,68.0,7922,Pave,,Reg,Lvl,...,0,0,,,,0,1,2010,WD,109000
3,318,916386060,60,RL,73.0,9802,Pave,,Reg,Lvl,...,0,0,,,,0,4,2010,WD,174000
4,255,906425045,50,RL,82.0,14235,Pave,,IR1,Lvl,...,0,0,,,,0,3,2010,WD,138500


In order to create a model with the regularizaiton techniques we learned recently, I'm going to call upon the numerical features list that my main model uses.  
NOTE: Most of the work in this notebook was learned from Lesson 4.02.

In [8]:
features = ['overall_qual', 'year_built', 'total_bsmt_sf', 'gr_liv_area', 'full_bath', 'garage_area',
           'mas_vnr_area', 'totrms_abvgrd', 'fireplaces', 'bsmtfin_sf_1', 
            'lot_frontage', 'lot_area', 'wood_deck_sf', 'open_porch_sf', '2nd_flr_sf', 'bsmt_full_bath', 'half_bath']

In [9]:
home.fillna(0, inplace=True) #fill in all nulls

In [10]:
home.isnull().sum().sum() #verify all nulls filled in

0

In [11]:
X = home[features] #create X
y = home['saleprice']

In [12]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=2)

In [13]:
sc = StandardScaler()
Z_train = sc.fit_transform(X_train)
Z_test = sc.transform(X_test)

In [14]:
print(f'Z_train shape is: {Z_train.shape}')
print(f'y_train shape is: {y_train.shape}')
print(f'Z_test shape is: {Z_test.shape}')
print(f'y_test shape is: {y_test.shape}')

Z_train shape is: (1538, 17)
y_train shape is: (1538,)
Z_test shape is: (513, 17)
y_test shape is: (513,)


Dimensions look good!

## Ridge Model

In [15]:
ridge_model = Ridge(alpha=100)

ridge_model.fit(Z_train, y_train)

Ridge(alpha=100)

Okay, time to get our scores!

In [16]:
ridge_model.score(Z_train, y_train)

0.8518862377012395

In [17]:
ridge_model.score(Z_test, y_test)

0.6247423802240789

Yikes. Looks like we got a pretty severe overfit, given that our accuracy score fo the training set was over 20% more accurate than it was on the test set. Let's see how it does on our test data via Kaggle's RMSE scoring.

In [18]:
test.head() #reminder - we already read in the test set at the top

Unnamed: 0,id,pid,ms_subclass,ms_zoning,lot_frontage,lot_area,street,alley,lot_shape,land_contour,...,3ssn_porch,screen_porch,pool_area,pool_qc,fence,misc_feature,misc_val,mo_sold,yr_sold,sale_type
0,2658,902301120,190,RM,69.0,9142,Pave,Grvl,Reg,Lvl,...,0,0,0,0,0,0,0,4,2006,WD
1,2718,905108090,90,RL,0.0,9662,Pave,0,IR1,Lvl,...,0,0,0,0,0,0,0,8,2006,WD
2,2414,528218130,60,RL,58.0,17104,Pave,0,IR1,Lvl,...,0,0,0,0,0,0,0,9,2006,New
3,1989,902207150,30,RM,60.0,8520,Pave,0,Reg,Lvl,...,0,0,0,0,0,0,0,7,2007,WD
4,625,535105100,20,RL,0.0,9500,Pave,0,IR1,Lvl,...,0,185,0,0,0,0,0,7,2009,WD


In [19]:
X_tester = test[features]
X_tester_sc = sc.transform(X_tester)

In [20]:
#https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.rename.html

In [21]:
submission = test[['id']].copy()
submission.rename({'id' : 'Id'}, axis=1, inplace=True)

In [22]:
submission.head() #making submission dataframe

Unnamed: 0,Id
0,2658
1,2718
2,2414
3,1989
4,625


In [23]:
submission['SalePrice'] = ridge_model.predict(X_tester_sc)

In [24]:
submission.head()

Unnamed: 0,Id,SalePrice
0,2658,163976.295979
1,2718,189074.885672
2,2414,206250.419753
3,1989,97354.915104
4,625,190145.777406


In [25]:
#submission.to_csv('./models/ridge_model.csv', index=False) #saving model

This model scored an RMSE of 32458.71844 on Kaggle. About 10k worse than my current best, but significantly better than the null (mean) prediction.

## RidgeCV Model

In [26]:
from sklearn.linear_model import RidgeCV

In [27]:
r_alphas = np.logspace(0, 4, 120) #generate 120 evenly spaced samples from 10^0 to 10^4

In [28]:
ridge_cv = RidgeCV(alphas=r_alphas, scoring='r2', cv=5) #make a 5-fold cross value scoring model

In [29]:
ridge_cv.fit(Z_train, y_train);

In [30]:
ridge_cv.alpha_ #optimal alpha value

9.436043101478887

In [31]:
ridge_cv.score(Z_train, y_train)

0.8536016915876272

In [32]:
ridge_cv.score(Z_test, y_test)

0.6120195427712252

Well, this actually brought down our test score by a tiny bit and upped the training score. Time to see how it does on Kaggle!

In [33]:
submission = test[['id']].copy() #reset test submission
submission.rename({'id' : 'Id'}, axis=1, inplace=True) #format ID column
#test X scaled already saved above as X_tester_sc

In [34]:
submission['SalePrice'] = ridge_cv.predict(X_tester_sc)

In [35]:
submission.head()

Unnamed: 0,Id,SalePrice
0,2658,161841.353821
1,2718,186051.919808
2,2414,203678.776303
3,1989,95780.64654
4,625,186650.437586


In [36]:
#submission.to_csv('./models/ridgeCV_model.csv', index=False) #saving model

This model scored an RMSE of 31748.98997 on Kaggle, so a small improvement over the traditional Ridge model!

## LASSO Model

In [37]:
from sklearn.linear_model import Lasso, LassoCV

In [38]:
l_alphas = np.logspace(-2, 3, 120) #120 samples between 10^-2 and 10^3
lasso_cv = LassoCV(alphas=l_alphas, cv=5, max_iter=40_000) #cross validate the lasso alpha list
lasso_cv.fit(Z_train, y_train);

In [39]:
lasso_cv.alpha_ #optimal alpha value

679.0985029955721

In [40]:
lasso_cv.score(Z_train, y_train)

0.8528055012897177

In [41]:
lasso_cv.score(Z_test, y_test)

0.6223938670706679

This output is virtually identical to the others above. I seem to be scoring much higher with the training than test set.

In [42]:
lasso_cv.coef_  #checking for zero coefficients

array([27354.76711191,  8336.48974562, 10026.00535587, 23954.74350228,
          -0.        ,  6636.09744696,  4802.80387827,     0.        ,
        2506.8452313 , 10007.32361798,  3741.70670767,  4458.58111999,
        3140.7359952 ,  1914.56427745,    -0.        ,  1383.21998755,
          -0.        ])

In [43]:
submission = test[['id']].copy() #reset test submission
submission.rename({'id' : 'Id'}, axis=1, inplace=True) #format ID column
#test X scaled already saved above as X_tester_sc

In [44]:
submission['SalePrice'] = lasso_cv.predict(X_tester_sc)

In [45]:
submission.head()

Unnamed: 0,Id,SalePrice
0,2658,164790.95187
1,2718,187189.560872
2,2414,204689.869972
3,1989,98131.787702
4,625,187389.456125


In [46]:
#submission.to_csv('./models/lassoCV_model.csv', index=False) #saving model

This model scored 31654.49310 on Kaggle, so of the 3 regularization models it appears to be the best. It's unfortunately still quite a ways off of my ordinary least squares model so I will not be incorporating these methods into my final production.