# LinDHA Mk. 4 (Ridge)

Since LASSO may have been throwing too much away, we want to change the penalty term in the cost function to something which encourages using many features delicately, rather than using only a few features exclusively. The answer is to change the $L_1$-penalty term to an $L_2$-penalty term. Doing this gets us the Ridge model:

$$ \min_{\beta} \frac{1}{m}\sum_{i=1}^{m}\left( y^{(i)}-h_{\beta}(x^{(i)})\right)^2 + \frac{\alpha}{2}\sum_{i=1}^{n}\left(\beta_i \right)^2 $$

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge


from sklearn import metrics
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.pipeline import Pipeline

import joblib

import PaulBettany as jarvis


pd.options.display.max_columns = 100
pd.options.display.max_rows = 2000


seed=1
score = 'neg_mean_absolute_error'

## Ridge and Tuning for $\alpha$

In [3]:
# The data sets have gotten huge and take a moment to load-in
train = pd.read_csv('../data/train.csv', index_col = 'Id').drop(index=2181)
data = pd.read_csv('../data/ames-poly-engineered.csv', index_col = 'Id')

In [4]:
print(train.shape, data.shape)

(2050, 80) (2928, 42194)


In [5]:
# prepare data
X = data.iloc[ : len(train.index) ]
Xkaggle = data.iloc[ len(train.index) : ]

y = train['SalePrice']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=seed)

In [6]:
# scale data
zscale = StandardScaler()

Xs_train = zscale.fit_transform(X_train)
Xs_test = zscale.transform(X_test)
Xs_kaggle = zscale.transform(Xkaggle)


In [7]:
# list of parameter values to search
params = { 'alpha' : [500, 600, 800, 1000, 2000] }

In [8]:
print(params)

{'alpha': [500, 600, 800, 1000, 2000]}


In [9]:
lindhamk5_1 = GridSearchCV(Ridge(max_iter=10_000),
                         param_grid=params,
                         n_jobs=-1,
                         scoring = score,
                         cv = 3
                        )

In [10]:
# conduct search
lindhamk5_1.fit(Xs_train, y_train)

GridSearchCV(cv=3, estimator=Ridge(max_iter=10000), n_jobs=-1,
             param_grid={'alpha': [500, 600, 800, 1000, 2000]},
             scoring='neg_mean_absolute_error')

In [11]:
lindhamk5_1.best_estimator_

Ridge(alpha=2000, max_iter=10000)

In [12]:
jarvis.grade_model(lindhamk5_1, Xs_train, y_train, display=True), jarvis.grade_model(lindhamk5_1, Xs_test, y_test, display=True);

 
R2 :0.989294119958519
MSE: 66622182.97861327
RMSE: 8162.241296274772
MAE: 4834.754119245182
 
 
R2 :0.893359686584041
MSE: 694877507.3245612
RMSE: 26360.529344543924
MAE: 15686.908447986978
 


In [13]:
# new list of search values
params2 = { 'alpha' : [2000, 2500, 3000, 3500, 4000, 6000] }

In [14]:
lindhamk5_2 = GridSearchCV(Ridge(max_iter=10_000),
                         param_grid=params2,
                         n_jobs=-1,
                         scoring = score,
                         cv = 3
                        )

In [15]:
lindhamk5_2.fit(Xs_train, y_train)

GridSearchCV(cv=3, estimator=Ridge(max_iter=10000), n_jobs=-1,
             param_grid={'alpha': [2000, 2500, 3000, 3500, 4000, 6000]},
             scoring='neg_mean_absolute_error')

In [16]:
lindhamk5_2.best_estimator_

Ridge(alpha=4000, max_iter=10000)

In [17]:
jarvis.grade_model(lindhamk5_2, Xs_train, y_train, display=True), jarvis.grade_model(lindhamk5_2, Xs_test, y_test, display=True);

 
R2 :0.9842776050735805
MSE: 97839716.82771002
RMSE: 9891.396101042057
MAE: 6160.830685246997
 
 
R2 :0.8966154396991309
MSE: 673662738.3820372
RMSE: 25955.013742667084
MAE: 15359.451813550993
 


In [18]:
# one more search?
params3 = { 'alpha' : list(range(3800, 5000,50)) }

lindhamk5_3 = GridSearchCV(Ridge(max_iter=10_000),
                         param_grid=params3,
                         n_jobs=-1,
                         scoring = score,
                         cv = 3
                        )
                           

In [19]:
lindhamk5_3.fit(Xs_train, y_train)

GridSearchCV(cv=3, estimator=Ridge(max_iter=10000), n_jobs=-1,
             param_grid={'alpha': [3800, 3850, 3900, 3950, 4000, 4050, 4100,
                                   4150, 4200, 4250, 4300, 4350, 4400, 4450,
                                   4500, 4550, 4600, 4650, 4700, 4750, 4800,
                                   4850, 4900, 4950]},
             scoring='neg_mean_absolute_error')

In [20]:
lindhamk5_3.best_estimator_

Ridge(alpha=4100, max_iter=10000)

In [21]:
jarvis.grade_model(lindhamk5_3, Xs_train, y_train, display=True), jarvis.grade_model(lindhamk5_3, Xs_test, y_test, display=True);

 
R2 :0.984069187046379
MSE: 99136692.31131941
RMSE: 9956.74104872269
MAE: 6211.220273701309
 
 
R2 :0.8967158319847416
MSE: 673008573.5643889
RMSE: 25942.408784929532
MAE: 15350.248562467748
 


In [22]:
jarvis.support(lindhamk5_3.best_estimator_)

25761

## Dialing Back The Feature Space

We remarked in the LASSO notebook that the 42,000+ features were not actually good because we polynomial transformed the dummy variables, leading to many redundant and straight-up duplicate columns. We can attempt to dial it back by running Ridge against a much more reasonable and tame 298 feature dataframe.

In [23]:
data = pd.read_csv('../data/ames-engineered.csv', index_col = 'Id').drop(columns='SalePrice')

In [24]:
data['MS SubClass'] = data['MS SubClass'].astype('object')

In [25]:
data_dummies = pd.get_dummies(data, drop_first=True)

  uniques = Index(uniques)


In [26]:
data_dummies.shape

(2928, 298)

In [27]:
X = data_dummies.iloc[ : len(train.index) ]
Xkaggle = data_dummies.iloc[ len(train.index) : ]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=seed)

Xs_train = zscale.fit_transform(X_train)
Xs_test = zscale.transform(X_test)
Xs_kaggle = zscale.transform(Xkaggle)

In [28]:
params4 = { 'alpha' : list(range(100,500,10)) }


lindhamk5_4 = GridSearchCV(Ridge(max_iter=10_000),
                         param_grid=params4,
                         n_jobs=-1,
                         scoring = score,
                         cv = 3
                        )

lindhamk5_4.fit(Xs_train, y_train)

jarvis.grade_model(lindhamk5_4, Xs_train, y_train, display=True), jarvis.grade_model(lindhamk5_4, Xs_test, y_test, display=True);

 
R2 :0.9287614473406154
MSE: 443314129.4331693
RMSE: 21055.02622732086
MAE: 13903.527516677803
 
 
R2 :0.8877372302317402
MSE: 731513919.2965387
RMSE: 27046.513995273748
MAE: 16834.64797220781
 


In [29]:
lindhamk5_4.best_estimator_

Ridge(alpha=290, max_iter=10000)

We see that the Mk. 5.4 with Ridge is still not performing as well as the Mk. 4 with LASSO. At the moment we're not sure what is causing this discrepancy. Maybe some feature engineering can make Ridge viable and competitive with the LASSO model, but this remains to be seen.