# Regularization with SciKit-Learn

Regularization attempts to minimize the RSS (residual sum of squares) *and* a penalty factor. This penalty factor will penalize models that have coefficients that are too large. Some methods of regularization will actually cause non useful features to have a coefficient of zero, in which case the model does not consider the feature

## Imports

In [63]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

## Data and Setup

In [64]:
df = pd.read_csv('Advertising.csv')

In [65]:
X = df.drop('sales', axis=1)

In [66]:
y = df['sales']

### Polynomial Conversion

In [67]:
from sklearn.preprocessing import PolynomialFeatures

In [68]:
polynomial_converter = PolynomialFeatures(degree=3, include_bias=False)

In [69]:
poly_features = polynomial_converter.fit_transform(X)

In [70]:
poly_features.shape

(200, 19)

### Train | Test Split

In [71]:
from sklearn.model_selection import train_test_split

In [72]:
X_train, X_test, y_train, y_test = train_test_split(poly_features, y, test_size=0.3, random_state=101)

In [73]:
X_train.shape

(140, 19)

----
----

### Scaling the data

While our particular data set has all the values in the same order of magnitude ($1000s of dollars spent), typically that won't be the case on a dataset, and since the mathematics behind regularized models will sum coefficients together, its important to standardize the features.

In [74]:
from sklearn.preprocessing import StandardScaler

In [75]:
scaler = StandardScaler()

In [76]:
# Only fit on the training data so that we can no infomartion coming in from the test set
scaler.fit(X_train)

In [77]:
X_train = scaler.transform(X_train)

In [78]:
X_test = scaler.transform(X_test)

## Ridge Regression

In [79]:
from sklearn.linear_model import Ridge

In [80]:
ridge_model = Ridge(alpha=10)

In [81]:
ridge_model.fit(X_train, y_train)

In [82]:
test_predictions = ridge_model.predict(X_test)

In [83]:
from sklearn.metrics import mean_absolute_error, mean_squared_error

In [84]:
MAE = mean_absolute_error(y_test, test_predictions)
MSE = mean_squared_error(y_test, test_predictions)
RMSE = np.sqrt(MSE)

In [85]:
MAE

0.5774404204714183

In [86]:
RMSE

0.8946386461319685

How did it perform on the training set? (This will be used later on for comparison)

In [88]:
# Training Set Performance
train_predictions = ridge_model.predict(X_train)
MAE = mean_absolute_error(y_train, train_predictions)
MAE

0.5288348183025332

**How do we know alpha=10 was the best choice?**

### Choosing an alpha value with Cross-Validation

In [89]:
from sklearn.linear_model import RidgeCV

In [90]:
# Choosing a scoring: https://scikit-learn.org/stable/modules/model_evaluation.html
# Negative RMSE so all metrics follow convention "Higher is better"

# See all options: sklearn.metrics.SCORERS.keys()
ridge_cv_model = RidgeCV(alphas=(0.1, 1.0, 10.0),scoring='neg_mean_absolute_error')

In [91]:
# The more alpha options you pass, the longer this will take.
# Fortunately our data set is still pretty small
ridge_cv_model.fit(X_train,y_train)

In [92]:
ridge_cv_model.alpha_

0.1

In [93]:
test_predictions = ridge_cv_model.predict(X_test)

In [94]:
MAE = mean_absolute_error(y_test, test_predictions)
MSE = mean_squared_error(y_test, test_predictions)
RMSE = np.sqrt(MSE)

In [95]:
MAE

0.42737748843373746

In [96]:
RMSE

0.6180719926921404

In [99]:
# Training Set Performance
train_predictions = ridge_cv_model.predict(X_train)
MAE = mean_absolute_error(y_train, train_predictions)
MAE

0.3094132105662787

In [100]:
ridge_cv_model.coef_

array([ 5.40769392,  0.5885865 ,  0.40390395, -6.18263924,  4.59607939,
       -1.18789654, -1.15200458,  0.57837796, -0.1261586 ,  2.5569777 ,
       -1.38900471,  0.86059434,  0.72219553, -0.26129256,  0.17870787,
        0.44353612, -0.21362436, -0.04622473, -0.06441449])

In [101]:
ridge_cv_model.best_score_

-0.3749223340292956

-----

## LASSO Regression - least absolute shrinkage and selection operator

In [102]:
from sklearn.linear_model import LassoCV

In [103]:
lasso_cv_model = LassoCV(eps=0.1, n_alphas=100,cv=5)

In [104]:
lasso_cv_model.fit(X_train, y_train)

In [105]:
lasso_cv_model.alpha_

0.4943070909225828

In [106]:
test_predictions = lasso_cv_model.predict(X_test)

In [107]:
MAE = mean_absolute_error(y_test, test_predictions)
MSE = mean_squared_error(y_test, test_predictions)
RMSE = np.sqrt(MSE)

In [108]:
MAE

0.6541723161252854

In [109]:
RMSE

1.130800102276253

In [110]:
# Training Set Perfomance
train_predictions = lasso_cv_model.predict(X_train)
MAE = mean_absolute_error(y_train, train_predictions)
MAE

0.6912807140820695

In [111]:
lasso_cv_model.coef_

array([1.002651  , 0.        , 0.        , 0.        , 3.79745279,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        ])

----

## Elastic Net

Elastic Net combines the penalties of ridge regression an lasso in an attempt to get the best of both worlds!

In [112]:
from sklearn.linear_model import ElasticNetCV

In [113]:
elastic_model = ElasticNetCV(l1_ratio=[.1, .5, .7, .9, .95, .99, 1], eps=0.001,n_alphas=100, max_iter=1000000)

In [114]:
elastic_model.fit(X_train, y_train)

In [116]:
test_predictions = elastic_model.predict(X_test)

In [117]:
MAE = mean_absolute_error(y_test, test_predictions)
MSE = mean_squared_error(y_test, test_predictions)
RMSE = np.sqrt(MSE)

In [118]:
MAE

0.43350346185900757

In [119]:
RMSE

0.6063140748984036

In [121]:
elastic_model.coef_

array([ 4.86023329,  0.12544598,  0.20746872, -4.99250395,  4.38026519,
       -0.22977201, -0.        ,  0.07267717, -0.        ,  1.77780246,
       -0.69614918, -0.        ,  0.12044132, -0.        , -0.        ,
       -0.        ,  0.        ,  0.        , -0.        ])