---
# Lecture notes - Regularized linear models
---

This is the lecture note for **regularized linear models**

<p class = "alert alert-info" role="alert"><b>Note</b> that this lecture note gives a brief introduction to regularized linear models. I encourage you to read further about regularized linear models. </p>

Read more:

- [Regularized linear models medium](https://medium.com/analytics-vidhya/regularized-linear-models-in-machine-learning-d2a01a26a46)
- [Ridge regression wikipedia](https://en.wikipedia.org/wiki/Ridge_regression)
- [Tikhonov regularization wikipedia](https://en.wikipedia.org/wiki/Tikhonov_regularization)
- [Lasso regression wikipedia](https://en.wikipedia.org/wiki/Lasso_(statistics))
- [Korsvalidering](https://sv.wikipedia.org/wiki/Korsvalidering)
- [Cross validation](https://machinelearningmastery.com/k-fold-cross-validation/)
- [Scoring parameter sklearn](https://scikit-learn.org/stable/modules/model_evaluation.html)
- [ISLP pp 240-253](https://www.statlearning.com/)
---


In [1]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
plt.style.use('ggplot')


## Data preparation 


In [2]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import PolynomialFeatures



In [15]:
df = pd.read_csv('../data/Advertising.csv', index_col=0)

X, y = df.drop('Sales', axis=1), df['Sales'] # X is a DataFrame, y is a Series
# Detta är en multiple polynomial regression
# termerna är x1^3, x2^3, x3^3, x1^3*x2^3*x3^3, x1^2, x2^2, x3^2, x1^2*x2^2*x3^2, x1*x2, x1*x3, ...osv. Totalt 19 termer/features
# övning 3 (e02) har en uppgift för att hitta graden neda. Denan är vald konservativt eftersom antalet features ökar snabbt med graden
# för polynomet och vi vill undvika overfitting

model_poly = PolynomialFeatures(degree=3, include_bias=False) # include_bias=False betyder att vi inte vill ha en konstant term ,degree=3 betyder att vi vill ha polynom av grad 3
poly_features = model_poly.fit_transform(X) # skapar nya features som är polynom av de gamla features

X_train, X_test, y_train, y_test = train_test_split(poly_features, y, test_size=0.33, random_state=42)
# Vi delar upp datan i en träningsmängd och en testmängd
# train_test_split returnerar fyra värden, X_train, X_test, y_train, y_test där X_train och X_test är träningsmängden och testmängden för features och y_train och y_test är träningsmängden och testmängden för target

X_train.shape, X_test.shape, y_train.shape, y_test.shape

((134, 19), (66, 19), (134,), (66,))

___
## Feature standardization 

Remove sample mean and divide by sample standard deviation 

$X' = \frac{X-\mu}{\sigma}$

LASSO, Ridge and Elasticnet regression that we'll use later require that the data is scaled.

In [18]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
scaled_X_train = scaler.fit_transform(X_train) # skalar träningsmängden
scaled_X_test = scaler.transform(X_test) # skalar testmängden

print(f'Scaled X_train mean: {scaled_X_train.mean():.2f} and std: {scaled_X_train.std():.2f}')
print(f'Scaled X_test mean: {scaled_X_test.mean():.2f} and std: {scaled_X_test.std():.2f}')

Scaled X_train mean: -0.00 and std: 1.00
Scaled X_test mean: -0.12 and std: 1.12


## Regularization techniques 

Problem with overfitting was discussed in previous lecture. When a model is to complex, data is noisy and too small the model picks upp pattern in the noise. The output of a lin-reg is the weighted sum $y = \theta_0 + \theta_1x_1 + \theta_2x_2 + \ldots + \theta_nx_n$, where the weights $\theta_i$ represents the importance of the $ith$ feature. We want to constraint the weight associated with noise, through **regularization**. We do this by **adding a regularization term** to the cost function used in training model. **Note** that the cost function for evaluation now will differ from training.

<p class = "alert alert-info" role="alert"><b>Note</b> most regularization model requires scaling of data </p>

---
### Ridge regression 
Also called Tikhonov regularization or $\ell_2$ regularization.

$C(\vec{\theta}) = MSE(\vec{\theta}) + \lambda \frac{1}{2}\sum_{i=1}^n \theta_i^2$

where $\lambda \ge 0$ is the ridge parameter or the penalty term, which reduces variance by increasing bias. Observe that the sum starts from 1, so the bias term $\theta_0$ is not affected by $\lambda$. Therefore by the larger the $\lambda$ the more $\theta_i, i = {1,2,\ldots}$ causes higher error. As variance is decreasing and bias increasing, the model fits worse to the training datas noise and generalizes better.

From the closed form OLS solution to ridge regression, we see that $\lambda = 0$ gives us the normal equation for linear regression: 

$\hat{\vec{\theta}} = (X^TX + \lambda I)^{-1}X^T\vec{y}$



In [22]:
from sklearn.linear_model import Ridge
from sklearn.metrics import mean_squared_error as mse, mean_absolute_error as mae

def ridge_regression(X, penalty=0):
    # alpha = 0 betyder att vi inte har någon regularisering ger oss en vanlig linjär regression
    # note that the default alpha is 1.0
    # note that alpha is the same as lambda in teory i.e the penalty term in the cost function, sklearn has chosen alpha to generalize their API
    model_ridge = Ridge(alpha=penalty)
    model_ridge.fit(scaled_X_train, y_train)
    y_pred_train = model_ridge.predict(X)
    return y_pred_train

y_pred_train = ridge_regression(scaled_X_test, 0.2)
MSE = mse(y_test, y_pred_train)
RMSE = np.sqrt(MSE) # RMSE is more interpretable since it's in the same units as the target

RMSE, mae(y_test, y_pred_train) # Mean Absolute Error for the test set predictions

(0.6109310380379472, 0.4845959994544078)

In [21]:
# check with linear regression -> Rmse very similar
from sklearn.linear_model import LinearRegression
model_lin = LinearRegression() # skapar en instans av en linjär regressionsmodell
model_lin.fit(scaled_X_train, y_train) # tränar modellen på träningsmängden
y_pred_train_lin = model_lin.predict(scaled_X_test) # gör förutsägelser på testmängden
np.sqrt(mse(y_test, y_pred_train_lin)), mae(y_test, y_pred_train_lin) # Mean Absolute Error for the test set predictions

(0.5148267621786599, 0.37485164412178346)

In [23]:
from sklearn.linear_model import Lasso
model_lasso = Lasso(alpha=0.1) # alpha=0.1 betyder att vi har en liten regularisering
model_lasso.fit(scaled_X_train, y_train) # tränar modellen på träningsmängden
y_pred_train_lasso = model_lasso.predict(scaled_X_test) # gör förutsägelser på testmängden
np.sqrt(mse(y_test, y_pred_train_lasso)), mae(y_test, y_pred_train_lasso) # Mean Absolute Error for the test set predictions
print(model_lasso.coef_) # printar ut koefficienterna för modellen

# Lasso ger oss en sparse model, dvs den ger oss en modell med färre features än vad vi började med
# Detta är användbart om vi har många features och vill ha en enklare modell
# Så här kan vi använda Lasso för att göra feature selection
# Vi ser att Lasso har satt koefficienten för vissa features till 0, dvs den har tagit bort dem från modellen

[ 1.89480144  0.42062367  0.         -0.          3.55216501  0.
  0.          0.01110965  0.         -0.42677394 -0.         -0.
  0.          0.         -0.          0.          0.06706906  0.
  0.        ]


### k-fold-validation 

One strategi to choose the **best hyperparameter alpha** is to take the training part of the data and:

 1. shuffle dataset randomly 
 2. split into k-groups 
 3. for each group -> take one test, the rest training -> fit the model -> predict on test -> get evaluation metric
 4. take the mean of evaluation metrics
 5. choose the parameters and train on the entire training dataset

Repeat the process for each alpha, to see which yields lowest RMSE k-fold cross validation:

- good for smaller datasets 
- fair evaluation, as mean of evaluation metric for all k-groups is calculated 
- expensive to compute as it requires k+1 times of training 

___

### Ridge regression 

In [30]:
from sklearn.linear_model import RidgeCV # RidgeCV är en variant av Ridge som har inbyggd cross-validation för att hitta det bästa värdet på alpha
from sklearn.metrics import make_scorer # make_scorer används för att skapa en scorer som vi kan använda i cross-validation
#from sklearn.metrics import SCORERS # SCORERS är en dictionary som innehåller alla scorer som finns i sklearn

#SCORERS.keys() # printar ut alla scorer som finns i sklearn
# Vi kan använda make_scorer för att skapa en scorer som vi kan använda i cross-validation
# Vi kan använda denna scorer för att hitta det bästa värdet på alpha i RidgeCV
# negative because sklearn uses the convention that the higher the score, the better the model

model_ridge_cv = RidgeCV(alphas=[0.0001, .001, .01, .1, 5, 10], scoring=make_scorer(mse, greater_is_better=False))
model_ridge_cv.fit(scaled_X_train, y_train)
model_ridge_cv.alpha_ # best alpha
print(model_ridge_cv.alpha_)

0.1


In [31]:
# best alpha is 0.1
# it seams that linear regression has outperformed ridge regression in this case
# however, this is not always the case, ridge regression is more robust to multicollinearity and overfitting
# it's also more interpretable since it shrinks the coefficients towards zero
# it could olso depend on the distribution of the train test data so using 0.1 is more robust here

y_pred_train_ridge_cv = model_ridge_cv.predict(scaled_X_test)
RMSE = np.sqrt(mse(y_test, y_pred_train_ridge_cv))
RMSE, mae(y_test, y_pred_train_ridge_cv) # Mean Absolute Error for the test set predictions

(0.5635899169610441, 0.4343075766545298)

In [32]:
model_ridge_cv.coef_ # printar ut koefficienterna för modellen

array([ 5.84681185,  0.52142086,  0.71689997, -6.17948738,  3.75034058,
       -1.36283352, -0.08571128,  0.08322815, -0.34893776,  2.16952446,
       -0.47840838,  0.68527348,  0.63080799, -0.5950065 ,  0.61661989,
       -0.31335495,  0.36499629,  0.03328145, -0.13652471])

## Lasso Regression 

In [35]:
from sklearn.linear_model import LassoCV

# it is trying 100 alphas along the regularization path epsilon
model_lasso_cv = LassoCV(eps=0.001, n_alphas=100, max_iter=10000, cv=5) # cv=5 betyder att vi använder 5-fold cross-validation
model_lasso_cv.fit(scaled_X_train, y_train) # tränar modellen på träningsmängden
print(f'alpha = {model_lasso_cv.alpha_}') # best alpha

y_pred_train_lasso_cv = model_lasso_cv.predict(scaled_X_test) # gör förutsägelser på testmängden
np.sqrt(mse(y_test, y_pred_train_lasso_cv)), mae(y_test, y_pred_train_lasso_cv) # Mean Absolute Error for the test set predictions

alpha = 0.004968802520343366


(0.5785146895301982, 0.46291883026933045)

In [36]:
# we can see that LassoCV has set some coefficients to zero
# this means that LassoCV has performed feature selection
# it has removed some features from the model
# this is useful if we have many features and want a simpler model

model_lasso_cv.coef_ # printar ut koefficienterna för modellen

array([ 5.19612354,  0.43037087,  0.29876351, -4.80417579,  3.46665205,
       -0.40507212,  0.        ,  0.        ,  0.        ,  1.35260206,
       -0.        ,  0.        ,  0.14879719, -0.        ,  0.        ,
        0.        ,  0.09649665,  0.        ,  0.04353956])

## Elastic net 

Elastic net is a combination of both **Ridge L2 regularization** and **Lasso regularization L1**. The cost function to minimize elastic net is:

$$C(\vec{\theta}) = MSE(\vec{\theta}) + \lambda\left(\alpha\sum_{i=1}^n |\theta_i| + \frac{1-\alpha}{2}\sum_{i=1}^n \theta_i^2\right)$$

, where $\alpha$ here determines the ratio for $\ell_1$ or $\ell_2$ regularization.


In [37]:
from sklearn.linear_model import ElasticNetCV

# note that alpha is the same as lambda in teory i.e the penalty term in the cost function, sklearn has chosen alpha to generalize their API
# l1_ratio is alpha in the cost function, i.e the ratio between the l1 and l2 penalty

model_elastic_net_cv = ElasticNetCV(l1_ratio=[.1, .5, .7, .9, .95, .99, 1], eps=0.001, n_alphas=100, max_iter=10000)
model_elastic_net_cv.fit(scaled_X_train, y_train)
print(f'L1 ratio = {model_elastic_net_cv.l1_ratio_}') # best l1 ratio (alpha) remove ridge and pick lasso entirely
print(f'alpha = {model_elastic_net_cv.alpha_}') # best alpha

L1 ratio = 1.0
alpha = 0.004968802520343366


In [38]:
y_pred_train_elastic_net_cv = model_elastic_net_cv.predict(scaled_X_test)
np.sqrt(mse(y_test, y_pred_train_elastic_net_cv)), mae(y_test, y_pred_train_elastic_net_cv) # Mean Absolute Error for the test set predictions
# note that result is same as lasso regression because l1_ratio is 1 so it is expected to be same as lasso regression

(0.5785146895301982, 0.46291883026933045)