<div style="text-align: right"> <b>Last Updated:</b> 17JUNE2020 </div>

# Ridge and Lasso Regression: L1 and L2 penalization

__Authors:__ Natasha A Sahr, PhD

Ridge and Lasso regression are some of the simple techniques to reduce model complexity and prevent over-fitting resulting linear regression.

In [84]:
import numpy as np 
import pandas as pd
import sklearn.metrics
import sklearn.model_selection as model_selection
import sklearn.linear_model as linear_model
import plotly.graph_objects as go
import math
from sklearn.datasets import load_breast_cancer

In [89]:
cancer = load_breast_cancer()
cancer_df = pd.DataFrame(cancer.data, columns=cancer.feature_names)
cancer_df['Target'] = cancer.target

In [90]:
X_cancer = cancer_df.drop('Target',axis=1)
Y_cancer = cancer_df['Target']

In [92]:
splits = model_selection.train_test_split(X_cancer, Y_cancer, random_state=1)

In [93]:
lr = linear_model.LinearRegression()
lr.fit(splits[0], splits[2])

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

In [94]:
lr_train_score = lr.score(splits[0], splits[2])
print(lr_train_score)

0.7826211754776591


In [95]:
lr_test_score = lr.score(splits[1], splits[3])
print(lr_test_score)

0.722268601197212


In [96]:
Y_lr_predict = lr.predict(splits[1])
lr_err = math.sqrt(sklearn.metrics.mean_squared_error(splits[3],Y_lr_predict))
print(lr_err)

0.2563888498359211


## Regression with a Ridge (L2) Penalty

In ridge regression, also known as L2 penalization, the cost function is alterred by adding a penalty equivalent to the square of the magnitude of the coefficients. This is equivalent to saying: for some $c > 0$, $\sum_{j=0}^p \beta_j^2 < 0$ for coefficients $\beta_j, j=1,\dots,p$. 

The cost function for ridge regression is

$$\sum_{i=1}^N (y_i-\hat{y_i})^2 = \sum_{i=1}^N (y_i - \sum_{j=0}^p \beta_i x_{ij})^2 + \lambda \sum_{j=0}^p \beta_j^2$$

When $\lambda = 0$, we have is a linear regression model.

The $\lambda$ regularizes the coefficients so the optimization function is penalized if the coefficients are large. This type of penalization leads to coefficients close to, but not exactly, zero. This feature of ridge regression shrinks the coefficients allowing for a reduction of model complexity and multicollinearity.

In [97]:
rr001 = linear_model.Ridge(alpha=0.01)
rr001.fit(splits[0], splits[2])

Ridge(alpha=0.01, copy_X=True, fit_intercept=True, max_iter=None,
      normalize=False, random_state=None, solver='auto', tol=0.001)

In [98]:
rr001_train_score = rr001.score(splits[0], splits[2])
print(rr001_train_score)

0.7754460618647805


In [99]:
rr001_test_score = rr001.score(splits[1], splits[3])
print(rr001_test_score)

0.7333307158514669


In [100]:
Y_rr001_predict = rr001.predict(splits[1])
rr001_err = math.sqrt(sklearn.metrics.mean_squared_error(splits[3],Y_rr001_predict))
print(rr001_err)

0.25123095018531005


In [101]:
rr100 = linear_model.Ridge(alpha=100)
rr100.fit(splits[0], splits[2])

Ridge(alpha=100, copy_X=True, fit_intercept=True, max_iter=None,
      normalize=False, random_state=None, solver='auto', tol=0.001)

In [102]:
rr100_train_score = rr100.score(splits[0], splits[2])
print(rr100_train_score)

0.7029228697091638


In [103]:
rr100_test_score = rr100.score(splits[1], splits[3])
print(rr100_test_score)

0.6417168855515434


In [104]:
Y_rr100_predict = rr100.predict(splits[1])
rr100_err = math.sqrt(sklearn.metrics.mean_squared_error(splits[3],Y_rr100_predict))
print(rr100_err)

0.2912056612560334


In [171]:
numcoef = len(lr.coef_)
Ridgefig = go.Figure()

Ridgefig = Ridgefig.add_trace(go.Scatter(x=np.linspace(1, numcoef, numcoef), 
                         y=lr.coef_, 
                         name='Linear Regression',
                         mode='markers',
                         marker=dict(color='blue', opacity=0.25, size=30)))

Ridgefig = Ridgefig.add_trace(go.Scatter(x=np.linspace(1, numcoef, numcoef), 
                         y=rr001.coef_, 
                         name='Ridge Regression 0.01',
                         mode='markers',
                         marker=dict(color='green', opacity=0.5, size=15))) 

Ridgefig = Ridgefig.add_trace(go.Scatter(x=np.linspace(1, numcoef, numcoef), 
                         y=rr100.coef_, 
                         name='Ridge Regression 100',
                         mode='markers',
                         marker=dict(color='red', opacity=0.75, size=8)))

Ridgefig.show()

## Regression with a Lasso (L1) Penalty

In lasso regression, also known as L1 penalization, the cost function is alterred by adding a penalty equivalent to the absolute value of the magnitude of the coefficients. This is equivalent to saying: for some $c > 0$, $|\beta_j| < 0$ for coefficients $\beta_j, j=1,\dots,p$. 

The cost function for ridge regression is

$$\sum_{i=1}^N (y_i-\hat{y_i})^2 = \sum_{i=1}^N (y_i - \sum_{j=0}^p \beta_i x_{ij})^2 + \lambda \sum_{j=0}^p |\beta_j|$$

When $\lambda = 0$, we have is a linear regression model.

The $\lambda$ regularizes the coefficients so the optimization function is penalized if the coefficients are large. This type of penalization leads to exactly zero coefficients. This feature of lasso regression shrinks the coefficients allowing for a reduction of model complexity and multicollinearity and allows use to perform feature selection.

In [124]:
lr_coeff_used = np.sum(lr.coef_!=0)
print(lr_coeff_used)

30


In [130]:
lasso001 = linear_model.Lasso(alpha=0.01, max_iter=10e5)
lasso001.fit(splits[0], splits[2])

Lasso(alpha=0.01, copy_X=True, fit_intercept=True, max_iter=1000000.0,
      normalize=False, positive=False, precompute=False, random_state=None,
      selection='cyclic', tol=0.0001, warm_start=False)

In [131]:
lasso001_train_score = lasso001.score(splits[0], splits[2])
print(lasso001_train_score)

0.6970256556001375


In [132]:
lasso001_test_score = lasso001.score(splits[1], splits[3])
print(lasso001_test_score)

0.6366806670191035


In [133]:
Y_lasso001_predict = lasso001.predict(splits[1])
lasso001_err = math.sqrt(sklearn.metrics.mean_squared_error(splits[3],Y_lasso001_predict))
print(lasso001_err)

0.29324519012925


In [134]:
lasso001_coeff_used = np.sum(lasso001.coef_!=0)
print(lasso001_coeff_used)

9


In [129]:
lasso000001 = linear_model.Lasso(alpha=0.00001, max_iter=10e5)
lasso000001.fit(splits[0], splits[2])

Lasso(alpha=1e-05, copy_X=True, fit_intercept=True, max_iter=1000000.0,
      normalize=False, positive=False, precompute=False, random_state=None,
      selection='cyclic', tol=0.0001, warm_start=False)

In [135]:
lasso000001_train_score = lasso000001.score(splits[0], splits[2])
print(lasso000001_train_score)

0.7818445650812895


In [136]:
lasso000001_test_score = lasso000001.score(splits[1], splits[3])
print(lasso000001_test_score)

0.7264228052674975


In [137]:
Y_lasso000001_predict = lasso000001.predict(splits[1])
lasso000001_err = math.sqrt(sklearn.metrics.mean_squared_error(splits[3],Y_lasso000001_predict))
print(lasso000001_err)

0.2544641404073568


In [138]:
lasso000001_coeff_used = np.sum(lasso000001.coef_!=0)
print(lasso000001_coeff_used)

27


In [173]:
numcoef = len(lr.coef_)
LassoFig = go.Figure()

LassoFig = LassoFig.add_trace(go.Scatter(x=np.linspace(1, numcoef, numcoef), 
                         y=lr.coef_, 
                         name='Linear Regression',
                         mode='markers',
                         marker=dict(color='blue', opacity=0.25, size=30)))

LassoFig = LassoFig.add_trace(go.Scatter(x=np.linspace(1, numcoef, numcoef), 
                         y=lasso001.coef_, 
                         name='Lasso 0.01',
                         mode='markers',
                         marker=dict(color='green', opacity=0.5, size=15))) 

LassoFig = LassoFig.add_trace(go.Scatter(x=np.linspace(1, numcoef, numcoef), 
                         y=lasso000001.coef_, 
                         name='Lasso 0.00001',
                         mode='markers',
                         marker=dict(color='red', opacity=0.75, size=8)))

LassoFig.show()

# Choosing $\lambda$

The tuning parameter, $\lambda$ can be found with a grid search. This concept is similar to the approaches used for choosing the optimal number of clusters (in the clustering notebook) and the optimal number of neighbors (in the KNN regression notebook). 

### For Ridge Regression

In [166]:
RidgeSearch = linear_model.RidgeCV(alphas=np.linspace(0.0001, 100, 10000)).fit(splits[0],splits[2])

In [167]:
print(RidgeSearch.alpha_)

0.0101009900990099


In [168]:
Y_optimalRidge_predict = RidgeSearch.predict(splits[1])
optimalRidge_err = math.sqrt(sklearn.metrics.mean_squared_error(splits[3],Y_optimalRidge_predict))
print(optimalRidge_err)

0.2512323122751003


In [172]:
Ridgefig = Ridgefig.add_trace(go.Scatter(x=np.linspace(1, numcoef, numcoef), 
                         y=RidgeSearch.coef_, 
                         name='Optimal Ridge',
                         mode='markers',
                         marker=dict(color='yellow', opacity=0.75, size=4)))

Ridgefig.show()

### For Lasso Regression

In [147]:
LassoSearch = linear_model.LassoCV(cv=5, random_state=0).fit(splits[0],splits[2])

In [148]:
print(LassoSearch.alpha_)

0.21270760265820274


In [153]:
LassoSearch_coeff_used = np.sum(LassoSearch.coef_!=0)
print(LassoSearch_coeff_used)

4


In [150]:
Y_optimalLasso_predict = LassoSearch.predict(splits[1])
optimalLasso_err = math.sqrt(sklearn.metrics.mean_squared_error(splits[3],Y_optimalLasso_predict))
print(optimalLasso_err)

0.3048806501850073


In [174]:
LassoFig = LassoFig.add_trace(go.Scatter(x=np.linspace(1, numcoef, numcoef), 
                         y=LassoSearch.coef_, 
                         name='Optimal Lasso',
                         mode='markers',
                         marker=dict(color='yellow', opacity=0.75, size=4)))

LassoFig.show()

## Extension of Penalized Regression Methods

- There are other penalties that can be applied. The most common penalties include ridge, lasso, bridge, and SCAD. 
- The penalties can be applied to many types of regression, not just linear regression. The most common types of regression for which penalties are applied include linear regression, logistic regression, and Cox regression. 
- The penalties can allow for natural correlation between features using their corresponding "grouped" versions. 