# Ridge Regression

### Introduction

So far we have performed feature selection through a fairly direct strategy.  We remove, or simulate removing features one by one, and then remove those features where the model's score does not decrease when they are absent.

So far, we have done this after fitting the model.  However, if we think about it, we may be able to achieve the same goal by changing the cost function.  We'll do so, by not only choosing a regression model that most closely fits to the data, but also where the model has fewer significant features.

### Model Simplicity through Coeficients

The general idea behind both ridge and lasso regression is to change the linear regression model's cost function so that we are no longer minimizing the sum of the squared errors, but also the total size of the model's coefficients.  Let's learn about this by thinking about the diabetes dataset as an example.

Now remember that the diabetes dataset is used to predict the progression of a disease through various features like `age`, `sex`, `bmi` and `blood pressure`.  After training a model, we can get something like the following:

$disease\_progression = 2*age + 20*is\_male + 8*blood\_pressure + -3*bmi$

The idea behind regularization is to limit the total size of these coefficients.  To calculate the total size, we'll start by using the l2 norm, defined as the following:

$\text{l2 norm} =\sqrt{\theta_1^2 + \theta_2^2 + ... \theta_n^2}$ 

> The l2 norm of the coefficients is also written as $||\theta||_2$ or simply $||\theta||$.   

So the for the model above, we have:

$||\theta||_2 = \sqrt{2^2 + 20^2 + 8^2 + (-3)^2} = $

In [64]:
import numpy as np
coef = np.array([2, 20, 8, -3])
np.sqrt(np.sum(coef**2))

21.840329667841555

So the point is that if we minimize the L2 norm, then the size of individual coeficients will decrease, and this will lead to some features that have less of an impact on our model.

### Onto ridge regression

This is the task of ridge regression: to minimize SSE as well as the L2 norm of the model's coefficients.

$ \underset{\theta}{\text{arg min }}  J(\theta) = \sum_{i=1}^n (y_i - f(x_i))^2 $,subject to $|| \theta||_2 \le c$ .

Let's start with some visualizations showing how we can achieve both goals.  

1. Minimize SSE

The first is to our task of minimizing the sum of the squared errors. Now one way to display this task is with a contour plot.

<img src="./contour-plot-lin-regression.png" width="50%">

If we look at the axes, the $w_1$ and $w_2$ represent the coefficients of two features.  As we know, as we change the weights of our coefficients, the SSE changes.  That is what the circles represent -- the differing costs as the weights are changed.  So the center of the circle is where we can see the $SSE = 300$.  And the next circle shows the weights where the SSE is 400.

So this is our an illustration of our SSE for different weights.  And in regression, we find weights where the cost is minimized. 

### Adding a restriction 

Now let's talk about the other restriction.  This is that our coefficients cannot exceed a certain size.  Remember that we are measuring this size as $||w||_2 = \sqrt{w_1^2 + w_2^2}$.

So this is saying that we want the distance from the origin to our weight vector to be no more than a certain number, $c$.  That's what the beklow graph illustrates.  The further these weights are from the origin, the greater the L2 norm.

<img src="./lagrange-axis.png" width="30%">

If we think of where the L2 norm is a specific number, say 3.  Then we can see that if we draw the set of points with distance 3 from the center, we just have a circle.  And the same thing for every other constant.

So each semicircle in the graph above depicts the set of weights where the L2 norm is a constant value.

### Satisfying both Objectives

Now with ridge regression, we put the two of these together.  Our goal is to find the minimum sum of squares, given that the L2 norm is less than a specific number.  

$ \underset{\theta}{\text{arg min }}  J(\theta) = \sum_{i=1}^n (y_i - f(x_i))^2 $,subject to $|| \theta||_2 \le c$ .

Visually placing these two constraints together looks like the image below.  Now look at the image below, and let's say: 

* we want to minimize the SSE with the L2 norm no greater than 3.  

Where on the graph can we do that?

<img src="./ridge-regression.png" width="60%">

So our task is to find the weights that minimize the SSE subject $||\theta || \le 3$.  All of the weights where $||\theta || = 3$ is indicated by the corresponding semicircle.  And to minimze the $SSE$, we wind up on the circle with $SSE = 700$.  Any other value would lie at a point with a larger $SSE$.

So we can see that with ridge regression, we will no longer be minimizing the SSE errors, but will do so subject to a constraining the coefficients to an L2 norm.

### Summary

### Ridge Regression Application

In [59]:
X_train.columns

Index(['age', 'sex', 'bmi', 'bp', 's1', 's2', 's3', 's4', 's5', 's6'], dtype='object')

Let's load up our data and fit a model.

In [46]:
import pandas as pd
from sklearn.datasets import load_diabetes
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
cali = load_diabetes()
X = pd.DataFrame(cali['data'], columns = cali['feature_names'])
y = pd.Series(cali['target'])

scaler = StandardScaler()
X_transformed = scaler.fit_transform(X)
X_transformed_df = pd.DataFrame(X_transformed, columns = X.columns)
X_train, X_test, y_train, y_test = train_test_split(X_transformed_df, y, random_state = 2)

model = LinearRegression()
model.fit(X_train, y_train)
model.score(X_test, y_test)

0.4429608706133161

Now remember, the idea is to reduce not just SSE but the total size of the model's coefficients.

In [47]:
coef_series = pd.Series(index = X.columns, data = model.coef_).sort_values()
coef_series

s1    -42.379489
sex    -9.232746
age    -1.735754
s6      2.582949
s4      6.965704
s3      7.395255
bp     16.887496
bmi    24.442546
s2     28.142755
s5     40.279982
dtype: float64

Now to start, the way that we'll calculate the total magnitude of the coefficients is to square each coefficient and take the square root.

In [48]:
(coef_series**2).sum()

5291.277142083067

Here we can see that the 

In [52]:
from sklearn.linear_model import Ridge
ridge = Ridge(alpha = 3)
ridge.fit(X_train, y_train)
ridge.score(X_test, y_test)

0.4454456415548589

In [56]:
ridge_coef_series = pd.Series(index = X.columns, data = ridge.coef_).sort_values()
(ridge_coef_series**2).sum()

2625.570055949969

In [57]:
ridge_coef_series

s1    -21.409152
sex    -8.990593
s3     -1.774973
age    -1.610662
s6      2.718423
s4      4.590996
s2     11.502600
bp     16.758519
bmi    24.675290
s5     32.095918
dtype: float64

In [58]:
coef_series

s1    -42.379489
sex    -9.232746
age    -1.735754
s6      2.582949
s4      6.965704
s3      7.395255
bp     16.887496
bmi    24.442546
s2     28.142755
s5     40.279982
dtype: float64