# Ridge Regression and L2 Norm

### Introduction

In the last lesson, we saw how error due to variance occurs in our models.  We saw two symptoms of this overfitting:

1. Scores that did not generalize to our holdout sets
2. Large amounts of variance in our coefficients

Now one technique for reducing variance, is by removing features after fitting the model.  However, if we think about it, we may be able to achieve the same goal by changing the cost function.  We'll do so, by not having a cost function that rewards hypothesis function that not only closely fits to the data, but also where the hypothesis function has fewer significant features, and is thus simpler.

### Model Simplicity through Coeficients

The general idea behind both ridge regression is to change the linear regression model's cost function so that we are no longer minimizing the sum of the squared errors, but also the total size of the model's coefficients.  Let's look back to our Airbnb dataset and various coefficients in the model.

<img src="./variance-in-coef.png" width="90%"> 

Looking at coefficients above, the features that are the largest contributors to variance are those features with largest coefficients (either positive or negative).  This is because, the larger the coefficients, the larger the variation from model to model, which will then lead to different predictions based on which model is used.    

The idea behind regularization is to: 
* Limit the total size of a model's coefficients, and thus limit the variance,
* Yet still train a model that fits to the data.

### Measuring our Coefficients

With ridge regression,  we prefer models where the total magnitude of the coefficients is smaller, and we capture this size with the L2 norm.  This is how we define the L2 Norm.

$\text{L2 norm} =\sqrt{\theta_1^2 + \theta_2^2 + ... \theta_n^2}$ 

> So to calculate the L2 norm of a model's coefficients we square each coefficient and then take the square root.  
> 
> This is also the Cartesian distance formula, in mathematics.  

For example, let's calculate the $\text{L2}$ norm of the following model:

> $\text{price} = 3*\text{accommodates} + 1.5*\text{guests_included} -20*\text{review_is_na} $ 

So the for the model above, we calculate the $L2$ norm of the coefficients as the following:

$\text{L2_norm} = \sqrt{3^2 + 1.5^2 + (-20)^2} = 20.279$

In [3]:
import numpy as np
coef = np.array([3, 1.5, -20])
np.sqrt(np.sum(coef**2))

20.279299790673246

> Now, the L2 norm is often denoted as with double pipes on either side, so we can write:

$||\theta||_2 =\sqrt{\theta_1^2 + \theta_2^2 + ... \theta_n^2} $ 

Or for the model above:

$||\theta||_2 = 20.279$

### Onto ridge regression

So with ridge regression, we'll try to minimize the $\text{L2}$ norm, which will limit the total size of individual coeficients, and this will lead to a decrease in variance.

We do this, by incorporating the L2 norm directly into our cost function, along with our normal task of finding a model that minimizes the sum of the squared errors.  It looks like the following:

$ \underset{\theta}{\text{arg min }}  J(\theta) = \sum_{i=1}^n (y_i - f(x_i))^2 $,subject to $|| \theta||_2 \le c$ .

Where $c$ is a constant.

We'll learn more about how to train a model subject to this constraint in the lessons that follow.

### Summary

In this lesson, we learned about the L2 norm and how we can use it to get a sense of variance in our linear regression model.  The idea is that the larger our the coefficients of our model, the larger the variance, as variations in the coefficients will produce variations in the predictions of the model.  We can measure the size of the coefficients through the L2 norm which is: 

$||\theta||_2 =\sqrt{\theta_1^2 + \theta_2^2 + ... \theta_n^2} $ 

Then we saw that we can embed the L2 norm directly into our cost function by changing our cost function to minimize SSE such that we limit the L2 norm of our parameters.  Doing so, our cost function now looks like the following:

$ \underset{\theta}{\text{arg min }}  J(\theta) = \sum_{i=1}^n (y_i - f(x_i))^2 $,subject to $|| \theta||_2 \le c$ .

We'll explore the effect of this new cost function in the lessons that follow.

### Introduction

Let's start with some visualizations showing how we can achieve both goals.  

1. Minimize SSE

The first is to our task of minimizing the sum of the squared errors. Now one way to display this task is with a contour plot.

<img src="./contour-plot-lin-regression.png" width="50%">

If we look at the axes, the $w_1$ and $w_2$ represent the coefficients of two features.  As we know, as we change the weights of our coefficients, the SSE changes.  That is what the circles represent -- the differing costs as the weights are changed.  So the center of the circle is where we can see the $SSE = 300$.  And the next circle shows the weights where the SSE is 400.

So this is our an illustration of our SSE for different weights.  And in regression, we find weights where the cost is minimized. 

### Adding a restriction 

Now let's talk about the other restriction.  This is that our coefficients cannot exceed a certain size.  Remember that we are measuring this size as $||w||_2 = \sqrt{w_1^2 + w_2^2}$.

So this is saying that we want the distance from the origin to our weight vector to be no more than a certain number, $c$.  That's what the beklow graph illustrates.  The further these weights are from the origin, the greater the L2 norm.

<img src="./lagrange-axis.png" width="30%">

If we think of where the L2 norm is a specific number, say 3.  Then we can see that if we draw the set of points with distance 3 from the center, we just have a circle.  And the same thing for every other constant.

So each semicircle in the graph above depicts the set of weights where the L2 norm is a constant value.

### Satisfying both Objectives

Now with ridge regression, we put the two of these together.  Our goal is to find the minimum sum of squares, given that the L2 norm is less than a specific number.  

$ \underset{\theta}{\text{arg min }}  J(\theta) = \sum_{i=1}^n (y_i - f(x_i))^2 $,subject to $|| \theta||_2^2 \le c$ .

Visually placing these two constraints together looks like the image below.  Now look at the image below, and let's say: 

* we want to minimize the SSE with the L2 norm no greater than 3.  

Where on the graph can we do that?

<img src="./ridge-regression.png" width="60%">

So our task is to find the weights that minimize the SSE subject $||\theta || \le 3$.  All of the weights where $||\theta || = 3$ is indicated by the corresponding semicircle.  And to minimze the $SSE$, we wind up on the circle with $SSE = 700$.  Any other value would lie at a point with a larger $SSE$.

So we can see that with ridge regression, we will no longer be minimizing the $SSE$ errors, but will do so subject to a constraining the coefficients to an $L2$ norm.

### Bias Variance Tradeoff

Notice something else about the solution we found above.  We are no longer finding the parameters that minimize our sum of squared errors.  So the introduction of the constraint restricts our model from fitting to the data.  

<img src="./ridge-regression.png" width="60%">

So the introduction of this constraint introduces bias into our model.  And remember we do this to reduce variance.  So once again, we see the bias-variance  tradeoff.  As we reduce the variance in our model by constraining the magnitude of the model's parameters, we introduce bias by preventing the model from minimizing the SSE.  In lessons, that follow we'll see how to tune this hyperparameter.

### Summary

In this lesson, we saw how ridge regression works to reduce variance in a linear regression function.  It does so by minimizing the SSE subject to parameters that constrained from exceeding a certain magnitude.  In ridge regression, we measure this magnitude through the $L2$ norm.  And the function is minimized where the $SSE$ and $||\theta||_2$ parameters intersect.