# Ridge Regression

### Introduction

In the last lesson, we saw that larger amounts of variance is generally associated with larger coefficients.  For example, we saw that our features with larger coefficients contributed to higher amounts of variance in our models.

<img src="./variance-in-coef.png" width="90%"> 

And we saw that we one measurement the total magnitude of our coefficients was the L2 norm.

$||\theta||_2 =\sqrt{\theta_1^2 + \theta_2^2 + ... \theta_n^2} $ 

As we'll see in this lesson, the idea behind ridge regression is to use a cost function that fits to the data while minimizing the L2 norm of the coefficients.

### Onto ridge regression

So the task of ridge regression is to minimize SSE as well as the L2 norm of the model's coefficients.  Hopefully this will allow us to fit to the data, while reducing the variance that comes with larger coefficients.  

To do so, we use the following cost function.

$ \underset{\theta}{\text{arg min }}  J(\theta) = \sum_{i=1}^n (y_i - f(x_i))^2 $,subject to $|| \theta||_2^2 \le c$ .

Notice that this is simply expresses our two goals.  The expression on the left is SSE, which allows us to fit to the data.  The expression on the right restricts the L2 norm of our coefficients to specific number, $c$, which will limit the size, and thus the variance of our features' parameters.

Let's see some visualizations showing these two componentsm and then we'll see how we can pursue these goals simultaneously.  

1. Minimize SSE

The first is to our task of minimizing the sum of the squared errors.  This, as we know, is how we measure how closely our hypothesis function predicts the target values.  And in training we adjust the parameters to reduce the SSE.

One way to display this task is with a contour plot.

<img src="./contour-plot-lin-regression.png" width="50%">

If we look at the axes in the above plot, the $w_1$ and $w_2$ represent the coefficients of two features.  As we know, as we change the weights of our coefficients, the SSE changes.  That is what the circles represent -- the differing costs as the weights are changed.  So the center of the circle is where we can see the $SSE = 300$.  And the next circle shows the weights where the $SSE = 400$.

So this is our an illustration of our SSE for different weights.  And in regression, we find weights where the cost is minimized. 

### Adding a restriction 

Now let's talk about the other component of our cost function.  This is that our coefficients cannot exceed a certain size.  Remember that we are measuring this size as $||w||_2^2 = \sqrt{w_1^2 + w_2^2}^2 = w_1^2 + w_2^2$.

So this is saying that we want the distance from the origin to our weight vector to be no more than a certain number, $c$.  That's what the below graph illustrates.  The further these weights are from the origin, the greater the $\text{L2}$ norm.

<img src="./lagrange-axis.png" width="30%">

If we think of where the L2 norm is a specific number, say 3.  Then we can see that if we draw the set of points with distance 3 from the center, we just have a circle.  And the same thing for every other constant.

So each semicircle in the graph above depicts the set of weights where the L2 norm is a constant value.

### Satisfying both Objectives

Now with ridge regression, we put the two of these together.  Our goal is to find the minimum sum of squares, given that the L2 norm squared is less than a specific number.  

$ \underset{\theta}{\text{arg min }}  J(\theta) = \sum_{i=1}^n (y_i - f(x_i))^2 $,subject to $|| \theta||_2^2 \le c$ .

Visually placing these two constraints together looks like the image below.  

<img src="./ridge-regression.png" width="60%">

> All of the L2 norm restrictions are indicated by the semicircles near the origin.  And the respective SSE scores are the circles in the center of the graph.

Now looking at the image above, our task is the following: 

* we want to minimize the SSE with the L2 norm no greater than 3.  

> Where on the graph can we do that?

To minimize the $SSE$, we wind up on the circle with $SSE = 700$.  Any other value would lie at a point with a larger $SSE$.  So we can see that with ridge regression, we wind up on the combination of coefficients where the L2 norm intersects with a value of the SSE.  By doing so, we are able to balance both fitting to the data and limiting the variance by limiting the L2 Norm of the coefficients.

### Bias Variance Tradeoff

Notice something else about the solution we found above.  We are no longer finding the parameters that minimize our sum of squared errors.  So the introduction of the constraint restricts our model from fitting to the data.  

<img src="./ridge-regression.png" width="60%">

So the introduction of this constraint introduces bias into our model.  And remember we do this to reduce variance.  So once again, we see the bias-variance  tradeoff.  As we reduce the variance in our model by constraining the magnitude of the model's parameters, we introduce bias by preventing the model from minimizing the SSE.  In lessons, that follow we'll see how to tune this hyperparameter.

### Summary

In this lesson, we saw how ridge regression minimize the sum of squared errors subject to constraining the magnitude of the coefficients to reduce variance in the model.  We saw that the parameters that minimize the SSE subject to this constraint occur at the intersection, in the graph below.

<img src="./ridge-regression.png" width="40%">

We also saw in reducing variance in our model, the constraint introduces variance.  As we'll see later, we'll tune a hyperparameter to balance bias and variance in our model.