# Ridge Regression Lagrange

### Introduction

<img src="./ridge-regression.png" width="40%">

### Combining the Functions

$ \underset{\theta}{\text{arg min }}  J(\theta) = \sum_{i=1}^n (y_i - f(x_i))^2 $,subject to $|| \theta||_2^2 \le c$ .

Now one way to rewrite this problem is to use lagrange multiplication.  With lagrange multiplication, we multiply our constraint function times a coefficient $\lambda$.  Doing so with the equation above, we get the following:

$ \underset{\theta}{\text{arg min }}  J(\theta) = \sum_{i=1}^n (y_i - f(x_i))^2  + \lambda(|| \theta||_2 - c)$ .

Now using our example of weights, $w_1$ and $w_2$, we have the following:

$ \underset{\theta}{\text{arg min }}  J(\theta) = \sum_{i=1}^n (y_i - f(x_i))^2  + \lambda(
\theta_1^2 + \theta_2^2 - c)$ .

### Our approach

Now to solve for the equation above, we'll treat different values for lambda as a hyperparameter.  Notice that when we do so, say we set $\lambda = 10$, and because c is just a number, like 3, then $\lambda *-c$ is also a number, here $-30$.

When performing minimization, adding or subtracting our function by a constant, $\lambda*c$ will have no impact on the values $\theta$ that minimize the function.  Because of this, we can simplify our function by removing $-c$.

Doing so, our function now looks like the following:

$ \underset{\theta}{\text{arg min }}  J(\theta) = \sum_{i=1}^n (y_i - f(x_i))^2  + \lambda(\theta_1^2 + \theta_2^2)$ .

### Bias Variance Tradeoff - Ridge Regression

Let's take a look at our cost function for ridge regression.

$ \underset{\theta}{\text{arg min }}  J(\theta) = \sum_{i=1}^n (y_i - f(x_i))^2  + \lambda(|| \theta||_2^2)$ .

We should be viewing it in two components.  

* $\sum_{i=1}^n (y_i - f(x_i))^2 $
>  This is SSE. By minimizing SSE, we train our model to predict target values close to the training data.
    
    
*  $\lambda(|| \theta||_2^2)$
>  This is our constraint. Our constraint reduces the variance of our model.  It does so by simplifying our model's hypothesis function, by reducing the size of the coefficients in general, with an emphasis on those  

Notice that if we treat $\lambda$ as a hyperparameter, we can work with the bias variance tradeoff all over again.  The larger the value of lambda, the more weight the cost function will give to satisfying the constraint, and thus the lower the variance.  But by doing so, the cost function will give less weight to the sum of squared errors and thus the higher the bias.

<img src="./ridge-regression.png" width="40%">

So by treating the $\lambda$ as a hyperparameter, which we can tune by assessing how well our model performs on a holdout set.  In doing so we can control for overfitting to the training data, by limiting the amount of variance.  And of course, we will also check that our lambda is not so large that it is not fitting to the data.  

### Summary

### Resources

[Khan Academy Contour Constrained Maximization](https://www.khanacademy.org/math/multivariable-calculus/applications-of-multivariable-derivatives/lagrange-multipliers-and-constrained-optimization/v/constrained-optimization-introduction)