# Ridge regression

Here 

![](images/10_1.PNG)

***

We try to create a linear regression model or try to fit a linear model on a given dataset of weight vs sizes.


![](images/10_b.PNG)

![](images/10_c.png)

When we have a lot of measurements we can be fairly confident that the Least squares line accurately reflects the relationship between Size and weight.

But what if the dataset has very few points.

![](images/10_d.PNG)


The sum of the squared residuals for just the __Two Red Points__, the __Training Data__ is small(in this case it is 0) but the sum of the squared residuals for the __Green Points__, the Testing Data is large and this means that the fitted line has high variance.

> In ML lingo, we can say that the fitted line is overfitted to the training data.

![](images/10_e.PNG)

![](images/10_f.PNG)

![](images/10_g.PNG)

In other words, by starting with a slightly wrong fit, __Ridge Regression__ can provide better long term predictions.

***
## Ridge regression in detail

![](images/10_h.PNG)


![](images/10_i.PNG)

$\lambda*slope^{2}$ adds a penalty to the traditional __least squares__ method and $\lambda$ determines how severe that penalty is.

![](images/10_j.PNG)

![](images/10_k.PNG)

Without the small amount of __Bias__ that the penalty creates, the __Least squares fit__ has a large amount of variance. In contrast, the Ridge regression line which has the small amount of __Bias__ due to the penalty has less __Variance__. 

## Effect of the ridge regression penalty on the fitting line


#### Case 1: If the fitting lines slope=1

![](images/10_l.PNG)

#### Case 2: If the slope is steep

![](images/10_m.PNG)

#### Case 3: If the slope is small

![](images/10_n.PNG)

Now lets go back to the __Least Squares__ and __Ridge regression__ lines fit to the two data points.

![](images/10_o.PNG)


***

## Lambda $\lambda$

- $\lambda$ can take any value from 0 to +ve infinity [0,$\propto$).
- when $\lambda = 0$ then Ridge regression will only minimise the SSR as the penalty parameter will be zero and ridge regression line will be the same as the least squares fit line.
- As we increase $\lambda$ the slope will go on decreasing and the larger we make lambda the slope get aymptotically close to __0__. So the larger lambda gets our prediction for size becomes less and less sensitive to weight (i.e., x-value).

#### So how do we decide to take which value of $\lambda$?

We just try a bunch of values for $\lambda$ and use __cross validation__, typically __10-fold cross validation__ to determine which one results in the lowest __variance__.

> $lambda$ is determined using cross-validation. 

***

## Ridge regression with discrete variables

Ridge regression also works when we use a discrete variable like __Normal diet vs high fat diet__ to predict __size__.

![](images/10_p.png)

Lets call that distance `Diet Distance`.

From eqn:  `Size = 1.5 + (0.7 x High_Fat_Diet)`

`High_Fat_Diet` = 0 for mice on Normal diet  
Or, `High_Fat_Diet` = 1 for mice on High Fat Diet  

![](images/10_q.PNG)

![](images/10_r.PNG)

![](images/10_s.PNG)

![](images/10_t.PNG)

![](images/10_u.PNG)

![](images/10_v.PNG)

When $\lambda = 0$ the whole term `$\lambda*Diet-Differecne^2$` becomes zero and we get the same least squares equation. 

But when $\lambda$ gets large the only way to minimize the whole equation is to shrink the Diet-Distance down.

![](images/10_w.PNG)

![](images/10_x.PNG)

***

# Ridge regression on Logistic regression

![](images/10_y.PNG)

Equation for Logistic Regression:

$$ Obese = c + slope * weight$$  where c:  y-intercept

Ridge regression would shrink the estimate for the slope, making our prediction about whether or not a mouse is obese less sensitive to weight.

![](images/10_z.PNG)

***

So far we've seen simple examples of how __Ridge regression__ helps __reduce Variance__ by __shrinking parameters__ and __making our predictors less sensitive to them__.

But we can apply ridge regression to complicated models as well.

![](images/10_aa.PNG)

![](images/10_ab.PNG)

Now the Ridge regression Penalty contains the parameters for the slope and the difference between diets.

In general. the ridge regression penalty conatins all of the parameters except for the __y-intercept__.

![](images/10_ac.PNG)

For estimating 2 parameters (i.e, to a fit a simple linear regression) we need minimum two points (2d). Similarly for estimating 3 parameters we need 3 points as we need a hyperplane here (3D).
So for estimating n-parameters we need n-data points.

But sometimes in practice its hard to get huge data points say(we need to collect gene exprssion measurements from 10,001 mice which is crazy expensive and time consuming. In practice, a huge dataset might have measurements from 500 mice). __So what do we do if we have an equation with 10,001 parameters and only 500 data points??__

> We use Ridge regression.

![](images/10_ad.PNG)

![](images/10_ae.PNG)

***

# Lasso Regression

Equation that Ridge regression tends to minimise:

> Sum of squared residuals + $\lambda * (slope)^2$

### Equation that Ridge regression tends to minimise:

> Sum of squared residuals + $\lambda *$|slope|

![](images/10_af.PNG)

![](images/10_ag.PNG)

![](images/10_ah.PNG)

![](images/10_ai.PNG)

![](images/10_agk.PNG)

![](images/10_fdg.PNG)

> Since __Lasso Regression__ can exclude useless variables from equations, it is a little better than __Ridge regression__ at reducing the variance in models that contain a lot of useless variables. So final equation of Lasso regression is funier and easier to interpret.

> But ridge regression tends to work better when most variables are useful.


### But what do we do when we have many more variables? So we need not choose between Lasso and ridge regression and can use the Elastic-net regression instead. 

![](images/10_bam.PNG)

![](images/10_bmb2.PNG)

![](images/11a.PNG)

> when $\lambda_1$ and $\lambda_2$ equal to zero, then we get least squares fit.

> when $\lambda_1=0$ and $\lambda_2>0$ equal to zero, then we get Lasso regression.

> when $\lambda_1>0$ and $\lambda_2=0$ equal to zero, then we get Ridge regression.

> when $\lambda_1>0$ and $\lambda_2>0$ equal to zero, then we get Elastic-net regression.

![](images/11b.PNG)

![](images/11c.PNG)

![](images/11d.PNG)