### What are the main challenges with Least Squares Regression?

- <b>Poor Prediction Accuracy</b> -The model often overfits i.e. the least squares estimates often have low bias and large variance which results in poor accuracy predictions.
- <b>Interpretability</b> - Often while modelling a problem with regression we want to identify some features which are responsible for majority variance in the data so that actionable insights can be drawn from these features.
- <b>Correlated features</b> - When data with many correlated features is modelled with Linear Regression, the coefficients can become poorly determined i.e. have large positive or negative values assigned to them which can result in huge errors when the model is used for extrapolation.
- <b>Invertibility of the covariance matrix $X^TX$</b> - In case of prefect corellation between some variables or $n<p$ the covariance matrix used to estimate the $\beta$ parameters will not be of full rank which will leave the matrix singular.

### Intuition behind Ridge Regression

- Ridge Regression is a modificaion to the Residual Sum of Squares formulation used to impose a penalty on the size of regression coefficients. 
- The reason behind imposing the penalty is that restricting the size of the beta coefficients will increase the prediction accuracy of the model by reducing the overall variance at the cost of some bais. This solves the problem of overfitting of the model.

 

### Mathematical formulation

Residual Sum of Squares without any constraints<br>

$RSS =(y-X\beta)^T(y-X\beta)$ <br>

we wish to restict the values of the coefficients using the below constraint: <br>

$\beta^T\beta <= C$


<div>
<img src="attachment:1.png" width="300"/>
    <img src="attachment:2.png" width="300"/>
    </div>

From the above plot we can see that $\triangledown RSS \propto -\beta$

We get the optimum value of $\beta$ where the gradients of RSS and the constraint are collinear and point in the opposite directions. 

Removing the proportionality symbol we get the equation as:

$\triangledown RSS + \lambda\beta = 0$

where$\lambda$ is the laplace multiplier or the complexity parameter which controls the amount of shrinkage: the larger the value of $\lambda$ the greater the shrinkage. The coefficients are shrunk towards zero and each other.

Finally we can say that we want to minimize the below equation:

$RSS(\lambda) = (y-X\beta)^T(y-X\beta) + \lambda\beta^T\beta$

This gives $\hat\beta^{ridge}$ as

$\hat\beta^{ridge} = (X^TX +{\lambda}I)^{-1}X^Ty$

we can observe that the only difference between unregularized and regularized $\beta$ is the $\lambda$ term. This $\lambda{I}$ makes the problem nonsingular even if the matrix $X^TX$ is not of full rank.

$\hat{\beta}^{ridge} = argmin_\beta\{\sum_{i=1}^N(y_i - \beta_0 - \sum_{j=1}^px_{ij}\beta_j)^2 + \lambda\sum_{j=1}^p\beta_j^2\}$
<br>Here $\lambda$ is complexity parameter. 

![lambda%20plots.png](attachment:lambda%20plots.png)

For $\lambda = 0$ the RSS is same as that for Least Squares method
whereas, for $\lambda = \infty$, the $\beta$ coefficients are reduced to 0 in order to minimize the Error.

### Why is it not useful for reducing the number of overall features?

- It does proportional shrinkage.

<img src="attachment:image.png" width="500"/>

Because of this reason it is very rare that the contours of the constraint and the error function are tangential to each other at 0 for any of the $\beta$ coefficients.

### Limitations of Ridge Regression

- It includes all the predictors in the model which makes it difficult to interpret the model.
- Feature selection is not possible using ridge regression unlike Lasso regression.