# **Lasso Regression in Python Notes**

By Noah Rubin

May 2021

---

#### <span style="color:black"><u>**Intro**</u><a name="Lasso"></a></span>
* Like ridge, lasso regression aims to address the concept of the bias-variance tradeoff in machine learning that suggests that optimising one tends to degrade the other
* We purposely introduce bias into the regression model in an effort to reduce the variance, which can then potentially lower the mean squared error of our estimator, since $$\text{MSE} = \text{Bias}^2 + \text{Variance}$$
* Even though by the Gauss-Markov theorem, OLS has the lowest sampling variance out of any linear unbiased estimator, there may be a biased estimator that can achieve a lower mean squared error, such as the lasso estimator
* In essense, lasso regression can be used because OLS might fit the training data well, but may not generalise as nicely to out of sample data
* Lasso regression is also a tool to help reduce the impact of multicollinearity within our feature matrix, just like ridge can
* One major advantage that lasso has over ridge is that while ridge can only shrink coefficients towards zero, lasso can shrink coefficients all the way to zero through adding an L1 regaularisation penalty to our ols loss function. Hence the loss function for ridge is defined as:

$$J(\beta_0, \beta_1, ... , \beta_p) = \sum_{i=1}^{n} (y_i - \beta_0 - \sum_{j=1}^{p} \beta_j x_{i,j})^2 + \lambda\sum_{j=1}^p |\beta_j|.$$

In matrix form this is defined as

$$J(\vec{\beta}) = (\vec{y} - X\vec{\beta})^T(y - X\vec{\beta}) + \lambda||\beta||_1$$ 

Because of the mathematical properties that follow from penalising the sum of the absolute values of the $\beta_j$ coefficients, certain coefficients can be shrunk all the way to zero, "...thus the lasso yields models that simultaneously use regularization to improve the model and to conduct feature selection." - Applied Predictive Modeling (By Max Kuhn and Kjell Johnson). In this sense, lasso further encourages parsimonious models through embedded feature selection methods

---

Both ridge and lasso are able to lessen the impact of multicollinearity, but the way that is done is different between the two models. In ridge regression, correlated predictors tend to be close to each other in value, while for lasso, out of the predictors correlated with each other, one tends to stand out while the remaining correlated predictors' coefficient values shrink close toward zero (or become zero).