# Feature Selection - LASSO regularization

## Introducition

In this section, I will cover regularisation as a method for feature selection for linear models.


Regularisation consists in adding a penalty on the different parameters of the model to reduce the freedom of the model. Hence, the model will be **less likely to fit to the noise** of the training data and will improve the generalization abilities of the machine learning algorithm.
For linear models, there are in general 3 types of regularization: 

- The L1 regularization (also called Lasso)
- The L2 regularization, (also called Ridge)
- The L1/L2 regularization (also called Elastic Net)

I will focus on the L1 and L2 for comparison, to conclude that LASSO is the regularization that allows for feature selection.

## Regularization: LASSO

As I mentioned regularization consists in applying a penalty to the coefficients that multiply each of the predictors in the linear model,
in order to avoid overfitting. 


$$\frac{1}{2m}x\sum(y - ypred)^2 + \lambda\sum \beta^1$$

- $m$ = number of observations
- $y$ = observed output
- $ypred$ = predicted output
- $\lambda$ is the regularization parameter

The higher the penalty, typically the bigger the generalization. If the penalty is too high, however, the model may lose predictive power.

During fitting of the algorithm, what the machine learning model is trying to minimize, is the difference between the predicted outcome and the real value of the observation, plus the regularization component. 

The regularization component is a penalty on the coefficients that the linear model feeds to the variables.

You can see that to keep this equation to the minimum, if we increase lambda, that is if we increase the penalty, we need to decrease $\beta$ which are the coefficients.

> L1 / Lasso will **shrink some parameters to zero**, therefore allowing for feature elimination

The LASSO regularization has the property that it can shrink some of the parameters, I mean some of the coefficients, to zero. This means that some of the $\beta$ can be zero.

Therefore, the regularisation indicates that a certain predictor or a certain variable will be then multiplied by zero to estimate the target. And therefore, it will not add to the overall final prediction of the output. This means, in other words, that that feature can be removed as it does not contribute to the prediction.

## Regularization: RIDGE

The ridge regression or L2, is different from LASSO in that what is trying to minimize is the Theta square.


$$\frac{1}{2m}x\sum(y - ypred)^2 + \lambda\sum \beta^2$$

- $m$ = number of observations
- $y$ = observed output
- $ypred$ = predicted output
- $\lambda$ is the regularization parameter

> L2 / Ridge, as the penalization increases, the coefficients approach zero but do not equal zero, hence no variable is ever excluded.


As the penalization lambda increases, the coefficients of the regression approach zero
but it never equals zero. Therefore, this regularization is not suitable for feature selection.

It is for a model optimization but it will not allow you to select variables, or better say, remove variables from a dataset.

Let's see an example for more clarity. In this example, we want to predict the prices of houses from the house sale dataset on Kaggle. 

### LASSOO

![](../imgs/L1.png)

These plots show the value of the coefficients that multiply each of the predictors on the right
as we increase the regularization parameter lambda. These are the predictors, these are the different lambdas or penalties, clearly and as expected, as we increase lambda we penalize
the parameters harder, therefore the value of the coefficients decrease.

For the LASSO regularization, we can see that as we increase Lambda, one after the other, the different feature coefficients are drawn to zero.

So as the regularization increases, the features are eliminated one after the other until we eliminate all of the features if the regularization is too high.

### RIDGE

![](../imgs/L2.png)

For the ridge regression, likewise, when the penalty increases, the coefficients that are fit to the variables also decrease but they decrease altogether and they do not reach zero. Therefore, these regularization is not suitable for feature selection.

### Another example

Here we have another example on a different dataset.

![](../imgs/L1L2.png)

Again, here are the different features and in this opportunity the plot is so that the increase in lambda goes towards the left, these are bigger lambdas, and these are smaller lambdas. In this case, the penalty then increases towards the left. We can see again how the coefficients of the different variables are shrunk to zero, one after the other, with the LASSO regularization. Therefore allowing for feature selection. On the contrary, the ridge regularization shrinks all the coefficients at the same time, only reaches zero at very high penalties.

## EMBEDDED METHODS: LASSO

Therefore, as you can see, by fitting a linear or logistic regression with the LASSO regularization,
we can then evaluate the coefficients of the different variables and remove those which coefficients
are zero.

And in this way, we are selecting features, while fitting the machine learning algorithm.