# IMPROVING MACHINE LEARNING MODELS - 2 : Regularization
<hr style="height:5px;border-width:2;color:gray">

## Introduction

One of the simplest ways of preventing overfitting is the use of regularization.

Overfitting happens when a model learns the very specific pattern and noise from the training data to such an extent that it negatively impacts our model’s ability to generalize from our training data to new (“unseen”) data. By noise, we mean the irrelevant information or randomness in a dataset.

Preventing overfitting is very necessary to improve the performance of our machine learning model.

## What is regularization?

In general, regularization means to make things regular or acceptable. This is exactly why we use it for applied machine learning. In the context of machine learning, regularization is the process which regularizes or shrinks the coefficients towards zero. In simple words, regularization discourages learning a more complex or flexible model, to prevent overfitting.

## Simple Intuition behind Regularization

Let's go back to our linear regression module in which we tried to predict the cost of a house. The predicted cost of our house might look like a function of X1 and X2 and X3 as:

$\hat{y} = 20843.6764X_{1} + 1893.12765X_{2} + 97.12131X_{3}$

and so on


Looks very complex and dangerous isn’t it?
The function has trained itself to get the correct target values for all the noise induced data points and thus has failed to predict the correct pattern. This function may give very less error for training set but will give huge errors in predicting the correct target values for test dataset.

## How Regularization is Implemented in Machine Learning

How do we implement regularization in ML? We usually add a penalty term in the cost function, be it linear regression,logistic regression or neural networks. This penalty term is used so that the weights do not become too large or complex

Consider the loss function for neural networks which is 

$$J(Y, \hat Y) = -\frac{1}{m} \sum\limits_{i = 1}^m y^{(i)} \log (\hat y^{(i)}) + (1 - y^{(i)}) \log (1 - \hat y^{(i)})$$

The new cost function with regularization is given by 

$$J = -\frac{1}{m} [\sum\limits_{i = 1}^m y^{(i)} \log (\hat y^{(i)}) + (1 - y^{(i)}) \log (1 - \hat y^{(i)})] + \frac{\lambda}{2}\sum\limits_{l=1}^ L\sum\limits_{j=1}^{n_{l}}|(w_{j}^{[l]})| ^{k} $$

where,

$n_{l}$ = number of neurons in layer l,

L = number of layers

$w_{j}^{[l]}$ = the 'jth' weight in layer l


You can basically think of it as multiplying the sum of the kth power of the weights by a value lambda

*The degree or value of k is usually set to 1 or 2*

1) When n=1, we call it L1 regularization

2) When n=2, we call it L2 regularization

In linear, logistic regression , and neural networks, we use L2 regularization as it tends to penalize the weights more, with equal importance.

># L1 vs L2 regularization

**L1 regularization:** Take a look at the L1 Equation. It has the effect of pushing W towards 0, leading to sparsity.

This is of course is pointless in a 1-variable linear regression model, but will prove its prowess to ‘remove’ useless variables in multivariate regression models. You can also think of L1 as reducing the number of features in the model altogether. Here is an arbitrary example of L1 trying to ‘push’ some variables in a multivariate linear regression model:

So how does pushing w towards 0 help in overfitting in L1 regularisation? As mentioned above, as w goes to 0, we are reducing the number of features by reducing the variable importance. Therefore, L1 leads to a form of feature selection. This in turn reduces the model complexity, making our model simpler. A simpler model can reduce the chances of overfitting.


**L2 regularization:** Take a look at the L2 Equation. It is more unlikely in L2 for weights to near zero as there is a square element. However, for the same value, weights are more penalized in L2 than L1. In addition, L2 maintains all the features while penalizing their weights.

## How it Works

We now have a new hyperparameter $\lambda$. Lambda controls how much the weights must be penalized. 

Let us say $\lambda$ is a relatively small value, like 0.001. We have to minimize the cost function J, but since the weights are multiplied by a relatively small number such as 0.001, they are still given freedom to take larger values.

Now assume $\lambda$ to be large, like 10 (In fact, this is too big a value for $\lambda$!) Each weight is multiplied by 10 times but the cost has to minimized so the weights are forced to take smaller and simpler values. This way, overfitting is prevented

## Built-in Code

Scikit learn provides both L1, L2 and other regularization techniques. Check [this link](https://scikit-learn.org/stable/modules/linear_model.html) out for linear models and their various regularization techniques in scikit learn. Even in deep learning frameworks such as Keras, there are many regularization techniques to prevent overfitting in neural networks

## Side Note: Gradient Computation and Effect on Training Accuracy

During backprop/ gradient descent, when regularization is used, the value of gradients are different. We suggest you to calculate the gradients with regularization (they're not that hard :) )

Effect on Training Accuracy: When regularization is used, since weights are penalized, the training accuracy decreases. However, this isn't a primary concern as with the proper choice of lambda, a balance can be achieved. Moreover, the test accuracy is a true metric of a model's robustness as it is a representation of the model's performance on unseen data.

## Conclusion

Regularization is a simple yet powerfult technique to the overfitting problem. It is used in almost every single regression technique or neural network. From now on you are advised to use regularization in every model you create.