## In this notebook, we will analyse the difference between $L_1$ and $L_2$ regularization in a mathematical way:

Let's say our original error function is: $$E_0 = \sum_i^n (h(x_i) - y_i)^2$$ where $h(.)$ is the hypothesis.

And we decide to regularize it: once with $L_1$ regularization and once with $L_2$regularization.

Let the respective error functions be $E_1$ and $E_2$. Then for $0 <\lambda < 1$,

$E_1$ looks like: $$ E_1 = E_0 + \lambda|w| $$

$E_2$ like: $$E_2 = E_0 + \frac{\lambda w^2}{2}$$

Here, the update rules under such constraints look like:

For $L_1$: $$ w_i^1 \leftarrow w_i -\frac{\delta E_1}{\delta w_i} $$

For $L_2$: $$ w_i^2 \leftarrow w_i -\frac{\delta E_2}{\delta w_i} $$

Superscript denotes the kind of regularization used to get the new value of $w_i$.

Which in turn looks like:

(1) $$ w_i^1 \leftarrow w_i - \frac{\delta E_0}{\delta w_i} - \frac{\lambda  \delta |w|}{\delta w_i} $$

(2) $$ w_i^2 \leftarrow w_i - \frac{\delta E_0}{\delta w_i} - \frac{\lambda  \delta w^2}{2 \delta w_i} $$

The update rules become

(3a) $$ w_i^1 \leftarrow  w_i - \frac{\delta E_0}{\delta w_i} - \lambda   \quad if \quad w_i >0$$

(3b)  $$ w_i^1 \leftarrow  w_i - \frac{\delta E_0}{\delta w_i} + \lambda  \quad if \quad w_i <0 $$

(4) $$ w_i^2 \leftarrow w_i - \frac{\delta E_0}{\delta w_i} - \lambda w_i $$

## Now

Consider $ Eq.(3a) \& (4)$

i.e. for $w_i >0$

The effective expression for update rules for $L_1$ and $L_2$ resp. are as follows

$$ w_i^1 \leftarrow w_i - \lambda $$
$$ w_i^2 \leftarrow w_i - \lambda w_i $$

Because the contribution from $ \frac{\delta E_0}{\delta w_i}$ is same in both the cases, there's no point to include it for comparison.

Now, $w_i$s are reduced: 

1) by an amount $\lambda$ to get to $w_i^1$ and

2) by an amount $\lambda w_i$ to get to $w_i^2$

Here we have $0< \lambda < 1 \quad and \quad 0< w_i < 1 \implies \lambda w_i < \lambda$.

Which means the (positive) weights via the $L_1$ regularizer are updated (i.e. decreased towards $0$) by a larger amount (=$\lambda$) than via the $L_2$ regularizer(=$\lambda w_i$).

So for a fixed number of updates, the (positive) weights obtained by $L_1$ will be much closer to $0$ than those obtained by $L_2$.

Hence the "saying" that $L_1$'s weights are "sparse" compared to $L_2$'s. Because $L_2$ hasn't made the same **amount** of updates to its weights to get them to $0$. Which it can not in the first place, becuase the update value is tied with its current value.

$L_2$ regularizer is decreasing the weights as well but at a slower rate and in propoertion to its current value.

Similar analysis extends for the case of Eq.(3b) & (4) i.e. for $w_i <0$