# Regularization

Great power implies great responsibility

In [1]:
%matplotlib inline
import torch
import torch.nn as nn
from d2l import torch as d2l

Powerfull models have a risk to overfit

**Think about it**: We are trying to reduce the error on the training set to get a good result on the validation/test sets

Best way to reduce the error: **Memorize everything**. This is not what we want

<center>
    <img src='images/capacity-vs-error.svg' width='60%'/>
    <p>Source: <a href='d2l.ai'>d2l.ai</a></p>
</center>

The more the model is complex, the easier it is for it to memorize (i.e., **overfit**) the training set.

However, if the model is too simple, it will failt to learn and will not have good performance (i.e., **underfit**).

### How to overcome this behavior?

<ol>
    <li><b>Increase the number of elements in the training set:</b> The more things you have to memorize the harder. <br />In practice, good quality datas are <b>very hard and expensive</b> to get.
    </li>
    <li><b>Reduce the complexity of your model:</b> You don't need big and deep neural network for simple problems.</li>
    <li><b>Regularization techniques</b></li>
</ol>

## Regularization techniques

One of the most *classic* approaches is to use **weight decay** techniques.
The idea is pretty straightforward. We add a second objective to the loss function to penalize too complex models.

The problem objective became: reduce the error of the neural network + reduce the complexity of the neural network

The network will try to find an *equilibrium* between these two objectives

The most popular weight decay technique is the **L2 regularization term**
We simply sum the square of the neural network weights $$\|\mathbf{w}\|^2$$

In Pytorch, this is defined in the optimizer

In [2]:
model = nn.Sequential(nn.Linear(5, 1))

In [3]:
optimizer = torch.optim.SGD(model.parameters(), lr=1e-3, weight_decay=1e-4)

In Pytorch the **weight decay** is an additional gradient on top of the loss gradient -> **Don't put 1**

If you use adaptive gradient algorithm such as the **Adam** optimizer with weight decay, you need to use the **AdamW** optimizer to avoid coupling problem between the L2 regularization and the adaptive learning rate.  
More info here: https://arxiv.org/abs/1711.05101

In [4]:
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-3, weight_decay=1e-4)

**L1 norm** (i.e. sum of absolute term) exists but is less popular and need to be computed by hand:

In [6]:
loss = 0
l1_alpha = 1e-3
l1_term = torch.tensor(0.)
for param in model.parameters():
    l1_term += torch.norm(param, 1)
loss += l1_alpha * l1_term

What is the impact of **L2 norm vs L1 norm**?

## Dropout

When a neural network overfit, it memorizes inside its weights the training set (acts as a database)

**Idea**: Inject noise in the layers to deactivated randomly a certain proportion of the neurons per layers. The neural network can't memorize because the neuron might be turned off at a given iteration

<center>
    <img src='images/dropout2.svg' width='60%'/>
    <p>Source: <a href='http://d2l.ai'>d2l.ai</a></p>
</center>

Dropout should be place after activation

In [7]:
with_dropout = nn.Sequential(
    nn.Linear(18, 36),
    nn.ReLU(),
    nn.Dropout(p=0.5),
    nn.Linear(36, 36),
    nn.ReLU(),
    nn.Dropout(p=0.5),
    nn.Linear(36, 1)
)

**Dropout is only used during training**

In [8]:
# put the network in training mode
with_dropout.train()

# after training, but it to evaluation mode
with_dropout.eval()

Sequential(
  (0): Linear(in_features=18, out_features=36, bias=True)
  (1): ReLU()
  (2): Dropout(p=0.5, inplace=False)
  (3): Linear(in_features=36, out_features=36, bias=True)
  (4): ReLU()
  (5): Dropout(p=0.5, inplace=False)
  (6): Linear(in_features=36, out_features=1, bias=True)
)