<a href="https://colab.research.google.com/github/kscaman/DL_ENS/blob/main/TP/TP08_stability_robustness.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# TP08 - Stability and robustness
In this practical, we are going to investigate the stability and robustness of neural networks, during training and at inference. To do so, we will test different initialization schemes, see their effect on the regularity of the function, test different methods to improve stability in the presence of outliers in the training distribution, and learn to generate adversarial attacks on pre-trained models.

In [None]:
import io
import requests
from PIL import Image
import torch
import torch.nn as nn
import torch.nn.functional as F
import torchvision
import numpy as np
import matplotlib.pyplot as plt
import math
from tqdm import tqdm

device = 'cuda' if torch.cuda.is_available() else 'cpu'
device

# Part A - Weight initialization
Let's now investigate the effect of initialization on simple neural networks (MLPs).

## A.0 - Model creation
First, we need to automatically create large and deep MLPs. Create a function `MLP(dim_input, dim_output, dim_hidden, num_layers)` that returns an MLP with ReLU activations, `num_layers` layers and width `dim_hidden` using `nn.Sequential`.

Check that the MLP has the correct architecture for 1, 2 and 4 layers.

In [None]:
### YOUR CODE HERE ###

## A.1 - Neural networks at initialization
We are now going to experiment with initialization. First, let's plot the function created by an MLP at initialization.

In [None]:
### YOUR CODE HERE ###

Plot multiple functions on the same figure.

In [None]:
### YOUR CODE HERE ###

Increase the number of layers to 10. What happens? Is that a problem for learning?

In [None]:
### YOUR CODE HERE ###

We are now going to fix this issue by applying a different initialization.
Create a function that initializes all weights of the MLP by using functions in [`nn.init`](https://pytorch.org/docs/stable/nn.init.html).

In [None]:
def init_weights(m):
    if isinstance(m, nn.Linear):
        ### YOUR CODE HERE ###

model = MLP(1, 1 , 100, 10)
for _ in range(10):
    model.apply(init_weights)
    x = torch.linspace(-1, 1, 100).view(-1, 1)
    y = model(x)

    plt.plot(x.detach().numpy(), y.detach().numpy())
plt.show()

## A.2 - Fixing the issue with batch normalization
Add a batch norm `nn.BatchNorm1d` layer after each hidden layer.

In [None]:
### YOUR CODE HERE ###

How is the result different at initialization? Plot several functions generated by a 10-layer MLP at initialization (with default initialization).

In [None]:
### YOUR CODE HERE ###

⚠ **Careful though:** Batch norm depends on the **whole batch**, and uses the **training mean and standard deviation** during **evaluation**.

In [None]:
# WITH TRAINING DATASET ON [-1,1]
model = MLP_bn(1, 1 , 100, 10)
model.train()
x = torch.linspace(-1, 1, 100).view(-1, 1)
y = model(x)

plt.plot(x.detach().numpy(), y.detach().numpy())
plt.show()

model.eval()
x = torch.linspace(-1e-3, 1e-3, 100).view(-1, 1)
y = model(x)

plt.plot(x.detach().numpy(), y.detach().numpy())
plt.show()

In [None]:
# WITH TRAINING DATASET ON [-1e-3,1e-3]
model = MLP_bn(1, 1 , 100, 10)
model.train()
x = torch.linspace(-1e-3, 1e-3, 100).view(-1, 1)
for _ in range(1000):
    y = model(x)

plt.plot(x.detach().numpy(), y.detach().numpy())
plt.show()

model.eval()
x = torch.linspace(-1e-3, 1e-3, 100).view(-1, 1)
y = model(x)

plt.plot(x.detach().numpy(), y.detach().numpy())
plt.show()

# Part B - Stability, outliers and overfitting
We now investigate the generaliation capabilities of MLPs on a simple regression task: our aim is to lear the function $y=\sin(3 x)$. However, a small number of training samples (denoted as **outliers**) were randomly perturbed by a large factor (of order 100).

In [None]:
batch_size = 20
num_points = 200
outlier_std = 100
num_outliers = 5
torch.manual_seed(1234)

x_train = 4 * (2 * torch.rand(num_points, 1) - 1)
outlier_noise = torch.zeros_like(x_train)
outlier_noise[torch.randperm(num_points)[:num_outliers]] = outlier_std * torch.randn(num_outliers, 1)
y_train_clean = torch.sin(3 * x_train)
train_dataset = torch.utils.data.TensorDataset(x_train, y_train_clean, outlier_noise)
train_dataloader = torch.utils.data.DataLoader(train_dataset, batch_size=batch_size, shuffle=True)

x_test = 4 * (2 * torch.rand(100, 1) - 1)
y_test = torch.sin(3 * x_test)
test_dataset = torch.utils.data.TensorDataset(x_test, y_test)
test_dataloader = torch.utils.data.DataLoader(test_dataset, batch_size=batch_size)

def plot_function(model=None, zoom=True):
    x = torch.linspace(-4, 4, 1000)
    plt.plot(x_train, y_train_clean + outlier_noise, '.', label="train")
    plt.plot(x_test, y_test, '.', label="test")
    plt.plot(x, torch.sin(3 * x), label="target")
    if model is not None:
        output = model(x.unsqueeze(1).to(device)).cpu().detach().numpy()
        plt.plot(x, output, label="model")
    plt.legend()
    if zoom:
        plt.ylim([-2, 2])
    plt.xlabel('input value ($x$)')
    plt.ylabel('function value ($f(x)$)')
    plt.show()

plot_function(zoom=False)

Create three functions:

1.   A function `train_epoch(model, optimizer, clipping_threshold=None)` that trains the model for one epoch and applies gradient clipping if `clipping_threshold` is not `None`.
2.   A function `test(model)` that returns the loss over the test set.
3.   A function `train(model, optimizer, num_epochs, clipping_threshold=None)` that performs `num_epochs` epochs of training (with the MSE loss), and plots (in a figure) the train and test losses at each epoch.


In [None]:
def train_epoch(model, optimizer, clipping_threshold):
    model.train()
    loss_function = nn.MSELoss(reduction='sum')
    losses = []
    for input, target, outlier_noise in train_dataloader:
        input, target, outlier_noise = input.to(device), target.to(device), outlier_noise.to(device)
        output = model(input)
        target_with_outliers = target + outlier_noise
        ### YOUR CODE HERE ###
    return np.sum(losses) / len(train_dataloader)

def test(model):
    model.eval()
    loss_function = nn.MSELoss(reduction='sum')
    losses = []
    with torch.no_grad():
        for input, target in test_dataloader:
            ### YOUR CODE HERE ###
    return np.sum(losses) / len(test_dataloader)

def train(model, optimizer, num_epochs, clipping_threshold=None):
    ### YOUR CODE HERE ###
    pass

Train an MLP of width $10^3$ and depth $4$ with Adam (lr=1e-3), and plot the target function and the output of the model. The model tries to fit the outliers, lead to a poor performance on the test set.

In [None]:
### YOUR CODE HERE ###

We are going to test 3 methods: 1) adding batch norm layers, 2) adding gradient clipping, and 3) adding regularization (aka weight decay). Test all three methods. Try to find reasonable parameters for clipping and weight decay by hand. Is the output of the model smoother? Can we reduce the impact of outliers?

In [None]:
### YOUR CODE HERE ###

Train the model with all three methods toghether. Is the test error better?

In [None]:
### YOUR CODE HERE ###

### Bonus question
Use a validation set to find optimal parameters for the learning rate, weigth decay and gradient clipping.

In [None]:
### YOUR CODE HERE ###