<a href="https://colab.research.google.com/github/kscaman/DL_ENS/blob/main/DL_ENS_robustness_regularity.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Robustness and regularity
In this practical, we will investigate the effect of initialization on simple neural networks (MLPs).

In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import numpy as np
import matplotlib.pyplot as plt
import math
from tqdm import tqdm

device = 'cuda' if torch.cuda.is_available() else 'cpu'
device

First, we need to automatically create large and deep MLPs. Create a function `MLP(dim_input, dim_output, dim_hidden, num_layers)` that returns an MLP with ReLU activations, `num_layers` layers and width `dim_hidden` using `nn.Sequential`.

In [None]:
# YOUR CODE GOES HERE

Check that the MLP has the correct architecture for 1, 2 and 4 layers.

In [None]:
print(MLP(3,5,10,1))
print(MLP(3,5,10,2))
print(MLP(3,5,10,4))

## Stability during training
We are now going to experiment with initialization. First, let's plot the function created by an MLP at initialization.

In [None]:
x = torch.linspace(-1, 1, 100).view(-1, 1)
# YOUR CODE GOES HERE

plt.plot(x.detach().numpy(), y.detach().numpy())
plt.show()

Plot multiple functions on the same figure.

In [None]:
# YOUR CODE GOES HERE

Increase the number of layers to 10. What happens? Is that a problem for learning?

In [None]:
# YOUR CODE GOES HERE

We are now going to fix this issue by applying a different initialization.
Create a function that initializes all weights of the MLP by using functions in [`nn.init`](https://pytorch.org/docs/stable/nn.init.html).

In [None]:
def init_weights(m):
    if isinstance(m, nn.Linear):
        # YOUR CODE GOES HERE

model = MLP(1, 1 , 100, 10)
for _ in range(10):
    model.apply(init_weights)
    x = torch.linspace(-1, 1, 100).view(-1, 1)
    y = model(x)

    plt.plot(x.detach().numpy(), y.detach().numpy())
plt.show()

Let's now look at the distribution of values for a single input (e.g. x=1).
Plot a histogram of outputs for random initializations of the weights.

In [None]:
# YOUR CODE GOES HERE

## Fixing the initialization with batch normalization.
Add a batch norm `nn.BatchNorm1d` layer after each hidden layer.

In [None]:
# YOUR CODE GOES HERE

How is the result different at initialization? Plot several functions generated by a 10-layer MLP at initialization (with default initialization).

In [None]:
for _ in range(10):
    model = MLP_bn(1, 1 , 100, 10)
    x = torch.linspace(-1, 1, 100).view(-1, 1)
    y = model(x)

    plt.plot(x.detach().numpy(), y.detach().numpy())
plt.show()

⚠ **Careful though:** Batch norm depends on the **whole batch**, and uses the **training mean and standard deviation** during **evaluation**.

In [None]:
# WITH TRAINING DATASET ON [-1,1]
model = MLP_bn(1, 1 , 100, 10)
model.train()
x = torch.linspace(-1, 1, 100).view(-1, 1)
y = model(x)

plt.plot(x.detach().numpy(), y.detach().numpy())
plt.show()

model.eval()
x = torch.linspace(-1e-3, 1e-3, 100).view(-1, 1)
y = model(x)

plt.plot(x.detach().numpy(), y.detach().numpy())
plt.show()

In [None]:
# WITH TRAINING DATASET ON [-1e-3,1e-3]
model = MLP_bn(1, 1 , 100, 10)
model.train()
x = torch.linspace(-1e-3, 1e-3, 100).view(-1, 1)
for _ in range(1000):
    y = model(x)

plt.plot(x.detach().numpy(), y.detach().numpy())
plt.show()

model.eval()
x = torch.linspace(-1e-3, 1e-3, 100).view(-1, 1)
y = model(x)

plt.plot(x.detach().numpy(), y.detach().numpy())
plt.show()

## Generalization and overfitting
We now investigate the generaliation capabilities of MLPs on a simple regression task $f(x)=\sin(x) + \varepsilon$ where $\varepsilon$ is a Gaussian noise of standard deviation $0.3$.

In [None]:
batch_size = 50
num_points = 50
x_train = 4 * (2 * torch.rand(num_points, 1) - 1)
y_train = torch.sin(x_train) + 0.3 * torch.randn_like(x_train)
train_dataset = torch.utils.data.TensorDataset(x_train, y_train)
train_dataloader = torch.utils.data.DataLoader(train_dataset, batch_size=batch_size, shuffle=True)

x_test = 4 * (2 * torch.rand(100, 1) - 1)
y_test = torch.sin(x_test) + 0.3 * torch.randn_like(x_test)
test_dataset = torch.utils.data.TensorDataset(x_test, y_test)
test_dataloader = torch.utils.data.DataLoader(test_dataset, batch_size=batch_size)

x = torch.linspace(-4, 4, 1000)
plt.plot(x_train, y_train, '.', label="train")
plt.plot(x, torch.sin(x), label="target")
plt.legend()
plt.xlabel('x')
plt.ylabel('sin(x)')
plt.show()

We create a training pipeline for MLPs of width `d`.

In [None]:
def create_model(d):
    model = MLP(1, 1, d, 4).to(device)
    loss_function = nn.MSELoss(reduction='mean')
    optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
    scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(optimizer)
    return model, loss_function, optimizer, scheduler

def train(model, loss_function, optimizer):
    model.train()
    losses = []
    for input, target in train_dataloader:
        input, target = input.to(device), target.to(device)
        output = model(input)
        loss = loss_function(output, target)
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        losses.append(loss.item())
    return np.mean(losses)

def test(model, loss_function):
    model.eval()
    losses = []
    with torch.no_grad():
        for input, target in test_dataloader:
            input, target = input.to(device), target.to(device)
            output = model(input)
            loss = loss_function(output, target)

            losses.append(loss.item())
    return np.mean(losses)

def training_loop(d, num_epochs):
    train_losses = []
    test_losses = []
    model, loss_function, optimizer, scheduler = create_model(d)
    for i in range(num_epochs):
        train_losses.append(train(model, loss_function, optimizer))
        test_losses.append(test(model, loss_function))
        # print(f"Epoch {i}: {train_losses[-1]:.3f} / {test_losses[-1]:.3f}")
    scheduler.step(train_losses[-1])
    return train_losses[-1], test_losses[-1], model

For increase model sizes, the training error decreases. However, the test error first decreases then increases due to the model overfitting the data.

In [None]:
dim_hidden = [int(d) for d in np.unique(np.round(10 ** np.linspace(0, 2, 20)))]
train_losses, test_losses = [], []
for d in tqdm(dim_hidden):
    train_loss, test_loss, model = training_loop(d, 1000)
    train_losses.append(train_loss)
    test_losses.append(test_loss)

plt.loglog(dim_hidden, train_losses, label="train")
plt.loglog(dim_hidden, test_losses, label="test")
plt.legend()
plt.xlabel("width of the MLP")
plt.ylabel("train and test losses")
plt.show()

We can see that the model has learn the noise in the data.

In [None]:
x = torch.linspace(-4, 4, 1000)
plt.plot(x_train, y_train, '.', label="train")
plt.plot(x_test, y_test, '.', label="test")
plt.plot(x, model(x.view(-1,1).to(device)).cpu().detach().numpy(), label="model")
plt.plot(x, torch.sin(x), label="target")
plt.legend()
plt.xlabel('model input')
plt.ylabel('model output')
plt.show()

If the number of parameters increase drastically, this tends to regularize/smoothen the model, and thus improve generalization. This is called **implicit regularization**.

In [None]:
dim_hidden = [int(d) for d in np.unique(np.round(10 ** np.linspace(0, 4, 20)))]
train_losses, test_losses = [], []
for d in tqdm(dim_hidden):
    train_loss, test_loss, model = training_loop(d, 1000)
    train_losses.append(train_loss)
    test_losses.append(test_loss)

plt.loglog(dim_hidden, train_losses, label="train")
plt.loglog(dim_hidden, test_losses, label="test")
plt.legend()
plt.xlabel("width of the MLP")
plt.ylabel("train and test losses")
plt.show()

In [None]:
x = torch.linspace(-4, 4, 1000)
plt.plot(x_train, y_train, '.', label="train")
plt.plot(x_test, y_test, '.', label="test")
plt.plot(x, model(x.view(-1,1).to(device)).cpu().detach().numpy(), label="model")
plt.plot(x, torch.sin(x), label="target")
plt.legend()
plt.xlabel('model input')
plt.ylabel('model output')
plt.show()