# Regularization

This week you have started learning about the basics of convolution and some regularization techniques like data augmentation, L1/L2 regularization, dropout, and batchnorm. In this notebook, we will go over these concepts in PyTorch and How to implement them.

In [1]:
# For interactive plotting:
import os
import matplotlib.pyplot as plt
import ipywidgets as widgets
try:
    from google.colab import output
    output.enable_custom_widget_manager()
except ImportError:
    pass
try:
    %matplotlib widget
except:
    os.system('pip install ipympl -qq')
    %matplotlib widget


import torch
from torch import nn, optim

from torchvision import datasets
from torch.utils.data import Dataset
from torchvision.transforms import v2

# helper function to inspect a tensor:
def print_tensor_info(
        name: str, 
        tensor, # torch.Tensor
        ):
    print(f'{name}')
    print(20*'-')
    if not isinstance(tensor, torch.Tensor):
        print(f'It is {type(tensor).__name__}!')
        print(20*'='+'\n')
        return
    # print name, shhape, dtype, device, require_grad
    print(f'shape: {tensor.shape}')
    print(f'dtype: {tensor.dtype}')
    print(f'device: {tensor.device}')
    print(f'requires_grad: {tensor.requires_grad}')
    print(20*'='+'\n')


# helper class to visualize image data interactively in Jupyter Notebook
class ImageDataViz:
    """
    An interactive image data visualzation tool inside Juptyer Notebook.
    Make sure to use the magic command: %matplotlib widget
    """
    def __init__(self, data: Dataset):
        self.data = data
        self.n_samples = len(data)
        self.index = widgets.IntSlider(
            value=0, 
            min=0, 
            max=self.n_samples-1, 
            step=1, 
            description='Index', 
            continuous_update=True,
            layout=widgets.Layout(width='40%'),
        )

    def update(self, index: int):
        x, y = self.data[index]
        image = x.moveaxis(0, -1).squeeze().numpy()
        self.img.set_data(image)
        self.ax.set_title(f'Label: {y}')

    def show(self):
        self.fig, self.ax = plt.subplots()
        x, y = self.data[0]
        image = x.moveaxis(0, -1).squeeze().numpy()
        self.img = self.ax.imshow(image)
        self.ax.axis('off')
        self.ax.set_title(f'Label: {y}')
        widgets.interact(self.update, index=self.index)

# Initialization

This is a simple but very important part of training a neural network. Using it is pretty easy. Choose an initializer [here](https://pytorch.org/docs/stable/nn.init.html) and pass in the paramter you want it to initialize, as well as the hyperparameters of the initializer. Try different initializations to find out which one work better. Here's an example:

In [None]:
w = torch.empty(3, 5)
nn.init.xavier_normal_(w)
print_tensor_info('w', w)
print(w)

## Image data augmentation

A very simple but important technique to prevent neural networks from overfitting to a small training dataset is data augmentation. Data augmentation is any transformation applied to input data that preserves the correct output. In PyTorch, many transforms are available in `torchvision.transforms.v2` that we imported as `v2`. `v2` also has basic preprocessing transforms like changing the dtype, scaling, and normalization. We encourage you to explore the [online documentation](https://pytorch.org/vision/stable/transforms.html) to learn more about what it has to offer and how to use it. We will show a simple one as an example. You can explore with other datasets and other transforms yourself!

Note: Since this technique does not explicitly affect the neural network design or parameters, some may say that it is not technically regularization, and some call it implicit regularization.

In [3]:
# Loading the MNIST dataset from torchvision

dataset = datasets.MNIST(
    root = 'MNIST',
    train = True,
    download = True,
    # transform the data to torch.Tensor and scale it to [0, 1]
    transform = v2.Compose([
        # augmentation:
        v2.RandomAffine(degrees=10, translate=(0.1, 0.1), scale=(0.9, 1.1)),
        # convert to tensor and scale to [0, 1]
        v2.ToImage(),
        v2.ToDtype(torch.float32, scale=True),
    ])
)

In [None]:
ImageDataViz(dataset).show()

## L1/L2 regularization

L1 and L2 regularization is another common method of regularization which simply penalizes the magnitude of the weights of the model. Let's look at the loss function of a model with these regularization and how the weight update changes.


### L1 Loss:
The total loss after adding the L1 regularization is:
$$
L = L_{pred} + \lambda L_{reg} = \sum_{samples} div(y_{pred}, y_{true}) + \lambda \sum_{i} |w_i|
$$
where $\lambda$ is a hyperparameter in charge of the importance of the regularization term. The higher $\lambda$ is set, the model is optimized towards having smaller weights. If you set it too high, the model is at the risk of underfitting due to the prediction loss having too little importance compared to the regularization loss.

Let's look at the derivative of the loss now. 
$$
\frac{\partial L}{\partial w_i} = \frac{\partial L_{pred}}{\partial w_i} + \lambda\frac{\partial |w_i|}{\partial w_i} 
$$
The last term depends only on the sign of $w_i$:
$$
w_i > 0 \rightarrow \frac{\partial |w_i|}{\partial w_i} = 1
$$
$$
w_i < 0 \rightarrow \frac{\partial |w_i|}{\partial w_i} = -1
$$
Let's see what happens in the gradient descent update.

$$
w_i > 0 \rightarrow w_i(t+1) = w_i(t) - \alpha (\frac{\partial L_{pred}}{\partial w_i(t)} + \lambda) = w_i(t) - \alpha\lambda - \alpha\frac{\partial L_{pred}}{\partial w_i(t)} \\
$$
$$
w_i < 0 \rightarrow w_i(t+1) = w_i(t) - \alpha (\frac{\partial L_{pred}}{\partial w_i(t)} - \lambda) = w_i(t) + \alpha\lambda - \alpha\frac{\partial L_{pred}}{\partial w_i(t)} 
$$
As you can see, this regularization causes a constant term in the gradient update. Whether the weight is positive or negative, this term moves the weight closer to zero in each update. That is why this regularization causes sparse weights.

L1 regularization is more common in shallow machine learning where sparse weights correspond to omitting some of the designed features and makes the model more understandable. In deep learning models, the number of parameters are actually a point of strength for the model, and sparsifying the weights has more downside than benefits.

### L2 Loss:
The L2 regularization loss is more common in deep neural networks. Let's look at the total loss, and the gradient descent update to see the impact of this regularization. We will also see why it is called __weight decay__.

$$
L = L_{pred} + \lambda L_{reg} = \sum_{samples} div(y_{pred}, y_{true}) + \frac{\lambda}{2} \sum_{i} w_i^2
$$
$$
\frac{\partial L}{\partial w_i} = \frac{\partial L_{pred}}{\partial w_i} + \lambda\frac{\partial w_i^2}{\partial w_i} = \frac{\partial L_{pred}}{\partial w_i} + \lambda w_i
$$
$$
w_i(t+1) = w_i(t) - \alpha (\frac{\partial L_{pred}}{\partial w_i(t)} + \lambda w_i(t)) = w_i(t) - \alpha\lambda w_i(t) - \alpha\frac{\partial L_{pred}}{\partial w_i(t)} 
$$
The term corresponding to the regularization loss is $\alpha\lambda w_i$, which is interpreted as a shrinking (or decaying) term, Since this is not constant, it is free from some of the downsides of the L1 regularization. However, you should look for a good $\lambda$ to avoid both underfitting and overfitting.

As you saw above, this regularization can be directly implemented in the update step rather than the loss function. In PyTorch, you can add L2 regularization by passing `weight_decay` to your optimizer. You can see the next cell as an example:

In [5]:
model = nn.Linear(28*28, 10)
optimizer = optim.Adam(
    params = model.parameters(),
    lr = 0.001,
    weight_decay = 1e-5, # L2 regularization coefficient)
)

## Dropout

Dropout is a regularization technique that randomly turns some neurons (or channels) of the feature map off, and scales the remaining ones up to roughly maintain the same range for the output. By doing so, the network avoids relying heavily on certain features that might only be dominant in the training data, and also helps with reducing the correlation of learned features. Dropout is only applied in training, and is turned off during inference.

$$
\text{Dropout}_p(x) =
\begin{cases}
    0 & \text{with probability  } p \\
    \dfrac{x}{1-p}  & \text{with probability  } 1-p
\end{cases}
$$

In PyTorch, you can use `nn.Dropout()` to turn off neurons randomly, or `nn.Dropout1D()`, `nn.Dropout2D()`, `nn.Dropout3D()` to randomly turn off random channels in 1D, 2D, or 3D data respectively. You can set the probability as one of the initialization arguments of your model. By doing so, you can assign the dropout probability to 0 if you do not want to use dropout.Let's take a closer look. Try changing the dropout probability and see the results. Run the cells multiple times to see if things are different each time. What do you notice? How does dropout behave differently during training and evaluation?

In [None]:
dropout = nn.Dropout(p=0.5)

batch_size = 10
dim = 3

T = torch.randn(batch_size, dim)

print('T before dropout:')
print(T)

In [None]:
# run this cell multiple times to see the effect of dropout
dropout.train()
print('T after dropout in training mode:')
print(dropout(T))
# Compare the output of the dropout layer in training and evaluation mode
# what happens to the values that are not dropped out?

In [None]:
# run this cell multiple times to see the effect of dropout
dropout.eval()
print('T after dropout during evaluation:')
print(dropout(T))
# Compare the output of the dropout layer in training and evaluation mode

### Channel-wise dropout

In spatial data with 1D, 2D or 3D structure, the feature dimension is also commonly refered to as the channel dimension. TO randomly turn off some feature maps (channels) in all locations, you can use `nn.Dropout1D()`, `nn.Dropout2D()`, `nn.Dropout3D()` for 1D, 2D, 3D data respectively. Here is an example for 2D (image) data:

In [None]:
dropout2d = nn.Dropout2d(p=0.5)

# create a random tensor with 8 channels and 5x5 spatial dimensions

batch_size, channels, height, width = 1, 8, 5, 5

T = torch.randn(batch_size, channels, height, width)

print('T before dropout2d:')
print(T)

In [None]:
dropout2d.train()
print('T after dropout2d in training mode:')
print(dropout2d(T))

In [None]:
dropout2d.eval()
print('T after dropout2d during evaluation:')
print(dropout2d(T))

## Batch Normalization

There are several types of normalizations in deep learning that try to contain the output values of the inner layers within a certain range or distribution. The one you learn about this week is Batch Normalization (called BatchNorm). As you can guess by the name, the normalization is done across the mini-batch. It first scales and shifts the input to a standard normal distribution $N(0,1)$, and then applies an affine (almost the same as linear) transform so the final output has a distribution of $N(\mu,\sigma^2)$ where $\mu$ and $\sigma$ are learnable parameters being updated by gradient steps. For each feature/channel dimension, the output of batchnorm is calculated as:
$$
BatchNorm(x) = \mu + \sigma\cdot\dfrac{x-E}{\sqrt{V+\epsilon}}
$$
where $E$ and $V$ are the mean and variance of the feature. $\epsilon$ is a small positive number to avoid devision by zero. The same calculation is broadcasted for all samples and locations. However, we don't know the mean and variance of the features. Depending on whether we are doing training or evaluation, different values are used for $E$ and $V$:
- During __training__, the layer calculates mean and variance of x in the current mini-batch and uses it to shift and scale the data. It also uses them to calculate an exponential running average of the mean and variance of the feature over the whole dataset (let's call them $\bar{E}$ and $\bar{V}$). In the beginning, $\bar{E}=0$ and $\bar{V}=1$. They are stored in the module's __buffer__ and are updated with every batch. They are not learnable parameters to be updated with gradient steps. The new distribution's parameters ($\mu$ and $\sigma$) of the features (the trainable __parameters__ of batchnorm) are also being updated by the gradient-based optimizer. Here's all that happens in the forward pass during training:
$$
E = \text{mean of $x$ over the current mini-batch}
$$
$$
V = \text{variance of $x$ over the current mini-batch}
$$
$$
\bar{E} = (1-m)\bar{E} + mE
$$
$$
\bar{V} = (1-m)\bar{V} + mV
$$
$$
BatchNorm(x) = \mu + \sigma\cdot\dfrac{x-E}{\sqrt{V+\epsilon}}
$$
where $m$ is the momentum. In PyTorch, the default momentum is $0.1$.
- During __evaluation__, the layer uses $\bar{E}$ and $\bar{V}$ to shift and scale the input data to $N(0,1)$ and then applies the learned mean $\mu$ and variance $\sigma^2$. Since the layer has seen a lot of data after training, these are now better values. It also prevents the model's definition to differ based on the input mini-batch. Here's how it works during evaluation:
$$
BatchNorm(x) = \mu + \sigma\cdot\dfrac{x-\bar{E}}{\sqrt{\bar{V}+\epsilon}}
$$

PyTorch has different BatchNorm layers you can use for different structures of data (1D, 2D, 3D). Let's take a look at `nn.BatchNorm2d()`:


In [None]:
# define a simple random tensor:

batch_size, channels, height, width = 13, 3, 5, 4

T = torch.randn(batch_size, channels, height, width)*20 + 100 # artificially skew the data

stat_dims = [0, 2, 3]

# Let's define a batch normalization layer

print('T statistics over the batch dimension:')
print(f'mean:\n{T.mean(stat_dims)}')
print(f'variance:\n{T.var(stat_dims)}')


In [None]:
# Let's look at a batchnorm layer and what it has

bn2d = nn.BatchNorm2d(num_features=channels)
bn2d

In [None]:
# Let's take a closer look into all the members (attributes) of the BatchNorm Layer
# Among the ones that start with an underline, look closely at _parameters and _buffers
vars(bn2d) # or bn2d.__dict__

In [None]:
# train mode
# Try running this for more than once to see the effect of batch normalization. See how the running mean and variance change.

bn2d_train = nn.BatchNorm2d(num_features=channels).train()
for i in range(100):
    T_bn = bn2d_train(T)
    if i % 10 == 0:
        print(f'Iteration {i+1}')
        print(bn2d_train._buffers)
        print(50*'-')


print(50*'=')
print('T_bn statistics over the batch dimension in training mode:')
print(f'mean:\n{T_bn.mean(stat_dims)}')
print(f'standard deviation:\n{T_bn.std(stat_dims)}')

In [None]:
# evaluation mode
# Try running this for more than once to see the effect of batch normalization. See how the running mean and variance change.

bn2d_train = nn.BatchNorm2d(num_features=channels).eval()
for i in range(100):
    T_bn = bn2d_train(T)
    if i % 10 == 0:
        print(f'Iteration {i+1}')
        print(bn2d_train._buffers)
        print(50*'-')


print(50*'=')
print('T_bn statistics over the batch dimension in training mode:')
print(f'mean:\n{T_bn.mean(stat_dims)}')
print(f'standard deviation:\n{T_bn.std(stat_dims)}')