# 04- Regularization

Regularization trades a marginal decrease in training accuracy for an increase in generalizability. Regularization encompasses a range of techniques to correct for overfitting in machine learning models.

In [None]:
import torch
from torchvision import transforms
import torch.nn as nn
import torch.optim as optim
import matplotlib.pyplot as plt
from collections import OrderedDict

In [None]:
device = 'cuda' if torch.cuda.is_available() else 'cpu'
print(f"Using {device} device")

In [None]:
# Load model tools
from scripts.model_tools import train_validate, test_validate, set_fashion_dataset

## Recreate last model

Let's recreate the Deep Neural Network from last notebook for comparison. But we'll decrease the learning rate to it's original values for smoother results.

In [None]:
# Hyperparameters
HIDDEN_LAYER_PARAMETERS = [64, 48, 24]
LEARNING_RATE = 0.003
EPOCHS = 15
OUTPUTS = 10
RATIO_VALIDATION = 0.2
BATCH_SIZE = 64

In [None]:
# Get the dataset
transform = transforms.Compose([transforms.ToTensor()])
train_ds, test_ds, train_dl, val_dl, test_dl, classes = set_fashion_dataset(transform, RATIO_VALIDATION, BATCH_SIZE)
image, label = next(iter(train_dl))
input_features = image[0].shape[0] * image[0].shape[1] * image[0].shape[2] # Total input features

In [None]:
model = nn.Sequential(OrderedDict([('fc1', nn.Linear(input_features, HIDDEN_LAYER_PARAMETERS[0])),
                                   ('relu1', nn.ReLU()),
                                   ('fc2', nn.Linear(HIDDEN_LAYER_PARAMETERS[0], HIDDEN_LAYER_PARAMETERS[1])),
                                   ('relu2', nn.ReLU()),
                                   ('fc3', nn.Linear(HIDDEN_LAYER_PARAMETERS[1], HIDDEN_LAYER_PARAMETERS[2])),
                                   ('relu3', nn.ReLU()),
                                   ('output', nn.Linear(HIDDEN_LAYER_PARAMETERS[2], OUTPUTS)),
                                   ('logsoftmax', nn.LogSoftmax(dim=1))]))
model = model.to(device)
loss_fn = nn.NLLLoss()
optimizer = optim.Adam(model.parameters(), lr=LEARNING_RATE)

In [None]:
train_validate(model, loss_fn, optimizer, train_dl, val_dl, device, n_epochs = EPOCHS, flatten=True)

In [None]:
test_validate(model, test_dl, device);

## Normalize Input Dataset

Input features can vary in different scales. By setting inputs to zero mean and unit variance, that guarantees that all your features are in a similar scale. This usually helps your learning algorithm run faster.

Remember gradient descent?  Well, imagine it like a ball rolling down to the lowest point in the valleys shown below. With unnormalized inputs, there will be a lot of time spent bouncing back and forth in the uneven terrain. With normalized inputs, the way down is a lot smoother.

![](../media/regularization/Normalization.png)

In [None]:
# Add input normalization
transform = transforms.Compose([
        transforms.ToTensor(),
        transforms.Normalize((0.5,), (0.5,)) # Notice normalization. This only happens during training
    ])
train_ds, test_ds, train_dl, val_dl, test_dl, classes = set_fashion_dataset(transform, RATIO_VALIDATION, BATCH_SIZE)
print(train_ds)

In [None]:
optimizer = optim.Adam(model.parameters(), lr=LEARNING_RATE)

In [None]:
train_validate(model, loss_fn, optimizer, train_dl, val_dl, device, n_epochs = EPOCHS, flatten=True)

In [None]:
test_validate(model, test_dl, device);

Since the images are in Black/White, both axes are on the 0-255 range, and generally there is a similar amount of white pixels on black background, normalization doesn't achieve much. It's an important tool for more varied datasets, like color photographs, or data with different scales (e.g, income vs age).

## Dropout

By randomly dropping out neurons during training, we force the model to not rely on any single feature. Dropout helps break co-adaptations among units, and each unit can act more independently when dropout regularization is used.

This makes it more robust against data it hasn't seen before. It requires more epochs to converge due to its stochastic nature.

![](../media/regularization/Dropout.png)

In [None]:
# Add dropout layers
model = nn.Sequential(OrderedDict([('fc1', nn.Linear(input_features, HIDDEN_LAYER_PARAMETERS[0])),
                                   ('relu1', nn.ReLU()),
                                   ('drop1', nn.Dropout(0.20)), # Dropout layer
                                   ('fc2', nn.Linear(HIDDEN_LAYER_PARAMETERS[0], HIDDEN_LAYER_PARAMETERS[1])),
                                   ('relu2', nn.ReLU()),
                                   ('output', nn.Linear(HIDDEN_LAYER_PARAMETERS[1], OUTPUTS)),
                                   ('logsoftmax', nn.LogSoftmax(dim=1))]))

In [None]:
# Increase number of epochs
dropout_epochs = int(EPOCHS * (1.25))
model = model.to(device)
optimizer = optim.Adam(model.parameters(), lr=LEARNING_RATE)

In [None]:
train_validate(model, loss_fn, optimizer, train_dl, val_dl, device, n_epochs = dropout_epochs, flatten=True)

In [None]:
test_validate(model, test_dl, device);

Notice that the validation loss, which comes from unseen data during training, has been reduced.

## Data augmentation

Data augmentation techniques include random rotations, zooms, crops, flips, and distortions to generate more data from the available image dataset.

In [None]:
import torch.utils.data as data_utils
from torch.utils.data import DataLoader

In [None]:
# Get a single sample image
image, label = next(iter(train_dl))
index = 0 # Only first image
print(classes[label[index].item()])
plt.imshow(image[index].numpy().squeeze(), cmap='gray');

In [None]:
# Get a subset of our dataset
indices = torch.arange(1)
train_ds_one = torch.utils.data.Subset(train_ds, indices)
train_dl_one = DataLoader(train_ds_one, batch_size=BATCH_SIZE)

In [None]:
# Add random rotation to our transformations

from torchvision import datasets

transform_rotate = transforms.Compose([
        transforms.ToTensor(),
        transforms.Normalize((0.5,), (0.5,)),
        transforms.RandomRotation(25), # Random rotate +/- degrees
    ])
train_ds_one.dataset.transform = transform_rotate

Notice that data augmentation does not actually expand your dataset. Data augmentation transformations are applied on each item in the dataset one by one, and not adding to the size of the dataset. Every epoch you get a different version of the dataset.

In [None]:
number_of_passes = 4

f, ax_arr = plt.subplots(1, number_of_passes, squeeze=False)
index = 0 # Only one image
print(classes[label[index].item()])
for j, row in enumerate(ax_arr):
    for i, ax in enumerate(row):
        image, label = next(iter(train_dl_one))
        ax.imshow(image[index].numpy().squeeze(), cmap='gray')

Why the grey area? It's being filled with the default color for `RandomRotation`, which is black. *However*, we had already normalized our data! The value for black, 0, is the mean in the value spectrum, which turns out to be grey. If you want to avoid this, the transformation above should do `RandomRotation`, then `Normalize` afterwards. It is important to check your assumptions when dealing with datasets.

### Apply to whole dataset

In [None]:
train_ds.transform = transform_rotate

In [None]:
optimizer = optim.Adam(model.parameters(), lr=LEARNING_RATE)
train_validate(model, loss_fn, optimizer, train_dl, val_dl, device, n_epochs = EPOCHS, flatten=True)

In [None]:
test_validate(model, test_dl, device);

In this case, our train data orientation was very well matched with the test data orientation, so data augmentation of rotation type worked against us. For real world photographs, it would probably make our dataset better suited and compensate for small datasets.

**Next Notebook: [05-Convolutional Neural Networks](05-CNN.ipynb)**