# Introduction to Machine Learning
## Lesson 10 Advanced training of Neural Networks in Pytorch
## Introduction

In this lab work, we will explore advanced techniques for training neural networks using PyTorch. These methods aim to enhance the efficiency, reliability, and speed of training neural networks.

## Objectives

Learn how to use:
1. Data augmentation
2. Batch normalization
3. Dropout
4. Early stopping
5. Learning rate scheduling
6. TensorBoard for training control

### Data Augmentation

Data augmentation involves creating new training examples by transforming the existing data, which helps improve the model's generalization. You will learn how to apply various data augmentation techniques to your dataset.

### Batch Normalization

Batch normalization standardizes the inputs to a layer for each mini-batch. This technique helps in speeding up the training process and improving the model's performance. You will implement batch normalization in your neural network layers.

### Dropout

Dropout is a regularization technique that helps prevent overfitting by randomly dropping units during training. You will learn how to apply dropout in your neural network.

### Early Stopping

Early stopping monitors the model's performance on a validation set and stops training when performance starts to degrade. This technique helps in preventing overfitting. You will implement early stopping in your training process.

### Learning Rate Scheduling

Learning rate scheduling adjusts the learning rate during training to improve convergence. You will explore different learning rate scheduling strategies and apply them to your training process.

### TensorBoard for Training Control

TensorBoard provides a suite of visualization tools to monitor and debug your training process. You will learn how to use TensorBoard to track metrics such as loss and accuracy, visualize the computational graph, and more.


### Dropout

#### Purpose and Idea

Dropout is a regularization technique to prevent overfitting in neural networks. During training, dropout randomly sets a fraction of the input units to 0 at each update step. This helps to break the reliance on specific neurons, ensuring that the network learns more robust features.

#### Formula

Given a layer's output $\mathbf{h}$, dropout can be applied as follows:
1. Generate a random binary mask $\mathbf{m}$ where each element is 0 with probability $p$ and 1 with probability $1 - p$.
2. Apply the mask to the output: $\mathbf{\tilde{h}} = \mathbf{h} \odot \mathbf{m}$.

During inference, we scale the activations by $(1 - p)$ to maintain the same expected value:
$$
\mathbf{\hat{h}} = (1 - p) \mathbf{h}
$$


#### Visualization

Imagine a neural network layer with 5 neurons. During training with dropout, some neurons are randomly "dropped out."

```plaintext
Before Dropout:        After Dropout (p = 0.4):

 o   o   o   o   o      o   o   o   x   o
 |   |   |   |   |      |   |   |       |
 o   o   o   o   o      o   x   o   x   o
```

#### Gain

Dropout helps in:
- Reducing overfitting.
- Making the network more robust by ensuring it doesn't rely on specific neurons.
- Promoting independence among feature detectors.

### Batch Normalization

#### Purpose and Idea

Batch Normalization (BatchNorm) is a technique to improve the training of deep neural networks by normalizing the inputs of each layer. It helps in speeding up the training and provides some regularization, reducing the need for dropout.

#### Formula

For a given mini-batch, batch normalization is applied as follows:
1. Compute the mean $\mu_B$ and variance $\sigma_B^2$ of the mini-batch.
2. Normalize the batch: $\hat{x}_i = \frac{x_i - \mu_B}{\sqrt{\sigma_B^2 + \epsilon}}$
3. Scale and shift: $y_i = \gamma \hat{x}_i + \beta$

Here, $\gamma$ and $\beta$ are learnable parameters, and $\epsilon$ is a small constant to avoid division by zero.


#### Visualization

Consider the activations of a neural network layer before and after applying batch normalization.

```plaintext
Before BatchNorm:     After BatchNorm:

 |    |    |    |      |    |    |    |
 o    o    o    o      o    o    o    o
 |    |    |    |      |    |    |    |
 o    o    o    o      o    o    o    o
```

BatchNorm ensures that the activations are centered and scaled, which helps in stabilizing and accelerating the training process.

#### Gain

Batch Normalization helps in:
- Reducing internal covariate shift, leading to faster training.
- Allowing the use of higher learning rates.
- Providing regularization, reducing the need for other regularization techniques like dropout.


# 1) Data augmentation
We're going to use a new dataset, CIFAR10, as our example for the task of classification. The data set consists of 60000 32x32 color images in 10 classes looking like this:

![cifar10.png](attachment:77959831-c494-4dd5-9742-4a305a294031.png)

**Question: why might we use data augmentation? What problem does it solve?**

### Exercise 1

**1) Write the following augmentation transforms of CIFAR10 dataset:**

    - Crop (with size = 32, crop = 2)
    - Horizontal flip (with probability 0.5)
    - Rotation (with 10 degrees max)
    - Random affine (degrees = 0, shear = 10, scale=(0.8,1.2))
    
**2) Define DataLoader-s of CIFAR-10 with these transforms**

**Hint:** refer [ILLUSTRATION OF TRANSFORMS](https://pytorch.org/vision/stable/auto_examples/plot_transforms.html#sphx-glr-auto-examples-plot-transforms-py) in Pytorch.

In [None]:
from torch.utils import data
from torchvision import datasets, transforms

# You can increase these values if you've enough computational power
train_batch_size = 128
test_batch_size = 128


# Put augmentations
train_transforms = transforms.Compose([
    ...,
    transforms.ToTensor(),
    transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))
])
# Do not modify test transforms, because it will corrupt test data
test_transforms = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))
])


# You should get how to get the training part of datasets.CIFAR10 and
# how to apply train_transforms here
# Specify path of downloaded set in root, if you've loaded it
train_dataset = datasets.CIFAR10(root='cifar10', ...)

# Define the loader of train data set: use train_batch_size as batch_size
# and don't forget to shuffle
train_data_loader = data.DataLoader(...)


# Do the same for test part of CIFAR10, apply test_transforms
test_dataset = datasets.CIFAR10(root='cifar10', ...)

# Define the loader of test data set: use test_batch_size
test_data_loader = data.DataLoader(...)

In [None]:
# Check the results of transformations
import matplotlib.pyplot as plt
images, _ = next(iter(train_data_loader))

fig, axs = plt.subplots(nrows=1, ncols=4)

for i in range(4):
    ax = axs[i]
    ax.imshow(images[i].numpy().transpose(1,2,0))
    ax.set(xticklabels=[], yticklabels=[], xticks=[], yticks=[])

plt.tight_layout()
plt.show()

In [None]:
## Test cases:
import unittest
import torch
from torch.utils import data
from torchvision import datasets, transforms

# Определяем функции для создания датасетов и DataLoader-ов
def get_train_transforms():
    return train_transforms;
def get_test_transforms():
    return test_transforms;

def get_train_dataset():
    return train_dataset;

def get_test_dataset():
    return test_dataset;

def get_train_data_loader():
    return train_data_loader;

def get_test_data_loader():
    return test_data_loader;

class TestCIFAR10DataLoading(unittest.TestCase):

    def test_train_dataset_loads_correctly(self):
        train_dataset = get_train_dataset()
        self.assertEqual(len(train_dataset), 50000, "Train dataset should have 50000 samples.")

    def test_test_dataset_loads_correctly(self):
        test_dataset = get_test_dataset()
        self.assertEqual(len(test_dataset), 10000, "Test dataset should have 10000 samples.")

    def test_train_batch_size(self):
        train_data_loader = get_train_data_loader()
        images, labels = next(iter(train_data_loader))
        self.assertEqual(images.size(0), 128, "Train batch size should be 128")
        self.assertEqual(labels.size(0), 128, "Train batch size should be 128")

    def test_test_batch_size(self):
        test_data_loader = get_test_data_loader()
        images, labels = next(iter(test_data_loader))
        self.assertEqual(images.size(0), 128, "Test batch size should be 128")
        self.assertEqual(labels.size(0), 128, "Test batch size should be 128")

    def test_data_types_and_normalization(self):
        train_data_loader = get_train_data_loader()
        mean = torch.tensor([0.5, 0.5, 0.5])
        std = torch.tensor([0.5, 0.5, 0.5])
        images, _ = next(iter(train_data_loader))
        self.assertEqual(images.dtype, torch.float32, "Images should be of type torch.float32")


    def test_data_randomization(self):
        train_data_loader = get_train_data_loader()
        first_iter_images, _ = next(iter(train_data_loader))
        second_iter_images, _ = next(iter(train_data_loader))
        self.assertFalse(torch.equal(first_iter_images, second_iter_images), "Data should be randomized across batches")

# Run
unittest.main(argv=[''], verbosity=2, exit=False)


# 2) Build a model with dropout and batch normalization

Here we're going to build about the same model as we used before, but with two new layers: batch normalization and dropout.

#### Question: what's the purpose of these operations? What's the proposed order of their disposition relative to other layers?

### Exercise 2

**Declare 4 blocks (nn.Sequential) of Custom model with the default parameters (unless otherwise stated):**

**1st block, convolutional)**

    - Convolution layer with 16 filters, kernel size equal to 3x3 and stride 1x1. Use ReLU as activation;
    - Max pool layer with kernel size 2;
    - Batch norm layer.
    
**2nd block, convolutional)**

    - Convolution layer with 32 filters, kernel size equal to 3x3 and stride 1x1. Use ReLU as activation;
    - Batch norm layer;
    - Dropout layer with probability of unit drop equal to 0.25.
    
**3rd block, convolutional)**

    - Convolution layer with 64 filters, kernel size equal to 3x3 and stride 1x1. Use ReLU as activation;
    - Batch norm layer;

**4th block, linear)**

    - Linear layer. If you stated the previous parameters properly, in_features should be 64*11*11. Set out_features as 256 and ReLU as activation;
    - Dropout layer with probability of unit drop equal to 0.1;
    - Final linear layer with size of output equals 10

In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F

class CustomModel(nn.Module):
    def __init__(self):
        super(CustomModel, self).__init__()
        # Build your model
        self.conv1 = nn.Sequential(...)
        self.conv2 = nn.Sequential(...)
        self.conv3 = nn.Sequential(...)
        self.linear1 = nn.Sequential(...)


    def forward(self, x):
        # Propagate x through the network
        # Do not forget to flatten after the 3rd block
        return F.log_softmax(x, dim=1)

use_cuda = torch.cuda.is_available()
device = torch.device("cuda" if use_cuda else "cpu")
model = CustomModel().to(device)

print(f'Device: {device}')

print(model)

# 3) Training pipeline upgrades: early stopping and LR scheduler

Early Stopping is a form of regularization, used to stop training when a monitored metric has stopped improving.


#### Question: what kind of metric can we monitor? What's the benefit of using early stopping?

In [None]:
## for tests
!pip install torchsummary

In [None]:
import unittest
import torch
from torchsummary import summary

class TestCustomModel(unittest.TestCase):

    def setUp(self):
        self.model = CustomModel().to(device)
        self.device = device
        self.input_shape = (3, 32, 32)  # CIFAR-10 image dimensions

    def test_model_initialization(self):
        self.assertIsInstance(self.model, CustomModel, "Model should be an instance of CustomModel")

    def test_model_forward_pass(self):
        # Create a random input tensor with batch size 1
        input_tensor = torch.randn(1, *self.input_shape).to(self.device)
        output = self.model(input_tensor)

        # Check if the output shape is correct (batch size, number of classes)
        self.assertEqual(output.shape, (1, 10), f"Output shape should be (1, 10) but got {output.shape}")

    def test_model_summary(self):
        try:
            summary(self.model, input_size=self.input_shape)
        except Exception as e:
            self.fail(f"Model summary failed with exception: {e}")

# Run
unittest.main(argv=[''], verbosity=2, exit=False)


Vanila Pytorch doesn't contain early stopping (check [Pytorch Ignite](https://pytorch.org/ignite/generated/ignite.handlers.early_stopping.EarlyStopping.html) for 'official' solution), so we have to write it it from scratch. Although, sometimes it's useful to have such a custom tool which you can tune to your specific needs.

### Exercise 3

**Implement EarlyStopping class**

In [None]:
# Fill this class to stop when a certain value stop improving
import operator
class EarlyStopping():
    def __init__(self, tolerance=5, min_delta=0, mode='min'):
        '''
        :param tolerance: number of epochs that the metric doesn't improve
        :param min_delta: minimum improvement
        :param mode: 'min' or 'max' to minimize or maximize the metric
        '''

        '''
        You should keep these parameters,
        define a counter of __call__ falses and the previous best value of metric
        '''
        self.early_stop = False
        pass

    def __call__(self, metric)->bool:
        ''' This function should return True if `metric` is not improving for
            'tolerance' calls
        '''
        if ...:
            self.early_stop = True
        return self.early_stop


### Let's look how different LR-schedulers work

In [None]:
from torch.optim import lr_scheduler
import matplotlib.pyplot as plt

# Just a toy model
class NullModule(nn.Module):
    def __init__(self):
        super().__init__()
        self.fc = nn.Linear(1,1)


toy_model = NullModule()
optimizer = torch.optim.Adam(model.parameters(), lr=0.01)

def plot_lr(scheduler, name):
    # Re-init for each scheduler
    optimizer.param_groups[0]['lr'] = 0.01
    optimizer.zero_grad()
    toy_model.zero_grad()
    lrs = []
    step = 100

    fig, ax = plt.subplots()
    ax.set(xlabel='Step', ylabel='LR value', title=name)

    for i in range(step):
        lr = optimizer.param_groups[0]['lr']
        if name == "ReduceLROnPlateau":
            scheduler.step(lr)
        else:
            scheduler.step()
        lrs.append(lr)

    ax.plot(lrs)
    plt.show()


# You can check https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate
LRs = {"ReduceLROnPlateau": lr_scheduler.ReduceLROnPlateau(optimizer, 'min', factor=0.3,
                                                           patience=10, verbose=True,min_lr=0.001),
       "Step LR": lr_scheduler.StepLR(optimizer, step_size=20, gamma=0.5),
       "Exponent LR": lr_scheduler.ExponentialLR(optimizer, gamma=0.9),
       "Cyclic LR":lr_scheduler.CyclicLR(optimizer, base_lr=0.01, max_lr=0.2,
                                         cycle_momentum=False, step_size_up=10)}

for name, lr in LRs.items():
    plot_lr(lr, name)


# 4) Gather all together in training loops

This implementation of training and testing loops for a PyTorch model, designed to streamline the training process and evaluate model performance.

The `train` function handles the model's training phase by iterating over the training dataset, computing the loss using a given criterion, performing backpropagation, and updating model parameters with an optimizer. It also tracks and reports the training loss and accuracy for each epoch.

The `test` function evaluates the trained model on a separate test dataset, calculating the test loss and accuracy without performing backpropagation. These functions leverage GPU acceleration if available and provide real-time progress updates using the `tqdm` library for better monitoring of the training process.

In [None]:
from time import time
from tqdm import tqdm


def train(model, device, train_loader, criterion, optimizer, epoch):
    model.train()
    epoch_loss = 0
    start_time = time()
    correct = 0
    iteration = 0

    bar = tqdm(train_loader)
    for data, target in bar:
        data, target = data.to(device), target.to(device)
        optimizer.zero_grad()

        output = model(data)
        # Get the index of the max log-probability
        pred = output.argmax(dim=1, keepdim=True)
        correct += pred.eq(target.view_as(pred)).sum().item()

        loss = criterion(output, target)
        loss.backward()
        optimizer.step()

        epoch_loss += loss.item()
        iteration += 1
        bar.set_postfix({"Loss": format(epoch_loss/iteration, '.6f')})

    acc = 100. * correct / len(train_loader.dataset)
    print(f'\rTrain Epoch: {epoch}, elapsed time:{time()-start_time:.2f}s')
    return epoch_loss, acc


def test(model, device, test_loader, criterion):
    model.eval()
    test_loss = 0
    correct = 0
    with torch.no_grad():
        for data, target in test_loader:
            data, target = data.to(device), target.to(device)
            output = model(data)
            test_loss += criterion(output, target).item()  # sum up batch loss
            pred = output.argmax(dim=1, keepdim=True)  # get the index of the max log-probability
            correct += pred.eq(target.view_as(pred)).sum().item()

    acc = 100. * correct / len(test_loader.dataset)
    return test_loss, acc

This code snippet sets up the training configuration for a PyTorch model, including the definition of hyperparameters, loss function, optimizer, learning rate scheduler, and early stopping mechanism.

The `epochs` variable specifies the number of training iterations.

The `criterion` is defined as CrossEntropyLoss, which is suitable for classification tasks.

The optimizer is set to Stochastic Gradient Descent (SGD) with a learning rate of 0.1 and a momentum of 0.9, enhancing convergence speed and stability.

The learning rate scheduler, `ReduceLROnPlateau`, reduces the learning rate by a factor of 0.3 if the monitored metric does not improve for 3 consecutive epochs, with a minimum learning rate of 0.001.

The early stopping mechanism halts training if the validation loss does not improve for 7 epochs, preventing overfitting. Finally, `best_model_wts` stores a deep copy of the model's initial state, ensuring the best model weights can be restored after training.

In [None]:
from torch.optim import SGD
from copy import deepcopy

# Define hyperparams
epochs = 100


criterion = nn.CrossEntropyLoss()
optimizer = optim.SGD(model.parameters(), lr=0.1, momentum=0.9)
# Choose the LR you like
scheduler = ...
early_stopping = EarlyStopping(tolerance=7, mode='min')

best_model_wts = copy.deepcopy(model.state_dict())
best_acc = 0.0

In [None]:
import unittest
import torch
from torch.utils.data import DataLoader, TensorDataset
import numpy as np
import torch.nn as nn
import torch.optim as optim

class TestEarlyStopping(unittest.TestCase):

    def test_initialization(self):
        early_stopping = EarlyStopping(tolerance=5, min_delta=0, mode='min')
        self.assertEqual(early_stopping.tolerance, 5)
        self.assertEqual(early_stopping.min_delta, 0)
        self.assertEqual(early_stopping.mode, 'min')
        self.assertEqual(early_stopping.counter, 0)
        self.assertEqual(early_stopping.early_stop, False)
        self.assertEqual(early_stopping.prev_metric, np.inf)

    def test_call_method_min_mode(self):
        early_stopping = EarlyStopping(tolerance=3, min_delta=0, mode='min')
        metrics = [1.0, 0.9, 0.8, 0.8, 0.8, 0.8]
        results = [early_stopping(metric) for metric in metrics]
        self.assertEqual(results, [False, False, False, False, False, False])

    def test_call_method_max_mode(self):
        early_stopping = EarlyStopping(tolerance=3, min_delta=0, mode='max')
        metrics = [1.0, 1.1, 1.2, 1.2, 1.2, 1.2]
        results = [early_stopping(metric) for metric in metrics]
        self.assertEqual(results, [False, False, False, False, False, False])

class TestTrainingFunctions(unittest.TestCase):

    def setUp(self):
        self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
        self.model = CustomModel().to(self.device)
        self.criterion = nn.CrossEntropyLoss()
        self.optimizer = optim.SGD(self.model.parameters(), lr=0.1, momentum=0.9)

        # Create a small dataset for testing
        data = torch.randn(100, 3, 32, 32)
        targets = torch.randint(0, 10, (100,))
        dataset = TensorDataset(data, targets)
        self.loader = DataLoader(dataset, batch_size=10, shuffle=True)

    def test_train_function(self):
        epoch_loss, acc = train(self.model, self.device, self.loader, self.criterion, self.optimizer, epoch=1)
        self.assertIsInstance(epoch_loss, float)
        self.assertIsInstance(acc, float)

    def test_test_function(self):
        test_loss, acc = test(self.model, self.device, self.loader, self.criterion)
        self.assertIsInstance(test_loss, float)
        self.assertIsInstance(acc, float)

# Run
unittest.main(argv=[''], verbosity=2, exit=False)


# 5) Use TensorBoard to check the progress of learning


This code defines a `training` function to train and evaluate a PyTorch model. If `writing` is `True`, it logs metrics to TensorBoard. The function iterates over epochs, updating model weights using a specified optimizer and learning rate scheduler. It calculates training and test losses and accuracies. Early stopping halts training if the test loss stops improving. The best model weights are saved if test accuracy improves. Finally, the model's state is saved to disk. This setup ensures efficient model training, evaluation, and optional logging.

In [None]:
import torch.optim as optim
from torch.utils.tensorboard import SummaryWriter
import copy



def training(writing=False):
    if writing:
        writer = SummaryWriter(log_dir='runs/model')
    for epoch in range(1, epochs + 1):
        train_loss, train_acc = train(model, device, train_data_loader, criterion, optimizer, epoch)
        # Update learning rate if needed
        scheduler.step(train_loss)

        test_loss, test_acc = test(model, device, test_data_loader, criterion)
        # Terminate training if loss stopped to decrease
        if early_stopping(test_loss):
            print('\nEarly stopping\n')
            break
        # Deep copy the weight of model if its accuracy is the best for now
        if test_acc > best_acc:
            best_acc = test_acc
            best_model_wts = copy.deepcopy(model.state_dict())
        if writing:
            writer.add_scalars('Loss',
                            {
                                'train': train_loss,
                                'test': test_loss
                            },
                            epoch)

            writer.add_scalars('Accuracy',
                            {
                                'train': train_acc,
                                'test': test_acc
                            },
                            epoch)
        else:
            print(f"Training accuracy {train_acc}, test accuracy {test_acc}")
            print(f"Training loss {train_loss}, test loss {test_loss}")

    torch.save(model.state_dict(), "model.pt")
    model.load_state_dict(best_model_wts)
    torch.save(model.state_dict(), "best_model.pt")
    if writing:
        writer.close()

training()

### Conclusion

In this lab, you will implement advanced techniques to improve the training of neural networks. By using data augmentation, batch normalization, dropout, early stopping, learning rate scheduling, and TensorBoard, you will enhance the efficiency, reliability, and performance of your neural network models.