## How to use

Use this URL to access:

**shorturl.at/gsIZ9**

To use this notebook first you need to create a copy in your own personnal google drive (as per the first picture of the first practical session). **Then you'll need to switch the runtime to GPU to be able to train your models on GPU to reduce runtime** (as per the second picture of the first practical session).

# Deep learning : **_Training_**

After studying the preparation of data and the design of deep learning models, we focus in this session on the training process. We are going to see how to train a model and choose its training parameters, as well as visualize what models really learn and conclude by understanding transfer learning, a powerful technique to reuse previously trained models and apply them to other usecases.

![Training overview](https://www.rocq.inria.fr/cluster-willow/tchabal/courses/springschool2022/overview_training.png)

_Illustration created by Thomas Chabal and Clément Riu, 2022._

## Setting up the notebook

In [None]:
%cd /content
!wget https://www.rocq.inria.fr/cluster-willow/tchabal/courses/springschool2022/casablanca_course.zip && \
  unzip casablanca_course.zip && \
  rm casablanca_course.zip
%cd casablanca_course
!python setup.py install

## Choosing training settings

### Loss

Deep learning is based on a principle of trial-and-error. A model makes a prediction from an input value and we compare this prediction with the expected value. That comparison leads to the computation of an error, which then helps to adjust the network's parameters.

That loss value should be a differentiable function. Indeed, weights of the network are adjusted with respect to the loss derivative. The formula is simple, given weights of the model at step $t$ $w_t$ and a loss $\mathcal{L}$, we update the weights as $w_{t+1} = w_t - \frac{\partial \mathcal{L}}{\partial w}$. This means that we have minimized the loss when we get $\frac{\partial \mathcal{L}}{\partial w} = 0$, i.e. the weights are not updated anymore.

This loss function must be chosen according to the problem, to reach the goal we want. Some losses are more adapted to some problems. For instance,
- When classifying images in several categories, it is common to use the [_Cross Entropy_ loss](https://pytorch.org/docs/stable/generated/torch.nn.CrossEntropyLoss.html#torch.nn.CrossEntropyLoss).
- When doing binary classification (i.e. classification with only 2 classes), we may use the [_Binary Cross Entropy_ loss](https://pytorch.org/docs/stable/generated/torch.nn.BCELoss.html#torch.nn.BCELoss):
$$\mathcal{L}(x, y) = y \log(x) + (1-y) \log(1-x)$$
- When regressing values, the [_L2_](https://pytorch.org/docs/stable/generated/torch.nn.MSELoss.html#torch.nn.MSELoss) and [_L1_](https://pytorch.org/docs/stable/generated/torch.nn.L1Loss.html#torch.nn.L1Loss) (or its [_Smooth L1_](https://pytorch.org/docs/stable/generated/torch.nn.SmoothL1Loss.html#torch.nn.SmoothL1Loss) variant) losses can be adapted.
There exists many other losses detailed [here](https://pytorch.org/docs/stable/nn.html#loss-functions).

In Pytorch, we simply define the loss function as follows:

In [None]:
import torch.nn as nn

criterion = nn.CrossEntropyLoss()

We then call this loss function during training with a code similar to: `loss = criterion(outputs, labels)`. All the computation is then performed by torch.

### Learning rate and optimizer

We explained that we update weights with a formula like $w := w - \frac{\partial \mathcal{L}}{\partial w}$. This is quite not accurate.

In that formula, the derivative of the loss indicates a direction in the space of weights to reach the minimum of the loss function. This is the opposite direction of the loss function's gradient, which points towards the minimum of that function.

Yet, we should also consider the size of the step to take: taking a too large step in the direction of the gradient could have us miss the minimum and get to a worse point.

Choosing that gradient step size is done with what we call the _learning rate_. We usually note it $\lambda$, and update the weights of the model as:
$$w := w - \lambda \frac{\partial \mathcal{L}}{\partial w}$$

This learning rate is found after learning with different values and keeping the one which leads to the best performances. There is no rule of thumb to decide which learning rate to take. In practice, research papers publishing new models also present the learning rate they take and we simply reuse their value.

There exists several formulas to update the weights of a model. We presented here the principle of the _Gradient Descent_, but there exists others like the [_Stochastic Gradient Descent_](https://pytorch.org/docs/stable/generated/torch.optim.SGD.html#torch.optim.SGD) or [_Adam_](https://pytorch.org/docs/stable/generated/torch.optim.Adam.html#torch.optim.Adam). In torch, they are called [_optimizers_](https://pytorch.org/docs/stable/optim.html#algorithms).

Pytorch handles the optimization step to update the weights of the model. All we need to do is to define an _optimizer_ with its parameters, as in the following cell. Optimizers list can be found [here](https://pytorch.org/docs/stable/optim.html#algorithms).

In [None]:
import torch
import torch.optim as optim

# Define a neural network
model = torch.hub.load('pytorch/vision:v0.10.0', 'resnet18', pretrained=True)

# Define the optimizer with the network's weights (or parameters), the learning rate and possibly other parameters
optimizer = optim.SGD(model.parameters(), lr=0.0001, momentum=0.9)
optimizer = optim.Adam(model.parameters(), lr=0.001)

### Learning rate scheduler

During the training, we may decide to reduce the learning rate. This is appropriate when the loss doesn't evolve much after a while as the network's weights get close to a minimum of the loss. Reducing the learning rate then allows the network to do minor and more refined updates of weights and therefore get closer to that minimum.

This update of the learning rate is done with what is called a [_Learning rate scheduler_](https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate).

This is again easy to define and use in a training pipeline, but we won't get more into details to focus on the other notions presented above.

## Writing a training script

### The training loop

Let's get now to the training pipeline !

**_Reminder_**: To train a model, we need:
- A _model_;
- Data to train on, i.e. a _train dataloader_;
- A _loss function_, or _criterion_;
- An _optimizer_ to update the model's weights.

The **training process** is then very simple:
- We **take a batch of data from the dataloader**, this batch being made of inputs and labels;
- The **model predicts outputs** from the inputs;
- We **compute a loss** between the predictions and the labels;
- We **update the model** given that loss.

The next cell defines a training function.

In [None]:
import torch
from tqdm import tqdm

def train_loop(net, trainloader, criterion, optimizer):
    running_loss = 0.0
    with tqdm(trainloader, desc="Training") as pbar:
        for (inputs, labels) in pbar:
            # get the inputs; data is a list of [inputs, labels]
            # .cuda() sends the data to the GPU
            inputs = inputs.cuda()
            labels = labels.cuda()

            # zero the parameter gradients
            optimizer.zero_grad()

            # forward + backward + optimize
            outputs = net(inputs)
            loss = criterion(outputs, labels)
            loss.backward()
            optimizer.step()

            # print statistics
            running_loss += loss.item()
            pbar.set_postfix({"train_loss": f"{running_loss:.3f}"})
            running_loss = 0.0

The previous code makes one pass over the whole train dataloader. We can already test it on some data.

In [None]:
from torch import nn, optim
from casablanca_course.training import load_dataset, build_net

dataset = "cifar10"

trainloader, testloader = load_dataset(dataset, batch_size=32, train_share=50)

net = build_net(dataset).cuda()

criterion = nn.CrossEntropyLoss()
optimizer = optim.SGD(net.parameters(), lr=1e-3, momentum=0.9)

In [None]:
train_loop(net, trainloader, criterion, optimizer)

### The validation loop

We may evaluate the quality of the model during the training. To do so, we previously defined a _validation dataset_. The evaluation consists simply in making predictions for each of the inputs and comparing them with the labels. No update of the model is done during this step, so it is also useless to compute the gradient of the loss.

The following cell defines an evaluation of the model, given a neural network, a validation dataloader and a loss function (or criterion):

In [None]:
def test_loop(net, testloader, criterion):
    # Tell torch to not compute gradients during predictions. This saves significant computational time.
    with torch.no_grad():
        val_loss = 0.0
        with tqdm(testloader, desc="Validation") as pbar:
            for (inputs, labels) in pbar:
                # Extract inputs and labels from the dataloader, and send them to the GPU
                inputs = inputs.cuda()
                labels = labels.cuda()

                # forward pass and loss computation, with no backward or optim
                outputs = net(inputs)
                loss = criterion(outputs, labels)

                # print statistics
                val_loss += loss.item()
                pbar.set_postfix(
                    {"val_loss": f"{loss.item() :.3f}"})
        print(
            f'Validation loss: {val_loss / len(testloader):.3f}')

In [None]:
test_loop(net, testloader, criterion)

During training, both the training and the validation losses should keep decreasing.

The **training shall be stopped when the validation loss starts increasing again while the training loss decreases**. The regime in which the training loss decreases while the validation loss increases is called _overfitting_.

### Epochs

During training, we make several passes over the training dataset. Each of these passes is called an **epoch**: during one epoch, each image is processed exactly once by the model.

The epochs are usually an alternance of a training step, going over the whole training set, and an evaluation step, evaluating the performance of the model on the validation set as a control of the training.

The following cell presents the whole training process:

In [None]:
def training_process(net, trainloader, testloader, criterion, optimizer, n_epochs):
    # For each epoch, i.e. going over the training set once
    for epoch in range(n_epochs):
        print("=" * 20, f"\nEpoch {epoch + 1}")

        # Training phase
        train_loop(net, trainloader, criterion, optimizer)

        # Evaluation phase
        test_loop(net, testloader, criterion)

    print('\n\nFinished Training')

In [None]:
training_process(net, trainloader, testloader, criterion, optimizer, n_epochs=4)

We used only 4 epochs here to have a run the cell fast, but training completely the network would require hundreds or thousands of epochs, which would represent hours or days of computation on a GPU.

We printed losses but we can also store them during training and plot them as curves of the loss as a function of epochs or seen batches. A useful tool for that is [Tensorboard](https://pytorch.org/docs/stable/tensorboard.html), which you may have a look at in the future.

### The whole code

To get a general overview of a deep learning code, we summarize in the next cell all the code we wrote in the previous sessions in order to train a model.

In [None]:
from torch import nn, optim
from torch.utils.data import DataLoader
import torchvision
from torchvision import transforms
from torchvision.datasets import CIFAR10


# Define the train and validation dataloaders
train_transforms = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))
])
test_transforms = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))
])

trainset = CIFAR10(root='./data', train=True, download=True, transform=train_transforms)
trainloader = DataLoader(trainset, batch_size=32, shuffle=True, num_workers=2)

testset = CIFAR10(root='./data', train=False, download=True, transform=test_transforms)
testloader = DataLoader(testset, batch_size=1024, shuffle=False, num_workers=2)

print(f"The training set is composed of {len(trainset)} images, while the test set includes {len(testset)} images.")


# Define the model
net = nn.Sequential(
    nn.Conv2d(3, 6, 5),
    nn.ReLU(),
    nn.MaxPool2d(2, 2),

    nn.Conv2d(6, 16, 5),
    nn.ReLU(),
    nn.MaxPool2d(2, 2),

    nn.Flatten(),

    nn.Linear(16 * 5 * 5, 120),
    nn.ReLU(),
    nn.Linear(120, 84),
    nn.ReLU(),
    nn.Linear(84, 10)
)
net = net.cuda()


# Define loss and optimizer
criterion = nn.CrossEntropyLoss()
optimizer = optim.SGD(net.parameters(), lr=1e-3, momentum=0.9)


def training_process(net, trainloader, testloader, criterion, optimizer, n_epochs):
    # For each epoch, i.e. going over the training set once
    for epoch in range(n_epochs):
        print("=" * 20, f"\nEpoch {epoch + 1}")

        # Training phase
        net.train()
        running_loss = 0.0
        with tqdm(trainloader, desc="Training") as pbar:
            for (inputs, labels) in pbar:
                # get the inputs; data is a list of [inputs, labels]
                inputs = inputs.cuda()
                labels = labels.cuda()

                # zero the parameter gradients
                optimizer.zero_grad()

                # forward + backward + optimize
                outputs = net(inputs)
                loss = criterion(outputs, labels)
                loss.backward()
                optimizer.step()

                # print statistics
                running_loss += loss.item()
                pbar.set_postfix({"train_loss": f"{running_loss:.3f}"})
                running_loss = 0.0

        # Evaluation phase
        net.eval()
        with torch.no_grad():
          val_loss = 0.0
          with tqdm(testloader, desc="Validation") as pbar:
              for (inputs, labels) in pbar:
                  # Extract inputs and labels from the dataloader, and send them to the GPU
                  inputs = inputs.cuda()
                  labels = labels.cuda()

                  # forward pass and loss computation, with no backward or optim
                  outputs = net(inputs)
                  loss = criterion(outputs, labels)

                  # print statistics
                  val_loss += loss.item()
                  pbar.set_postfix(
                      {"val_loss": f"{loss.item() :.3f}"})
          print(
              f'Validation loss: {val_loss / len(testloader):.3f}')

    print('\n\nFinished Training')

training_process(net, trainloader, testloader, criterion, optimizer, n_epochs=4)

## Visualize CNN filters

### Visualizing features

To better understand how neural networks, and in particular convolutional ones, work, it is worth visualizing their filters.

When feeding an image to a convolutional neural network, this image is processed by a succession of convolutional layers and _activation_ layers, i.e. non-linear functions. To understand the role of a given convolutional filter on the output, we can have a look at how the output of that layer looks like given some input image.

Indeed, as we saw in the first session, a CNN filter of kernel $K$ acts on an image $I$ as:

$$(I * K)[x, y] = b + \sum_{i,j=-k}^k I[x+i, y+j] K[i, j]$$

Once the network has been trained, we don't change its weights anymore. Therefore $K$ is constant, and the output value $I * K$ of that filter only depends on the input image.

What we can do now is to look for the image $I$ that leads to the largest value of $I*K$ on average, i.e. the image which leads to the largest _activation_ of the filter. We solve the problem:

\begin{equation}
\begin{array}{rrclcl}
\displaystyle \max_{I} \frac{1}{WH}\sum_{x, y \in I} (I * K)[x,y] =
\displaystyle \min_{I} - \frac{1}{WH}\sum_{x, y \in I} (I * K)[x,y]
\end{array}
\end{equation}

The optimum is an image which we visualize. We do this in the following code:

In [None]:
"""
Adapted from https://github.com/utkuozbulak/pytorch-cnn-visualizations/blob/master/src/cnn_layer_visualization.py
"""
import numpy as np
import matplotlib.pyplot as plt

import torch
from torch.optim import Adam
from torchvision import models
from casablanca_course.training.layer_viz_utils import preprocess_image, recreate_image


class CNNLayerVisualization():
    """
        Produces an image that minimizes the loss of a convolution
        operation for a specific layer and filter
    """

    def __init__(self, model, selected_layer, selected_filter, n_optim_iters=30):
        self.model = model
        self.model.eval().cuda()
        self.selected_layer = selected_layer
        self.selected_filter = selected_filter
        self.n_optim_iters = n_optim_iters

    def create_random_image(self):
        # Generate a random image
        random_image = np.uint8(np.random.uniform(150, 180, (224, 224, 3)))
        return preprocess_image(random_image)

    def visualise_layer(self, visualize=True):
        # Generate a random image
        processed_image = self.create_random_image()

        # Define optimizer for the image
        optimizer = Adam([processed_image], lr=0.1, weight_decay=1e-6)

        for _ in range(self.n_optim_iters):
            optimizer.zero_grad()
            # Assign create image to a variable to move forward in the model
            x = processed_image.cuda()

            for index, layer in enumerate(self.model):
                # Forward pass layer by layer
                x = layer(x)
                if index == self.selected_layer:
                    # Stop at the selected layer
                    break

            # Select the filter we are interested in, compute the loss and optimize the input image
            conv_output = x[0, self.selected_filter]
            # Loss function is the mean of the output of the selected layer/filter
            # We try to minimize the mean of the output of that specific filter
            loss = -torch.mean(conv_output)
            # Backward
            loss.backward()
            # Update image
            optimizer.step()

        # Recreate image
        self.recreated_image = recreate_image(processed_image)
        if visualize:
            plt.imshow(self.recreated_image)
            plt.show()

In [None]:
# Take a pretrained model
pretrained_model = models.vgg16(pretrained=True).features

# We pick a filter of a given layer and visualize it
cnn_layer = 2
filter_pos = 4

# Layer visualization
layer_vis = CNNLayerVisualization(pretrained_model, cnn_layer, filter_pos)
layer_vis.visualise_layer()

In [None]:
# We pick a filter of a given layer and visualize it
cnn_layer = 24
filter_pos = 1

# Layer visualization
layer_vis = CNNLayerVisualization(pretrained_model, cnn_layer, filter_pos)
layer_vis.visualise_layer()

Change the layers and filters in the previous cells to visualize filters at various depths in the network.

### Comparing trained and not trained models

We compare in what follows these features between a trained version and an untrained one of the same architecture.

Once again, you may change the layers to visualize different filters.

In [None]:
def compare_before_after_training(cnn_layer, filter_pos):
    # Select trained and not trained models
    untrained_model = models.vgg16(pretrained=False).features
    trained_model = models.vgg16(pretrained=True).features

    untrained_layer_vis = CNNLayerVisualization(untrained_model, cnn_layer, filter_pos)
    trained_layer_vis = CNNLayerVisualization(trained_model, cnn_layer, filter_pos)

    untrained_layer_vis.visualise_layer(visualize=False)
    trained_layer_vis.visualise_layer(visualize=False)

    # Display recreated images
    plt.figure(figsize=(15, 15))
    plt.subplot(121)
    plt.imshow(untrained_layer_vis.recreated_image)
    plt.title("Untrained model feature")
    plt.subplot(122)
    plt.imshow(trained_layer_vis.recreated_image)
    plt.title("Trained model feature")

# We pick a filter of a given layer and visualize it
cnn_layer = 24
filter_pos = 3

compare_before_after_training(cnn_layer, filter_pos)

With these visualizations, we notice significant differences between images that activate both models:
- On the one hand, the untrained network is activated by a white noise. It means its filters do not detect any structure in the inputs.
- On the other hand, the trained network captures more and more complex structures as we get deep in the model.

These complex structures is of particular interest: they have been obtained with training on massive data and they represent very important information in images that help to classify them.

While these models are trained on data that is irrelevant to most industrial applications, we would still like to reuse these features and models for our own problems. This is possible with what is called _transfer learning_.

## Transfer learning

### Saving and loading a model

Training a model takes a very long time and is computation intensive. Therefore, we do not train a model everytime we run a program: we rather save trained networks and load them when it is necessary.

Both operations can be done in a single line of code.

During training, we may save the model whenever we find it relevant: we can choose to save the model at the end of each epoch or when we improve the validation accuracy for instance.

First, let us create a small neural network and train it:

In [None]:
import torch.nn as nn
from casablanca_course.models import evaluate_net, train_for_net

def create_cnn():
  return nn.Sequential(
      nn.Conv2d(1, 6, 3),
      nn.ReLU(),
      nn.Conv2d(6, 3, 3),
      nn.ReLU(),
      nn.Flatten(),
      nn.Linear(24 * 24 * 3, 256),
      nn.ReLU(),
      nn.Linear(256, 10),
      nn.Softmax(dim=1),
  )

my_cnn = create_cnn()

def compare_before_after(net, batch_size=128, n_epochs=4, lr=0.001):
  print("Evaluation before training the network")
  net.eval()
  evaluate_net(net, dataset="mnist")

  print("\n\n")
  net.train()
  train_for_net(net, dataset="mnist", batch_size=128, n_epochs=4, lr=lr)
  print("\n\n")

  print("Evaluation after training the network")
  net.eval()
  evaluate_net(net, dataset="mnist")


compare_before_after(my_cnn, lr=3e-3)

We have trained our model, and we can now have a look at its weights:

In [None]:
my_cnn.state_dict()

As we saw previously, our model is made of weights representing the convolutional kernels for the Conv2d layers, and weight matrices for the linear layers, as well as bias vectors.

We now want to save the model in order to reuse it later. Here is how we do it:

In [None]:
torch.save(my_cnn.state_dict(), "./my_model_weights.pth")

In [None]:
!ls -l my_model_weights.pth

In our case, our model weighs around 1.8Mb.

Now, assume our model is not in memory anymore. We would like to load it again to use it. To do so, all we have to do is to create an instance of the model and load its weights:

In [None]:
# Create an instance of the model
loaded_model = create_cnn()

loaded_model.eval()
evaluate_net(loaded_model, dataset="mnist")

Our model performs very bad with a low accuracy. Indeed, its weights have only been randomly initialized.

We load it with the following line:

In [None]:
# Load the weights stored in the previously saved file
loaded_model.load_state_dict(torch.load("./my_model_weights.pth"))

loaded_model.eval()
evaluate_net(loaded_model, dataset="mnist")

Indeed, our model now performs as well as after the previous training.

### Training from scratch

Let us get back to the CIFAR 10 dataset. We would like to train a VGG16 model on this dataset.

We can simply take a VGG model available online and run it on the testset.

In [None]:
import torch
from torch import nn, optim
from torch.utils.data import DataLoader
import torchvision
from torchvision import transforms
from torchvision.datasets import CIFAR10
from tqdm import tqdm
from casablanca_course.training.eval import evaluate_net

test_transforms = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))
])

testset = CIFAR10(root='./data', train=False, download=True, transform=test_transforms)
testloader = DataLoader(testset, batch_size=1024, shuffle=False, num_workers=2)

num_classes = len(testloader.dataset.classes)
print(f"Data contains {num_classes} classes.")

In [None]:
net = torchvision.models.vgg16(pretrained=False).cuda()
evaluate_net(net)

The accuracy with a randomly initialized and untrained network is very poor. The same occurs with a pretrained network:

In [None]:
net = torchvision.models.vgg16(pretrained=True).cuda()
evaluate_net(net)

Indeed, the current model is built to classify images into 1000 classes whereas our data is split in only 10 classes. The next cell shows the architecture of our network, whose final layer (classifier.6) has an output size of 1000.

In [None]:
print(net)

Instead, we can build a model which classifies images in only 10 categories. We do it by only changing the output size of the last layer of the VGG model:

In [None]:
net = torchvision.models.vgg16(pretrained=False, num_classes=10).cuda()
print(net)
evaluate_net(net)

This time, we get an accuracy around 10%, which corresponds to random classification among 10 categories.

We can train the model from scratch as we did before:

In [None]:
train_transforms = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))
])

trainset = CIFAR10(root='./data', train=True, download=True, transform=train_transforms)
trainloader = DataLoader(trainset, batch_size=32, shuffle=True, num_workers=2)

# Define loss and optimizer
criterion = nn.CrossEntropyLoss()
optimizer = optim.SGD(net.parameters(), lr=1e-3, momentum=0.9)


training_process(net, trainloader, testloader, criterion, optimizer, n_epochs=4)

In [None]:
evaluate_net(net)

We notice that this training takes a long time. Our dataset is small and made of very small images, but the training may already take at least a few dozen minutes before converging.

In industry, most datasets are made of large images, which makes the training more tedious and much longer.

This costs time but also computational resources and hence money.

### Reusing features

We can save much time by reusing a previously trained model. Here, we adapt a VGG model trained on massive data in order to make it efficient on our own problem.

We start by reusing a trained VGG.

In [None]:
net = torchvision.models.vgg16(pretrained=True)
print(net)

This model is fully trained but classifies data into 1000 categories. We need to classify images in only 10 categories while keeping the same features.

First, we _freeze_ the weights, i.e. we prevent them from being updated and keep their value constant.

In [None]:
def count_parameters(model):
  total_params = sum(p.numel() for p in model.parameters())
  trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
  return total_params, trainable_params

total, trainable = count_parameters(net)
print(f"Before freezing, the model has {total} parameters, among which {trainable} are trainable.")

for param in net.parameters():
    # Remove possibility to update each parameter/weight
    param.requires_grad = False

total, trainable = count_parameters(net)
print(f"After freezing, the model has {total} parameters, among which {trainable} are trainable.")

Now that the weights are frozen, we can replace the last trainable layer to fit our problem:

In [None]:
in_features = net.classifier[6].in_features
net.classifier[6] = nn.Linear(in_features, num_classes)

print(net)
total, trainable = count_parameters(net)
print(f"After changing the last layer, the model has {total} parameters, among which {trainable} are trainable.")

We now have a VGG model that outputs 10 values instead of 1000. The point of interest lies in the number of parameters to train: while training from scratch requires to tune around 134 million parameters, we have reduced that number to barely 40 000. This is much less demanding in terms of data and training time.

We can evaluate the accuracy of that changed model:

In [None]:
net = net.cuda()
evaluate_net(net)

The accuracy still is around 10% as we replaced the last layer by a randomly initialized one.

Yet, we can train a bit the model as we did before:

In [None]:
optimizer = optim.SGD(net.classifier.parameters(), lr=1e-4, momentum=0.9)

training_process(net, trainloader, testloader, criterion, optimizer, n_epochs=2)

evaluate_net(net)

With only 2 training steps with the pre-trained model, we perform better than the model trained from scratch during 4 epochs.

If the trainings should be run for longer to truly show this phenomenon, _finetuning_ deep learning models as we did saves time and computation, which is very useful in numerous applications.

## References

- [Most helpful forum for finding a solution to fix errors you may get while developing](https://stackoverflow.com)
- [Loss functions in Pytorch](https://pytorch.org/docs/stable/nn.html#loss-functions)
- [Github repository for visualizing CNN features](https://github.com/utkuozbulak/pytorch-cnn-visualizations)
- [Some common loss functions](https://ml-cheatsheet.readthedocs.io/en/latest/loss_functions.html)
- [Wikipedia explanation of transfer learning](https://en.wikipedia.org/wiki/Transfer_learning)

_Practical session written by Thomas Chabal and Clément Riu • Spring School on Data Science - Ecole Centrale Casablanca • 2022._