# 8. Transfer Learning, Freezing and Tuning

### About this notebook

This notebook was used in the 50.039 Deep Learning course at the Singapore University of Technology and Design.

**Author:** Matthieu DE MARI (matthieu_demari@sutd.edu.sg)

**Version:** 1.0 (10/02/2023)

**Requirements:**
- Python 3 (tested on v3.9.6)
- Torch (tested on v1.12.1)
- Torchvision (tested on v0.13.1)

### Imports and CUDA

In [1]:
# Torch
import torch
import torchvision
from torch.utils.data import Dataset
from torchvision import datasets
import torch.optim as optim
from torchvision.transforms import ToTensor, Compose, Normalize
from torchvision.datasets import MNIST
import torch.nn.functional as F
import torch.nn as nn

In [2]:
# Use GPU if available, else use CPU
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(device)

cuda


### MNIST Dataset

As before.

In [3]:
# Define transform to convert images to tensors and normalize them
transform_data = Compose([ToTensor(),
                          Normalize((0.1307,), (0.3081,))])

# Load the data
batch_size = 256
train_dataset = MNIST(root = './mnist/', train = True, download = True, transform = transform_data)
test_dataset = MNIST(root = './mnist/', train = False, download = True, transform = transform_data)
train_loader = torch.utils.data.DataLoader(train_dataset, batch_size = batch_size, shuffle = True)
test_loader = torch.utils.data.DataLoader(test_dataset, batch_size = batch_size, shuffle = False)

### The concept of transfer learning

In the previous notebook, we have seen a few different models for Computer Vision. These models sometimes had very large architectures, and therefore required very large computational resources to train. For that reason, the models have been released publicly so that anyone can reuse them, without having to train them from scratch.

This is especially useful, if we are looking for a high performance models to use on an image dataset, containing images that are roughly similar but not necessarily identical to the dataset used for training these state-of-the-art models in the first place (e.g. ImageNet).

In these scenarios, it is often preferable to use these pre-trained networks as starting points and then modify parts of these models (by adding or replacing some layers). We would then use this model, trained in most parts, and resume the training for our given dataset.

This is called **transfer learning**: a model trained on one task is used as the starting point for a model on a second related task. This allows the model to take advantage of the knowledge learned from the first task and apply it to the second task, often resulting in faster training times and improved performance.

In this notebook, and like in the previous notebook, we will reuse a Resnet pre-trained model, as shown below.

In [4]:
# Load pre-trained ResNet model
resnet = torchvision.models.resnet18(pretrained = True)
print(resnet)



ResNet(
  (conv1): Conv2d(3, 64, kernel_size=(7, 7), stride=(2, 2), padding=(3, 3), bias=False)
  (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (relu): ReLU(inplace=True)
  (maxpool): MaxPool2d(kernel_size=3, stride=2, padding=1, dilation=1, ceil_mode=False)
  (layer1): Sequential(
    (0): BasicBlock(
      (conv1): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (relu): ReLU(inplace=True)
      (conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn2): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    )
    (1): BasicBlock(
      (conv1): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (relu): ReLU(inplace=True)
  

### Adjusting the pre-trained model to the new task - changing layers

Our ResNet model was trained on the ImageNet dataset, which contains $ 256 \times 256 $ RGB images, therefore having three channels (Red, Green and Blue). In comparison, MNIST consists of greyscale images (therefore using only 1 channel), of size $ 28 \times 28 $.

Our first adjustment then consists of replacing the first Conv2d layer to accomodate for that change. This is done as shown below.

In [5]:
# Replace the first convolutional layer with a single-channel convolutional layer
# Expecting only one input channel instead of 3.
resnet.conv1 = torch.nn.Conv2d(1, 64, kernel_size = 7, stride = 2, padding = 3, bias = False)

In [6]:
# Show new Resnet architecture after replacement
print(resnet)

ResNet(
  (conv1): Conv2d(1, 64, kernel_size=(7, 7), stride=(2, 2), padding=(3, 3), bias=False)
  (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (relu): ReLU(inplace=True)
  (maxpool): MaxPool2d(kernel_size=3, stride=2, padding=1, dilation=1, ceil_mode=False)
  (layer1): Sequential(
    (0): BasicBlock(
      (conv1): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (relu): ReLU(inplace=True)
      (conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn2): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    )
    (1): BasicBlock(
      (conv1): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (relu): ReLU(inplace=True)
  

### Freezing layers to avoid retraining them

Setting the attribute requiresGrad to False for a given layer, freezes the layer. This means that the parameters of said layer (weight, bias, kernel, etc.) cannot be changed by backpropagation.

The first Convolutional layer, however, has to be retrained.

In [7]:
# Freeze all layers except the new first layer
for param in resnet.parameters():
    param.requiresGrad = False
resnet.conv1.requiresGrad = True

### Adjusting the pre-trained model to the new task - adding layers

The dataset used to train the Resnet, consisted of images with 1000 different classes. In comparison, MNIST only has 10 classes. To cope with this change, we can reuse the resnet model we have so far, but add an extra Linear layer, which will reduce the size of the output vector from a 1D vector with 1000 elements to a 1D vector with only 10 elements, corresponding to the 10 classes of MNIST.

This can be done, using the Sequential() function, as shown below.

In [8]:
# Replace the final layer in ResNet, with the same original Linear layer (512, 1000)
# and add a Linear (1000, 10) on top of that.
resnet.fc = torch.nn.Sequential(resnet.fc, torch.nn.Linear(1000, 10))
print(resnet)

ResNet(
  (conv1): Conv2d(1, 64, kernel_size=(7, 7), stride=(2, 2), padding=(3, 3), bias=False)
  (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (relu): ReLU(inplace=True)
  (maxpool): MaxPool2d(kernel_size=3, stride=2, padding=1, dilation=1, ceil_mode=False)
  (layer1): Sequential(
    (0): BasicBlock(
      (conv1): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (relu): ReLU(inplace=True)
      (conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn2): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    )
    (1): BasicBlock(
      (conv1): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (relu): ReLU(inplace=True)
  

In [9]:
# Freeze all layers except the new first and last layers
for param in resnet.parameters():
    param.requiresGrad = False
resnet.conv1.requiresGrad = True
resnet.fc[1].requiresGrad = True

### Retraining the custom model

Having performed the adjustments required to make our pre-trained Resnet model compatible with the MNIST dataset, we can now retrain it for this specific dataset!

As before, our trainer() function will look along the lines shown below.

**Warning:** Resnets are rather large models (even though we froze most layers). Retraining will take quite some time.

In [10]:
# Define loss function and optimizer
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(resnet.parameters(), lr = 0.001)

# Prepare model for training
resnet.train()
resnet.to(device)

# Retrain
num_epochs = 3
for epoch in range(num_epochs):
    for i, (images, labels) in enumerate(train_loader):
        # Forward pass
        outputs = resnet(images.to(device))
        loss = criterion(outputs, labels.to(device))
        
        # Backward pass and optimize
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        
        # Display
        if (i+1) % 10 == 0:
            print ('Epoch [{}/{}], Step [{}/{}], Loss: {:.4f}'.format(epoch+1, \
                                                                      num_epochs, \
                                                                      i+1, \
                                                                      len(train_loader), \
                                                                      loss.item()))

Epoch [1/3], Step [10/235], Loss: 0.5666
Epoch [1/3], Step [20/235], Loss: 0.3886
Epoch [1/3], Step [30/235], Loss: 0.2676
Epoch [1/3], Step [40/235], Loss: 0.1870
Epoch [1/3], Step [50/235], Loss: 0.1345
Epoch [1/3], Step [60/235], Loss: 0.0834
Epoch [1/3], Step [70/235], Loss: 0.0867
Epoch [1/3], Step [80/235], Loss: 0.1642
Epoch [1/3], Step [90/235], Loss: 0.1918
Epoch [1/3], Step [100/235], Loss: 0.0850
Epoch [1/3], Step [110/235], Loss: 0.1126
Epoch [1/3], Step [120/235], Loss: 0.1066
Epoch [1/3], Step [130/235], Loss: 0.0555
Epoch [1/3], Step [140/235], Loss: 0.0701
Epoch [1/3], Step [150/235], Loss: 0.0746
Epoch [1/3], Step [160/235], Loss: 0.0570
Epoch [1/3], Step [170/235], Loss: 0.0214
Epoch [1/3], Step [180/235], Loss: 0.0603
Epoch [1/3], Step [190/235], Loss: 0.1190
Epoch [1/3], Step [200/235], Loss: 0.0417
Epoch [1/3], Step [210/235], Loss: 0.0581
Epoch [1/3], Step [220/235], Loss: 0.1437
Epoch [1/3], Step [230/235], Loss: 0.0978
Epoch [2/3], Step [10/235], Loss: 0.0585
Ep

### Remember to test your model after retraining!

We leave it to the students to play with this notebook to check if the model performance could have been improved with more iterations.

Another idea worth exploring is to freeze less layers, or progressively unfreeze them during training...?

In [13]:
# Test the model
# Remember to use eval mode (we have some batchnorm layers!)
resnet.eval() 
with torch.no_grad():
    # Accuracy counters
    correct = 0
    total = 0
    for images, labels in test_loader:
        # Forward pass and predict
        outputs = resnet(images.to(device))
        _, predicted = torch.max(outputs.data, 1)
        
        # Update accuracies
        total += labels.size(0)
        correct += (predicted == labels.to(device)).sum().item()

# Display
print('Test accuracy of the model after retraining for 3 epochs: {} %'.format(100*correct/total))

Test accuracy of the model after retraining for 3 epochs: 98.1 %


### What's next?

This concludes week 4 on Convolutional Neural Networks.

Next week we will investigate a new type of data, series.