<a href="https://colab.research.google.com/github/AndyCatruna/DSM/blob/main/Lab_03_CNN_Architectures.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# CNN Architectures

#### Motivation
In the previous Lab we saw that it is difficult to construct a custom CNN for specific tasks as we have to experiment with many hyperparameters (number of layers, number of feature maps per layer, kernel sizes, fully connected network configuration, etc.)

Luckily we do not have to do that. There are plenty of architectures that have been constructed and tested for many tasks.

With very minor changes, we can adapt existing architectures to our task and be relatively confident that it will obtain good results (if trained correctly).

In Lab 3 you will learn how to utilize available architectures and how to fine-tune them.

#### Notes
As we utilize deep networks, training will take quite a long time in this lab. Running on GPU is necessary (Runtime -> Change Runtime Type).

If you see that a run is not going anywhere (Validation Accuracy is not increasing or Loss is not decreasing) cancel it to save some time.

If possible, run the training with higher batch sizes to speed up the training.

While the models are training, we recommend that you look over the papers of the architectures.😄

#### Libraries
There are plenty of architectures available in the [torchvision library](https://pytorch.org/vision/0.8/models.html), the [timm library](https://timm.fast.ai/), and on [Hugging Face](https://huggingface.co/models?pipeline_tag=image-classification&sort=downloads).

### Training an off-the-shelf model

In [None]:
import sys

import numpy as np
from tqdm import tqdm

import torch
import torch.nn as nn
import torch.optim as optim
import torchvision
from torchvision import transforms, datasets
from torch.utils.data import Dataset, DataLoader

We will use the CIFAR10 dataset as in the previous lab. Let's see if we can obtain better results this time.

In [None]:
train_transform = transforms.Compose(
    [
      transforms.ToTensor(),
      transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))
    ])

test_transform = transforms.Compose(
    [
        transforms.ToTensor(),
        transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))
    ])

trainset = datasets.CIFAR10(root='./data', train=True,
                                        download=True, transform=train_transform)

# The batch size is the number of images the model processes in parallel
# We use shuffle for training as we don't want the model to see the images in the same order
trainloader = DataLoader(trainset, batch_size=256,
                                          shuffle=True)

testset = datasets.CIFAR10(root='./data', train=False,
                                       download=True, transform=test_transform)

# For testing we don't have to shuffle the data
testloader = DataLoader(testset, batch_size=256,
                                         shuffle=False)

classes = ('plane', 'car', 'bird', 'cat', 'deer', 'dog', 'frog', 'horse', 'ship', 'truck')

Code for training and validation - same as in previous lab

You should already be familiar with the following blocks of code.

In [None]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

In [None]:
def train_epoch(model, dataloader, device, optimizer, criterion, epoch):
    # We set the model to be in training mode
    model.train()

    total_train_loss = 0.0
    dataset_size = 0

    # This is only for showing the progress bar
    bar = tqdm(enumerate(dataloader), total=len(dataloader), colour='cyan', file=sys.stdout)

    # We iterate through all batches - 1 step is 1 batch of batch_size images
    for _, (images, labels) in bar:
        # We take the images and their labels and push them on GPU
        images = images.to(device)
        labels = labels.to(device)

        batch_size = images.shape[0]

        # Reset gradients
        optimizer.zero_grad()

        # Obtain predictions
        pred = model(images)

        # Compute loss for this batch
        loss = criterion(pred, labels)

        # Compute gradients for each weight (backpropagation)
        loss.backward()

        # Update weights based on gradients (gradient descent)
        optimizer.step()

        # We keep track of the average training loss
        total_train_loss += (loss.item() * batch_size)
        dataset_size += batch_size

        epoch_loss = np.round(total_train_loss / dataset_size, 2)
        bar.set_postfix(Epoch=epoch, Train_Loss=epoch_loss)

    return epoch_loss

In [None]:
def valid_epoch(model, dataloader, device, criterion, epoch):
    # We set the model in evaluation mode
    model.eval()

    total_val_loss = 0.0
    dataset_size = 0

    # We keep track of correct predictions
    correct = 0

    # This is only for showing the progress bar
    bar = tqdm(enumerate(dataloader), total=len(dataloader), colour='cyan', file=sys.stdout)

    for step, (images, labels) in bar:
        images = images.to(device)
        labels = labels.to(device)

        batch_size = images.shape[0]

        pred = model(images)
        loss = criterion(pred, labels)

        # The raw output of the model is a score for each class
        # We keep the index of the class with the highest score as the prediction
        _, predicted = torch.max(pred, 1)

        # We see how many predictions match the ground truth labels
        correct += (predicted == labels).sum().item()

        # We compute evaluation metrics - loss and accurarcy
        total_val_loss += (loss.item() * batch_size)
        dataset_size += batch_size

        epoch_loss = np.round(total_val_loss / dataset_size, 2)

        accuracy = np.round(100 * correct / dataset_size, 2)

        bar.set_postfix(Epoch=epoch, Valid_Acc=accuracy, Valid_Loss=epoch_loss)

    return accuracy, epoch_loss

In [None]:
def run_training(model, num_epochs, learning_rate):
    # Define criterion
    criterion = nn.CrossEntropyLoss()

    # Define optimizer
    optimizer = optim.Adam(model.parameters(), lr=learning_rate)

    # Check if we are using GPU
    if torch.cuda.is_available():
        print("[INFO] Using GPU: {}\n".format(torch.cuda.get_device_name()))

    # For keeping track of the best validation accuracy
    top_accuracy = 0.0

    # We train the emodel for a number of epochs
    for epoch in range(num_epochs):

        train_loss = train_epoch(model, trainloader, device, optimizer, criterion, epoch)
        print(f"Epoch {epoch} - Training Loss: {train_loss}")

        # For validation we do not keep track of gradients
        with torch.no_grad():
            val_accuracy, _ = valid_epoch(model, testloader, device, criterion, epoch)
            if val_accuracy > top_accuracy:
                print(f"Validation Accuracy Improved ({top_accuracy} ---> {val_accuracy})")
                top_accuracy = val_accuracy
        print()

Code for counting number of parameters of a model.

As we have limited compute resources, we will opt for models with fewer parameters.

In [None]:
def count_parameters(model):
    return sum(p.numel() for p in model.parameters() if p.requires_grad)

### SqueezeNet from torchvision

First, let's experiment with a model from the torchvision library

In [None]:
import torchvision.models as models

Let's see what models are available

In [None]:
models.list_models()

We will choose SqueezeNet which is a very lightweight model. More details about the architecture can be found in this [paper](https://arxiv.org/abs/1602.07360).

<img src=https://production-media.paperswithcode.com/methods/Screen_Shot_2020-06-26_at_6.04.32_PM.png width=750px>

In [None]:
squeeze_net = models.squeezenet1_0()
print(squeeze_net)
print("Number of parameters:", count_parameters(squeeze_net))

Visualization of modifying prediction head. In our case, we only change the final output layer.

<img src="https://miro.medium.com/v2/resize:fit:720/format:webp/0*8Z3To8OAwBBIj66p.jpg" width=700px>

The classifier is built for 1000 classes as it was intended for training on ImageNet. We have to modify it to work on 10 classes. Notice that the number of parameters changes

In [None]:
squeeze_net.classifier[1] = nn.Conv2d(512, 10, kernel_size=(1,1), stride=(1,1))
squeeze_net.to(device)
print("Number of parameters:", count_parameters(squeeze_net))

We run the training

In [None]:
# You may want to change the hyperparameters to obtain better results
run_training(model=squeeze_net, num_epochs=10, learning_rate=0.001)

### Fine-tuning pre-trained model

The performance is not that impressive. You could improve the results by playing with the learning rate, number of epochs, and add image augmentations.

However, we do not have to start from randomly initialized weights.

We can start from weights pre-trained on large datasets (ImageNet) and fine-tune the weights on the new task.

This is simply done by specifying :```pretrained=True``` when initializing the model

This should improve the performance as the convolutional layers have already learnt to extract meaningful features.

In [None]:
squeeze_net = models.squeezenet1_0(pretrained=True)
squeeze_net.classifier[1] = nn.Conv2d(512, 10, kernel_size=(1,1), stride=(1,1))
squeeze_net.to(device)

In [None]:
# Usually when we fine-tune pretrained models we utilize lower learning rates
run_training(model=squeeze_net, num_epochs=10, learning_rate=0.0003)

The model should obtain better results.

### Resnet-18 from timm library

Let's try another model from a different library. We will use the timm library

In [None]:
!pip install timm

In [None]:
import timm

Let's see available models

In [None]:
timm.list_models()

We will train a ResNet-18 that is already pre-trained on ImageNet.
You can read more about ResNet in this [paper](https://arxiv.org/abs/1512.03385).

<img src=https://www.researchgate.net/publication/366608244/figure/fig1/AS:11431281109643320@1672145338540/Structure-of-the-Resnet-18-Model.jpg>

In [None]:
resnet = timm.create_model('resnet18', pretrained=True)
print(resnet)
print("Number of parameters:", count_parameters(resnet))

Similarly, we will need to modify the classifier to have only 10 classes.

In [None]:
resnet.fc = nn.Linear(in_features=512, out_features=10, bias=True)
resnet.to(device)
print(resnet)
print("Number of parameters:", count_parameters(resnet))

We run the training

In [None]:
# You may have to play with the hyperparameters to obtain better results
run_training(model=resnet, num_epochs=10, learning_rate=0.0005)

### MobileNet from HuggingFace

Finally, we will use a model from the Hugging Face library.

We will use the MobileNet architecture. You can read more about it in this [paper](https://arxiv.org/abs/1704.04861)

In [None]:
!pip install transformers datasets evaluate accelerate pillow torchvision scikit-learn
from transformers import AutoModelForImageClassification

We utilize `AutoModelForImageClassification.from_pretrained`to create the model.
As in the other cases, we change the classifier to have only 10 classes.

We build a wrapper around the model to only output the score for each class (by default it also outputs the loss). We only do this so it works with our train and valid functions.

In [None]:
class MobileNetWrapper(nn.Module):
  def __init__(self):
    super(MobileNetWrapper, self).__init__()
    self.model = AutoModelForImageClassification.from_pretrained("google/mobilenet_v2_1.0_224")

    # Changing the classifier to output only 10 classes
    self.model.classifier = nn.Linear(in_features=1280, out_features=10, bias=True)

  def forward(self, x):
    return self.model(x).logits

mobilenet = MobileNetWrapper()
mobilenet.to(device)
print("Number of parameters:", count_parameters(mobilenet))
print(mobilenet)

In [None]:
# You may have to play with the hyperparameters to obtain better results
run_training(model=mobilenet, num_epochs=10, learning_rate=0.0005)

### Training only a subset of layers

When using pre-trained models that were trained on similar images, we do not have to re-train every layer of the network.

Ideally the first layers of the network already extract meaningful features which can be utilized in our task. So, in theory we don't have to re-train these first layers.

We will demonstrate this by fine-tuning only a part of the resnet-18 model. Let's have a closer look at the structure of ResNet18.

In [None]:
resnet = timm.create_model('resnet18', pretrained=True)

# Change the output layer to match the number of classes in CIFAR-10
resnet.fc = nn.Linear(in_features=512, out_features=10, bias=True)

resnet.to(device)

<img src=https://www.researchgate.net/publication/366608244/figure/fig1/AS:11431281109643320@1672145338540/Structure-of-the-Resnet-18-Model.jpg>

The model has an initial 7x7 Conv, followed by a Pooling Layer.

Afterwards, it has 4 layers, each with the same structure. Finally, it has a fully connected classifier (which we changed to output only 10 scores instead of 1000).

A layer is composed of 2 Building Blocks. A building Block contains the following:
1. A 3x3 Conv
2. Batch Norm - read more about this operation [here](https://arxiv.org/abs/1502.03167)
3. Activation - ReLU
4. A second 3x3 Conv
5. Batch Norm
6. Activation - ReLU
7. Residual Connection with the Input - input + output - read more about this [here](https://arxiv.org/abs/1512.03385)

A Building Block with residual connection looks like this:

<img src=https://miro.medium.com/v2/resize:fit:570/1*D0F3UitQ2l5Q0Ak-tjEdJg.png width=300px>

We will start by freezing all the parameters of the model. This means that we tell pytorch not to compute gradients for these parameters.

Any parameter that is frozen will not be changed during training.

In [None]:
for param in resnet.parameters():
    param.requires_grad = False

Function that prints the name of the parameter and if it's trainable or frozen.

In [None]:
def print_params(model):
    for name, param in model.named_parameters():
        print(name, param.requires_grad)

In [None]:
print_params(resnet)

All the parameters of the model are frozen.

We will unfreeze only the final fc layer.

We usually freeze the first layers and fine-tune the later ones.

In [None]:
for param in resnet.fc.parameters():
    param.requires_grad = True

print_params(resnet)
print("Number of trainable parameters:", count_parameters(resnet))

Notice that we only have 5130 trainable parameters out of 11M.
Let's run the training and see if the model learns anything.

In [None]:
# You may have to play with the hyperparameters
run_training(model=resnet, num_epochs=20, learning_rate=0.0003)

**Exercise 1** - Unfreeze all the BatchNorm (bn) weights and biases and re-train the network. See what you observe.

- Play with freezing and unfreezing certain parts of the network. See if you can obtain good results without having to re-train too many parameters.

In [None]:
# Unfreeze BatchNorm layers
for name, param in resnet.named_parameters():
    if "bn" in name: 
        param.requires_grad = True

print_params(resnet)
print("Number of trainable parameters:", count_parameters(resnet))
run_training(model=resnet, num_epochs=20, learning_rate=0.0003)

**Exercise 2** - Try to obtain the highest accuracy possible on CIFAR-10.
- Use any model you want from any of the presented libraries
- Use image augmentations in training and try to find optimal hyperparameters
- Models trained on ImageNet were trained with 224x224 images. What happens if you resize the CIFAR-10 32x32 images to the 224x224 size? (Training will take a very long time if you do this, but try it for 1-2 epochs)

In [None]:
train_transform = transforms.Compose([
    transforms.Resize((224, 224)),
    transforms.RandomHorizontalFlip(),
    transforms.RandomRotation(15),
    transforms.ColorJitter(brightness=0.2, contrast=0.2, saturation=0.2, hue=0.1),
    transforms.ToTensor(),
    transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))
])

resnet50 = timm.create_model('resnet50', pretrained=True)
resnet50.fc = nn.Linear(in_features=2048, out_features=10)
resnet50.to(device)

run_training(model=resnet50, num_epochs=15, learning_rate=0.0003)