**NOTE: This notebook is written for the Google Colab platform, which provides free hardware acceleration. However it can also be run (possibly with minor modifications) as a standard Jupyter notebook, using a local GPU.** 



In [None]:
#@title -- Installation of Packages -- { display-mode: "form" }
import sys
!{sys.executable} -m pip install torchinfo
!{sys.executable} -m pip install git+https://github.com/michalgregor/class_utils.git

In [None]:
#@title -- Import of Necessary Packages -- { display-mode: "form" }
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.metrics import accuracy_score
from class_utils.pytorch_utils import BestModelCheckpointer, freeze_except_last
from torch.optim.lr_scheduler import ExponentialLR
from torchvision import models, transforms
from torch.utils.data import DataLoader
from torchvision.datasets import ImageFolder
import torchinfo
import torch.nn as nn
import torch

In [None]:
#@title -- Downloading Data -- { display-mode: "form" }
from class_utils.download import download_file_maybe_extract
download_file_maybe_extract("https://www.dropbox.com/s/w4pg809npvatye0/food5v2.zip?dl=1", directory="data/food5v2")

# also create a directory for storing any outputs
import os
os.makedirs("output", exist_ok=True)

## Transfer Learning

In this notebook we will use the **Food 5**  dataset to illustrate transfer learning. The dataset is a downsized version of the [Food 11](https://www.kaggle.com/vermaavi/food11) dataset. For a different transfer learning example, you can also see tensorflow.js's [interactive demo](https://storage.googleapis.com/tfjs-examples/webcam-transfer-learning/dist/index.html).

Transfer learning is a very useful technique. Under ordinary circumstance deep learning requires a huge amount of data and computation. If we intend to apply it to a small dataset we will typically not be able to achieve good generalization. The problem is connected to the fact that a small dataset typically cannot sufficiently cover all the possible variations of samples that a model can encounter. In the case of image recognition, for instance, there is virtually an infinite number of variations that a photo of a dog can take: the environment, the lighting, the breed of the dog, the angle – these and other aspects can all change. A small dataset is very unlikely to cover such complex space sufficiently.

One of the solutions that allow us to apply deep learning to small datasets even in spite of these problems is **transfer learning** . Under this technique the neural network is first pre-trained on a large, more general dataset (for image recognition this tends to be the ImageNet dataset). The network uses this dataset to learn what natural images look like and how they need to be preprocessed. Once this pre-training is complete, the dataset is then further trained for the specific target task.

### The Overall Procedure

The overall procedure for transfer learning in image recognition:

* Pre-train a network on ImageNet.


* Remove one or several of the final layers (the top of the network) and replace them with new layers. The new output layer will now have as many outputs as there are classes in the dataset.


* The weights of the pre-trained layers are frozen. Only the new layers are trained using the target dataset.


* One the new layers have been trained we can (an optional step) unfreeze the weights of the pre-trained layers as well and fine-tune the network as a whole. We will need to use a significantly lower learning rate. This is so that we do not destroy the pre-trained layers by doing excessively aggressive updates, but also because when the pre-trained layers can be modified, the risk of overfitting tends to increase.


### Preparation of the Dataset

As usual, let us start by preparing our dataset. For most image recognition tasks the dataset will be too large to fit into memory at once. We will therefore typically not attempt to load all the data at once and we'll use the `DataSet` and `DataLoader` abstraction from `PyTorch`. In the present case, our data comes pre-split into the train, validation and test folds, with each stored in a separate folder. The folders are structured so that each class has its own subfolder.



In [None]:
!ls data/food5v2

In [None]:
!ls data/food5v2/training

Given that our data has this structure, we can use the `ImageFolder` dataset class from `torchvision.datasets`.

Each image will need to be pre-processed before it is fed into the neural network: it will need to be resized, cropped and normalized in the same way it was done when the network was pre-trained. We will use a pre-trained ResNet50 with `IMAGENET1K_V2` weights. So let's first see what the preprocessing procedure that these weights were trained with looks like.



In [None]:
device = "cuda" if torch.cuda.is_available() else "cpu"
weights = models.ResNet50_Weights.IMAGENET1K_V2
image_transforms = weights.transforms()
image_transforms

This is actually pretty simple. We are going to base two different preprocessing procedures for our data on it. The first one is just going to reproduce `image_transforms` shown above. The second one, however, is going to do **data augmentation**  – it will contain a couple of randomized steps that are going to modify the image every time that it is loaded. This is going to add more variety to our training set. The network is essentially never going to see the exact same image twice. In practice, data augmentation pipelines can be a lot more elaborate, applying rotation, zoom, channel shift and a bunch of other transformations to the image.



In [None]:
normal_preproc = transforms.Compose([
    transforms.Resize(image_transforms.resize_size),
    transforms.CenterCrop(image_transforms.crop_size),
    transforms.ToTensor(),
    transforms.Normalize(image_transforms.mean, image_transforms.std)
])

augment_preproc = transforms.Compose([
    transforms.RandomResizedCrop(image_transforms.crop_size),
    transforms.RandomHorizontalFlip(),
    transforms.ToTensor(),
    transforms.Normalize(image_transforms.mean, image_transforms.std)
])

Next we can construct the `ImageFolder` datasets themselves. We specify the paths to the individual folds of our dataset as well as the way in which the images should be preprocessed for each fold. We will use the normal pipeline for validation and testing data and the pipeline with augmentation for training data.



In [None]:
train_dataset = ImageFolder(
    "data/food5v2/training",
    augment_preproc
)

train_dataloader = DataLoader(
    train_dataset,
    batch_size=32,
    shuffle=True,
    num_workers=4
)

valid_dataset = ImageFolder(
    "data/food5v2/validation",
    normal_preproc
)

valid_dataloader = DataLoader(
    valid_dataset,
    batch_size=32,
    shuffle=True,
    num_workers=4
)

test_dataset = ImageFolder(
    "data/food5v2/testing",
    normal_preproc
)

test_dataloader = DataLoader(
    test_dataset,
    batch_size=32,
    shuffle=True,
    num_workers=4
)

#### Displaying a Few Samples



In [None]:
#@title -- Display Data Samples --
disp_dataset = ImageFolder(
    "data/food5v2/training",
    transforms.ToTensor()
)
loader = DataLoader(disp_dataset, batch_size=1, shuffle=True)
loader_iter = iter(loader)

num_rows = 4; num_cols = 4
fig, axes = plt.subplots(num_rows, num_cols, figsize=(10, 8))

for row in axes:
    for ax in row:
        sample = next(loader_iter)[0][0].numpy().transpose((1, 2, 0))
        ax.imshow(sample)
        ax.set_xticks([])
        ax.set_yticks([])

### Loading the Pre-Trained Network

We load a pre-trained ResNet50 network. The weights pre-trained on ImageNet will download automatically.



In [None]:
model = models.resnet50(weights=weights)

To get a feeling for what our architecture looks like, we are going to use function `torchinfo.summary`. This is going to give us info about the hierarchical structure of our network, including all its submodules and individual layers. The summary at the very bottom also shows how many trainable parameters there are.



In [None]:
torchinfo.summary(model)

### Modifying the Network

#### Replacing the Final Layer

To adapt our neural network to the new classification task, we are going to replace the last layer (the fully-connected linear layer `model.fc`) with a new module.



In [None]:
class ModelTop(nn.Module):
    def __init__(self, num_features, num_outputs):
        super().__init__()
        self.dropout = nn.Dropout(0.5)
        self.fc = nn.Linear(num_features, num_outputs)

    def set_dropout(self, p):
        self.dropout.p = p
    
    def forward(self, x):
        y = torch.flatten(x, 1)
        y = self.dropout(y)
        y = self.fc(y)
        return y

In [None]:
num_features = model.fc.in_features
top = ModelTop(num_features=num_features, num_outputs=10)
model.fc = top
model.to(device);

### Training the New Layers

In our training loop, we are going to use best model checkpointing: we are going to monitor the validation loss and every time it improves, we are going to save our model. Then at the end of training, we are going to restore the best saved model.



#### Freezing the Pretrained Layers

Recall that at first, we only want to train our new top layers and leave the pre-trained layers as they are. In our case we will therefore need to freeze all layers except last. We are going to use a predefined auxiliary function to do that, but internally, it just goes over the layers and set the `requires_grad` flag for all their parameters to a corresponding value.



In [None]:
freeze_except_last(model);

Let's display the summary of our model again to make sure that everything worked as intended. We should see that the number of trainable parameters is now substantially lower (just the parameters of the final layer in our new module) and there is a lot of untrainable (frozen) parameters. Note also that for the frozen layers, the number of parameters is now shown in parentheses – this way you can check whether you have frozen the correct layers.



In [None]:
torchinfo.summary(model)

Even though we are just using a single linear layer for 10 classes on top of the network, there is still a bunch of trainable parameters: 20 490 if you are using ResNet50. This is a huge amount given that we only have 200 samples in our train set. Dropout should help with generalization, but even so we cannot expect miracles. It wouldn't be difficult to get more data for this kind of task – it is just that we do not want the training in the notebook to take too long so we are working with a tiny dataset.



### Training the New Layers

In our training loop, we are going to use best model checkpointing: we are going to monitor the validation loss and every time it improves, we are going to save our model. Then at the end of training, we are going to restore the best saved model.



In [None]:
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=1e-2)
schedule = ExponentialLR(optimizer, gamma=0.9)
checkpointer = BestModelCheckpointer(checkpoint_path="output/best_model.pt")
loss_train = []
loss_valid = []

for epoch in range(30):
    epoch_train_loss = []
    epoch_valid_loss = []

    model.train()
    for X_batch, Y_batch in train_dataloader:
        X_batch = X_batch.to(device)
        Y_batch = Y_batch.to(device)
        
        y_batch = model(X_batch)
        loss = criterion(y_batch, Y_batch)

        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        epoch_train_loss.append(loss.item())

    loss_train.append(np.mean(epoch_train_loss))

    model.eval()
    for X_batch, Y_batch in valid_dataloader:
        X_batch = X_batch.to(device)
        Y_batch = Y_batch.to(device)
        
        with torch.no_grad():
            y_batch = model(X_batch)
            loss = criterion(y_batch, Y_batch)

        epoch_valid_loss.append(loss.item())

    loss_valid.append(np.mean(epoch_valid_loss))
    checkpointer(loss_valid[-1], model)
    schedule.step()

    if epoch % 5 == 0:
        print(f"epoch {epoch}, train loss: {np.mean(loss_train[-5:])}, valid loss: {np.mean(loss_valid[-5:])}")

print(f"epoch {epoch}, loss: {loss_train[-1]}")

In [None]:
plt.plot(loss_train, label="train")
plt.plot(loss_valid, label="valid")
plt.xlabel("epoch")
plt.ylabel("loss")
plt.grid(ls='--')
plt.legend()

#### Evaluating the Model on the Validation Set

Now we are going to load the best saved model back from the checkpoint file and run evaluation. Since we are not done with our model yet, we are only going to be testing it on the **validation set, not on the testing set** .



In [None]:
model.load_state_dict(torch.load("output/best_model.pt"));

In [None]:
eval_Y = []
eval_y = []

model.eval()
for X_batch, Y_batch in valid_dataloader:
    eval_Y.extend(Y_batch.numpy())
    X_batch = X_batch.to(device)
    Y_batch = Y_batch.to(device)
    
    with torch.no_grad():
        y_batch = model(X_batch)

    eval_y.extend(y_batch.argmax(dim=1).cpu().numpy())

eval_Y = np.array(eval_Y)
eval_y = np.array(eval_y)

cm = pd.crosstab(
    eval_Y, eval_y,
    rownames=['actual'],
    colnames=['predicted']
)
print(cm, '\n')

acc = accuracy_score(eval_Y, eval_y)
print("Accuracy = {}".format(acc))

### Fine-tuning Pre-Trained Weights

Once we have trained the new top of the model, it will often make sense to unfreeze a few more layers of the network and continue training. However, one usually lowers the learning rate significantly and often uses a more conservative optimizer such as `SGD` in place of more aggressive optimizers such as `Adam`. This is to ensure that the steps taken by the optimizer do not disrupt the pretrained features and undo the benefits of using transfer learning. Note also that even at this stage, one typically does not unfreeze the entire network.

In our case we have very little data and it is unlikely that this fine-tuning stage will actually help to improve results. We can make the attempt though. We'll start by unfreezing the last 5 layers.



In [None]:
freeze_except_last(model, num_last=5);
torchinfo.summary(model)

---
### Task 1: Run the Fine-Tuning

**Now modify the training loop used earlier to perform the final fine-tuning. Instead of `Adam`, use `SGD` as the optimizer and set the learning rate to a lower value such as `1e-7`. Also modify `checkpoint_path` so that the checkpoints are saved in a different file than before. If the fine-tuned model's performance is not an improvement upon the previous version, restore the previous version's weights from the corresponding checkpoint.** 

---


In [None]:


# ----



#### Testing the Fine-Tuned Model

Now we load back the best version of our fine-tuned model and evaluate it. Chances are that it is not going to do better than the version where we just trained the new layers because we have so little data.



In [None]:
model.load_state_dict(torch.load("output/best_full_model.pt"));

In [None]:
eval_Y = []
eval_y = []

model.eval()
for X_batch, Y_batch in valid_dataloader:
    eval_Y.extend(Y_batch.numpy())
    X_batch = X_batch.to(device)
    Y_batch = Y_batch.to(device)
    
    with torch.no_grad():
        y_batch = model(X_batch)

    eval_y.extend(y_batch.argmax(dim=1).cpu().numpy())

eval_Y = np.array(eval_Y)
eval_y = np.array(eval_y)

cm = pd.crosstab(
    eval_Y, eval_y,
    rownames=['actual'],
    colnames=['predicted']
)
print(cm, '\n')

acc = accuracy_score(eval_Y, eval_y)
print("Accuracy = {}".format(acc))

We can now also evaluate our model on the test set. On this particular dataset, we can actually expect the results on the test set to be a bit better – by chance, the test fold appears to be a bit less challenging in this case, which can happen when you're working with very small datasets.



In [None]:
eval_Y = []
eval_y = []

model.eval()
for X_batch, Y_batch in test_dataloader:
    eval_Y.extend(Y_batch.numpy())
    X_batch = X_batch.to(device)
    Y_batch = Y_batch.to(device)
    
    with torch.no_grad():
        y_batch = model(X_batch)

    eval_y.extend(y_batch.argmax(dim=1).cpu().numpy())

eval_Y = np.array(eval_Y)
eval_y = np.array(eval_y)

cm = pd.crosstab(
    eval_Y, eval_y,
    rownames=['actual'],
    colnames=['predicted']
)
print(cm, '\n')

acc = accuracy_score(eval_Y, eval_y)
print("Accuracy = {}".format(acc))