# Transfer learning / fine-tuning

This tutorial will guide you through the process of using _transfer learning_ to learn an accurate image classifier from a relatively small number of training samples. Generally speaking, transfer learning refers to the process of leveraging the knowledge learned in one model for the training of another model. 

More specifically, the process involves taking an existing neural network which was previously trained to good performance on a larger dataset, and using it as the basis for a new model which leverages that previous network's accuracy for a new task. This method has become popular in recent years to improve the performance of a neural net trained on a small dataset; the intuition is that the new dataset may be too small to train to good performance by itself, but we know that most neural nets trained to learn image features often learn similar features anyway, especially at early layers where they are more generic (edge detectors, blobs, and so on). 

Transfer learning has been largely enabled by the open-sourcing of state-of-the-art models; for the top performing models in image classification tasks (like from [ILSVRC](http://www.image-net.org/challenges/LSVRC/)), it is common practice now to not only publish the architecture, but to release the trained weights of the model as well. This lets amateurs use these top image classifiers to boost the performance of their own task-specific models.

#### Feature extraction vs. fine-tuning

At one extreme, transfer learning can involve taking the pre-trained network and freezing the weights, and using one of its hidden layers (usually the last one) as a feature extractor, using those features as the input to a smaller neural net. 

At the other extreme, we start with the pre-trained network, but we allow some of the weights (usually the last layer or last few layers) to be modified. Another name for this procedure is called "fine-tuning" because we are slightly adjusting the pre-trained net's weights to the new task. We usually train such a network with a lower learning rate, since we expect the features are already relatively good and do not need to be changed too much. 

Sometimes, we do something in-between: Freeze just the early/generic layers, but fine-tune the later layers. Which strategy is best depends on the size of your dataset, the number of classes, and how much it resembles the dataset the previous model was trained on (and thus, whether it can benefit from the same learned feature extractors). A more detailed discussion of how to strategize can be found in [[1]](http://cs231n.github.io/transfer-learning/) [[2]](http://sebastianruder.com/transfer-learning/).

## Procedure

In this guide will go through the process of loading a state-of-the-art, 1000-class image classifier, [VGG16](https://arxiv.org/pdf/1409.1556.pdf) which [won the ImageNet challenge in 2014](http://www.robots.ox.ac.uk/~vgg/research/very_deep/), and using it as a fixed feature extractor to train a smaller custom classifier on our own images, although with very few code changes, you can try fine-tuning as well.

We will first load VGG16 and remove its final layer, the 1000-class softmax classification layer specific to ImageNet, and replace it with a new classification layer for the classes we are training over. We will then freeze all the weights in the network except the new ones connecting to the new classification layer, and then train the new classification layer over our new dataset. 

We will also compare this method to training a small neural network from scratch on the new dataset, and as we shall see, it will dramatically improve our accuracy. We will do that part first.

As our test subject, we'll use a dataset consisting of around 6000 images belonging to 97 classes, and train an image classifier with around 80% accuracy on it. It's worth noting that this strategy scales well to image sets where you may have even just a couple hundred or less images. Its performance will be lesser from a small number of samples (depending on classes) as usual, but still impressive considering the usual constraints.


In [2]:
%matplotlib inline

import os

import random
import numpy as np
import keras

import matplotlib.pyplot as plt
from matplotlib.pyplot import imshow

import torch
import torch.nn as nn
import torch.optim as optim
import torchvision
from torchvision import datasets, models, transforms

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

### Getting a dataset

The first step is going to be to load our data. As our example, we will be using the dataset [STL-10](https://cs.stanford.edu/~acoates/stl10/), which contains thousands of labeled images belonging to 10 object categories. In order to handle this large sum of data, we prescribe it the dataloader from torchvision. This will grab individual pieces of the dataset used for training in batches.

If wanting to use a dataset outside the scope of pytorch, feel free to still use the dataloader, but the retrieval method will vary depending on the library. A handy tool as an alternative to wget is `gdown` which seamlessly integrates with the colab environment.

If you wish to use your own dataset, it should be aranged in the same fashion to `LSUN` with all of the images organized into subfolders, one for each class. In this case, the following cell should load your custom dataset correctly by just replacing `root` with your folder. If you have an alternate structure, you just need to make sure that you load the list `data` where every element is a dict where `x` is the data (a 1-d numpy array) and `y` is the label (an integer). Use the helper function `get_image(path)` to load the image correctly into the array, and note also that the images are being resized to 224x224. This is necessary because the input to VGG16 is a 224x224 RGB image. You do not need to resize them on your hard drive, as that is being done in the code below.



In [3]:
stl_data = torchvision.datasets.STL10("/", folds=None, transform=None, download=True)

Downloading http://ai.stanford.edu/~acoates/stl10/stl10_binary.tar.gz to /stl10_binary.tar.gz


HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))

Extracting /stl10_binary.tar.gz to /


In [4]:
root = 'stl10_binary'
classes = open("/stl10_binary/class_names.txt")
print(classes.read())
classes.close()

airplane
bird
car
cat
deer
dog
horse
monkey
ship
truck



This function is useful for pre-processing the data into an image and input vector. Resize the image to the appropriate dimensions. Load all the images from root folder. Randomize the data order. Pre-process the data as before by making sure it's float32 and normalized between 0 and 1.

In [5]:
# helper function to load image and return it and input vector
reshape = transforms.Compose([
                    transforms.Resize(256), 
                    transforms.CenterCrop(224),
                    transforms.ToTensor(),
                    transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])])

Load the data and apply the transformarions.

In [6]:
stl_train = torchvision.datasets.STL10("/", split='train', folds=None, transform=reshape)
                                      
train_loader = torch.utils.data.DataLoader(stl_train,
                                          batch_size=12,
                                          shuffle=True,
                                          num_workers=0)
stl_test = torchvision.datasets.STL10("/", split='test', folds=None, transform=reshape)
                                      
test_loader = torch.utils.data.DataLoader(stl_test,
                                          batch_size=12,
                                          shuffle=True,
                                          num_workers=0)

If everything worked properly, you should have loaded a bunch of images, and split them into three sets: `train`, `val`, and `test`.

Notice that we divided all the data into three subsets -- a training set `train`, a validation set `val`, and a test set `test`. The reason for this is to properly evaluate the accuracy of our classifier. During training, the optimizer uses the validation set to evaluate its internal performance, in order to determine the gradient without overfitting to the training set. The `test` set is always held out from the training algorithm, and is only used at the end to evaluate the final accuracy of our model.

## Transfer learning by starting with existing network

Now we can move on to the main strategy for training an image classifier on our small dataset: by starting with a larger and already trained network.

To start, we will load the VGG16 from TorchVision, which was trained on ImageNet and the weights saved online. If this is your first time loading VGG16, you'll need to wait a bit for the weights to download from the web. Once the network is loaded, we can again inspect the layers with the `summary()` method.

In [7]:
vgg = torchvision.models.vgg16(pretrained=False)
vgg.to(torch.device(device))

VGG(
  (features): Sequential(
    (0): Conv2d(3, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (1): ReLU(inplace=True)
    (2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (3): ReLU(inplace=True)
    (4): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
    (5): Conv2d(64, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (6): ReLU(inplace=True)
    (7): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (8): ReLU(inplace=True)
    (9): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
    (10): Conv2d(128, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (11): ReLU(inplace=True)
    (12): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (13): ReLU(inplace=True)
    (14): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (15): ReLU(inplace=True)
    (16): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1

Notice that VGG16 is _much_ bigger than the network we constructed earlier. It contains 13 convolutional layers and two fully connected layers at the end, and has over 138 million parameters, around 100 times as many parameters than the network we made above. Like our first network, the majority of the parameters are stored in the connections leading into the first fully-connected layer.

VGG16 was made to solve ImageNet, and achieves a [8.8% top-5 error rate](https://github.com/jcjohnson/cnn-benchmarks), which means that 91.2% of test samples were classified correctly within the top 5 predictions for each image. It's top-1 accuracy--equivalent to the accuracy metric we've been using (that the top prediction is correct)--is 73%. This is especially impressive since there are not just 97, but 1000 classes, meaning that random guesses would get us only 0.1% accuracy.

In order to use this network for our task, we "remove" the final classification layer, the 1000-neuron softmax layer at the end, which corresponds to ImageNet, and instead replace it with a new Linear layer for our dataset, which contains 10 layers in the case of STL_10.

In terms of implementation, it's easier to simply create a copy of VGG from its input layer until the second to last layer, and then work with that, rather than modifying the VGG object directly. So technically we never "remove" anything, we just circumvent/ignore it. This can be done by adding a layer to our typical training loop. In this case we would add a softmax to the variable output.

In [13]:
criterion = nn.CrossEntropyLoss()
optimizer = optim.SGD(vgg.parameters(), lr=0.001, momentum=0.9)

lastLayer = nn.Linear(1000,10).to(torch.device(device))

for epoch in range(10):  # loop over the dataset multiple times

    running_loss = 0.0
    for i, data in enumerate(train_loader, 0):
        inputs, labels = data
        inputs, labels = inputs.to(torch.device(device)), labels.to(torch.device(device))
        optimizer.zero_grad()

        # forward + backward + optimize
        outputs = vgg(inputs)
        outputs = lastLayer(outputs)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()

        # print statistics
        running_loss += loss.item()
        if i % 25 == 24:    # print every 25 mini-batches
            print('[%d, %5d] loss: %.3f' %
                  (epoch + 1, i + 1, running_loss / 25))
            running_loss = 0.0

print('Finished Training')

[1,    25] loss: 2.254
[1,    50] loss: 2.128
[1,    75] loss: 2.022
[1,   100] loss: 2.031
[1,   125] loss: 1.999
[1,   150] loss: 1.956
[1,   175] loss: 2.016
[1,   200] loss: 1.958
[1,   225] loss: 1.888
[1,   250] loss: 1.872
[1,   275] loss: 1.839
[1,   300] loss: 1.876
[1,   325] loss: 1.872
[1,   350] loss: 1.831
[1,   375] loss: 1.948
[1,   400] loss: 1.842
[2,    25] loss: 1.829
[2,    50] loss: 1.765
[2,    75] loss: 1.823
[2,   100] loss: 1.789
[2,   125] loss: 1.766
[2,   150] loss: 1.804
[2,   175] loss: 1.808
[2,   200] loss: 1.789
[2,   225] loss: 1.754
[2,   250] loss: 1.798
[2,   275] loss: 1.795
[2,   300] loss: 1.755
[2,   325] loss: 1.563
[2,   350] loss: 1.692
[2,   375] loss: 1.699
[2,   400] loss: 1.622
[3,    25] loss: 1.638
[3,    50] loss: 1.701
[3,    75] loss: 1.668
[3,   100] loss: 1.626
[3,   125] loss: 1.527
[3,   150] loss: 1.609
[3,   175] loss: 1.610
[3,   200] loss: 1.707
[3,   225] loss: 1.622
[3,   250] loss: 1.590
[3,   275] loss: 1.662
[3,   300] 

The model appears very functional after only 10 epochs. However, now we must freeze all weights to test it on the validation set. This test will demonstrate the effectiveness of our model while also making clear if the model is overfit.

In [18]:
correct = 0
total = 0
with torch.no_grad():
    for data in test_loader:
        inputs, labels = data
        inputs, labels = inputs.to(torch.device(device)), labels.to(torch.device(device))
        outputs = vgg(inputs)
        outputs = lastLayer(outputs)
        _, predicted = torch.max(outputs.data, 1)
        total += labels.size(0)
        correct += (predicted == labels).sum().item()

print('Accuracy of the network on the test images: %d %%' % (
    100 * correct / total))

Accuracy of the network on the test images: 43 %


### Improving the results

43% top-1 accuracy on 10 classes after 10 epochs, roughly evenly distributed, is a pretty good achievement. It is not quite as impressive as the original VGG16 which achieved 73% top-1 accuracy on 1000 classes. Nevertheless, it is much better than what we were able to achieve with our original network, and there is room for improvement. Some techniques which possibly could have improved our performance.

- Using data augementation: augmentation refers to using various modifications of the original training data, in the form of distortions, rotations, rescalings, lighting changes, etc to increase the size of the training set and create more tolerance for such distortions.
- Using a different optimizer, adding more regularization/dropout, and other hyperparameters.
- Training for longer (of course)

A more advanced example of transfer learning in Keras, involving augmentation for a small 2-class dataset, can be found in the [Keras blog](https://blog.keras.io/building-powerful-image-classification-models-using-very-little-data.html).