<a href="https://colab.research.google.com/github/nyp-sit/it3103/blob/main/week4/transfer_learning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Lab Exercise: Transfer Learning


A common and highly effective approach to deep learning on small image datasets is to leverage a pre-trained network. A pre-trained network 
is simply a saved network previously trained on a large dataset, typically on a large-scale image classification task. If this original 
dataset is large enough and general enough, then the spatial feature hierarchy learned by the pre-trained network can effectively act as a 
generic model of our visual world, and hence its features can prove useful for many different computer vision problems, even though these 
new problems might involve completely different classes from those of the original task. For instance, one might train a network on 
ImageNet (where classes are mostly animals and everyday objects) and then re-purpose this trained network for something as remote as 
identifying furniture items in images. Such portability of learned features across different problems is a key advantage of deep learning 
compared to many older shallow learning approaches, and it makes deep learning very effective for small-data problems.

In our case, we will consider a large convnet trained on the ImageNet dataset (1.4 million labeled images and 1000 different classes). 
ImageNet contains many animal classes, including different species of cats and dogs, and we can thus expect to perform very well on our cat 
vs. dog classification problem.

We will use the VGG16 architecture. Although it is a bit of an older model, far from the current state of the art and somewhat heavier than many other recent 
models, we chose it because its architecture is similar to what you are already familiar with, and easy to understand without introducing 
any new concepts. 

There are two ways to leverage a pre-trained network: *feature extraction* and *fine-tuning*. We will cover both of them. Let's start with 
feature extraction.

## Feature extraction

Feature extraction consists of using the representations learned by a previous network to extract interesting features from new samples. 
These features are then run through a new classifier, which is trained from scratch.

As we saw previously, convnets used for image classification comprise two parts: they start with a series of pooling and convolution 
layers, and they end with a densely-connected classifier. The first part is called the "convolutional base" of the model. In the case of 
convnets, "feature extraction" will simply consist of taking the convolutional base of a previously-trained network, running the new data 
through it, and training a new classifier on top of the output.

![swapping FC classifiers](https://nyp-aicourse.s3-ap-southeast-1.amazonaws.com/resources/it3103/swapping_fc_classifier.png)

Why only reuse the convolutional base? Could we reuse the densely-connected classifier as well? In general, it should be avoided. The 
reason is simply that the representations learned by the convolutional base are likely to be more generic and therefore more reusable: the 
feature maps of a convnet are presence maps of generic concepts over a picture, which is likely to be useful regardless of the computer 
vision problem at hand. On the other end, the representations learned by the classifier will necessarily be very specific to the set of 
classes that the model was trained on -- they will only contain information about the presence probability of this or that class in the 
entire picture. Additionally, representations found in densely-connected layers no longer contain any information about _where_ objects are 
located in the input image: these layers get rid of the notion of space, whereas the object location is still described by convolutional 
feature maps. For problems where object location matters, densely-connected features would be largely useless.

Note that the level of generality (and therefore reusability) of the representations extracted by specific convolution layers depends on 
the depth of the layer in the model. Layers that come earlier in the model extract local, highly generic feature maps (such as visual 
edges, colors, and textures), while layers higher-up extract more abstract concepts (such as "cat ear" or "dog eye"). So if your new 
dataset differs a lot from the dataset that the original model was trained on, you may be better off using only the first few layers of the 
model to do feature extraction, rather than using the entire convolutional base.

In our case, since the ImageNet class set did contain multiple dog and cat classes, it is likely that it would be beneficial to reuse the 
information contained in the densely-connected layers of the original model. However, we will chose not to, in order to cover the more 
general case where the class set of the new problem does not overlap with the class set of the original model.

Let's put this in practice by using the convolutional base of the VGG16 network, trained on ImageNet, to extract interesting features from 
our cat and dog images, and then training a cat vs. dog classifier on top of these features.

The VGG16 model, among others, comes pre-packaged with Keras. You can import it from the `keras.applications` module. Here's the list of 
image classification models (all pre-trained on the ImageNet dataset) that are available as part of `keras.applications`:

* Xception
* InceptionV3
* ResNet50
* VGG16
* VGG19
* MobileNet

Let's instantiate a pretrained VGG16 model to use as our convolutional base. 

In [None]:
import tensorflow as tf
from tensorflow.keras.applications import VGG16
from tensorflow.keras.applications import vgg16

conv_base = VGG16(weights='imagenet',
                  include_top=False,
                  input_shape=(150, 150, 3))

preprocess_input_fn = vgg16.preprocess_input

We passed three arguments to the constructor:

* `weights`, to specify which weight checkpoint to initialize the model from
* `include_top`, which refers to including or not the densely-connected classifier on top of the network. By default, this 
densely-connected classifier would correspond to the 1000 classes from ImageNet. Since we intend to use our own densely-connected 
classifier (with only two classes, cat and dog), we don't need to include it.
* `input_shape`, the shape of the image tensors that we will feed to the network. This argument is purely optional: if we don't pass it, 
then the network will be able to process inputs of any size.

Here's the detail of the architecture of the VGG16 convolutional base: it's very similar to the simple convnets that you are already 
familiar with.

In [None]:
conv_base.summary()

Each pretrained model has model-specific pre-processing function to pre-process images (e.g. change the color channel orders, scale/normalize the pixel values),  before going through the convnet for prediction. So it is important that we use this model-specific pre-processing function to pre-process images before using the convnet for feature extraction. 

Here we retrieve the preprocess_input function for VGG16 to be used later.

In [None]:
from tensorflow.keras.applications import vgg16
preprocess_input_fn = vgg16.preprocess_input

The final feature map has shape `(4, 4, 512)`. This is the output we will use to feed to a Dense network for classification. 

Here is what we need to do:
* Running the convolutional base over our dataset, recording its output to a Numpy array 
* Save the numpy array (which is our features) to disk.
* Feed the numpy array to a 
standalone densely-connected classifier. 

This solution is very fast and 
cheap to run, because it only requires running the convolutional base once for every input image, and the convolutional base is by far the 
most expensive part of the pipeline. 

We will start by simply running instances of the previously-introduced `ImageDataGenerator` to extract images as Numpy arrays as well as 
their labels. We will extract features from these images simply by calling the `predict` method of the `conv_base` model.

**Note** 

It is important that we don't scale the images ourselves (e.g. by using the ``ImageDataGenerator(rescale=1./255)``), but instead use the model-specific preprocess-input function.

In [None]:
import os

dataset_URL = 'https://nyp-aicourse.s3-ap-southeast-1.amazonaws.com/datasets/cats_and_dogs_filtered.zip'
path_to_zip = tf.keras.utils.get_file('cats_and_dogs.zip', origin=dataset_URL, extract=True, cache_dir='.')
print(path_to_zip)
base_dir = os.path.join(os.path.dirname(path_to_zip), 'cats_and_dogs_filtered')



In [None]:
import os
import numpy as np
from tensorflow.keras.preprocessing.image import ImageDataGenerator

train_dir = os.path.join(base_dir, 'train')
validation_dir = os.path.join(base_dir, 'validation')

datagen = ImageDataGenerator()
batch_size = 20

def extract_features(preprocess_input_fn, directory, sample_count):
    features = np.zeros(shape=(sample_count, 4, 4, 512))
    labels = np.zeros(shape=(sample_count))
    generator = datagen.flow_from_directory(
        directory,
        target_size=(150, 150),
        batch_size=batch_size,
        class_mode='binary')
    i = 0
    for inputs_batch, labels_batch in generator:
        preprocessed_inputs_bath = preprocess_input_fn(inputs_batch)
        features_batch = conv_base.predict(preprocessed_inputs_bath)
        features[i * batch_size : (i + 1) * batch_size] = features_batch
        labels[i * batch_size : (i + 1) * batch_size] = labels_batch
        i += 1
        if i * batch_size >= sample_count:
            # Note that since generators yield data indefinitely in a loop,
            # we must `break` after every image has been seen once.
            break
    return features, labels

# We will use the preprocess_input_fn() to pre-process images in the extract_features()
train_features, train_labels = extract_features(preprocess_input_fn, train_dir, 2000)
validation_features, validation_labels = extract_features(preprocess_input_fn, validation_dir, 1000)

The extracted features are currently of shape `(samples, 4, 4, 512)`. We will save these numpy arrays to disks.

In [None]:
np.save("train_features.npy", train_features)
np.save("train_labels.npy", train_labels)
np.save("validation_features.npy", validation_features)
np.save("validation_labels.npy", validation_labels)

In [None]:
train_features.shape

At this point, we can define our densely-connected classifier (note the use of dropout for regularization), and train it on the data and 
labels that we just recorded. 

As the Dense layer only accepts 1D array, we cannot directly feed the features extracted from convolutional base (which is of 2D shape) to the Dense layer. We can use Flatten layer to flatten the 2D to 1D. Alternatively, we can use GlobalAveragePooling2D. Recall that GlobalAveragePooling summarize each feature map into a single a average number, we effectively convert the a 2D feature maps into 1D array. 

In [None]:
from tensorflow.keras import models
from tensorflow.keras import layers
from tensorflow.keras import optimizers


model_top = models.Sequential()
model_top.add(layers.GlobalAveragePooling2D())
model_top.add(layers.Dense(256, activation='relu', input_dim=4 * 4 * 512))
model_top.add(layers.Dropout(0.5))
model_top.add(layers.Dense(1, activation='sigmoid'))

model_top.compile(optimizer=optimizers.RMSprop(lr=2e-5),
              loss='binary_crossentropy',
              metrics=['acc'])



In [None]:
# we will now load the extracted features from the files we save to earlier 
X_train = np.load('train_features.npy')
y_train = np.load('train_labels.npy')
X_validation = np.load('validation_features.npy')
y_validation = np.load('validation_labels.npy')


# We create a directory to store the event logs required by Tensorboard
root_logdir = os.path.join(os.curdir, "tb_logs")

def get_run_logdir():    # use a new directory for each run
	import time
	run_id = time.strftime("run_%Y_%m_%d-%H_%M_%S")
	return os.path.join(root_logdir, run_id)

run_logdir = get_run_logdir()

tb_callback = tf.keras.callbacks.TensorBoard(run_logdir)

history = model_top.fit(X_train, y_train,
                    epochs=100,
                    batch_size=20,
                    validation_data=(X_validation, y_validation),
                    callbacks=[tb_callback])

Training is very fast, since we only have to deal with two `Dense` layers -- an epoch takes less than one second even on CPU.

Let's take a look at the loss and accuracy curves during training:

In [None]:
%load_ext tensorboard
%tensorboard --logdir tb_logs


We reach a validation accuracy of about 97%, much better than what we could achieve in the previous exercise with our small model trained from scratch. 

## Prepare the model for deployment

We cannot just use our `model_top` for image classification, as it takes pre-extracted features as input, not image. We need to stick back our convolutional base and use an input layer of appropriate shape. This is what we are going to do below.

In [None]:
from tensorflow.keras import Model

inputs = layers.Input(shape=(150, 150, 3))
x = preprocess_input_fn(inputs)
x = conv_base(x)
top_outputs = model_top(x)
model_final = Model(inputs=[inputs], outputs=[top_outputs])
model_final.compile(loss="binary_crossentropy", optimizer=optimizers.RMSprop(lr=2e-5), metrics=['acc'])
model_final.summary()
model_final.save("final_model")

Ok, now we are ready to test with our own image. Upload your favourite cat and dog images and see your model in action.

In [None]:
from google.colab import files

uploaded = files.upload()

for filename in uploaded.keys():
    print('User uploaded file "{name}" with length {length} bytes'.format(
        name=filename, length=len(uploaded[filename])))

In [None]:
img = tf.keras.preprocessing.image.load_img(
    filename, target_size=(150, 150)
)

# we convert the image to numpy array
img_array = tf.keras.preprocessing.image.img_to_array(img)

# Although we only have single image, however our model expected data in batches
# so we will need to add in the batch axis too
img_array = tf.expand_dims(img_array, 0) # Create a batch

# we load the model saved earlier and do the inference 
model = tf.keras.models.load_model('final_model')
predictions = model(img_array)
if predictions[0] > 0.5: 
    print('It is a dog')
else:
    print('It is a cat')

## Fine-tuning

Another widely used technique is _fine-tuning_. 
Fine-tuning consists in unfreezing a few of the top layers 
of a frozen model base used for feature extraction, and jointly training both the newly added part of the model (in our case, the 
fully-connected classifier) and these top layers. This is called "fine-tuning" because it slightly adjusts the more abstract 
representations of the model being reused, in order to make them more relevant for the problem at hand.

![fine-tuning VGG16](https://nyp-aicourse.s3-ap-southeast-1.amazonaws.com/resources/it3103/vgg16_fine_tuning.png)

We have stated before that it was necessary to freeze the convolution base of VGG16 in order to be able to train a randomly initialized 
classifier on top. For the same reason, it is only possible to fine-tune the top layers of the convolutional base once the classifier on 
top has already been trained. If the classified wasn't already trained, then the error signal propagating through the network during 
training would be too large, and the representations previously learned by the layers being fine-tuned would be destroyed. Thus the steps 
for fine-tuning a network are as follow:

* 1) Add your custom network on top of an already trained base network.
* 2) Freeze the base network.
* 3) Train the part you added.
* 4) Unfreeze some layers in the base network.
* 5) Jointly train both these layers and the part you added.


As a reminder, this is what our convolutional base looks like. Note that the trainable weights are 14,714,688 (around 14 millions).

In [None]:
conv_base = VGG16(weights='imagenet',
                  include_top=False,
                  input_shape=(150, 150, 3))
conv_base.summary()


We will fine-tune the last 3 convolutional layers, which means that all layers up until `block4_pool` should be frozen, and the layers 
`block5_conv1`, `block5_conv2` and `block5_conv3` should be trainable.

Why not fine-tune more layers? Why not fine-tune the entire convolutional base? We could. However, we need to consider that:

* Earlier layers in the convolutional base encode more generic, reusable features, while layers higher up encode more specialized features. It is 
more useful to fine-tune the more specialized features, as these are the ones that need to be repurposed on our new problem. There would 
be fast-decreasing returns in fine-tuning lower layers.
* The more parameters we are training, the more we are at risk of overfitting. The convolutional base has 15M parameters, so it would be 
risky to attempt to train it on our small dataset.

Thus, in our situation, it is a good strategy to only fine-tune the top 2 to 3 layers in the convolutional base.

Let's set this up, we will unfreeze our `conv_base`, 
and then freeze individual layers inside of it, except the last 3 layers. 

Do a model ``summary()`` and you will see now that the number of trainable weights are now 7,079,424 (around 7 millions), much less than previously, because all the layers are frozen except the last 3 layers.

In [None]:
conv_base.trainable = True

set_trainable = False
for layer in conv_base.layers:
    if layer.name == 'block5_conv1':
        set_trainable = True
    if set_trainable:
        layer.trainable = True
    else:
        layer.trainable = False

conv_base.summary()

As explained earlier, we also need our model-specific input preprocessing function.

In [None]:
preprocess_input_fn = vgg16.preprocess_input

We will now setup our data pipeline for images as before using ImageDataGenerator.

In [None]:
from keras.preprocessing.image import ImageDataGenerator

train_datagen = ImageDataGenerator()

test_datagen = ImageDataGenerator()

train_generator = train_datagen.flow_from_directory(
        train_dir,
        target_size=(150, 150),
        batch_size=20,
        class_mode='binary')

validation_generator = test_datagen.flow_from_directory(
        validation_dir,
        target_size=(150, 150),
        batch_size=20,
        class_mode='binary')


Now we can start fine-tuning our network. We will do this with the RMSprop optimizer, using a very low learning rate. The reason for using 
a low learning rate is that we want to limit the magnitude of the modifications we make to the representations of the 3 layers that we are 
fine-tuning. Updates that are too large may harm these representations.

Now let's proceed with fine-tuning.

In [None]:
inputs = layers.Input(shape=(150, 150, 3))
x = preprocess_input_fn(inputs)
x = conv_base(x)
x = layers.GlobalAveragePooling2D()(x)
x = layers.Dense(256, activation='relu')(x)
x = layers.Dropout(0.5)(x)
outputs = layers.Dense(1, activation='sigmoid')(x)
model_finetune = Model(inputs=[inputs], outputs=[outputs])

model_finetune.compile(loss='binary_crossentropy',
              optimizer=optimizers.RMSprop(lr=1e-5),
              metrics=['acc'])

root_logdir = os.path.join(os.curdir, "tb_logs")

def get_run_logdir():    # use a new directory for each run
	import time
	run_id = time.strftime("run_%Y_%m_%d-%H_%M_%S")
	return os.path.join(root_logdir, run_id)

run_logdir = get_run_logdir()

tb_callback = tf.keras.callbacks.TensorBoard(run_logdir)

history = model_finetune.fit(
      train_generator,
      steps_per_epoch=100,
      epochs=35,
      validation_data=validation_generator,
      validation_steps=50,
      callbacks=[tb_callback])

Let's visualize our loss and accuracy using Tensorboard. 

In [None]:
%tensorboard --logdir tb_logs


With fine-tuning, we are able to achieve a validation accuracy of around 96%

**Exercise 1:**

Is there any overfitting? If there is, what can you do to reduce overfitting? 

*Type your answer here*


Modify the codes to reduce overfitting (if there is). You can write your codes in the code cell below.


In [None]:
## TODO: Write your code here ###



**Exercise 2:**

Modify the code to fine-tune less layers (e.g. 2 or 1 layers). What happen to the overfitting and the accuracy?

*Type your answer here*

You can write your codes in the code cell below.

In [None]:
## TODO: Write your code here ###

**Additional Exercises**

Instead of VGG16, you may want to try using a more recent network architecture such as ResNet50 or MobileNet (which is good for mobile devices due to its small size).
