<a href="https://colab.research.google.com/github/nyp-sit/iti107/blob/main/session-3/3.fine-tuning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Fine-tuning

Another widely used transfer learning technique is _fine-tuning_. 
Fine-tuning involves unfreezing a few of the top layers 
of a frozen model base used for feature extraction, and jointly training both the newly added part of the model (in our case, the 
fully-connected classifier) and these unfrozen top layers. This is called "fine-tuning" because it slightly adjusts the more abstract 
representations of the model being reused, in order to make them more relevant for the problem at hand.



![fine-tuning VGG16](https://nyp-aicourse.s3.ap-southeast-1.amazonaws.com/iti107/resources/vgg16_fine_tuning.png)

In [1]:
import os
import tensorflow as tf
import tensorflow.keras as keras



## Creating Datasets

We will setup our training and validation dataset as we did in earlier exercise.

In [30]:
dataset_URL = 'https://nyp-aicourse.s3-ap-southeast-1.amazonaws.com/datasets/cats_and_dogs_subset.tar.gz'
tf.keras.utils.get_file(origin=dataset_URL, extract=True, cache_dir='.')
dataset_folder = os.path.join('datasets', 'cats_and_dogs_subset')
# dataset_URL = 'https://nyp-aicourse.s3.ap-southeast-1.amazonaws.com/iti107/datasets/emotions_dataset_jpg.zip'
# path_to_zip= keras.utils.get_file('emotions_dataset_jpg.zip', origin=dataset_URL, extract=True, cache_dir='.')
# print(path_to_zip)
# dataset_folder = os.path.dirname(path_to_zip)
# dataset_url = 'https://storage.googleapis.com/download.tensorflow.org/example_images/flower_photos.tgz'
# path_to_zip = tf.keras.utils.get_file(origin=dataset_url, extract=True, cache_dir='.')
# dataset_folder = os.path.dirname(path_to_zip)
# dataset_folder = os.path.join(dataset_folder, 'flower_photos')

Downloading data from https://nyp-aicourse.s3-ap-southeast-1.amazonaws.com/datasets/cats_and_dogs_subset.tar.gz


In [39]:
batch_size = 16
image_size = (128,128)
label_mode = 'binary'
num_classes = 2

train_ds = tf.keras.preprocessing.image_dataset_from_directory(
    dataset_folder,
    validation_split=0.2,
    subset="training",
    seed=1337,
    image_size=image_size,
    batch_size=batch_size,
    label_mode=label_mode
)
val_ds = tf.keras.preprocessing.image_dataset_from_directory(
    dataset_folder,
    validation_split=0.2,
    subset="validation",
    seed=1337,
    image_size=image_size,
    batch_size=batch_size,
    label_mode=label_mode
)

Found 3000 files belonging to 2 classes.
Using 2400 files for training.
Found 3000 files belonging to 2 classes.
Using 600 files for validation.


## Transfer Learning Workflow 

It was necessary to freeze the convolution base before training a randomly initialized classifier top. If the classifier wasn't already trained, then the error signal propagating through the network during training would be too large, and the representations previously learned by the layers being fine-tuned would be destroyed. Thus the steps for fine-tuning a network are as follow:

1. Add your custom network on top of an already trained base network.
2. Freeze the convolutional base network.
3. Train the classification top you added.
4. Unfreeze some layers in the base network.
5. Jointly train both these layers and the part you added.


#### BatchNormalization layer 

Many CNN models contain BatchNormalization layers. 
BatchNormalization contains 2 non-trainable variables that keep track of the mean and variance of the inputs. These variables are updated during training time. Here are a few things to note when fine-tuning model with BatchNormalization layers: 
- When you set `bn_layer.trainable = False`, the BatchNormalization layer will run in inference mode, and will not update its mean & variance statistics. 
- When you unfreeze a model that contains BatchNormalization layers in order to do fine-tuning, you should keep the BatchNormalization layers in inference mode by passing `training=False` when calling the base model. Otherwise the updates applied to the non-trainable weights will suddenly destroy what the model has learned.

## Build our Model 

We will now construct our model: a convolutional base (initialized with pre-trained weights) and our own classification head (initialized with random weights).

In [40]:
data_augmentation = keras.Sequential(
        [
            tf.keras.layers.RandomRotation(0.1),
            tf.keras.layers.RandomFlip("horizontal")
        ]
    )

In [62]:
# Load the pre-trained model 
base_model = keras.applications.EfficientNetB0(input_shape=image_size + (3,),
                                         include_top=False,
                                         weights='imagenet')

## This is not necessary as it is just a passthrough. EfficientNet model includes the rescaling layer that preprocess the input
## refer to https://www.tensorflow.org/api_docs/python/tf/keras/applications/efficientnet/preprocess_input
preprocess_input_fn = keras.applications.efficientnet.preprocess_input

# freeze the base layer 
base_model.trainable = False

# Add input layer 
inputs = keras.layers.Input(shape=image_size+(3,))

x = data_augmentation(inputs)
# Add preprocessing layer

## This is not necessary as it is just a passthrough. EfficientNet model includes the rescaling layer that preprocess the input
## refer to https://www.tensorflow.org/api_docs/python/tf/keras/applications/efficientnet/preprocess_input
x = preprocess_input_fn(x)

# The base model contains batchnorm layers. We want to keep them in inference mode
# when we unfreeze the base model for fine-tuning, so we make sure that the
# base_model is running in inference mode here.
x = base_model(x, training=False)

# Add our classification head
x = keras.layers.GlobalAveragePooling2D()(x)
x = keras.layers.Dropout(rate=0.5)(x)
#x = keras.layers.Dense(units=256, activation="relu")(x)
#x = keras.layers.Dropout(rate=0.5)(x)

# outputs = keras.layers.Dense(units=1, activation="softmax")(x)
outputs = keras.layers.Dense(units=1, activation="sigmoid")(x)

model = keras.models.Model(inputs=[inputs], outputs=[outputs])

base_learning_rate = 0.001

model.compile(loss="binary_crossentropy", 
                  optimizer=keras.optimizers.Adam(learning_rate=base_learning_rate), 
                  metrics=["accuracy"])


In [63]:
val_ds.class_names

['cats', 'dogs']

Let's confirm all the layers of convolutional base are frozen. 

In [64]:
for layer in base_model.layers:
    print(f'layer name = {layer.name}, trainable={layer.trainable}')

layer name = input_13, trainable=False
layer name = rescaling_6, trainable=False
layer name = normalization_6, trainable=False
layer name = stem_conv_pad, trainable=False
layer name = stem_conv, trainable=False
layer name = stem_bn, trainable=False
layer name = stem_activation, trainable=False
layer name = block1a_dwconv, trainable=False
layer name = block1a_bn, trainable=False
layer name = block1a_activation, trainable=False
layer name = block1a_se_squeeze, trainable=False
layer name = block1a_se_reshape, trainable=False
layer name = block1a_se_reduce, trainable=False
layer name = block1a_se_expand, trainable=False
layer name = block1a_se_excite, trainable=False
layer name = block1a_project_conv, trainable=False
layer name = block1a_project_bn, trainable=False
layer name = block2a_expand_conv, trainable=False
layer name = block2a_expand_bn, trainable=False
layer name = block2a_expand_activation, trainable=False
layer name = block2a_dwconv_pad, trainable=False
layer name = block2a_dwco

In [65]:
index = 0

for layer in base_model.layers: 
    if layer.name == 'block7a_expand_conv': 
        print(index)
        break
    index += 1 

221


Let's print out the model summary and see how many trainable weights. We can see that we only 1,281 trainable weights (parameters), coming from the classification head that put on top of the convolutional base. (For comparison, a EfficientNetB0 has total of 4,049,571 weights).

In [66]:
model.layers[1].layers

[<keras.layers.preprocessing.image_preprocessing.RandomRotation at 0x23c175ffca0>,
 <keras.layers.preprocessing.image_preprocessing.RandomFlip at 0x23c13436a90>]

In [67]:
model.summary()

Model: "model_6"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_14 (InputLayer)        [(None, 128, 128, 3)]     0         
_________________________________________________________________
sequential_2 (Sequential)    (None, 128, 128, 3)       0         
_________________________________________________________________
efficientnetb0 (Functional)  (None, 4, 4, 1280)        4049571   
_________________________________________________________________
global_average_pooling2d_6 ( (None, 1280)              0         
_________________________________________________________________
dropout_6 (Dropout)          (None, 1280)              0         
_________________________________________________________________
dense_6 (Dense)              (None, 1)                 1281      
Total params: 4,050,852
Trainable params: 1,281
Non-trainable params: 4,049,571
_____________________________________________

## Train the classification head 

We will go ahead and train our classification head.

In [None]:
# create model checkpoint callback to save the best model checkpoint
model_checkpoint_callback = tf.keras.callbacks.ModelCheckpoint(
    filepath="best_checkpoint",
    save_weights_only=True,
    monitor='val_accuracy',
    mode='max',
    save_best_only=True)

model.fit(train_ds, validation_data=val_ds, 
          epochs=50, callbacks=[model_checkpoint_callback])

In [69]:
model.load_weights('best_checkpoint')
model.evaluate(val_ds)



[0.09174549579620361, 0.9750000238418579]

In [None]:
model.save("frozenbase")

In [None]:
model = keras.models.load_model("frozenbase")

Now we have our classification layers trained, let's start to unfreeze some top layers of the convolutional base to fine tune the weights. 
We will fine-tune the last 3 convolutional layers, which means that all layers up until `block4_pool` should be frozen, and the layers 
`block5_conv1`, `block5_conv2` and `block5_conv3` should be trainable.

Why not fine-tune more layers? Why not fine-tune the entire convolutional base? We could. However, we need to consider that:

* Earlier layers in the convolutional base encode more generic, reusable features, while layers higher up encode more specialized features. It is 
more useful to fine-tune the more specialized features, as these are the ones that need to be repurposed on our new problem. There would 
be fast-decreasing returns in fine-tuning lower layers.
* The more parameters we are training, the more we are at risk of overfitting. The convolutional base has 15M parameters, so it would be 
risky to attempt to train it on our small dataset.

Thus, in our situation, it is a good strategy to only fine-tune the top 2 to 3 layers in the convolutional base.

Let's set this up, we will unfreeze our `base_model`, 
and then freeze individual layers inside of it, except the last 3 layers. 

Do a model ``summary()`` and you will see now that the number of trainable weights are now 7,079,424 (around 7 millions), much less than previously, because all the layers are frozen except the last 3 layers.

In [70]:
base_model.trainable = True
# for layer in base_model.layers[:221]:
#     layer.trainable = False
for layer in base_model.layers[:221]:
    layer.trainable = False

In [None]:
for layer in base_model.layers:
    print(layer.name, layer.trainable)

Let us examine model summary again. We can see now that we have more trainable weights 7,342,593 compared to previously 263,169.

In [72]:
model.summary()

Model: "model_6"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_14 (InputLayer)        [(None, 128, 128, 3)]     0         
_________________________________________________________________
sequential_2 (Sequential)    (None, 128, 128, 3)       0         
_________________________________________________________________
efficientnetb0 (Functional)  (None, 4, 4, 1280)        4049571   
_________________________________________________________________
global_average_pooling2d_6 ( (None, 1280)              0         
_________________________________________________________________
dropout_6 (Dropout)          (None, 1280)              0         
_________________________________________________________________
dense_6 (Dense)              (None, 1)                 1281      
Total params: 4,050,852
Trainable params: 1,130,673
Non-trainable params: 2,920,179
_________________________________________

As you are training a much larger model and want to readapt the pretrained weights, it is important to use a lower learning rate at this stage as we do not want to make too drastic changes to the weights in the convolutional layers under fine-tuning.

In [73]:
finetune_learning_rate = base_learning_rate / 10.

model.compile(loss="binary_crossentropy",
              optimizer=keras.optimizers.Adam(learning_rate=finetune_learning_rate),
              metrics=["accuracy"])
# model.compile(loss="sparse_categorical_crossentropy",
#               optimizer=keras.optimizers.Adam(learning_rate=finetune_learning_rate),
#               metrics=["accuracy"])

model.fit(
    train_ds,
    epochs=20,
    validation_data=val_ds,
    callbacks=[model_checkpoint_callback])

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20

KeyboardInterrupt: 

In [29]:
model.load_weights('best_checkpoint')
model.evaluate(val_ds)



[0.2562572956085205, 0.9318801164627075]

**Question:**

Is our fine-tuned model performing better or worse than the previous model?

Provide a possible explanation to your observation. 


In [None]:
**Exercise:**

