# Transfer learning & fine-tuning

## Setup

In [None]:
import numpy as np
import tensorflow as tf
from tensorflow import keras

## Introduction

**Transfer learning** consists of taking features learned on one problem, and
leveraging them on a new, similar problem. For instance, features from a model that has
learned to identify racoons may be useful to kick-start a model meant to identify
 tanukis.

Transfer learning is usually done for tasks where your dataset has too little data to
 train a full-scale model from scratch.

The most common incarnation of transfer learning in the context of deep learning is the
 following workflow:

1. Take layers from a previously trained model.
2. Freeze them, so as to avoid destroying any of the information they contain during
 future training rounds.
3. Add some new, trainable layers on top of the frozen layers. They will learn to turn
 the old features into predictions on a  new dataset.
4. Train the new layers on your dataset.

A last, optional step, is **fine-tuning**, which consists of unfreezing the entire
model you obtained above (or part of it), and re-training it on the new data with a
very low learning rate. This can potentially achieve meaningful improvements, by
 incrementally adapting the pretrained features to the new data.

First, we will go over the Keras `trainable` API in detail, which underlies most
 transfer learning & fine-tuning workflows.

Then, we'll demonstrate the typical workflow by taking a model pretrained on the
ImageNet dataset, and retraining it on the Kaggle "cats vs dogs" classification
 dataset.

This is adapted from
[Deep Learning with Python](https://www.manning.com/books/deep-learning-with-python)
 and the 2016 blog post
["building powerful image classification models using very little
 data"](https://blog.keras.io/building-powerful-image-classification-models-using-very-little-data.html).

## Freezing layers: understanding the `trainable` attribute

Layers & models have three weight attributes:

- `weights` is the list of all weights variables of the layer.
- `trainable_weights` is the list of those that are meant to be updated (via gradient
 descent) to minimize the loss during training.
- `non_trainable_weights` is the list of those that aren't meant to be trained.
 Typically they are updated by the model during the forward pass.

**Example: the `Dense` layer has 2 trainable weights (kernel & bias)**

In [None]:
layer = keras.layers.Dense(3)
layer.build((None, 4))  # Create a input of 4 features

print("weights:", len(layer.weights)) # You have two weights (weights and biases)
print("trainable_weights:", len(layer.trainable_weights))
print("non_trainable_weights:", len(layer.non_trainable_weights))

In [None]:
# Here you can check the weights available in a dense layer
layer.weights

In general, all weights are trainable weights. The only built-in layer that has
non-trainable weights is the `BatchNormalization` layer. It uses non-trainable weights
 to keep track of the mean and variance of its inputs during training.
To learn how to use non-trainable weights in your own custom layers, see the
[guide to writing new layers from scratch](https://keras.io/guides/making_new_layers_and_models_via_subclassing/).

**Example: the `BatchNormalization` layer has 2 trainable weights and 2 non-trainable
 weights**

In [None]:
# We try the same with BatchNormalization
layer = keras.layers.BatchNormalization()
layer.build((None, 4))  # Create a input of 4 features

print("weights:", len(layer.weights))
print("trainable_weights:", len(layer.trainable_weights))
print("non_trainable_weights:", len(layer.non_trainable_weights))

In [None]:
# Check the weights available in a dense layer
layer.weights

In [None]:
# In this case, gamma y beta are learned variables
layer.trainable_weights

In [None]:
# Moving mean and moving variance are non-trainable variables
layer.non_trainable_weights

Layers & models also feature a boolean attribute `trainable`. Its value can be changed.
Setting `layer.trainable` to `False` moves all the layer's weights from trainable to
non-trainable.  This is called "freezing" the layer: the state of a frozen layer won't
be updated during training (either when training with `fit()` or when training with
 any custom loop that relies on `trainable_weights` to apply gradient updates).

**Example: setting `trainable` to `False`**

In [None]:
layer = keras.layers.Dense(3)
layer.build((None, 4))  # Create a input of 4 features
layer.trainable = False  # Freeze the layer

print("weights:", len(layer.weights))
print("trainable_weights:", len(layer.trainable_weights))
print("non_trainable_weights:", len(layer.non_trainable_weights))

When a trainable weight becomes non-trainable, its value is no longer updated during
 training.

In [None]:
# Define two layers with three neurons each one
layer1 = keras.layers.Dense(3, activation="relu")
layer2 = keras.layers.Dense(2, activation="sigmoid")
# Create the model 
model = keras.Sequential([keras.Input(shape=(3,)), layer1, layer2])

# Freeze the first layer
layer1.trainable = False

# Keep a copy of the weights of layer1 for later reference
initial_layer1_weights_values = layer1.get_weights()

# Train the model
model.compile(optimizer="adam", loss="mse")

# Create random X and y values
X = np.random.random((2, 3)) # Two samples, three features
y = np.random.random((2, 2)) # Two samples, two features

# Check the data
print("X: ", X)
print("y: ", y)

# Train using random values in X and y. 1 epoch
model.fit(X, y)

# Check that the weights of layer1 have not changed during training
final_layer1_weights_values = layer1.get_weights()

# Consider that get_weights for a dense layer returns a matrix of weights and a vector of biases

# We can check the weights in first layer before training
print("\ninitial layer1 weights values: \n",initial_layer1_weights_values)
# Now we can check the weights in first layer after training
print("\nfinal layer1 weights values: \n",final_layer1_weights_values)

## Recursive setting of the `trainable` attribute

If you set `trainable = False` on a model or on any layer that has sublayers,
all children layers become non-trainable as well.

**Example:**

In [None]:
# e go to define a simple model
inner_model = keras.Sequential(
    [
        keras.Input(shape=(3,)),
        keras.layers.Dense(3, activation="relu"),
        keras.layers.Dense(3, activation="relu"),
    ])

# We go to define another simple model, which includes the previous one
model = keras.Sequential(
    [keras.Input(shape=(3,)), inner_model, keras.layers.Dense(3, activation="sigmoid")])

# We can set all weights non-trainable
model.trainable = False  # Freeze the outer model

# We can cehck both the entire model and individual inner layers are frozen
print("Weights in our model are trainable?: ", inner_model.trainable) 
print("Weights in our first layer are trainable?: ",inner_model.layers[0].trainable)


## The typical transfer-learning workflow

This leads us to how a typical transfer learning workflow can be implemented in Keras:

1. Instantiate a base model and load pre-trained weights into it.
2. Freeze all layers in the base model by setting `trainable = False`.
3. Create a new model on top of the output of one (or several) layers from the base
 model.
4. Train your new model on your new dataset.

Note that an alternative, more lightweight workflow could also be:

1. Instantiate a base model and load pre-trained weights into it.
2. Run your new dataset through it and record the output of one (or several) layers
 from the base model. This is called **feature extraction**.
3. Use that output as input data for a new, smaller model.

A key advantage of that second workflow is that you only run the base model once on
 your data, rather than once per epoch of training. So it's a lot faster & cheaper.

An issue with that second workflow, though, is that it doesn't allow you to dynamically
modify the input data of your new model during training, which is required when doing
data augmentation, for instance. Transfer learning is typically used for tasks when
your new dataset has too little data to train a full-scale model from scratch, and in
such scenarios data augmentation is very important. So in what follows, we will focus
 on the first workflow.

In [None]:
# We load a pretrained model
base_model = keras.applications.Xception(
    weights='imagenet',  # Load weights pre-trained on ImageNet.    
    input_shape=(120, 120, 3),
    include_top=False)  # Do not include the ImageNet classifier at the top (final layer)
# You add the parameter include_top = False to not include the final layer of the base model

# Then, freeze the base model
base_model.trainable = False

In [None]:
# You can see the architecture of this model
base_model.summary()

In [None]:
# Create the model and add custom top layers
model = keras.models.Sequential()
model.add(base_model)
model.add(keras.layers.GlobalAveragePooling2D()) # Technique to reduce the data. 
model.add(keras.layers.Dense(10, activation='softmax'))

In [None]:
from keras.utils import to_categorical
from keras.utils import img_to_array, array_to_img

# Load the MNIST dataset
(X_train, y_train), (X_test, y_test) = keras.datasets.mnist.load_data()

# Preprocess the data
X_train = X_train[0:1000]
X_test = X_test[0:100]

# We are working with 1 channel images, but this model needs 3 channels. e repeat the data 3 times
X_train_3c = np.dstack([X_train] * 3)
X_test_3c = np.dstack([X_test]*3)

# We create a 3d image. Each channel contains same data
X_train_3d = X_train_3c.reshape((X_train_3c.shape[0], 28, 28, 3)) 
X_test_3d = X_test_3c.reshape((X_test_3c.shape[0], 28, 28, 3))

# Change the dimmension of data
X_train_resized = np.array([img_to_array(array_to_img(im, scale=False).resize((120, 120))) for im in X_train_3d]) / 255.0
X_test_resized = np.array([img_to_array(array_to_img(im, scale=False).resize((120, 120))) for im in X_test_3d]) / 255.0

# Create classes
y_train = to_categorical(y_train[0:1000], num_classes=10)
y_test = to_categorical(y_test[0:100], num_classes=10)

In [None]:
# Compile the model
model.compile(optimizer=keras.optimizers.Adam(), loss='categorical_crossentropy', metrics=['accuracy'])

# Train the model
model.fit(X_train_resized, y_train, epochs=4)

## Fine-tuning

Once your model has converged on the new data, you can try to unfreeze **all or part of
 the base model** and retrain the whole model end-to-end with a very low learning rate.

This is an optional last step that can potentially give you incremental improvements.
 It could also potentially lead to quick overfitting -- keep that in mind.

It is critical to only do this step *after* the model with frozen layers has been
trained to convergence. If you mix randomly-initialized trainable layers with
trainable layers that hold pre-trained features, the randomly-initialized layers will
cause very large gradient updates during training, which will destroy your pre-trained
 features.

It's also critical to use a very low learning rate at this stage, because
you are training a much larger model than in the first round of training, on a dataset
 that is typically very small.
As a result, you are at risk of overfitting very quickly if you apply large weight
 updates. Here, you only want to readapt the pretrained weights in an incremental way.

In [None]:
# Choose the number of layers to unfreeze
num_layers_to_unfreeze = 3

# Unfreeze the top layers
for layer in base_model.layers[-num_layers_to_unfreeze:]:
    layer.trainable = True

# It's important to recompile your model after you make any changes
# to the `trainable` attribute of any inner layer, so that your changes
# are take into account
model.compile(optimizer=keras.optimizers.Adam(0.000001),  # Very low learning rate
              loss='categorical_crossentropy',
              metrics=['accuracy'])

# Train end-to-end. Be careful to stop before you overfit!
model.fit(X_train_resized, y_train, epochs=1) # We are using 1 epoch only because it takes to much time.

## An end-to-end example: fine-tuning an image classification model on a cats vs. dogs dataset. 

**Try to run this in your free time since it takes time**

To solidify these concepts, let's walk you through a concrete end-to-end transfer
learning & fine-tuning example. We will load the Xception model, pre-trained on
 ImageNet, and use it on the Kaggle "cats vs. dogs" classification dataset.

### Getting the data

First, let's fetch the cats vs. dogs dataset using TFDS. If you have your own dataset,
you'll probably want to use the utility
`tf.keras.utils.image_dataset_from_directory` to generate similar labeled
 dataset objects from a set of images on disk filed into class-specific folders.

Transfer learning is most useful when working with very small datasets. To keep our
dataset small, we will use 40% of the original training data (25,000 images) for
 training, 10% for validation, and 10% for testing.

In [None]:
# Install tensorflow_datasets
#!pip install tensorflow_datasets

In [None]:
import tensorflow_datasets as tfds

tfds.disable_progress_bar() # To disable an annoying progress bar
train_ds, validation_ds, test_ds = tfds.load(           
    "cats_vs_dogs",
    # Reserve 10% for validation and 10% for test
    split=["train[:40%]", "train[40%:50%]", "train[50%:60%]"],
    as_supervised=True)  # Include labels

print("Number of training samples: %d" % tf.data.experimental.cardinality(train_ds))
print("Number of validation samples: %d" % tf.data.experimental.cardinality(validation_ds))
print("Number of test samples: %d" % tf.data.experimental.cardinality(test_ds))

These are the first 9 images in the training dataset -- as you can see, they're all
 different sizes.

In [None]:
import matplotlib.pyplot as plt

plt.figure(figsize=(10, 10))
for i, (image, label) in enumerate(train_ds.take(9)):
    ax = plt.subplot(3, 3, i + 1)
    plt.imshow(image)
    plt.title(int(label))
    plt.axis("off")

We can also see that label 1 is "dog" and label 0 is "cat".

### Standardizing the data

Our raw images have a variety of sizes. In addition, each pixel consists of 3 integer
values between 0 and 255 (RGB level values).

In general, it's a good practice to develop models that take raw data as input, as
opposed to models that take already-preprocessed data. The reason being that, if your
model expects preprocessed data, any time you export your model to use it elsewhere
(in a web browser, in a mobile app), you'll need to reimplement the exact same
preprocessing pipeline. This gets very tricky very quickly. So we should do the least
 possible amount of preprocessing before hitting the model.

Here, we'll do image resizing in the data pipeline (because a deep neural network can
only process contiguous batches of data), and we'll do the input value scaling as part
 of the model, when we create it.

Let's resize images to 120x120:

In [None]:
size = (120, 120)
# This is a way how to reshape images
train_ds = train_ds.map(lambda x, y: (tf.image.resize(x, size), y))
validation_ds = validation_ds.map(lambda x, y: (tf.image.resize(x, size), y))
test_ds = test_ds.map(lambda x, y: (tf.image.resize(x, size), y))

Besides, let's batch the data and use caching & prefetching to optimize loading speed.

In [None]:
batch_size = 32
# This is a way how to save images in cache in order to improve speed
train_ds = train_ds.cache().batch(batch_size).prefetch(buffer_size=10)
validation_ds = validation_ds.cache().batch(batch_size).prefetch(buffer_size=10)
test_ds = test_ds.cache().batch(batch_size).prefetch(buffer_size=10)

### Using random data augmentation

When you don't have a large image dataset, it's a good practice to artificially
 introduce sample diversity by applying random yet realistic transformations to
the training images, such as random horizontal flipping or small random rotations. This
helps expose the model to different aspects of the training data while slowing down
 overfitting.

In [None]:
from tensorflow import keras

# This is a process useful to broaden the dataset. Here we include a rotation technique
data_augmentation = keras.Sequential(
    [keras.layers.RandomFlip("horizontal"), keras.layers.RandomRotation(0.1)])

Let's visualize what the first image of the first batch looks like after various random
 transformations:

In [None]:
import numpy as np

for images, labels in train_ds.take(1):
    plt.figure(figsize=(10, 10))
    first_image = images[0]
    for i in range(9):
        ax = plt.subplot(3, 3, i + 1)
        augmented_image = data_augmentation(
            tf.expand_dims(first_image, 0), training=True
        )
        plt.imshow(augmented_image[0].numpy().astype("int32"))
        plt.title(int(labels[0]))
        plt.axis("off")

## Build a model

Now let's built a model that follows the blueprint we've explained earlier.

Note that:

- We add a `Rescaling` layer to scale input values (initially in the `[0, 255]`
 range) to the `[-1, 1]` range. This is because Xception require data in this way
- We add a `Dropout` layer before the classification layer, for regularization.
- We make sure to pass `training=False` when calling the base model, so that
it runs in inference mode, so that batchnorm statistics don't get updated
even after we unfreeze the base model for fine-tuning.

In [None]:
base_model = keras.applications.Xception(
    weights="imagenet",  # Load weights pre-trained on ImageNet.
    input_shape=(120, 120, 3),
    include_top=False,
)  # Do not include the ImageNet classifier at the top.

# Freeze the base_model
base_model.trainable = False

# Create new model on top. 

# This is other way how to create a model.
inputs = keras.Input(shape=(120, 120, 3))
x = data_augmentation(inputs)  # Apply random data augmentation

# Pre-trained Xception weights requires that input be scaled
# from (0, 255) to a range of (-1., +1.), the rescaling layer
# outputs: `(inputs * scale) + offset`. Values are given in Keras page: https://keras.io/api/layers/preprocessing_layers/image_preprocessing/rescaling/
scale_layer = keras.layers.Rescaling(scale=1 / 127.5, offset=-1)
x = scale_layer(x)

# The base model contains batchnorm layers. We want to keep them in inference mode
# when we unfreeze the base model for fine-tuning, so we make sure that the
# base_model is running in inference mode here.
x = base_model(x, training=False)
x = keras.layers.GlobalAveragePooling2D()(x)
x = keras.layers.Dropout(0.2)(x)  # Regularize with dropout
outputs = keras.layers.Dense(1)(x)
model = keras.Model(inputs, outputs)

model.summary()

## Train the top layer

In [None]:
model.compile(
    optimizer=keras.optimizers.Adam(),
    loss=keras.losses.BinaryCrossentropy(from_logits=True),
    metrics=[keras.metrics.BinaryAccuracy()],
)

epochs = 10
model.fit(train_ds, epochs=epochs, validation_data=validation_ds)

## Do a round of fine-tuning of the entire model

Finally, let's unfreeze the base model and train the entire model end-to-end with a low
 learning rate.

In [None]:
# Unfreeze the base_model. Note that it keeps running in inference mode
# since we passed `training=False` when calling it. 
base_model.trainable = True
model.summary()

model.compile(
    optimizer=keras.optimizers.Adam(1e-5),  # Low learning rate
    loss=keras.losses.BinaryCrossentropy(from_logits=True),
    metrics=[keras.metrics.BinaryAccuracy()])

epochs = 10
model.fit(train_ds, epochs=epochs, validation_data=validation_ds)

After 10 epochs, fine-tuning gains us a nice improvement here.