<p style="text-align:center">
    <a href="https://skills.network" target="_blank">
    <img src="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/assets/logos/SN_web_lightmode.png" width="200" alt="Skills Network Logo">
    </a>
</p>


# Machine Learning Foundation

## Course 5, Part g: Transfer Learning DEMO


For this exercise, we will use the well-known MNIST digit data. To illustrate the power and concept of transfer learning, we will train a CNN on just the digits 5,6,7,8,9.  Then we will train just the last layer(s) of the network on the digits 0,1,2,3,4 and see how well the features learned on 5-9 help with classifying 0-4.




In [1]:
import datetime
import keras
from tensorflow.keras.datasets import mnist
from tensorflow.keras.models import Sequential
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.layers import Dense, Input, Dropout, Activation, Flatten, MaxPooling2D, Conv2D
from tensorflow.keras import backend as B

In [2]:
# Create function
now = datetime.datetime.now
print(now)
print(now()) # call function

<built-in method now of type object at 0xb07000>
2025-11-15 11:00:45.015263


In [3]:
batch_size = 128
n_classes = 5
epochs = 5

In [4]:
img_rows, img_cols = 28, 28
filters = 32 # learn 32 different features/ patterns
pool_size = 2
kernel_size = 3

In [5]:
# Handle variability for loaded input data
if B.image_data_format() == "channels_first":
    input_shape = (1, img_rows, img_cols)
else:
    input_shape = (img_rows, img_cols, 1)

In [6]:
## To simplify things, write a function to include all the training steps
## As input, function takes a model, training set, test set, and the number of classes
## Inside the model object will be the state about which layers we are freezing and which we are training

def train_model(model, train, test, n_classes):
    # From (None, 28, 28, 1) to (29404, 28, 28, 1)
    x_train = train[0].reshape((train[0].shape[0],) + input_shape)
    x_test  = test[0].reshape((test[0].shape[0],) + input_shape)
    x_train = (x_train / 255).astype('float32')
    x_test  = (x_test  / 255).astype('float32')

    print("X_train shape:", x_train.shape)
    print(x_train.shape[0], 'train samples')
    print(x_test.shape[0], 'test samples')

    # Convert class vectors to binary class matrices
    y_train = to_categorical(train[1], n_classes)
    y_test  = to_categorical(test[1], n_classes)

    model.compile(loss='categorical_crossentropy',
                optimizer='adadelta',
                metrics=['accuracy'])

    t = now() # current time
    model.fit(x_train, y_train,
            batch_size = batch_size,
            epochs = epochs,
            verbose = 1,
            validation_data = (x_test, y_test))

    print("Training time: %s" % (now() - t))
    score = model.evaluate(x_test, y_test, verbose=0)
    print("Test loss:", score[0])
    print("Test accuracy:", score[1])

In [7]:
(X_train, y_train), (X_test, y_test) = mnist.load_data()

# Create two datasets:
# One with digits below 5
X_train_lt5 = X_train[y_train < 5]
y_train_lt5 = y_train[y_train < 5]
X_test_lt5  = X_test[y_test < 5]
y_test_lt5  = y_test[y_test < 5]

# One with digits above and equal 5
X_train_gte5 = X_train[y_train >= 5]
y_train_gte5 = y_train[y_train >= 5] - 5 # for labels: [0, 1, 2, 3, 4] --> not [5, 6, 7, 8, 9]
X_test_gte5  = X_test[y_test >= 5]
y_test_gte5  = y_test[y_test >= 5] - 5   # for labels: [0, 1, 2, 3, 4] --> not [5, 6, 7, 8, 9]

Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/mnist.npz
[1m11490434/11490434[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 0us/step


Define the **feature** layers.  These are the early layers that we expect will **transfer** to a new problem.  We will freeze these layers during the fine-tuning process.

In [None]:
# Define the "feature" layers.  These are the early layers that we expect will "transfer"
# to a new problem.  We will freeze these layers during the fine-tuning process
feature_layers = [Input(shape=input_shape),
                  Conv2D(filters, kernel_size, padding='valid', activation='relu'), # padding='valid' (default)
                  Conv2D(filters, kernel_size, activation='relu'), # padding='valid' (default)
                  MaxPooling2D(pool_size=pool_size),
                  Dropout(0.25),
                  Flatten()]

feature_layers

[<KerasTensor shape=(None, 28, 28, 1), dtype=float32, sparse=False, ragged=False, name=keras_tensor>,
 <Conv2D name=conv2d, built=False>,
 <Conv2D name=conv2d_1, built=False>,
 <MaxPooling2D name=max_pooling2d, built=True>,
 <Dropout name=dropout, built=True>,
 <Flatten name=flatten, built=False>]

Define the **classification** layers.  These are the later layers that predict the specific classes from the features learned by the feature layers.  This is the part of the model that needs to be re-trained for a new problem

In [None]:
classification_layers = [Dense(128, activation='relu'),
                         Dropout(0.5),
                         Dense(n_classes, activation='softmax')]

In [None]:
# We create our model by combining the two sets of layers as follows
model = Sequential(feature_layers + classification_layers)

In [None]:
# Let's take a look
model.summary()

In [None]:
# Now, let's train our model on the digits 5,6,7,8,9

train_model(model,
            (X_train_gte5, y_train_gte5),
            (X_test_gte5, y_test_gte5), n_classes)

X_train shape: (29404, 28, 28, 1)
29404 train samples
4861 test samples
Epoch 1/5
[1m230/230[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m48s[0m 202ms/step - accuracy: 0.1814 - loss: 1.6413 - val_accuracy: 0.2705 - val_loss: 1.6023
Epoch 2/5
[1m230/230[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m81s[0m 200ms/step - accuracy: 0.2448 - loss: 1.6056 - val_accuracy: 0.3812 - val_loss: 1.5649
Epoch 3/5
[1m230/230[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m47s[0m 203ms/step - accuracy: 0.3024 - loss: 1.5719 - val_accuracy: 0.5447 - val_loss: 1.5273
Epoch 4/5
[1m230/230[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m81s[0m 200ms/step - accuracy: 0.3703 - loss: 1.5381 - val_accuracy: 0.6355 - val_loss: 1.4877
Epoch 5/5
[1m230/230[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m82s[0m 202ms/step - accuracy: 0.4441 - loss: 1.4974 - val_accuracy: 0.7052 - val_loss: 1.4441
Training time: 0:06:15.619527
Test loss: 1.4441215991973877
Test accuracy: 0.7052046656608582


### Freezing Layers
Keras allows layers to be "frozen" during the training process.  That is, some layers would have their weights updated during the training process, while others would not.  This is a core part of transfer learning, the ability to train just the last one or several layers.

Note also, that a lot of the training time is spent "back-propagating" the gradients back to the first layer.  Therefore, if we only need to compute the gradients back a small number of layers, the training time is much quicker per iteration.  This is in addition to the savings gained by being able to train on a smaller data set.


In [None]:
feature_layers

[<Conv2D name=conv2d_2, built=True>,
 <Activation name=activation, built=True>,
 <Conv2D name=conv2d_3, built=True>,
 <Activation name=activation_1, built=True>,
 <MaxPooling2D name=max_pooling2d_1, built=True>,
 <Dropout name=dropout_1, built=True>,
 <Flatten name=flatten_1, built=True>]

In [None]:
# Freeze only the feature layers
for l in feature_layers:
    print("Before:", l)
    l.trainable = False
    print("After:", l)

Before: <Conv2D name=conv2d_2, built=True>
After: <Conv2D name=conv2d_2, built=True>
Before: <Activation name=activation, built=True>
After: <Activation name=activation, built=True>
Before: <Conv2D name=conv2d_3, built=True>
After: <Conv2D name=conv2d_3, built=True>
Before: <Activation name=activation_1, built=True>
After: <Activation name=activation_1, built=True>
Before: <MaxPooling2D name=max_pooling2d_1, built=True>
After: <MaxPooling2D name=max_pooling2d_1, built=True>
Before: <Dropout name=dropout_1, built=True>
After: <Dropout name=dropout_1, built=True>
Before: <Flatten name=flatten_1, built=True>
After: <Flatten name=flatten_1, built=True>


Observe below the differences between the number of *total params*, *trainable params*, and *non-trainable params*.


In [None]:
model.summary()

In [None]:
train_model(model, (X_train_lt5, y_train_lt5), (X_test_lt5, y_test_lt5), n_classes)

X_train shape: (30596, 28, 28, 1)
30596 train samples
5139 test samples
Epoch 1/5
[1m240/240[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m19s[0m 73ms/step - accuracy: 0.2890 - loss: 1.5689 - val_accuracy: 0.4374 - val_loss: 1.5067
Epoch 2/5
[1m240/240[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m20s[0m 70ms/step - accuracy: 0.3769 - loss: 1.5119 - val_accuracy: 0.6075 - val_loss: 1.4494
Epoch 3/5
[1m240/240[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m19s[0m 78ms/step - accuracy: 0.4635 - loss: 1.4590 - val_accuracy: 0.7038 - val_loss: 1.3941
Epoch 4/5
[1m240/240[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m17s[0m 69ms/step - accuracy: 0.5337 - loss: 1.4070 - val_accuracy: 0.7630 - val_loss: 1.3403
Epoch 5/5
[1m240/240[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m17s[0m 72ms/step - accuracy: 0.5858 - loss: 1.3611 - val_accuracy: 0.8079 - val_loss: 1.2865
Training time: 0:01:31.252219
Test loss: 1.286516785621643
Test accuracy: 0.8079392910003662


Note that after a single epoch, we are already achieving results on classifying 0-4 that are comparable to those achieved on 5-9 after 5 full epochs.  This despite the fact the we are only "fine-tuning" the last layer of the network, and all the early layers have never seen what the digits 0-4 look like.

Also, note that even though nearly all (590K/600K) of the *parameters* were trainable, the training time per epoch was still much reduced.  This is because the unfrozen part of the network was very shallow, making backpropagation faster.


## Exercise
- Now we will write code to reverse this training process.  That is, train on the digits 0-4, then finetune only the last layers on the digits 5-9.


In [8]:
# Create layers and define the model as above
feature_layers2 = [Input(shape=(input_shape)),
                   Conv2D(filters, kernel_size, activation='relu'),
                   Conv2D(filters, kernel_size, activation='relu'),
                   MaxPooling2D(pool_size=pool_size),
                   Dropout(0.25),
                   Flatten()]

classification_layers2 = [Dense(128, activation='relu'),
                          Dropout(0.5),
                          Dense(n_classes, activation='softmax')]

model2 = Sequential(feature_layers2 + classification_layers2)
model2.summary()

In [9]:
# Now, let's train our model on the digits 0,1,2,3,4
train_model(model2, (X_train_lt5, y_train_lt5), (X_test_lt5, y_test_lt5), n_classes)

X_train shape: (30596, 28, 28, 1)
30596 train samples
5139 test samples
Epoch 1/5
[1m240/240[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m58s[0m 232ms/step - accuracy: 0.2424 - loss: 1.5910 - val_accuracy: 0.4244 - val_loss: 1.5368
Epoch 2/5
[1m240/240[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m49s[0m 205ms/step - accuracy: 0.3309 - loss: 1.5391 - val_accuracy: 0.5511 - val_loss: 1.4784
Epoch 3/5
[1m240/240[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m49s[0m 204ms/step - accuracy: 0.4193 - loss: 1.4842 - val_accuracy: 0.6811 - val_loss: 1.4143
Epoch 4/5
[1m240/240[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m82s[0m 205ms/step - accuracy: 0.4886 - loss: 1.4295 - val_accuracy: 0.7725 - val_loss: 1.3437
Epoch 5/5
[1m240/240[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m50s[0m 207ms/step - accuracy: 0.5708 - loss: 1.3595 - val_accuracy: 0.8299 - val_loss: 1.2660
Training time: 0:04:48.561783
Test loss: 1.266013264656067
Test accuracy: 0.8299279808998108


In [10]:
# Freeze layers
for l in feature_layers2:
    l.trainable = False

In [12]:
model2.summary()

In [13]:
train_model(model2, (X_train_gte5, y_train_gte5), (X_test_gte5, y_test_gte5), n_classes)

X_train shape: (29404, 28, 28, 1)
29404 train samples
4861 test samples
Epoch 1/5
[1m230/230[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m17s[0m 71ms/step - accuracy: 0.2829 - loss: 1.6148 - val_accuracy: 0.3999 - val_loss: 1.5537
Epoch 2/5
[1m230/230[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m21s[0m 75ms/step - accuracy: 0.3107 - loss: 1.5663 - val_accuracy: 0.4367 - val_loss: 1.5036
Epoch 3/5
[1m230/230[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m20s[0m 74ms/step - accuracy: 0.3607 - loss: 1.5206 - val_accuracy: 0.5159 - val_loss: 1.4560
Epoch 4/5
[1m230/230[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m17s[0m 75ms/step - accuracy: 0.4182 - loss: 1.4792 - val_accuracy: 0.6252 - val_loss: 1.4110
Epoch 5/5
[1m230/230[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m17s[0m 75ms/step - accuracy: 0.4761 - loss: 1.4366 - val_accuracy: 0.6916 - val_loss: 1.3677
Training time: 0:01:37.072854
Test loss: 1.367672324180603
Test accuracy: 0.6916272640228271


---
### Machine Learning Foundation (C) 2020 IBM Corporation
