## Transfer Learning using MNIST data
To illustrate the power and concept of transfer learning, we will train a CNN on just the digits 5,6,7,8,9.  Then we will train just the last layer(s) of the network on the digits 0,1,2,3,4 and see how well the features learned on 5-9 help with classifying 0-4.



In [1]:


import datetime
import keras
from keras.datasets import mnist
from keras.models import Sequential
from keras.layers import Dense, Dropout, Activation, Flatten
from keras.layers import Conv2D, MaxPooling2D
from keras import backend as K

In [2]:
#used to help some of the timing functions
now = datetime.datetime.now

In [3]:
start_time = datetime.datetime.now()

# Your operation here, e.g., training a model
# model.fit(...)

end_time = datetime.datetime.now()
duration = end_time - start_time
print(f"Operation took: {duration}")


Operation took: 0:00:00.000077


In [4]:
# set some parameters
batch_size = 128
num_classes = 5
epochs = 5

In [5]:
# set some more parameters
img_rows, img_cols = 28, 28
filters = 32
pool_size = 2
kernel_size = 3

In [None]:
## To simplify things, write a function to include all the training steps
## As input, function takes a model, training set, test set, and the number of classes
## Inside the model object will be the state about which layers we are freezing and which we are training
#Reshape the data
#Normalize the data
# One hot encode the targert label
# Compile the model
# Train the model on the training data
# Evaluate the model on the testing data

def train_model(model, train, test, num_classes):


In [10]:
import numpy as np
from tensorflow.keras.utils import to_categorical
from tensorflow.keras import backend as K
from tensorflow.keras.optimizers import Adam

def train_model(model, train, test, num_classes):
    (x_train, y_train), (x_test, y_test) = train, test
    
    # Reshape the data based on channels first or last
    if K.image_data_format() == 'channels_first':
        x_train = x_train.reshape(x_train.shape[0], 1, 28, 28)
        x_test = x_test.reshape(x_test.shape[0], 1, 28, 28)
        input_shape = (1, 28, 28)
    else:
        x_train = x_train.reshape(x_train.shape[0], 28, 28, 1)
        x_test = x_test.reshape(x_test.shape[0], 28, 28, 1)
        input_shape = (28, 28, 1)
    
    # Normalize the data
    x_train = x_train.astype('float32') / 255
    x_test = x_test.astype('float32') / 255
    
    # One hot encode the target labels
    y_train = to_categorical(y_train, num_classes)
    y_test = to_categorical(y_test, num_classes)
    
    # Compile the model
    model.compile(loss='categorical_crossentropy',
                  optimizer=Adam(),
                  metrics=['accuracy'])
    
    # Train the model on the training data
    model.fit(x_train, y_train,
              batch_size=128,
              epochs=10,
              verbose=1,
              validation_data=(x_test, y_test))
    
    # Evaluate the model on the testing data
    score = model.evaluate(x_test, y_test, verbose=0)
    print('Test loss:', score[0])
    print('Test accuracy:', score[1])



In [9]:
# Load the Mnist data and split between train and test sets

# create two datasets: one with digits below 5 and one with 5 and above


In [12]:
from tensorflow.keras.datasets import mnist
from tensorflow.keras.utils import to_categorical

# Load the MNIST data
(x_train, y_train), (x_test, y_test) = mnist.load_data()

# Split the dataset into two parts: one for digits < 5 and one for digits >= 5
x_train_lt_5 = x_train[y_train < 5]
y_train_lt_5 = y_train[y_train < 5]
x_test_lt_5 = x_test[y_test < 5]
y_test_lt_5 = y_test[y_test < 5]

x_train_gte_5 = x_train[y_train >= 5]
y_train_gte_5 = y_train[y_train >= 5]
x_test_gte_5 = x_test[y_test >= 5]
y_test_gte_5 = y_test[y_test >= 5]

# Normalize the pixel values to be between 0 and 1
x_train_lt_5 = x_train_lt_5.astype('float32') / 255
x_test_lt_5 = x_test_lt_5.astype('float32') / 255
x_train_gte_5 = x_train_gte_5.astype('float32') / 255
x_test_gte_5 = x_test_gte_5.astype('float32') / 255

# Reshape the data to fit the model input requirements
# Assuming TensorFlow as the backend and channels_last data format
x_train_lt_5 = x_train_lt_5.reshape((-1, 28, 28, 1))
x_test_lt_5 = x_test_lt_5.reshape((-1, 28, 28, 1))
x_train_gte_5 = x_train_gte_5.reshape((-1, 28, 28, 1))
x_test_gte_5 = x_test_gte_5.reshape((-1, 28, 28, 1))

# One-hot encode the labels
y_train_lt_5 = to_categorical(y_train_lt_5, 5)
y_test_lt_5 = to_categorical(y_test_lt_5, 5)
y_train_gte_5 = to_categorical(y_train_gte_5 - 5, 5) # Subtract 5 to make the labels start at 0 for the >=5 subset
y_test_gte_5 = to_categorical(y_test_gte_5 - 5, 5)

# Now you have two datasets:
# - Digits < 5: x_train_lt_5, y_train_lt_5, x_test_lt_5, y_test_lt_5
# - Digits >= 5: x_train_gte_5, y_train_gte_5, x_test_gte_5, y_test_gte_5


In [None]:
# Define the "feature" layers. Add 2 convolution layer with max pool layer. At the end, add dropout layer with 0.25% probability and end with the flatten layer. These are the early layers that we expect will "transfer"
# to a new problem.  We will freeze these layers during the fine-tuning process

feature_layers = [
   
]

In [14]:
from tensorflow.keras.layers import Conv2D, MaxPooling2D, Dropout, Flatten

feature_layers = [
    Conv2D(32, kernel_size=(3, 3), activation='relu', input_shape=(28, 28, 1)),
    MaxPooling2D(pool_size=(2, 2)),
    Conv2D(64, (3, 3), activation='relu'),
    MaxPooling2D(pool_size=(2, 2)),
    Dropout(0.25),
    Flatten()
]


In [None]:
# Define the "classification" layers. Add Dense layer with 128 nodes and the output dense layer. These are the later layers that predict the specific classes from the features
# learned by the feature layers.  This is the part of the model that needs to be re-trained for a new problem

classification_layers = [

]

In [15]:
from tensorflow.keras.layers import Dense, Dropout

# Assuming num_classes is defined based on your specific problem
# For MNIST digits below 5 and digits 5 and above, num_classes would be 5
num_classes = 5

classification_layers = [
    Dense(128, activation='relu'),
    Dropout(0.5),
    Dense(num_classes, activation='softmax')
]


In [None]:
# Create the model by combining the two sets of layers:


In [16]:
from tensorflow.keras.models import Sequential

# Assuming feature_layers and classification_layers are defined as per previous instructions

# Combine the feature and classification layers to create the model
model = Sequential(feature_layers + classification_layers)

# Now, the model is defined with the combined layers


In [17]:
model.compile(loss='categorical_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])


In [18]:
model.summary()


Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 conv2d_2 (Conv2D)           (None, 26, 26, 32)        320       
                                                                 
 max_pooling2d_2 (MaxPoolin  (None, 13, 13, 32)        0         
 g2D)                                                            
                                                                 
 conv2d_3 (Conv2D)           (None, 11, 11, 64)        18496     
                                                                 
 max_pooling2d_3 (MaxPoolin  (None, 5, 5, 64)          0         
 g2D)                                                            
                                                                 
 dropout_1 (Dropout)         (None, 5, 5, 64)          0         
                                                                 
 flatten_1 (Flatten)         (None, 1600)              0

In [19]:
for layer in feature_layers:
    layer.trainable = False


In [4]:
# Now, let's train our model on the digits 5,6,7,8,9



In [20]:
model.compile(optimizer='adam',
              loss='categorical_crossentropy',
              metrics=['accuracy'])


In [21]:
history = model.fit(x_train_gte_5, y_train_gte_5,
                    batch_size=128,
                    epochs=10,
                    verbose=1,
                    validation_data=(x_test_gte_5, y_test_gte_5))


Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


In [22]:
test_loss, test_accuracy = model.evaluate(x_test_gte_5, y_test_gte_5, verbose=0)
print(f'Test loss: {test_loss}')
print(f'Test accuracy: {test_accuracy}')


Test loss: 0.0475693978369236
Test accuracy: 0.9853939414024353


### Freezing Layers
Keras allows layers to be "frozen" during the training process.  That is, some layers would have their weights updated during the training process, while others would not.  This is a core part of transfer learning, the ability to train just the last one or several layers.

Note also, that a lot of the training time is spent "back-propagating" the gradients back to the first layer.  Therefore, if we only need to compute the gradients back a small number of layers, the training time is much quicker per iteration.  This is in addition to the savings gained by being able to train on a smaller data set.

In [3]:
# Freeze only the feature layers


In [23]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Conv2D, MaxPooling2D, Dropout, Flatten, Dense

# Define feature layers
feature_layers = [
    Conv2D(32, kernel_size=(3, 3), activation='relu', input_shape=(28, 28, 1)),
    MaxPooling2D(pool_size=(2, 2)),
    Conv2D(64, (3, 3), activation='relu'),
    MaxPooling2D(pool_size=(2, 2)),
    Dropout(0.25),
    Flatten(),
]

# Define classification layers
classification_layers = [
    Dense(128, activation='relu'),
    Dropout(0.5),
    Dense(5, activation='softmax'),  # Assuming 5 classes for the example
]

# Combine them into a Sequential model
model = Sequential(feature_layers + classification_layers)


In [24]:
for layer in feature_layers:
    layer.trainable = False


In [25]:
model.compile(optimizer='adam',
              loss='categorical_crossentropy',
              metrics=['accuracy'])

# Assuming you have your dataset ready
# model.fit(x_train, y_train, batch_size=128, epochs=10, validation_split=0.2)


Observe below the differences between the number of *total params*, *trainable params*, and *non-trainable params*.

In [2]:
# print model summary


In [26]:
model.summary()


Model: "sequential_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 conv2d_4 (Conv2D)           (None, 26, 26, 32)        320       
                                                                 
 max_pooling2d_4 (MaxPoolin  (None, 13, 13, 32)        0         
 g2D)                                                            
                                                                 
 conv2d_5 (Conv2D)           (None, 11, 11, 64)        18496     
                                                                 
 max_pooling2d_5 (MaxPoolin  (None, 5, 5, 64)          0         
 g2D)                                                            
                                                                 
 dropout_3 (Dropout)         (None, 5, 5, 64)          0         
                                                                 
 flatten_2 (Flatten)         (None, 1600)             

In [1]:
# Now, let's train our model on the digits 0,1,2,3,4


In [27]:
for layer in model.layers[:-3]:  # Assuming the last 3 layers are your classification layers
    layer.trainable = False


In [28]:
model.compile(optimizer='adam',
              loss='categorical_crossentropy',
              metrics=['accuracy'])


In [29]:
history = model.fit(x_train_lt_5, y_train_lt_5,
                    batch_size=128,
                    epochs=10,
                    verbose=1,
                    validation_data=(x_test_lt_5, y_test_lt_5))


Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


In [30]:
test_loss, test_accuracy = model.evaluate(x_test_lt_5, y_test_lt_5, verbose=0)
print(f'Test loss: {test_loss}')
print(f'Test accuracy: {test_accuracy}')


Test loss: 0.014963459223508835
Test accuracy: 0.9943568706512451


Note that after a single epoch, we are already achieving results on classifying 0-4 that are comparable to those achieved on 5-9 after 5 full epochs.  This despite the fact the we are only "fine-tuning" the last layer of the network, and all the early layers have never seen what the digits 0-4 look like.

Also, note that even though nearly all (590K/600K) of the *parameters* were trainable, the training time per epoch was still much reduced.  This is because the unfrozen part of the network was very shallow, making backpropagation faster. 