# Transfer Learning

This notebook implements all the steps need to achieve above 70% accuracy on a given dataset by first training a network on the mnist dataset and transfering the learned weights to the new network. This notebook also shows a comparison between applying Transfer Learning technique and training a network from scratch with few training examples as is the case with our data

## Importing the needed libraries

In [10]:
import os
import numpy as np
from tensorflow import keras
from keras.optimizers import SGD
from keras.models import Model
from keras.layers import Input, BatchNormalization, Conv2D, MaxPooling2D
from keras.layers import Activation, Flatten, Dropout, Dense

## Preprocessing the data from the mnist dataset

In order to train a recurrent network on the mnist dataset we must first preprocess the data in order to be in the correct format

In [11]:
# Number of epochs and batch size to user in the old model
INIT_LR = 0.01
NUM_EPOCHS = 5
BS = 512

# Loading the mnist data
((train_x, train_y), (test_x, test_y)) = keras.datasets.mnist.load_data()

# In this case there's only one input channel that is the black and white channel
train_x = train_x.reshape((train_x.shape[0], 28, 28, 1))
test_x = test_x.reshape((test_x.shape[0], 28, 28, 1))

train_x = train_x.astype("float32") / 255.0
test_x = test_x.astype("float32") / 255.0

# one-hot encoding the trainning and testing labels
train_y = keras.utils.to_categorical(train_y, 10)
test_y = keras.utils.to_categorical(test_y, 10)

## Defining the structure of the recurrent network for the mnist dataset

In order to use the learned weights obtained while training the mnist dataset we must first create the structure of the network. This network structure will only be used for the mnist dataset, because with the provided dataset we will only need the learned weights and not the complete network

In [12]:
# Defining the model using Keras functional API

# Functional API is better for transfer learning as it supports the concatenation
# of neural networks in a Concatenation Layer object
# The sequential API is a more rigid API to defining neural networks

# We create an input object that works like tf.placeholder
# We add a name to the object so we can find it more easily
inputs = Input(shape=(28, 28, 1), name="inputs")

# The next part of the model definition is the same as in Lab3 exercise
# Note that these objects implement the __call__ method and can be interpreted as functions
# The input they receive is the input the layer receives
layer = Conv2D(32, (3, 3), padding="same", input_shape=(28, 28, 1))(inputs)
layer = Activation("relu")(layer)
layer = BatchNormalization(axis=-1)(layer)

# The input layer is a convolutional layer with 32 filters
# The shape of the kernel in this layer is 3x3
# We add padding in this layer (so we can start the kernel right at the beginning of the image)
# and in this case we use padding "same" for it to add values to the padding that are copied from the original matrix (it could also be 0)
layer = Conv2D(32, (3, 3), padding="same")(layer)

# For this layer we add a ReLU activation
# We need to add ReLU because a convolution is still a linear transformation
# so we add ReLU for it to be a non linear transformation
layer = Activation("relu")(layer)

# We add batch normalization here
# This normalizes the output from the previous layer in order
# for the input of the next layer to be normalized
# In this case we put the channels at the end so we don't need to specify the axis of normalization
# otherwise we would need to specify
layer = BatchNormalization(axis=-1)(layer)

# In this layer we Pool the layer before in order to reduce the number of features
# Since we are using a 2x2 pooling size we are keeping only half of the features in each dimension
# So instead of a 28*28 vector we now have a 14*14 tensor
# Since we are omitting the stride Keras assumes the same stride as pool size which is what we want
layer = MaxPooling2D(pool_size=(2, 2))(layer)

# We add a dropout layer of 25% dropout for regularization
layer = Dropout(0.25)(layer)

# We add another convolution layer, in this case we don't need to specify the input shape
# because keras finds out the right input shape
layer = Conv2D(64, (3, 3), padding="same")(layer)
layer = Activation("relu")(layer)
layer = BatchNormalization(axis=-1)(layer)

layer = Conv2D(64, (3, 3), padding="same")(layer)
layer = Activation("relu")(layer)
layer = BatchNormalization(axis=-1)(layer)

# After this pooling we have a 7*7 tensor
layer = MaxPooling2D(pool_size=(2, 2))(layer)
layer = Dropout(0.25)(layer)

# We add a Flatten layer in order to transform the input tensor into a vector
# In this case we had a 7*7*64 (7*7*the number of filters we have)
features = Flatten(name="features")(layer)

# Fully connected part of the network
layer = Dense(512)(features)
layer = Activation("relu")(layer)
layer = BatchNormalization()(layer)
layer = Dropout(0.5)(layer)
layer = Dense(10)(layer)
layer = Activation("softmax")(layer)

## Training the network with the mnist dataset

Having specified the structure of the network we can now train it on the mnist dataset.

Note: After training the network we save the weights in HDF5 format in the "files" directory. To be more efficient and to not have to run the training of the network every time, we first verify if this directory is empty or not, if it isn't then the network has already been trained before and we don't need to train it again

In [13]:
# Here we say where the model starts and ends
old_model = Model(inputs=inputs, outputs=layer)

old_model.compile(optimizer=SGD(lr=INIT_LR, momentum=0.9, decay=INIT_LR / NUM_EPOCHS), loss="categorical_crossentropy",
                  metrics=["accuracy"])

tensorboard_callback = keras.callbacks.TensorBoard(log_dir='./logs', histogram_freq=0,
                                                   write_graph=True, write_images=True)

if len(os.listdir("./files")) == 0:
    # If there are no files in the Files directory
    # it means that the network hasn't been trained yet, so we
    # need to train it and save its weights

    history = old_model.fit(train_x, train_y, validation_data=(test_x, test_y), batch_size=BS, epochs=NUM_EPOCHS,
                            callbacks=[tensorboard_callback])

    # This saves the weights to the specified file in HDF5 format
    old_model.save_weights('./files/mnist_model.h5')

## Using the learned weights to classify a new dataset

Having trained the network on the mnist dataset we can now use its learned representation and apply it to a new set of data.

First we need to specify the structure for the new network that will take advantage of the representations learned from the previous network

In [14]:
# Loading the weights previously obtained by training the network
old_model.load_weights("./files/mnist_model.h5")

# We now iterate over all the layers in the model
# In order to freeze them, we don't want to train this model
for layer in old_model.layers:
    layer.trainable = False

# Now we create the new model, that will take advantage of the old models structure

layer = Dense(512)(old_model.get_layer("features").output)
layer = Activation("relu")(layer)
layer = BatchNormalization()(layer)
layer = Dropout(0.5)(layer)
layer = Dense(256)(layer)
layer = Activation("relu")(layer)
layer = BatchNormalization()(layer)
layer = Dense(128)(layer)
layer = Activation("relu")(layer)
layer = BatchNormalization()(layer)
layer = Dropout(0.2)(layer)
layer = Dense(26)(layer)
layer = Activation("softmax")(layer)

# Here we say that the model starts where the old model ends
# and ends in the layer object
model = Model(inputs=old_model.get_layer("inputs").output, outputs=layer)

model.compile(optimizer=SGD(lr=1e-2, momentum=0.9), loss="categorical_crossentropy",
              metrics=["accuracy"])

After that we only need now to load the new data, do the same preprocessing we did with the mnist dataset and fit our model

In [15]:
# Loading the new data
new_train_x = np.load('./data/imagesLettersTrain.npy')
new_train_y = np.load('./data/labelsTrain.npy')

new_test_x = np.load('./data/imagesLettersTest.npy')
new_test_y = np.load('./data/labelsTest.npy')

# In this case we only have one input channel that is the black and white channel
new_train_x = new_train_x.reshape((new_train_x.shape[0], 28, 28, 1))
new_test_x = new_test_x.reshape((new_test_x.shape[0], 28, 28, 1))

new_train_x = new_train_x.astype("float32") / 255.0
new_test_x = new_test_x.astype("float32") / 255.0

# We one-hot encode the trainning and testing labels
# Now we have 26 different labels so we one-hot encode a vector with size 26
new_train_y = keras.utils.to_categorical(new_train_y, 26)
new_test_y = keras.utils.to_categorical(new_test_y, 26)

NEW_NUM_EPOCHS = 100
NEW_BS = 20

fitting = model.fit(new_train_x, new_train_y, batch_size=NEW_BS, epochs=NEW_NUM_EPOCHS, callbacks=[tensorboard_callback])

Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100
Epoch 77/100
Epoch 78

Epoch 86/100
Epoch 87/100
Epoch 88/100
Epoch 89/100
Epoch 90/100
Epoch 91/100
Epoch 92/100
Epoch 93/100
Epoch 94/100
Epoch 95/100
Epoch 96/100
Epoch 97/100
Epoch 98/100
Epoch 99/100
Epoch 100/100


Having fined tuned our model to the new training set we can see that it achieves an accuracy of around 97%. This is a really great value considering the few number of examples existent in the training set. We must now evaluate how well our model does on the test set 

In [16]:
model.evaluate(x = new_test_x, y = new_test_y)



[0.9862400656826953, 0.7799864183158711]

The results above are:

| Test Loss | Test Accuracy |
|-----------|---------------|
| 0.986     | 0.780         |

## Training a network from scratch with the new dataset

In order to better example the benefits of Transfer Learning we will compare the previous results obtained with transfer learning with the results of training a network from scratch. 

Training a network from scratch is usually the prefered method of training a network, as in this way the model can learn all the detail specific for a given dataset, but when there's very few training examples, as is the case for this dataset, it may not be enough to accuratly train a neural networok and so transfer learning may be an option

We will build a similar network structure as the one built for training on the mnist dataset

In [17]:
# Defining the model using Keras functional API

# Functional API is better for transfer learning as it supports the concatenation
# of neural networks in a Concatenation Layer object
# The sequential API is a more rigid API to defining neural networks

# We create an input object that works like tf.placeholder
# We add a name to the object so we can find it more easily
new_inputs = Input(shape=(28, 28, 1), name="inputs")

# The next part of the model definition is the same as in Lab3 exercise
# Note that these objects implement the __call__ method and can be interpreted as functions
# The input they receive is the input the layer receives
new_layer = Conv2D(32, (3, 3), padding="same", input_shape=(28, 28, 1))(new_inputs)
new_layer = Activation("relu")(new_layer)
new_layer = BatchNormalization(axis=-1)(new_layer)

# The input layer is a convolutional layer with 32 filters
# The shape of the kernel in this layer is 3x3
# We add padding in this layer (so we can start the kernel right at the beginning of the image)
# and in this case we use padding "same" for it to add values to the padding that are copied from the original matrix (it could also be 0)
new_layer = Conv2D(32, (3, 3), padding="same")(new_layer)

# For this layer we add a ReLU activation
# We need to add ReLU because a convolution is still a linear transformation
# so we add ReLU for it to be a non linear transformation
new_layer = Activation("relu")(new_layer)

# We add batch normalization here
# This normalizes the output from the previous layer in order
# for the input of the next layer to be normalized
# In this case we put the channels at the end so we don't need to specify the axis of normalization
# otherwise we would need to specify
new_layer = BatchNormalization(axis=-1)(new_layer)

# In this layer we Pool the layer before in order to reduce the number of features
# Since we are using a 2x2 pooling size we are keeping only half of the features in each dimension
# So instead of a 28*28 vector we now have a 14*14 tensor
# Since we are omitting the stride Keras assumes the same stride as pool size which is what we want
new_layer = MaxPooling2D(pool_size=(2, 2))(new_layer)

# We add a dropout layer of 25% dropout for regularization
new_layer = Dropout(0.25)(new_layer)

# We add another convolution layer, in this case we don't need to specify the input shape
# because keras finds out the right input shape
new_layer = Conv2D(64, (3, 3), padding="same")(new_layer)
new_layer = Activation("relu")(new_layer)
new_layer = BatchNormalization(axis=-1)(new_layer)

new_layer = Conv2D(64, (3, 3), padding="same")(new_layer)
new_layer = Activation("relu")(new_layer)
new_layer = BatchNormalization(axis=-1)(new_layer)

# After this pooling we have a 7*7 tensor
new_layer = MaxPooling2D(pool_size=(2, 2))(new_layer)
new_layer = Dropout(0.25)(new_layer)

# We add a Flatten layer in order to transform the input tensor into a vector
# In this case we had a 7*7*64 (7*7*the number of filters we have)
features = Flatten(name="features")(new_layer)

# Fully connected part of the network
new_layer = Dense(512)(features)
new_layer = Activation("relu")(new_layer)
new_layer = BatchNormalization()(new_layer)
new_layer = Dropout(0.5)(new_layer)
new_layer = Dense(256)(new_layer)
new_layer = Activation("relu")(new_layer)
new_layer = BatchNormalization()(new_layer)
new_layer = Dropout(0.2)(new_layer)
new_layer = Dense(128)(new_layer)
new_layer = Activation("relu")(new_layer)
new_layer = BatchNormalization()(new_layer)
new_layer = Dropout(0.2)(new_layer)
new_layer = Dense(26)(new_layer)
new_layer = Activation("softmax")(new_layer)

In [18]:
# Here we say that the model starts where the old model ends
# and ends in the layer object
new_model = Model(inputs=new_inputs, outputs=new_layer)

new_model.compile(optimizer=SGD(lr=1e-2, momentum=0.9), loss="categorical_crossentropy",
              metrics=["accuracy"])

new_model_fitting = new_model.fit(new_train_x, new_train_y, batch_size=NEW_BS, epochs=NEW_NUM_EPOCHS, callbacks=[tensorboard_callback])

Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100
Epoch 77/100
Epoch 78

Epoch 86/100
Epoch 87/100
Epoch 88/100
Epoch 89/100
Epoch 90/100
Epoch 91/100
Epoch 92/100
Epoch 93/100
Epoch 94/100
Epoch 95/100
Epoch 96/100
Epoch 97/100
Epoch 98/100
Epoch 99/100
Epoch 100/100


In [19]:
new_model.evaluate(x = new_test_x, y = new_test_y)



[1.1156733264438996, 0.7550543267365153]

## Conclusions

The results above are:

| Test Loss | Test Accuracy |
|-----------|---------------|
| 1.116     | 0.755         |

These results confirm the benefits of using transfer learning. Using transfer learning we were able to obtain a slight gain in both the accuracy and loss metrics but most importantly we we're able to reduce the computational time by quite a lot.

Training the network from scratch took around 4ms for each training step, i.e. each example took around 4 ms to go through the network and update the networks weights. By contrast using transfer learning each example takes about 1ms. This is due to the fact that most low-level features such as the ability of detecting edges, shapes, etc. are already learned from the previous network and the new network can focus on learning just the specific aspects of the new data. On the other hand training the network from scratch, requires the network to also learn these low-level features and so a big part of the computation is dedicated for this purpose. This fact would be crucial on a larger dataset where training a network might have a human-time limitation.