# Training from scratch v/s Transfer learning
This project was a part of **Machine Learning Specialization with TF2** by [**CLOUDXLAB**](http://cloudxlab.com/).

### What are we going to do?

We will train a neural network (say model A) on data related to 6 of the classes, and we will train another neural network (say model B) on the remaining 2 classes. Then, we would use the pre-trained weights of model A and tune the last layer so as to classify these 2 classes(this technique is called Transfer Learning), and compare the results of classification obtained using normal training and transfer learning. In this project, we would practically appreciate the use of Transfer Learning.

In [1]:
import numpy as np
import tensorflow as tf
from tensorflow import keras

In [6]:
(X_train_full, y_train_full), (X_test, y_test) = keras.datasets.fashion_mnist.load_data()
X_train_full = X_train_full[:30000]
y_train_full = y_train_full[:30000]

In [7]:
X_test = X_test[:5000]
y_test = y_test[:5000]

In [8]:
X_train_full = X_train_full / 255.0
X_test = X_test / 255.0

In [9]:
X_valid, X_train =X_train_full[:5000], X_train_full[5000:]

In [10]:
y_valid, y_train = y_train_full[:5000], y_train_full[5000:]

### Dividing the data sets
Let's split the fashion MNIST training set in two:

**X_train_A:** all images of all items except for sandals and shirts (classes 5 and 6). **X_train_B:** a much smaller training set of just the first 200 images of sandals or shirts. The validation set and the test set are also split this way, but without restricting the number of images.

**Why are we doing this?**

We will train a model on set A (classification task with 8 classes), and try to reuse it to tackle set B (binary classification). We hope to transfer a little bit of knowledge from task A to task B, since classes in set A (sneakers, ankle boots, coats, t-shirts, etc.) are somewhat similar to classes in set B (sandals and shirts). However, since we are using Dense layers, only patterns that occur at the same location can be reused (in contrast, convolutional layers will transfer much better, since learned patterns can be detected anywhere on the image).

Define the **split_dataset function**, which splits the whole dataset into 2: one which contains sandals and shirts data, the other containing the images of the remaining classes.

In [20]:
def split_dataset(X, y):
    y_5_or_6 = (y == 5) | (y == 6) # sandals or shirts
    y_A = y[~y_5_or_6]
    y_A[y_A > 6] -= 2 # class indices 7, 8, 9 should be moved to 5, 6, 7
    y_B = (y[y_5_or_6] == 6).astype(np.float32) # binary classification task: is it a shirt (class 6)?
    return ((X[~y_5_or_6], y_A), (X[y_5_or_6], y_B))

In [21]:
(X_train_A, y_train_A), (X_train_B, y_train_B) = split_dataset(X_train, y_train)

In [22]:
(X_valid_A, y_valid_A), (X_valid_B, y_valid_B) = split_dataset(X_valid, y_valid)

In [23]:
(X_test_A, y_test_A), (X_test_B, y_test_B) = split_dataset(X_test, y_test)

In [24]:
tf.random.set_seed(42)
np.random.seed(42)

### Build and Fit the Model A
Let us define the model for the classification of data set A that we have created previously.

Later the trained weights of this model will be used for the classification task of data B.

In [39]:
model_A = keras.models.Sequential()
model_A.add(keras.layers.Flatten(input_shape=[28, 28]))
for n_hidden in (300, 100, 50, 50, 50):
    model_A.add(keras.layers.Dense(n_hidden, activation="selu"))
model_A.add(keras.layers.Dense(8, activation="softmax"))

In [41]:
model_A.compile(loss="sparse_categorical_crossentropy",
    optimizer= keras.optimizers.SGD(lr=1e-3),
    metrics=["accuracy"])

In [42]:
history = model_A.fit(X_train_A, y_train_A, epochs=5,
            validation_data=(X_valid_A, y_valid_A))

Train on 19875 samples, validate on 4014 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


In [43]:
model_A.save("my_model_A.h5")

### Build and Fit the Model B
Let us define the model for the classification of data set B that we have created previously.

Later, let us also examine the classification of B set by using the trained weights of model A.

In [46]:
model_B = keras.models.Sequential()
model_B.add(keras.layers.Flatten(input_shape=[28, 28]))
for n_hidden in (300, 100, 50, 50, 50):
    model_B.add(keras.layers.Dense(n_hidden, activation="selu"))
model_B.add(keras.layers.Dense(1, activation="softmax"))

In [48]:
model_B.compile(loss= "binary_crossentropy",
    optimizer= keras.optimizers.SGD(lr=1e-3),
    metrics=["accuracy"])

In [49]:
history = model_B.fit(X_train_B, y_train_B, epochs=5,
            validation_data=(X_valid_B, y_valid_B))

Train on 5125 samples, validate on 986 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


### Creating new model based on existing model A
Let us first see how many trainable parameters are there for model_B we trained previously.

Then we shall create a new model model_B_on_A which has the pre-trained parameters of model_A but customized final dense layer with only 1 neuron.

Finally, we shall compare the performance of both the models - model_B and model_B_on_A.

In [53]:
model_B.summary()

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
flatten_1 (Flatten)          (None, 784)               0         
_________________________________________________________________
dense_6 (Dense)              (None, 300)               235500    
_________________________________________________________________
dense_7 (Dense)              (None, 100)               30100     
_________________________________________________________________
dense_8 (Dense)              (None, 50)                5050      
_________________________________________________________________
dense_9 (Dense)              (None, 50)                2550      
_________________________________________________________________
dense_10 (Dense)             (None, 50)                2550      
_________________________________________________________________
dense_11 (Dense)             (None, 1)                

Now, before creating model_B_on_A(a model based on pre-trained layers of model_A), we shall clone the model_A and set its trained weights so that when you train model_B_on_A, it will not affect model_A.

In [54]:
model_A_clone = keras.models.clone_model(model_A)

In [55]:
model_A_clone.set_weights(model_A.get_weights())

In [56]:
# create a new model model_B_on_A, based on existing layers of model_A
model_B_on_A = keras.models.Sequential(model_A.layers[:-1])

In [57]:
model_B_on_A.add(keras.layers.Dense(1, activation="sigmoid"))

In [58]:
for layer in model_B_on_A.layers[:-1]:
    layer.trainable = False

In [59]:
model_B_on_A.summary()

Model: "sequential_2"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
flatten (Flatten)            (None, 784)               0         
_________________________________________________________________
dense (Dense)                (None, 300)               235500    
_________________________________________________________________
dense_1 (Dense)              (None, 100)               30100     
_________________________________________________________________
dense_2 (Dense)              (None, 50)                5050      
_________________________________________________________________
dense_3 (Dense)              (None, 50)                2550      
_________________________________________________________________
dense_4 (Dense)              (None, 50)                2550      
_________________________________________________________________
dense_12 (Dense)             (None, 1)                

In [60]:
model_B_on_A.compile(loss="binary_crossentropy",
         optimizer=keras.optimizers.SGD(lr=1e-3),
         metrics=["accuracy"])

In [61]:
history = model_B_on_A.fit(X_train_B, y_train_B, epochs=5,
                   validation_data=(X_valid_B, y_valid_B))

Train on 5125 samples, validate on 986 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


### Evaluating the models
Now that we have the two models model_B and model_B_on_A for classifying the B dataset, let us evaluate the performance of the model based on their accuracies on the test data of B data set.

In [64]:
model_B.evaluate(X_test_B, y_test_B)



[7.64827382502906, 0.49844882]

In [65]:
model_B_on_A.evaluate(X_test_B, y_test_B)



[0.09924376358383075, 0.98862463]

We observe that the accuracies of both models are almost the same.

We also see that the performance of model_B_on_A - with as less as 51 trainable parameter - stands to be as great as that of model_Bwith as many as 275,801.

So, with very little training, model_B_on_A is performing really well. This saves time and resources even in real-time scenarios. This is the beauty of using pre-trained layers. This method is also known as transfer learning - transferring the knowledge obtained from solving one problem to solving another similar problem.