# Mini Project: Transfer Learning with Keras

Transfer learning is a machine learning technique where a model trained on one task is used as a starting point to solve a different but related task. Instead of training a model from scratch, transfer learning leverages the knowledge learned from the source task and applies it to the target task. This approach is especially useful when the target task has limited data or computational resources.

In transfer learning, the pre-trained model, also known as the "base model" or "source model," is typically trained on a large dataset and a more general problem (e.g., image classification on ImageNet, a vast dataset with millions of labeled images). The knowledge learned by the base model in the form of feature representations and weights captures common patterns and features in the data.

To perform transfer learning, the following steps are commonly followed:

1. Pre-training: The base model is trained on a source task using a large dataset, which can take a considerable amount of time and computational resources.

2. Feature Extraction: After pre-training, the base model is used as a feature extractor. The last few layers (classifier layers) of the model are discarded, and the remaining layers (feature extraction layers) are retained. These layers serve as feature extractors, producing meaningful representations of the data.

3. Fine-tuning: The feature extraction layers and sometimes some of the earlier layers are connected to a new set of layers, often called the "classifier layers" or "task-specific layers." These layers are randomly initialized, and the model is trained on the target task with a smaller dataset. The weights of the base model can be frozen during fine-tuning, or they can be allowed to be updated with a lower learning rate to fine-tune the model for the target task.

Transfer learning has several benefits:

1. Reduced training time and resource requirements: Since the base model has already learned generic features, transfer learning can save time and resources compared to training a model from scratch.

2. Improved generalization: Transfer learning helps the model generalize better to the target task, especially when the target dataset is small and dissimilar from the source dataset.

3. Better performance: By starting from a model that is already trained on a large dataset, transfer learning can lead to better performance on the target task, especially in scenarios with limited data.

4. Effective feature extraction: The feature extraction layers of the pre-trained model can serve as powerful feature extractors for different tasks, even when the task domains differ.

Transfer learning is commonly used in various domains, including computer vision, natural language processing (NLP), and speech recognition, where pre-trained models are fine-tuned for specific applications like object detection, sentiment analysis, or speech-to-text.

In this mini-project you will perform fine-tuning using Keras with a pre-trained VGG16 model on the CIFAR-10 dataset.

First, import all the libraries you'll need.

In [None]:
import numpy as np
import tensorflow as tf
from tensorflow.keras.datasets import cifar10
from tensorflow.keras.applications import VGG16
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Dense, GlobalAveragePooling2D
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.utils import to_categorical
from sklearn.model_selection import train_test_split

2024-08-01 17:16:42.246229: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:485] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-08-01 17:16:42.331158: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:8454] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-08-01 17:16:42.357423: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1452] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-08-01 17:16:42.519764: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


The CIFAR-10 dataset is a widely used benchmark dataset in the field of computer vision and machine learning. It stands for the "Canadian Institute for Advanced Research 10" dataset. CIFAR-10 was created by researchers at the CIFAR institute and was originally introduced as part of the Neural Information Processing Systems (NIPS) 2009 competition.

The dataset consists of 60,000 color images, each of size 32x32 pixels, belonging to ten different classes. Each class contains 6,000 images. The ten classes in CIFAR-10 are:

1. Airplane
2. Automobile
3. Bird
4. Cat
5. Deer
6. Dog
7. Frog
8. Horse
9. Ship
10. Truck

The images are evenly distributed across the classes, making CIFAR-10 a balanced dataset. The dataset is divided into two sets: a training set and a test set. The training set contains 50,000 images, while the test set contains the remaining 10,000 images.

CIFAR-10 is often used for tasks such as image classification, object recognition, and transfer learning experiments. The relatively small size of the images and the variety of classes make it a challenging dataset for training machine learning models, especially deep neural networks. It also serves as a good dataset for teaching and learning purposes due to its manageable size and straightforward class labels.

Here are your tasks:

1. Load the CIFAR-10 dataset after referencing the documentation [here](https://keras.io/api/datasets/cifar10/).
2. Normalize the pixel values so they're all in the range [0, 1].
3. Apply One Hot Encoding to the train and test labels using the [to_categorical](https://www.tensorflow.org/api_docs/python/tf/keras/utils/to_categorical) function.
4. Further split the the training data into training and validation sets using [train_test_split](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html). Use only 10% of the data for validation.  

In [None]:
# Load the CIFAR-10 dataset
(X_train, y_train), (X_test, y_test) = cifar10.load_data()

In [None]:
# Normalize the pixel values to [0, 1]
X_train = X_train.astype('float32') / 255.0
X_test = X_test.astype('float32') / 255.0

In [None]:
# One-hot encode the labels
y_train = to_categorical(y_train, 10)
y_test = to_categorical(y_test, 10)

In [None]:
# Split the data into training and validation sets
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.1, random_state=42)

VGG16 (Visual Geometry Group 16) is a deep convolutional neural network architecture that was developed by the Visual Geometry Group at the University of Oxford. It was proposed by researchers Karen Simonyan and Andrew Zisserman in their paper titled "Very Deep Convolutional Networks for Large-Scale Image Recognition," which was presented at the International Conference on Learning Representations (ICLR) in 2015.

The VGG16 architecture gained significant popularity for its simplicity and effectiveness in image classification tasks. It was one of the pioneering models that demonstrated the power of deeper neural networks for visual recognition tasks.

Key characteristics of the VGG16 architecture:

1. Architecture: VGG16 consists of a total of 16 layers, hence the name "16." These layers are stacked one after another, forming a deep neural network.

2. Convolutional Layers: The main building blocks of VGG16 are the convolutional layers. It primarily uses 3x3 convolutional filters throughout the network, which allows it to capture local features effectively.

3. Max Pooling: After each set of convolutional layers, VGG16 applies max-pooling layers with 2x2 filters and stride 2, which halves the spatial dimensions (width and height) of the feature maps and reduces the number of parameters.

4. Fully Connected Layers: Towards the end of the network, VGG16 has fully connected layers that act as a classifier to make predictions based on the learned features.

5. Activation Function: The network uses the Rectified Linear Unit (ReLU) activation function for all hidden layers, which helps with faster convergence during training.

6. Number of Filters: The number of filters in each convolutional layer is relatively small compared to more recent architectures like ResNet or InceptionNet. However, stacking multiple layers allows VGG16 to learn complex hierarchical features.

7. Output Layer: The output layer consists of 1000 units, corresponding to 1000 ImageNet classes. VGG16 was originally trained on the large-scale ImageNet dataset, which contains millions of images from 1000 different classes.

VGG16 was instrumental in showing that increasing the depth of a neural network can significantly improve its performance on image recognition tasks. However, the main drawback of VGG16 is its high number of parameters, making it computationally expensive and memory-intensive to train. Despite this limitation, VGG16 remains an essential benchmark architecture and has paved the way for even deeper and more efficient models in the field of computer vision, such as ResNet, DenseNet, and EfficientNet.

Here are your tasks:

1. Load [VGG16](https://keras.io/api/applications/vgg/#vgg16-function) as a base model. Make sure to exclude the top layer.
2. Freeze all the layers in the base model. We'll be using these weights as a feature extraction layer to forward to layers that are trainable.

In [None]:
# Load the pre-trained VGG16 model (excluding the top classifier)
base_model = VGG16(weights='imagenet', include_top=False, input_shape=(32, 32, 3))

2024-08-01 17:16:49.421086: I tensorflow/core/common_runtime/gpu/gpu_device.cc:2021] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 9435 MB memory:  -> device: 0, name: NVIDIA GeForce RTX 3080 Ti, pci bus id: 0000:01:00.0, compute capability: 8.6


In [None]:
# Freeze the layers in the base model
for layer in base_model.layers:
    layer.trainable = False

# Display the model summary
# base_model.summary()

Now, we'll add some trainable layers to the base model.

1. Using the base model, add a [GlobalAveragePooling2D](https://keras.io/api/layers/pooling_layers/global_average_pooling2d/) layer, followed by a [Dense](https://keras.io/api/layers/core_layers/dense/) layer of length 256 with ReLU activation. Finally, add a classification layer with 10 units, corresponding to the 10 CIFAR-10 classes, with softmax activation.
2. Create a Keras [Model](https://keras.io/api/models/model/) that takes in approproate inputs and outputs.

In [None]:
# Add a global average pooling layer
x = base_model.output
x = GlobalAveragePooling2D()(x)

In [None]:
# Add a fully connected layer with 256 units and ReLU activation
x = Dense(256, activation='relu')(x)

In [None]:
# Add the final classification layer with 10 units (for CIFAR-10 classes) and softmax activation
predictions = Dense(10, activation='softmax')(x)

In [None]:
# Create the fine-tuned model
model = Model(inputs=base_model.input, outputs=predictions)

# Display the model summary
# model.summary()

With your model complete it's time to train it and assess its performance.

1. Compile your model using an appropriate loss function. Feel free to play around with the optimizer, but a good starting optimizer might be Adam with a learning rate of 0.001.
2. Fit your model on the training data. Use the validation data to print the accuracy for each epoch. Try training for 10 epochs. Note, training can take a few hours so go ahead and grab a cup of coffee.

**Optional**: See if you can implement an [Early Stopping](https://keras.io/api/callbacks/early_stopping/) criteria as a callback function.

In [None]:
# Compile the model
model.compile(optimizer=Adam(learning_rate=0.001),
              loss='categorical_crossentropy',
              metrics=['accuracy'])

In [None]:
from tensorflow.keras.callbacks import EarlyStopping

# Early stopping callback
early_stopping = EarlyStopping(monitor='val_loss', patience=5, restore_best_weights=True)

# Train the model
history = model.fit(X_train, y_train,
                    epochs=10,
                    batch_size=512,
                    validation_data=(X_val, y_val),
                    callbacks=[early_stopping])

2024-08-01 17:16:51.006802: W external/local_tsl/tsl/framework/cpu_allocator_impl.cc:83] Allocation of 552960000 exceeds 10% of free system memory.
2024-08-01 17:16:51.245659: W external/local_tsl/tsl/framework/cpu_allocator_impl.cc:83] Allocation of 552960000 exceeds 10% of free system memory.


Epoch 1/10


I0000 00:00:1722521812.177862   78924 service.cc:146] XLA service 0x7f841c045940 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
I0000 00:00:1722521812.177906   78924 service.cc:154]   StreamExecutor device (0): NVIDIA GeForce RTX 3080 Ti, Compute Capability 8.6
2024-08-01 17:16:52.249263: I tensorflow/compiler/mlir/tensorflow/utils/dump_mlir_util.cc:268] disabling MLIR crash reproducer, set env var `MLIR_CRASH_REPRODUCER_DIRECTORY` to enable.
2024-08-01 17:16:54.372244: I external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:531] Loaded cuDNN version 8907
2024-08-01 17:16:54.420278: W external/local_xla/xla/service/gpu/nvptx_compiler.cc:762] The NVIDIA driver's CUDA version is 12.0 which is older than the ptxas CUDA version (12.3.107). Because the driver is older than the ptxas version, XLA is disabling parallel compilation, which may slow down compilation. You should update your NVIDIA driver or use the NVIDIA-provided CUDA forward compatibil

[1m16/88[0m [32m━━━[0m[37m━━━━━━━━━━━━━━━━━[0m [1m0s[0m 12ms/step - accuracy: 0.2107 - loss: 2.1916

I0000 00:00:1722521817.965548   78924 device_compiler.h:188] Compiled cluster using XLA!  This line is logged at most once for the lifetime of the process.


[1m88/88[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m12s[0m 64ms/step - accuracy: 0.3649 - loss: 1.8287 - val_accuracy: 0.5182 - val_loss: 1.3672
Epoch 2/10
[1m88/88[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 13ms/step - accuracy: 0.5459 - loss: 1.3163 - val_accuracy: 0.5584 - val_loss: 1.2537
Epoch 3/10
[1m88/88[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 13ms/step - accuracy: 0.5789 - loss: 1.2232 - val_accuracy: 0.5796 - val_loss: 1.2111
Epoch 4/10
[1m88/88[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 13ms/step - accuracy: 0.5959 - loss: 1.1717 - val_accuracy: 0.5900 - val_loss: 1.1772
Epoch 5/10
[1m88/88[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 13ms/step - accuracy: 0.6022 - loss: 1.1481 - val_accuracy: 0.5906 - val_loss: 1.1640
Epoch 6/10
[1m88/88[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 13ms/step - accuracy: 0.6169 - loss: 1.1130 - val_accuracy: 0.5938 - val_loss: 1.1557
Epoch 7/10
[1m88/88[0m [32m━━━━━━━━━━━━━━

With your model trained, it's time to assess how well it performs on the test data.

1. Use your trained model to calculate the accuracy on the test set. Is the model performance better than random?
2. Experiment! See if you can tweak your model to improve performance.  

In [None]:
# Evaluate the model on the test set
test_loss, test_accuracy = model.evaluate(X_test, y_test, verbose=0)

print(f"Test Accuracy: {test_accuracy:.4f}")

Test Accuracy: 0.5999


#### **Comparison with Random Guessing**

59.99% accuracy is significantly better than random guessing.

For a dataset with 10 classes like CIFAR-10, a random classifier would have an expected accuracy of 10%.

There is still room for improvement though.

### **Experimentation**

In [None]:
def create_model(base_model=None,
                 input_shape=(32, 32, 3),
                 num_classes=10,
                 dense_units=256,
                 unfreeze_layers=0,
                 learning_rate=0.001):

    # Load the pre-trained base model
    if base_model == None:
        base_model = VGG16(weights='imagenet', include_top=False, input_shape=input_shape)

    # Unfreeze the top layers if specified
    if unfreeze_layers > 0:
        for layer in base_model.layers[-unfreeze_layers:]:
            layer.trainable = True
    else:
        # Freeze all layers
        for layer in base_model.layers:
            layer.trainable = False

    # Build the top part of the model
    x = base_model.output
    x = GlobalAveragePooling2D()(x)
    x = Dense(dense_units, activation='relu')(x)
    predictions = Dense(num_classes, activation='softmax')(x)

    # build
    model = Model(inputs=base_model.input, outputs=predictions)

    # compile
    model.compile(optimizer=Adam(learning_rate=learning_rate),
                  loss='categorical_crossentropy',
                  metrics=['accuracy'])

    return model

### **Experiment Baseline**

In [None]:
# Create and train the baseline model
baseline_model = create_model()

# Early stopping
early_stopping = EarlyStopping(monitor='val_loss', patience=5, restore_best_weights=True)

# Train the model
history_baseline = baseline_model.fit(X_train, y_train,
                                      epochs=10,
                                      batch_size=512,
                                      validation_data=(X_val, y_val),
                                      callbacks=[early_stopping])

# Evaluate the model
test_loss, test_accuracy = baseline_model.evaluate(X_test, y_test, verbose=0)
print(f"Baseline Model Test Accuracy: {test_accuracy:.4f}")

Epoch 1/10


2024-08-01 17:17:18.018499: W external/local_tsl/tsl/framework/cpu_allocator_impl.cc:83] Allocation of 552960000 exceeds 10% of free system memory.
2024-08-01 17:17:18.231765: W external/local_tsl/tsl/framework/cpu_allocator_impl.cc:83] Allocation of 552960000 exceeds 10% of free system memory.


[1m88/88[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 32ms/step - accuracy: 0.3503 - loss: 1.8341 - val_accuracy: 0.5254 - val_loss: 1.3504
Epoch 2/10
[1m88/88[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 13ms/step - accuracy: 0.5486 - loss: 1.3177 - val_accuracy: 0.5478 - val_loss: 1.2739
Epoch 3/10
[1m88/88[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 13ms/step - accuracy: 0.5743 - loss: 1.2305 - val_accuracy: 0.5712 - val_loss: 1.2121
Epoch 4/10
[1m88/88[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 13ms/step - accuracy: 0.5933 - loss: 1.1816 - val_accuracy: 0.5830 - val_loss: 1.1927
Epoch 5/10
[1m88/88[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 13ms/step - accuracy: 0.5980 - loss: 1.1508 - val_accuracy: 0.5842 - val_loss: 1.1736
Epoch 6/10
[1m88/88[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 13ms/step - accuracy: 0.6096 - loss: 1.1224 - val_accuracy: 0.6008 - val_loss: 1.1501
Epoch 7/10
[1m88/88[0m [32m━━━━━━━━━━━━━━━

### **Optimized Experiment**

In [None]:
# # Create and train the baseline model
# baseline_model = create_model(unfreeze_layers=4, learning_rate=0.002)

# # Early stopping
# early_stopping = EarlyStopping(monitor='val_loss', patience=4, restore_best_weights=True)

# # Train the model
# history = baseline_model.fit(X_train, y_train,
#                             epochs=30,
#                             batch_size=64,
#                             validation_data=(X_val, y_val),
#                             callbacks=[early_stopping])

# # Evaluate the model
# test_loss, test_accuracy = baseline_model.evaluate(X_test, y_test, verbose=0)
# print(f"Baseline Model Test Accuracy: {test_accuracy:.4f}")

In [None]:
# # Create and train the baseline model
# baseline_model = create_model(unfreeze_layers=4, learning_rate=0.002)

# # Early stopping
# early_stopping = EarlyStopping(monitor='val_loss', patience=8, restore_best_weights=True)

# # Train the model
# history = baseline_model.fit(X_train, y_train,
#                             epochs=30,
#                             batch_size=128,
#                             validation_data=(X_val, y_val),
#                             callbacks=[early_stopping])

# # Evaluate the model
# test_loss, test_accuracy = baseline_model.evaluate(X_test, y_test, verbose=0)
# print(f"Baseline Model Test Accuracy: {test_accuracy:.4f}")

In [None]:
# Create and train the baseline model
baseline_model = create_model(unfreeze_layers=5, learning_rate=0.001)

# Early stopping
early_stopping = EarlyStopping(monitor='val_loss', patience=15, restore_best_weights=True)

# Train the model
history = baseline_model.fit(X_train, y_train,
                            epochs=100,
                            batch_size=64,
                            validation_data=(X_val, y_val),
                            callbacks=[early_stopping])

# Evaluate the model
test_loss, test_accuracy = baseline_model.evaluate(X_test, y_test, verbose=0)
print(f"Baseline Model Test Accuracy: {test_accuracy:.4f}")

Epoch 1/100


2024-08-01 17:17:36.136027: W external/local_tsl/tsl/framework/cpu_allocator_impl.cc:83] Allocation of 552960000 exceeds 10% of free system memory.


[1m704/704[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m18s[0m 17ms/step - accuracy: 0.2308 - loss: 1.9538 - val_accuracy: 0.4782 - val_loss: 1.4138
Epoch 2/100
[1m704/704[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m8s[0m 11ms/step - accuracy: 0.5396 - loss: 1.2546 - val_accuracy: 0.6478 - val_loss: 1.0420
Epoch 3/100
[1m704/704[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m8s[0m 11ms/step - accuracy: 0.6833 - loss: 0.9201 - val_accuracy: 0.6852 - val_loss: 0.9442
Epoch 4/100
[1m704/704[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m8s[0m 11ms/step - accuracy: 0.7339 - loss: 0.7946 - val_accuracy: 0.7616 - val_loss: 0.7283
Epoch 5/100
[1m704/704[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m8s[0m 11ms/step - accuracy: 0.7958 - loss: 0.6246 - val_accuracy: 0.7352 - val_loss: 0.8102
Epoch 6/100
[1m704/704[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m8s[0m 11ms/step - accuracy: 0.8195 - loss: 0.5440 - val_accuracy: 0.7778 - val_loss: 0.6730
Epoch 7/100
[1m704/704[0m