<a href="https://colab.research.google.com/github/ntavakoulnia/ntavakoulnia/blob/main/Copy_of_Student_MLE_MiniProject_Fine_Tuning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Mini Project: Transfer Learning with Keras

Transfer learning is a machine learning technique where a model trained on one task is used as a starting point to solve a different but related task. Instead of training a model from scratch, transfer learning leverages the knowledge learned from the source task and applies it to the target task. This approach is especially useful when the target task has limited data or computational resources.

In transfer learning, the pre-trained model, also known as the "base model" or "source model," is typically trained on a large dataset and a more general problem (e.g., image classification on ImageNet, a vast dataset with millions of labeled images). The knowledge learned by the base model in the form of feature representations and weights captures common patterns and features in the data.

To perform transfer learning, the following steps are commonly followed:

1. Pre-training: The base model is trained on a source task using a large dataset, which can take a considerable amount of time and computational resources.

2. Feature Extraction: After pre-training, the base model is used as a feature extractor. The last few layers (classifier layers) of the model are discarded, and the remaining layers (feature extraction layers) are retained. These layers serve as feature extractors, producing meaningful representations of the data.

3. Fine-tuning: The feature extraction layers and sometimes some of the earlier layers are connected to a new set of layers, often called the "classifier layers" or "task-specific layers." These layers are randomly initialized, and the model is trained on the target task with a smaller dataset. The weights of the base model can be frozen during fine-tuning, or they can be allowed to be updated with a lower learning rate to fine-tune the model for the target task.

Transfer learning has several benefits:

1. Reduced training time and resource requirements: Since the base model has already learned generic features, transfer learning can save time and resources compared to training a model from scratch.

2. Improved generalization: Transfer learning helps the model generalize better to the target task, especially when the target dataset is small and dissimilar from the source dataset.

3. Better performance: By starting from a model that is already trained on a large dataset, transfer learning can lead to better performance on the target task, especially in scenarios with limited data.

4. Effective feature extraction: The feature extraction layers of the pre-trained model can serve as powerful feature extractors for different tasks, even when the task domains differ.

Transfer learning is commonly used in various domains, including computer vision, natural language processing (NLP), and speech recognition, where pre-trained models are fine-tuned for specific applications like object detection, sentiment analysis, or speech-to-text.

In this mini-project you will perform fine-tuning using Keras with a pre-trained VGG16 model on the CIFAR-10 dataset.

First, import all the libraries you'll need.

In [1]:
import numpy as np
import tensorflow as tf
from tensorflow.keras.datasets import cifar10
from tensorflow.keras.applications import VGG16
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Dense, GlobalAveragePooling2D
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.utils import to_categorical
from sklearn.model_selection import train_test_split

The CIFAR-10 dataset is a widely used benchmark dataset in the field of computer vision and machine learning. It stands for the "Canadian Institute for Advanced Research 10" dataset. CIFAR-10 was created by researchers at the CIFAR institute and was originally introduced as part of the Neural Information Processing Systems (NIPS) 2009 competition.

The dataset consists of 60,000 color images, each of size 32x32 pixels, belonging to ten different classes. Each class contains 6,000 images. The ten classes in CIFAR-10 are:

1. Airplane
2. Automobile
3. Bird
4. Cat
5. Deer
6. Dog
7. Frog
8. Horse
9. Ship
10. Truck

The images are evenly distributed across the classes, making CIFAR-10 a balanced dataset. The dataset is divided into two sets: a training set and a test set. The training set contains 50,000 images, while the test set contains the remaining 10,000 images.

CIFAR-10 is often used for tasks such as image classification, object recognition, and transfer learning experiments. The relatively small size of the images and the variety of classes make it a challenging dataset for training machine learning models, especially deep neural networks. It also serves as a good dataset for teaching and learning purposes due to its manageable size and straightforward class labels.

Here are your tasks:

1. Load the CIFAR-10 dataset after referencing the documentation [here](https://keras.io/api/datasets/cifar10/).
2. Normalize the pixel values so they're all in the range [0, 1].
3. Apply One Hot Encoding to the train and test labels using the [to_categorical](https://www.tensorflow.org/api_docs/python/tf/keras/utils/to_categorical) function.
4. Further split the the training data into training and validation sets using [train_test_split](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html). Use only 10% of the data for validation.  

In [2]:
# Load the CIFAR-10 dataset
# The data for the CIFAR-10 is loaded using the load_data technique training and testing datasets.
# The CIFAR-10 has 60,000 images according to the documentation, where 50000 images are used for training and 10000 are used for testing.
# The x_train and x_test illustrate the array that contains the RGB images of shape (32,32,3), representing 32x32 pixels with a 3 color channel (red,green,and blue).
# The y_train and y_test indicate arrays for classe labels, one for each image. For example class 0 can represent a bus for instance, and class 1 can represent a house.
(x_train, y_train), (x_test, y_test) = cifar10.load_data()

In [3]:
# Normalize the pixel values to [0, 1]
# From documentation, I found that the pixel values in each image range from 0 to 255.
# Inorder to scale the values, I divided the x_train and x_test, which represent the pixels of the images by 255 inorder to scale the pixel values to the range of [0,1].
x_train = x_train / 255
x_test = x_test / 255
x_train

array([[[[0.23137255, 0.24313726, 0.24705882],
         [0.16862746, 0.18039216, 0.1764706 ],
         [0.19607843, 0.1882353 , 0.16862746],
         ...,
         [0.61960787, 0.5176471 , 0.42352942],
         [0.59607846, 0.49019608, 0.4       ],
         [0.5803922 , 0.4862745 , 0.40392157]],

        [[0.0627451 , 0.07843138, 0.07843138],
         [0.        , 0.        , 0.        ],
         [0.07058824, 0.03137255, 0.        ],
         ...,
         [0.48235294, 0.34509805, 0.21568628],
         [0.46666667, 0.3254902 , 0.19607843],
         [0.47843137, 0.34117648, 0.22352941]],

        [[0.09803922, 0.09411765, 0.08235294],
         [0.0627451 , 0.02745098, 0.        ],
         [0.19215687, 0.10588235, 0.03137255],
         ...,
         [0.4627451 , 0.32941177, 0.19607843],
         [0.47058824, 0.32941177, 0.19607843],
         [0.42745098, 0.28627452, 0.16470589]],

        ...,

        [[0.8156863 , 0.6666667 , 0.3764706 ],
         [0.7882353 , 0.6       , 0.13333334]

In [4]:
# One-hot encode the labels
# Using to_categorical function, we can apply one hot encoding inorder to convert the class labels into binary vectors.
# This is done on the Images class data represented by y_train and y_test. Since we have 10 classes we expect 10 columns where
# a 1 is present if that are apart of that class.
y_train = to_categorical(y_train)
y_test = to_categorical(y_test)
y_train

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 1.],
       [0., 0., 0., ..., 0., 0., 1.],
       ...,
       [0., 0., 0., ..., 0., 0., 1.],
       [0., 1., 0., ..., 0., 0., 0.],
       [0., 1., 0., ..., 0., 0., 0.]])

In [5]:
# Split the data into training and validation sets
# Using the train validation split, we want to use the training data as well as the validation set inorder to train and then evaluate the models performance when training to prevent overfitting.
# 20 percent of the data will be used for validation, illustrated by test_size=0.2, with a fixed random seed=42.
x_train, x_val, y_train, y_val = train_test_split(x_train, y_train, test_size=0.2, random_state=42)

VGG16 (Visual Geometry Group 16) is a deep convolutional neural network architecture that was developed by the Visual Geometry Group at the University of Oxford. It was proposed by researchers Karen Simonyan and Andrew Zisserman in their paper titled "Very Deep Convolutional Networks for Large-Scale Image Recognition," which was presented at the International Conference on Learning Representations (ICLR) in 2015.

The VGG16 architecture gained significant popularity for its simplicity and effectiveness in image classification tasks. It was one of the pioneering models that demonstrated the power of deeper neural networks for visual recognition tasks.

Key characteristics of the VGG16 architecture:

1. Architecture: VGG16 consists of a total of 16 layers, hence the name "16." These layers are stacked one after another, forming a deep neural network.

2. Convolutional Layers: The main building blocks of VGG16 are the convolutional layers. It primarily uses 3x3 convolutional filters throughout the network, which allows it to capture local features effectively.

3. Max Pooling: After each set of convolutional layers, VGG16 applies max-pooling layers with 2x2 filters and stride 2, which halves the spatial dimensions (width and height) of the feature maps and reduces the number of parameters.

4. Fully Connected Layers: Towards the end of the network, VGG16 has fully connected layers that act as a classifier to make predictions based on the learned features.

5. Activation Function: The network uses the Rectified Linear Unit (ReLU) activation function for all hidden layers, which helps with faster convergence during training.

6. Number of Filters: The number of filters in each convolutional layer is relatively small compared to more recent architectures like ResNet or InceptionNet. However, stacking multiple layers allows VGG16 to learn complex hierarchical features.

7. Output Layer: The output layer consists of 1000 units, corresponding to 1000 ImageNet classes. VGG16 was originally trained on the large-scale ImageNet dataset, which contains millions of images from 1000 different classes.

VGG16 was instrumental in showing that increasing the depth of a neural network can significantly improve its performance on image recognition tasks. However, the main drawback of VGG16 is its high number of parameters, making it computationally expensive and memory-intensive to train. Despite this limitation, VGG16 remains an essential benchmark architecture and has paved the way for even deeper and more efficient models in the field of computer vision, such as ResNet, DenseNet, and EfficientNet.

Here are your tasks:

1. Load [VGG16](https://keras.io/api/applications/vgg/#vgg16-function) as a base model. Make sure to exclude the top layer.
2. Freeze all the layers in the base model. We'll be using these weights as a feature extraction layer to forward to layers that are trainable.

In [6]:
# Load the pre-trained VGG16 model (excluding the top classifier)
# We want to use the base model VGG16 and training it with the ImageNet dataset, setting the weights to 'imagenet' inorder to perform this.
# The reason for adding the pretrained weights is for our model to not have to learn from scratch. Since imagenet contains many images that are categorized into a myraid number of classes.
# We also need to set our include_top=false so that we get rid of the fully connected layers at the top of the network, and only include the convolutional layers.
# Is is done so we can add our own classication layers based on what we are looking for.
VGG16_model = VGG16(weights='imagenet', include_top=False)

Downloading data from https://storage.googleapis.com/tensorflow/keras-applications/vgg16/vgg16_weights_tf_dim_ordering_tf_kernels_notop.h5
[1m58889256/58889256[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 0us/step


In [7]:
# Freeze the layers in the base model
# By setting the VGG16_model.trainable=False we want to prevent the pre trained convolutional layers from being updated many times. The
# reason for this is that the VGG16_model is already being super strong in image training due to having been trained on massive datasets.
# This  not only save time but prevent overfitting. The focus will be on only updating the weights of the dense layers that we implimented.
VGG16_model.trainable = False

Now, we'll add some trainable layers to the base model.

1. Using the base model, add a [GlobalAveragePooling2D](https://keras.io/api/layers/pooling_layers/global_average_pooling2d/) layer, followed by a [Dense](https://keras.io/api/layers/core_layers/dense/) layer of length 256 with ReLU activation. Finally, add a classification layer with 10 units, corresponding to the 10 CIFAR-10 classes, with softmax activation.
2. Create a Keras [Model](https://keras.io/api/models/model/) that takes in approproate inputs and outputs.

In [8]:
# Add a global average pooling layer
# By using VGG16_model.output we obtain the the feature maps of the last convolutional layer that represents very detailed information about certain patterns of the input data.
# This was done through many convolutional layers and pooling to obtain the VGG16_model.output.
# Using the GlobalAveragePooling2D() we can pass the average of the feature map foe each channel as a scalar to reduce the feature map into a 1D frame.
# We then get a vector of averages to indicate the strength of a specific feature acrros the whole image, The higher the activation value for that feature map the better it can detect that feature.
# For example a feature map illustrating dog ears would have a high activation value because dog ears are strongly detected in that feature.
model = GlobalAveragePooling2D()(VGG16_model.output)

In [9]:
# Add a fully connected layer with 256 units and ReLU activation
# The 1D model is then placed into a Dense Layer with 256 neurons.
# The activation is set to relu for nonlinearity and to get rid of 0 values.
# This allows for the highest activiation to be looked at more by focusing on the stronger activations and setting the weaker ones to 0.
# This will help to classify the image.
fully_connected_layer=Dense(256, activation='relu')(model)

In [10]:
# Add the final classification layer with 10 units (for CIFAR-10 classes) and softmax activation
# The 256 neurons are then passed into a another dense layer of 10 representing each of the unique classes the image can fall in.
# Using softmax as the activation, gives us the porbability from 0 to 1 that the image would fall in one of the classes.
output=Dense(10, activation='softmax')(fully_connected_layer)

In [11]:
# Create the fine-tuned model
# We then fine tune our model with the following adjustments, we want to create the input layer by taking the VGG16_model.input, which includes the input as well as all the convolutional layers, plus
# the outputs which include, the output of the final convolutonal layer, applied by the Global Average Pooling, and then the Dense layers (256 and 10). This is done using the Model function and
# apply the applicable inputs and outputs.
fine_tuned_model=Model(inputs=VGG16_model.input,outputs=output)

With your model complete it's time to train it and assess its performance.

1. Compile your model using an appropriate loss function. Feel free to play around with the optimizer, but a good starting optimizer might be Adam with a learning rate of 0.001.
2. Fit your model on the training data. Use the validation data to print the accuracy for each epoch. Try training for 10 epochs. Note, training can take a few hours so go ahead and grab a cup of coffee.

**Optional**: See if you can implement an [Early Stopping](https://keras.io/api/callbacks/early_stopping/) criteria as a callback function.

In [12]:
# Compile the model
#Using the model.compile, I am able to configure the model for training using a loss, optimizer, and metric term.
#I set the loss term to categorical_crossentropy, since I am dealing with a many different classifications of images not just two with binary. The loss function
#tells us how well the models predictions match the actual values we are looking for.
#The optimizer that is used is adam to adjust the models weights inorder to minimize the loss. I set to learning rate to a small value of 0.001 to indicate how small the error should be. I also added
#an accuracy as metrics inorder to see how the model is doing. The accuracy is the number of correct predictions
#divided by total number of predictions. The bases of the compiler is that when the model is used for predicting
#a value, the loss function calculates how off the predictions are from the actual values. If the value of the loss function
#is not close to 0, the optimizer = 'adam' in this case, will adjust the weights of the model based on the error the model makes.
#and it will keep going until there is a loss close to 0 or the best model case.
fine_tuned_model.compile(optimizer=Adam(learning_rate=0.001), loss='categorical_crossentropy', metrics=['accuracy'] )

In [13]:
# Train the model
from tensorflow.keras.callbacks import EarlyStopping
#Using the fine_tuned_model, we want to use fit so that the model learns from the training data to predict the labels in y_train. The validations set as x_val and y_val are used
# to evaluate the model during training. The data will go though the training data set  10 time since our Epoch=10. The EarlyStopping which is used when the validation data does not
# inprove, has a patience=5 so that if the validation does not imporve for 5 epochs then the training process will stop. restore_best_weights set to true lets the program know
# that once the training is doen the weights go back to the best weights achieved during the training process. Batch size=32 means 32 sets of data are handle in each run, with verbose
# of 1 which displays the data below.
fine_tuned_model.fit(x_train, y_train, validation_data=(x_val, y_val), epochs=10, callbacks=[EarlyStopping(monitor='val_loss', patience=5, restore_best_weights=True)],batch_size=32,verbose=1)

Epoch 1/10
[1m1250/1250[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m657s[0m 525ms/step - accuracy: 0.4697 - loss: 1.5202 - val_accuracy: 0.5698 - val_loss: 1.2365
Epoch 2/10
[1m1250/1250[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m685s[0m 527ms/step - accuracy: 0.5790 - loss: 1.1942 - val_accuracy: 0.5853 - val_loss: 1.1898
Epoch 3/10
[1m1250/1250[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m641s[0m 513ms/step - accuracy: 0.6038 - loss: 1.1225 - val_accuracy: 0.5963 - val_loss: 1.1532
Epoch 4/10
[1m1250/1250[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m678s[0m 510ms/step - accuracy: 0.6233 - loss: 1.0740 - val_accuracy: 0.6057 - val_loss: 1.1226
Epoch 5/10
[1m1250/1250[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m705s[0m 528ms/step - accuracy: 0.6360 - loss: 1.0389 - val_accuracy: 0.6097 - val_loss: 1.1193
Epoch 6/10
[1m1250/1250[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m640s[0m 512ms/step - accuracy: 0.6555 - loss: 0.9857 - val_accuracy: 0.6099 - val_loss:

<keras.src.callbacks.history.History at 0x7eb76aa6f2b0>

With your model trained, it's time to assess how well it performs on the test data.

1. Use your trained model to calculate the accuracy on the test set. Is the model performance better than random?
2. Experiment! See if you can tweak your model to improve performance.  

In [14]:
# Evaluate the model on the test set
# Using the evaluate function on the x_test and y_test, we found the test_loss which tells us how well the model performed inorder to predict the correct labels.
# test_accuracy tells us the amound of correct predictions our model made.
# Based on the model I made I have around a 61 percent accuracy, which is not the best and needs to be fine tuned more inorder to achieve a better model performance.
# After multiple tests, for some reason I keep getting around the 60th percentile. Since the problem only wants us to use 10 Epoch, I believe if the number of Epochs
# increased to 20 but not very large like 100, it can give us better accuracy. Even if we did use an Epoch of 100 EarlyStopping will help us stop whenever the validation
# stops improving.
test_loss, test_accuracy = fine_tuned_model.evaluate(x_test, y_test, verbose=1)

[1m313/313[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m129s[0m 411ms/step - accuracy: 0.6077 - loss: 1.1292
