## **Importing Libraries**
This code block is responsible for setting up the necessary environment and importing the required libraries for the project.

1. It imports the necessary libraries for the project, including:
   - `os`: For operating system-related functions.
   - `math`, `numpy`, `tensorflow`, `matplotlib`, and `scipy`: For various mathematical and scientific computing operations.
   - `sklearn.metrics`: For performance evaluation metrics like the classification report.
   - `tensorflow.keras`: For the deep learning framework, including models, layers, data generators, optimizers, and loss functions.
   - `tensorflow.keras.applications`: For pre-trained models like ResNet50 and ResNet152.

2. It sets the Keras backend to TensorFlow, which is the backend used for the project.

This code block sets up the foundational components required for the rest of the project, including the necessary libraries and the Keras backend. It provides the essential building blocks for the subsequent data preparation, model training, and evaluation steps of the project.

In [None]:
# Import necessary libraries
import os
os.environ["KERAS_BACKEND"] = "tensorflow"  # Set the Keras backend to TensorFlow
import math, numpy as np, tensorflow as tf, matplotlib as mpl, matplotlib.pyplot as plt
from scipy.special import softmax
from sklearn.metrics import classification_report
from tensorflow.keras.models import Model, load_model
from tensorflow.keras.layers import Dense, GlobalAveragePooling2D
from tensorflow.keras.preprocessing.image import ImageDataGenerator
from tensorflow.keras.utils import load_img, img_to_array, array_to_img
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.losses import categorical_crossentropy
from tensorflow.keras.applications import ResNet50, ResNet152

## **Image and Layer Configuration**

This code block sets the image size and the name of the last convolutional layer, which will be used in the project.

1. `img_size = (224, 224)`: This line sets the image size to 224x224 pixels, which is a common input size for many deep learning models, especially those based on pre-trained architectures like ResNet.

2. `last_conv_layer_name = "conv5_block3_out"`: This line sets the name of the last convolutional layer, which will be used in the Grad-CAM heatmap generation process. Grad-CAM (Gradient-weighted Class Activation Mapping) is a technique that visualizes the important regions of an image for a specific prediction, and it requires access to the activations of the last convolutional layer.

This code block establishes the necessary parameters for the image data and the Grad-CAM heatmap generation, which are crucial for the project's model training and interpretation tasks.

In [None]:
# Set image size and last convolutional layer name
img_size = (224, 224)  # Set the image size to 224x224 pixels
last_conv_layer_name = "conv5_block3_out"  # Set the name of the last convolutional layer

## **Data Generator Setup**

This code block sets up the data generators for the training, validation, and test sets, which will be used to load and preprocess the image data for the deep learning model.

1. `train_val_data_generator = ImageDataGenerator(...)`: This line creates a data generator for the training and validation sets, applying various data augmentation techniques, such as rescaling, shearing, zooming, and horizontal flipping. The `validation_split=0.2` parameter reserves 20% of the data for the validation set.

2. `test_data_generator = ImageDataGenerator(rescale=1./255)`: This line creates a data generator for the test set, only applying rescaling to the pixel values.

3. `train_generator = train_val_data_generator.flow_from_directory(...)`: This line creates the training data generator by loading the images from the specified `'./image-net-data/training-set'` directory, setting the target size to 224x224 pixels, the batch size to 32, and the class mode to 'categorical'. The `subset='training'` parameter specifies that this generator should use the training subset of the data.

4. `validation_generator = train_val_data_generator.flow_from_directory(...)`: This line creates the validation data generator using the same training and validation directory, but with the `subset='validation'` parameter to use the validation subset of the data.

5. `test_generator = test_data_generator.flow_from_directory(...)`: This line creates the test data generator by loading the images from the specified `'./image-net-data/testing-set'` directory, with the same target size and batch size as the training and validation generators. The `class_mode='categorical'` and `shuffle=False` parameters ensure that the test set is not shuffled, allowing for accurate evaluation.

This code block sets up the necessary data generators for the project, ensuring that the image data is properly loaded, preprocessed, and organized for the subsequent model training and evaluation steps.

In [None]:
# Define data generators for training, validation, and testing
train_val_data_generator = ImageDataGenerator(rescale=1./255, shear_range=0.2, zoom_range=0.2, horizontal_flip=True, validation_split=0.2)  # Define the training and validation data generator
test_data_generator = ImageDataGenerator(rescale=1./255)  # Define the test data generator

# Create generators for training, validation, and test sets
train_generator = train_val_data_generator.flow_from_directory('./image-net-data/training-set', target_size=(224, 224), batch_size=32, class_mode='categorical', subset='training')  # Create the training data generator
validation_generator = train_val_data_generator.flow_from_directory('./image-net-data/training-set', target_size=(224, 224), batch_size=32, class_mode='categorical', subset='validation')  # Create the validation data generator
test_generator = test_data_generator.flow_from_directory('./image-net-data/testing-set', target_size=(224, 224), batch_size=32, class_mode='categorical', shuffle=False)  # Create the test data generator

## **Class Index Verification and Tag-Index Mapping**

This code block ensures that the class indices are consistent across the training, validation, and test data generators, and it creates mappings between the class tags and their corresponding indices.

1. `assert train_generator.class_indices == test_generator.class_indices == validation_generator.class_indices`: This line verifies that the class indices (the mapping between class labels and their corresponding integer indices) are the same across the training, validation, and test data generators. This is an important check to ensure that the data is properly organized and that the class labels match across the different sets.

2. `tag_to_idx = train_generator.class_indices`: This line gets the mapping of class tags (the actual class labels) to their corresponding indices from the training data generator and stores it in the `tag_to_idx` dictionary.

3. `idx_to_tag = {v: k for k, v in tag_to_idx.items()}`: This line creates the inverse mapping, from indices to class tags, by iterating over the `tag_to_idx` dictionary and swapping the keys and values.

These mappings between class tags and indices will be used later in the project, for example, when converting the predicted labels to their corresponding class names for reporting and visualization purposes.

In [None]:
# Ensure that the class indices are the same across all generators
assert train_generator.class_indices == test_generator.class_indices == validation_generator.class_indices  # Verify that the class indices are the same
tag_to_idx = train_generator.class_indices  # Get the mapping of class tags to indices
idx_to_tag = {v: k for k, v in tag_to_idx.items()}  # Create the inverse mapping of indices to class tags

## **Loading Data into Memory**

This code block loads the training, validation, and test data into memory, preparing it for the subsequent model training and evaluation steps.

1. `train_set_images, train_set_labels = [], []`: These lines initialize empty lists to store the training set images and labels.

2. `steps_per_epoch_train = train_generator.samples // train_generator.batch_size + (1 if train_generator.samples % train_generator.batch_size else 0)`: This line calculates the number of steps (batches) per training epoch, based on the total number of training samples and the batch size.

3. `for _ in range(steps_per_epoch_train):`: This loop iterates through the training set, retrieving the next batch of images and labels using the `next(train_generator)` call, and appending them to the `train_set_images` and `train_set_labels` lists, respectively.

4. The code then repeats a similar process for the validation and test sets, creating the `validation_set_images`, `validation_set_labels`, `test_set_images`, and `test_set_labels` lists.

This code block ensures that the entire training, validation, and test datasets are loaded into memory, which is necessary for the model training and evaluation processes. By storing the data in these lists, the model can efficiently access the images and labels during training and testing.

In [None]:
# Load training, validation, and test data into memory
train_set_images, train_set_labels = [], []  # Initialize the training set images and labels lists
steps_per_epoch_train = train_generator.samples // train_generator.batch_size + (1 if train_generator.samples % train_generator.batch_size else 0)  # Calculate the number of steps per training epoch
for _ in range(steps_per_epoch_train):  # Iterate through the training set
    images, labels = next(train_generator)  # Get the next batch of images and labels
    train_set_images.append(images)  # Append the images to the training set images list
    train_set_labels.append(labels)  # Append the labels to the training set labels list

validation_set_images, validation_set_labels = [], []  # Initialize the validation set images and labels lists
steps_per_epoch_validation = validation_generator.samples // validation_generator.batch_size + (1 if validation_generator.samples % validation_generator.batch_size else 0)  # Calculate the number of steps per validation epoch
for _ in range(steps_per_epoch_validation):  # Iterate through the validation set
    images, labels = next(validation_generator)  # Get the next batch of images and labels
    validation_set_images.append(images)  # Append the images to the validation set images list
    validation_set_labels.append(labels)  # Append the labels to the validation set labels list

test_set_images, test_set_labels = [], []  # Initialize the test set images and labels lists
steps_per_epoch_test = test_generator.samples // test_generator.batch_size + (1 if test_generator.samples % test_generator.batch_size else 0)  # Calculate the number of steps per test epoch
for _ in range(steps_per_epoch_test):  # Iterate through the test set
    images, labels = next(test_generator)  # Get the next batch of images and labels
    test_set_images.append(images)  # Append the images to the test set images list
    test_set_labels.append(labels)  # Append the labels to the test set labels list

## **Data Dimension Verification**

This code block verifies that the loaded training, validation, and test data have the expected dimensions and that the batches have the correct number of samples.

1. `assert len(train_set_images) == len(train_set_labels)`: This line ensures that the number of training images and the number of training labels are the same, as they should be.

2. `assert len(validation_set_images) == len(validation_set_labels)`: This line ensures that the number of validation images and the number of validation labels are the same.

3. `assert len(test_set_images) == len(test_set_labels)`: This line ensures that the number of test images and the number of test labels are the same.

4. `assert len(train_set_images[0]) == len(validation_set_images[0]) == len(test_set_images[0]) == len(train_set_labels[0]) == len(validation_set_labels[0]) == len(test_set_labels[0]) == 32`: This line verifies that each batch of data (training, validation, and test) has 32 samples, as specified in the data generator configuration.

5. `print(len(test_set_images), len(train_set_images), len(validation_set_images))`: This line prints the lengths of the test, training, and validation sets, which can be useful for monitoring the data loading process.

These assertions and the print statement ensure that the loaded data has the expected dimensions and that the batches have the correct number of samples. This is an important step to verify the integrity of the data before proceeding with the model training and evaluation.

In [None]:
# Ensure that the data has the expected dimensions
assert len(train_set_images) == len(train_set_labels)  # Verify that the training set images and labels have the same length
assert len(validation_set_images) == len(validation_set_labels)  # Verify that the validation set images and labels have the same length
assert len(test_set_images) == len(test_set_labels)  # Verify that the test set images and labels have the same length
assert len(train_set_images[0]) == len(validation_set_images[0]) == len(test_set_images[0]) == len(train_set_labels[0]) == len(validation_set_labels[0]) == len(test_set_labels[0]) == 32  # Verify that each batch has 32 samples

# Print the lengths of the test, train, and validation sets
print(len(test_set_images), len(train_set_images), len(validation_set_images))  # Print the lengths of the test, train, and validation sets

## **Teacher Model Training and Evaluation**

This code block is responsible for training and evaluating the teacher model, which is a larger and more complex model compared to the student model.

1. `taecher_training = True`: This line sets a flag to determine whether the teacher model should be trained or if a pre-trained model should be loaded.

2. `if taecher_training:`: This conditional block runs if the `taecher_training` flag is set to `True`.

3. `base_teacher_model = ResNet152(...)`: This line loads the base ResNet152 model with pre-trained ImageNet weights, but without the top layer.

4. `for layer in base_teacher_model.layers: layer.trainable = False`: This loop freezes the layers of the base model to prevent them from being trained.

5. The code then adds a global average pooling layer and a dense layer with softmax activation to create the final teacher model.

6. `teacher_model = Model(...)`: This line creates the teacher model and compiles it with the Adam optimizer, categorical cross-entropy loss, and accuracy metric.

7. The code then trains the teacher model for 50 epochs, printing the training loss and accuracy for every 10 batches, and computing the validation loss and accuracy at the end of each epoch.

8. `teacher_model.save('./models/teacher_model.h5')`: This line saves the trained teacher model to a file.

9. `else: teacher_model = load_model('./models/teacher_model.h5')`: If the `taecher_training` flag is set to `False`, this block loads the pre-trained teacher model from the saved file.

This code block establishes the teacher model, which is a crucial component of the knowledge distillation process. By training the teacher model on the ImageNet dataset and saving it, the student model can later be trained to mimic the teacher's performance and gradient-based explanations.

In [None]:
# Check if teacher training is enabled
taecher_training = True  # Set this flag to True to enable teacher training, False to load pre-trained model

# If teacher training is enabled
if taecher_training:
    # Load the base ResNet152 model with ImageNet weights, without the top layer
    base_teacher_model = ResNet152(weights='imagenet', include_top=False, input_shape=(224, 224, 3))

    # Freeze the layers of the base model to prevent them from being trained
    for layer in base_teacher_model.layers:
        layer.trainable = False

    # Add a global average pooling layer and a dense layer with softmax activation
    x = base_teacher_model.output
    x = GlobalAveragePooling2D()(x)
    predictions = Dense(len(tag_to_idx), activation='softmax')(x)

    # Create the teacher model and compile it
    teacher_model = Model(inputs=base_teacher_model.input, outputs=predictions)
    teacher_model.compile(optimizer=Adam(learning_rate=0.01), loss='categorical_crossentropy', metrics=['accuracy'])

    # Train the teacher model for 50 epochs
    epochs = 50
    for epoch in range(epochs):
        print(f"Epoch {epoch+1}/{epochs}")
        for batch, (images, labels) in enumerate(zip(train_set_images, train_set_labels)):
            with tf.GradientTape() as tape:
                preds = teacher_model(images, training=False)
                loss = categorical_crossentropy(labels, preds)

            grads = tape.gradient(loss, teacher_model.trainable_variables)
            teacher_model.optimizer.apply_gradients(zip(grads, teacher_model.trainable_variables))
            accuracy = tf.reduce_mean(tf.cast(tf.equal(tf.argmax(preds, axis=1), tf.argmax(labels, axis=1)), tf.float32))

            if (batch + 1) % 10 == 0:
                print(f'Batch {batch + 1} - Training Loss: {round(float(loss.numpy().mean()), 2)} | Training Accuracy: {round(float(accuracy.numpy()), 2)}')

        # Compute the validation loss and accuracy
        validation_loss, validation_accuracy = 0, 0
        for (validation_images, validation_labels) in zip(validation_set_images, validation_set_labels):
            validation_preds = teacher_model(validation_images, training=False)
            validation_loss += categorical_crossentropy(validation_labels, validation_preds).numpy().mean()
            validation_accuracy += float(tf.reduce_mean(tf.cast(tf.equal(tf.argmax(validation_preds, axis=1), tf.argmax(validation_labels, axis=1)), tf.float32)).numpy())

        validation_loss = round(validation_loss / len(validation_set_images), 2)
        validation_accuracy = round(validation_accuracy / len(validation_set_images), 2)

        print(f"Epoch {epoch + 1} Finished! Validation Loss: {validation_loss} | Validation Accuracy: {validation_accuracy}")

    # Save the trained teacher model
    teacher_model.save('./models/teacher_model.h5')

# If teacher training is not enabled, load the pre-trained teacher model
else:
    teacher_model = load_model('./models/teacher_model.h5')

## **Evaluating the Teacher Model on the Test Set**

This code block is responsible for evaluating the performance of the teacher model on the test set.

1. `y_test_true, y_test_pred = [], []`: These lines initialize two empty lists to store the true and predicted labels for the test set.

2. `for (test_images, test_labels) in zip(test_set_images, test_set_labels):`: This loop iterates through the test set images and labels.

3. `test_preds = teacher_model(test_images, training=False)`: This line makes predictions using the teacher model for the current batch of test images.

4. `y_test_pred.extend(list(tf.argmax(test_preds, axis=1)))`: This line appends the predicted labels (the index of the maximum value in the prediction vector) to the `y_test_pred` list.

5. `y_test_true.extend(list(test_labels))`: This line appends the true labels to the `y_test_true` list.

6. `y_test_pred = [idx_to_tag[x.numpy()] for x in y_test_pred]`: This line converts the predicted labels (indices) to their corresponding class tags using the `idx_to_tag` dictionary.

7. `y_test_true = [idx_to_tag[np.argmax(x)] for x in y_test_true]`: This line converts the true labels (one-hot encoded) to their corresponding class tags using the `idx_to_tag` dictionary.

8. `print(classification_report(y_true=y_test_true, y_pred=y_test_pred))`: This line prints the classification report for the test set, comparing the true and predicted labels.

This code block evaluates the performance of the teacher model on the test set, providing valuable insights into the model's accuracy and ability to classify the test images correctly. The classification report generated at the end can be used to assess the teacher model's effectiveness and guide further model improvements or the training of the student model.

In [None]:
# Initialize empty lists to store the true and predicted labels for the test set
y_test_true, y_test_pred = [], []

# Iterate through the test set images and labels
for (test_images, test_labels) in zip(test_set_images, test_set_labels):
    # Get the predictions from the teacher model for the test images
    test_preds = teacher_model(test_images, training=False)
    # Append the predicted labels (index of the maximum value) to the y_test_pred list
    y_test_pred.extend(list(tf.argmax(test_preds, axis=1)))
    # Append the true labels to the y_test_true list
    y_test_true.extend(list(test_labels))

# Convert the predicted and true labels from indices to class tags
y_test_pred = [idx_to_tag[x.numpy()] for x in y_test_pred]
y_test_true = [idx_to_tag[np.argmax(x)] for x in y_test_true]

# Print the classification report for the test set
print(classification_report(y_true=y_test_true, y_pred=y_test_pred))

## **Visualizing a Random Test Image and Its Predicted Label**

This code block selects a random image from the test set, displays the image, and shows the predicted label for the image.

1. `random_batch_number = np.random.choice(len(test_set_images))`: This line selects a random batch index from the test set images.

2. `random_image_number = np.random.choice(len(test_set_images[random_batch_number]))`: This line selects a random image index within the chosen batch.

3. `selected_image = np.expand_dims(test_set_images[random_batch_number][random_image_number], 0)`: This line retrieves the selected image from the test set and expands its dimensions to match the expected input shape of the model (batch size of 1).

4. `selected_label = test_set_labels[random_batch_number][random_image_number]`: This line retrieves the true label for the selected image.

5. `plt.imshow(selected_image.squeeze(0))`: This line displays the selected image.

6. `plt.title(idx_to_tag[np.argmax(selected_label)])`: This line displays the predicted label for the selected image, using the `idx_to_tag` dictionary to convert the index to the corresponding class tag.

7. `plt.show()`: This line shows the image and the predicted label.

This code block is useful for quickly visualizing a randomly selected test image and its predicted label, which can provide insights into the model's performance and help identify any misclassifications or areas for improvement.

In [None]:
# Select a random batch and image index from the test set
random_batch_number = np.random.choice(len(test_set_images))
random_image_number = np.random.choice(len(test_set_images[random_batch_number]))

# Retrieve the selected image and label
selected_image = np.expand_dims(test_set_images[random_batch_number][random_image_number], 0)
selected_label = test_set_labels[random_batch_number][random_image_number]

# Display the selected image and its predicted label
plt.imshow(selected_image.squeeze(0))
plt.title(idx_to_tag[np.argmax(selected_label)])
plt.show()

## **Grad-CAM Heatmap Generation and Visualization**

This code block contains two functions: `make_gradcam_heatmap` and `display_gradcam`, which are responsible for generating and visualizing Grad-CAM (Gradient-weighted Class Activation Mapping) heatmaps.

1. `make_gradcam_heatmap(img_array, model, last_conv_layer_name)`:
   - This function takes an input image, a model, and the name of the last convolutional layer as input.
   - It first creates a new model that maps the input image to the activations of the last convolutional layer and the output predictions.
   - It then computes the gradient of the top predicted class with respect to the activations of the last convolutional layer.
   - It calculates the weighted average of the gradients, where the weights are the mean intensity of the gradients over each feature map channel.
   - It multiplies the feature map activations by the weighted gradients and sums them to obtain the heatmap.
   - It normalizes the heatmap between 0 and 1 for visualization purposes.
   - Finally, it returns the normalized heatmap.

2. `display_gradcam(img, heatmap, alpha=0.4)`:
   - This function takes an input image, the generated heatmap, and an optional alpha value (for superimposing the heatmap) as input.
   - It rescales the heatmap to the range 0-255.
   - It uses the 'jet' colormap to colorize the heatmap.
   - It superimposes the colorized heatmap on the original image.
   - It displays the resulting image.

These functions are used to generate and visualize the Grad-CAM heatmaps, which highlight the important regions of an image that contribute the most to the model's predictions. This technique provides valuable interpretability and explainability for the deep learning model.

In [None]:
def make_gradcam_heatmap(img_array, model, last_conv_layer_name):
    # First, we create a model that maps the input image to the activations of the last conv layer as well as the output predictions
    grad_model = Model(model.inputs, [model.get_layer(last_conv_layer_name).output, model.output])

    # Then, we compute the gradient of the top predicted class for our input image with respect to the activations of the last conv layer
    with tf.GradientTape() as tape:
        last_conv_layer_output, preds = grad_model(img_array)
        pred_index = tf.argmax(preds[0])
        class_channel = preds[:, pred_index]

    # This is the gradient of the output neuron (top predicted or chosen) with regard to the output feature map of the last conv layer
    grads = tape.gradient(class_channel, last_conv_layer_output)

    # This is a vector where each entry is the mean intensity of the gradient over a specific feature map channel
    pooled_grads = tf.reduce_mean(grads, axis=(0, 1, 2))

    # We multiply each channel in the feature map array by "how important this channel is" with regard to the top predicted class then sum all the channels to obtain the heatmap
    last_conv_layer_output = last_conv_layer_output[0]
    heatmap = last_conv_layer_output @ pooled_grads[..., tf.newaxis]
    heatmap = tf.squeeze(heatmap)

    # For visualization purpose, we will also normalize the heatmap between 0 & 1
    heatmap = tf.maximum(heatmap, 0) / tf.math.reduce_max(heatmap)
    return heatmap.numpy()

def display_gradcam(img, heatmap, alpha=0.4):
    # Load the original image
    # img = load_img(img_path)
    # img = img_to_array(img)
    img = img.squeeze(0) * 255
    # Rescale heatmap to a range 0-255
    heatmap = np.uint8(255 * heatmap)

    # Use jet colormap to colorize heatmap
    jet = mpl.colormaps["jet"]

    # Use RGB values of the colormap
    jet_colors = jet(np.arange(256))[:, :3]
    jet_heatmap = jet_colors[heatmap]

    # Create an image with RGB colorized heatmap
    jet_heatmap = array_to_img(jet_heatmap)
    jet_heatmap = jet_heatmap.resize((img.shape[1], img.shape[0]))
    jet_heatmap = img_to_array(jet_heatmap)

    # Superimpose the heatmap on original image
    superimposed_img = jet_heatmap * alpha + img
    superimposed_img = array_to_img(superimposed_img)

    # Display Grad CAM
    plt.imshow(superimposed_img)

    return None

## **Visualizing Predictions and Grad-CAM for a Selected Test Image**

This code block visualizes the predictions and Grad-CAM heatmap for a selected test image using the teacher model.

1. `plt.imshow(selected_image.squeeze(0))`: This line displays the selected test image.

2. `teacher_model.layers[-1].activation = None`: This line removes the softmax activation from the last layer of the teacher model. This is necessary to visualize the raw predictions without the softmax function applied.

3. `preds = softmax(teacher_model.predict(selected_image))`: This line makes a prediction using the teacher model for the selected image and applies the softmax function to the output to obtain the probability distribution.

4. `print(preds, idx_to_tag[np.argmax(preds)])`: This line prints the predicted probabilities and the class tag of the top predicted class, using the `idx_to_tag` dictionary to map the index to the corresponding class label.

5. `heatmap = make_gradcam_heatmap(selected_image, teacher_model, last_conv_layer_name)`: This line generates the Grad-CAM heatmap for the selected image using the `make_gradcam_heatmap` function and the teacher model.

6. `plt.matshow(heatmap)`: This line displays the Grad-CAM heatmap.

7. `plt.show()`: This line shows the displayed plots.

8. `display_gradcam(selected_image, heatmap)`: This line displays the Grad-CAM visualization, which superimposes the heatmap on the original image.

This code block provides a way to visualize the predictions and the Grad-CAM heatmap for a selected test image using the teacher model. This can be useful for interpreting the model's decision-making process and identifying the important regions of the image that contribute to the prediction.

In [None]:
# Display the selected image
plt.imshow(selected_image.squeeze(0))

# Remove last layer's softmax
teacher_model.layers[-1].activation = None

# Print what the top predicted class is
preds = softmax(teacher_model.predict(selected_image))
print(preds, idx_to_tag[np.argmax(preds)])

# Generate class activation heatmap
heatmap = make_gradcam_heatmap(selected_image, teacher_model, last_conv_layer_name)

# Display heatmap
plt.matshow(heatmap)
plt.show()

# Display the Grad-CAM visualization
display_gradcam(selected_image, heatmap)

## **Knowledge Distillation Loss Calculation**

This code block contains two functions: `soft_labels` and `distillation_loss`, which are integral to the knowledge distillation process where a smaller student model is trained to emulate the behavior of a larger teacher model.

1. `soft_labels(logits, temperature)`:
   - This function is designed to soften the logits (the vector of raw predictions that a classification model generates) produced by the teacher model.
   - It takes the logits and a temperature parameter as inputs. The temperature controls the smoothness of the probabilities.
   - By applying the softmax function with temperature scaling to the logits, it generates a softer probability distribution over the classes.
   - The resulting soft labels are used to guide the student model during training, providing information about the relative probabilities of incorrect classes.

2. `distillation_loss(y_true, y_pred, y_soft, temperature)`:
   - This function calculates the distillation loss, which is a composite loss function used during the training of the student model.
   - It takes the true labels (`y_true`), the student model's predictions (`y_pred`), the soft labels from the teacher model (`y_soft`), and the temperature as inputs.
   - The function computes two types of losses: the hard loss (`loss_hard`) and the soft loss (`loss_soft`).
   - The hard loss is the categorical cross-entropy between the true labels and the student model's predictions, reflecting the accuracy of the student model on the actual task.
   - The soft loss is the categorical cross-entropy between the soft labels from the teacher model and the softened predictions of the student model, encouraging the student to mimic the teacher's output distribution.
   - The total distillation loss is the sum of the hard and soft losses, and it is used to optimize the student model's parameters.

These functions are crucial for implementing knowledge distillation, allowing the student model to learn both from the ground truth and the additional insights provided by the teacher model's soft labels. This approach can lead to improved performance of the student model, as it leverages the rich information contained in the teacher model's predictions.

In [None]:
def soft_labels(logits, temperature):
    """
    Description: Soften the output logits of the teacher model.
    """
    # Apply the softmax function to the scaled logits. The temperature parameter
    # controls the softness of the probability distribution. A higher temperature
    # results in a softer distribution (i.e., less confident predictions).
    return tf.nn.softmax(logits / temperature)

def distillation_loss(y_true, y_pred, y_soft, temperature):
    """
    Description: Calculate the distillation loss.
    """
    # Calculate the 'hard' loss using the true labels and the predictions of the student model.
    # This part of the loss ensures the student model learns the correct class predictions.
    loss_hard = categorical_crossentropy(y_true, y_pred)

    # Calculate the 'soft' loss using the soft labels from the teacher model and the softened
    # predictions of the student model. The temperature is used again to soften the student's
    # predictions. This part of the loss helps the student model to approximate the teacher's
    # probability distribution.
    loss_soft = categorical_crossentropy(y_soft, tf.nn.softmax(y_pred / temperature))

    # The total distillation loss is the sum of the hard and soft losses. This combined loss
    # is used to train the student model not just to predict the correct classes, but also to
    # produce similar probability distributions to the teacher model.
    return loss_hard + loss_soft

## **Student Model Architecture Definition**

This code block is responsible for defining the architecture of the student model, which is a smaller neural network that will be trained using knowledge distillation from a larger teacher model.
The code defines a student model based on the ResNet50 architecture, which is a popular convolutional neural network known for its performance on image classification tasks. The model is adapted for the specific dataset by setting the input shape and replacing the top layer with a new dense layer that has as many units as there are classes in the dataset (`len(tag_to_idx)`). The softmax activation function ensures that the output can be interpreted as a probability distribution over the classes. This student model will later be trained using the soft labels and potentially the Grad-CAM heatmaps generated by the teacher model to improve its performance.

In [None]:
# Initialize a ResNet50 model with random weights and without the top classification layer.
# The input shape is set for images of size 224x224 with 3 color channels (RGB).
base_student_model = ResNet50(weights=None, include_top=False, input_shape=(224, 224, 3))

# Retrieve the output of the last layer of the base student model.
x = base_student_model.output

# Apply global average pooling to the output features of the base model.
# This operation will average out the spatial dimensions, resulting in a single
# 1D feature vector per image.
x = GlobalAveragePooling2D()(x)

# Add a dense (fully connected) layer with a number of units equal to the number of classes
# in the dataset (denoted by `len(tag_to_idx)`). The softmax activation function is used
# to produce a probability distribution over the class labels.
predictions = Dense(len(tag_to_idx), activation='softmax')(x)

## **Student Model Compilation**

This code block is focused on finalizing the student model's architecture by specifying its inputs and outputs, and preparing it for training by compiling it with an optimizer, loss function, and evaluation metrics.

The `Model` function from Keras is used to create a new model instance (`student_model`) that takes the input from the base ResNet50 model (`base_student_model.input`) and produces the output from the dense softmax layer (`predictions`). The model is then compiled with the Adam optimizer, which is a popular choice for training deep learning models due to its efficiency and adaptive learning rate capabilities. The learning rate is set to 0.01, which determines how much the model's weights are updated during training. The loss function used is categorical crossentropy, which is standard for classification problems where each instance belongs to exactly one class. The model's performance will be evaluated based on its accuracy on the training data.

In [None]:
# Define the student model by specifying the input layer of the base model and the newly added output layer.
student_model = Model(inputs=base_student_model.input, outputs=predictions)

# Compile the student model by setting the optimizer, loss function, and metrics for training.
# Adam optimizer is used with a learning rate of 0.01.
# The loss function is categorical crossentropy, which is suitable for multi-class classification tasks.
# The model's performance is evaluated based on accuracy.
student_model.compile(optimizer=Adam(learning_rate=0.01), loss='categorical_crossentropy', metrics=['accuracy'])

## **Student Model Training with Distillation**

This code block outlines the process of training the student model using knowledge distillation. It involves iterating over the dataset for a specified number of epochs, using both the hard labels and the soft labels generated by the teacher model to calculate the distillation loss. The student model's weights are updated based on this loss, and its performance is evaluated on both the training and validation sets.

- The training process is set to run for a total of 10 epochs.
- A temperature of 5.0 is used to soften the probabilities in the soft labels generated by the teacher model.
- An empty list is initialized to store the soft labels for the training set.
- The training loop starts, iterating over each epoch and each batch of images and labels.
- For the first epoch, the teacher model's predictions are obtained, softened, and stored in `train_set_soft_labels`.
- For subsequent epochs, the stored soft labels are reused.
- The `distillation_loss` function is called to compute the loss for the student model, which includes both the hard and soft label components.
- Gradients are calculated with respect to the student model's trainable variables, and the optimizer is used to update the weights.
- Training accuracy is computed and printed every 10 batches.
- After each epoch, the model is evaluated on the validation set, and the validation loss and accuracy are calculated and printed.
- Finally, the trained student model is saved to the file system.

In [None]:
# Set the number of epochs for training and the temperature for softening probabilities.
epochs = 10
temperature = 5.0

# Initialize a list to store the soft labels for the training set.
train_set_soft_labels = []

# Start the training loop over the specified number of epochs.
for epoch in range(epochs):
    print(f"Epoch {epoch+1}/{epochs}")

    # Iterate over each batch of images and labels in the training set.
    for batch, (images, labels) in enumerate(zip(train_set_images, train_set_labels)):

        # Record operations for automatic differentiation.
        with tf.GradientTape() as tape:
            # Obtain predictions from the student model.
            preds_student = student_model(images, training=True)

            # During the first epoch, generate and store soft labels from the teacher model.
            if epoch == 0:
                preds_teacher = teacher_model(images, training=False)
                soft_labels_teacher = soft_labels(preds_teacher, temperature)
                train_set_soft_labels.append(soft_labels_teacher)
            else:
                # For subsequent epochs, reuse the stored soft labels.
                soft_labels_teacher = train_set_soft_labels[batch]

            # Compute the distillation loss using both hard and soft labels.
            loss = distillation_loss(labels, preds_student, soft_labels_teacher, temperature)

        # Calculate gradients of the loss with respect to the model's trainable variables.
        grads = tape.gradient(loss, student_model.trainable_variables)
        # Apply the gradients to update the model's weights.
        student_model.optimizer.apply_gradients(zip(grads, student_model.trainable_variables))
        # Calculate and print the training accuracy every 10 batches.
        accuracy = tf.reduce_mean(tf.cast(tf.equal(tf.argmax(preds_student, axis=1), tf.argmax(labels, axis=1)), tf.float32))

        if (batch + 1) % 10 == 0:
            print(f'Batch {batch + 1} - Training Loss: {round(float(loss.numpy().mean()), 2)} | Training Accuracy: {round(float(accuracy.numpy()), 2)}')

    # Initialize variables to accumulate validation loss and accuracy.
    validation_loss, validation_accuracy = 0, 0
    # Evaluate the model on the validation set.
    for (validation_images, validation_labels) in zip(validation_set_images, validation_set_labels):
        validation_preds = student_model(validation_images, training=False)
        validation_loss += categorical_crossentropy(validation_labels, validation_preds).numpy().mean()
        validation_accuracy += float(tf.reduce_mean(tf.cast(tf.equal(tf.argmax(validation_preds, axis=1), tf.argmax(validation_labels, axis=1)), tf.float32)).numpy())

    # Calculate the average validation loss and accuracy.
    validation_loss = round(validation_loss / len(validation_set_images), 2)
    validation_accuracy = round(validation_accuracy / len(validation_set_images), 2)

    # Print the validation results for the current epoch.
    print(f"Epoch {epoch + 1} Finished! Validation Loss: {validation_loss} | Validation Accuracy: {validation_accuracy}")

# Save the trained student model to the file system.
student_model.save('./models/student_model.h5')

## **Evaluating the Student Model on the Test Set**

This code block is responsible for evaluating the performance of the student model on the test set.

1. `y_test_true, y_test_pred = [], []`: These lines initialize two empty lists to store the true and predicted labels for the test set.

2. `for (test_images, test_labels) in zip(test_set_images, test_set_labels):`: This loop iterates through the test set images and labels.

3. `test_preds = student_model(test_images, training=False)`: This line makes predictions using the student model for the current batch of test images.

4. `y_test_pred.extend(list(tf.argmax(test_preds, axis=1)))`: This line appends the predicted labels (the index of the maximum value in the prediction vector) to the `y_test_pred` list.

5. `y_test_true.extend(list(test_labels))`: This line appends the true labels to the `y_test_true` list.

6. `y_test_pred = [idx_to_tag[x.numpy()] for x in y_test_pred]`: This line converts the predicted labels (indices) to their corresponding class tags using the `idx_to_tag` dictionary.

7. `y_test_true = [idx_to_tag[np.argmax(x)] for x in y_test_true]`: This line converts the true labels (one-hot encoded) to their corresponding class tags using the `idx_to_tag` dictionary.

8. `print(classification_report(y_true=y_test_true, y_pred=y_test_pred))`: This line prints the classification report for the test set, comparing the true and predicted labels.

This code block evaluates the performance of the student model on the test set, providing valuable insights into the model's accuracy and ability to classify the test images correctly. The classification report generated at the end can be used to assess the student model's effectiveness and guide further model improvements.

In [None]:
# Initialize empty lists to store the true and predicted labels for the test set
y_test_true, y_test_pred = [], []

# Iterate through the test set images and labels
for (test_images, test_labels) in zip(test_set_images, test_set_labels):
    # Get the predictions from the student model for the test images
    test_preds = student_model(test_images, training=False)
    # Append the predicted labels (index of the maximum value) to the y_test_pred list
    y_test_pred.extend(list(tf.argmax(test_preds, axis=1)))
    # Append the true labels to the y_test_true list
    y_test_true.extend(list(test_labels))

# Convert the predicted and true labels from indices to class tags
y_test_pred = [idx_to_tag[x.numpy()] for x in y_test_pred]
y_test_true = [idx_to_tag[np.argmax(x)] for x in y_test_true]

# Print the classification report for the test set
print(classification_report(y_true=y_test_true, y_pred=y_test_pred))

## **Visualizing Predictions and Grad-CAM for a Selected Test Image using the Student Model**

This code block visualizes the predictions and Grad-CAM heatmap for a selected test image using the student model.

1. `plt.imshow(selected_image.squeeze(0))`: This line displays the selected test image.

2. `student_model.layers[-1].activation = None`: This line removes the softmax activation from the last layer of the student model. This is necessary to visualize the raw predictions without the softmax function applied.

3. `preds = tf.nn.softmax(student_model.predict(selected_image))`: This line makes a prediction using the student model for the selected image and applies the softmax function to the output to obtain the probability distribution.

4. `print(preds, idx_to_tag[np.argmax(preds)])`: This line prints the predicted probabilities and the class tag of the top predicted class, using the `idx_to_tag` dictionary to map the index to the corresponding class label.

5. `heatmap = make_gradcam_heatmap(selected_image, student_model, last_conv_layer_name)`: This line generates the Grad-CAM heatmap for the selected image using the `make_gradcam_heatmap` function and the student model.

6. `plt.matshow(heatmap)`: This line displays the Grad-CAM heatmap.

7. `plt.show()`: This line shows the displayed plots.

8. `display_gradcam(selected_image, heatmap)`: This line displays the Grad-CAM visualization, which superimposes the heatmap on the original image.

This code block provides a way to visualize the predictions and the Grad-CAM heatmap for a selected test image using the student model. This can be useful for interpreting the student model's decision-making process and identifying the important regions of the image that contribute to the prediction. It allows for a direct comparison between the teacher and student models' interpretability and performance.

In [None]:
# Display the selected image
plt.imshow(selected_image.squeeze(0))

# Remove the softmax activation from the last layer of the student model
student_model.layers[-1].activation = None

# Get the predictions from the student model for the selected image
preds = tf.nn.softmax(student_model.predict(selected_image))

# Print the top predicted class and its probability
print(preds, idx_to_tag[np.argmax(preds)])

# Generate the Grad-CAM heatmap for the selected image using the student model
heatmap = make_gradcam_heatmap(selected_image, student_model, last_conv_layer_name)

# Display the Grad-CAM heatmap
plt.matshow(heatmap)
plt.show()

# Display the Grad-CAM visualization for the selected image
display_gradcam(selected_image, heatmap)

## **Wasserstein Distance Computation for 2D Probability Distributions**

This code block defines a function called `wasserstein_distance_2d` that computes the Wasserstein distance between two sets of 2D probability distributions.

1. `def wasserstein_distance_2d(ps, qs):`: This line defines the function, which takes two arguments: `ps` and `qs`, both of which are lists of 2D probability distributions.

2. `assert len(ps) == len(qs):`: This line ensures that the number of elements in `ps` and `qs` are the same, as the Wasserstein distance can only be computed when the distributions have the same number of elements.

3. `wasserstein_dist = 0:`: This line initializes the Wasserstein distance to 0, which will be accumulated throughout the computation.

4. `for p, q in zip(ps, qs):`: This loop iterates through the corresponding elements in `ps` and `qs`.

5. `p_flat = tf.reshape(p, [-1])` and `q_flat = tf.reshape(q, [-1]):`: These lines flatten the 2D probability distributions into 1D tensors.

6. `p_flat /= tf.reduce_sum(p_flat)` and `q_flat /= tf.reduce_sum(q_flat):`: These lines normalize the flattened tensors to ensure that they represent valid probability distributions.

7. `cdf_p = tf.cumsum(p_flat)` and `cdf_q = tf.cumsum(q_flat):`: These lines compute the cumulative distribution functions (CDFs) of the normalized probability distributions.

8. `wasserstein_dist += tf.reduce_sum(tf.abs(cdf_p - cdf_q)):`: This line computes the Wasserstein distance between the CDFs of the two probability distributions and adds it to the running total.

9. `return wasserstein_dist / len(ps):`: Finally, this line returns the average Wasserstein distance across all the pairs of probability distributions.

The Wasserstein distance is a useful metric for comparing two probability distributions, as it takes into account the underlying structure of the distributions, rather than just the difference between individual elements. In the context of this project, the Wasserstein distance is used to compare the Grad-CAM heatmaps generated by the teacher and student models, providing a way to measure the similarity of the explanations provided by the two models.

In [None]:
def wasserstein_distance_2d(ps, qs):
    # Ensure that the number of elements in ps and qs are the same
    assert len(ps) == len(qs)

    # Initialize the Wasserstein distance
    wasserstein_dist = 0

    # Iterate through the elements in ps and qs
    for p, q in zip(ps, qs):
        # Flatten the tensors
        p_flat = tf.reshape(p, [-1])
        q_flat = tf.reshape(q, [-1])

        # Normalize the flattened tensors
        p_flat /= tf.reduce_sum(p_flat)
        q_flat /= tf.reduce_sum(q_flat)

        # Compute the cumulative distribution functions (CDFs) of p and q
        cdf_p = tf.cumsum(p_flat)
        cdf_q = tf.cumsum(q_flat)

        # Compute the Wasserstein distance between the CDFs
        wasserstein_dist += tf.reduce_sum(tf.abs(cdf_p - cdf_q))

    # Return the average Wasserstein distance
    return wasserstein_dist / len(ps)

## **Student Model Creation and Compilation**

This code block is responsible for creating the base student model and the final student model, as well as compiling the student model for training.

1. `base_student_model_2 = ResNet50(weights=None, include_top=False, input_shape=(224, 224, 3))`: This line creates the base student model using the ResNet50 architecture, but without the top layer. The `weights=None` parameter indicates that the model will be trained from scratch, and the `input_shape` parameter specifies the input size of the images.

2. `x = base_student_model_2.output`: This line extracts the output of the base student model.

3. `x = GlobalAveragePooling2D()(x)`: This line adds a global average pooling layer to the model, which will reduce the spatial dimensions of the feature maps and provide a fixed-size output.

4. `predictions = Dense(len(tag_to_idx), activation='softmax')(x)`: This line adds a dense layer with a softmax activation function to the model, which will output the class probabilities. The number of units in the dense layer is set to the length of the `tag_to_idx` dictionary, which represents the number of classes.

5. `student_model_2 = Model(inputs=base_student_model_2.input, outputs=predictions)`: This line creates the final student model by defining the inputs (the base student model's input) and the outputs (the predictions from the previous layers).

6. `student_model_2.compile(optimizer=Adam(learning_rate=0.01), loss='categorical_crossentropy', metrics=['accuracy'])`: This line compiles the student model with the Adam optimizer, categorical cross-entropy loss, and accuracy metric. The learning rate for the optimizer is set to 0.01.

This code block sets up the student model architecture and prepares it for training. The base student model is created using the ResNet50 architecture, and additional layers are added to produce the final class probabilities. The compiled student model can now be trained using the specified optimizer, loss function, and evaluation metric.

In [None]:
# Create the base student model with ResNet50 architecture, without the top layer
base_student_model_2 = ResNet50(weights=None, include_top=False, input_shape=(224, 224, 3))
x = base_student_model_2.output
x = GlobalAveragePooling2D()(x)
predictions = Dense(len(tag_to_idx), activation='softmax')(x)

# Create the student model and compile it
student_model_2 = Model(inputs=base_student_model_2.input, outputs=predictions)
student_model_2.compile(optimizer=Adam(learning_rate=0.01), loss='categorical_crossentropy', metrics=['accuracy'])

## **Training the Student Model using Knowledge Distillation**

This code block is responsible for training the student model using a knowledge distillation approach, where the student model is trained to mimic the performance and gradient-based explanations of the pre-trained teacher model.

1. `epochs = 10`: This line sets the number of training epochs to 10.
2. `temperature = 5.0`: This line sets the temperature parameter used for softening the probabilities in the distillation loss.
3. `train_set_heatmaps = []`: This line initializes an empty list to store the Grad-CAM heatmaps of the teacher model for the training set.
4. `for epoch in range(epochs):`: This loop iterates through the training epochs.
5. `preds_student = student_model_2(images, training=True)`: This line gets the predictions from the student model for the current batch of training images.
6. `soft_labels_teacher = train_set_soft_labels[batch]`: This line retrieves the soft labels (probabilities) from the pre-trained teacher model for the current training batch.
7. `heatmaps_student = [make_gradcam_heatmap(...) for image in images]`: This line generates the Grad-CAM heatmaps for the student model's predictions on the current training batch.
8. `if epoch == 0: heatmaps_teacher = [...]; train_set_heatmaps.append(heatmaps_teacher)`: If it's the first epoch, this block generates the Grad-CAM heatmaps for the teacher model and stores them in the `train_set_heatmaps` list.
9. `loss = distillation_loss(...) + wasserstein_distance_2d(heatmaps_teacher, heatmaps_student)`: This line computes the combined loss, which includes the distillation loss (using the soft labels from the teacher) and the Wasserstein distance between the teacher and student Grad-CAM heatmaps.
10. `grads = tape.gradient(loss, student_model_2.trainable_variables)`: This line computes the gradients of the loss with respect to the student model's trainable variables.
11. `student_model_2.optimizer.apply_gradients(...)`: This line updates the student model's weights using the computed gradients and the Adam optimizer.
12. `accuracy = tf.reduce_mean(tf.cast(tf.equal(tf.argmax(preds_student, axis=1), tf.argmax(labels, axis=1)), tf.float32))`: This line computes the training accuracy for the current batch.
13. `validation_loss, validation_accuracy = 0, 0`: These lines initialize the validation loss and accuracy.
14. `for (validation_images, validation_labels) in ...:`: This loop iterates through the validation set, computes the validation loss and accuracy, and updates the running totals.
15. `validation_loss = round(validation_loss / len(validation_set_images), 2)`: This line computes the average validation loss.
16. `validation_accuracy = round(validation_accuracy / len(validation_set_images), 2)`: This line computes the average validation accuracy.
17. `student_model_2.save('./models/student_model_2.h5')`: This line saves the trained student model to a file.

This code block trains the student model using a knowledge distillation approach, where the student model is trained to not only match the labels predicted by the teacher model (soft labels), but also to mimic the teacher's gradient-based explanations (Grad-CAM heatmaps). The validation loss and accuracy are monitored during training, and the final student model is saved to a file.

In [None]:
# Train the student model for 10 epochs
epochs = 10
temperature = 5.0  # Temperature used for softening the probabilities
train_set_heatmaps = []
for epoch in range(epochs):
    print(f"Epoch {epoch+1}/{epochs}")
    for batch, (images, labels) in enumerate(zip(train_set_images, train_set_labels)):
        with tf.GradientTape() as tape:
            # Get the predictions from the student model
            preds_student = student_model_2(images, training=True)
            # Get the soft labels from the teacher model
            soft_labels_teacher = train_set_soft_labels[batch]

            # Generate the Grad-CAM heatmaps for the student and teacher models
            heatmaps_student = [make_gradcam_heatmap(np.expand_dims(image, 0), student_model_2, last_conv_layer_name) for image in images]
            if epoch == 0:
                heatmaps_teacher = [make_gradcam_heatmap(np.expand_dims(image, 0), teacher_model, last_conv_layer_name) for image in images]
                train_set_heatmaps.append(heatmaps_teacher)
            else:
                heatmaps_teacher = train_set_heatmaps[batch]

            # Compute the distillation loss and Wasserstein distance
            loss = distillation_loss(labels, preds_student, soft_labels_teacher, temperature) + wasserstein_distance_2d(heatmaps_teacher, heatmaps_student)

        # Update the student model's weights
        grads = tape.gradient(loss, student_model_2.trainable_variables)
        student_model_2.optimizer.apply_gradients(zip(grads, student_model_2.trainable_variables))
        accuracy = tf.reduce_mean(tf.cast(tf.equal(tf.argmax(preds_student, axis=1), tf.argmax(labels, axis=1)), tf.float32))

        if (batch + 1) % 10 == 0:
            print(f'Batch {batch + 1} - Training Loss: {tf.reduce_mean(loss)} | Training Accuracy: {round(float(accuracy.numpy()), 2)}')

    # Compute the validation loss and accuracy
    validation_loss, validation_accuracy = 0, 0
    for (validation_images, validation_labels) in zip(validation_set_images, validation_set_labels):
        validation_preds = student_model_2(validation_images, training=False)
        validation_loss += categorical_crossentropy(validation_labels, validation_preds).numpy().mean()
        validation_accuracy += float(tf.reduce_mean(tf.cast(tf.equal(tf.argmax(validation_preds, axis=1), tf.argmax(validation_labels, axis=1)), tf.float32)).numpy())

    validation_loss = round(validation_loss / len(validation_set_images), 2)
    validation_accuracy = round(validation_accuracy / len(validation_set_images), 2)

    print(f"Epoch {epoch + 1} Finished! Validation Loss: {validation_loss} | Validation Accuracy: {validation_accuracy}")

# Save the trained student model
student_model_2.save('./models/student_model_2.h5')

## **Evaluating the Student Model on the Test Set**

This code block is responsible for evaluating the performance of the student model (student_model_2) on the test set.

1. `y_test_true, y_test_pred = [], []`: These lines initialize two empty lists to store the true and predicted labels for the test set.

2. `for (test_images, test_labels) in zip(test_set_images, test_set_labels):`: This loop iterates through the test set images and labels.

3. `test_preds = student_model_2(test_images, training=False)`: This line makes predictions using the student model for the current batch of test images.

4. `y_test_pred.extend(list(tf.argmax(test_preds, axis=1)))`: This line appends the predicted labels (the index of the maximum value in the prediction vector) to the `y_test_pred` list.

5. `y_test_true.extend(list(test_labels))`: This line appends the true labels to the `y_test_true` list.

6. `y_test_pred = [idx_to_tag[x.numpy()] for x in y_test_pred]`: This line converts the predicted labels (indices) to their corresponding class tags using the `idx_to_tag` dictionary.

7. `y_test_true = [idx_to_tag[np.argmax(x)] for x in y_test_true]`: This line converts the true labels (one-hot encoded) to their corresponding class tags using the `idx_to_tag` dictionary.

8. `print(classification_report(y_true=y_test_true, y_pred=y_test_pred))`: This line prints the classification report for the test set, comparing the true and predicted labels.

This code block evaluates the performance of the student model on the test set, providing valuable insights into the model's accuracy and ability to classify the test images correctly. The classification report generated at the end can be used to assess the student model's effectiveness and guide further model improvements or fine-tuning.

In [None]:
# Initialize empty lists to store the true and predicted labels for the test set
y_test_true, y_test_pred = [], []

# Iterate through the test set images and labels
for (test_images, test_labels) in zip(test_set_images, test_set_labels):
    # Get the predictions from the student model for the test images
    test_preds = student_model_2(test_images, training=False)
    # Append the predicted labels (index of the maximum value) to the y_test_pred list
    y_test_pred.extend(list(tf.argmax(test_preds, axis=1)))
    # Append the true labels to the y_test_true list
    y_test_true.extend(list(test_labels))

# Convert the predicted and true labels from indices to class tags
y_test_pred = [idx_to_tag[x.numpy()] for x in y_test_pred]
y_test_true = [idx_to_tag[np.argmax(x)] for x in y_test_true]

# Print the classification report for the test set
print(classification_report(y_true=y_test_true, y_pred=y_test_pred))

## **Visualizing Predictions and Grad-CAM for a Selected Test Image using the Student Model**

This code block visualizes the predictions and Grad-CAM heatmap for a selected test image using the student model (student_model_2).

1. `plt.imshow(selected_image.squeeze(0))`: This line displays the selected test image.

2. `student_model_2.layers[-1].activation = None`: This line removes the softmax activation from the last layer of the student model. This is necessary to visualize the raw predictions without the softmax function applied.

3. `preds = tf.nn.softmax(student_model_2.predict(selected_image))`: This line makes a prediction using the student model for the selected image and applies the softmax function to the output to obtain the probability distribution.

4. `print(preds, idx_to_tag[np.argmax(preds)])`: This line prints the predicted probabilities and the class tag of the top predicted class, using the `idx_to_tag` dictionary to map the index to the corresponding class label.

5. `heatmap = make_gradcam_heatmap(selected_image, student_model_2, last_conv_layer_name)`: This line generates the Grad-CAM heatmap for the selected image using the `make_gradcam_heatmap` function and the student model.

6. `plt.matshow(heatmap)`: This line displays the Grad-CAM heatmap.

7. `plt.show()`: This line shows the displayed plots.

8. `display_gradcam(selected_image, heatmap)`: This line displays the Grad-CAM visualization, which superimposes the heatmap on the original image.

This code block provides a way to visualize the predictions and the Grad-CAM heatmap for a selected test image using the student model. This can be useful for interpreting the student model's decision-making process and identifying the important regions of the image that contribute to the prediction. It allows for a direct comparison between the teacher and student models' interpretability and performance.

In [None]:
# Display the selected image
plt.imshow(selected_image.squeeze(0))

# Remove the softmax activation from the last layer of the student model
student_model_2.layers[-1].activation = None

# Get the predictions from the student model for the selected image and apply softmax
preds = tf.nn.softmax(student_model_2.predict(selected_image))

# Print the top predicted class and its probability
print(preds, idx_to_tag[np.argmax(preds)])

# Generate the Grad-CAM heatmap for the selected image using the student model
heatmap = make_gradcam_heatmap(selected_image, student_model_2, last_conv_layer_name)

# Display the Grad-CAM heatmap
plt.matshow(heatmap)
plt.show()

# Display the Grad-CAM visualization for the selected image
display_gradcam(selected_image, heatmap)

## **Configuration and Compilation of Student Model Trained on Hard Labels**

This code block outlines the process of configuring a student model based on the ResNet50 architecture for training on hard labels, and preparing it for the training process.

1. **Base Model Initialization**:
   ```python
   base_student_model_3 = ResNet50(weights=None, include_top=False, input_shape=(224, 224, 3))
   ```
   A new instance of ResNet50 is created without pre-trained weights and without the fully connected top layers. The model is set to accept input images of size 224x224 pixels with 3 color channels.

2. **Layer Freezing**:
   ```python
   for layer in base_student_model_3.layers:
       layer.trainable = False
   ```
   The trainable parameter of each layer in the base model is set to `False` to freeze the layers. This prevents the weights from being updated during training, which is a common approach when fine-tuning a model on a specific task.

3. **Adding Custom Top Layers**:
   ```python
   x = base_student_model_3.output
   x = GlobalAveragePooling2D()(x)
   predictions = Dense(len(tag_to_idx), activation='softmax')(x)
   ```
   The output of the base model is passed through a GlobalAveragePooling2D layer to reduce dimensionality and parameters. A Dense layer with a softmax activation function is added to make final predictions. The number of units in the Dense layer corresponds to the number of classes, determined by the size of the `tag_to_idx` mapping.

4. **Student Model Instantiation**:
   ```python
   student_model_3 = Model(inputs=base_student_model_3.input, outputs=predictions)
   ```
   A new model, `student_model_3`, is created using the Keras Model class, specifying the input and output layers.

5. **Model Compilation**:
   ```python
   student_model_3.compile(optimizer=Adam(learning_rate=0.01), loss='categorical_crossentropy', metrics=['accuracy'])
   ```
   The student model is compiled with the Adam optimizer (with a learning rate of 0.01), a categorical crossentropy loss function suitable for multi-class classification with hard labels, and accuracy as the metric for evaluation.

In [None]:
# Initialize the base model with the ResNet50 architecture, without pre-trained weights, and excluding the top layer.
base_student_model_3 = ResNet50(weights=None, include_top=False, input_shape=(224, 224, 3))

# Iterate through all the layers of the base model and set them to be non-trainable (weights will not be updated during training).
for layer in base_student_model_3.layers:
    layer.trainable = False

# Get the output of the last layer of the base model.
x = base_student_model_3.output

# Apply global average pooling to the output of the base model to reduce its dimensionality.
x = GlobalAveragePooling2D()(x)

# Add a dense layer with a softmax activation function. The number of units equals the number of classes.
# This layer will output probability distribution over the classes.
predictions = Dense(len(tag_to_idx), activation='softmax')(x)

# Create the final student model by specifying the input and output layers.
student_model_3 = Model(inputs=base_student_model_3.input, outputs=predictions)

# Compile the student model with the Adam optimizer, a learning rate of 0.01, categorical crossentropy loss function,
# and accuracy metric for performance evaluation.
student_model_3.compile(optimizer=Adam(learning_rate=0.01), loss='categorical_crossentropy', metrics=['accuracy'])

## **Training and Validation of the Student Model**

This code block outlines the training and validation process for the student model over a specified number of epochs.

1. **Setting Up Training Parameters**:
   - The variable `epochs` is set to 10, indicating the student model will undergo 10 full iterations (epochs) over the training dataset.

2. **Training Loop**:
   - The `for` loop begins, iterating over the range of epochs, starting from 0 and going up to but not including `epochs`.

3. **Epoch Progress Output**:
   - The `print` function outputs the progress of the training, showing the current epoch number out of the total epochs.

4. **Batch Processing**:
   - Another `for` loop starts, iterating over each batch of images and labels from the training dataset. The `enumerate` function adds a counter (`batch`) to each iteration, and `zip` combines `train_set_images` with `train_set_labels` into pairs for processing.

5. **Gradient Calculation**:
   - `tf.GradientTape` is used to record operations for automatic differentiation. This is essential for the backpropagation algorithm to work during the training process.

6. **Model Predictions**:
   - Inside the `GradientTape` block, the student model (`student_model_3`) makes predictions on the training images with `training=False`, which is likely an error in the code since it should be set to `True` during training to enable dropout and other training-specific behaviors.

7. **Loss Computation**:
   - The loss is calculated using `categorical_crossentropy`, which compares the predicted labels (`preds`) to the true labels (`labels`).

8. **Backpropagation**:
   - Gradients of the loss with respect to the model's trainable variables are computed, and then these gradients are applied to the model's variables to minimize the loss using the `apply_gradients` method of the model's optimizer.

9. **Accuracy Calculation**:
   - The accuracy for the current batch is calculated by comparing the predicted labels to the true labels, using the `tf.argmax` function to extract the predicted class indices and `tf.equal` to check for matches.

10. **Training Status Output**:
   - Every 10 batches, the training loss and accuracy are printed to provide updates on the model's performance.

11. **Validation Metrics Initialization**:
   - Before starting validation, `validation_loss` and `validation_accuracy` are set to 0 to prepare for accumulation.

12. **Validation Loop**:
   - A loop starts over the validation dataset, where the student model makes predictions on the validation images, and the validation loss and accuracy are accumulated.

13. **Average Validation Metrics**:
   - The total validation loss and accuracy are divided by the number of validation samples to calculate the average validation loss and accuracy.

14. **Epoch Summary Output**:
   - At the end of each epoch, the average validation loss and accuracy are printed to summarize the model's performance on the validation dataset.

15. **Model Saving**:
   - After training is complete, the student model is saved to a file (`student_model_3.h5`) for later use or analysis.

In [None]:
# Set the number of epochs for which the student model will be trained.
epochs = 10

# Begin the training loop over the specified number of epochs.
for epoch in range(epochs):
    # Print the current epoch number.
    print(f"Epoch {epoch+1}/{epochs}")

    # Iterate over each batch of images and labels from the training dataset.
    for batch, (images, labels) in enumerate(zip(train_set_images, train_set_labels)):
        # Record the operations for automatic differentiation using GradientTape.
        with tf.GradientTape() as tape:
            # Make predictions on the batch of images without enabling training-specific layers like dropout.
            preds = student_model_3(images, training=False)
            # Compute the loss by comparing the predicted labels to the true labels.
            loss = categorical_crossentropy(labels, preds)

        # Calculate the gradients of the loss with respect to the model's trainable variables.
        grads = tape.gradient(loss, student_model_3.trainable_variables)
        # Apply the gradients to the model's variables to minimize the loss function.
        student_model_3.optimizer.apply_gradients(zip(grads, student_model_3.trainable_variables))
        # Calculate the batch's accuracy by comparing the predicted labels to the true labels.
        accuracy = tf.reduce_mean(tf.cast(tf.equal(tf.argmax(preds, axis=1), tf.argmax(labels, axis=1)), tf.float32))

        # Print the training loss and accuracy every 10 batches.
        if (batch + 1) % 10 == 0:
            print(f'Batch {batch + 1} - Training Loss: {round(float(loss.numpy().mean()), 2)} | Training Accuracy: {round(float(accuracy.numpy()), 2)}')

    # Initialize variables to accumulate validation loss and accuracy.
    validation_loss, validation_accuracy = 0, 0
    # Iterate over each batch of images and labels from the validation dataset.
    for (validation_images, validation_labels) in zip(validation_set_images, validation_set_labels):
        # Make predictions on the batch of validation images without enabling training-specific layers like dropout.
        validation_preds = student_model_3(validation_images, training=False)
        # Accumulate the validation loss.
        validation_loss += categorical_crossentropy(validation_labels, validation_preds).numpy().mean()
        # Accumulate the validation accuracy.
        validation_accuracy += float(tf.reduce_mean(tf.cast(tf.equal(tf.argmax(validation_preds, axis=1), tf.argmax(validation_labels, axis=1)), tf.float32)).numpy())

    # Calculate the average validation loss and accuracy over all validation batches.
    validation_loss = round(validation_loss / len(validation_set_images), 2)
    validation_accuracy = round(validation_accuracy / len(validation_set_images), 2)

    # Print the final validation loss and accuracy for the current epoch.
    print(f"Epoch {epoch + 1} Finished! Validation Loss: {validation_loss} | Validation Accuracy: {validation_accuracy}")

# Save the trained student model to a file for later use or analysis.
student_model_3.save('./models/student_model_3.h5')  # student hard

## Evaluating the Student Model on Test Set

This code block is for evaluating the trained student model on a test dataset and generating a report that provides various classification metrics.

1. **Initialize Lists for True and Predicted Labels**:
   ```python
   y_test_true, y_test_pred = [], []
   ```
   Two empty lists are initialized to store the true labels (`y_test_true`) and the predicted labels (`y_test_pred`) for the test dataset.

2. **Iterate Over Test Dataset**:
   ```python
   for (test_images, test_labels) in zip(test_set_images, test_set_labels):
   ```
   The code iterates over the test images and labels, which are paired together using Python's `zip` function.

3. **Make Predictions**:
   ```python
   test_preds = student_model_3(test_images, training=False)
   ```
   The student model (`student_model_3`) makes predictions on the test images with `training=False`. This ensures that the model uses its inference behavior, without any training-specific operations like dropout.

4. **Store Predicted Labels**:
   ```python
   y_test_pred.extend(list(tf.argmax(test_preds, axis=1)))
   ```
   The predicted probabilities (`test_preds`) are converted to class indices using the `tf.argmax` function, which selects the index with the highest probability (the predicted class). These indices are added to the `y_test_pred` list.

5. **Store True Labels**:
   ```python
   y_test_true.extend(list(test_labels))
   ```
   The true labels from `test_labels` are added to the `y_test_true` list.

6. **Convert Indices to Class Names**:
   ```python
   y_test_pred = [idx_to_tag[x.numpy()] for x in y_test_pred]
   y_test_true = [idx_to_tag[np.argmax(x)] for x in y_test_true]
   ```
   The predicted and true class indices are converted to their corresponding class names using the `idx_to_tag` dictionary. This dictionary maps numerical indices to class names (tags).

7. **Print Classification Report**:
   ```python
   print(classification_report(y_true=y_test_true, y_pred=y_test_pred))
   ```
   A classification report is printed using scikit-learn's `classification_report` function. This report includes metrics such as precision, recall, f1-score, and support for each class, providing a detailed overview of the model's performance on the test dataset.

Overall, this code block is used to assess the accuracy of the student model by comparing its predictions to the true labels on the test data and then providing a detailed classification analysis.

In [None]:
# Initialize lists to hold the true and predicted labels for the test set.
y_test_true, y_test_pred = [], []

# Iterate over pairs of test images and labels.
for (test_images, test_labels) in zip(test_set_images, test_set_labels):
    # Make predictions using the student model on the test images with training set to False.
    test_preds = student_model_3(test_images, training=False)
    # Extend the prediction list with the predicted class indices obtained by taking the argmax of the predictions.
    y_test_pred.extend(list(tf.argmax(test_preds, axis=1)))
    # Extend the true labels list with the actual labels from the test set.
    y_test_true.extend(list(test_labels))

# Convert the list of predicted indices to their corresponding class names using the idx_to_tag mapping.
y_test_pred = [idx_to_tag[x.numpy()] for x in y_test_pred]
# Convert the list of true label indices to their corresponding class names using the idx_to_tag mapping.
y_test_true = [idx_to_tag[np.argmax(x)] for x in y_test_true]

# Print the classification report comparing the true and predicted labels to evaluate the model's performance.
print(classification_report(y_true=y_test_true, y_pred=y_test_pred))

## **Student Model Training with Soft Labels**

This code block is designed to configure a student model to be trained using soft labels. Soft labels are a softened version of the original hard labels, typically provided by a teacher model's output logits, and they carry more information about the class probabilities.

1. **Define Soft Labels Function**:
   ```python
   def soft_labels(logits, temperature):
   ```
   This function, `soft_labels`, takes logits from a teacher model and a temperature value to soften the logits.

   ```python
   return tf.nn.softmax(logits / temperature)
   ```
   The logits are divided by the temperature to smooth the probability distribution, and then the softmax function is applied to obtain the soft labels.

2. **Initialize Base Student Model**:
   ```python
   base_student_model_4 = ResNet50(weights=None, include_top=False, input_shape=(224, 224, 3))
   ```
   A new instance of the ResNet50 model is created without pre-trained weights, excluding the top layer, and is set to accept input images of size 224x224 pixels with 3 color channels.

3. **Add Custom Top Layers**:
   ```python
   x = base_student_model_4.output
   x = GlobalAveragePooling2D()(x)
   predictions = Dense(len(tag_to_idx), activation='softmax')(x)
   ```
   The output of the base model is passed through a GlobalAveragePooling2D layer, followed by a Dense layer with a softmax activation. The Dense layer is configured with a number of units equal to the number of classes, which is determined by the size of the `tag_to_idx` mapping.

4. **Create Final Student Model**:
   ```python
   student_model_4 = Model(inputs=base_student_model_4.input, outputs=predictions)
   ```
   A new Keras Model instance, `student_model_4`, is created with the input layer from the base model and the custom top layers.

5. **Compile the Student Model**:
   ```python
   student_model_4.compile(optimizer=Adam(learning_rate=0.01), loss='categorical_crossentropy', metrics=['accuracy'])
   ```
   The student model is compiled with the Adam optimizer and a learning rate of 0.01. The loss function is categorical crossentropy, which is suitable for classification tasks with soft labels, and accuracy is selected as the evaluation metric.

In [None]:
# Define a function to apply softmax with temperature scaling to logits, effectively creating soft labels.
def soft_labels(logits, temperature):
    """
    Description: Soften the output logits of the teacher model.
    """
    # Apply the softmax function to the scaled logits to obtain soft labels.
    return tf.nn.softmax(logits / temperature)

# Initialize the base student model using the ResNet50 architecture without pre-trained weights and excluding the top layer.
base_student_model_4 = ResNet50(weights=None, include_top=False, input_shape=(224, 224, 3))

# Retrieve the output from the last layer of the base student model.
x = base_student_model_4.output
# Apply global average pooling to the output to reduce its dimensionality.
x = GlobalAveragePooling2D()(x)
# Add a dense softmax layer to make predictions, with the number of units equal to the number of classes.
predictions = Dense(len(tag_to_idx), activation='softmax')(x)

# Instantiate the final student model by specifying its input and output layers.
student_model_4 = Model(inputs=base_student_model_4.input, outputs=predictions)
# Compile the student model with the Adam optimizer, a learning rate of 0.01, and the categorical crossentropy loss function.
# The model will use accuracy as the metric for evaluation.
student_model_4.compile(optimizer=Adam(learning_rate=0.01), loss='categorical_crossentropy', metrics=['accuracy'])

## **Training Student Model with Softened Teacher Probabilities**

This code block describes the training process of the student model using soft labels generated from the output of a teacher model, with temperature scaling applied to the probabilities.

Of course, here's a detailed explanation for each line of the code:

1. **Set Training Parameters**:
   - `epochs = 10`: This line sets the number of training cycles (epochs) that the student model will undergo to 10.
   - `temperature = 5.0`: This line sets the temperature value to 5.0. The temperature is a hyperparameter used to soften the output probabilities (logits) of the teacher model, making them smoother and more generalizable for distillation.

2. **Prepare for Soft Label Storage**:
   - `train_set_soft_labels = []`: This line initializes an empty list that will later store the softened labels for the training set. These labels are generated from the teacher model's outputs.

3. **Training Loop**:
   - `for epoch in range(epochs):`: This line starts a loop that will iterate over the set number of epochs.
   - `print(f"Epoch {epoch+1}/{epochs}")`: This line prints the current epoch number to the console, providing a progress update during training.

4. **Batch Processing**:
   - `for batch, (images, labels) in enumerate(zip(train_set_images, train_set_labels)):`: This line starts a nested loop that iterates over each batch of images and labels in the training dataset.

5. **Gradient Recording**:
   - `with tf.GradientTape() as tape:`: This line sets up a context manager that records the operations for automatic differentiation.

6. **Make Predictions and Generate Soft Labels**:
   - `preds_student = student_model_4(images, training=True)`: Inside the gradient tape context, this line obtains predictions from the student model for the current batch of images.
   - `if epoch == 0:`: This conditional checks if it's the first epoch.
   - `preds_teacher = teacher_model(images, training=False)`: If it's the first epoch, this line gets predictions from the teacher model.
   - `soft_labels_teacher = soft_labels(preds_teacher, temperature)`: The teacher model's logits are softened using the `soft_labels` function and the set temperature.
   - `train_set_soft_labels.append(soft_labels_teacher)`: The softened labels are stored in the list for use in later epochs.
   - `else: soft_labels_teacher = train_set_soft_labels[batch]`: For subsequent epochs, the previously stored soft labels are retrieved.

7. **Calculate Loss**:
   - `loss = categorical_crossentropy(soft_labels_teacher, tf.nn.softmax(preds_student / temperature))`: The loss is computed using the soft labels from the teacher and the softened predictions from the student model.

8. **Backpropagation**:
   - `grads = tape.gradient(loss, student_model_4.trainable_variables)`: Gradients of the loss with respect to the student model's trainable variables are calculated.
   - `student_model_4.optimizer.apply_gradients(zip(grads, student_model_4.trainable_variables))`: The gradients are applied to the model's variables, updating them to minimize the loss.

9. **Calculate Training Accuracy**:
   - `accuracy = tf.reduce_mean(tf.cast(tf.equal(tf.argmax(preds_student, axis=1), tf.argmax(labels, axis=1)), tf.float32))`: The accuracy for the current batch is calculated by comparing the predicted and true labels.

10. **Logging**:
   - `if (batch + 1) % 10 == 0:`: Every 10 batches, the conditional statement triggers logging.
   - `print(...)`: This line outputs the training loss and accuracy for the current batch to the console.

11. **Validation Metrics Initialization**:
   - `validation_loss, validation_accuracy = 0, 0`: Before validation begins, the loss and accuracy are initialized to zero.

12. **Validation Loop**:
   - `for (validation_images, validation_labels) in zip(validation_set_images, validation_set_labels):`: This line starts a loop over the validation dataset.
   - `validation_preds = student_model_4(validation_images, training=False)`: The student model makes predictions on the validation images.
   - `validation_loss += ...`: The validation loss is accumulated over all batches.
   - `validation_accuracy += ...`: The validation accuracy is accumulated over all batches.

13. **Compute Average Validation Metrics**:
   - `validation_loss = ...`: The total validation loss is averaged over the number of validation samples.
   - `validation_accuracy = ...`: The total validation accuracy is averaged over the number of validation samples.

14. **Epoch Summary**:
   - `print(f"Epoch {epoch + 1} Finished! Validation Loss: {validation_loss} | Validation Accuracy: {validation_accuracy}")`: At the end of each epoch, the average validation loss and accuracy are printed to summarize the model's performance on the validation set

15. **Model Saving**:
   `student_model_4.save('./models/student_model_4.h5')`

   The trained student model is saved to the specified file path, indicating that this model was trained using soft labels.

In [None]:
# Set the number of epochs for training and the temperature for softening probabilities.
epochs = 10
temperature = 5.0 # temperature used for softening the probabilities

# Initialize an empty list to store the soft labels for the training set.
train_set_soft_labels = []

# Begin the training process for the specified number of epochs.
for epoch in range(epochs):
    # Print the current epoch number.
    print(f"Epoch {epoch+1}/{epochs}")

    # Iterate over each batch of images and labels from the training dataset.
    for batch, (images, labels) in enumerate(zip(train_set_images, train_set_labels)):
        # Record the operations for automatic differentiation.
        with tf.GradientTape() as tape:
            # Make predictions on the training images using the student model.
            preds_student = student_model_4(images, training=True)
            # During the first epoch, generate and store soft labels using the teacher model's predictions.
            if epoch == 0:
                # Obtain predictions from the teacher model.
                preds_teacher = teacher_model(images, training=False)
                # Convert the teacher model's predictions to soft labels using the defined temperature.
                soft_labels_teacher = soft_labels(preds_teacher, temperature)
                # Store the soft labels for future epochs.
                train_set_soft_labels.append(soft_labels_teacher)
            else:
                # Retrieve the stored soft labels.
                soft_labels_teacher = train_set_soft_labels[batch]

            # Calculate the loss using the soft labels and the softened student predictions.
            loss = categorical_crossentropy(soft_labels_teacher, tf.nn.softmax(preds_student / temperature))

        # Compute the gradients of the loss with respect to the model's trainable variables.
        grads = tape.gradient(loss, student_model_4.trainable_variables)
        # Apply the gradients to the model's variables.
        student_model_4.optimizer.apply_gradients(zip(grads, student_model_4.trainable_variables))
        # Calculate the training accuracy for the current batch.
        accuracy = tf.reduce_mean(tf.cast(tf.equal(tf.argmax(preds_student, axis=1), tf.argmax(labels, axis=1)), tf.float32))

        # Print the training loss and accuracy for every 10 batches.
        if (batch + 1) % 10 == 0:
            print(f'Batch {batch + 1} - Training Loss: {round(float(loss.numpy().mean()), 2)} | Training Accuracy: {round(float(accuracy.numpy()), 2)}')

    # Initialize variables for accumulating validation loss and accuracy.
    validation_loss, validation_accuracy = 0, 0
    # Iterate over each batch of images and labels from the validation set.
    for (validation_images, validation_labels) in zip(validation_set_images, validation_set_labels):
        # Make predictions on the validation images using the student model.
        validation_preds = student_model_4(validation_images, training=False)
        # Accumulate the validation loss.
        validation_loss += categorical_crossentropy(validation_labels, validation_preds).numpy().mean()
        # Accumulate the validation accuracy.
        validation_accuracy += float(tf.reduce_mean(tf.cast(tf.equal(tf.argmax(validation_preds, axis=1), tf.argmax(validation_labels, axis=1)), tf.float32)).numpy())

    # Calculate the average validation loss and accuracy over all validation samples.
    validation_loss = round(validation_loss / len(validation_set_images), 2)
    validation_accuracy = round(validation_accuracy / len(validation_set_images), 2)

    # Print the validation loss and accuracy at the end of the epoch.
    print(f"Epoch {epoch + 1} Finished! Validation Loss: {validation_loss} | Validation Accuracy: {validation_accuracy}")

# Save the trained student model to the specified file path.
student_model_4.save('./models/student_model_4.h5') # student soft

##Evaluating the Student Model on Test Set

1. **Initialize Lists for True and Predicted Labels**:
   ```python
   y_test_true, y_test_pred = [], []
   ```
   Two empty lists are created to store the true labels (`y_test_true`) and the predicted labels (`y_test_pred`) for the test dataset.

2. **Iterate Over Test Dataset**:
   ```python
   for (test_images, test_labels) in zip(test_set_images, test_set_labels):
   ```
   The code iterates over the test images and labels, pairing them together using Python's `zip` function.

3. **Make Predictions**:
   ```python
   test_preds = student_model_4(test_images, training=False)
   ```
   The student model (`student_model_4`) makes predictions on the test images with `training=False` to ensure it's in evaluation mode, not training mode.

4. **Store Predicted Labels**:
   ```python
   y_test_pred.extend(list(tf.argmax(test_preds, axis=1)))
   ```
   The predicted probabilities (`test_preds`) are converted to class indices by selecting the index of the highest probability for each prediction using `tf.argmax`. These indices are added to the `y_test_pred` list.

5. **Store True Labels**:
   ```python
   y_test_true.extend(list(test_labels))
   ```
   The true labels from `test_labels` are added to the `y_test_true` list.

6. **Convert Indices to Class Names (Predictions)**:
   ```python
   y_test_pred = [idx_to_tag[x.numpy()] for x in y_test_pred]
   ```
   The predicted class indices in `y_test_pred` are converted to their corresponding class names using the `idx_to_tag` dictionary. The `numpy()` method is called to convert the TensorFlow tensors to numpy arrays before looking up in the dictionary.

7. **Convert Indices to Class Names (True Labels)**:
   ```python
   y_test_true = [idx_to_tag[np.argmax(x)] for x in y_test_true]
   ```
   The true class labels in `y_test_true` are converted to their corresponding class names using the `idx_to_tag` dictionary. Here, `np.argmax` is used to find the index of the true label since the labels are likely one-hot encoded.

8. **Print Classification Report**:
   ```python
   print(classification_report(y_true=y_test_true, y_pred=y_test_pred))
   ```
   A classification report is printed using scikit-learn's `classification_report` function. This report includes metrics such as precision, recall, f1-score, and support for each class, providing a detailed overview of the model's performance on the test dataset.

In [None]:
# Create empty lists to store the true and predicted labels for the test set.
y_test_true, y_test_pred = [], []

# Loop through each pair of test images and labels.
for (test_images, test_labels) in zip(test_set_images, test_set_labels):
    # Obtain predictions from the student model for the current batch of test images.
    test_preds = student_model_4(test_images, training=False)
    # Extend the predicted labels list with the class indices predicted by the model.
    y_test_pred.extend(list(tf.argmax(test_preds, axis=1)))
    # Extend the true labels list with the actual labels from the test set.
    y_test_true.extend(list(test_labels))

# Convert the list of predicted class indices to their corresponding class names using the idx_to_tag mapping.
y_test_pred = [idx_to_tag[x.numpy()] for x in y_test_pred]
# Convert the list of true label indices to their corresponding class names using the idx_to_tag mapping.
y_test_true = [idx_to_tag[np.argmax(x)] for x in y_test_true]

# Print a classification report that compares the true and predicted labels, providing metrics like precision, recall, and F1-score.
print(classification_report(y_true=y_test_true, y_pred=y_test_pred))