In [None]:
#@title Copyright 2022 HCAIM.

# Statement related to copyrights can be addeded here

# Materials for this exercise are derived from the listed sources
  #  Lab work created by Dr Rosario Catelli
  #  Third Party Resources:
    # https://www.tensorflow.org/api_docs
    # https://colab.research.google.com/github/keras-team/keras-io/blob/master/examples/vision/ipynb/knowledge_distillation.ipynb


In [None]:
#@title HCAIM Practical Details

Practical Title =  Model Compression#@param
Module = C #@param ["A", "B", "C", "D"] {type:"raw"}
Focus = Future AI/Learning #@param {type:"raw"}
Topic =  Knowledge Distillation#@param
Solution_Available =  No#@param ["Yes", "No", "NA"] {type:"raw", allow-input: true}
Duration_in_minutes = 150 #@param {type:"slider", min:120, max:180, step:10}

## Learning Outcomes:

  * Understand how to implement techniques of model compression
  * Grasp pros and cons of knowledge distillation
  * Becoming familiar with high-level frameworks
  
  

## Lecturer Notes

  * Instruction on **Quiz**: answer in the empty text cell below each quiz.

## Instructions/Advice to students:

  * This is an individual work
  * Complete the listed taks with in the allocated time
  * Submit all documentation related to practical on Moodle
  * First complete the easy tasks
  
  

## **Knowledge Distillation (KD)**


In [None]:
Allocated_time_in_minutes = 60 #@param {type:"slider", min:0, max:60, step:5}

In this example we will see how to build a custom `Distiller()` class and apply KD with Keras/TensorFlow. In detail:

1. Dataflow and models definition
2. Training models
3. Distillation
4. Comparison of the models

## **Task 1 - Dataflow and models definition**

The dataset used for training the teacher and distilling the teacher is
[MNIST](https://keras.io/api/datasets/mnist/), a handwritten digits dataset.

Please note the procedure would be equivalent for any other dataset, e.g. [CIFAR-10](https://keras.io/api/datasets/cifar10/) (with a suitable choice of models).

In [None]:
import tensorflow as tf

In [None]:
from tensorflow import keras
import numpy as np

In [None]:
# Prepare the train and test dataset.
# batch_size = 64
(x_train, y_train), (x_test, y_test) = keras.datasets.mnist.load_data()

# Normalize the input image so that each pixel value is between 0 and 1: this helps us to make the network more "stable" during learning.
x_train = x_train.astype("float32") / 255.0
x_train = np.reshape(x_train, (-1, 28, 28, 1))

x_test = x_test.astype("float32") / 255.0
x_test = np.reshape(x_test, (-1, 28, 28, 1))

Initially, we create a teacher model and a smaller student model. Both models are
convolutional neural networks and created using `Sequential()`,
but could be any Keras model.

In [None]:
# Create the teacher
teacher = keras.Sequential(
    [
        keras.Input(shape=(28, 28, 1)), # Input() is used to instantiate a Keras tensor.
                                        # A Keras tensor is a symbolic tensor-like object, which we augment with certain attributes that allow us to build a Keras model just by knowing the inputs and outputs of the model.
                                        # For instance, if a, b and c are Keras tensors, it becomes possible to do: model = Model(input=[a, b], output=c)
                                        # shape: A shape tuple (integers), not including the batch size. For instance, shape=(32,) indicates that the expected input will be batches of 32-dimensional vectors. Elements of this tuple can be None; 'None' elements represent dimensions where the shape is not known.

        keras.layers.Conv2D(filters=256, kernel_size=(3, 3), strides=(2, 2), padding="same"), # Conv2D is a 2D convolution layer (e.g. spatial convolution over images).
                                                                                              # This layer creates a convolution kernel that is convolved with the layer input to produce a tensor of outputs.
                                                                                              # filters: Integer, the dimensionality of the output space (i.e. the number of output filters in the convolution).
                                                                                              # kernel_size: An integer or tuple/list of 2 integers, specifying the height and width of the 2D convolution window. Can be a single integer to specify the same value for all spatial dimensions.
                                                                                              # strides: An integer or tuple/list of 2 integers, specifying the strides of the convolution along the height and width. Can be a single integer to specify the same value for all spatial dimensions.
                                                                                              # padding: one of "valid" or "same" (case-insensitive). "valid" means no padding. "same" results in padding with zeros evenly to the left/right or up/down of the input. When padding="same" and strides=1, the output has the same size as the input.

        keras.layers.LeakyReLU(alpha=0.2),  # Leaky version of a Rectified Linear Unit. It allows a small gradient when the unit is not active:
                                            # f(x) = alpha * x    if x < 0
                                            # f(x) = x            if x >= 0

        keras.layers.MaxPool2D(pool_size=(2, 2), strides=(1, 1), padding="same"), # Max pooling operation for 2D spatial data. Downsamples the input along its spatial dimensions (height and width) by taking the maximum value over an input window (of size defined by pool_size) for each channel of the input. The window is shifted by strides along each dimension.
                                                                                  # The resulting output, when using the "valid" padding option, has a spatial shape (number of rows or columns) of: output_shape = math.floor((input_shape - pool_size) / strides) + 1 (when input_shape >= pool_size)
                                                                                  # The resulting output shape when using the "same" padding option is: output_shape = math.floor((input_shape - 1) / strides) + 1
                                                                                  # pool_size: Integer or tuple of 2 integers, window size over which to take the maximum. (2, 2) will take the max value over a 2x2 pooling window. If only one integer is specified, the same window length will be used for both dimensions.
                                                                                  # strides: Integer, tuple of 2 integers, or None. Strides values. Specifies how far the pooling window moves for each pooling step. If None, it will default to pool_size.
                                                                                  # padding: One of "valid" or "same" (case-insensitive). "valid" means no padding. "same" results in padding evenly to the left/right or up/down of the input such that output has the same height/width dimension as the input.

        keras.layers.Conv2D(filters=512, kernel_size=(3, 3), strides=(2, 2), padding="same"),

        keras.layers.Flatten(), # Flattens the input. Does not affect the batch size.

        keras.layers.Dense(10), # Just your regular densely-connected NN layer.
                                # Dense implements the operation: output = activation(dot(input, kernel) + bias) where activation is the element-wise activation function passed as the activation argument, kernel is a weights matrix created by the layer, and bias is a bias vector created by the layer (only applicable if use_bias is True). These are all attributes of Dense.
                                # Activation function to use. If you don't specify anything, no activation is applied (ie. "linear" activation: a(x) = x).
    ],
    name="teacher",
)

# Create the student
student = keras.Sequential(
    [
        keras.Input(shape=(28, 28, 1)),
        keras.layers.Conv2D(filters=16, kernel_size=(3, 3), strides=(2, 2), padding="same"),
        keras.layers.LeakyReLU(alpha=0.2),
        keras.layers.MaxPooling2D(pool_size=(2, 2), strides=(1, 1), padding="same"),
        keras.layers.Conv2D(filters=32, kernel_size=(3, 3), strides=(2, 2), padding="same"),
        keras.layers.Flatten(),
        keras.layers.Dense(10),
    ],
    name="student",
)

# Clone student for later comparison
student_scratch = keras.models.clone_model(student)

In [None]:
teacher.summary()

In [None]:
student.summary()

**Quiz**

* What is the main difference between what we have called teacher and what we have called student?

* Is it fair to take this difference as a watershed between the model concepts of *teacher* and *student*?

* Why yes? Why not?

*Your answers here*

## **Task 2 - Training models**

In KD we assume that the teacher is trained and fixed. Thus, we start
by training the teacher model on the training set in the usual way.

In [None]:
teacher.compile(
    optimizer=keras.optimizers.Adam(),  # Optimizer that implements the Adam algorithm.
                                        # Adam optimization is a stochastic gradient descent method that is based on adaptive estimation of first-order and second-order moments.

    loss=keras.losses.SparseCategoricalCrossentropy(from_logits=True),  # Computes the crossentropy loss between the labels and predictions.
                                                                        # Use this crossentropy loss function when there are two or more label classes. We expect labels to be provided as integers.
                                                                        # If you want to provide labels using one-hot representation, please use CategoricalCrossentropy loss.
                                                                        # There should be # classes floating point values per feature for y_pred and a single floating point value per feature for y_true.

    metrics=[keras.metrics.SparseCategoricalAccuracy()],  # Calculates how often predictions match integer labels.
                                                          # You can provide logits of classes as y_pred, since argmax of logits and probabilities are same.
)

# Teacher trained on data
teacher_history = teacher.fit(x=x_train, y=y_train, validation_split=0.1, epochs=5)

How can we figure out what is the measure of the gain given by KD?

Comparing the performance between the student model trained from scratch and the analog that we will obtain from KD.

In [None]:
student_scratch.compile(
    optimizer=keras.optimizers.Adam(),
    loss=keras.losses.SparseCategoricalCrossentropy(from_logits=True),
    metrics=[keras.metrics.SparseCategoricalAccuracy()],
)

# Student trained from scratch.
student_scratch_history = student_scratch.fit(x=x_train, y=y_train, validation_split=0.1, epochs=5)

## **Task 3 - Distillation**

During the KD process, knowledge is transferred through the minimization of a pair of loss functions. In particular, the goal is to match the *softened* logits of the teacher and the ground-truth labels: this is why we speak of soft loss and hard loss (the standard loss) respectively.

In detail, we will speak of soft loss when the logits are softened by applying a *temperature* scaling function in the softmax, effectively smoothing out the probability distribution and revealing inter-class relationships learned by the teacher.

The custom `Distiller()` class, overrides the `Model` methods `train_step`, `test_step` and `compile()`: what do these methods do?

*   `train_step` -> This is the logic for one training step. This method can be overridden to support custom training logic. This method should contain the mathematical logic for one step of training. This typically includes the forward pass, loss calculation, backpropagation, and metric updates.

*   `test_step` -> This is the logic for one evaluation step. This method can be overridden to support custom evaluation logic. This function should contain the mathematical logic for one step of evaluation. This typically includes the forward pass, loss calculation, and metrics updates.

*   `compile()` -> Configures the model for training through important hyper-paramaters such as optimizer, loss, metrics and so on.

In order to use the distiller, we need:

- a trained teacher model and a student model to train;
- a hard loss function: it is a student loss function on the difference between student predictions and ground-truth;
- a soft loss function: it is a distillation loss function, along with a `temperature`, on the difference between the soft student predictions and the soft teacher labels;
- an `alpha` factor to weight the student and distillation loss;
- an optimizer for the student and metrics to evaluate performance.

In the modified `train_step` method it is needed to:

*   perform a forward pass of both teacher and student;
*   calculate the loss with weighting of the `student_loss` and `distillation_loss` by `alpha` and `1 - alpha`, respectively;
*   perform the backward pass: only the student weights are updated, and therefore we only calculate the gradients for the student weights.

In the modified `test_step` method, we evaluate the student model on the provided dataset.

In [None]:
class Distiller(keras.Model):
    def __init__(self, student, teacher):   # init method or constructor: Distiller instances will need both student and teacher parameters
        super(Distiller, self).__init__()
        self.teacher = teacher
        self.student = student

    def compile(
        self,
        optimizer,
        metrics,
        student_loss_fn,
        distillation_loss_fn,
        alpha=0.1,
        temperature=3,
    ):
        """
        Configure the distiller.
        Args:
            optimizer:            Keras optimizer for the student weights
            metrics:              Keras metrics for evaluation
            student_loss_fn:      Loss function of difference between student predictions and ground-truth (hard loss)
            distillation_loss_fn: Loss function of difference between soft student predictions and soft teacher predictions (soft loss)
            alpha:                Weight to student_loss_fn and 1-alpha to distillation_loss_fn
            temperature:          Temperature for softening probability distributions. Larger temperature gives softer distributions.
        """
        super(Distiller, self).compile(optimizer=optimizer, metrics=metrics)
        self.student_loss_fn = student_loss_fn
        self.distillation_loss_fn = distillation_loss_fn
        self.alpha = alpha
        self.temperature = temperature

    def train_step(self, data):
        # Unpack the data. Its structure depends on your model and on what you pass to `fit()`.
        x, y = data

        # Forward pass of teacher
        teacher_predictions = self.teacher(x, training=False)   # training=False == inference mode

        with tf.GradientTape() as tape:                         # tf.GradientTape(): Record operations for automatic differentiation.

            # Forward pass of student
            student_predictions = self.student(x, training=True)  # training=True == training mode

            # Compute losses
            student_loss = self.student_loss_fn(y, student_predictions)
            distillation_loss = self.distillation_loss_fn(
                tf.nn.softmax(teacher_predictions / self.temperature, axis=1),
                tf.nn.softmax(student_predictions / self.temperature, axis=1),
            )
            loss = self.alpha * student_loss + (1 - self.alpha) * distillation_loss

        # Calculate gradients with respect to every trainable variable
        trainable_vars = self.student.trainable_variables
        gradients = tape.gradient(loss, trainable_vars)

        # Update weights
        self.optimizer.apply_gradients(zip(gradients, trainable_vars))

        # Update the metrics configured in `compile()`.
        self.compiled_metrics.update_state(y, student_predictions)

        # Return a dict of performance mapping metric names to current value
        results = {m.name: m.result() for m in self.metrics}
        results.update({"student_loss": student_loss, "distillation_loss": distillation_loss})
        return results

    def test_step(self, data):
        # Unpack the data
        x, y = data

        # Compute predictions
        y_prediction = self.student(x, training=False)

        # Calculate the loss
        student_loss = self.student_loss_fn(y, y_prediction)

        # Update the metrics configured in `compile()`.
        self.compiled_metrics.update_state(y, y_prediction)

        # Return a dict of performance mapping metric names to current value.
        results = {m.name: m.result() for m in self.metrics}
        results.update({"student_loss": student_loss})
        return results

We have already trained the teacher model, and we only need to initialize a
`Distiller(student, teacher)` instance, `compile()` it with the desired losses,
hyperparameters and optimizer, and distill knowledge from the teacher to the student.

In [None]:
# Initialize and compile distiller
student_distiller = Distiller(student=student, teacher=teacher)
student_distiller.compile(
    optimizer=keras.optimizers.Adam(),
    metrics=[keras.metrics.SparseCategoricalAccuracy()],
    student_loss_fn=keras.losses.SparseCategoricalCrossentropy(from_logits=True),
    distillation_loss_fn=keras.losses.KLDivergence(), # Computes Kullback-Leibler divergence loss between y_true and y_pred.
                                                      # loss = y_true * log(y_true / y_pred)
    alpha=0.1,
    temperature=10,
)

# Distill teacher to student
student_distilled_history = student_distiller.fit(x=x_train, y=y_train, validation_split=0.1, epochs=3)

**Quiz**

What happens changing the temperature?

*Your answer here*

## **Task 4 - Comparison of the models**

We can evaluate the performance of the models on the test set. Remember that:

*   Teacher was trained for 5 epochs
*   Student was trained from scratch for 5 epochs
*   Student was distilled by the teacher for 3 epochs

**Quiz**

*   What ranking do you expect?
*   How would you change things?

*Your answers here*

In [None]:
teacher_loss, teacher_accuracy = teacher.evaluate(x_test, y_test)

print("\nTeacher\n- Loss: {}\n- Accuracy: {}".format(round(teacher_loss, 5), round(teacher_accuracy, 5)))

In [None]:
student_scratch_loss, student_scratch_accuracy = student_scratch.evaluate(x_test, y_test)
print("\nStudent trained from scratch\n- Loss: {}\n- Accuracy: {}".format(round(student_scratch_loss, 5), round(student_scratch_accuracy, 5)))

In [None]:
student_distilled_accuracy, student_distilled_loss = student_distiller.evaluate(x_test, y_test)

print("\nStudent trained through KD\n- Loss: {}\n- Accuracy: {}".format(round(student_distilled_loss, 5), round(student_distilled_accuracy, 5)))

# Independent Study Materials
* Paper *Distilling the Knowledge in a Neural Network* at https://arxiv.org/pdf/1503.02531.pdf by Hinton et al.
