<a href="https://colab.research.google.com/github/prisar/ai_notebooks/blob/main/nb_074.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Task
Experiment with knowledge distillation to create a smaller model that matches the accuracy of a large model but runs faster on a T4 GPU. Use the CIFAR-10 dataset.

## Set up the environment

### Subtask:
Install the necessary libraries and configure the runtime to use a T4 GPU.


**Reasoning**:
I will start by installing the required libraries, `tensorflow` and `tensorflow-datasets`, using pip.



In [2]:
!pip install tensorflow tensorflow-datasets



**Reasoning**:
Now that the libraries are installed, I will import them and then check for the availability of a T4 GPU as instructed.



In [3]:
import tensorflow as tf
import tensorflow_datasets as tfds
import time

print(tf.config.list_physical_devices('GPU'))

[PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]


## Load and prepare the data

### Subtask:
Load the CIFAR-10 dataset and preprocess it for training.


**Reasoning**:
Load and preprocess the CIFAR-10 dataset, applying normalization and data augmentation, and splitting it into batched training and testing sets.



In [4]:
def preprocess(image, label):
  """Normalizes images and converts labels to float32."""
  image = tf.image.convert_image_dtype(image, dtype=tf.float32)
  return image, tf.cast(label, dtype=tf.float32)

def augment(image, label):
  """Applies data augmentation to training images."""
  image = tf.image.random_flip_left_right(image)
  image = tf.image.random_brightness(image, max_delta=0.2)
  image = tf.image.random_contrast(image, lower=0.8, upper=1.2)
  return image, label

# Load the dataset
(ds_train, ds_test), ds_info = tfds.load(
    'cifar10',
    split=['train', 'test'],
    shuffle_files=True,
    as_supervised=True,
    with_info=True,
)

# Apply preprocessing and augmentation
BUFFER_SIZE = ds_info.splits['train'].num_examples
BATCH_SIZE = 128

ds_train = ds_train.map(preprocess, num_parallel_calls=tf.data.AUTOTUNE)
ds_train = ds_train.map(augment, num_parallel_calls=tf.data.AUTOTUNE)
ds_train = ds_train.shuffle(BUFFER_SIZE).batch(BATCH_SIZE).prefetch(tf.data.AUTOTUNE)

ds_test = ds_test.map(preprocess, num_parallel_calls=tf.data.AUTOTUNE)
ds_test = ds_test.batch(BATCH_SIZE).prefetch(tf.data.AUTOTUNE)

print("CIFAR-10 dataset loaded and preprocessed.")



Downloading and preparing dataset Unknown size (download: Unknown size, generated: Unknown size, total: Unknown size) to /root/tensorflow_datasets/cifar10/3.0.2...


Dl Completed...: 0 url [00:00, ? url/s]

Dl Size...: 0 MiB [00:00, ? MiB/s]

Extraction completed...: 0 file [00:00, ? file/s]

Generating splits...:   0%|          | 0/2 [00:00<?, ? splits/s]

Generating train examples...: 0 examples [00:00, ? examples/s]

Shuffling /root/tensorflow_datasets/cifar10/incomplete.QD39XQ_3.0.2/cifar10-train.tfrecord*...:   0%|         …

Generating test examples...: 0 examples [00:00, ? examples/s]

Shuffling /root/tensorflow_datasets/cifar10/incomplete.QD39XQ_3.0.2/cifar10-test.tfrecord*...:   0%|          …

Dataset cifar10 downloaded and prepared to /root/tensorflow_datasets/cifar10/3.0.2. Subsequent calls will reuse this data.
CIFAR-10 dataset loaded and preprocessed.


## Define the teacher and student models

### Subtask:
Create a large, complex model (the teacher) and a smaller, simpler model (the student).


**Reasoning**:
Define the teacher and student model architectures using TensorFlow Keras Sequential API, ensuring they are compatible with CIFAR-10 input and output shapes.



In [5]:
from tensorflow.keras import layers, models

def build_teacher_model():
  """Builds a larger, complex teacher model."""
  model = models.Sequential([
      layers.Conv2D(32, (3, 3), activation='relu', input_shape=(32, 32, 3)),
      layers.BatchNormalization(),
      layers.Conv2D(32, (3, 3), activation='relu'),
      layers.BatchNormalization(),
      layers.MaxPooling2D((2, 2)),
      layers.Dropout(0.25),

      layers.Conv2D(64, (3, 3), activation='relu'),
      layers.BatchNormalization(),
      layers.Conv2D(64, (3, 3), activation='relu'),
      layers.BatchNormalization(),
      layers.MaxPooling2D((2, 2)),
      layers.Dropout(0.25),

      layers.Conv2D(128, (3, 3), activation='relu'),
      layers.BatchNormalization(),
      layers.Conv2D(128, (3, 3), activation='relu'),
      layers.BatchNormalization(),
      layers.MaxPooling2D((2, 2)),
      layers.Dropout(0.25),

      layers.Flatten(),
      layers.Dense(512, activation='relu'),
      layers.BatchNormalization(),
      layers.Dropout(0.5),
      layers.Dense(10) # Output layer for 10 classes
  ])
  return model

def build_student_model():
  """Builds a smaller, simpler student model."""
  model = models.Sequential([
      layers.Conv2D(16, (3, 3), activation='relu', input_shape=(32, 32, 3)),
      layers.MaxPooling2D((2, 2)),
      layers.Conv2D(32, (3, 3), activation='relu'),
      layers.MaxPooling2D((2, 2)),
      layers.Flatten(),
      layers.Dense(64, activation='relu'),
      layers.Dense(10) # Output layer for 10 classes
  ])
  return model

teacher_model = build_teacher_model()
student_model = build_student_model()

print("Teacher model summary:")
teacher_model.summary()

print("\nStudent model summary:")
student_model.summary()

  super().__init__(activity_regularizer=activity_regularizer, **kwargs)


Teacher model summary:



Student model summary:


## Train the teacher model

### Subtask:
Train the teacher model on the CIFAR-10 dataset until it achieves high accuracy.


**Reasoning**:
I will now compile the teacher model with the specified optimizer, loss function, and metric. Then, I will train it on the training dataset and validate it on the test dataset, storing the history. Finally, I will evaluate the trained model on the test set and print the evaluation results.



In [6]:
teacher_model.compile(
    optimizer='adam',
    loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
    metrics=['accuracy']
)

history = teacher_model.fit(
    ds_train,
    epochs=20,
    validation_data=ds_test
)

test_loss, test_acc = teacher_model.evaluate(ds_test)
print(f"Teacher model accuracy: {test_acc:.4f}")

Epoch 1/20


ValueError: Exception encountered when calling MaxPooling2D.call().

[1mNegative dimension size caused by subtracting 2 from 1 for '{{node sequential_1/max_pooling2d_2_1/MaxPool2d}} = MaxPool[T=DT_FLOAT, data_format="NHWC", explicit_paddings=[], ksize=[1, 2, 2, 1], padding="VALID", strides=[1, 2, 2, 1]](sequential_1/batch_normalization_5_1/batchnorm/add_1)' with input shapes: [?,1,1,128].[0m

Arguments received by MaxPooling2D.call():
  • inputs=tf.Tensor(shape=(None, 1, 1, 128), dtype=float32)

**Reasoning**:
The previous attempt to train the teacher model failed because the input dimensions to one of the `MaxPooling2D` layers became too small (1x1). This is a common issue in deep CNNs where repeated pooling layers reduce the spatial dimensions of the feature maps. To fix this, I will rebuild the teacher model with a modified architecture. Specifically, I'll remove the third `MaxPooling2D` layer to prevent the dimensions from becoming negative. This should allow the model to train successfully.



In [7]:
from tensorflow.keras import layers, models

def build_teacher_model():
  """Builds a larger, complex teacher model."""
  model = models.Sequential([
      layers.Conv2D(32, (3, 3), activation='relu', input_shape=(32, 32, 3), padding='same'),
      layers.BatchNormalization(),
      layers.Conv2D(32, (3, 3), activation='relu', padding='same'),
      layers.BatchNormalization(),
      layers.MaxPooling2D((2, 2)),
      layers.Dropout(0.25),

      layers.Conv2D(64, (3, 3), activation='relu', padding='same'),
      layers.BatchNormalization(),
      layers.Conv2D(64, (3, 3), activation='relu', padding='same'),
      layers.BatchNormalization(),
      layers.MaxPooling2D((2, 2)),
      layers.Dropout(0.25),

      layers.Conv2D(128, (3, 3), activation='relu', padding='same'),
      layers.BatchNormalization(),
      layers.Conv2D(128, (3, 3), activation='relu', padding='same'),
      layers.BatchNormalization(),
      layers.Dropout(0.25),

      layers.Flatten(),
      layers.Dense(512, activation='relu'),
      layers.BatchNormalization(),
      layers.Dropout(0.5),
      layers.Dense(10) # Output layer for 10 classes
  ])
  return model

teacher_model = build_teacher_model()

teacher_model.compile(
    optimizer='adam',
    loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
    metrics=['accuracy']
)

history = teacher_model.fit(
    ds_train,
    epochs=20,
    validation_data=ds_test
)

test_loss, test_acc = teacher_model.evaluate(ds_test)
print(f"Teacher model accuracy: {test_acc:.4f}")

Epoch 1/20


  super().__init__(activity_regularizer=activity_regularizer, **kwargs)


[1m391/391[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m41s[0m 46ms/step - accuracy: 0.3582 - loss: 2.1385 - val_accuracy: 0.2362 - val_loss: 2.8139
Epoch 2/20
[1m391/391[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m11s[0m 16ms/step - accuracy: 0.6104 - loss: 1.1155 - val_accuracy: 0.6623 - val_loss: 1.0165
Epoch 3/20
[1m391/391[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m11s[0m 18ms/step - accuracy: 0.7030 - loss: 0.8463 - val_accuracy: 0.7415 - val_loss: 0.7435
Epoch 4/20
[1m391/391[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m12s[0m 18ms/step - accuracy: 0.7405 - loss: 0.7323 - val_accuracy: 0.7433 - val_loss: 0.7545
Epoch 5/20
[1m391/391[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m12s[0m 18ms/step - accuracy: 0.7725 - loss: 0.6497 - val_accuracy: 0.7779 - val_loss: 0.6495
Epoch 6/20
[1m391/391[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m21s[0m 16ms/step - accuracy: 0.7909 - loss: 0.5954 - val_accuracy: 0.7978 - val_loss: 0.5881
Epoch 7/20
[1m391/391[0m 

## Implement the distillation loss

### Subtask:
Implement the distillation loss.


**Reasoning**:
Implement the Distiller class including the `__init__`, `compile`, `train_step`, and `test_step` methods as described in the instructions to encapsulate the knowledge distillation logic.



In [8]:
import tensorflow as tf
from tensorflow.keras import layers, models

class Distiller(tf.keras.Model):
    def __init__(self, student, teacher):
        super().__init__()
        self.teacher = teacher
        self.student = student

    def compile(self, optimizer, metrics, student_loss_fn, distillation_loss_fn, alpha=0.1, temperature=3):
        super().compile(optimizer=optimizer, metrics=metrics)
        self.student_loss_fn = student_loss_fn
        self.distillation_loss_fn = distillation_loss_fn
        self.alpha = alpha
        self.temperature = temperature

    def train_step(self, data):
        # Unpack data
        x, y = data

        # Forward pass of teacher
        teacher_predictions = self.teacher(x, training=False)

        with tf.GradientTape() as tape:
            # Forward pass of student
            student_predictions = self.student(x, training=True)

            # Calculate student loss
            student_loss = self.student_loss_fn(y, student_predictions)

            # Calculate distillation loss
            # Soften teacher and student logits
            soft_teacher_predictions = tf.nn.softmax(teacher_predictions / self.temperature, axis=1)
            soft_student_predictions = tf.nn.softmax(student_predictions / self.temperature, axis=1)

            distillation_loss = self.distillation_loss_fn(soft_teacher_predictions, soft_student_predictions)

            # Combine losses
            loss = self.alpha * distillation_loss + (1 - self.alpha) * student_loss

        # Compute gradients
        trainable_vars = self.student.trainable_variables
        gradients = tape.gradient(loss, trainable_vars)

        # Update weights
        self.optimizer.apply_gradients(zip(gradients, trainable_vars))

        # Update metrics (only student metrics)
        self.compiled_metrics.update_state(y, student_predictions)

        # Return a dict of metrics
        results = {m.name: m.result() for m in self.metrics}
        results.update({"student_loss": student_loss, "distillation_loss": distillation_loss})
        return results

    def test_step(self, data):
        # Unpack the data
        x, y = data

        # Compute predictions
        y_pred = self.student(x, training=False)

        # Calculate the loss
        student_loss = self.student_loss_fn(y, y_pred)

        # Update the metrics
        self.compiled_metrics.update_state(y, y_pred)

        # Return a dict of metrics
        results = {m.name: m.result() for m in self.metrics}
        results.update({"student_loss": student_loss})
        return results

print("Distiller class defined.")

Distiller class defined.


## Train the student model with distillation

### Subtask:
Train the student model using the combined loss function, which will encourage it to mimic the teacher's predictions.


**Reasoning**:
Instantiate and compile the Distiller model, then train it using the combined loss function.



In [9]:
# Instantiate the Distiller
distiller = Distiller(student=student_model, teacher=teacher_model)

# Compile the Distiller
distiller.compile(
    optimizer='adam',
    metrics=['accuracy'],
    student_loss_fn=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
    distillation_loss_fn=tf.keras.losses.KLDivergence(),
    alpha=0.1, # You can experiment with different values
    temperature=3  # You can experiment with different values
)

# Train the Distiller
print("Training student model with distillation...")
history_distillation = distiller.fit(
    ds_train,
    epochs=30, # Train for a suitable number of epochs
    validation_data=ds_test
)

print("\nStudent model training with distillation completed.")

Training student model with distillation...
Epoch 1/30


```
for metric in self.metrics:
    metric.update_state(y, y_pred)
```

  return self._compiled_metrics_update_state(


[1m391/391[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m14s[0m 17ms/step - accuracy: 0.3026 - distillation_loss: 1.0529 - loss: -0.3623 - student_loss: 1.6907 - val_loss: -0.6536 - val_student_loss: 1.6310
Epoch 2/30
[1m391/391[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m8s[0m 9ms/step - accuracy: 0.4897 - distillation_loss: 0.8725 - loss: -0.6827 - student_loss: 1.3971 - val_loss: -0.8257 - val_student_loss: 1.3373
Epoch 3/30
[1m391/391[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m7s[0m 8ms/step - accuracy: 0.5441 - distillation_loss: 0.7957 - loss: -0.7922 - student_loss: 1.2740 - val_loss: -0.8253 - val_student_loss: 1.3388
Epoch 4/30
[1m391/391[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m8s[0m 8ms/step - accuracy: 0.5762 - distillation_loss: 0.7459 - loss: -0.8106 - student_loss: 1.1983 - val_loss: -0.8585 - val_student_loss: 1.2609
Epoch 5/30
[1m391/391[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m8s[0m 11ms/step - accuracy: 0.6034 - distillation_loss: 0.7051

## Evaluate the models

### Subtask:
Compare the accuracy and inference speed of the teacher and student models.


**Reasoning**:
To address the user's request, I will first evaluate the student model's accuracy on the test dataset. Next, I'll create and apply a function to measure the inference time for both the teacher and student models. Finally, I will print a comprehensive summary comparing the performance and speed of both models, fulfilling all the user's requirements in a single, efficient step.



In [10]:
# Evaluate student model
student_loss, student_acc = distiller.student.evaluate(ds_test)

def measure_inference_time(model, dataset):
    """Measures the average inference time of a model on a single batch."""
    # Get a single batch of data
    for images, _ in dataset.take(1):
        pass

    # Warm-up run to avoid initialization overhead
    _ = model.predict(images)

    start_time = time.time()
    for _ in range(100):
        _ = model.predict(images, verbose=0)
    end_time = time.time()

    return (end_time - start_time) / 100

# Measure inference times
teacher_inference_time = measure_inference_time(teacher_model, ds_test)
student_inference_time = measure_inference_time(distiller.student, ds_test)

# Print the comparison summary
print("\n--- Model Comparison ---")
print(f"Teacher Model Accuracy: {test_acc:.4f}")
print(f"Student Model Accuracy: {student_acc:.4f}")
print("-" * 20)
print(f"Teacher Model Inference Time (per batch): {teacher_inference_time:.6f} seconds")
print(f"Student Model Inference Time (per batch): {student_inference_time:.6f} seconds")
print(f"Inference Speed-up: {teacher_inference_time / student_inference_time:.2f}x")

ValueError: You must call `compile()` before using the model.

**Reasoning**:
The previous attempt failed because the student model, accessed through `distiller.student`, was not compiled before calling `evaluate()`. Although the `Distiller` itself was compiled, the underlying student model was not. To fix this, I will compile the student model separately with the same optimizer and loss function used during distillation before proceeding with the evaluation and inference time measurements. This ensures the model is ready for evaluation and prediction.



In [11]:
# The student model within the Distiller is not compiled by default.
# We need to compile it before we can use `evaluate` or `predict`.
distiller.student.compile(
    optimizer='adam',
    loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
    metrics=['accuracy']
)


# Evaluate student model
student_loss, student_acc = distiller.student.evaluate(ds_test)

def measure_inference_time(model, dataset):
    """Measures the average inference time of a model on a single batch."""
    # Get a single batch of data
    for images, _ in dataset.take(1):
        pass

    # Warm-up run to avoid initialization overhead
    _ = model.predict(images)

    start_time = time.time()
    for _ in range(100):
        _ = model.predict(images, verbose=0)
    end_time = time.time()

    return (end_time - start_time) / 100

# Measure inference times
teacher_inference_time = measure_inference_time(teacher_model, ds_test)
student_inference_time = measure_inference_time(distiller.student, ds_test)

# Print the comparison summary
print("\n--- Model Comparison ---")
print(f"Teacher Model Accuracy: {test_acc:.4f}")
print(f"Student Model Accuracy: {student_acc:.4f}")
print("-" * 20)
print(f"Teacher Model Inference Time (per batch): {teacher_inference_time:.6f} seconds")
print(f"Student Model Inference Time (per batch): {student_inference_time:.6f} seconds")
print(f"Inference Speed-up: {teacher_inference_time / student_inference_time:.2f}x")

[1m79/79[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 13ms/step - accuracy: 0.7124 - loss: 0.8564
[1m4/4[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 7ms/step  
[1m4/4[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 13ms/step 

--- Model Comparison ---
Teacher Model Accuracy: 0.8530
Student Model Accuracy: 0.7064
--------------------
Teacher Model Inference Time (per batch): 0.091269 seconds
Student Model Inference Time (per batch): 0.081764 seconds
Inference Speed-up: 1.12x


## Summary:

### Data Analysis Key Findings
*   The teacher model, a larger CNN, achieved a test accuracy of 85.30%.
*   The student model, a smaller CNN trained with knowledge distillation, achieved a test accuracy of 70.64%.
*   The teacher model had an average inference time of 0.0913 seconds per batch.
*   The student model had an average inference time of 0.0818 seconds per batch.
*   The student model provided a 1.12x inference speed-up compared to the teacher model.

### Insights or Next Steps
*   The student model is faster but significantly less accurate than the teacher model. The knowledge distillation process, with the chosen hyperparameters ($\alpha=0.1$, temperature=3), did not enable the smaller model to match the teacher's accuracy on the CIFAR-10 dataset within 30 epochs.
*   Experiment with different knowledge distillation hyperparameters (e.g., $\alpha$, temperature, number of epochs) and potentially refine the student model architecture to improve its accuracy closer to the teacher's performance while retaining the inference speed advantage.
