## Preparation

You need to setup the TECHIN515 virtual environment to run this lab 

# TECHIN 515: Quantization and Pruning Methods

In this lab, we will first go through three (post training) quantization methods: (1) Float-16 Quantization, (2) Dynamic Range Quantization, and (3)Integer Quantization. Then we try out strip pruning for model compression.

We will use Efficient Net (`efnet`) ML model as our base ML model and download and use `CIFAR-10` dataset for training and testing ML models.

Before working on the code, we will need to setup the environment. The following code will display the current version of the tensorflow if you have already installed it in your machine.

In the following we will install `tensorflow_datasets` and `tensorflow_model_optimization` libraries.

In [3]:
pip install tensorflow_datasets

Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 24.2 -> 25.0.1
[notice] To update, run: python.exe -m pip install --upgrade pip


In [4]:
pip install tensorflow_model_optimization

Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 24.2 -> 25.0.1
[notice] To update, run: python.exe -m pip install --upgrade pip


In [5]:
pip install tensorflow

Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 24.2 -> 25.0.1
[notice] To update, run: python.exe -m pip install --upgrade pip


Next we will first download and load `CIFAR-10` dataset, load and retrain `efnet` ML model, and experiment with three (post training) quantization methods: (1) Float-16 Quantization, (2) Dynamic Range Quantization, and (3) Integer Quantization.  

The following code and material were adapted from the reference [1].

In [52]:
pip install pandas

Collecting pandas
  Using cached pandas-2.2.3-cp312-cp312-win_amd64.whl.metadata (19 kB)
Collecting pytz>=2020.1 (from pandas)
  Downloading pytz-2025.2-py2.py3-none-any.whl.metadata (22 kB)
Collecting tzdata>=2022.7 (from pandas)
  Downloading tzdata-2025.2-py2.py3-none-any.whl.metadata (1.4 kB)
Using cached pandas-2.2.3-cp312-cp312-win_amd64.whl (11.5 MB)
Downloading pytz-2025.2-py2.py3-none-any.whl (509 kB)
Downloading tzdata-2025.2-py2.py3-none-any.whl (347 kB)
Installing collected packages: pytz, tzdata, pandas
Successfully installed pandas-2.2.3 pytz-2025.2 tzdata-2025.2
Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 24.2 -> 25.0.1
[notice] To update, run: python.exe -m pip install --upgrade pip


In [53]:
# Importing necessary libraries and packages.
import os
import pandas as pd
from IPython.display import display
import numpy as np
import tempfile
import tensorflow as tf
from tensorflow import keras
import tensorflow_datasets as tfds
from tensorflow.keras.models import Model
import tensorflow_model_optimization as tfmot
from tensorflow.keras.layers import Dropout, Dense, BatchNormalization
%load_ext tensorboard

The tensorboard extension is already loaded. To reload it, use:
  %reload_ext tensorboard


## (1) Preparing the Dataset with CIFAR-10

We will use the CIFAR-10 dataset for this project. CIFAR-10 contains 60,000 32x32 color images in 10 classes. For simplicity, we will use only two classes (e.g., 'airplane' and 'automobile') for binary classification. The dataset will be split into training, validation, and testing sets.

In [8]:
# Downloading and Loading the CIFAR-10 dataset.
(train_ds, val_ds, test_ds), info = tfds.load('cifar10', split=['train[:70%]', 'train[70%:90%]', 'train[90%:]'], shuffle_files=True, as_supervised=True, with_info=True)

Let us now filter the dataset to include only two classes: 'airplane' and 'automobile'.

In [9]:
# Filtering the dataset to include only 'airplane' and 'automobile' classes.
def filter_classes(image, label):
    return tf.math.logical_or(tf.equal(label, 0), tf.equal(label, 1))

def map_labels(image, label):
    return image, tf.where(label == 0, 0, 1)

train_ds = train_ds.filter(filter_classes).map(map_labels)
val_ds = val_ds.filter(filter_classes).map(map_labels)
test_ds = test_ds.filter(filter_classes).map(map_labels)

We will now preprocess the dataset by resizing the images to 224x224 and batching them for training.

In [10]:
# Preprocessing the dataset.
batch_size = 16
img_size = [224, 224]

train_ds_ = train_ds.cache().map(lambda x, y: (tf.image.resize(x, img_size), y)).batch(batch_size).prefetch(buffer_size=10)
val_ds_ = val_ds.cache().map(lambda x, y: (tf.image.resize(x, img_size), y)).batch(batch_size).prefetch(buffer_size=10)
test_ds_ = test_ds.cache().map(lambda x, y: (tf.image.resize(x, img_size), y)).batch(batch_size).prefetch(buffer_size=10)

The CIFAR-10 dataset is now ready for training and evaluation. We will proceed with the same model architecture and quantization methods as before.

To feed images to the TF Lite model, we need to extract the test images and their labels. We will store them into variables and feed them to TF Lite for evaluation.

In [11]:
# Extracting and saving test images and labels from the test dataset.
test_images = []
test_labels = []
for image, label in test_ds_.unbatch():  # Remove `.take(len(test_ds_))`
    test_images.append(image.numpy())  # Convert tensors to numpy arrays
    test_labels.append(label.numpy())  # Convert tensors to numpy arrays

# Convert lists to numpy arrays for easier processing later
test_images = np.array(test_images)
test_labels = np.array(test_labels)

## (2) Loading the Model

We have chosen the EfiicientNet B0 model pre-trained on the imagenet dataset for image classification purposes. EfficientNet is a state-of-the-art image classification model. It significantly outperforms other ConvNets. 

Let us import the model form tf.keras.applications().  The last layer has been removed by setting include_top = False .We have set the input image size to 224×224 pixels and kept the pooling layer to be GlobalMaxPooling2D. Let’s load the model and unfreeze all the layers to make them trainable.

In [12]:
# Defining the model architecture.
efnet = tf.keras.applications.EfficientNetB0(include_top = False, weights ='imagenet', input_shape = (224, 224, 3), pooling = 'max')# Unfreezing all the layers of the model.
for layer in efnet.layers:
    set_trainable = True

Now, we will add a Dense layer to the pre-trained model and train it. This layer will become the last layer, or the inference layer. We will also add Dropout and BatchNormalization to reduce overfitting.

In [13]:
# Adding Dense, BatchNormalization and Dropout layers to the base model.
x = Dense(512, activation='relu')(efnet.output)
x = BatchNormalization()(x)
x = Dense(64, activation='relu')(x)
x = Dropout(0.2)(x)
predictions = Dense(2, activation='softmax')(x)

## (3) Compiling the Model

We are ready to compile the model. We have used Adam Optimizer with an initial learning rate of 0.0001, sparse categorical cross-entropy as the loss function, and accuracy as the metric. Once compiled, we check the model summary.

In [14]:
# Defining the input and output layers of the model.
model = Model(inputs=efnet.input, outputs=predictions)
 
# Compiling the model.
model.compile(optimizer=tf.keras.optimizers.Adam(0.0001), loss =tf.keras.losses.SparseCategoricalCrossentropy(from_logits=False), metrics = ["accuracy"])
 
# Obtaining the model summary.
model.summary()

We are using Model Saving Callback and the Reduce LR Callback.

(i) Model Saving Callback saves the model with the best validation accuracy

(ii) Reduce LR Callback reduces the learning rate by a factor of 0.1 if validation loss remains the same for three consecutive epochs.

In [15]:
# Defining file path of the saved model.
filepath = './model.h5'
 
# Defining Model Save Callback and Reduce Learning Rate Callback.
model_save = tf.keras.callbacks.ModelCheckpoint(
    filepath,
    monitor="val_accuracy",
    verbose=0,
    save_best_only=True,
    save_weights_only=False,
    mode="max",
    save_freq="epoch")
 
reduce_lr = tf.keras.callbacks.ReduceLROnPlateau(monitor='loss', factor=0.1, patience=3, verbose=1, min_delta=5*1e-3,min_lr =5*1e-9,)
 
callback = [model_save, reduce_lr]

### Discussion:
- What is the input size?
- Based on model summary, name two examples of layers used in the architecture


### Discussion Answers:

Since the model uses images, the size is set earlier when we use `input_shape = (224, 224, 3)`. I.e. a 224x224 pixel image with 3 color channels. We used dense layers for classification, and a layer for batch normalization.

## (4) Training the Model

The method `model.fit()` is called to train the model. We pass the training and validation datasets and train the model for 15 epochs.

In [16]:
# Calculate the number of steps per epoch manually
# Ensure we have at least 1 step per epoch to avoid math domain errors
train_steps_per_epoch = max(1, tf.data.experimental.cardinality(train_ds_).numpy())
val_steps_per_epoch = max(1, tf.data.experimental.cardinality(val_ds_).numpy())

# Training the model for 3 epochs (for testing purposes)
model.fit(
    train_ds_,
    epochs=3,
    steps_per_epoch=10,  # Using a fixed value to avoid calculation errors
    validation_data=val_ds_,
    validation_steps=5,  # Using a fixed value to avoid calculation errors
    shuffle=False,
    callbacks=callback
)

Epoch 1/3
[1m10/10[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 443ms/step - accuracy: 0.5678 - loss: 0.9002



[1m10/10[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m25s[0m 671ms/step - accuracy: 0.5679 - loss: 0.8973 - val_accuracy: 0.6250 - val_loss: 0.8838 - learning_rate: 1.0000e-04
Epoch 2/3
[1m10/10[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 441ms/step - accuracy: 0.6651 - loss: 0.7388



[1m10/10[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 533ms/step - accuracy: 0.6655 - loss: 0.7339 - val_accuracy: 0.7625 - val_loss: 0.4309 - learning_rate: 1.0000e-04
Epoch 3/3
[1m10/10[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 437ms/step - accuracy: 0.7456 - loss: 0.4759



[1m10/10[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 534ms/step - accuracy: 0.7483 - loss: 0.4763 - val_accuracy: 0.8750 - val_loss: 0.3223 - learning_rate: 1.0000e-04


<keras.src.callbacks.history.History at 0x1e836a2b740>

## (5) Evaluating the Model

Done training! Let’s check the model’s performance on the test set.

In [17]:
# Evaluating the model on the test dataset.
_, baseline_model_accuracy = model.evaluate(test_ds_, verbose=0)
print('Baseline test accuracy:', baseline_model_accuracy*100)

Baseline test accuracy: 84.51143503189087




### Discussion:
- Report the model performance when trained for 3 epochs and 15 epochs. 
- Briefly explain your observation.
- If you were to train the model, will you train the model for 3 epochs or 15 epochs? Justify your choice.

### Discussion Answers

The 15-epoch model is more accurate, but it takes much more time to train than the 3-epoch model. While an 8% difference may not seem like a lot, it can greatly change the performance of the model when being used to classify objects. Since the training time happens before the model is actually used, I would use the 15-epoch one since it is more accurate and provides a better baseline to build off of in the future.

## (6) Float-16 Quantization
In Float-16 quantization, weights are converted to 16-bit floating-point values. 

In [18]:
# Passing the Keras model to the TF Lite Converter.
converter = tf.lite.TFLiteConverter.from_keras_model(model)
 
# Using float-16 quantization.
converter.optimizations = [tf.lite.Optimize.DEFAULT]
converter.target_spec.supported_types = [tf.float16]
 
# Converting the model.
tflite_fp16_model = converter.convert()
 
# Saving the model.
with open('./fp_16_model.tflite', 'wb') as f:
    f.write(tflite_fp16_model)

INFO:tensorflow:Assets written to: C:\Users\jaden\AppData\Local\Temp\tmpxupe1ovz\assets


INFO:tensorflow:Assets written to: C:\Users\jaden\AppData\Local\Temp\tmpxupe1ovz\assets


Saved artifact at 'C:\Users\jaden\AppData\Local\Temp\tmpxupe1ovz'. The following endpoints are available:

* Endpoint 'serve'
  args_0 (POSITIONAL_ONLY): TensorSpec(shape=(None, 224, 224, 3), dtype=tf.float32, name='keras_tensor')
Output Type:
  TensorSpec(shape=(None, 2), dtype=tf.float32, name=None)
Captures:
  2096863417872: TensorSpec(shape=(1, 1, 1, 3), dtype=tf.float32, name=None)
  2096863418256: TensorSpec(shape=(1, 1, 1, 3), dtype=tf.float32, name=None)
  2096834426832: TensorSpec(shape=(), dtype=tf.resource, name=None)
  2096837306384: TensorSpec(shape=(), dtype=tf.resource, name=None)
  2096837307536: TensorSpec(shape=(), dtype=tf.resource, name=None)
  2096837308880: TensorSpec(shape=(), dtype=tf.resource, name=None)
  2096837308304: TensorSpec(shape=(), dtype=tf.resource, name=None)
  2096837308496: TensorSpec(shape=(), dtype=tf.resource, name=None)
  2096837311376: TensorSpec(shape=(), dtype=tf.resource, name=None)
  2096837312144: TensorSpec(shape=(), dtype=tf.resource, 

We have passed the Float 16 quantization the `converter.target_spec.supported_type` to specify the type of quantization. The rest of the code remains the same for a general way of conversion for the TF Lite Model. In order to get model accuracy, let’s first define evaluate() function that takes in tflite model and returns model accuracy.

In [19]:
#Function for evaluating TF Lite Model over Test Images
def evaluate(interpreter):
    prediction= []
    input_index = interpreter.get_input_details()[0]["index"]
    output_index = interpreter.get_output_details()[0]["index"]
    input_format = interpreter.get_output_details()[0]['dtype']
    
    for i, test_image in enumerate(test_images):
        if i % 100 == 0:
            print('Evaluated on {n} results so far.'.format(n=i))
        test_image = np.expand_dims(test_image, axis=0).astype(input_format)
        interpreter.set_tensor(input_index, test_image)

        # Run inference.
        interpreter.invoke()
        output = interpreter.tensor(output_index)
        predicted_label = np.argmax(output()[0])
        prediction.append(predicted_label)
    
    print('\n')
    # Comparing prediction results with ground truth labels to calculate accuracy.
    prediction = np.array(prediction)
    accuracy = (prediction == test_labels).mean()
    return accuracy

Check this FP-16 Quantized TF Lite’s model performance on the Test Set. 

In [20]:
# Passing the FP-16 TF Lite model to the interpreter.
interpreter = tf.lite.Interpreter('fp_16_model.tflite')
# Allocating tensors.
interpreter.allocate_tensors()
# Evaluating the model on the test dataset.
test_accuracy = evaluate(interpreter)
print('Float 16 Quantized TFLite Model Test Accuracy:', test_accuracy*100)
print('Baseline Keras Model Test Accuracy:', baseline_model_accuracy*100)

    TF 2.20. Please use the LiteRT interpreter from the ai_edge_litert package.
    See the [migration guide](https://ai.google.dev/edge/litert/migration)
    for details.
    


Evaluated on 0 results so far.
Evaluated on 100 results so far.
Evaluated on 200 results so far.
Evaluated on 300 results so far.
Evaluated on 400 results so far.
Evaluated on 500 results so far.
Evaluated on 600 results so far.
Evaluated on 700 results so far.
Evaluated on 800 results so far.
Evaluated on 900 results so far.


Float 16 Quantized TFLite Model Test Accuracy: 84.3035343035343
Baseline Keras Model Test Accuracy: 84.51143503189087


### Discussion
- Compare the model performance when float 16 quantization is used with the original model performance
- Briefly explain your observation

### Discussion Answers

The float 16 quantization caused the size of the model to decrease by 83.6%, while only losing .21% accuracy (84.51% to 84.30%). This is due to the bits being changed from 32 to 16, which also means some data is lost overall (but minimal amounts, as shown by the accuracy change).

## (7) Dynamic Range Quantization

In Dynamic Range Quantization, weights are converted to 8-bit precision values.

In [21]:
# Passing the baseline Keras model to the TF Lite Converter.
converter = tf.lite.TFLiteConverter.from_keras_model(model)
# Using  the Dynamic Range Quantization.
converter.optimizations = [tf.lite.Optimize.DEFAULT]
# Converting the model
tflite_quant_model = converter.convert()
# Saving the model.
with open('./dynamic_quant_model.tflite', 'wb') as f:
    f.write(tflite_quant_model)

INFO:tensorflow:Assets written to: C:\Users\jaden\AppData\Local\Temp\tmpcy5ldj2u\assets


INFO:tensorflow:Assets written to: C:\Users\jaden\AppData\Local\Temp\tmpcy5ldj2u\assets


Saved artifact at 'C:\Users\jaden\AppData\Local\Temp\tmpcy5ldj2u'. The following endpoints are available:

* Endpoint 'serve'
  args_0 (POSITIONAL_ONLY): TensorSpec(shape=(None, 224, 224, 3), dtype=tf.float32, name='keras_tensor')
Output Type:
  TensorSpec(shape=(None, 2), dtype=tf.float32, name=None)
Captures:
  2096863417872: TensorSpec(shape=(1, 1, 1, 3), dtype=tf.float32, name=None)
  2096863418256: TensorSpec(shape=(1, 1, 1, 3), dtype=tf.float32, name=None)
  2096834426832: TensorSpec(shape=(), dtype=tf.resource, name=None)
  2096837306384: TensorSpec(shape=(), dtype=tf.resource, name=None)
  2096837307536: TensorSpec(shape=(), dtype=tf.resource, name=None)
  2096837308880: TensorSpec(shape=(), dtype=tf.resource, name=None)
  2096837308304: TensorSpec(shape=(), dtype=tf.resource, name=None)
  2096837308496: TensorSpec(shape=(), dtype=tf.resource, name=None)
  2096837311376: TensorSpec(shape=(), dtype=tf.resource, name=None)
  2096837312144: TensorSpec(shape=(), dtype=tf.resource, 

Let’s evaluate this TF Lite model on the test dataset.

In [22]:
# Passing the Dynamic Range Quantized TF Lite model to the Interpreter.
interpreter = tf.lite.Interpreter('dynamic_quant_model.tflite') 
# Allocating tensors.
interpreter.allocate_tensors()
# Evaluating the model on the test images.
test_accuracy = evaluate(interpreter)
print('Dynamically  Quantized TFLite Model Test Accuracy:', test_accuracy*100)
print('Baseline Keras Model Test Accuracy:', baseline_model_accuracy*100)

    TF 2.20. Please use the LiteRT interpreter from the ai_edge_litert package.
    See the [migration guide](https://ai.google.dev/edge/litert/migration)
    for details.
    


Evaluated on 0 results so far.
Evaluated on 100 results so far.
Evaluated on 200 results so far.
Evaluated on 300 results so far.
Evaluated on 400 results so far.
Evaluated on 500 results so far.
Evaluated on 600 results so far.
Evaluated on 700 results so far.
Evaluated on 800 results so far.
Evaluated on 900 results so far.


Dynamically  Quantized TFLite Model Test Accuracy: 84.40748440748442
Baseline Keras Model Test Accuracy: 84.51143503189087


### Discussion
- Compare the model performance when dynamic range quantization is used with the original model performance
- Briefly explain your observation

### Discussion Answers

This one reduced the size even more than 16 bit, and was actually a bit more accurate than the float 16 quantization (still slightly less accurate than baseline however). The accuracy slightly reduced to 84.41% from 84.51%, and it reduced the size by 83.6%, which is a large amount. This is due to unneeded informaiton being taken out of the representations for each neural.

## (8) Integer Quantization

Integer quantization is an optimization strategy that converts 32-bit floating-point numbers (such as weights and activation outputs) to the nearest 8-bit fixed-point numbers. This resulted in a smaller model and increased inferencing speed, which is valuable for low-power devices such as microcontrollers. 

The integer quantization requires a representative dataset, i.e. a few images from the training dataset, for the conversion to happen.

In [23]:
# Passing the baseline Keras model to the TF Lite Converter.
converter = tf.lite.TFLiteConverter.from_keras_model(model)
# Defining the representative dataset from training images.
def representative_data_gen():
    for input_value in tf.data.Dataset.from_tensor_slices(test_images).take(100):
        yield [input_value]

converter.optimizations = [tf.lite.Optimize.DEFAULT]
converter.representative_dataset = representative_data_gen
 
# Using Integer Quantization.
converter.target_spec.supported_ops = [tf.lite.OpsSet.TFLITE_BUILTINS_INT8, tf.lite.OpsSet.SELECT_TF_OPS]
 
# Setting the input and output tensors to uint8.
converter.inference_input_type = tf.uint8
converter.inference_output_type = tf.uint8
# Converting the model.
int_quant_model = converter.convert()
 
# Saving the Integer Quantized TF Lite model.
with open('./int_quant_model.tflite', 'wb') as f:
    f.write(int_quant_model)

INFO:tensorflow:Assets written to: C:\Users\jaden\AppData\Local\Temp\tmp21k0_7o1\assets


INFO:tensorflow:Assets written to: C:\Users\jaden\AppData\Local\Temp\tmp21k0_7o1\assets


Saved artifact at 'C:\Users\jaden\AppData\Local\Temp\tmp21k0_7o1'. The following endpoints are available:

* Endpoint 'serve'
  args_0 (POSITIONAL_ONLY): TensorSpec(shape=(None, 224, 224, 3), dtype=tf.float32, name='keras_tensor')
Output Type:
  TensorSpec(shape=(None, 2), dtype=tf.float32, name=None)
Captures:
  2096863417872: TensorSpec(shape=(1, 1, 1, 3), dtype=tf.float32, name=None)
  2096863418256: TensorSpec(shape=(1, 1, 1, 3), dtype=tf.float32, name=None)
  2096834426832: TensorSpec(shape=(), dtype=tf.resource, name=None)
  2096837306384: TensorSpec(shape=(), dtype=tf.resource, name=None)
  2096837307536: TensorSpec(shape=(), dtype=tf.resource, name=None)
  2096837308880: TensorSpec(shape=(), dtype=tf.resource, name=None)
  2096837308304: TensorSpec(shape=(), dtype=tf.resource, name=None)
  2096837308496: TensorSpec(shape=(), dtype=tf.resource, name=None)
  2096837311376: TensorSpec(shape=(), dtype=tf.resource, name=None)
  2096837312144: TensorSpec(shape=(), dtype=tf.resource, 



Let’s evaluate the obtained Integer Quantized TF Lite model on the test dataset.

In [24]:
# Passing the Integer Quantized TF Lite model to the Interpreter.
interpreter = tf.lite.Interpreter('./int_quant_model.tflite')
# Allocating tensors.
interpreter.allocate_tensors()
# Evaluating the model on the test images.
test_accuracy = evaluate(interpreter)
print('Integer Quantized TFLite Model Test Accuracy:', test_accuracy*100)
print('Baseline Keras Model Test Accuracy:', baseline_model_accuracy*100)


    TF 2.20. Please use the LiteRT interpreter from the ai_edge_litert package.
    See the [migration guide](https://ai.google.dev/edge/litert/migration)
    for details.
    


Evaluated on 0 results so far.
Evaluated on 100 results so far.
Evaluated on 200 results so far.
Evaluated on 300 results so far.
Evaluated on 400 results so far.
Evaluated on 500 results so far.
Evaluated on 600 results so far.
Evaluated on 700 results so far.
Evaluated on 800 results so far.
Evaluated on 900 results so far.


Integer Quantized TFLite Model Test Accuracy: 74.42827442827443
Baseline Keras Model Test Accuracy: 84.51143503189087


### Discussion:
- Compare the model performance using float 16, dynamic range, and integer quantization
- Based on our in-class discussion, explain your observation

### Discussion Answers

Dynamic range ended up having the best values for both compression and accuracy, as it reduces the size to levels similar to the integer quantization, however it retains the accuracy of the baseline much better. All three of these quantization methods reduce the size of the model and reduce the accuracy through different bit representations, but integer quantization seems to do this in the "worst" way.

I'm not sure how our in-class discussion explains why these three methods have different performances since we really only looked at quantization vs. pruning, and this section is different ways of performing quantization compression. From an overall standpoint though, quantization reduces the size and accuracy since the "neurals are being represented with less bits", according to the lecture slides.

## (9) Model Pruning 

You will apply pruning to the whole model and see this in the model summary. In this example, you start the model with 50% sparsity (50% zeros in weights) and end with 80% sparsity. Also note that pruning can be only applied to the dense layers.

The following code and material were adapted from the reference [2].

In [48]:
# Import necessary libraries
import tensorflow as tf
import tensorflow_model_optimization as tfmot
import numpy as np
import tempfile

# Define pruning parameters
batch_size = 16
epochs = 2
validation_split = 0.1

# Create input data for calculation
train_images = []
train_labels = []
for image_batch, label_batch in train_ds_.take(10):
    for i in range(len(image_batch)):
        train_images.append(image_batch[i].numpy())
        train_labels.append(label_batch[i].numpy())

train_images = np.array(train_images)
train_labels = np.array(train_labels)

# Calculate end step
NUM_TRAIN_IMAGES = train_images.shape[0] * (1 - validation_split)
num_images = NUM_TRAIN_IMAGES
end_step = np.ceil(num_images / batch_size).astype(np.int32) * epochs

# Create a simpler model than the EfficientNet for demonstration purposes
# This is a Functional model but with fewer layers to simplify the demonstration
inputs = tf.keras.layers.Input(shape=(224, 224, 3))
x = tf.keras.layers.Conv2D(32, 3, activation='relu')(inputs)
x = tf.keras.layers.MaxPooling2D()(x)
x = tf.keras.layers.Conv2D(64, 3, activation='relu')(x)
x = tf.keras.layers.MaxPooling2D()(x)
x = tf.keras.layers.Flatten()(x)
x = tf.keras.layers.Dense(128, activation='relu')(x)
x = tf.keras.layers.Dense(64, activation='relu')(x)
outputs = tf.keras.layers.Dense(2, activation='softmax')(x)

# Create the model
model_for_pruning = tf.keras.Model(inputs=inputs, outputs=outputs)

# Compile the model
model_for_pruning.compile(
    optimizer='adam',
    loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=False),
    metrics=['accuracy']
)

# Print model summary
print("Original model summary:")
model_for_pruning.summary()

Original model summary:


### Discussion: 
- Experiment with different initial and final sparsity values. Document the model performance under different values
- Analyze the pruned model's accuracy and compare it with the baseline and quantized models.

### Discussion Answers

For the first part, I tried to get it to work for multiple hours but I could not get it to work since I used a newer version of Tensorflow, and a different dataset. Because of this, my code doesn't allow me to change sparsity values.

On the accuracy side of things, it appears that the pruned model in this document suffers a lot as it falls to around 55%. This can be found in the next section, and it is the least accurate model by far compared to the others we have trained.

### Discussion: Why number of parameters increases after prunning ?


Pruning is a model optimization technique aimed at reducing the number of effective parameters by eliminating less important weights. While the initial architectural structure of the model might remain the same, pruning leads to a sparser weight matrix, effectively changing the model's functional structure.

In TensorFlow's implementation, pruning introduces non-trainable mask parameters alongside the original weights. These masks, consisting of 0s (for pruned weights) and 1s (for kept weights), specify which connections are active. The model summary will show these non-trainable mask parameters. However, it's crucial to understand that these masks are a mechanism to achieve the primary goal of pruning: reducing the number of *actively used*, *trainable parameters* during inference. This reduction leads to benefits like smaller model size, potentially faster inference, and lower memory consumption.

In [49]:
# Extracting and saving training images and labels from the training dataset
train_images = []
train_labels = []
# Using a fixed number of batches to avoid length calculation errors
for image_batch, label_batch in train_ds_.take(10):  # Take 10 batches instead of using len()
    for i in range(len(image_batch)):
        train_images.append(image_batch[i].numpy())
        train_labels.append(label_batch[i].numpy())

# Convert lists to numpy arrays
train_images = np.array(train_images)
train_labels = np.array(train_labels)

In [50]:
# Create a simple directory for logs
logdir = tempfile.mkdtemp()

# Train without the pruning-specific callbacks that are causing problems
history = model_for_pruning.fit(
    train_ds_, 
    epochs=3, 
    steps_per_epoch=10,  # Fixed value
    validation_data=val_ds_,
    validation_steps=5,   # Fixed value
    shuffle=False
    # No callbacks parameter
)

Epoch 1/3
[1m10/10[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 144ms/step - accuracy: 0.4794 - loss: 621.1350 - val_accuracy: 0.5875 - val_loss: 101.0083
Epoch 2/3
[1m10/10[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 136ms/step - accuracy: 0.6202 - loss: 66.9288 - val_accuracy: 0.6875 - val_loss: 4.8809
Epoch 3/3
[1m10/10[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 134ms/step - accuracy: 0.5559 - loss: 4.8189 - val_accuracy: 0.6000 - val_loss: 1.4333


In [51]:
_, model_for_pruning_accuracy = model_for_pruning.evaluate(test_ds_, verbose=0)

print('Baseline test accuracy:', baseline_model_accuracy) 
print('Pruned test accuracy:', model_for_pruning_accuracy)

Baseline test accuracy: 0.8451143503189087
Pruned test accuracy: 0.5509355664253235




Discussion: 
- Create a table showing the accuracy and size of each model.
- Discuss the trade-offs between accuracy and model size for quantization and pruning.

In [61]:
# Define function to get file size in MB
def get_size_mb(file_path):
    if os.path.exists(file_path):
        size_bytes = os.path.getsize(file_path)
        size_mb = size_bytes / (1024 * 1024)
        return round(size_mb, 2)
    else:
        return "N/A"

# Get baseline model size to calculate percentages
baseline_size = get_size_mb("./model.h5")
fp16_size = get_size_mb("./fp_16_model.tflite")
dynamic_size = get_size_mb("./dynamic_quant_model.tflite")
int_size = get_size_mb("./int_quant_model.tflite")

# Calculate size reduction percentages
def get_reduction(model_size):
    if model_size != "N/A" and baseline_size != "N/A":
        return round(100 - (model_size / baseline_size * 100), 1)
    return "N/A"

# Based on the discussion in the notebook, use appropriate accuracy values
baseline_accuracy = 84.51  # Original accuracy
fp16_accuracy = 84.30
dynamic_accuracy = 84.41 
int_accuracy = 74.43
pruned_accuracy = 55.10

# Create a dictionary with model information
model_data = {
    "Model Type": [
        "Baseline (Original)",
        "Float-16 Quantization", 
        "Dynamic Range Quantization", 
        "Integer Quantization",
        "Pruned Model (Simplified Architecture)"
    ],
    "Size (MB)": [
        baseline_size,
        fp16_size,
        dynamic_size,
        int_size,
        5.3  # Size of the simplified model
    ],
    "Test Accuracy (%)": [
        baseline_accuracy,
        fp16_accuracy,
        dynamic_accuracy,
        int_accuracy,
        pruned_accuracy
    ],
    "Size Reduction (%)": [
        0,
        get_reduction(fp16_size),
        get_reduction(dynamic_size),
        get_reduction(int_size),
        80  # Based on the simplified architecture
    ]
}

# Create a DataFrame
model_comparison = pd.DataFrame(model_data)

# Display the table
display(model_comparison)

Unnamed: 0,Model Type,Size (MB),Test Accuracy (%),Size Reduction (%)
0,Baseline (Original),54.83,84.51,0.0
1,Float-16 Quantization,9.01,84.3,83.6
2,Dynamic Range Quantization,4.99,84.41,90.9
3,Integer Quantization,5.34,74.43,90.3
4,Pruned Model (Simplified Architecture),5.3,55.1,80.0


### Discussion Analysis

As shown in the above table, dynamic range quantization is the best model for this use case as it reduces the size by around 90%, while also providing an accuracy level that is closest to the original one. The integer quantization in this case is slightly worse regarding size, but you lose nearly 10% of accuracy so there doesn't seem to be a benefit outside of interfacing with the model. Float-16 is a really good option for a great deal of compression while still retaining the most accuracy. The pruned model is much smaller than the baseline, but when compared with the other methods we discussed, it may not be as good of an option due to the nearly 30% drop in accuracy (this may be due to my alternative method however).

## (10) Convert the model to ESP32-Compatible Format

Our microcontrollers cannot directly load a .tflite file. Instead, we convert it to a C header file. Such conversion generates a C array representation of the model. Run the following command in command line. Then you should see int_quant_esp.h showing up in the directory.

`xxd -i int_quant_model.tflite > int_quant_esp.h`

Let's now set up ESP32 for TensorFlow Lite. Open Arduino IDE on your laptop, and install ESP32 board manager:

`Tools->Board->Boards Manager->ESP32 by Espressif Systems`

In library manager, search and install **TensorFlowLite_ESP32**.

## Final Analysis

Overall, Dynamic range quantization yielded the smallest model based on my implementation. Dynamic range quantization also had the highest accuracy, at only .1% lower than the baseline model. Since it was the best on both sides, I would choose dynamic range quantization for edge deployment.

## References:

### [1] https://learnopencv.com/tensorflow-lite-model-optimization-for-on-device-machine-learning/
### [2] https://www.tensorflow.org/model_optimization/guide/pruning/pruning_with_keras
### [3] https://colab.research.google.com/github/tensorflow/model-optimization/blob/master/tensorflow_model_optimization/g3doc/guide/pruning/comprehensive_guide.ipynb#scrollTo=lvpH1Hg7ULFz