# Lab 2: Quantization of AI models

## Intro

In this lab you will learn to **optimize** and **quantize AI models** using the **LiteRT** library (previously called Tensorflow Lite \[For Microcontrollers]). <br />

To be able to run the necessary scripts throughout this lab, you will need access to a GPU. You can either **make use of your own GPU** (through a Linux or Windows WSL system, with a GPU-enabled tensorflow installed (version 2.18.0)) **or use Google Colab**. <br />To run notebooks in colab, you will need to download the lab folder on Ufora, **unzip it and put it on your Google Drive** (this folder will only be a few MBs in size). You can **drag and drop** the unzipped folder in your Google Drive.<br /><br />


Next, **double click on the provided .ipynb file** for each lab which will open Google Colab. <br />From there, fill in the necessary variables (such as the path to your Google Drive) and you will be able to **run and program the necessary code. Be sure te select a GPU under Runtime > Change runtime type.**

In [2]:
%pip install --user --upgrade tensorflow-model-optimization
%pip install tf_keras

# Click Runtime > Restart session
# This ensures the above installed libraries are correctly imported

Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


In [None]:
# Run this code to connect your Google Drive to Colab

# from google.colab import drive
# drive.mount('/content/drive')

In [3]:
# Change to your project directory
# path_to_lab = "drive/MyDrive/Colab Notebooks/Embedded-ML-main/" # working on google colab
path_to_lab = "" # working locally

## Functions
Below you can find **functions** which can be used to complete the lab. <br />
_Note: when running the below code for the first time on Google Colab, you will get a warning that you need to restart your runtime session. This is expected because the kernel needs to use the expected tensorflow version._ 

In [4]:
import tensorflow as tf
from tensorflow import keras as keras
import tensorflow_model_optimization as tfmot
import numpy as np
from sklearn.metrics import accuracy_score, classification_report

def mnist_model(train=False):
    model = tf.keras.Sequential([
    tf.keras.layers.Input(shape=(28, 28)),
    tf.keras.layers.Reshape(target_shape=(28, 28, 1)),
    tf.keras.layers.Conv2D(filters=64, kernel_size=(6, 6), activation=tf.nn.relu, name="conv1"),
    tf.keras.layers.MaxPooling2D(pool_size=(2, 2)),
    tf.keras.layers.Dropout(0.25),
    tf.keras.layers.Conv2D(filters=32, kernel_size=(3, 3), activation=tf.nn.relu, name="conv2"),
    tf.keras.layers.Flatten(),
    tf.keras.layers.Dropout(0.5),
    tf.keras.layers.Dense(16, activation=tf.nn.relu, name="dense1"),
    # tf.keras.layers.Dropout(0.25),
    tf.keras.layers.Dense(10, activation=tf.nn.softmax, name="dense2")
    ])

    model.compile(optimizer='adam',
              loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
              metrics=['accuracy'])

    if train:
        model.fit(x=train_images, y= train_labels, batch_size=64, epochs=50, validation_data=(test_images, test_labels))
    else:
        # model = tf.keras.models.load_model("Models/mnist.keras")
        model = tf.keras.models.load_model(path_to_lab + "Models/mnist")
    return model


2025-05-09 14:16:33.102638: I external/local_tsl/tsl/cuda/cudart_stub.cc:32] Could not find cuda drivers on your machine, GPU will not be used.
2025-05-09 14:16:33.105639: I external/local_tsl/tsl/cuda/cudart_stub.cc:32] Could not find cuda drivers on your machine, GPU will not be used.
2025-05-09 14:16:33.114551: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:479] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2025-05-09 14:16:33.131310: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:10575] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2025-05-09 14:16:33.131347: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1442] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2025-05-09 14:16:33.142530: I tensorflow/core/platform/cpu_feature_guard.cc:

## Quantize models using LiteRT


### Part 1: Steps previous lab
1) Similar to the previous lab, load the mnist dataset and pre-trained model. For this exercise we will use a pre-trained model working on the mnist dataset for digit recognition.
2) Evaluate the model. To obtain a baseline performance, evaluate the model without any LiteRT optimizations applied.
3) Convert the model to the LiteRT format and evaluate whether this has an impact on performance or not.

In [19]:
# from code lab1: helper function to verify performance of tflite model
def verify_performance(model_path):
    # Load TFLite model and allocate tensors.
    interpreter = tf.lite.Interpreter(model_path=path_to_lab + model_path)
    interpreter.allocate_tensors()

    # Get input and output tensors.
    input_details = interpreter.get_input_details()
    output_details = interpreter.get_output_details()

    # Test model on random input data.
    # input_shape = input_details[0]['shape']
    # test_image = test_images[0].astype(np.float32)
    # test_image = np.expand_dims(test_image, axis=0)

    # interpreter.set_tensor(input_details[0]['index'], test_image)
    # interpreter.invoke()

    # output_data = interpreter.get_tensor(output_details[0]['index'])
    # predicted_label = np.argmax(output_data)

    correct = 0
    for i in range(len(test_images)):

        # change type of array elements form UINT to float32
        test_image = test_images[i].astype(np.int8)
        # change shape of test img to be batch of lenght 1
        test_image = np.expand_dims(test_image, axis=0)

        # input test_image
        interpreter.set_tensor(input_details[0]['index'], test_image)

        # run model
        interpreter.invoke()

        # get result
        output_data = interpreter.get_tensor(output_details[0]['index'])

        if np.argmax(output_data) == test_labels[i]:
            correct += 1

    accuracy = correct / len(test_images)
    model_name = model_path.split("/")[-1]
    print(f"TFLite Model ({model_name}) Accuracy: {accuracy:.4f}")

In [6]:
# Load dataset
mnist = tf.keras.datasets.mnist
(train_images, train_labels), (test_images, test_labels) = mnist.load_data()

# Load pre-trained model
model = mnist_model(train=False) # same as lab0
# model.save("Models/mnist")

## Verify performance by inserting your code below

# ---- see lab 0

# Perform lite model conversion

# -- see lab 1: Part 1, for performance difference (there was none)
# Perform lite model conversion
converter = tf.lite.TFLiteConverter.from_saved_model(path_to_lab + 'Models/mnist') # path to the SavedModel directory
tflite_base_model = converter.convert()

with open(path_to_lab + 'Models/mnist_base.tflite', 'wb') as f:
  f.write(tflite_base_model)

verify_performance(path_to_lab + 'Models/mnist_base.tflite')

2025-05-09 14:16:35.496828: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:998] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2025-05-09 14:16:35.500850: W tensorflow/core/common_runtime/gpu/gpu_device.cc:2251] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
Skipping registering GPU devices...
W0000 00:00:1746792995.875190  105850 tf_tfl_flatbuffer_helpers.cc:390] Ignored output_format.
W0000 00:00:1746792995.875227  105850 tf_tfl_flatbuffer_helpers.cc:393] Ignored drop_control_dependency.
2025-05-09 14:16:35.875888: I tensorflow/cc/saved_model/reader.cc:83] R

TFLite Model (mnist_base.tflite) Accuracy: 0.9912


### Part 2: New steps in this lab

In [7]:
# helper functions to check diff in models
def print_model_details(model_path):
    interpreter = tf.lite.Interpreter(model_path=model_path)
    interpreter.allocate_tensors()

    model_name = model_path.split("/")[-1]
    print("\n")
    print(f"Model: {model_name}")
    print(f"Number of tensors: {len(interpreter.get_tensor_details())}")
    print(f"Number of ops: {len(interpreter.get_signature_list())}")

    for tensor in interpreter.get_tensor_details()[0:3]:
        print(f"Tensor Name: {tensor['name']}, Shape: {tensor['shape']}, Type: {tensor['dtype']}")
    print("\n")

def check_weight_types(model_path):
    interpreter = tf.lite.Interpreter(model_path=model_path)
    interpreter.allocate_tensors()

    tensor_details = interpreter.get_tensor_details()
    weight_types = {tensor['dtype'] for tensor in tensor_details}

    model_name = model_path.split("/")[-1]
    print(f"Model: {model_name}")
    print(f"Weight data types: {weight_types}")
    print("\n")


4) Covert the model to the LiteRT format and **quantize the model** by enabling **dynamic range** quantization. (See [here](https://ai.google.dev/edge/litert/models/post_training_quantization))

In [8]:
# code for step 4.1

# Perform Dynamic-range quantization
dynamic_range_converter = tf.lite.TFLiteConverter.from_saved_model(path_to_lab + 'Models/mnist') # path to the SavedModel directory
dynamic_range_converter.optimizations = [tf.lite.Optimize.DEFAULT]
tflite_dynamic_range_quant_model = dynamic_range_converter.convert()

with open(path_to_lab + 'Models/dynamic_range_model.tflite', 'wb') as f:
  f.write(tflite_dynamic_range_quant_model)

W0000 00:00:1746792997.446307  105850 tf_tfl_flatbuffer_helpers.cc:390] Ignored output_format.
W0000 00:00:1746792997.446345  105850 tf_tfl_flatbuffer_helpers.cc:393] Ignored drop_control_dependency.
2025-05-09 14:16:37.446533: I tensorflow/cc/saved_model/reader.cc:83] Reading SavedModel from: Models/mnist
2025-05-09 14:16:37.447908: I tensorflow/cc/saved_model/reader.cc:51] Reading meta graph with tags { serve }
2025-05-09 14:16:37.447920: I tensorflow/cc/saved_model/reader.cc:146] Reading SavedModel debug info (if present) from: Models/mnist
2025-05-09 14:16:37.455380: I tensorflow/cc/saved_model/loader.cc:234] Restoring SavedModel bundle.
2025-05-09 14:16:37.473464: I tensorflow/cc/saved_model/loader.cc:218] Running initialization op on SavedModel bundle at path: Models/mnist
2025-05-09 14:16:37.480115: I tensorflow/cc/saved_model/loader.cc:317] SavedModel load for tags { serve }; Status: success: OK. Took 33586 microseconds.


In [9]:
# Verify performance
verify_performance('Models/dynamic_range_model.tflite')

TFLite Model (dynamic_range_model.tflite) Accuracy: 0.9913


In [10]:
# Check model parameters compared to previous conversion
! ls -lh Models/mnist_base.tflite Models/dynamic_range_model.tflite

print_model_details("Models/mnist_base.tflite")
print_model_details("Models/dynamic_range_model.tflite")

check_weight_types("Models/mnist_base.tflite")
check_weight_types("Models/dynamic_range_model.tflite")

-rw-rw-r-- 1 jasper jasper  67K May  9 14:16 Models/dynamic_range_model.tflite
-rw-rw-r-- 1 jasper jasper 248K May  9 14:16 Models/mnist_base.tflite


Model: mnist_base.tflite
Number of tensors: 25
Number of ops: 1
Tensor Name: serving_default_input_4:0, Shape: [ 1 28 28], Type: <class 'numpy.float32'>
Tensor Name: arith.constant, Shape: [32], Type: <class 'numpy.float32'>
Tensor Name: arith.constant1, Shape: [64], Type: <class 'numpy.float32'>




Model: dynamic_range_model.tflite
Number of tensors: 25
Number of ops: 1
Tensor Name: serving_default_input_4:0, Shape: [ 1 28 28], Type: <class 'numpy.float32'>
Tensor Name: arith.constant, Shape: [2], Type: <class 'numpy.int32'>
Tensor Name: arith.constant1, Shape: [], Type: <class 'numpy.int32'>


Model: mnist_base.tflite
Weight data types: {<class 'numpy.float32'>, <class 'numpy.int32'>}


Model: dynamic_range_model.tflite
Weight data types: {<class 'numpy.int8'>, <class 'numpy.float32'>, <class 'numpy.int32'>}




In [11]:
# normalize to [0,1] for float32 consistency
train_images = train_images.astype(np.float32)
test_images = test_images.astype(np.float32)

In [12]:
# code for step 4.2
# Perform Full int8 quantization

# we need to stimate the range, i.e., (min, max) of all floating-point tensors in the model
def representative_dataset():
  indices = np.random.choice(len(train_images), 200, replace=False) # take evenely/randomly distrubted from data
  for i in indices:
      yield [np.expand_dims(train_images[i], axis=0)]

# set up converter
int8_converter = tf.lite.TFLiteConverter.from_saved_model(path_to_lab + 'Models/mnist') # path to the SavedModel directory
int8_converter.optimizations = [tf.lite.Optimize.DEFAULT]
int8_converter.representative_dataset = representative_dataset
int8_converter.target_spec.supported_ops = [tf.lite.OpsSet.TFLITE_BUILTINS_INT8] # only int8

# acutally convert and save
tflite_int8_quant_model = int8_converter.convert() 
with open(path_to_lab + 'Models/int8_model.tflite', 'wb') as f:
  f.write(tflite_int8_quant_model)

W0000 00:00:1746793000.172586  105850 tf_tfl_flatbuffer_helpers.cc:390] Ignored output_format.
W0000 00:00:1746793000.172608  105850 tf_tfl_flatbuffer_helpers.cc:393] Ignored drop_control_dependency.
2025-05-09 14:16:40.172757: I tensorflow/cc/saved_model/reader.cc:83] Reading SavedModel from: Models/mnist
2025-05-09 14:16:40.174008: I tensorflow/cc/saved_model/reader.cc:51] Reading meta graph with tags { serve }
2025-05-09 14:16:40.174018: I tensorflow/cc/saved_model/reader.cc:146] Reading SavedModel debug info (if present) from: Models/mnist
2025-05-09 14:16:40.180966: I tensorflow/cc/saved_model/loader.cc:234] Restoring SavedModel bundle.
2025-05-09 14:16:40.198878: I tensorflow/cc/saved_model/loader.cc:218] Running initialization op on SavedModel bundle at path: Models/mnist
2025-05-09 14:16:40.205522: I tensorflow/cc/saved_model/loader.cc:317] SavedModel load for tags { serve }; Status: success: OK. Took 32769 microseconds.
fully_quantize: 0, inference_type: 6, input_inference_typ

In [13]:
# Verify performance
verify_performance(path_to_lab + 'Models/int8_model.tflite')

TFLite Model (int8_model.tflite) Accuracy: 0.9905


**Q1: Do you see any difference in accuracy? What is changed in terms of the model parameters compared to the previous conversion?**
    
There is no change in accuracy as it stays 0.9912.

Simply comparing the size of the files we see a +- 3.5x reduction&nbsp;

- 68K Mar 28 13:40 Models/dynamic_range_model.tflite
- 248K Mar 28 13:51 Models/mnist_base.tflite

Taking a closer look at the parameters we see there is no difference in amount of tensors between the models but some tensor have different datatypes in the quantized model compared to base, conversion of some tensors from float/int32 to int8 lead to this reduction in size:

- Model: mnist_base.tflite \
Weight data types: {<class 'numpy.float32'="">, <class 'numpy.int32'="">}

- Model: dynamic_range_model.tflite \
Weight data types: {<class 'numpy.int8'="">, <class 'numpy.float32'="">, <class 'numpy.int32'="">}</class></class></class></class></class>
    
**Q2: Compared to dynamic range quantization, what accuracy difference do you get with full int8 precision quantization?**
There is a slight drop (accuracy = 0.9877) if we use only 100 images to represent the data.

If we use 200 images we can narrow the gap (accuracy = 0.9905).

And there is an even better result if we don't naively select the first 200 images. Instead I used `np.linspace` to try and represent the whole dataset better. This resulted in a slight performance boost (accuracy = 0.9910)

With random selection we can get even better results (accuracy = 0.9914), this random approach probably only works good if we use a big enough subset of the data.

This makes sense since we have to try and represent the range the value's in our dataset can take so the values of S and Z get estimated with more realistic r_min and r_max.

5) Try to train the model from scratch using **quantization-aware training.**

Full training from scratch

In [None]:
# code for step 5
# Perform Quantization aware training
model = mnist_model(train=False)
model = tf.keras.models.clone_model(model)  # we need weights from scratch

q_aware_model = tfmot.quantization.keras.quantize_model(model)

q_aware_model.compile(optimizer='adam',
                      loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=False),
                      metrics=['accuracy'])
# keep best fit
checkpoint = tf.keras.callbacks.ModelCheckpoint(path_to_lab + "Models/best_fits/q_aware_model", 
                    monitor="val_loss", mode="min", 
                    save_best_only=True, verbose=0)

# train from scratch
q_aware_model.fit(x=train_images, y= train_labels, batch_size=64, epochs=50, validation_data=(test_images, test_labels), callbacks=[checkpoint])

In [None]:
# reload best fit from checkpoint
q_aware_model.load_weights(path_to_lab + "Models/best_fits/q_aware_model")
# save keras model in proper file
q_aware_model.save(path_to_lab + "Models/qat_mnist")

2025-03-28 16:25:57.818117: W tensorflow/core/util/tensor_slice_reader.cc:98] Could not open Models/best_fits/q_aware_model: FAILED_PRECONDITION: Models/best_fits/q_aware_model; Is a directory: perhaps your file is in a different file format and you need to use a different restore operator?


INFO:tensorflow:Assets written to: Models/qat_mnist/assets


INFO:tensorflow:Assets written to: Models/qat_mnist/assets


In [None]:
test_loss, test_acc = q_aware_model.evaluate(test_images, test_labels, verbose=2)

313/313 - 1s - loss: 0.0306 - accuracy: 0.9919 - 900ms/epoch - 3ms/step


In [None]:
# set up converter
int8_qat_converter = tf.lite.TFLiteConverter.from_saved_model(path_to_lab + "Models/qat_mnist") # path to the SavedModel directory
int8_qat_converter.optimizations = [tf.lite.Optimize.DEFAULT]
int8_qat_converter.representative_dataset = representative_dataset
int8_qat_converter.target_spec.supported_ops = [tf.lite.OpsSet.TFLITE_BUILTINS_INT8] # only int8
int8_qat_converter.inference_input_type = tf.int8  
int8_qat_converter.inference_output_type = tf.int8   

# acutally convert and save
tflite_int8_qat_model = int8_qat_converter.convert() 
with open(path_to_lab + 'Models/int8_qat_model.tflite', 'wb') as f:
  f.write(tflite_int8_qat_model)

NameError: name 'tf' is not defined

In [None]:
# Verify performance
verify_performance('Models/int8_qat_model.tflite')

TFLite Model (int8_qat_model.tflite) Accuracy: 0.9918


Fine tuning model with 10 epochs

In [None]:
# code for step 5
# Perform Quantization aware training       
model = mnist_model(train=False)

q_aware_model_fine_tune = tfmot.quantization.keras.quantize_model(model)

q_aware_model_fine_tune.compile(optimizer='adam',
                      loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=False),
                      metrics=['accuracy'])
# keep best fit
checkpoint = tf.keras.callbacks.ModelCheckpoint(path_to_lab + "Models/best_fits/q_aware_ft_model", 
                    monitor="val_loss", mode="min", 
                    save_best_only=True, verbose=0)

# fine tune model
q_aware_model_fine_tune.fit(x=train_images, y= train_labels, batch_size=64, epochs=10, validation_data=(test_images, test_labels), callbacks=[checkpoint])

Epoch 1/10


INFO:tensorflow:Assets written to: Models/best_fits/q_aware_ft_model/assets


Epoch 2/10


INFO:tensorflow:Assets written to: Models/best_fits/q_aware_ft_model/assets


Epoch 3/10


INFO:tensorflow:Assets written to: Models/best_fits/q_aware_ft_model/assets


Epoch 4/10


INFO:tensorflow:Assets written to: Models/best_fits/q_aware_ft_model/assets


Epoch 5/10


INFO:tensorflow:Assets written to: Models/best_fits/q_aware_ft_model/assets


Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


INFO:tensorflow:Assets written to: Models/best_fits/q_aware_ft_model/assets




<tf_keras.src.callbacks.History at 0x7f840a1c39d0>

In [None]:
type(q_aware_model_fine_tune)

tf_keras.src.engine.sequential.Sequential

In [None]:
# reload best fit from checkpoint
q_aware_model_fine_tune.load_weights(path_to_lab + "Models/best_fits/q_aware_ft_model")
# save keras model in proper file
q_aware_model_fine_tune.save(path_to_lab + "Models/qat_ft_mnist")

2025-03-28 16:41:31.213182: W tensorflow/core/util/tensor_slice_reader.cc:98] Could not open Models/best_fits/q_aware_ft_model: FAILED_PRECONDITION: Models/best_fits/q_aware_ft_model; Is a directory: perhaps your file is in a different file format and you need to use a different restore operator?


INFO:tensorflow:Assets written to: Models/qat_ft_mnist/assets


INFO:tensorflow:Assets written to: Models/qat_ft_mnist/assets


In [None]:
test_loss, test_acc = q_aware_model_fine_tune.evaluate(test_images, test_labels, verbose=2)

313/313 - 1s - loss: 0.0375 - accuracy: 0.9902 - 932ms/epoch - 3ms/step


In [None]:
# set up converter
int8_qat_ft_converter = tf.lite.TFLiteConverter.from_saved_model(path_to_lab + "Models/qat_fine_time_mnist") # path to the SavedModel directory
int8_qat_ft_converter.optimizations = [tf.lite.Optimize.DEFAULT]
int8_qat_ft_converter.representative_dataset = representative_dataset
int8_qat_ft_converter.target_spec.supported_ops = [tf.lite.OpsSet.TFLITE_BUILTINS_INT8] # only int8

# acutally convert and save
tflite_int8_qat_ft_model = int8_qat_ft_converter.convert() 
with open(path_to_lab + 'Models/int8_qat_ft_model.tflite', 'wb') as f:
  f.write(tflite_int8_qat_ft_model)

W0000 00:00:1743176502.478601   27551 tf_tfl_flatbuffer_helpers.cc:365] Ignored output_format.
W0000 00:00:1743176502.478623   27551 tf_tfl_flatbuffer_helpers.cc:368] Ignored drop_control_dependency.
2025-03-28 16:41:42.478789: I tensorflow/cc/saved_model/reader.cc:83] Reading SavedModel from: Models/qat_fine_time_mnist
2025-03-28 16:41:42.482437: I tensorflow/cc/saved_model/reader.cc:52] Reading meta graph with tags { serve }
2025-03-28 16:41:42.482464: I tensorflow/cc/saved_model/reader.cc:147] Reading SavedModel debug info (if present) from: Models/qat_fine_time_mnist
2025-03-28 16:41:42.501723: I tensorflow/cc/saved_model/loader.cc:236] Restoring SavedModel bundle.
2025-03-28 16:41:42.571358: I tensorflow/cc/saved_model/loader.cc:220] Running initialization op on SavedModel bundle at path: Models/qat_fine_time_mnist
2025-03-28 16:41:42.593971: I tensorflow/cc/saved_model/loader.cc:466] SavedModel load for tags { serve }; Status: success: OK. Took 115181 microseconds.
fully_quantize

In [None]:
verify_performance('Models/int8_qat_ft_model.tflite')

TFLite Model (int8_qat_ft_model.tflite) Accuracy: 0.9915


**Q3: What is the impact on accuracy with quantization-aware training?**

If we train a quantization-aware mnist model from scratch (50 epochs):

No quantization: accuracy = 0.9919
Full int8 precision quant (int8_qat_model.tflite): accuracy = 0.9918

As we can see the starting accuracy is higher then before (0.9912) but the drop after quantizing is also smaller: 0.0001 compared to 0.0002 with no quantization-aware. This is a very small difference but I think the fact that the accuracy is higher overall also makes the small drop impressive.


If we train a mnnist model with fine-tuning (10 epochs)"

No quantization: accuracy = 0.9902
Full int8 precision quant (int8_qat_ft_model.tflite): accuracy = 0.9915

As we can see in the fine-tuned model the accuracy even goes up after applying the quantization.

**Q4: When saving the tflite model, do you see any difference in the model size (full int8 quantization vs no quantization)?**
- 70K Mar 28 16:26 Models/int8_qat_model.tflite -rw-rw-r--
- 254K Mar 28 16:50 Models/qat_mnist.tflite

&rarr; Yes we can see that the quantized model has a size reduction of +- 3.5 like before.

6) **Prune** the first three layers, at 85% AND perform **full INT8 quantization.**

In [None]:
# code for step 6
# Perform pruning + quantization

# Step 6: Prune the first three layers at 85% and perform full INT8 quantization

# Load the pre-trained model
model = mnist_model(train=False)

# Define the layers to prune (first three trainable layers)
layers_to_prune = ["conv1", "conv2", "dense1"]

# Clone the model to apply pruning
model_for_pruning = tf.keras.models.clone_model(model)
model_for_pruning.set_weights(model.get_weights())

# Apply pruning to the specified layers
pruning_params = {
    'pruning_schedule': tfmot.sparsity.keras.ConstantSparsity(
        target_sparsity=0.85,
        begin_step=0,
        end_step=int(train_images.shape[0] / 64 * 10)  # 10 epochs
    )
}

for layer in model_for_pruning.layers:
    if layer.name in layers_to_prune:
        layer = tfmot.sparsity.keras.prune_low_magnitude(layer, **pruning_params)

# Compile the pruned model
model_for_pruning.compile(
    optimizer='adam',
    loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=False),
    metrics=['accuracy']
)

# Train the pruned model (fine-tuning)
callbacks = [
    tfmot.sparsity.keras.UpdatePruningStep(),
    tfmot.sparsity.keras.PruningSummaries(log_dir='logs/pruning_85pct'),
    keras.callbacks.EarlyStopping(monitor='val_loss', patience=3, restore_best_weights=True)
]

model_for_pruning.fit(
    train_images, train_labels,
    batch_size=64,
    epochs=10,
    validation_data=(test_images, test_labels),
    callbacks=callbacks
)

# Strip pruning wrappers to finalize the model
final_pruned_model = tfmot.sparsity.keras.strip_pruning(model_for_pruning)

# Save the pruned model
final_pruned_model.save(path_to_lab + 'Models/mnist_pruned_85pct')

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10




INFO:tensorflow:Assets written to: Models/mnist_pruned_85pct/assets


INFO:tensorflow:Assets written to: Models/mnist_pruned_85pct/assets


In [20]:
converter = tf.lite.TFLiteConverter.from_saved_model(path_to_lab + 'Models/mnist_pruned_85pct')
converter.optimizations = [tf.lite.Optimize.DEFAULT]
converter.representative_dataset = representative_dataset
converter.target_spec.supported_ops = [tf.lite.OpsSet.TFLITE_BUILTINS_INT8]
converter.inference_input_type = tf.int8  # setting input type
converter.inference_output_type = tf.int8  # settuing output type

tflite_pruned_quant_model = converter.convert()

# Save the quantized model
with open(path_to_lab + 'Models/mnist_pruned85_quantint8.tflite', 'wb') as f:
    f.write(tflite_pruned_quant_model)

W0000 00:00:1746793259.994708  105850 tf_tfl_flatbuffer_helpers.cc:390] Ignored output_format.
W0000 00:00:1746793259.994726  105850 tf_tfl_flatbuffer_helpers.cc:393] Ignored drop_control_dependency.
2025-05-09 14:20:59.994879: I tensorflow/cc/saved_model/reader.cc:83] Reading SavedModel from: Models/mnist_pruned_85pct
2025-05-09 14:20:59.995549: I tensorflow/cc/saved_model/reader.cc:51] Reading meta graph with tags { serve }
2025-05-09 14:20:59.995558: I tensorflow/cc/saved_model/reader.cc:146] Reading SavedModel debug info (if present) from: Models/mnist_pruned_85pct
2025-05-09 14:21:00.000813: I tensorflow/cc/saved_model/loader.cc:234] Restoring SavedModel bundle.
2025-05-09 14:21:00.013979: I tensorflow/cc/saved_model/loader.cc:218] Running initialization op on SavedModel bundle at path: Models/mnist_pruned_85pct
2025-05-09 14:21:00.019469: I tensorflow/cc/saved_model/loader.cc:317] SavedModel load for tags { serve }; Status: success: OK. Took 24593 microseconds.
fully_quantize: 0,

In [21]:
# Verify performance
verify_performance('Models/mnist_pruned85_quantint8.tflite')

ValueError: Cannot set tensor: Got value of type FLOAT64 but expected type INT8 for input 0, name: serving_default_input_4:0 

In [None]:
import zipfile
import os

def zip_model(model_path, output_zip_path=None):
    """
    Zips a model file and saves it to the specified output path.
    
    Args:
        model_path (str): Path to the model file to be zipped
        output_zip_path (str, optional): Path for the output zip file. 
                        If None, uses model_path + '.zip'
    """
    if output_zip_path is None:
        output_zip_path = model_path + '.zip'
    
    with zipfile.ZipFile(output_zip_path, 'w', zipfile.ZIP_DEFLATED) as zipf:
        zipf.write(model_path, os.path.basename(model_path))
    
    print(f"Model zipped to: {output_zip_path}")
    print(f"Original size: {os.path.getsize(model_path)} bytes")
    print(f"Zipped size: {os.path.getsize(output_zip_path)} bytes")
    return output_zip_path

In [None]:
! ls -lh Models/mnist_pruned85_quantint8.tflite.zip

-rw-rw-r-- 1 jasper jasper 58K Mar 28 17:28 Models/mnist_pruned85_quantint8.tflite.zip


**Q5: Describe the observed effect in terms of accuracy and zipped model size when performing both pruning (first three layers, 85%) & full int8 quantization. (Tip: check the zipped tflite file size)**]

After pruning and quantization we can decrease the size by 5x while keeping the accuracy high: 0.9897