<h1> <FONT COLOR=""> Quantization and benchmarking of deep learning models using ONNX Runtime and STM32Cube.AI Developer Cloud : </h1>
    
    


<p>
The process of quantization involves the convertion the original floating-point parameters and intermediate activations of a model into lower precision integer representations. This reduction in precision can significantly decrease the memory footprint and computational cost of the model, making it more efficient to deploy on STM32 board using STM32Cube.AI or any other resource-constrained devices.

ONNX Runtime Quantization is a feature the ONNX Runtime that allows efficient execution of quantized models. It provides tools and techniques to quantize the ONNX format models. It includes methods for quantizing weights and activations.


**This notebook demonstrates the process of static post-training quantization for deep learning models using the ONNX runtime. It covers the model quantization with calibration dataset or with fake data, the evaluation of the full precision model and the quantized model, and then the STM32Cube.AI Developer Cloud is used to benchmark the models and to generate the model C code to be deployed on your STM32 board.** 
</p>

## License of the Jupyter Notebook

This software component is licensed by ST under BSD-3-Clause license,
the "License"; 

You may not use this file except in compliance with the
License. 

You may obtain a copy of the License at: https://opensource.org/licenses/BSD-3-Clause

Copyright (c) 2023 STMicroelectronics. All rights reserved

<div style="border-bottom: 3px solid #273B5F">
<h2>Table of content</h2>
<ul style="list-style-type: none">
  <li><a href="#settings">1. Settings</a>
  <ul style="list-style-type: none">
    <li><a href="#install">1.1 Install and import necessary packages</a></li>
    <li><a href="#select">1.2 Select input model filename and dataset folder</a></li>
  </ul>
</li>
<li><a href="#quantization">2.Quantization</a></li>
      <ul style="list-style-type: none">
    <li><a href="#opset">2.1 Opset conversion</a></li>
    <li><a href="#dataset">2.2 Creating calibration dataset</a></li>
    <li><a href="#quantize">2.3 Quantize the model using QDQ quantization to int8 weights and activations</a></li>
  </ul>
<li><a href="#Model validation">3. Model validation </a></li>
<li><a href="#benchmark_both">4. Benchmarking the Models on the STM32Cube.AI Developer Cloud</a></li>
      <ul style="list-style-type: none">
    <li><a href="#proxy">4.1 Proxy setting and connection to the STM32Cube.AI Developer Cloud</a></li>
    <li><a href="#Benchmark_both">4.2 Benchmark the models on a STM32 target</a></li>
    <li><a href="#generate">4.2 Generate the model optimized C code for STM32</a></li>
         

  </ul>
</ul>
</div>




<div id="settings">
    <h2>1. Settings</h2>
</div>


<div id="install">
    <h3>1.1 Install and import necessary packages </h3>
</div>

In [1]:
import sys

!{sys.executable} -m pip install numpy==1.23.5
!{sys.executable} -m pip install onnx==1.15.0
!{sys.executable} -m pip install onnxruntime==1.18.1
!{sys.executable} -m pip install tensorflow==2.15.0
!{sys.executable} -m pip install scikit-learn

!{sys.executable} -m pip install Pillow==9.4.0
!{sys.executable} -m pip install matplotlib
!{sys.executable} -m pip install tqdm
!{sys.executable} -m pip install marshmallow 

# for the cloud service

!{sys.executable} -m pip install gitdir
# for the cloud service
!{sys.executable} -m pip install gitdir

[31mERROR: Could not find a version that satisfies the requirement tensorflow==2.15.0 (from versions: 2.2.0, 2.2.1, 2.2.2, 2.2.3, 2.3.0, 2.3.1, 2.3.2, 2.3.3, 2.3.4, 2.4.0, 2.4.1, 2.4.2, 2.4.3, 2.4.4, 2.5.0, 2.5.1, 2.5.2, 2.5.3, 2.6.0rc0, 2.6.0rc1, 2.6.0rc2, 2.6.0, 2.6.1, 2.6.2, 2.6.3, 2.6.4, 2.6.5, 2.7.0rc0, 2.7.0rc1, 2.7.0, 2.7.1, 2.7.2, 2.7.3, 2.7.4, 2.8.0rc0, 2.8.0rc1, 2.8.0, 2.8.1, 2.8.2, 2.8.3, 2.8.4, 2.9.0rc0, 2.9.0rc1, 2.9.0rc2, 2.9.0, 2.9.1, 2.9.2, 2.9.3, 2.10.0rc0, 2.10.0rc1, 2.10.0rc2, 2.10.0rc3, 2.10.0, 2.10.1, 2.11.0rc0, 2.11.0rc1, 2.11.0rc2, 2.11.0, 2.11.1, 2.12.0rc0, 2.12.0rc1, 2.12.0, 2.12.1, 2.13.0rc0, 2.13.0rc1, 2.13.0rc2, 2.13.0, 2.13.1)[0m[31m
[0m[31mERROR: No matching distribution found for tensorflow==2.15.0[0m[31m


In [1]:
import glob
import os
import random
import shutil

import numpy as np 
import tensorflow as tf
from datetime import datetime
from tqdm import tqdm
from typing import Tuple, Optional, List, Dict

import onnx
import onnxruntime
from onnx import version_converter
from onnxruntime import quantization
from onnxruntime.quantization import (CalibrationDataReader, CalibrationMethod, QuantFormat, QuantType, quantize_static)

2025-04-27 11:27:06.086682: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2025-04-27 11:27:06.488948: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2025-04-27 11:27:06.491380: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.



<div id="select">
    <h3>1.2 Select input folder</h3>
</div>


In [2]:
input_dir ="./output/retina-ann-v6-evs-1000" 


<div id="quantization">
    <h2>2. Quantization</h2>
</div>

<div id="opset">
    <h3>2.1. Opset conversion  </h3>
</div>

In this section, we are upgrading the model's opset to version 15 to take advantage of advanced optimizations such as Batch normalization folding and ensure compatibility with the latest versions of ONNX and ONNX runtime. To do this, we run the code below.

To ensure compatibility between the ONNX runtime version and the opset number, please refer to [the official documentation of ONNX Runtime](https://onnxruntime.ai/docs/reference/compatibility.html).

In [4]:
def change_opset(input_model: str, new_opset: int) -> str:
    """
    Converts the opset version of an ONNX model to a new opset version.

    Args:
        input_model (str): The path to the input ONNX model.
        new_opset (int): The new opset version to convert the model to.

    Returns:
        str: The path to the converted ONNX model.
    """
    if not input_model.endswith('.onnx'):
        raise Exception("Error! The model must be in onnx format")    
    model = onnx.load(input_model)
    # Check the current opset version
    current_opset = model.opset_import[0].version
    if current_opset == new_opset:
        print(f"The model is already using opset {new_opset}")
        return input_model

    # Modify the opset version in the model
    converted_model = version_converter.convert_version(model, new_opset)
    temp_model_path = input_model+ '.temp'
    onnx.save(converted_model, temp_model_path)

    # Load the modified model using ONNX Runtime Check if the model is valid
    session = onnxruntime.InferenceSession(temp_model_path)
    try:
        session.get_inputs()
    except Exception as e:
        print(f"An error occurred while loading the modified model: {e}")
        return

    # Replace the original model file with the modified model
    os.replace(temp_model_path, input_model)
    print(f"The model has been converted to opset {new_opset} and saved at the same location.")
    return input_model

input_model_path = os.path.join(input_dir, "models", "model.onnx")
change_opset(input_model_path, new_opset=15)

The model has been converted to opset 15 and saved at the same location.


'./output/retina-ann-v6-evs-1000/models/model.onnx'

<div id="dataset">
    <h3> 2.2 Creating the calibration dataset </h3>
</div>

During ONNX runtime quantization, the model is run on the calibration data to provide statistics about the dynamic and characteristics of each input and output. These statistics are then used to determine the main quantization parameters, which are the scale factor and a zero-point or offset to map the floating-point values to integers.

The next three code sections below contain:

* The `create_calibration_dataset` function to create the calibration set from the original directory by taking a specific number of samples from each class, and the `preprocess_image_batch` function to load the batch and process it.
* The `preprocess_random_images` function to generate random images for fake quantization and preprocess them.
* The `ImageNetDataReader` class that inherits from the ONNX Runtime calibration data readers and implements the `get_next` method to generate and provide input data dictionaries for the calibration process.

**Note:** Using a different normalization method during quantization than during training can affect the scale of the data and lead to a loss of accuracy in the quantized model. For example, if you used TensorFlow's normalization method during training, where the data is scaled by dividing each pixel value by 255.0, you should also use this method during quantization. Similarly, if you used Torch's normalization method during training, where the data is scaled by subtracting the mean and dividing by the standard deviation, you should also use this method during quantization.

Using the same normalization method for both training and quantization ensures that the quantized model retains the accuracy achieved during training. Therefore, it is important to pay attention to the normalization method used during both training and quantization to ensure the best possible accuracy for your model.

To align the preprocessing of the quantization dataset in the section below with the preprocessing of the trained model, adjust the arguments `color_mode`, `interpolation`, and `norm` for normalization.

In [5]:
from torch.utils.data import RandomSampler, DataLoader

from data.module import EyeTrackingDataModule 
from data.utils import load_yaml_config

# Representative dataset function for calibration
class EyeTrackingDataReader(CalibrationDataReader):
    """
    A class used to provide a representative dataset for calibration.

    Attributes
    ----------
    train_dataloader : DataLoader
        The training data loader
    enum_data : iter
        Enumerator for iterating through the dataset
    """

    def __init__(self, model_path: str, train_dataloader: DataLoader) -> None:
        """
        Initializes the RepresentativeDataset class.

        Parameters
        ----------
        train_dataloader : DataLoader
            The data loader for training data
        """
        self.train_dataloader = train_dataloader
        self.enum_data = None  # Enumerator for calibration data 
        
        try:
            first_batch = next(iter(self.train_dataloader))
            print("First batch of data:", first_batch[0].shape)  # Print the shape of the first batch
        except StopIteration:
            print("train_dataloader is empty!")
            
        # Use inference session to get input shape
        session = onnxruntime.InferenceSession(model_path, None)
        (_, channel, height, width) = session.get_inputs()[0].shape
        self.input_name = session.get_inputs()[0].name

    def get_next(self) -> list:
        if self.enum_data is None:
            self.enum_data = self._create_enumerator()

        data = next(self.enum_data, None)
        if data is None:
            print("No data returned!") 
        return data

    def rewind(self) -> None:
        """
        Resets the enumeration of the dataset.
        """
        self.enum_data = None  # Reset the enumerator for the dataset

    def _create_enumerator(self):
        """
        Creates an iterator that generates representative dataset items.

        Yields
        -------
        list
            A list containing the input data for calibration
        """
        for input_data, _, _ in self.train_dataloader:
            input_data = input_data.detach().cpu().numpy().astype(np.float32)
            for i in range(input_data.shape[0]): 
                yield {self.input_name: input_data[i]} 
                
# Load dataset params
training_params = load_yaml_config(os.path.join(input_dir, "training_params.yaml"))
dataset_params = load_yaml_config(os.path.join(input_dir, "dataset_params.yaml"))  
training_params["batch_size"] = 1
data_module = EyeTrackingDataModule(dataset_params=dataset_params, training_params=training_params, num_workers=16)
data_module.setup(stage='fit')

sampler = RandomSampler(data_module.train_dataset, replacement=True, num_samples=64)
train_dataloader = data_module.train_dataloader(sampler)
data_reader = EyeTrackingDataReader(input_model_path, train_dataloader)

  from .autonotebook import tqdm as notebook_tqdm


First batch of data: torch.Size([1, 1, 2, 64, 64])


# Header C File

In [6]:
import os
import torch
import numpy as np

# Set the save path
save_path = os.path.join("output", "sample_batches.h")
num_batches_to_save = 5
saved_batches = []

# Save batches from DataLoader
for i, (input_data, target, _) in enumerate(train_dataloader):
    if i >= num_batches_to_save:
        break
    # Detach and convert to numpy
    np_input = input_data.detach().cpu().numpy()
    saved_batches.append(np_input)

# Write to C header
with open(save_path, "w") as f:
    f.write("#ifndef SAMPLE_BATCHES_H\n")
    f.write("#define SAMPLE_BATCHES_H\n\n")

    for batch_idx, batch in enumerate(saved_batches):
        batch = batch.astype(np.float32)
        flat_data = batch.flatten()
        shape = batch.shape  # (B, C, H, W)

        f.write(f"// Batch {batch_idx}, shape: {shape}\n")
        f.write(f"static const float sample_batch_{batch_idx}[] = {{\n")
        for i, value in enumerate(flat_data):
            f.write(f"{value:.6f}f")
            if i < len(flat_data) - 1:
                f.write(", ")
            if (i + 1) % 8 == 0:
                f.write("\n")
        f.write("\n};\n\n")

    f.write("#endif // SAMPLE_BATCHES_H\n")

<div id="quantize">
    <h3> 2.3 Quantize the model using QDQ quantization to int8 weights and activations </h3>
</div>

The following section quantize the float32 onnx model to int8 quantized onnx model after the preprocessing to prepare it to the qunatization by using the ``quantize_static`` function that we recommand to use with calibration data and with the following supported arguments setting.


<table>
<tr>
<th style="text-align: left">Argument</th>
<th style="text-align: left">Description /  CUBE.AI recommendation</th>
</tr>
    
<tr><td style="text-align: left">Quant_format </td>
<td style="text-align: left"> <p> QuantFormat.QDQ format: <strong>recommended</strong>, it quantizes the model by inserting QuantizeLinear/DeQuantizeLinear on the tensor. QOperator format: <strong> not recommended </strong>, it quantizes the model with quantized operators directly </p> </td></tr>
<tr><td style="text-align: left"> Activation type</td> 
<td style="text-align: left"> <p> QuantType.QInt8: <strong>recommended</strong>, it quantizes the activations to int8.  QuantType.QUInt8: <strong>not recommended</strong>, to quantize the activations uint8 </p> </td></tr>  
<tr><td style="text-align: left">Weight_type </td> 
<td style="text-align: left"> <p> QuantType.QInt8: <strong>recommended</strong> , it quantizes the weights to int8.  QuantType.QUInt8: <strong>not recommended</strong>, it quantizes the weights to uint8</p> </td></tr> 
<tr><td style="text-align: left">Per_Channel</td>
<td style="text-align: left"> <p>True: <strong>recommended</strong>, it makes the quantization process is carried out individually and separately for each channel based on the characteristics of the data within that specific channel / False: supported and <strong>not recommended</strong>, the quantization process is carried out for each tensor </p> </td>
</tr>
<tr><td style="text-align: left">ActivationSymmetric</td>
<td style="text-align: left"> <p>False: <strong>recommended</strong> it makes the activations in the range of [-128  +127]. True: supported, it makes the  activations in the range of [-128  +127] with the zero_point=0 </p> </td>
</tr>
<tr>
<td style="text-align: left">WeightSymmetric</td>
<td style="text-align: left"> <p>True: <strong>Highly recommended</strong>, it makes the weights in the range of [-127  +127] with the zero_point=0.  False: supported and <strong>not recommended</strong>, it makes the weights in the range of [-128  +127]</p> </td>
</tr>
<td style="text-align: left">reduce_range</td>
<td style="text-align: left"> <p>True: <strong>Highly recommended</strong>, it quantizes the weights in 7-bits. It may improve the accuracy for some models, especially for per-channel mode</p> </td>
</tr> 
</table>

In [6]:
# Preprocess the model to infer shapes of each tensor
infer_model = os.path.splitext(input_model_path)[0] + '_infer' + os.path.splitext(input_model_path)[1]
print('Infer for the model: {}...'.format(os.path.basename(input_model_path)))
quantization.quant_pre_process(input_model_path=input_model_path, output_model_path=infer_model, skip_optimization=False)

# Prepare quantized ONNX model filename
quant_model = os.path.splitext(input_model_path)[0] + '_QDQ_quant' + os.path.splitext(input_model_path)[1] 
print('Quantize the model {}, please wait...'.format(os.path.basename(input_model_path)))

quantize_static(
        infer_model,
        quant_model,
        data_reader,
        calibrate_method=CalibrationMethod.MinMax, 
        quant_format=QuantFormat.QDQ,
        per_channel=True,
        weight_type=QuantType.QInt8, 
        activation_type=QuantType.QInt8, 
        reduce_range=True,
        extra_options={
        'WeightSymmetric': True,
        'ActivationSymmetric': False,
        'AddQDQPairToInput': False,  
        'AddQDQPairToOutput': False  
    })

now = datetime.now()
current_time = now.strftime("%H:%M:%S")
print(current_time + ' - ' + '{} model has been created.'.format(os.path.basename(quant_model)))
quantized_session = onnxruntime.InferenceSession(quant_model)

Infer for the model: model.onnx...
Quantize the model model.onnx, please wait...
No data returned!
11:27:49 - model_QDQ_quant.onnx model has been created.
