<h1>Following tutorial of Static Quantizaion with Hugging Face 'optimum':
<a href='https://www.philschmid.de/static-quantization-optimum'>source</a>.</h1>

<h2>1. Enviroment and Libraries</h2>

In [4]:
pip install onnx onnxruntime onnxruntime-tools optimum


Collecting onnx
  Downloading onnx-1.17.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (16 kB)
Collecting onnxruntime
  Downloading onnxruntime-1.20.1-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (4.5 kB)
Collecting onnxruntime-tools
  Downloading onnxruntime_tools-1.7.0-py3-none-any.whl.metadata (14 kB)
Collecting optimum
  Downloading optimum-1.24.0-py3-none-any.whl.metadata (21 kB)
Collecting coloredlogs (from onnxruntime)
  Downloading coloredlogs-15.0.1-py2.py3-none-any.whl.metadata (12 kB)
Collecting py3nvml (from onnxruntime-tools)
  Downloading py3nvml-0.2.7-py3-none-any.whl.metadata (13 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch>=1.11->optimum)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch>=1.11->optimum)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Co

In [6]:
import torch
import time
import numpy as np
import gc
import os
import platform
import psutil
from transformers import AutoTokenizer, AutoModelForCausalLM

# Function to get system specs
def get_system_specs():
    specs = {
        "OS": platform.system() + " " + platform.release(),
        "Python Version": platform.python_version(),
        "CPU": platform.processor(),
        "CPU Cores": os.cpu_count(),
        "RAM (GB)": round(psutil.virtual_memory().total / (1024**3), 2),
        "PyTorch Version": torch.__version__,
        "Transformers Version": AutoTokenizer.__module__.split('.')[0]  # Get library version
    }

    # Check if GPU is available
    if torch.cuda.is_available():
        specs["GPU"] = torch.cuda.get_device_name(0)
        specs["CUDA Version"] = torch.version.cuda
    else:
        specs["GPU"] = "None (Running on CPU)"

    return specs

# Print system specs
specs = get_system_specs()
print("\n--- System Specifications ---")
for key, value in specs.items():
    print(f"{key}: {value}")
print("-" * 40)



--- System Specifications ---
OS: Linux 6.1.85+
Python Version: 3.11.11
CPU: x86_64
CPU Cores: 2
RAM (GB): 12.67
PyTorch Version: 2.5.1+cu124
Transformers Version: transformers
GPU: Tesla T4
CUDA Version: 12.4
----------------------------------------


In [8]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"device: {device}")

device: cuda


<h2>2. Convert transformers model to onnx for inference</h2>

Model to be used is <a href="https://huggingface.co/optimum/distilbert-base-uncased-finetuned-banking77">optimum/distilbert-base-uncased-finetuned-banking77</a> a fine-tuned DistilBERT model on the Banking77 dataset achieving an Accuracy score of <strong>92.5</strong> and as the feature (task) <strong>text-classification</strong> .

In [17]:
from optimum.onnxruntime import ORTModelForSequenceClassification
from transformers import AutoTokenizer
from pathlib import Path

model_id="optimum/distilbert-base-uncased-finetuned-banking77"
dataset_id="banking77"
onnx_path = Path("onnx")

# load vanilla transformers and convert to onnx
model = ORTModelForSequenceClassification.from_pretrained(model_id, from_transformers=True)
tokenizer = AutoTokenizer.from_pretrained(model_id)

# save onnx checkpoint and tokenizer
model.save_pretrained(onnx_path)
tokenizer.save_pretrained(onnx_path)

The argument `from_transformers` is deprecated, and will be removed in optimum 2.0.  Use `export` instead


('onnx/tokenizer_config.json',
 'onnx/special_tokens_map.json',
 'onnx/vocab.txt',
 'onnx/added_tokens.json',
 'onnx/tokenizer.json')

In [15]:
import os
print(os.listdir("optimum/distilbert-base-uncased-finetuned-banking77"))


FileNotFoundError: [Errno 2] No such file or directory: 'optimum/distilbert-base-uncased-finetuned-banking77'

<h2>3. Configure static quantization & run Calibration of quantization ranges</h2>

Post-training static quantization, compared to dynamic quantization not only involves converting the weights from float to int, but also feeding the data through the model to compute the distributions of the different activations (calibration ranges). These distributions are then used to determine how the different activations should be quantized at inference time. For more information refer to <a href="https://leimao.github.io/article/Neural-Networks-Quantization/">lei mao's blog post</a>.

First step is to create our Quantization configuration using optimum.

In [23]:
from optimum.onnxruntime import ORTQuantizer
from optimum.onnxruntime.configuration import AutoQuantizationConfig
from onnxruntime.quantization import QuantFormat, QuantizationMode

# create ORTQuantizer and define quantization configuration
quantizer = ORTQuantizer.from_pretrained(onnx_path)

qconfig = AutoQuantizationConfig.avx512_vnni(
    is_static=True,
    # format=QuantFormat.QOperator,
    # mode=QuantizationMode.QLinearOps,
    per_channel=True,
    operators_to_quantize=["MatMul", "Add" ]
    )

In [25]:
import os
from functools import partial
from optimum.onnxruntime.configuration import AutoCalibrationConfig

def preprocess_fn(ex, tokenizer):
    return tokenizer(ex["text"],padding="longest")


# Create the calibration dataset
calibration_samples = 256
calibration_dataset = quantizer.get_calibration_dataset(
    dataset_id,
    preprocess_function=partial(preprocess_fn, tokenizer=tokenizer),
    num_samples=calibration_samples,
    dataset_split="train",
)

# Create the calibration configuration containing the parameters related to calibration.
calibration_config = AutoCalibrationConfig.percentiles(calibration_dataset, percentile=99.99239080907178)

# Perform the calibration step: computes the activations quantization ranges
shards=16
for i in range(shards):
    shard = calibration_dataset.shard(shards, i)
    quantizer.partial_fit(
        dataset=shard,
        calibration_config=calibration_config,
        onnx_model_path=onnx_path / "model.onnx",
        operators_to_quantize=qconfig.operators_to_quantize,
        batch_size=calibration_samples//shards,
        use_external_data_format=False,
    )
ranges = quantizer.compute_ranges()

# remove temp augmented model again
os.remove("augmented_model.onnx")


ImportError: 
ORTQuantizer requires the datasets library but it was not found in your environment. You can install it with pip:
`pip install datasets`. Please note that you may need to restart your runtime after installation.


In [None]:
# input = "What is the capital of Mongolia?"

# tokenizer.pad_token = tokenizer.eos_token
# print("EOS Token:", tokenizer.eos_token)


# rep_ids = tokenizer(representative_batch, return_tensors="pt", padding=True, truncation=True)

# # Calibration function
# def evaluate(model, inputs):
#     with torch.no_grad():

#         model(**inputs)

# num_calibration_batches = 10
# for i in range(num_calibration_batches):
#     print(i)
#     evaluate(model, rep_ids)

# print("Calibration completed!")


In [None]:
# import torch
# from torch.ao.quantization import get_default_qconfig, prepare, convert

# def set_embedding_qconfig(model):
#     """
#     Set the quantization config to None (float32) for embedding layers,
#     and apply default quantization config to all other layers.
#     """
#     for name, module in model.named_modules():
#         if isinstance(module, torch.nn.Embedding) or isinstance(module, torch.nn.LayerNorm):
#             # No quantization for embeddings, keep them as float32
#             module.qconfig = None
#         else:
#             # Apply default quantization config for other layers
#             module.qconfig = get_default_qconfig('fbgemm')

# # Set the quantization config
# set_embedding_qconfig(qmod)

# # Prepare the model for quantization
# model.eval()  # Ensure the model is in evaluation mode
# prepared_model = prepare(qmod, inplace=True)

# # Now, calibrate the model with representative data (this step is necessary for static quantization)
# # Let's assume you already have `rep_ids` for the representative batch of text

# # Example calibration loop
# num_calibration_batches = 10
# for _ in range(num_calibration_batches):
#     with torch.no_grad():
#         prepared_model(**rep_ids)  # Feed representative batch through the model

# print("Calibration completed!")

# # Convert the model to the quantized version after calibration
# quantized_model = convert(prepared_model, inplace=True)

# # Now, the model is quantized with float32 embeddings and int8 for other layers
# print("Model conversion to quantized version completed!")

# # You can now test the quantized model or save it


In [None]:
# import torch

# # Set the quantized engine
# torch.backends.quantized.engine = 'qnnpack'

# device = torch.device('cpu')  # Explicitly use CPU
# quantized_model.to(device)

# tokenizer.pad_token_id = tokenizer.eos_token_id
# # Move the model to the device
# quantized_model.to(device)

# # Ensure the model is in evaluation mode
# quantized_model.eval()

# # Example input question
# input_question = "What is the capital of Mongolia?"

# # Tokenize the input question and get attention mask
# inputs = tokenizer(input_question, return_tensors="pt", padding=True, truncation=True, return_attention_mask=True)

# # Move input tensors to the same device as the model
# inputs = {key: value.to(device) for key, value in inputs.items()}

# # Generate the model's response (answer)
# with torch.no_grad():
#     # Forward pass through the quantized model
#     output = quantized_model.generate(
#         inputs["input_ids"],
#         attention_mask=inputs["attention_mask"],
#         max_length=50
#     )

# # Decode the generated tokens to a human-readable text answer
# generated_answer = tokenizer.decode(output[0], skip_special_tokens=True)

# print(f"Question: {input_question}")
# print(f"Answer: {generated_answer}")


In [None]:
# torch.backends.quantized.engine = 'qnnpack'  # Set the quantized engine


In [None]:
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
from transformers import AutoModelForSequenceClassification, AutoTokenizer

model_name = "distil-gpt-2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)
# original_model.eval()

# Export the model to ONNX
dummy_input = tokenizer("What is the capital of Mongolia?", return_tensors="pt").input_ids
onnx_path = "distilbert-base.onnx"

torch.onnx.export(
    model,
    dummy_input,
    onnx_path,
    export_params=True,
    opset_version=14, # UnsupportedOperatorError: Exporting the operator 'aten::triu' to ONNX opset version 13 is not supported. Support for this operator was added in version 14, try exporting with this version.
    input_names=["input_ids"],
    output_names=["output"],
    dynamic_axes={"input_ids": {0: "batch_size", 1: "sequence_length"}}
)

print(f"Model exported to {onnx_path}")



The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


OSError: distil-gpt-2 is not a local folder and is not a valid model identifier listed on 'https://huggingface.co/models'
If this is a private repository, make sure to pass a token having permission to this repo either by logging in with `huggingface-cli login` or by passing `token=<your_token>`

In [None]:
from onnxruntime.quantization import quantize_dynamic, QuantType

# Path to the exported ONNX model
onnx_path = "distilbert-base.onnx"
quantized_onnx_path = "distilbert-base_quantized.onnx"

# Perform dynamic quantization
quantize_dynamic(
    model_input=onnx_path,
    model_output=quantized_onnx_path,
    weight_type=QuantType.QInt8  # Use 8-bit quantization for weights
)

print(f"Quantized model saved to {quantized_onnx_path}")




Quantized model saved to distilbert-base_quantized.onnx


In [None]:
import onnxruntime as ort
import numpy as np

import numpy as np






# Load the quantized ONNX model
ort_session = ort.InferenceSession(quantized_onnx_path)

# Tokenize input
input_question = "What is the capital of Mongolia?"
inputs = tokenizer(input_question, return_tensors="np")
input_ids = inputs["input_ids"]

# Run inference
ort_inputs = {"input_ids": input_ids.astype(np.int64)}
ort_outputs = ort_session.run(None, ort_inputs)

# Decode the output
generated_ids = ort_outputs[0]
generated_answer = tokenizer.decode(generated_ids[0], skip_special_tokens=True)




print(f"Question: {input_question}")
print(f"Answer: {generated_answer}")

# # Convert the ONNX Runtime output to integers
# generated_ids = np.rint(ort_outputs[0]).astype(int)  # Round and cast to integers

# # Decode the output
# generated_answer = tokenizer.decode(generated_ids[0], skip_special_tokens=True)

# print(f"Question: {input_question}")
# print(f"Answer: {generated_answer}")


TypeError: argument 'ids': 'float' object cannot be interpreted as an integer

In [None]:
generated_ids

array([[0, 0]])

In [None]:
generated_ids


array([[-0.1639258 , -0.20489815]], dtype=float32)

In [None]:
print("Input IDs:", inputs["input_ids"])
print("Attention Mask:", inputs["attention_mask"])


Input IDs: [[  101  2054  2003  1996  3007  1997 13906  1029   102]]
Attention Mask: [[1 1 1 1 1 1 1 1 1]]


In [None]:
print("ONNX Output:", ort_outputs)


ONNX Output: [array([[-0.1639258 , -0.20489815]], dtype=float32)]


In [None]:
# nput_question = "What is the capital of Mongolia?"
# inputs = tokenizer(input_question, return_tensors="np")
# input_ids = inputs["input_ids"]

# # Run inference
# ort_inputs = {"input_ids": input_ids.astype(np.int64)}
# ort_outputs = ort_session.run(None, ort_inputs)

# # Decode the output
# generated_ids = ort_outputs[0]
# generated_answer = tokenizer.decode(generated_ids[0], skip_special_tokens=True)


output_ids = model(**input_ids)
generated_answer = tokenizer.decode(output_ids[0], skip_special_tokens=True)

print(f"Question: {input_question}")
print(f"Answer: {generated_answer}")

TypeError: DistilBertForSequenceClassification(
  (distilbert): DistilBertModel(
    (embeddings): Embeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer): Transformer(
      (layer): ModuleList(
        (0-5): 6 x TransformerBlock(
          (attention): DistilBertSdpaAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inplace=False)
            (lin1): Linear(in_features=768, out_features=3072, bias=True)
            (lin2): Linear(in_features=3072, out_features=768, bias=True)
            (activation): GELUActivation()
          )
          (output_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
        )
      )
    )
  )
  (pre_classifier): Linear(in_features=768, out_features=768, bias=True)
  (classifier): Linear(in_features=768, out_features=2, bias=True)
  (dropout): Dropout(p=0.2, inplace=False)
) argument after ** must be a mapping, not numpy.ndarray