# Quantize Transformers models

In this notebook, we will learn how to do post-training static quantization on Hugging Face Transformers model. The session will show you how to quantize a ELECTRA model using Hugging Face Optimum and ONNX Runtime.

Static quantization is currently only supported for CPUs, so we will not be utilizing GPUs / CUDA in this session. By the end of this session, you see how quantization with Hugging Face Optimum can result in significant increase in model latency while keeping almost 100% of the full-precision model.

In [None]:
from optimum.onnxruntime import ORTModelForSequenceClassification, ORTQuantizer
from optimum.onnxruntime.configuration import AutoQuantizationConfig
from transformers import AutoTokenizer, pipeline
from pathlib import Path

## 1. Onnx inference

Before quantizing, we need to convert our transformers model to the onnx format.

In [None]:
model_id = "phobert/category"
onnx_path = Path("onnx")

# load transformers and convert to onnx
model = ORTModelForSequenceClassification.from_pretrained(model_id, from_transformers=True)
tokenizer = AutoTokenizer.from_pretrained(model_id)

# save onnx checkpoint and tokenizer
model.save_pretrained(onnx_path)
tokenizer.save_pretrained(onnx_path)
pipe_onnx = pipeline("text-classification", model=model_id)

## 2.1 Dynamic quantize model

Unlike dynamic quantization, where the scales and zero points were collected during inference, the scales and zero points for static quantization were determined prior to inference using a representative dataset. Therefore, static quantization is theoretically faster than dynamic quantization while the model size and memory bandwidth consumptions remain to be the same. Therefore, statically quantized models are more favorable for inference than dynamic quantization models.

In [None]:
quantizer = ORTQuantizer.from_pretrained(model)
dqconfig = AutoQuantizationConfig.avx512_vnni(is_static=False, per_channel=False)
model_quantized_path = quantizer.quantize(
    save_dir=onnx_path,
    quantization_config=dqconfig,
)

model = ORTModelForSequenceClassification.from_pretrained(onnx_path, file_name="model_quantized.onnx")
preprocessor = AutoTokenizer.from_pretrained(onnx_path)
pipe_q8 = pipeline("text-classification", model=model, tokenizer=preprocessor)

## 3. Compare performance

In [None]:
# model size
for i in ['model.onnx', 'model_quantized.onnx']:
    size = (onnx_path / i).stat().st_size / (1024*1024)
    print(f'{i} file size: {size:.2f}')

In [None]:
from time import perf_counter
import numpy as np


def measure_latency(pipe, payload):
    latencies = []
    # warm up
    for _ in range(10):
        _ = pipe(payload)
    # Timed run
    for _ in range(300):
        start_time = perf_counter()
        _ =  pipe(payload)
        latency = perf_counter() - start_time
        latencies.append(latency)
    # Compute run statistics
    time_avg_ms = 1000 * np.mean(latencies)
    time_std_ms = 1000 * np.std(latencies)
    time_p95_ms = 1000 * np.percentile(latencies,95)
    return f"P95 latency (ms) - {time_p95_ms}; " \
           f"Average latency (ms) - {time_avg_ms:.2f} +\- {time_std_ms:.2f};", time_p95_ms

In [None]:
payload = "hàng tốt nhỉ nhưng mình chưa thích lắm"*2
print(f'Payload sequence length: {len(tokenizer(payload)["input_ids"])}')

vanilla_model=measure_latency(pipe_onnx, payload)
quantized_model=measure_latency(pipe_q8, payload)

print(f"Vanilla model: {vanilla_model[0]}")
print(f"Quantized model: {quantized_model[0]}")
print(f"Improvement through quantization: {round(vanilla_model[1]/quantized_model[1],2)}x")