<!--<badge>--><a href="https://colab.research.google.com/github/huggingface/workshops/blob/main/bosch/dynamic-quantization.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a><!--</badge>-->

# Dynamic Quantization with Hugging Face Optimum

In this session, you will learn how to apply _dynamic quantization_ to a 🤗 Transformers model. You will quantize a [DistilBERT model](https://huggingface.co/optimum/distilbert-base-uncased-finetuned-banking77) that's been fine-tuned on the [Banking77 dataset](https://huggingface.co/datasets/banking77) for intent classification. 

Along the way, you'll learn how to use two open-source libraries: 

* [🤗 Optimum](https://github.com/huggingface/optimum): an extension of 🤗 Transformers, which provides a set of performance optimization tools enabling maximum efficiency to train and run models on targeted hardware.
* [🤗 Evaluate](https://github.com/huggingface/evaluate): a library that makes evaluating and comparing models and reporting their performance easier and more standardized.


By the end of this session, you see how graph optimization and quantization with 🤗 Optimum can significantly decrease model latency while keeping almost 100% of the full-precision model.

## Learning objectives

By the end of this session, you will know how to:

* Setup a development environment
* Convert a 🤗 Transformers model to ONNX for inference
* Apply dynamic quantization using `ORTQuantizer` from 🤗 Optimum
* Test inference with the quantized model
* Evaluate the model performance with 🤗 Evaluate
* Compare the latency of the quantized model against the original one
* Push the quantized model to the Hub
* Load and run inference with a quantized model from the Hub


Let's get started! 🚀

## 1. Setup development environment

Our first step is to install 🤗 Optimum, along with 🤗 Evaluate and some other libraries. Running the following cell will install all the required packages for us including 🤗 Transformer, PyTorch, and ONNX Runtime utilities:

In [None]:
%pip install "optimum[onnxruntime]" "evaluate[evaluator]" sklearn

> If you want to run inference on a GPU, you can install 🤗 Optimum with `pip install optimum[onnxruntime-gpu]`.

While we're at it, let's turn off some of the warnings from the 🤗 Datasets library and the tokenizer:

In [2]:
import datasets

datasets.logging.set_verbosity_error()

%env TOKENIZERS_PARALLELISM=false

env: TOKENIZERS_PARALLELISM=false


## 2. Convert a 🤗 Transformers model to ONNX for inference

Before we can optimize and quantize our model, we first need to export it to the ONNX format. To do this we will use the `ORTModelForSequenceClassification` class and call the `from_pretrained()` method. This method will download the PyTorch weights from the Hub and export them via the `from_transformers` argument. The model we are using is `optimum/distilbert-base-uncased-finetuned-banking77`, which is a fine-tuned DistilBERT model on the Banking77 dataset achieving an accuracy score of 91.7% and as the feature (task) text-classification:

In [None]:
from pathlib import Path

from optimum.onnxruntime import ORTModelForSequenceClassification
from transformers import AutoTokenizer

model_id = "lewtun/autotrain-sphere-banking77-1565555714"
dataset_id = "banking77"
onnx_path = Path("onnx")

# load vanilla transformers and convert to onnx
model = ORTModelForSequenceClassification.from_pretrained(
    model_id, from_transformers=True
)
tokenizer = AutoTokenizer.from_pretrained(model_id)

Downloading:   0%|          | 0.00/5.84k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/346 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/712k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/125 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/268M [00:00<?, ?B/s]

One neat thing about 🤗 Optimum, is that allows you to run ONNX models with the `pipeline()` function from  🤗 Transformers. This means that you get all the pre- and post-processing features for free, without needing to re-implement them for each model! Here's how you can run inference with our vanilla ONNX model:

In [None]:
from transformers import pipeline

vanilla_clf = pipeline("text-classification", model=model, tokenizer=tokenizer)
vanilla_clf("Could you assist me in finding my lost card?")

This looks good, so let's save the model and tokenizer to disk for later usage:

In [4]:
# save onnx checkpoint and tokenizer
model.save_pretrained(onnx_path)
tokenizer.save_pretrained(onnx_path)

('onnx/tokenizer_config.json',
 'onnx/special_tokens_map.json',
 'onnx/vocab.txt',
 'onnx/added_tokens.json',
 'onnx/tokenizer.json')

If we inspect the `onnx` directory where we've saved the model and tokenizer:

In [6]:
!ls {onnx_path}

config.json  special_tokens_map.json  tokenizer_config.json
model.onnx   tokenizer.json	      vocab.txt


we can see that there's a `model.onnx` file that corresponds to our exported model. Let's now go ahead and optimize this!

## 3. Apply graph optimization using `ORTOptimizer` from 🤗 Optimum

To apply graph optimization in 🤗 Optimum, we do this by:

* Creating an optimizer based on our ONNX model
* Defining the type of optimizations via a configuration class
* Exporting the optimized model as a new ONNX file

The following code snippet does these steps for us:

In [5]:
from optimum.onnxruntime import ORTOptimizer
from optimum.onnxruntime.configuration import OptimizationConfig

optimizer = ORTOptimizer.from_pretrained(model)

optimization_config = OptimizationConfig(optimization_level=2,
    optimize_with_onnxruntime_only=False,
    optimize_for_gpu=False,
)

optimizer.optimize(save_dir=onnx_path, optimization_config=optimization_config)

PosixPath('onnx')

Here we can see that we've specifed in the configuration which level of optimisation to apply, along with optimizing for CPU only. If we now take a look at our `onnx` directory:

In [9]:
!ls {onnx_path}

config.json	      ort_config.json	       tokenizer_config.json
model.onnx	      special_tokens_map.json  vocab.txt
model_optimized.onnx  tokenizer.json


we can see we have a new ONNX file called `model_optimized.onnx`. Let's do a quick speed test of the two models.

## 4. Test inference with the optimized model

As we saw earlier, Optimum has built-in support for transformers pipelines. This allows us to leverage the same API that we know from using PyTorch and TensorFlow models. Therefore we can load our quantized model with `ORTModelForSequenceClassification` class and the transformers `pipeline()` function:

In [6]:
model_optimized = ORTModelForSequenceClassification.from_pretrained(
    onnx_path, file_name="model_optimized.onnx"
)
tokenizer = AutoTokenizer.from_pretrained(onnx_path)

optimized_clf = pipeline("text-classification", model=model_optimized, tokenizer=tokenizer)
optimized_clf("Could you assist me in finding my lost card?")

[{'label': 'lost_or_stolen_card', 'score': 0.9500694870948792}]

## 5. Compare the latency of the optimized model against the original one

Okay, now let's test the performance (latency) of our optimized model. We are going to use a payload with a sequence length of 128 for the benchmark. To keep it simple, we are going to use a Python loop and calculate the avgerage and p95 latencies for our vanilla model and for the optimized model:

In [7]:
from time import perf_counter
from tqdm.auto import tqdm

import numpy as np

payload = (
    "Hello my name is Philipp. I am getting in touch with you because i didn't get a response from you. What do I need to do to get my new card which I have requested 2 weeks ago? Please help me and answer this email in the next 7 days. Best regards and have a nice weekend "
    * 2
)
print(f'Payload sequence length: {len(tokenizer(payload)["input_ids"])}')


def measure_latency(pipe):
    latencies = []
    # warm up
    for _ in range(10):
        _ = pipe(payload)
    # Timed run
    for _ in tqdm(range(300)):
        start_time = perf_counter()
        _ = pipe(payload)
        latency = perf_counter() - start_time
        latencies.append(latency)
    # Compute run statistics
    time_avg_ms = 1000 * np.mean(latencies)
    time_std_ms = 1000 * np.std(latencies)
    time_p95_ms = 1000 * np.percentile(latencies, 95)
    return (
        f"P95 latency (ms) - {time_p95_ms}; Average latency (ms) - {time_avg_ms:.2f} +\- {time_std_ms:.2f};",
        time_p95_ms,
    )


vanilla_latencies = measure_latency(vanilla_clf)
optimized_latencies = measure_latency(optimized_clf)

print(f"Vanilla model: {vanilla_latencies[0]}")
print(f"Optimized model: {optimized_latencies[0]}")
print(
    f"Improvement through quantization: {round(vanilla_latencies[1]/optimized_latencies[1],2)}x"
)

Payload sequence length: 128


  0%|          | 0/300 [00:00<?, ?it/s]

  0%|          | 0/300 [00:00<?, ?it/s]

Vanilla model: P95 latency (ms) - 40.66422965042875; Average latency (ms) - 37.75 +\- 2.26;
Optimized model: P95 latency (ms) - 35.47789009908229; Average latency (ms) - 32.11 +\- 2.59;
Improvement through quantization: 1.15x


Nice, applying graph optimization has given us a decent speed up! Let's see if we can squeeze a bit more performance with quantization.

## 6. Apply dynamic quantization using `ORTQuantizer` from 🤗 Optimum

To apply quantization in 🤗 Optimum, we do this by:

* Creating an optimizer based on our ONNX model
* Defining the type of optimizations via a configuration class
* Exporting the optimized model as a new ONNX file

The following code snippet does these steps for us:

In [8]:
from optimum.onnxruntime import ORTQuantizer
from optimum.onnxruntime.configuration import AutoQuantizationConfig

# create ORTQuantizer and define quantization configuration
model_optimized = ORTModelForSequenceClassification.from_pretrained(onnx_path, file_name="model_optimized.onnx")
dynamic_quantizer = ORTQuantizer.from_pretrained(model_optimized)
dqconfig = AutoQuantizationConfig.avx512_vnni(is_static=False, per_channel=False)

# apply the quantization configuration to the model
dynamic_quantizer.quantize(save_dir=onnx_path, quantization_config=dqconfig)

PosixPath('onnx')

Here we can see that we've specifed in the configuration the type of execution engine to use with the Intel AVX512-VNNI CPU. If we now take a look at our `onnx` directory:

In [13]:
!ls {onnx_path}

config.json	      model_optimized_quantized.onnx  tokenizer.json
model.onnx	      ort_config.json		      tokenizer_config.json
model_optimized.onnx  special_tokens_map.json	      vocab.txt


we can see we have a new ONNX file called `model_optimized_quantized.onnx`. Let's do a quick size comparison of the two models:

In [9]:
import os

# get model file size
size = os.path.getsize(onnx_path / "model.onnx") / (1024 * 1024)
quantized_model = os.path.getsize(onnx_path / "model_optimized_quantized.onnx") / (1024 * 1024)

print(f"Model file size: {size:.2f} MB")
print(f"Quantized Model file size: {quantized_model:.2f} MB")

Model file size: 255.68 MB
Quantized Model file size: 162.68 MB


Nice, dynamic quantization has reduced the model size by around a factor of 2! This should allow us to speed up the inference time by a similar factor, so let's now see how we can test the latency of our models.

## 7. Test inference with the quantized model

As before, the first order of business is to create a new pipeline for our quantized model:

In [10]:
model_quantized = ORTModelForSequenceClassification.from_pretrained(
    onnx_path, file_name="model_optimized_quantized.onnx"
)
tokenizer = AutoTokenizer.from_pretrained(onnx_path)

quantized_clf = pipeline("text-classification", model=model_quantized, tokenizer=tokenizer)
quantized_clf("Could you assist me in finding my lost card?")

[{'label': 'lost_or_stolen_card', 'score': 0.9163928627967834}]

## 8. Compare the latency of the quantized model against the original one

Okay, now let's test the performance (latency) of our quantized model. We are going to use a payload with a sequence length of 128 for the benchmark. To keep it simple, we are going to use a Python loop and calculate the avgerage and p95 latencies for our vanilla model and for the quantized model:

In [11]:
quantized_latencies = measure_latency(quantized_clf)

print(f"Vanilla model: {vanilla_latencies[0]}")
print(f"Quantized model: {quantized_latencies[0]}")
print(
    f"Improvement through quantization: {round(vanilla_latencies[1]/quantized_latencies[1],2)}x"
)

  0%|          | 0/300 [00:00<?, ?it/s]

Vanilla model: P95 latency (ms) - 40.66422965042875; Average latency (ms) - 37.75 +\- 2.26;
Quantized model: P95 latency (ms) - 27.898280650788365; Average latency (ms) - 25.40 +\- 1.82;
Improvement through quantization: 1.46x


Nice, our model is model is a bit over two times faster! Let's see what the impacty on accuracy is

## 9. Evaluate the model performance with 🤗 Evaluate

It is always a good idea to evaluate the performance of your quantized model on a dedicated test set to ensure the optimizations haven't impacted the model too strongly. To evaluate our model, we'll use the handy `evaluator()` function from 🤗 Evaluate. This function is similar to the `pipeline()` function from 🤗 Transformers, in the sense that it handles the evaluation loop for you automatically!

Here's how you can load an evaluator for text classification and feed in the quantized pipeline:

In [12]:
from datasets import load_dataset
from evaluate import evaluator

eval_pipe = evaluator("text-classification")
eval_dataset = load_dataset(dataset_id, split="test")
label_feature = eval_dataset.features["label"]
label2id = {label_feature.int2str(idx):idx for idx in range(label_feature.num_classes)}

results = eval_pipe.compute(
    model_or_pipeline=quantized_clf,
    data=eval_dataset,
    metric="accuracy",
    input_column="text",
    label_column="label",
    label_mapping=label2id,
    strategy="simple",
)
print(results)

{'accuracy': 0.912987012987013, 'total_time_in_seconds': 16.234970625000642, 'samples_per_second': 189.71392502903763, 'latency_in_seconds': 0.005271094358766442}


Not bad! The resulting accuracy isn't too far from the original model - let's see how much exactly:

In [13]:
print(f"Vanilla model: 91.68%")
print(f"Quantized model: {results['accuracy']*100:.2f}%")
print(
    f"The quantized model achieves {round(results['accuracy']/0.925,4)*100:.2f}% accuracy of the fp32 model"
)

Vanilla model: 91.68%
Quantized model: 91.30%
The quantized model achieves 98.70% accuracy of the fp32 model


## 10. Push the quantized model to the Hub

The Optimum model classes like `ORTModelForSequenceClassification` are integrated with the Hugging Face Model Hub, which means you can not only load model from the Hub, but also push your models to the Hub with the `push_to_hub()` method. That way we can now save our qunatized model on the Hub to be for example used inside our inference API.

We have to make sure that we are also saving the tokenizer as well as the `config.json` to have a good inference experience.

If you haven't logged into the Hub yet you can use the `notebook_login` function to do so:

In [12]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

It's then a simple mater of saving our files to a local directory and running the `push_to_hub()` method:

In [13]:
tmp_store_directory = "onnx_hub_repo"
repository_id = "quantized-distilbert-banking77"

model.save_pretrained(tmp_store_directory)
tokenizer.save_pretrained(tmp_store_directory)

model.push_to_hub(tmp_store_directory, repository_id=repository_id, use_auth_token=True)



## 11. Load and run inference from the Hub

Now that our model is on the Hub, we can use it from anywhere! Here's a demo to show how we can load the model and tokenizer, before passing them to the `pipeline()` function:

In [14]:
model = ORTModelForSequenceClassification.from_pretrained(
    "lewtun/quantized-distilbert-banking77-2"
)
tokenizer = AutoTokenizer.from_pretrained("lewtun/quantized-distilbert-banking77-2")

remote_clf = pipeline("text-classification", model=model, tokenizer=tokenizer)
remote_clf("Could you assist me in finding my lost card?")

Downloading:   0%|          | 0.00/5.84k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/268M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/341 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/711k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/125 [00:00<?, ?B/s]

[{'label': 'lost_or_stolen_card', 'score': 0.9500694870948792}]