# Optimizing Transformers for GPUs with Optimum

In this session, you will learn how to optimize Hugging Face Transformers models for GPUs using Optimum. The session will show you how to convert you weights to fp16 weights and optimize a DistilBERT model using [Hugging Face Optimum](https://huggingface.co/docs/optimum/index) and [ONNX Runtime](https://onnxruntime.ai/). Hugging Face Optimum is an extension of 🤗 Transformers, providing a set of performance optimization tools enabling maximum efficiency to train and run models on targeted hardware. We are going to optimize a RoBERTa model for Question Answering, which was fine-tuned on the SQuAD2.0 dataset to decrease the latency from Xms to Yms for a sequence lenght of 128.

Note: int 8 quantization is currently only supported for CPUs. We plan to add support for in the near future.

By the end of this session, you will know how GPU optimization with Hugging Face Optimum can result in significant increase in model latency and througput while keeping  100% of the full-precision model. 

You will learn how to:
1. Setup Development Environment
2. Convert a Hugging Face `Transformers` model to ONNX for inference
3. Apply graph optimization techniques to the ONNX model
4. Convert model weights from `fp32` to `fp16`
5. Evaluate the performance and speed

Let's get started! 🚀

_This tutorial was created and run on an g4dn.xlarge AWS EC2 Instance including a NVIDIA T4._

---

## 1. Setup Development Environment

Our first step is to install Optimum, along with  Evaluate and some other libraries. Running the following cell will install all the required packages for us including Transformers, PyTorch, and ONNX Runtime utilities:

_Note: You need a machine with a GPU and CUDA installed. You can check this by running `nvidia-smi` in your terminal. If you have a correct environment you should statistics abour your GPU._

In [None]:
!pip install "optimum[onnxruntime-gpu]==1.2.3"

> If you want to run inference on a CPU, you can install 🤗 Optimum with `pip install optimum[onnxruntime]`.


## 2. Convert a Hugging Face `Transformers` model to ONNX for inference

Before we can start optimizing our model we need to convert our vanilla `transformers` model to the `onnx` format. To do this we will use the new [ORTModelForQuestionAnswering](https://huggingface.co/docs/optimum/main/en/onnxruntime/modeling_ort#optimum.onnxruntime.ORTModelForQuestionAnswering) class calling the `from_pretrained()` method with the `from_transformers` attribute. The model we are using is the [deepset/roberta-base-squad2](https://huggingface.co/deepset/roberta-base-squad2) a fine-tuned RoBERTa-based model on the SQuAD2.0 dataset achieving an F1 score of `82.5` and as the feature (task) `question-answering`.



In [1]:
from optimum.onnxruntime import ORTModelForQuestionAnswering
from transformers import AutoTokenizer
from pathlib import Path


model_id="deepset/roberta-base-squad2"
onnx_path = Path("onnx")

# load vanilla transformers and convert to onnx
model = ORTModelForQuestionAnswering.from_pretrained(model_id, from_transformers=True)
tokenizer = AutoTokenizer.from_pretrained(model_id)

# save onnx checkpoint and tokenizer
model.save_pretrained(onnx_path)
tokenizer.save_pretrained(onnx_path)

  from .autonotebook import tqdm as notebook_tqdm


Before we jump into the optimization of the model lets first evaluate the current performance of the model. Therefore we can use `pipeline()` function from 🤗 Transformers. Meaning we will measure the end-to-end latency including the pre- and post-processing features.

In [None]:
context="Hello, my name is Philipp and I live in Nuremberg, Germany. Currently I am working as a Technical Lead at Hugging Face to democratize artificial intelligence through open source and open science. In the past I designed and implemented cloud-native machine learning architectures for fin-tech and insurance companies. I found my passion for cloud concepts and machine learning 5 years ago. Since then I never stopped learning. Currently, I am focusing myself in the area NLP and how to leverage models like BERT, Roberta, T5, ViT, and GPT2 to generate business value." 
question="As what is Philipp working?" 
inputs = tokenizer(question, text, return_tensors="pt")


After we prepared our payload we can create the inference `pipeline`. 

In [None]:
from transformers import pipeline

vanilla_qa = pipeline("question-answering", model=model, tokenizer=tokenizer)
vanilla_qa(question=question,context=context)

If you want to learn more about exporting transformers model check-out [Convert Transformers to ONNX with Hugging Face Optimum](https://www.philschmid.de/convert-transformers-to-onnx) blog post



## 3. Apply graph optimization techniques to the ONNX model

Graph optimizations are essentially graph-level transformations, ranging from small graph simplifications and node eliminations to more complex node fusions and layout optimizations. 
Examples of graph optimizations include:
* **Constant folding**: evaluate constant expressions at compile time instead of runtime
* **Redundant node elimination**: remove redundant nodes without changing graph structure
* **Operator fusion**: merge one node (i.e. operator) into another so they can be executed together


![operator fusion](./assets/operator_fusion.png)

If you want to learn more about graph optimization you take a look at the [ONNX Runtime documentation](https://onnxruntime.ai/docs/performance/graph-optimizations.html). We are going to first optimize the model and then dynamically quantize to be able to use transformers specific operators such as QAttention for quantization of attention layers.
To apply graph optimizations to our ONNX model, we will use the `ORTOptimizer()`. The `ORTOptimizer` makes it with the help of a `OptimizationConfig` easy to optimize. The `OptimizationConfig` is the configuration class handling all the ONNX Runtime optimization parameters. 

In [4]:
from optimum.onnxruntime import ORTOptimizer
from optimum.onnxruntime.configuration import OptimizationConfig

# create ORTOptimizer and define optimization configuration
optimizer = ORTOptimizer.from_pretrained(model_id, feature=task)
optimization_config = OptimizationConfig(optimization_level=99) # enable all optimizations

# apply the optimization configuration to the model
optimizer.export(
    onnx_model_path=onnx_path / "model.onnx",
    onnx_optimized_model_output_path=onnx_path / "model-optimized.onnx",
    optimization_config=optimization_config,
)

To test performance we can use the ORTModelForSequenceClassification class again and provide an additional `file_name` parameter to load our optimized model. _(This also works for models available on the hub)._

In [None]:
from transformers import pipeline

# load optimized model
model = ORTModelForQuestionAnswering.from_pretrained(model_id, file_name="model-optimized.onnx")

# create optimized pipeline
optimized_qa = pipeline("question-answering", model=model, tokenizer=tokenizer)
optimized_qa(question=question,context=context)

## 4. Convert model weights from `fp32` to `fp16`

Feature is missing: https://github.com/huggingface/optimum/blob/42825ce3e73c381ed14c2fe906cd970e9090d325/optimum/onnxruntime/optimization.py#L143

After we have optimized our model we can accelerate it even more by quantizing it using the `ORTQuantizer`. The `ORTQuantizer` can be used to apply dynamic quantization to decrease the size of the model size and accelerate latency and inference.

_We use the `avx512_vnni` config since the instance is powered by an intel ice-lake CPU supporting avx512._

In [6]:
from optimum.onnxruntime import ORTQuantizer
from optimum.onnxruntime.configuration import AutoQuantizationConfig

# create ORTQuantizer and define quantization configuration
dynamic_quantizer = ORTQuantizer.from_pretrained(model_id, feature=model.pipeline_task)
dqconfig = AutoQuantizationConfig.avx512_vnni(is_static=False, per_channel=False)

# apply the quantization configuration to the model
model_quantized_path = dynamic_quantizer.export(
    onnx_model_path=onnx_path / "model-optimized.onnx",
    onnx_quantized_model_output_path=onnx_path / "model-quantized.onnx",
    quantization_config=dqconfig,
)

PosixPath('onnx/model-quantized.onnx')

Lets quickly check the new model size.

In [7]:
import os

# get model file size
size = os.path.getsize(onnx_path / "model-optimized.onnx")/(1024*1024)
quantized_model = os.path.getsize(onnx_path / "model-quantized.onnx")/(1024*1024)

print(f"Model file size: {size:.2f} MB")
print(f"Quantized Model file size: {quantized_model:.2f} MB")

Model file size: 255.68 MB
Quantized Model file size: 134.32 MB


## 5. Test inference with the GPU optimized model

[Optimum](https://huggingface.co/docs/optimum/main/en/pipelines#optimizing-with-ortoptimizer) has built-in support for [transformers pipelines](https://huggingface.co/docs/transformers/main/en/main_classes/pipelines#pipelines). This allows us to leverage the same API that we know from using PyTorch and TensorFlow models.
Therefore we can load our quantized model with `ORTModelForQuestionAnswering` class and transformers `pipeline`.

[{'label': 'exchange_rate', 'score': 0.9802021384239197}]

## 6. Evaluate the performance and speed

As the last step of the tutorial, we want to take a detailed look at the performance and accuracy of our model. Applying optimization techniques, like graph optimizations or quantization not only impact performance (latency) those also might have an impact on the accuracy of the model. So accelerating your model comes with a trade-off. 

In [None]:
from optimum.onnxruntime import ORTModelForSequenceClassification
from transformers import pipeline, AutoTokenizer

model = ORTModelForQuestionAnswering.from_pretrained(model_id, file_name="model-optimized.onnx")
tokenizer = AutoTokenizer.from_pretrained(onnx_path)

# create optimized pipeline
optimized_qa = pipeline("question-answering", model=model, tokenizer=tokenizer)
optimized_qa(question=question,context=context)

Let's evaluate our models. Our transformers model [deepset/roberta-base-squad2](https://huggingface.co/deepset/roberta-base-squad2) was fine-tuned on the SQUAD2 dataset. This will be the dataset we use to evaluate our models. 

In [9]:
from datasets import load_metric,load_dataset

metric = load_metric("squad_v2")
eval_dataset = load_dataset("squad_v2")["validation"]

# creating a subset for faster evaluation
# COMMENT OUT to run evaluation on the whole dataset -> can take up to 45 min.
eval_dataset = eval_dataset.select(range(1000))

Using custom data configuration default
Reusing dataset banking77 (/home/ubuntu/.cache/huggingface/datasets/banking77/default/1.1.0/aec0289529599d4572d76ab00c8944cb84f88410ad0c9e7da26189d31f62a55b)
Couldn't find a directory or a metric named 'accuracy' in this version. It was picked from the master branch on github instead.


{'accuracy': 0.9224025974025974}


We can now leverage the [map](https://huggingface.co/docs/datasets/v2.1.0/en/process#map) function of [datasets](https://huggingface.co/docs/datasets/index) to iterate over the validation set of squad 2 and run prediction for each data point. Therefore we write a `evaluate` helper method which uses our pipelines and applies some transformation to work with the [squad v2 metric.](https://huggingface.co/metrics/squad_v2)

In [None]:
def evaluate(example):
  default = optimum_qa(question=example["question"], context=example["context"])
  optimized = opt_optimum_qa(question=example["question"], context=example["context"])
  return {
      'reference': {'id': example['id'], 'answers': example['answers']},
      'default': {'id': example['id'],'prediction_text': default['answer'], 'no_answer_probability': 0.},
      'optimized': {'id': example['id'],'prediction_text': optimized['answer'], 'no_answer_probability': 0.},
      }

result = eval_dataset.map(evaluate)

In [10]:
default_acc = metric.compute(predictions=result["default"], references=result["reference"])
optimized = metric.compute(predictions=result["optimized"], references=result["reference"])

print(f"vanilla model: exact={default_acc['exact']}% f1={default_acc['f1']}%")
print(f"optimized model: exact={optimized['exact']}% f1={optimized['f1']}%")
print(f"quantized model: exact={quantized['exact']}% f1={quantized['f1']}%")

print(f"Vanilla model: 92.5%")
print(f"Quantized model: {results['accuracy']*100:.2f}%")
print(f"The quantized model achieves {round(results['accuracy']/0.925,4)*100:.2f}% accuracy of the fp32 model")

Vanilla model: 92.5%
Quantized model: 92.24%
The quantized model achieves 99.72% accuracy of the fp32 model


Okay, now let's test the performance (latency) of our quantized model. We are going to use a payload with a sequence length of 128 for the benchmark. To keep it simple, we are going to use a python loop and calculate the avg,mean & p95 latency for our vanilla model and for the quantized model.



In [5]:
from time import perf_counter
import numpy as np 

payload="Hello my name is Philipp. I am getting in touch with you because i didn't get a response from you. What do I need to do to get my new card which I have requested 2 weeks ago? Please help me and answer this email in the next 7 days. Best regards and have a nice weekend "*2
print(f'Payload sequence length: {len(tokenizer(payload)["input_ids"])}')

def measure_latency(pipe):
    latencies = []
    # warm up
    for _ in range(10):
        _ = pipe(payload)
    # Timed run
    for _ in range(300):
        start_time = perf_counter()
        _ =  pipe(payload)
        latency = perf_counter() - start_time
        latencies.append(latency)
    # Compute run statistics
    time_avg_ms = 1000 * np.mean(latencies)
    time_std_ms = 1000 * np.std(latencies)
    time_p95_ms = 1000 * np.percentile(latencies,95)
    return f"P95 latency (ms) - {time_p95_ms}; Average latency (ms) - {time_avg_ms:.2f} +\- {time_std_ms:.2f};", time_p95_ms

vanilla_clx = pipeline("text-classification",model=model_id)


vanilla_model=measure_latency(vanilla_clx)
quantized_model=measure_latency(clx)

print(f"Vanilla model: {vanilla_model[0]}")
print(f"Quantized model: {quantized_model[0]}")
print(f"Improvement through quantization: {round(vanilla_model[1]/quantized_model[1],2)}x")

Payload sequence length: 128
Vanilla model: P95 latency (ms) - 75.69221085868777; Average latency (ms) - 57.52 +\- 6.16;
Quantized model: P95 latency (ms) - 26.75848939397838; Average latency (ms) - 24.86 +\- 1.25;
Improvement through quantization: 2.83x


We managed to accelerate our model latency from 75.69ms to 26.75ms or 2.83x while keeping 99.72% of the accuracy. 

![performance](assets/performance.png)

## Conclusion

We successfully quantized our vanilla Transformers model with Hugging Face and managed to accelerate our model latency from 75.69ms to 26.75ms or 2.83x while keeping 99.72% of the accuracy. 

But this i have to say that this isn't a plug and play process you can transfer to any Transformers model, task and dataset. The challenge with static quantization ist the calibration of the dataset to find the right ranges which you can use to quantize the model achieve good performance. I ran a hyperparameter search to find the best ranges for our dataset and quantized model using the [run_static_quantizatio_hpo.py](https://github.com/philschmid/optimum-static-quantization/blob/master/scripts/run_static_quantizatio_hpo.py). 

Also noteably to say it that static quantization can only achieve as good as results as dynamic quantization, but will be faster than dynamic quantization. Meaning that it might always be a good start to first dynamically quantize your model using Optimum and then move to static quantization for further latency and throughput gains. The attached repository also includes an example on how dynamically quantize the model [dynamic_quantization.py](https://github.com/philschmid/optimum-static-quantization/blob/master/scripts/dynamic_quantization.py)