Copyright (c) Microsoft Corporation. All rights reserved.  
Licensed under the MIT License.

# Inference PyTorch Bert Model with ONNX Runtime on GPU

In this tutorial, you'll learn how to load a Bert model from PyTorch, convert it to ONNX, and inference it for high performance using ONNX Runtime and NVIDIA GPU. In the following sections, we are going to use the Bert model trained with Stanford Question Answering Dataset (SQuAD) dataset as an example. Bert SQuAD model is used in question answering scenarios, where the answer to every question is a segment of text from the corresponding reading passage, or the question might be unanswerable.

This notebook is for GPU inference. For CPU inference, please look at another notebook [Inference PyTorch Bert Model with ONNX Runtime on CPU](PyTorch_Bert-Squad_OnnxRuntime_CPU.ipynb).

## 0. Prerequisites ##
It requires your machine to have a GPU, and a python environment with [PyTorch](https://pytorch.org/) installed before running this notebook.

#### GPU Environment Setup using AnaConda

First, we install [AnaConda](https://www.anaconda.com/distribution/) in a target machine and open an AnaConda prompt window when it is done. Then run the following commands to create a conda environment. This notebook is tested with PyTorch 2.0.1 and OnnxRuntime 1.16.0.

```console
conda create -n gpu_env python=3.10
conda activate gpu_env
pip install jupyterlab
conda install ipykernel
conda install -c conda-forge ipywidgets
ipython kernel install --user --name gpu_env
jupyter-lab
```
Finally, launch Jupyter Notebook and you can choose gpu_env as kernel to run this notebook.

Onnxruntime-gpu need specified version of CUDA and cuDNN. You can find the Requirements [here](https://onnxruntime.ai/docs/execution-providers/CUDA-ExecutionProvider.html#requirements). Remember to add the directories to PATH environment variable (See [CUDA and cuDNN Path](#CUDA-and-cuDNN-Path) below).

In [1]:
import sys

if sys.platform in ['linux', 'win32']: # Linux or Windows
    !{sys.executable} -m pip install torch --index-url https://download.pytorch.org/whl/cu118 -q
    !{sys.executable} -m pip install onnxruntime-gpu onnx transformers psutil pandas py-cpuinfo py3nvml coloredlogs wget netron sympy protobuf==3.20.3 -q
else: # Mac
    print("CUDA is not available on MacOS")

### CUDA and cuDNN Path
onnxruntime-gpu has dependency on [CUDA](https://developer.nvidia.com/cuda-downloads) and [cuDNN](https://developer.nvidia.com/cudnn). Required CUDA version can be found [here](https://onnxruntime.ai/docs/execution-providers/CUDA-ExecutionProvider.html#requirements) If you import torch before onnxruntime, onnxruntime might use the CUDA and cuDNN DLLs that loaded by PyTorch.

In [2]:
import torch
import onnx
import onnxruntime
import transformers
print("pytorch:", torch.__version__)
print("onnxruntime:", onnxruntime.__version__)
print("onnx:", onnx.__version__)
print("transformers:", transformers.__version__)

pytorch: 2.0.1+cu118
onnxruntime: 1.16.0
onnx: 1.14.1
transformers: 4.33.1


## 1. Load Pretrained Bert model ##

We begin by downloading the SQuAD data file and store them in the specified location. 

In [3]:
import os

cache_dir = "./squad"
if not os.path.exists(cache_dir):
    os.makedirs(cache_dir)

predict_file_url = "https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v1.1.json"
predict_file = os.path.join(cache_dir, "dev-v1.1.json")
if not os.path.exists(predict_file):
    import wget
    print("Start downloading predict file.")
    wget.download(predict_file_url, predict_file)
    print("Predict file downloaded.")

Let's first define some constant variables.

In [4]:
# Whether allow overwriting existing ONNX model and download the latest script from GitHub
enable_overwrite = True

# Total samples to inference, so that we can get average latency
total_samples = 1000

# ONNX opset version
opset_version=11

Specify some model configuration variables.

In [5]:
# fine-tuned model from https://huggingface.co/models?search=squad
model_name_or_path = "bert-large-uncased-whole-word-masking-finetuned-squad"
max_seq_length = 128
doc_stride = 128
max_query_length = 64

Start to load model from pretrained. This step could take a few minutes. 

In [6]:
# The following code is adapted from HuggingFace transformers
# https://github.com/huggingface/transformers/blob/master/examples/run_squad.py

from transformers import (BertConfig, BertForQuestionAnswering, BertTokenizer)

# Load pretrained model and tokenizer
config_class, model_class, tokenizer_class = (BertConfig, BertForQuestionAnswering, BertTokenizer)
config = config_class.from_pretrained(model_name_or_path, cache_dir=cache_dir)
tokenizer = tokenizer_class.from_pretrained(model_name_or_path, do_lower_case=True, cache_dir=cache_dir)
model = model_class.from_pretrained(model_name_or_path,
                                    from_tf=False,
                                    config=config,
                                    cache_dir=cache_dir)
# load some examples
from transformers.data.processors.squad import SquadV1Processor

processor = SquadV1Processor()
examples = processor.get_dev_examples(None, filename=predict_file)

from transformers import squad_convert_examples_to_features
features, dataset = squad_convert_examples_to_features( 
            examples=examples[:total_samples], # convert enough examples for this notebook
            tokenizer=tokenizer,
            max_seq_length=max_seq_length,
            doc_stride=doc_stride,
            max_query_length=max_query_length,
            is_training=False,
            return_dataset='pt'
        )

Some weights of the model checkpoint at bert-large-uncased-whole-word-masking-finetuned-squad were not used when initializing BertForQuestionAnswering: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForQuestionAnswering from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForQuestionAnswering from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 48/48 [00:02<00:00, 16.27it/s]
convert squad examples to features: 100%|███████████████████████████████████████████████████████████

## 2. Export the loaded model ##
Once the model is loaded, we can export the loaded PyTorch model to ONNX.

In [7]:
output_dir = os.path.join(".", "onnx_models")
if not os.path.exists(output_dir):
    os.makedirs(output_dir)   
export_model_path = os.path.join(output_dir, 'bert-base-cased-squad_opset{}.onnx'.format(opset_version))

import torch
use_gpu = torch.cuda.is_available()
device = torch.device("cuda" if use_gpu else "cpu")

# Get the first example data to run the model and export it to ONNX
data = dataset[0]
inputs = {
    'input_ids':      data[0].to(device).reshape(1, max_seq_length),
    'attention_mask': data[1].to(device).reshape(1, max_seq_length),
    'token_type_ids': data[2].to(device).reshape(1, max_seq_length)
}

# Set model to inference mode, which is required before exporting the model because some operators behave differently in 
# inference and training mode.
model.eval()
model.to(device)

if enable_overwrite or not os.path.exists(export_model_path):
    with torch.no_grad():
        symbolic_names = {0: 'batch_size', 1: 'max_seq_len'}
        torch.onnx.export(model,                                            # model being run
                          args=tuple(inputs.values()),                      # model input (or a tuple for multiple inputs)
                          f=export_model_path,                              # where to save the model (can be a file or file-like object)
                          opset_version=opset_version,                      # the ONNX version to export the model to
                          do_constant_folding=True,                         # whether to execute constant folding for optimization
                          input_names=['input_ids',                         # the model's input names
                                       'input_mask', 
                                       'segment_ids'],
                          output_names=['start', 'end'],                    # the model's output names
                          dynamic_axes={'input_ids': symbolic_names,        # variable length axes
                                        'input_mask' : symbolic_names,
                                        'segment_ids' : symbolic_names,
                                        'start' : symbolic_names,
                                        'end' : symbolic_names})
        print("Model exported at ", export_model_path)

verbose: False, log level: Level.ERROR

Model exported at  .\onnx_models\bert-base-cased-squad_opset11.onnx


## 3. PyTorch Inference ##
Use PyTorch to evaluate an example input for comparison purpose.

In [8]:
import time

# Measure the latency. It is not accurate using Jupyter Notebook, it is recommended to use standalone python script.
latency = []
with torch.no_grad():
    for i in range(total_samples):
        data = dataset[i]
        inputs = {
            'input_ids':      data[0].to(device).reshape(1, max_seq_length),
            'attention_mask': data[1].to(device).reshape(1, max_seq_length),
            'token_type_ids': data[2].to(device).reshape(1, max_seq_length)
        }
        start = time.time()
        outputs = model(**inputs)
        latency.append(time.time() - start)
print("PyTorch {} Inference time = {} ms".format(device.type, format(sum(latency) * 1000 / len(latency), '.2f')))

PyTorch cuda Inference time = 19.32 ms


## 4. Inference ONNX Model with ONNX Runtime ##

Now we are ready to inference the model with ONNX Runtime.

In [9]:
import psutil
import onnxruntime
import numpy

assert 'CUDAExecutionProvider' in onnxruntime.get_available_providers()
device_name = 'gpu'

sess_options = onnxruntime.SessionOptions()

# Optional: store the optimized graph and view it using Netron to verify that model is fully optimized.
# Note that this will increase session creation time so enable it for debugging only.
sess_options.optimized_model_filepath = os.path.join(output_dir, "optimized_model_{}.onnx".format(device_name))

# Please change the value according to best setting in Performance Test Tool result.
sess_options.intra_op_num_threads=psutil.cpu_count(logical=True)

session = onnxruntime.InferenceSession(export_model_path, sess_options, providers=["CUDAExecutionProvider", "CPUExecutionProvider"])

latency = []
for i in range(total_samples):
    data = dataset[i]
    ort_inputs = {
        'input_ids':  data[0].cpu().reshape(1, max_seq_length).numpy(),
        'input_mask': data[1].cpu().reshape(1, max_seq_length).numpy(),
        'segment_ids': data[2].cpu().reshape(1, max_seq_length).numpy()
    }
    start = time.time()
    ort_outputs = session.run(None, ort_inputs)
    latency.append(time.time() - start)
    
print("OnnxRuntime {} Inference time = {} ms".format(device_name, format(sum(latency) * 1000 / len(latency), '.2f')))

OnnxRuntime gpu Inference time = 6.91 ms


We can compare the output of PyTorch and ONNX Runtime. We can see some results are not close. It is because ONNX Runtime uses some approximation in CUDA optimization. Based on our evaluation on SQuAD data set, F1 score is on par for models before and after optimization.

In [10]:
print("***** Verifying correctness *****")
for i in range(2):    
    print('PyTorch and ONNX Runtime output {} are close:'.format(i), numpy.allclose(ort_outputs[i], outputs[i].cpu(), rtol=1e-02, atol=1e-02))
    diff = ort_outputs[i] - outputs[i].cpu().numpy()
    max_diff = numpy.max(numpy.abs(diff))
    avg_diff = numpy.average(numpy.abs(diff))
    print(f'maximum_diff={max_diff} average_diff={avg_diff}')

***** Verifying correctness *****
PyTorch and ONNX Runtime output 0 are close: True
maximum_diff=0.002086162567138672 average_diff=0.00040457770228385925
PyTorch and ONNX Runtime output 1 are close: True
maximum_diff=0.0033638477325439453 average_diff=0.00045418128138408065


### Inference with Actual Sequence Length
Note that ONNX model is exported using dynamic length axis. It is recommended to use actual sequence input without padding instead of fixed length input for best performance. Let's see how it can be applied to this model.

From an example input below, we can see zero padding at the end of each sequence.

In [11]:
# An example input (we can see padding). From attention_mask, we can deduce the actual length.
inputs

{'input_ids': tensor([[  101,  2054,  2329,  2694,  2897,  2097,  4287,  1996,  3565,  4605,
           1029,   102,  1999,  1996,  2142,  2983,  1010,  4035,  2557,  1019,
           2444,  1998,  1019,  2444,  2998,  4469,  2097,  4287,  1996,  5049,
           1012,  1996,  4035,  2097,  4287,  2049,  2219,  2329,  2394,  3743,
           1010,  2007,  6754, 10184,  1010, 12270, 10589,  1998,  6857,  8945,
          18505,  2006,  8570,  1012,   102,     0,     0,     0,     0,     0,
              0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
              0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
              0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
              0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
              0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
              0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
              0,     0,    

The original sequence length is 128. After removing paddings, the sequence length is reduced. Input with smaller sequence length need less computation, thus we can see there is improvement on inference latency. 

In [12]:
import statistics

latency = []
lengths = []
for i in range(total_samples):
    data = dataset[i]
    # Instead of using fixed length (128), we can use actual sequence length (less than 128), which helps to get better performance.
    actual_sequence_length = sum(data[1].numpy())
    lengths.append(actual_sequence_length)
    opt_inputs = {
        'input_ids':  data[0].numpy()[:actual_sequence_length].reshape(1, actual_sequence_length),
        'input_mask': data[1].numpy()[:actual_sequence_length].reshape(1, actual_sequence_length),
        'segment_ids': data[2].numpy()[:actual_sequence_length].reshape(1, actual_sequence_length)
    }
    start = time.time()
    opt_outputs = session.run(None, opt_inputs)
    latency.append(time.time() - start)
print("Average length", statistics.mean(lengths))
print("OnnxRuntime {} Inference time with actual sequence length = {} ms".format(device_name, format(sum(latency) * 1000 / len(latency), '.2f')))

Average length 94
OnnxRuntime gpu Inference time with actual sequence length = 6.47 ms


Let's compare the output and see whether the results are close.

**Note**: Need end-to-end evaluation on performance and accuracy if you use this strategy.

In [13]:
print("***** Comparing results with/without paddings *****")
for i in range(2):
    print('Output {} are close:'.format(i), numpy.allclose(opt_outputs[i], ort_outputs[i][:,:len(opt_outputs[i][0])], rtol=1e-03, atol=1e-03))

***** Comparing results with/without paddings *****
Output 0 are close: False
Output 1 are close: False


## 5. Offline Optimization and Test Tools

It is recommended to try [OnnxRuntime Transformer Model Optimization Tool](https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/python/tools/transformers) on the exported ONNX models. It could help verify whether the model can be fully optimized, and get performance test results.

#### Transformer Optimizer

Although OnnxRuntime could optimize Bert model exported by PyTorch. Sometime, model cannot be fully optimized due to different reasons:
* A new subgraph pattern is generated by new version of export tool, and the pattern is not covered by older version of OnnxRuntime. 
* The exported model uses dynamic axis and this makes it harder for shape inference of the graph. That blocks some optimization to be applied.
* Some optimization is better to be done offline. Like change input tensor type from int64 to int32 to avoid extra Cast nodes, or convert model to float16 to achieve better performance in V100 or T4 GPU.

We have python script **optimizer.py**, which is more flexible in graph pattern matching and model conversion (like float32 to float16). You can also use it to verify whether a Bert model is fully optimized.

In this example, we can see that it introduces optimization that is not provided by onnxruntime: SkipLayerNormalization and bias fusion, which is not fused in OnnxRuntime due to shape inference as mentioned.

It will also tell whether the model is fully optimized or not. If not, that means you might need change the script to fuse some new pattern of subgraph.

Example Usage:
```
from onnxruntime.transformers import optimizer
optimized_model = optimizer.optimize_model(export_model_path, model_type='bert', num_heads=12, hidden_size=768)
optimized_model.save_model_to_file(optimized_model_path)
```

You can also use command line like the following:

#### Float32 Model
Let us optimize the ONNX model using the script. The first example will output model with float32 to store weights. This is the choice for most GPUs without Tensor Core.

If your GPU (like V100 or T4) has Tensor Core, jump to [Float16 Model](#6.-Model-Optimization-with-Float16) section since that will give you better performance than Float32 model.

In [14]:
optimized_fp32_model_path = './onnx/bert-base-cased-squad_opt_{}_fp32.onnx'.format('gpu' if use_gpu else 'cpu')

!{sys.executable} -m onnxruntime.transformers.optimizer --input $export_model_path --output $optimized_fp32_model_path

               apply: Fused LayerNormalization: 49
               apply: Fused Gelu: 24
               apply: Fused SkipLayerNormalization: 48
               apply: Fused Attention: 24
         prune_graph: Removed 5 nodes
               apply: Fused EmbedLayerNormalization(with mask): 1
         prune_graph: Removed 10 nodes
               apply: Fused BiasGelu: 24
               apply: Fused SkipLayerNormalization(add bias): 48
            optimize: opset version: 11
get_fused_operator_statistics: Optimized operators:{'EmbedLayerNormalization': 1, 'Attention': 24, 'MultiHeadAttention': 0, 'Gelu': 0, 'FastGelu': 0, 'BiasGelu': 24, 'GemmFastGelu': 0, 'LayerNormalization': 0, 'SkipLayerNormalization': 48, 'QOrderedAttention': 0, 'QOrderedGelu': 0, 'QOrderedLayerNormalization': 0, 'QOrderedMatMul': 0}
                main: The model has been fully optimized.
  save_model_to_file: Sort graphs in topological order
  save_model_to_file: Model saved to ./onnx/bert-base-cased-squad_opt_gpu_fp

#### Optimized Graph
We can open the optimized model using [Netron](https://github.com/lutzroeder/netron) to visualize.

The graph is like the following:
<img src='images/optimized_bert_gpu.png'>

Sometime, optimized graph is slightly different. For example, FastGelu is replaced by BiasGelu for CPU inference; When the option --input_int32 is used, Cast nodes for inputs are removed.

In [15]:
import netron

# change it to True if want to view the optimized model in browser
enable_netron = False
if enable_netron:
    # If you encounter error "access a socket in a way forbidden by its access permissions", install Netron as standalone application instead.
    netron.start(optimized_fp32_model_path)

### Performance Test Tool

The following will create 1000 random inputs of batch_size 1 and sequence length 128, then measure the average latency and throughput numbers.

Note that the test uses fixed sequence length. If you use [dynamic sequence length](#Inference-with-Actual-Sequence-Length), actual performance depends on the distribution of sequence length.

**Attention**: Latency numbers from Jupyter Notebook are not accurate. See [Attional Info](#7.-Additional-Info) for more info.

In [16]:
GPU_OPTION = '--use_gpu --use_io_binding' if use_gpu else ''

!{sys.executable} -m onnxruntime.transformers.bert_perf_test --model $optimized_fp32_model_path --batch_size 1 --sequence_length 128 --samples 1000 --test_times 1 $GPU_OPTION

Running test: model=bert-base-cased-squad_opt_gpu_fp32.onnx,graph_optimization_level=ENABLE_ALL,intra_op_num_threads=32,batch_size=1,sequence_length=128,test_cases=1000,test_times=1,use_gpu=True,use_io_binding=True,average_sequence_length=128,random_sequence_length=False
Average latency = 3.21 ms, Throughput = 311.15 QPS
Running test: model=bert-base-cased-squad_opt_gpu_fp32.onnx,graph_optimization_level=ENABLE_ALL,intra_op_num_threads=24,batch_size=1,sequence_length=128,test_cases=1000,test_times=1,use_gpu=True,use_io_binding=True,average_sequence_length=128,random_sequence_length=False
Average latency = 3.21 ms, Throughput = 311.73 QPS
Running test: model=bert-base-cased-squad_opt_gpu_fp32.onnx,graph_optimization_level=ENABLE_ALL,intra_op_num_threads=15,batch_size=1,sequence_length=128,test_cases=1000,test_times=1,use_gpu=True,use_io_binding=True,average_sequence_length=128,random_sequence_length=False
Average latency = 3.20 ms, Throughput = 312.51 QPS
Running test: model=bert-base-c

Let's load the summary file and take a look.

In [17]:
def load_last_perf_test_result():
    import os
    import glob     
    import pandas
    latest_result_file = max(glob.glob("./onnx/perf_results_*.txt"), key=os.path.getmtime)
    result_data = pandas.read_table(latest_result_file)
    print("Perf results from", latest_result_file)
    # Do not show columns that have same values for all rows.
    columns_to_remove = ['model', 'graph_optimization_level', 'batch_size', 'sequence_length', 'test_cases', 'test_times', 'use_gpu', 'use_io_binding', 'average_sequence_length', 'random_sequence_length']
    result_data.drop(columns_to_remove, axis=1, inplace=True)
    return result_data
    
thread_results = load_last_perf_test_result()
thread_results

Perf results from ./onnx\perf_results_GPU_B1_S128_20230912-125746.txt


Unnamed: 0,Latency(ms),Latency_P50,Latency_P75,Latency_P90,Latency_P95,Latency_P99,Throughput(QPS),intra_op_num_threads
0,3.19,3.16,3.21,3.27,3.35,3.52,313.24,1
1,3.2,3.17,3.22,3.25,3.34,3.5,312.8,8
2,3.2,3.15,3.25,3.29,3.36,3.58,312.51,15
3,3.2,3.18,3.21,3.26,3.35,3.53,312.49,14
4,3.2,3.16,3.25,3.29,3.4,3.56,312.24,13
5,3.2,3.19,3.22,3.27,3.35,3.48,312.2,12
6,3.21,3.18,3.23,3.28,3.37,3.51,311.73,24
7,3.21,3.19,3.23,3.27,3.34,3.52,311.57,9
8,3.21,3.18,3.26,3.31,3.36,3.54,311.15,32
9,3.21,3.17,3.24,3.28,3.34,3.52,311.1,5


From above result, we can see that latency is very close for different settings of intra_op_num_threads.

### Model Results Comparison Tool

When a BERT model is optimized, some approximation is used in calculation. If your BERT model has three inputs, a script compare_bert_results.py can be used to do a quick verification. The tool will generate some fake input data, and compare the inference outputs of the original and optimized models. If outputs are all close, it is safe to use the optimized model.

For GPU inference, the absolute or relative difference is larger than those numbers of CPU inference. Note that slight difference in output will not impact final result. We did end-to-end evaluation using SQuAD data set using a fine-tuned squad model, and F1 score is almost the same before/after optimization.

In [18]:
USE_GPU = '--use_gpu' if use_gpu else ''
!{sys.executable} -m onnxruntime.transformers.compare_bert_results --baseline_model $export_model_path --optimized_model $optimized_fp32_model_path --batch_size 1 --sequence_length 128 --samples 100 --rtol 0.01 --atol 0.01 $USE_GPU

100% passed for 100 random inputs given thresholds (rtol=0.01, atol=0.01).
maximum absolute difference=0.05149984359741211


## 6. Model Optimization with Float16

The optimizer.py script have an option **--float16** to convert model to use float16 to store weights. After the conversion, it could be faster to run in GPU with tensor cores like V100 or T4.

Let's run tools to measure the performance on Nvidia RTX 4090. The results show significant performance improvement: latency is about 3.2 ms for float32 model, and about 1.8 ms for float16 model.

In [19]:
optimized_fp16_model_path = './onnx/bert-base-cased-squad_opt_{}_fp16.onnx'.format('gpu' if use_gpu else 'cpu')
!{sys.executable} -m onnxruntime.transformers.optimizer --input $export_model_path --output $optimized_fp16_model_path --float16 $USE_GPU

 [ 0 ; 9 3 m 2 0 2 3 - 0 9 - 1 2   1 2 : 5 7 : 5 4 . 5 5 0 8 2 2 8   [ W : o n n x r u n t i m e : ,   s e s s i o n _ s t a t e . c c : 1 1 6 2   o n n x r u n t i m e : : V e r i f y E a c h N o d e I s A s s i g n e d T o A n E p ]   S o m e   n o d e s   w e r e   n o t   a s s i g n e d   t o   t h e   p r e f e r r e d   e x e c u t i o n   p r o v i d e r s   w h i c h   m a y   o r   m a y   n o t   h a v e   a n   n e g a t i v e   i m p a c t   o n   p e r f o r m a n c e .   e . g .   O R T   e x p l i c i t l y   a s s i g n s   s h a p e   r e l a t e d   o p s   t o   C P U   t o   i m p r o v e   p e r f .  [ m 
  [ 0 ; 9 3 m 2 0 2 3 - 0 9 - 1 2   1 2 : 5 7 : 5 4 . 5 5 1 1 0 0 8   [ W : o n n x r u n t i m e : ,   s e s s i o n _ s t a t e . c c : 1 1 6 4   o n n x r u n t i m e : : V e r i f y E a c h N o d e I s A s s i g n e d T o A n E p ]   R e r u n n i n g   w i t h   v e r b o s e   o u t p u t   o n   a   n o n - m i n i m a l   b u i l d   w i l l   s h o w 

In [20]:
GPU_OPTION = '--use_gpu --use_io_binding' if use_gpu else ''
!python -m onnxruntime.transformers.bert_perf_test --model $optimized_fp16_model_path --batch_size 1 --sequence_length 128 --samples 1000 --test_times 1 $GPU_OPTION

Running test: model=bert-base-cased-squad_opt_gpu_fp16.onnx,graph_optimization_level=ENABLE_ALL,intra_op_num_threads=32,batch_size=1,sequence_length=128,test_cases=1000,test_times=1,use_gpu=True,use_io_binding=True,average_sequence_length=128,random_sequence_length=False
Average latency = 1.77 ms, Throughput = 566.45 QPS
Running test: model=bert-base-cased-squad_opt_gpu_fp16.onnx,graph_optimization_level=ENABLE_ALL,intra_op_num_threads=24,batch_size=1,sequence_length=128,test_cases=1000,test_times=1,use_gpu=True,use_io_binding=True,average_sequence_length=128,random_sequence_length=False
Average latency = 1.74 ms, Throughput = 574.96 QPS
Running test: model=bert-base-cased-squad_opt_gpu_fp16.onnx,graph_optimization_level=ENABLE_ALL,intra_op_num_threads=15,batch_size=1,sequence_length=128,test_cases=1000,test_times=1,use_gpu=True,use_io_binding=True,average_sequence_length=128,random_sequence_length=False
Average latency = 1.74 ms, Throughput = 574.28 QPS
Running test: model=bert-base-c

In [21]:
fp32_result = load_last_perf_test_result()
fp32_result

Perf results from ./onnx\perf_results_GPU_B1_S128_20230912-130021.txt


Unnamed: 0,Latency(ms),Latency_P50,Latency_P75,Latency_P90,Latency_P95,Latency_P99,Throughput(QPS),intra_op_num_threads
0,1.74,1.72,1.72,1.75,1.8,2.17,575.17,14
1,1.74,1.73,1.73,1.75,1.76,2.14,574.96,24
2,1.74,1.72,1.73,1.76,1.79,2.16,574.28,15
3,1.75,1.72,1.72,1.76,2.02,2.15,572.89,6
4,1.76,1.74,1.74,1.76,1.81,2.14,569.77,13
5,1.76,1.72,1.73,1.8,2.08,2.15,568.67,5
6,1.77,1.73,1.74,1.81,2.12,2.19,566.45,32
7,1.77,1.74,1.74,1.77,2.06,2.17,566.38,7
8,1.77,1.73,1.74,1.81,2.1,2.18,566.14,3
9,1.77,1.73,1.74,1.82,2.07,2.17,566.09,11


### Throughput Tuning

Some application need best throughput under some constraint on latency. This can be done by testing performance of different batch sizes. The tool could help on this.

Here is an example that check the performance of multiple batch sizes (1, 2, 4, 8, 16, 32 and 64) using default settings.

In [22]:
THREAD_SETTING = '--intra_op_num_threads 8'
!{sys.executable} -m onnxruntime.transformers.bert_perf_test --model $optimized_fp16_model_path --batch_size 1 2 4 8 16 32 --sequence_length 128 --samples 1000 --test_times 1 $THREAD_SETTING $GPU_OPTION

Running test: model=bert-base-cased-squad_opt_gpu_fp16.onnx,graph_optimization_level=ENABLE_ALL,intra_op_num_threads=8,batch_size=32,sequence_length=128,test_cases=1000,test_times=1,use_gpu=True,use_io_binding=True,average_sequence_length=128,random_sequence_length=False
Average latency = 20.41 ms, Throughput = 1567.65 QPS
Running test: model=bert-base-cased-squad_opt_gpu_fp16.onnx,graph_optimization_level=ENABLE_ALL,intra_op_num_threads=8,batch_size=1,sequence_length=128,test_cases=1000,test_times=1,use_gpu=True,use_io_binding=True,average_sequence_length=128,random_sequence_length=False
Average latency = 1.73 ms, Throughput = 576.74 QPS
Running test: model=bert-base-cased-squad_opt_gpu_fp16.onnx,graph_optimization_level=ENABLE_ALL,intra_op_num_threads=8,batch_size=2,sequence_length=128,test_cases=1000,test_times=1,use_gpu=True,use_io_binding=True,average_sequence_length=128,random_sequence_length=False
Average latency = 2.18 ms, Throughput = 917.92 QPS
Running test: model=bert-base-c

In [23]:
fp16_result = load_last_perf_test_result()
fp16_result

Perf results from ./onnx\perf_results_GPU_B1-2-4-8-16-32_S128_20230912-130248.txt


Unnamed: 0,Latency(ms),Latency_P50,Latency_P75,Latency_P90,Latency_P95,Latency_P99,Throughput(QPS),intra_op_num_threads
0,1.73,1.72,1.73,1.73,1.79,2.04,576.74,8
1,2.18,2.16,2.16,2.18,2.29,2.76,917.92,8
2,3.25,3.25,3.26,3.28,3.29,3.43,1229.91,8
3,5.38,5.38,5.39,5.42,5.44,5.6,1486.89,8
4,9.9,9.89,9.94,9.97,10.0,10.06,1616.79,8
5,20.41,20.41,20.47,20.52,20.55,20.68,1567.65,8


### Packing Mode (Effective Transformer)

When padding ratio is high, it is helpful to use packing mode, also known as [effective transformer](https://github.com/bytedance/effective_transformer).
This feature requires onnxruntime-gpu verison 1.16 or later. 

In below example, average sequence length after removing paddings is 32, the sequence length with paddings is 128. We can see 3x throughput with packing mode (QPS increased from 1617 to 5652).

In [24]:
assert use_gpu, "Require GPU for packing mode"
packed_fp16_model_path = './onnx/bert-base-cased-squad_opt_gpu_fp16_packed.onnx'
!{sys.executable} -m onnxruntime.transformers.convert_to_packing_mode --input $optimized_fp16_model_path --output $packed_fp16_model_path --use_external_data_format
!{sys.executable} -m onnxruntime.transformers.bert_perf_test --model $packed_fp16_model_path --batch_size 1 2 4 8 16 32 --sequence_length 128 --average_sequence_length 32 --samples 1000 --test_times 1 $THREAD_SETTING $GPU_OPTION    

_replace_attention_with_packing_attention: Converted 24 Attention nodes to PackedAttention.
  save_model_to_file: Sort graphs in topological order
                save: Delete the existing onnx file: ./onnx/bert-base-cased-squad_opt_gpu_fp16_packed.onnx
                save: Delete the existing external data file: ./onnx/bert-base-cased-squad_opt_gpu_fp16_packed.onnx.data
  save_model_to_file: Model saved to ./onnx/bert-base-cased-squad_opt_gpu_fp16_packed.onnx


Running test: model=bert-base-cased-squad_opt_gpu_fp16_packed.onnx,graph_optimization_level=ENABLE_ALL,intra_op_num_threads=8,batch_size=32,sequence_length=128,test_cases=1000,test_times=1,use_gpu=True,use_io_binding=True,average_sequence_length=32,random_sequence_length=False
Average latency = 5.66 ms, Throughput = 5652.40 QPS
Running test: model=bert-base-cased-squad_opt_gpu_fp16_packed.onnx,graph_optimization_level=ENABLE_ALL,intra_op_num_threads=8,batch_size=1,sequence_length=128,test_cases=1000,test_times=1,use_gpu=True,use_io_binding=True,average_sequence_length=32,random_sequence_length=False
Average latency = 1.70 ms, Throughput = 586.97 QPS
Running test: model=bert-base-cased-squad_opt_gpu_fp16_packed.onnx,graph_optimization_level=ENABLE_ALL,intra_op_num_threads=8,batch_size=2,sequence_length=128,test_cases=1000,test_times=1,use_gpu=True,use_io_binding=True,average_sequence_length=32,random_sequence_length=False
Average latency = 1.79 ms, Throughput = 1114.37 QPS
Running test:

In [25]:
packing_result = load_last_perf_test_result()
packing_result

Perf results from ./onnx\perf_results_GPU_B1-2-4-8-16-32_S128_20230912-130354.txt


Unnamed: 0,Latency(ms),Latency_P50,Latency_P75,Latency_P90,Latency_P95,Latency_P99,Throughput(QPS),intra_op_num_threads
0,1.7,1.63,1.65,2.13,2.2,2.32,586.97,8
1,1.77,1.74,1.76,1.82,1.93,2.17,2262.31,8
2,1.79,1.73,1.74,2.12,2.18,2.32,1114.37,8
3,2.18,2.16,2.17,2.22,2.3,2.64,3666.45,8
4,3.31,3.31,3.32,3.35,3.39,3.51,4829.58,8
5,5.66,5.66,5.68,5.71,5.74,5.91,5652.4,8


## 7. Additional Info

Note that running Jupyter Notebook has significant impact on performance result. You can close Jupyter Notebook and other applications, then run the performance test in a console to get more accurate performance numbers.

We have a [benchmark script](https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/python/tools/transformers/run_benchmark.sh). It is recommended to use it measure inference speed of OnnxRuntime.

[OnnxRuntime C API](https://github.com/microsoft/onnxruntime/blob/main/docs/C_API.md) could get slightly better performance than python API. If you use C API in inference, you can use OnnxRuntime_Perf_Test.exe built from source to measure performance instead.

Here is the machine configuration that generated the above results. You might get slower or faster result according to your hardware.

In [26]:
!{sys.executable} -m onnxruntime.transformers.machine_info --silent

{
  "gpu": {
    "driver_version": "537.13",
    "devices": [
      {
        "memory_total": 25757220864,
        "memory_available": 18009264128,
        "name": "NVIDIA GeForce RTX 4090"
      }
    ]
  },
  "cpu": {
    "brand": "13th Gen Intel(R) Core(TM) i9-13900",
    "cores": 24,
    "logical_cores": 32,
    "hz": "2000000000,0",
    "l2_cache": 33554432,
    "flags": "3dnow,3dnowprefetch,abm,acpi,adx,aes,apic,avx,avx2,bmi1,bmi2,clflush,clflushopt,clwb,cmov,cx16,cx8,de,dts,erms,est,f16c,fma,fpu,fxsr,gfni,ht,hypervisor,ia64,intel_pt,invpcid,lahf_lm,mca,mce,mmx,monitor,movbe,msr,mtrr,osxsave,pae,pat,pbe,pcid,pclmulqdq,pdcm,pge,pni,popcnt,pse,pse36,rdpid,rdrnd,rdseed,sep,serial,sha,smap,smep,ss,sse,sse2,sse4_1,sse4_2,ssse3,tm,tm2,tsc,tscdeadline,umip,vaes,vme,vpclmulqdq,x2apic,xsave,xtpr",
    "processor": "Intel64 Family 6 Model 183 Stepping 1, GenuineIntel"
  },
  "memory": {
    "total": 33992912896,
    "available": 17272422400
  },
  "os": "Windows-10-10.0.22621-SP0",
  "pyth