Copyright (c) Microsoft Corporation. All rights reserved.  
Licensed under the MIT License.

# Inference PyTorch Bert Model for High Performance in ONNX Runtime

In this tutorial, you'll be introduced to how to load a Bert model from PyTorch, convert it to ONNX, and inference it for high performance using ONNX Runtime. In the following sections, we are going to use the Bert model trained with Stanford Question Answering Dataset (SQuAD) dataset as an example. Bert SQuAD model is used in question answering scenarios, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable.

## 0. Prerequisites ##
It need a python environment with [PyTorch](https://pytorch.org/) and [OnnxRuntime](https://microsoft.github.io/onnxruntime/) installed before running this notebook. 

First, we install [AnaConda](https://www.anaconda.com/distribution/) in a target machine and open an AnaConda prompt window when it is done. Then you can choose a setup based on your target device (CPU or GPU), and run the commands to create a conda environment.

#### CPU Environment Setup
If your machines does not have GPU or want to test CPU inference. You can create a conda environment like the following:

```console
conda create -n cpu_env python=3.6
conda activate cpu_env
conda install pytorch torchvision cpuonly -c pytorch
pip install onnxruntime
conda install jupyter
jupyter notebook
```
The last command will launch Jupyter Notebook and we can open this notebook in browser to continue.

Another option is to use pip to install package to your existing jupyter notebook environment:
```console
pip install --upgrade torch==1.4.0+cpu torchvision==0.5.0+cpu -f https://download.pytorch.org/whl/torch_stable.html
pip install onnxruntime==1.1.2
```

#### GPU Environment Setup

This requires your machine to have a GPU.

```console
conda create -n gpu_env python=3.6
conda activate gpu_env
conda install pytorch torchvision cudatoolkit=10.1 -c pytorch
pip install onnxruntime-gpu
conda install jupyter
jupyter notebook
```

onnxruntime-gpu v1.1.2 requires installing [CUDA](https://developer.nvidia.com/cuda-downloads) 10.0 and [cuDNN](https://developer.nvidia.com/cudnn) 7.6, and add their bin directories to PATH environment variable (You need update the path in section 4 below).

In [1]:
# install some extra packages used in this notebook
import sys
!{sys.executable} -m pip install transformers==2.5.1
!{sys.executable} -m pip install wget
!{sys.executable} -m pip install psutil

Collecting transformers==2.5.1
  Using cached transformers-2.5.1-py3-none-any.whl (499 kB)
Processing c:\users\tianl\appdata\local\pip\cache\wheels\6d\ec\1a\21b8912e35e02741306f35f66c785f3afe94de754a0eaf1422\sacremoses-0.0.38-cp36-none-any.whl
Collecting sentencepiece
  Using cached sentencepiece-0.1.85-cp36-cp36m-win_amd64.whl (1.2 MB)
Collecting tqdm>=4.27
  Downloading tqdm-4.43.0-py2.py3-none-any.whl (59 kB)
Collecting filelock
  Using cached filelock-3.0.12-py3-none-any.whl (7.6 kB)
Collecting boto3
  Downloading boto3-1.12.11-py2.py3-none-any.whl (128 kB)
Collecting requests
  Downloading requests-2.23.0-py2.py3-none-any.whl (58 kB)
Collecting regex!=2019.12.17
  Downloading regex-2020.2.20-cp36-cp36m-win_amd64.whl (272 kB)
Collecting tokenizers==0.5.2
  Using cached tokenizers-0.5.2-cp36-cp36m-win_amd64.whl (1.0 MB)
Collecting joblib
  Using cached joblib-0.14.1-py2.py3-none-any.whl (294 kB)
Collecting click
  Using cached Click-7.0-py2.py3-none-any.whl (81 kB)
Collecting botoco



Processing c:\users\tianl\appdata\local\pip\cache\wheels\40\15\30\7d8f7cea2902b4db79e3fea550d7d7b85ecb27ef992b618f3f\wget-3.2-cp36-none-any.whl
Installing collected packages: wget
Successfully installed wget-3.2
Collecting psutil
  Using cached psutil-5.7.0-cp36-cp36m-win_amd64.whl (235 kB)
Installing collected packages: psutil
Successfully installed psutil-5.7.0


## 1. Load Pretrained Bert model ##

We begin by downloading the data files and store them in the specified location. 

In [2]:
import os

cache_dir = "./squad"
if not os.path.exists(cache_dir):
    os.makedirs(cache_dir)

predict_file_url = "https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v1.1.json"
predict_file = os.path.join(cache_dir, "dev-v1.1.json")
if not os.path.exists(predict_file):
    import wget
    print("Start downloading predict file.")
    wget.download(predict_file_url, predict_file)
    print("Predict file downloaded.")

Specify some model config variables.

In [3]:
# For fine tuned large model, the model name is "bert-large-uncased-whole-word-masking-finetuned-squad". Here we use bert-base for demo.
model_name_or_path = "bert-base-cased"
max_seq_length = 128
doc_stride = 128
max_query_length = 64
import torch
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

Start to load model from pretrained. This step could take a few minutes. 

In [4]:
# The following code is adapted from HuggingFace transformers
# https://github.com/huggingface/transformers/blob/master/examples/run_squad.py

from transformers import (BertConfig, BertForQuestionAnswering, BertTokenizer)

# Load pretrained model and tokenizer
config_class, model_class, tokenizer_class = (BertConfig, BertForQuestionAnswering, BertTokenizer)
config = config_class.from_pretrained(model_name_or_path, cache_dir=cache_dir)
tokenizer = tokenizer_class.from_pretrained(model_name_or_path, do_lower_case=True, cache_dir=cache_dir)
model = model_class.from_pretrained(model_name_or_path,
                                    from_tf=False,
                                    config=config,
                                    cache_dir=cache_dir)
# load some examples
from transformers.data.processors.squad import SquadV1Processor

processor = SquadV1Processor()
examples = processor.get_dev_examples(None, filename=predict_file)

from transformers import squad_convert_examples_to_features
features, dataset = squad_convert_examples_to_features( 
            examples=examples[:3], # convert only 3 examples for demo
            tokenizer=tokenizer,
            max_seq_length=max_seq_length,
            doc_stride=doc_stride,
            max_query_length=max_query_length,
            is_training=False,
            return_dataset='pt'
        )

100%|██████████████████████████████████████████████████████████████████████████████████| 48/48 [00:03<00:00, 13.88it/s]
convert squad examples to features: 100%|███████████████████████████████████████████████| 3/3 [00:00<00:00, 125.49it/s]
add example index and unique id: 100%|█████████████████████████████████████████████████| 3/3 [00:00<00:00, 2926.26it/s]


## 2. Export the loaded model ##
Once the model is loaded, we can export the loaded PyTorch model to ONNX.

In [5]:
output_dir = "./onnx"
if not os.path.exists(output_dir):
    os.makedirs(output_dir)   
output_model_path = os.path.join(output_dir, 'bert-base-cased-squad.onnx')

# Get the first batch of data to run the model and export it to ONNX
batch = dataset[0]

# Set model to inference mode, which is required before exporting the model because some operators behave differently in 
# inference and training mode.
model.eval()
model.to(device)
inputs = {
    'input_ids':      batch[0].to(device).reshape(1, 128),                         # using batch size = 1 here. Adjust as needed.
    'attention_mask': batch[1].to(device).reshape(1, 128),
    'token_type_ids': batch[2].to(device).reshape(1, 128)
}

if not os.path.exists(output_model_path):
    with torch.no_grad():
        symbolic_names = {0: 'batch_size', 1: 'max_seq_len'}
        torch.onnx.export(model,                                            # model being run
                          (inputs['input_ids'],                             # model input (or a tuple for multiple inputs)
                           inputs['attention_mask'], 
                           inputs['token_type_ids']), 
                          output_model_path,                                # where to save the model (can be a file or file-like object)
                          opset_version=11,                                 # the ONNX version to export the model to
                          do_constant_folding=True,                         # whether to execute constant folding for optimization
                          input_names=['input_ids',                         # the model's input names
                                       'input_mask', 
                                       'segment_ids'],
                          output_names=['start', 'end'],                    # the model's output names
                          dynamic_axes={'input_ids': symbolic_names,        # variable length axes
                                        'input_mask' : symbolic_names,
                                        'segment_ids' : symbolic_names,
                                        'start' : {0: 'batch_size'},
                                        'end' : {0: 'batch_size'}})
        print("Model exported at ", output_model_path)

Model exported at  ./onnx\bert-base-cased-squad.onnx


## 3. PyTorch Inference ##
Use PyTorch to evaluate an example input for comparison purpose.

In [6]:
import time

# Warm up with one run.
model(**inputs)

# Measure the latency.
with torch.no_grad():
    start = time.time()
    outputs = model(**inputs)
    end = time.time()
    print("PyTorch {} Inference time = {} ms".format(device.type, format((end - start) * 1000, '.2f')))

PyTorch cuda Inference time = 30.92 ms


## 4. Inference the Exported Model with ONNX Runtime ##

To use onnxruntime-gpu, it is required to install CUDA 10.0 and CUDNN 7.6, and add their bin directories to PATH environment variable.

In [8]:
if device.type == 'cuda':
    # Add path for CUDA 10.0 and CUDNN 7.6, which are required by onnxruntime-gpu
    cuda_dir = 'D:/NVidia/CUDA/v10.0/bin'
    cudnn_dir = 'D:/NVidia/CUDA/v10.0/bin'
    if not (os.path.exists(cuda_dir) and os.path.exists(cudnn_dir)):
        raise ValueError("Please specify correct path for CUDA 10.0 and CUDNN 7.6. Otherwise onnxruntime-gpu cannot be imported.")
    else:
        if cuda_dir == cudnn_dir:
            os.environ["PATH"] = cuda_dir + ';' + os.environ["PATH"]
        else:
            os.environ["PATH"] = cuda_dir + ';' + cudnn_dir + ';' + os.environ["PATH"]

Now we are ready to inference the model with ONNX Runtime.

In [9]:
import psutil
import onnxruntime
import numpy

device_name = 'cuda' if 'CUDAExecutionProvider' in onnxruntime.get_available_providers() else 'cpu'

sess_options = onnxruntime.SessionOptions()

# Optional: store the optimized graph and view it using Netron to verify that model is fully optimized.
#sess_options.optimized_model_filepath = os.path.join(output_dir, "optimized_model_{}.onnx".format(device_name))

# The following settings enables OpenMP, which is required to get best performance for CPU inference of Bert models.
sess_options.intra_op_num_threads=1
os.environ["OMP_NUM_THREADS"] = str(psutil.cpu_count(logical=True))
os.environ["OMP_WAIT_POLICY"] = 'ACTIVE'

session = onnxruntime.InferenceSession(output_model_path, sess_options)

# Use contiguous array as input could improve performance.
ort_inputs = {'input_ids': numpy.ascontiguousarray(inputs['input_ids'].cpu().numpy()),
              'input_mask': numpy.ascontiguousarray(inputs['attention_mask'].cpu().numpy()),
              'segment_ids': numpy.ascontiguousarray(inputs['token_type_ids'].cpu().numpy())
}

# Warm up with one run.
session.run(None, ort_inputs)

# Measure the latency.
start = time.time()
results = session.run(None, ort_inputs)
end = time.time()
print("ONNX Runtime {} inference time: {} ms".format(device_name, format((end - start) * 1000, '.2f')))

ONNX Runtime cuda inference time: 9.97 ms


In [10]:
print("***** Verifying correctness *****")
for i in range(2):
    print('PyTorch and ONNX Runtime output {} are close:'.format(i), numpy.allclose(results[i], outputs[i].cpu(), rtol=1e-05, atol=1e-04))

***** Verifying correctness *****
PyTorch and ONNX Runtime output 0 are close: True
PyTorch and ONNX Runtime output 1 are close: True


### 5. Test Tools
For more accurate latency number, it is recommended to download the [Performance Test and Model Verification Tools](https://github.com/microsoft/onnxruntime/tree/tlwu/bert_test_tools/onnxruntime/python/tools/bert) and run them on the exported models.

In [11]:
import wget
url_prfix = "https://raw.githubusercontent.com/microsoft/onnxruntime/tlwu/bert_test_tools/onnxruntime/python/tools/bert/"
script_files = ['bert_perf_test.py', 'compare_bert_results.py', 'BertOnnxModel.py', 'BertOnnxModelKeras.py', 'BertOnnxModelTF.py', 'OnnxModel.py', 'bert_model_optimization.py']

script_dir = './bert_scripts'
if not os.path.exists(script_dir):
    os.makedirs(script_dir)

for filename in script_files:
    target_file = os.path.join(script_dir, filename)
    if not os.path.exists(target_file):
        wget.download(url_prfix + filename, target_file)

100% [................................................................................] 7605 / 7605

In [12]:
!{sys.executable} -m pip install onnx pytz

Collecting onnx
  Downloading onnx-1.6.0-cp36-cp36m-win_amd64.whl (4.5 MB)
Collecting pytz
  Downloading pytz-2019.3-py2.py3-none-any.whl (509 kB)
Collecting typing-extensions>=3.6.2.1
  Downloading typing_extensions-3.7.4.1-py3-none-any.whl (20 kB)
Collecting protobuf
  Downloading protobuf-3.11.3-cp36-cp36m-win_amd64.whl (1.1 MB)
Installing collected packages: typing-extensions, protobuf, onnx, pytz
Successfully installed onnx-1.6.0 protobuf-3.11.3 pytz-2019.3 typing-extensions-3.7.4.1




The following will create 10 samples (of batch_size 1 and sequence_length 128) and run each for 10 times, then get the average latency number. You can increase number of samples to get more stable result, while the test will take longer time to finish.

In [14]:
%run ./bert_scripts/bert_perf_test.py --model ./onnx/bert-base-cased-squad.onnx --batch_size 1 --sequence_length 128 --samples 10 --test_times 10 --inclusive --use_gpu

generating test data...
Running test: model=bert-base-cased-squad.onnx,graph_optimization_level=ENABLE_ALL,intra_op_num_threads=12,OMP_NUM_THREADS=,OMP_WAIT_POLICY=,batch_size=1,sequence_length=128,test_cases=10,test_times=10,contiguous=False,use_gpu=True
Average latency is 8.72 ms
Extra latency for converting inputs to contiguous: 0.00 ms
Running test: model=bert-base-cased-squad.onnx,graph_optimization_level=ENABLE_ALL,intra_op_num_threads=12,OMP_NUM_THREADS=,OMP_WAIT_POLICY=,batch_size=1,sequence_length=128,test_cases=10,test_times=10,contiguous=True,use_gpu=True
Average latency is 8.87 ms
Test summary is saved to onnx\perf_results_GPU_B1_S128_20200301-225917.txt


In [15]:
%run ./bert_scripts/compare_bert_results.py --baseline_model ./onnx/bert-base-cased-squad.onnx --batch_size 1 --sequence_length 128 --samples 10 --use_gpu

baseline average latency: 31.352509999999256 ms
Successfully created the directory onnx\batch_1_seq_128\test_data_set_0 
Successfully created the directory onnx\batch_1_seq_128\test_data_set_1 
Successfully created the directory onnx\batch_1_seq_128\test_data_set_2 
Successfully created the directory onnx\batch_1_seq_128\test_data_set_3 
Successfully created the directory onnx\batch_1_seq_128\test_data_set_4 
Successfully created the directory onnx\batch_1_seq_128\test_data_set_5 
Successfully created the directory onnx\batch_1_seq_128\test_data_set_6 
Successfully created the directory onnx\batch_1_seq_128\test_data_set_7 
Successfully created the directory onnx\batch_1_seq_128\test_data_set_8 
Successfully created the directory onnx\batch_1_seq_128\test_data_set_9 
treatment average latency: 12.36652000001186 ms
0 out of 10 results are not close (rtol=0.001, atol=0.0001).
maximum absolute difference=1.0132789611816406e-06 in test case 6
maximum relative difference=0.00093184458091855