## Inference PyTorch Bert Model for High Performance in ONNX Runtime


In this tutorial, you'll learn how to load a BERT model from PyTorch, convert it to ONNX, and inference it for high performance using ONNX Runtime with transformer optimization. In the following sections, we are going to use the BERT model trained with Stanford Question Answering Dataset (SQuAD) dataset as an example. BERT SQuAD model is used in question answering scenarios, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable.

Tutorial Roadmap

- Install a few necessary packages (PyTorch, Transformers, TorchVision, wget).
- Load a pretrained BERT SQuAD model from a source PyTorch implementation.
- Export our PyTorch BERT model to ONNX.
- Compare inference on our models for PyTorch and ONNX Runtime.
- Optimize our inference with execution providers using the ONNX Go Live (OLive) tool.

### Pre-requisites

First you need to check if the following packages exist and install them if needed.

In [8]:
# Install a pip package in the current Jupyter kernel
!pip install wget
!pip install torch==1.3.1
!pip install torchvision==0.4.2
!pip install transformers==2.5.1
!pip install psutil

[33mYou are using pip version 18.1, however version 20.0.2 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.[0m
Collecting torch==1.3.1
  Using cached https://files.pythonhosted.org/packages/65/96/c97c8a0ea8f66de41f452925b521bcfdebef6fffb899dc704fc269d87563/torch-1.3.1-cp36-none-macosx_10_7_x86_64.whl
Collecting torchvision==0.4.2
  Using cached https://files.pythonhosted.org/packages/c1/8c/53a88b9a18d8edb33019519f9595bfd5add2ff5aeba19e7402b950906edf/torchvision-0.4.2-cp36-cp36m-macosx_10_7_x86_64.whl
Collecting pillow>=4.1.1 (from torchvision==0.4.2)
  Using cached https://files.pythonhosted.org/packages/c3/3a/1cb999d3f9311f9b7c6387b81ec7b5373d50ef031b957994898e59697c18/Pillow-7.0.0-cp36-cp36m-macosx_10_6_intel.whl
Installing collected packages: torch, pillow, torchvision
Successfully installed pillow-7.0.0 torch-1.3.1 torchvision-0.4.2
[33mYou are using pip version 18.1, however version 20.0.2 is available.
You should consider upgrading via t

### Load Pretrained BERT model
We begin by downloading the data files and store them in the specified pytorch_output and pytorch_squad directories.

In [3]:
import os

# Create a directory to store predict file
output_dir = "./pytorch_output"
cache_dir = "./pytorch_squad"
predict_file = os.path.join(cache_dir, "dev-v1.1.json")
# create cache dir
if not os.path.exists(cache_dir):
    os.makedirs(cache_dir)
    
# Download the file
predict_file_url = "https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v1.1.json"
if not os.path.exists(predict_file):
    import wget
    print("Start downloading predict file.")
    wget.download(predict_file_url, predict_file)
    print("Predict file downloaded.")

Start downloading predict file.
Predict file downloaded.


We specify some relevant model config / hyperparameter variables.

In [9]:
import torch

# Define some variables. As an example, we used batch size 1 and max sequence length 128. 
model_type = "bert"
model_name_or_path = "bert-base-cased"
max_seq_length = 128
doc_stride = 128
max_query_length = 64
eval_batch_size = 1
# The hardware you'd like to use to run the model.
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

In [46]:
squad_convert_examples_to_features??

Let's start to load our BERT model into PyTorch from our pretrained files. This step could take a few minutes.

In [11]:
# The following code is adapted from HuggingFace transformers
# https://github.com/huggingface/transformers/blob/master/examples/run_squad.py#L290

from transformers import (WEIGHTS_NAME, BertConfig, BertForQuestionAnswering, BertTokenizer)
from torch.utils.data import (DataLoader, SequentialSampler)

# Load pre-trained model and tokenizer
config_class, model_class, tokenizer_class = (BertConfig, BertForQuestionAnswering, BertTokenizer)
config = config_class.from_pretrained(model_name_or_path, cache_dir=cache_dir)
tokenizer = tokenizer_class.from_pretrained(model_name_or_path, do_lower_case=True, cache_dir=cache_dir)
model = model_class.from_pretrained(model_name_or_path,
                                    from_tf=False,
                                    config=config,
                                    cache_dir=cache_dir)

# Load and Convert the examples from the downloaded predict file into a list of features 
# that can be directly given as input to a model.
from transformers.data.processors.squad import SquadV2Processor

processor = SquadV2Processor()
examples = processor.get_dev_examples(None, filename=predict_file)

from transformers import squad_convert_examples_to_features
features, dataset = squad_convert_examples_to_features( 
            examples=examples[:3], # convert only 3 examples for demo
            tokenizer=tokenizer,
            max_seq_length=max_seq_length,
            doc_stride=doc_stride,
            max_query_length=max_query_length,
            is_training=False,
            return_dataset='pt'
        )

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=361.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=213450.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=435779157.0, style=ProgressStyle(descri…




100%|██████████| 48/48 [00:04<00:00, 11.32it/s]
convert squad examples to features: 100%|██████████| 3/3 [00:00<00:00, 90.65it/s]
add example index and unique id: 100%|██████████| 3/3 [00:00<00:00, 3416.48it/s]


Export the loaded model
Once the model is loaded, we can export the loaded PyTorch model to ONNX.

In [17]:
# Eval!
print("***** Running evaluation {} *****")
print("  Num examples = ", len(dataset))
print("  Batch size = ", eval_batch_size)

# create output dir
if not os.path.exists(output_dir):
    os.makedirs(output_dir)
    
output_model_path = './pytorch_squad/bert-base-cased-squad.onnx'    
inputs = {}
outputs= {}
# Get the first batch of data to run the model and export it to ONNX
batch = dataset[0]

# Set model to inference mode, which is required before exporting the model because some operators behave differently in 
# inference and training mode.
model.eval()
batch = tuple(t.to(device) for t in batch)
inputs = {
    'input_ids':      batch[0].reshape(1, 128),                         # using batch size = 1 here. Adjust as needed.
    'attention_mask': batch[1].reshape(1, 128),
    'token_type_ids': batch[2].reshape(1, 128)
}

with torch.no_grad():
    symbolic_names = {0: 'batch_size', 1: 'max_seq_len'}
    torch.onnx.export(model,                                            # model being run
                      (inputs['input_ids'],                             # model input (or a tuple for multiple inputs)
                       inputs['attention_mask'], 
                       inputs['token_type_ids']), 
                      output_model_path,                                # where to save the model (can be a file or file-like object)
                      opset_version=11,                                 # the ONNX version to export the model to
                      do_constant_folding=True,                         # whether to execute constant folding for optimization
                      input_names=['input_ids',                         # the model's input names
                                   'input_mask', 
                                   'segment_ids'],
                      output_names=['start', 'end'],                    # the model's output names
                      dynamic_axes={'input_ids': symbolic_names,        # variable length axes
                                    'input_mask' : symbolic_names,
                                    'segment_ids' : symbolic_names,
                                    'start' : symbolic_names,
                                    'end' : symbolic_names})
    print("Model exported at ", output_model_path)

***** Running evaluation {} *****
  Num examples =  6
  Batch size =  1


  "Passing an tensor of different rank in execution will be incorrect.")


Model exported at  ./pytorch_squad/bert-base-cased-squad.onnx


### Inference the Exported Model with ONNX Runtime
#### Install ONNX Runtime
Install ONNX Runtime if you haven't done so already. Make sure to install the correct package from PyPi -- onnxruntime to use CPU features, or onnxruntime-gpu to use GPU.

In [47]:
ONNXRUNTIME = 'onnxruntime'
# Install ONNX Runtime
if torch.cuda.is_available():
    ## Install onnxruntime-gpu if cuda is available
    ONNXRUNTIME = 'onnxruntime-gpu'
!pip install $ONNXRUNTIME

Collecting onnxruntime
  Using cached https://files.pythonhosted.org/packages/68/db/ce33eae1c701547b99d778b64a9694e16efa7353ab98dd60d20402bdabda/onnxruntime-1.2.0-cp36-cp36m-macosx_10_14_x86_64.whl
Collecting onnx>=1.2.3 (from onnxruntime)
  Using cached https://files.pythonhosted.org/packages/83/53/2a51e046fb94bb924556341d09f82f08bc5cc515cf4764ceb3feeebc763a/onnx-1.6.0-cp36-cp36m-macosx_10_6_intel.macosx_10_9_intel.macosx_10_9_x86_64.macosx_10_10_intel.macosx_10_10_x86_64.whl
Collecting protobuf (from onnx>=1.2.3->onnxruntime)
[?25l  Downloading https://files.pythonhosted.org/packages/4c/65/3ac73d6a9f31de4b45ebff6885e9a4ccd16eccb63764dd406140d337fabd/protobuf-3.11.3-cp36-cp36m-macosx_10_9_x86_64.whl (1.3MB)
[K    100% |████████████████████████████████| 1.3MB 2.3MB/s ta 0:00:011
Collecting typing-extensions>=3.6.2.1 (from onnx>=1.2.3->onnxruntime)
  Using cached https://files.pythonhosted.org/packages/03/92/705fe8aca27678e01bbdd7738173b8e7df0088a2202c80352f664630d638/typing_extension

Now we are ready to inference our model with ONNX Runtime!

In [48]:
import onnxruntime as rt  
import time
import psutil

sess_options = rt.SessionOptions()

# Set graph optimization level to ORT_ENABLE_EXTENDED to enable bert optimization. This is enabled on default.
sess_options.graph_optimization_level = rt.GraphOptimizationLevel.ORT_ENABLE_EXTENDED

# The following settings enables OpenMP, which is required to get best performance for CPU inference of Bert models.
sess_options.intra_op_num_threads=1
os.environ["OMP_NUM_THREADS"] = str(psutil.cpu_count(logical=True))
os.environ["OMP_WAIT_POLICY"] = 'ACTIVE'

session = rt.InferenceSession(output_model_path, sess_options)

# evaluate the model
start = time.time()
res = session.run(None, {
          'input_ids': inputs['input_ids'].cpu().numpy(),
          'input_mask': inputs['attention_mask'].cpu().numpy(),
          'segment_ids': inputs['token_type_ids'].cpu().numpy()
        })
end = time.time()
print("ONNX Runtime inference time: ", end - start)

ONNX Runtime inference time:  0.3330717086791992


Get comparative performance numbers from the original PyTorch model.

In [50]:
start = time.time()
outputs = model(**inputs)
end = time.time()
print("PyTorch Inference time = ", end - start)

print("***** Verifying correctness *****")
import numpy as np
for i in range(2):
    print('PyTorch and ORT matching numbers:', np.allclose(res[i], outputs[i].cpu().detach().numpy(), rtol=1e-04, atol=1e-05))

PyTorch Inference time =  0.23375296592712402
***** Verifying correctness *****
PyTorch and ORT matching numbers: True
PyTorch and ORT matching numbers: True
