Copyright (c) Microsoft Corporation. All rights reserved.  
Licensed under the MIT License.

# Inference PyTorch GPT2 Model with ONNX Runtime on CPU

In this tutorial, you'll be introduced to how to load a GPT2 model from PyTorch, convert it to ONNX, and inference it using ONNX Runtime.

**Note: this work is still in progresss. Need install ort_nightly package before onnxruntime 1.3.0 is ready. The performance number here does not reflect the final result for onnxruntime 1.3.0. **

## Prerequisites ##

If you have Jupyter Notebook, you may directly run this notebook. We will use pip to install or upgrade [PyTorch](https://pytorch.org/), [OnnxRuntime](https://microsoft.github.io/onnxruntime/) and other required packages.

Otherwise, you can setup a new environment. First, we install [AnaConda](https://www.anaconda.com/distribution/). Then open an AnaConda prompt window and run the following commands:

```console
conda create -n cpu_env python=3.6
conda activate cpu_env

conda install pytorch torchvision cpuonly -c pytorch
pip install onnxruntime
pip install transformers==2.5.1
pip install onnx psutil pytz pandas py-cpuinfo py3nvml netron

conda install jupyter
jupyter notebook
```
The last command will launch Jupyter Notebook and we can open this notebook in browser to continue.

In [1]:
# Enable pass state in input.
enable_past_input = False

In [2]:
import os

cache_dir = "./gpt2"
if not os.path.exists(cache_dir):
    os.makedirs(cache_dir)

output_dir = './gpt2_onnx'
if not os.path.exists(output_dir):
    os.makedirs(output_dir)

## Benchmark ##

You will need git clone the onnxruntime repository like
```console
git clone https://github.com/microsoft/onnxruntime.git
```
Then update the bert_tools_dir according to the path in your machine.

In [3]:
# Assume you have git clone the repository of onnxruntime from github.
bert_tools_dir = r'D:\Git\onnxruntime\onnxruntime\python\tools\bert'
benchmark_script = os.path.join(bert_tools_dir, 'benchmark_gpt2.py')

if enable_past_input:
    %run $benchmark_script --model_type gpt2 --cache_dir $cache_dir --output_dir $output_dir --enable_optimization --enable_past_input
else:
    %run $benchmark_script --model_type gpt2 --cache_dir $cache_dir --output_dir $output_dir --enable_optimization

To use data.metrics please install scikit-learn. See https://scikit-learn.org/stable/index.html


   benchmark_gpt2.py: no environment variable of OMP_NUM_THREADS
   benchmark_gpt2.py: no environment variable of OMP_WAIT_POLICY
tokenization_utils.py: loading file https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-vocab.json from cache at ./gpt2\f2808208f9bec2320371a9f5f891c184ae0b674ef866b79c58177067d15732dd.1512018be4ba4e8726e41b9145129dc30651ea4fec86aa61f4b9f40bf94eac71
tokenization_utils.py: loading file https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-merges.txt from cache at ./gpt2\d629f792e430b3c76a1291bb2766b0a047e36fae0588f9dbc1ae51decdff691b.70bec105b4158ed9a1747fea67a43f5dee97855c64d62b6ec3742f4cfdb5feda
configuration_utils.py: loading configuration file https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-config.json from cache at ./gpt2\4be02c5697d91738003fb1685c9872f284166aa32e061576bbe6aaeb95649fcf.699bbd1c449e9861456f359d6daa51bd523ac085b4b531ab0aad5a55d091e942
configuration_utils.py: Model config {
  "architectures": [
    "GPT2LMHeadModel"
  ],
  

  w = w / math.sqrt(v.size(-1))
  b = self.bias[:, :, ns-nd:ns, :ns]


   benchmark_gpt2.py: PyTorch Inference time = 36.92 ms
   benchmark_gpt2.py: OMP_NUM_THREADS=1
   benchmark_gpt2.py: OMP_WAIT_POLICY=ACTIVE
   benchmark_gpt2.py: Prune graph to keep the first output and drop past state outputs:['last_state']
        OnnxModel.py: Graph pruned: 0 inputs, 12 outputs and 48 nodes are removed
        OnnxModel.py: Output model to ./gpt2_onnx\gpt2_past0_out1.onnx
    BertOnnxModel.py: Fused LayerNormalization count: 25
    BertOnnxModel.py: Fused FastGelu count: 12
    BertOnnxModel.py: Fused Reshape count:48
        OnnxModel.py: Graph pruned: 0 inputs, 0 outputs and 1106 nodes are removed
    Gpt2OnnxModel.py: Fused Attention count:12
    BertOnnxModel.py: Failed to find embedding layer
    BertOnnxModel.py: Fused EmbedLayerNormalization count: 0
        OnnxModel.py: Graph pruned: 0 inputs, 0 outputs and 480 nodes are removed
    Gpt2OnnxModel.py: Remove Reshape count:48
    BertOnnxModel.py: Fused FastGelu with Bias count:12
    BertOnnxModel.py: opset

If you only need the benchmark results. You can skip the remaining parts.

In the following, we will introduce the benchmark script.

### Load pretrained model

In [4]:
from transformers import GPT2Model, GPT2Tokenizer
model_class, tokenizer_class,  model_name_or_path = (GPT2Model,  GPT2Tokenizer,  'gpt2')
tokenizer = tokenizer_class.from_pretrained(model_name_or_path, cache_dir=cache_dir)
model = model_class.from_pretrained(model_name_or_path, cache_dir=cache_dir)
model.eval().cpu()

tokenization_utils.py: loading file https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-vocab.json from cache at ./gpt2\f2808208f9bec2320371a9f5f891c184ae0b674ef866b79c58177067d15732dd.1512018be4ba4e8726e41b9145129dc30651ea4fec86aa61f4b9f40bf94eac71
tokenization_utils.py: loading file https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-merges.txt from cache at ./gpt2\d629f792e430b3c76a1291bb2766b0a047e36fae0588f9dbc1ae51decdff691b.70bec105b4158ed9a1747fea67a43f5dee97855c64d62b6ec3742f4cfdb5feda
configuration_utils.py: loading configuration file https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-config.json from cache at ./gpt2\4be02c5697d91738003fb1685c9872f284166aa32e061576bbe6aaeb95649fcf.699bbd1c449e9861456f359d6daa51bd523ac085b4b531ab0aad5a55d091e942
configuration_utils.py: Model config {
  "architectures": [
    "GPT2LMHeadModel"
  ],
  "attn_pdrop": 0.1,
  "embd_pdrop": 0.1,
  "finetuning_task": null,
  "id2label": {
    "0": "LABEL_0",
    "1": "LABEL_1"
  },
  "

GPT2Model(
  (wte): Embedding(50257, 768)
  (wpe): Embedding(1024, 768)
  (drop): Dropout(p=0.1, inplace=False)
  (h): ModuleList(
    (0): Block(
      (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
      (attn): Attention(
        (c_attn): Conv1D()
        (c_proj): Conv1D()
        (attn_dropout): Dropout(p=0.1, inplace=False)
        (resid_dropout): Dropout(p=0.1, inplace=False)
      )
      (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
      (mlp): MLP(
        (c_fc): Conv1D()
        (c_proj): Conv1D()
        (dropout): Dropout(p=0.1, inplace=False)
      )
    )
    (1): Block(
      (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
      (attn): Attention(
        (c_attn): Conv1D()
        (c_proj): Conv1D()
        (attn_dropout): Dropout(p=0.1, inplace=False)
        (resid_dropout): Dropout(p=0.1, inplace=False)
      )
      (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
      (mlp): MLP(
        (c_fc): Conv1D

In [24]:
import numpy
import time

def pytorch_inference(model, input_ids, past=None, total_runs = 100):
    latency = []
    with torch.no_grad():
        for _ in range(total_runs):
            start = time.time()
            outputs = model(input_ids=input_ids, past=past)
            latency.append(time.time() - start)
            
    if total_runs > 1:
        print("PyTorch Inference time = {} ms".format(format(sum(latency) * 1000 / len(latency), '.2f')))
    
    return outputs
    
def onnxruntime_inference(ort_session, input_ids, past=None, total_runs=100):    
    # Use contiguous array as input might improve performance.
    # You can check the results from performance test tool to see whether you need it.
    ort_inputs = {
        'input_ids':  numpy.ascontiguousarray(input_ids.cpu().numpy())
    }
    
    if past is not None:
        for i, past_i in enumerate(past):
            ort_inputs[f'past_{i}'] = numpy.ascontiguousarray(past[i].cpu().numpy())
            
    latency = []
    for _ in range(total_runs):
        start = time.time()
        ort_outputs = ort_session.run(None, ort_inputs)
        latency.append(time.time() - start)
        
    if total_runs > 1:
        print("OnnxRuntime Inference time = {} ms".format(format(sum(latency) * 1000 / len(latency), '.2f')))
    
    return ort_outputs

def inference(model, ort_session, input_ids, past=None, total_runs=100, verify_outputs=True):
    outputs = pytorch_inference(model, input_ids, past, total_runs)
    ort_outputs = onnxruntime_inference(ort_session, input_ids, past, total_runs)
    if verify_outputs:
        print('PyTorch and OnnxRuntime output 0 (last_state) are close:'.format(0), numpy.allclose(ort_outputs[0], outputs[0].cpu(), rtol=1e-05, atol=1e-04))

        if enable_past_input:
            for layer in range(model.config.n_layer):
                print('PyTorch and OnnxRuntime layer {} state (present_{}) are close:'.format(layer, layer), numpy.allclose(ort_outputs[1 + layer], outputs[1][layer].cpu(), rtol=1e-05, atol=1e-04))    

In [25]:
import torch
import os

inputs = tokenizer.encode_plus("Here is an example input for GPT2 model", add_special_tokens=True, return_tensors='pt')
input_ids = inputs['input_ids']

# run without past so that we can know the shape of past from output.
outputs = model(input_ids=input_ids, past=None)

In [26]:
num_layer = model.config.n_layer    
present_names = [f'present_{i}' for i in range(num_layer)]
output_names = ["last_state"] + present_names

input_names = ['input_ids']
dynamic_axes= {'input_ids': {0: 'batch_size', 1: 'seq_len'},
               #'token_type_ids' : {0: 'batch_size', 1: 'seq_len'},
               #'attention_mask' : {0: 'batch_size', 1: 'seq_len'},
               'last_state' : {0: 'batch_size', 1: 'seq_len'}
              }
for name in present_names:
        dynamic_axes[name] = {1: 'batch_size', 3: 'seq_len'}
        
if enable_past_input:
    past_names = [f'past_{i}' for i in range(num_layer)]
    input_names = ['input_ids'] + past_names  #+ ['token_type_ids', 'attention_mask']
    dummy_past = [torch.zeros(list(outputs[1][0].shape)) for _ in range(num_layer)]
    for name in past_names:
        dynamic_axes[name] = {1: 'batch_size', 3: 'seq_len'}
    export_inputs = (inputs['input_ids'], tuple(dummy_past)) #, inputs['token_type_ids'], inputs['attention_mask'])
else:
    export_inputs = (inputs['input_ids'])

export_model_path = os.path.join(output_dir, 'gpt2_past{}.onnx'.format(int(enable_past_input)))

torch.onnx.export(model,
                  args=export_inputs,
                  f=export_model_path,
                  input_names=input_names,
                  output_names=output_names,
                  dynamic_axes=dynamic_axes,
                  opset_version=11,
                  do_constant_folding = True,
                  verbose=False)

In [27]:
def remove_past_outputs(export_model_path, output_model_path):
    from onnx import ModelProto
    from OnnxModel import OnnxModel

    model = ModelProto()
    with open(export_model_path, "rb") as f:
        model.ParseFromString(f.read())
    bert_model = OnnxModel(model)

    # remove past state outputs and only keep the first output.
    keep_output_names = [bert_model.model.graph.output[0].name]
    logger.info(f"Prune graph to keep the first output and drop past state outputs:{keep_output_names}")
    bert_model.prune_graph(keep_output_names)

    bert_model.save_model_to_file(output_model_path)
    
if enable_past_input:
    onnx_model_path = export_model_path
else:
    onnx_model_path = os.path.join(output_dir, 'gpt2_past{}_out1.onnx'.format(int(enable_past_input)))
    remove_past_outputs(export_model_path, onnx_model_path)

 remove_past_outputs: Prune graph to keep the first output and drop past state outputs:['last_state']
         prune_graph: Graph pruned: 0 inputs, 12 outputs and 48 nodes are removed
  save_model_to_file: Output model to ./gpt2_onnx\gpt2_past0_out1.onnx


## Inference with ONNX Runtime

### OpenMP Environment Variable

OpenMP environment variables are very important for CPU inference of GPT2 model. It has large performance impact on GPT2 model so you might need set it carefully according to benchmark script.

Setting environment variables shall be done before importing onnxruntime. Otherwise, they might not take effect.

In [28]:
import psutil

# You may change the settings in this cell according to Performance Test Tool result.
use_openmp = True

# ATTENTION: these environment variables must be set before importing onnxruntime.
if use_openmp:
    os.environ["OMP_NUM_THREADS"] = str(psutil.cpu_count(logical=True))
else:
    os.environ["OMP_NUM_THREADS"] = '1'

os.environ["OMP_WAIT_POLICY"] = 'ACTIVE'

In [29]:
import onnxruntime
import numpy

# Print warning if user uses onnxruntime-gpu instead of onnxruntime package.
if 'CUDAExecutionProvider' in onnxruntime.get_available_providers():
    print("warning: onnxruntime-gpu is not built with OpenMP. You might try onnxruntime package to test CPU inference.")

sess_options = onnxruntime.SessionOptions()

# Optional: store the optimized graph and view it using Netron to verify that model is fully optimized.
# Note that this will increase session creation time, so it is for debugging only.
#sess_options.optimized_model_filepath = os.path.join(output_dir, "optimized_model_cpu.onnx")
   
if use_openmp:
    sess_options.intra_op_num_threads=1
else:
    sess_options.intra_op_num_threads=psutil.cpu_count(logical=True)

# Specify providers when you use onnxruntime-gpu for CPU inference.
session = onnxruntime.InferenceSession(onnx_model_path, sess_options, providers=['CPUExecutionProvider'])

# Compare PyTorch and OnnxRuntime inference performance and results
%time inference(model, session, input_ids, past=dummy_past if enable_past_input else None)

PyTorch Inference time = 39.00 ms
OnnxRuntime Inference time = 70.14 ms
PyTorch and OnnxRuntime output 0 (last_state) are close: True
Wall time: 10.9 s


In [30]:
import gc
del session
gc.collect()

0

In [31]:
optimized_model = os.path.join(output_dir, 'gpt2_past{}_optimized.onnx'.format(int(enable_past_input)))

In [32]:
bert_opt_script = os.path.join(bert_tools_dir, 'bert_model_optimization.py')

In [33]:
# Local directory corresponding to https://github.com/microsoft/onnxruntime/tree/master/onnxruntime/python/tools/bert/
%run $bert_opt_script --model_type gpt2 --input $onnx_model_path --output $optimized_model --opt_level 0

     fuse_layer_norm: Fused LayerNormalization count: 25
 fuse_gelu_with_tanh: Fused FastGelu count: 12
        fuse_reshape: Fused Reshape count:48
         prune_graph: Graph pruned: 0 inputs, 0 outputs and 1106 nodes are removed
      fuse_attention: Fused Attention count:12
fuse_embed_layer_without_mask: Failed to find embedding layer
    fuse_embed_layer: Fused EmbedLayerNormalization count: 0
         prune_graph: Graph pruned: 0 inputs, 0 outputs and 480 nodes are removed
         postprocess: Remove Reshape count:48
      fuse_bias_gelu: Fused FastGelu with Bias count:12
            optimize: opset verion: 11
  save_model_to_file: Output model to ./gpt2_onnx\gpt2_past0_optimized.onnx
get_fused_operator_statistics: Optimized operators:{'EmbedLayerNormalization': 0, 'Attention': 12, 'Gelu': 0, 'FastGelu': 12, 'BiasGelu': 0, 'LayerNormalization': 25, 'SkipLayerNormalization': 0}
  is_fully_optimized: EmbedLayer=0, Attention=12, Gelu=12, LayerNormalization=25, Successful=False
    

In [34]:
session = onnxruntime.InferenceSession(optimized_model, sess_options, providers=['CPUExecutionProvider'])

%time inference(model, session, input_ids, past=dummy_past if enable_past_input else None, verify_outputs=False)

PyTorch Inference time = 37.91 ms
OnnxRuntime Inference time = 66.12 ms
Wall time: 10.4 s


## Additional Info

Note that running Jupyter Notebook has slight impact on performance result since Jupyter Notebook is using system resources like CPU and memory etc. It is recommended to close Jupyter Notebook and other applications, then run the benchmark script in a console to get more accurate performance numbers.

[OnnxRuntime C API](https://github.com/microsoft/onnxruntime/blob/master/docs/C_API.md) could get slightly better performance than python API. If you use C API in inference, you can use OnnxRuntime_Perf_Test.exe built from source to measure performance instead.

Here is the machine configuration that generated the above results. The machine has GPU but not used in CPU inference.
You might get slower or faster result based on your hardware.

In [35]:
machine_info_script = os.path.join(bert_tools_dir, 'MachineInfo.py')
%run $machine_info_script --silent

{
  "gpu": {
    "driver_version": "441.22",
    "devices": [
      {
        "memory_total": 8589934592,
        "memory_available": 8480882688,
        "name": "GeForce GTX 1070"
      }
    ]
  },
  "cpu": {
    "brand": "Intel(R) Core(TM) i7-8700 CPU @ 3.20GHz",
    "cores": 6,
    "logical_cores": 12,
    "hz": "3.1920 GHz",
    "l2_cache": "1536 KB",
    "l3_cache": "12288 KB",
    "processor": "Intel64 Family 6 Model 158 Stepping 10, GenuineIntel"
  },
  "memory": {
    "total": 16971259904,
    "available": 2603229184
  },
  "python": "3.6.10.final.0 (64 bit)",
  "os": "Windows-10-10.0.18362-SP0",
  "onnxruntime": {
    "version": "1.2.0",
    "support_gpu": false
  },
  "pytorch": {
    "version": "1.4.0+cpu",
    "support_gpu": false
  },
  "tensorflow": {
    "version": "2.1.0",
    "git_version": "v2.1.0-rc2-17-ge5bf8de410",
    "support_gpu": true
  }
}
