Copyright (c) Microsoft Corporation. All rights reserved.  
Licensed under the MIT License.

# Inference PyTorch GPT2 Model with ONNX Runtime on CPU

In this tutorial, you'll be introduced to how to load a GPT2 model from PyTorch, convert it to ONNX, and inference it using ONNX Runtime using IO Binding. Note that past state is used to get better performance.

## Prerequisites ##

If you have Jupyter Notebook, you may directly run this notebook. We will use pip to install or upgrade [PyTorch](https://pytorch.org/), [OnnxRuntime](https://microsoft.github.io/onnxruntime/) and other required packages.

Otherwise, you can setup a new environment. First, we install [AnaConda](https://www.anaconda.com/distribution/). Then open an AnaConda prompt window and run the following commands:

```console
conda create -n cpu_env python=3.8
conda activate cpu_env
conda install jupyter
jupyter notebook
```
The last command will launch Jupyter Notebook and we can open this notebook in browser to continue.

In [1]:
import os

# Create a cache directory to store pretrained model.
cache_dir = os.path.join(".", "cache_models")
if not os.path.exists(cache_dir):
    os.makedirs(cache_dir)

In [2]:
!pip install coloredlogs



## Convert GPT2 model from PyTorch to ONNX ##

We have a script [convert_to_onnx.py](https://github.com/microsoft/onnxruntime/blob/master/onnxruntime/python/tools/transformers/convert_to_onnx.py) that could help you to convert GPT2 with past state to ONNX. 

The script accepts a pretrained model name or path of a checkpoint directory as input, and converts the model to ONNX. It also verifies that the ONNX model could generate same input as the pytorch model. The usage is like 
```
python -m onnxruntime.transformers.convert_to_onnx -m model_name_or_path --output gpt2.onnx -o -p fp32|fp16|int8
```
The -p option can be used to choose the precision: fp32 (float32), fp16 (mixed precision) or int8 (quantization). The -o option will generate optimized model, which is required for fp16 or int8.

Here we use a pretrained model as example:

In [9]:
from onnxruntime.transformers.gpt2_helper import Gpt2Helper, MyGPT2LMHeadModel
from transformers import AutoConfig
import torch
import numpy as np
import numpy

model_name_or_path = "gpt2"
config = AutoConfig.from_pretrained(model_name_or_path, cache_dir=cache_dir)
model = MyGPT2LMHeadModel.from_pretrained(model_name_or_path, config=config, cache_dir=cache_dir)
device = torch.device("cpu")
model.eval().to(device)

print(model.config)

num_attention_heads = model.config.n_head
hidden_size = model.config.n_embd
num_layer = model.config.n_layer

GPT2Config {
  "_name_or_path": "gpt2",
  "activation_function": "gelu_new",
  "architectures": [
    "GPT2LMHeadModel"
  ],
  "attn_pdrop": 0.1,
  "bos_token_id": 50256,
  "embd_pdrop": 0.1,
  "eos_token_id": 50256,
  "initializer_range": 0.02,
  "layer_norm_epsilon": 1e-05,
  "model_type": "gpt2",
  "n_ctx": 1024,
  "n_embd": 768,
  "n_head": 12,
  "n_inner": null,
  "n_layer": 12,
  "n_positions": 1024,
  "reorder_and_upcast_attn": false,
  "resid_pdrop": 0.1,
  "scale_attn_by_inverse_layer_idx": false,
  "scale_attn_weights": true,
  "summary_activation": null,
  "summary_first_dropout": 0.1,
  "summary_proj_to_labels": true,
  "summary_type": "cls_index",
  "summary_use_proj": true,
  "task_specific_params": {
    "text-generation": {
      "do_sample": true,
      "max_length": 50
    }
  },
  "transformers_version": "4.15.0.dev0",
  "use_cache": true,
  "vocab_size": 50257
}



In [4]:
onnx_model_path = "gpt2.onnx"
Gpt2Helper.export_onnx(model, device, onnx_model_path) # add parameter use_external_data_format=True when model size > 2 GB

  if batch_size <= 0:
  past_key, past_value = layer_past
  attn_weights = attn_weights / (float(value.size(-1)) ** 0.5)


## PyTorch Inference using Huggingface Transformers##

In the following, we will use an example input to get the output from PyTorch for comparison purpose.
For the first inference, there is no any past state. We can prepare empty state for input.

In [5]:
import onnxruntime as ort
ort_session = ort.InferenceSession("gpt2.onnx")
for input in ort_session.get_inputs():
    print(input)
print("Outputs:")
for output in ort_session.get_outputs():
    print(output)

NodeArg(name='input_ids', type='tensor(int64)', shape=['batch_size', 'seq_len'])
NodeArg(name='position_ids', type='tensor(int64)', shape=['batch_size', 'seq_len'])
NodeArg(name='attention_mask', type='tensor(float)', shape=['batch_size', 'total_seq_len'])
NodeArg(name='past_0', type='tensor(float)', shape=[2, 'batch_size', 12, 'past_seq_len', 64])
NodeArg(name='past_1', type='tensor(float)', shape=[2, 'batch_size', 12, 'past_seq_len', 64])
NodeArg(name='past_2', type='tensor(float)', shape=[2, 'batch_size', 12, 'past_seq_len', 64])
NodeArg(name='past_3', type='tensor(float)', shape=[2, 'batch_size', 12, 'past_seq_len', 64])
NodeArg(name='past_4', type='tensor(float)', shape=[2, 'batch_size', 12, 'past_seq_len', 64])
NodeArg(name='past_5', type='tensor(float)', shape=[2, 'batch_size', 12, 'past_seq_len', 64])
NodeArg(name='past_6', type='tensor(float)', shape=[2, 'batch_size', 12, 'past_seq_len', 64])
NodeArg(name='past_7', type='tensor(float)', shape=[2, 'batch_size', 12, 'past_seq_le

In [None]:
from transformers import AutoTokenizer

EXAMPLE_Text = ['best hotel in bay area', 'here is an example of gpt2 model']

def get_tokenizer(model_name_or_path, cache_dir):
    tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, cache_dir=cache_dir)
    tokenizer.padding_side = "left"
    tokenizer.pad_token = tokenizer.eos_token
    #okenizer.add_special_tokens({'pad_token': '[PAD]'})
    return tokenizer

def get_example_inputs(prompt_text=EXAMPLE_Text):    
    tokenizer = get_tokenizer(model_name_or_path, cache_dir)
    encodings_dict = tokenizer.batch_encode_plus(prompt_text, padding=True)

    input_ids = torch.tensor(encodings_dict['input_ids'], dtype=torch.int64)
    attention_mask = torch.tensor(encodings_dict['attention_mask'], dtype=torch.float32)
    position_ids = (attention_mask.long().cumsum(-1) - 1)
    position_ids.masked_fill_(position_ids < 0, 0)

    #Empty Past State for generating first word
    empty_past = []
    batch_size = input_ids.size(0)
    print("batch_size", batch_size)
    sequence_length = input_ids.size(1)
    past_shape = [2, batch_size, num_attention_heads, 0, hidden_size // num_attention_heads]
    for i in range(num_layer):
        empty_past.append(torch.empty(past_shape).type(torch.float32).to(device))
       
    return input_ids, attention_mask, position_ids, empty_past

input_ids, attention_mask, position_ids, empty_past = get_example_inputs()
print("input_ids", input_ids)
print("attention_mask", attention_mask)
print("position_ids", position_ids)

NameError: name 'model_name_or_path' is not defined

numpy.ascontiguousarray(input_ids.cpu().numpy())## ONNX Runtime Inference ##

We can use ONNX Runtime to inference. The inputs are dictionary with name and numpy array as value, and the output is list of numpy array. Note that both input and output are in CPU. When you run the inference in GPU, it will involve data copy between CPU and GPU for input and output.

Let's create an inference session for ONNX Runtime given the exported ONNX model, and see the output.

In [13]:
import onnxruntime
import numpy

input_ids, attention_mask, position_ids, empty_past = get_example_inputs()

onnx_model_path = "gpt2.onnx"
session = onnxruntime.InferenceSession(onnx_model_path)

inputs = {}
shape_name_mapping = {
    'seq_len': 20,
    'total_seq_len': 20,
    'past_seq_len': 0,
    'batch_size': 1
}
type_name_mapping = {
    'tensor(int64)': np.int64,
    'tensor(float)': np.float32
}
def map_shape(x):
    if type(x) is str:
        return shape_name_mapping[x]
    return x
for input in ort_session.get_inputs():
    print(input)
    processed_shape = list(map(map_shape, input.shape))
    print("processed_shape", processed_shape)
    inputs[input.name] = np.zeros(processed_shape, dtype = type_name_mapping[input.type])

ort_inputs = inputs
ort_outputs = session.run(None, ort_inputs)

batch_size 2
NodeArg(name='input_ids', type='tensor(int64)', shape=['batch_size', 'seq_len'])
processed_shape [1, 20]
NodeArg(name='position_ids', type='tensor(int64)', shape=['batch_size', 'seq_len'])
processed_shape [1, 20]
NodeArg(name='attention_mask', type='tensor(float)', shape=['batch_size', 'total_seq_len'])
processed_shape [1, 20]
NodeArg(name='past_0', type='tensor(float)', shape=[2, 'batch_size', 12, 'past_seq_len', 64])
processed_shape [2, 1, 12, 0, 64]
NodeArg(name='past_1', type='tensor(float)', shape=[2, 'batch_size', 12, 'past_seq_len', 64])
processed_shape [2, 1, 12, 0, 64]
NodeArg(name='past_2', type='tensor(float)', shape=[2, 'batch_size', 12, 'past_seq_len', 64])
processed_shape [2, 1, 12, 0, 64]
NodeArg(name='past_3', type='tensor(float)', shape=[2, 'batch_size', 12, 'past_seq_len', 64])
processed_shape [2, 1, 12, 0, 64]
NodeArg(name='past_4', type='tensor(float)', shape=[2, 'batch_size', 12, 'past_seq_len', 64])
processed_shape [2, 1, 12, 0, 64]
NodeArg(name='past

In [16]:
import onnxruntime
import numpy

input_ids, attention_mask, position_ids, empty_past = get_example_inputs()

onnx_model_path = "gpt2.onnx"
session = onnxruntime.InferenceSession(onnx_model_path)
ort_inputs = {'input_ids': input_ids.numpy(),
              'attention_mask' : numpy.ascontiguousarray(attention_mask.cpu().numpy()),
              'position_ids': numpy.ascontiguousarray(position_ids.cpu().numpy())
             }
for i, past_i in enumerate(empty_past):
    ort_inputs[f'past_{i}'] = numpy.ascontiguousarray(past_i.cpu().numpy())
print(ort_inputs)
ort_outputs = session.run(None, ort_inputs)

batch_size 2
{'input_ids': array([[50256, 50256, 50256, 50256, 13466,  7541,   287, 15489,  1989],
       [ 1456,   318,   281,  1672,   286,   308,   457,    17,  2746]]), 'attention_mask': array([[0., 0., 0., 0., 1., 1., 1., 1., 1.],
       [1., 1., 1., 1., 1., 1., 1., 1., 1.]], dtype=float32), 'position_ids': array([[0, 0, 0, 0, 0, 1, 2, 3, 4],
       [0, 1, 2, 3, 4, 5, 6, 7, 8]]), 'past_0': array([], shape=(2, 2, 12, 0, 64), dtype=float32), 'past_1': array([], shape=(2, 2, 12, 0, 64), dtype=float32), 'past_2': array([], shape=(2, 2, 12, 0, 64), dtype=float32), 'past_3': array([], shape=(2, 2, 12, 0, 64), dtype=float32), 'past_4': array([], shape=(2, 2, 12, 0, 64), dtype=float32), 'past_5': array([], shape=(2, 2, 12, 0, 64), dtype=float32), 'past_6': array([], shape=(2, 2, 12, 0, 64), dtype=float32), 'past_7': array([], shape=(2, 2, 12, 0, 64), dtype=float32), 'past_8': array([], shape=(2, 2, 12, 0, 64), dtype=float32), 'past_9': array([], shape=(2, 2, 12, 0, 64), dtype=float32), 'p

We can compare the outputs from PyTorch and ONNX Runtime. Logits are very close (max difference is 1E-4).

In [None]:
logits_masked_diff = (torch_output[0] - ort_outputs[0]) * attention_mask.unsqueeze(2)
max_logits_diff = logits_masked_diff.abs().max()
print("max logits diff (ignored padding)", max_logits_diff)

## ONNX Runtime Inference with IO Binding ##

To avoid data copy for input and output, ONNX Runtime also supports IO Binding. User could provide some buffer for input and outputs. For GPU inference, the buffer can be in GPU to reduce memory copy between CPU and GPU. This is helpful for high performance inference in GPU. For GPT-2, IO Binding might help the performance when batch size or (past) sequence length is large.

In [None]:
def inference_with_io_binding(session, config, input_ids, position_ids, attention_mask, past):
    output_shapes = Gpt2Helper.get_output_shapes(batch_size=input_ids.size(0),
                                                 past_sequence_length=past[0].size(3),
                                                 sequence_length=input_ids.size(1),
                                                 config=config)
    output_buffers = Gpt2Helper.get_output_buffers(output_shapes, device)

    io_binding = Gpt2Helper.prepare_io_binding(session, input_ids, position_ids, attention_mask, past,
                                               output_buffers, output_shapes)
    session.run_with_iobinding(io_binding)

    outputs = Gpt2Helper.get_outputs_from_io_binding_buffer(session, output_buffers, output_shapes,
                                                            return_numpy=False)
    return outputs

We can see that the result is exactly same with/without IO Binding:

In [None]:
input_ids, attention_mask, position_ids, empty_past = get_example_inputs()
outputs = inference_with_io_binding(session, config, input_ids, position_ids, attention_mask, empty_past)
for i in range(len(outputs)):
    assert torch.eq(outputs[i], torch.from_numpy(ort_outputs[i])).all()
print("IO Binding result is good")

## Batch Text Generation ##

Here is an example for text generation using ONNX Runtime or PyTorch. For ONNX Runtime, IO Binding is used for better performance.

In [28]:
tokenizer = get_tokenizer(model_name_or_path, cache_dir)
input_text = EXAMPLE_Text
test_generation(tokenizer, input_text, ort_session=session)

Text generation using OnnxRuntime ...
batch_size 2
------------
best hotel in bay area.

The hotel is located in the historic Bayview neighborhood of San Francisco.

The hotel is open daily from 9 a.m.
------------
here is an example of gpt2 model.

The gpt2 model is a simple, but powerful, way to generate a GPT2-like data structure. It is a


Next, we use PyTorch to run again and we can see that the result is exactly same.

In [None]:
test_generation(tokenizer, input_text)

## Int8 Quantization ##
Next, we will apply dynamic quantization to the model. We optimize the model before quantization to get better performance.

Note that text generation result from fp32 and int8 models could be quite different. User shall evaluate the precision metric for your application for both fp32 and int8 models. If the quality of int8 model result is acceptable, you will be glad to find that it is faster than fp32 model in inference. 

Note that you can leverage [quantization aware training (QAT)](https://pytorch.org/blog/introduction-to-quantization-on-pytorch/) for accuracy improvement if needed.

In [None]:
from onnxruntime.transformers.quantize_helper import QuantizeHelper

optimized_fp32_model_path = "gpt2_fp32.onnx"
quantized_int8_model_path = "gpt2_int8.onnx"
Gpt2Helper.optimize_onnx("gpt2.onnx", optimized_fp32_model_path, False, model.config.num_attention_heads, model.config.hidden_size)
QuantizeHelper.quantize_onnx_model(optimized_fp32_model_path, quantized_int8_model_path)

In [None]:
session_int8 = onnxruntime.InferenceSession(quantized_int8_model_path)
input_text = ['bert model optimization']
test_generation(tokenizer, input_text, ort_session=session_int8, num_tokens_to_produce=14)

## Benchmark ##
There is a tool benchmark_gpt2.py, which can be used to measure the performance of GPT-2 by PyTorch, ONNX Runtime without/with IO Binding.

In [None]:
!{sys.executable} -m onnxruntime.transformers.benchmark_gpt2 -m gpt2 -o

In [None]:
!{sys.executable} -m onnxruntime.transformers.benchmark_gpt2 -m gpt2 -o --precision int8

We can see that quantized model has significant speed up (close to 2x).

### Test Environment ###
The following is the hardware of the test machine, and software version:

In [None]:
!{sys.executable} -m onnxruntime.transformers.machine_info --silent