Copyright (c) Microsoft Corporation. All rights reserved.  
Licensed under the MIT License.

# Inference PyTorch GPT2 Model with ONNX Runtime on CPU

In this tutorial, you'll be introduced to how to load a GPT2 model from PyTorch, convert it to ONNX, and inference it using ONNX Runtime using IO Binding. Note that past state is used to get better performance.

## Prerequisites ##

If you have Jupyter Notebook, you may directly run this notebook. We will use pip to install or upgrade [PyTorch](https://pytorch.org/), [OnnxRuntime](https://microsoft.github.io/onnxruntime/) and other required packages.

Otherwise, you can setup a new environment. First, we install [AnaConda](https://www.anaconda.com/distribution/). Then open an AnaConda prompt window and run the following commands:

```console
conda create -n cpu_env python=3.10
conda activate cpu_env
pip install jupyterlab
conda install ipykernel
conda install -c conda-forge ipywidgets
ipython kernel install --user --name cpu_env
jupyter-lab
```
The last command will launch JupyterLab, then we can open this notebook and select kernel cpu_env to run it.

In [1]:
# Please refer to https://pytorch.org to install CPU-only PyTorch
import sys
if sys.platform in ["darwin", "win32"]:  # Mac or Windows
    !{sys.executable} -m pip install torch -q
else:
    !{sys.executable} -m pip install install torch --index-url https://download.pytorch.org/whl/cpu -q

!{sys.executable} -m pip install onnxruntime transformers==4.18 onnx psutil pandas py-cpuinfo py3nvml netron coloredlogs --no-warn-script-location -q

In [2]:
import os

if sys.platform in ["win32"]:
    os.environ["HF_HUB_DISABLE_SYMLINKS_WARNING"] = "1"

# Create a cache directory to store pretrained model.
cache_dir = os.path.join(".", "cache_models")
if not os.path.exists(cache_dir):
    os.makedirs(cache_dir)

## Convert GPT2 model from PyTorch to ONNX ##

We have a script [convert_to_onnx.py](https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/python/tools/transformers/models/gpt2/convert_to_onnx.py) that could help you to convert GPT2 with past state to ONNX. 

The script accepts a pretrained model name or path of a checkpoint directory as input, and converts the model to ONNX. It also verifies that the ONNX model could generate same input as the pytorch model. The usage is like 
```
python -m onnxruntime.transformers.models.gpt2.convert_to_onnx -m model_name_or_path --output gpt2.onnx -o -p fp32
python -m onnxruntime.transformers.models.gpt2.convert_to_onnx -m model_name_or_path --output gpt2.onnx -o -p fp16 --auto_mixed_precision
```
The -p option can be used to choose the precision: fp32 (float32), fp16 (mixed precision) or int8 (quantization). The -o option will generate optimized model, which is required for fp16 or int8. Mixed precision model by --auto_mixed_precision is recommended for GPU inference. For CPU inference, fp32 model is recommended since int8 model might have large accuracy loss.

Here we use a pretrained model as example:

In [3]:
from onnxruntime.transformers.models.gpt2.gpt2_helper import Gpt2Helper, MyGPT2LMHeadModel
from transformers import AutoConfig
import torch

model_name_or_path = "gpt2"
config = AutoConfig.from_pretrained(model_name_or_path, cache_dir=cache_dir)
model = MyGPT2LMHeadModel.from_pretrained(model_name_or_path, config=config, cache_dir=cache_dir)
device = torch.device("cpu")
model.eval().to(device)

print(model.config)

num_attention_heads = model.config.n_head
hidden_size = model.config.n_embd
num_layer = model.config.n_layer

GPT2Config {
  "_name_or_path": "gpt2",
  "activation_function": "gelu_new",
  "architectures": [
    "GPT2LMHeadModel"
  ],
  "attn_pdrop": 0.1,
  "bos_token_id": 50256,
  "embd_pdrop": 0.1,
  "eos_token_id": 50256,
  "initializer_range": 0.02,
  "layer_norm_epsilon": 1e-05,
  "model_type": "gpt2",
  "n_ctx": 1024,
  "n_embd": 768,
  "n_head": 12,
  "n_inner": null,
  "n_layer": 12,
  "n_positions": 1024,
  "reorder_and_upcast_attn": false,
  "resid_pdrop": 0.1,
  "scale_attn_by_inverse_layer_idx": false,
  "scale_attn_weights": true,
  "summary_activation": null,
  "summary_first_dropout": 0.1,
  "summary_proj_to_labels": true,
  "summary_type": "cls_index",
  "summary_use_proj": true,
  "task_specific_params": {
    "text-generation": {
      "do_sample": true,
      "max_length": 50
    }
  },
  "transformers_version": "4.18.0",
  "use_cache": true,
  "vocab_size": 50257
}



In [4]:
onnx_model_path = "gpt2.onnx"

In [5]:
!{sys.executable} -m onnxruntime.transformers.models.gpt2.convert_to_onnx -m $model_name_or_path --output $onnx_model_path -o -p fp32 -t 10 >export_output.txt 2>&1

Please pay attention to the optimized operators in the output. Counters of EmbedLayerNormalization, Attention, FastGelu and SkipLayerNormalization shall be positive for fully optimized GPT-2 model.

In [6]:
file = open("export_output.txt", "r")
for line in file.readlines():
    if "Optimized operators" in line:
        print(line)

Optimized operators:{'EmbedLayerNormalization': 1, 'Attention': 12, 'MultiHeadAttention': 0, 'Gelu': 0, 'FastGelu': 12, 'BiasGelu': 0, 'GemmFastGelu': 0, 'LayerNormalization': 0, 'SkipLayerNormalization': 24, 'QOrderedAttention': 0, 'QOrderedGelu': 0, 'QOrderedLayerNormalization': 0, 'QOrderedMatMul': 0}



## PyTorch Inference using Huggingface Transformers ##

In the following, we will use an example input to get the output from PyTorch for comparison purpose.
For the first inference, there is no any past state. We can prepare empty state for input.

In [7]:
from transformers import AutoTokenizer

EXAMPLE_Text = ["best hotel in bay area", "here is an example of gpt2 model"]


def get_tokenizer(model_name_or_path, cache_dir):
    tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, cache_dir=cache_dir)
    tokenizer.padding_side = "left"
    tokenizer.pad_token = tokenizer.eos_token
    return tokenizer


def get_example_inputs(prompt_text=EXAMPLE_Text):
    tokenizer = get_tokenizer(model_name_or_path, cache_dir)
    encodings_dict = tokenizer.batch_encode_plus(prompt_text, padding=True)

    input_ids = torch.tensor(encodings_dict["input_ids"], dtype=torch.int32)
    attention_mask = torch.tensor(encodings_dict["attention_mask"], dtype=torch.int32)
    position_ids = attention_mask.long().cumsum(-1) - 1
    position_ids.masked_fill_(position_ids < 0, 0)
    position_ids = position_ids.to(torch.int32)

    # Empty Past State for generating first word
    empty_past = []
    batch_size = input_ids.size(0)
    sequence_length = input_ids.size(1)
    past_shape = [2, batch_size, num_attention_heads, 0, hidden_size // num_attention_heads]
    for i in range(num_layer):
        empty_past.append(torch.empty(past_shape).type(torch.float32).to(device))

    return input_ids, attention_mask, position_ids, empty_past


from transformers import GPT2LMHeadModel

torch_model = GPT2LMHeadModel.from_pretrained(model_name_or_path, config=config, cache_dir=cache_dir)
device = torch.device("cpu")
torch_model.eval().to(device)

input_ids, attention_mask, position_ids, empty_past = get_example_inputs()
print("input_ids", input_ids)
print("attention_mask", attention_mask)
print("position_ids", position_ids)

input_ids tensor([[50256, 50256, 50256, 50256, 13466,  7541,   287, 15489,  1989],
        [ 1456,   318,   281,  1672,   286,   308,   457,    17,  2746]],
       dtype=torch.int32)
attention_mask tensor([[0, 0, 0, 0, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 1, 1]], dtype=torch.int32)
position_ids tensor([[0, 0, 0, 0, 0, 1, 2, 3, 4],
        [0, 1, 2, 3, 4, 5, 6, 7, 8]], dtype=torch.int32)


In [8]:
with torch.no_grad():
    torch_output = torch_model(
        input_ids, past_key_values=empty_past, attention_mask=attention_mask, position_ids=position_ids
    )

## ONNX Runtime Inference ##

We can use ONNX Runtime to inference. The inputs are dictionary with name and numpy array as value, and the output is list of numpy array. Note that both input and output are in CPU. When you run the inference in GPU, it will involve data copy between CPU and GPU for input and output.

Let's create an inference session for ONNX Runtime given the exported ONNX model, and see the output.

In [9]:
import onnxruntime
import numpy

input_ids, attention_mask, position_ids, empty_past = get_example_inputs()

session = onnxruntime.InferenceSession(onnx_model_path, providers=["CPUExecutionProvider"])
ort_inputs = {
    "input_ids": numpy.ascontiguousarray(input_ids.cpu().numpy()),
    "attention_mask": numpy.ascontiguousarray(attention_mask.cpu().numpy()),
    "position_ids": numpy.ascontiguousarray(position_ids.cpu().numpy()),
}
for i, past_i in enumerate(empty_past):
    ort_inputs[f"past_{i}"] = numpy.ascontiguousarray(past_i.cpu().numpy())
ort_outputs = session.run(None, ort_inputs)

We can compare the outputs from PyTorch and ONNX Runtime. Logits are very close.

In [10]:
logits_masked_diff = (torch_output[0] - ort_outputs[0]) * attention_mask.unsqueeze(2)
max_logits_diff = logits_masked_diff.abs().max()
print("max logits diff (ignored padding)", max_logits_diff)

max logits diff (ignored padding) tensor(7.6294e-05)


## ONNX Runtime Inference with IO Binding ##

To avoid data copy for input and output, ONNX Runtime also supports IO Binding. User could provide some buffer for input and outputs. For GPU inference, the buffer can be in GPU to reduce memory copy between CPU and GPU. This is helpful for high performance inference in GPU. For GPT-2, IO Binding might help the performance when batch size or (past) sequence length is large.

In [11]:
from typing import List, Dict
from onnxruntime import InferenceSession

from onnxruntime.transformers.io_binding_helper import TypeHelper
from onnxruntime.transformers.io_binding_helper import IOBindingHelper


def inference_with_io_binding(session, config, input_ids, position_ids, attention_mask, past):
    output_shapes = Gpt2Helper.get_output_shapes(
        batch_size=input_ids.size(0),
        past_sequence_length=past[0].size(3),
        sequence_length=input_ids.size(1),
        config=config,
    )
    output_buffers = Gpt2Helper.get_output_buffers(output_shapes, device)

    io_binding = IOBindingHelper.prepare_io_binding(
        session, input_ids, position_ids, attention_mask, past, output_buffers, output_shapes
    )
    session.run_with_iobinding(io_binding)

    outputs = Gpt2Helper.get_outputs_from_io_binding_buffer(session, output_buffers, output_shapes, return_numpy=False)
    return outputs

We can see that the result is exactly same with/without IO Binding:

In [12]:
input_ids, attention_mask, position_ids, empty_past = get_example_inputs()
outputs = inference_with_io_binding(session, config, input_ids, position_ids, attention_mask, empty_past)
for i in range(len(outputs)):
    assert torch.eq(outputs[i], torch.from_numpy(ort_outputs[i])).all()
print("IO Binding result is good")

IO Binding result is good


## Batch Text Generation ##

Here is an example for text generation using ONNX Runtime or PyTorch. For ONNX Runtime, IO Binding is used for better performance.

In [13]:
def test_generation(tokenizer, input_text, ort_session=None, num_tokens_to_produce=30):
    assert len(input_text) == 1  # This function requires batch_size==1
    use_onnxruntime = ort_session is not None
    print("Text generation using", "OnnxRuntime" if use_onnxruntime else "PyTorch", "...")
    eos_token_id = tokenizer.eos_token_id

    input_ids, attention_mask, position_ids, past = get_example_inputs(input_text)
    batch_size = input_ids.size(0)

    has_eos = torch.zeros(batch_size, dtype=torch.bool)

    all_token_ids = input_ids.clone()

    for step in range(num_tokens_to_produce):
        if ort_session is not None:
            outputs = inference_with_io_binding(ort_session, config, input_ids, position_ids, attention_mask, past)
        else:
            outputs = torch_model(
                input_ids, attention_mask=attention_mask, position_ids=position_ids, past_key_values=past
            )

        next_token_logits = outputs[0][:, -1, :]
        # Greedy approach is used here. You can easily extend it to use beam search and sampling to pick next tokens.
        next_tokens = torch.argmax(next_token_logits, dim=-1)

        has_eos = has_eos | (next_tokens == eos_token_id)
        tokens_to_add = next_tokens.masked_fill(has_eos, eos_token_id)
        all_token_ids = torch.cat([all_token_ids, tokens_to_add.unsqueeze(-1)], dim=-1)

        # Update input_ids, attention_mask, position_ids and past
        input_ids = tokens_to_add.clone().detach().reshape([batch_size, 1]).to(device)
        position_ids = (position_ids[:, -1] + 1).reshape(batch_size, 1)
        attention_mask = torch.cat([attention_mask, torch.ones([batch_size, 1]).type_as(attention_mask)], 1).to(device)

        past = []
        if not use_onnxruntime:
            past = list(outputs[1])  # past in torch output is tuple
        else:
            for i in range(num_layer):
                past_i = (
                    torch.from_numpy(outputs[i + 1])
                    if isinstance(outputs[i + 1], numpy.ndarray)
                    else outputs[i + 1].clone().detach()
                )
                past.append(past_i.to(device))

        if torch.all(has_eos):
            break

    for i, output in enumerate(all_token_ids):
        print("------------")
        print(tokenizer.decode(output, skip_special_tokens=True))

In [14]:
tokenizer = get_tokenizer(model_name_or_path, cache_dir)
input_text = EXAMPLE_Text[:1]
test_generation(tokenizer, input_text, ort_session=session)

Text generation using OnnxRuntime ...
------------
best hotel in bay area.

The hotel is located in the historic Bayview neighborhood of San Francisco.

The hotel is open daily from 9 a.m.


Next, we use PyTorch to run again and we can see that the result is exactly same.

In [15]:
test_generation(tokenizer, input_text)

Text generation using PyTorch ...
------------
best hotel in bay area.

The hotel is located in the historic Bayview neighborhood of San Francisco.

The hotel is open daily from 9 a.m.


## Benchmark ##
There is a tool benchmark_gpt2.py, which can be used to measure the performance of GPT-2 by PyTorch, ONNX Runtime without/with IO Binding.

In [16]:
!{sys.executable} -m onnxruntime.transformers.models.gpt2.benchmark_gpt2 -m gpt2 -o -b 1 ----sequence_lengths 1 --past_sequence_lengths 128 >benchmark_output.txt 2>&1

In [17]:
file = open("benchmark_output.txt", "r")
for line in file.readlines():
    if "onnxruntime_latency" in line:
        print(line)

batch_size=1, sequence_length=1, past_sequence_length=8, onnxruntime_latency=21.33  

batch_size=1, sequence_length=1, past_sequence_length=16, onnxruntime_latency=22.88  

batch_size=1, sequence_length=1, past_sequence_length=32, onnxruntime_latency=22.81  

batch_size=1, sequence_length=1, past_sequence_length=64, onnxruntime_latency=24.01  

batch_size=1, sequence_length=1, past_sequence_length=128, onnxruntime_latency=22.87  

batch_size=1, sequence_length=1, past_sequence_length=256, onnxruntime_latency=25.30  



## ONNX Model for Generation ##
There is a tool [convert_generation.py](https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/python/tools/transformers/convert_generation.py), which can convert GPT-2 model to enable Beam Search, Greedy Search or Sampling in one ONNX model.

In [18]:
!{sys.executable} -m onnxruntime.transformers.convert_generation -m gpt2 --output gpt2_beam_search.onnx >convert_generation_output.txt 2>&1

## Quantization ##
For large language model, we can quantize the model to int8 to run efficiently in CPU. See examples of [GPT quantization](https://github.com/microsoft/onnxruntime-inference-examples/tree/main/quantization/language_model/gpt2) and [LLaMA quantization](https://github.com/microsoft/onnxruntime-inference-examples/tree/main/quantization/language_model/llama)

### Test Environment ###
The following is the hardware of the test machine, and software version:

In [19]:
!{sys.executable} -m onnxruntime.transformers.machine_info --silent

{
  "gpu": {
    "driver_version": "472.88",
    "devices": [
      {
        "memory_total": 12884901888,
        "memory_available": 12732858368,
        "name": "NVIDIA GeForce RTX 3060"
      }
    ]
  },
  "cpu": {
    "brand": "Intel(R) Core(TM) i7-8700 CPU @ 3.20GHz",
    "cores": 6,
    "logical_cores": 12,
    "hz": "3192000000,0",
    "l2_cache": 1572864,
    "flags": "3dnow,3dnowprefetch,abm,acpi,adx,aes,apic,avx,avx2,bmi1,bmi2,clflush,clflushopt,cmov,cx16,cx8,de,dtes64,dts,erms,est,f16c,fma,fpu,fxsr,hle,ht,hypervisor,ia64,invpcid,lahf_lm,mca,mce,mmx,monitor,movbe,mpx,msr,mtrr,osxsave,pae,pat,pbe,pcid,pclmulqdq,pdcm,pge,pni,popcnt,pse,pse36,rdrnd,rdseed,rtm,sep,serial,sgx,sgx_lc,smap,smep,ss,sse,sse2,sse4_1,sse4_2,ssse3,tm,tm2,tsc,tscdeadline,vme,x2apic,xsave,xtpr",
    "processor": "Intel64 Family 6 Model 158 Stepping 10, GenuineIntel"
  },
  "memory": {
    "total": 16977195008,
    "available": 9550102528
  },
  "os": "Windows-10-10.0.22621-SP0",
  "python": "3.10.12.fina

  import pkg_resources
