# Exporting 🤗 transformers model to ONNX


Under the hood the process is sensibly the following:

- Allocate the model from transformers (PyTorch or TensorFlow)
- Forward dummy inputs through the model this way ONNX can record the set of operations executed
- Optionally define dynamic axes on input and output tensors
- Save the graph along with the network parameters

## Default compiling (not task specific)

[ONNX x Pytorch Opset Version Table](https://github.com/onnx/onnx/blob/master/docs/Versioning.md#released-versions)

In [2]:
!rm -rf models/
from pathlib import Path
from transformers.convert_graph_to_onnx import convert

# Handles all the above steps for you
convert(framework="pt", # The framework the pipeline is backed by ("pt" or "tf")
        model="bert-base-cased-finetuned-mrpc", #  The name of the model to load for the pipeline
        output=Path("models/bert.onnx"), # The path where the ONNX graph will be stored
        opset=11 # version of the ONNX operator set to use 
       )

ONNX opset version set to: 13
Loading pipeline (model: bert-base-cased-finetuned-mrpc, tokenizer: bert-base-cased-finetuned-mrpc)


HBox(children=(FloatProgress(value=0.0, description='Downloading', max=433297515.0, style=ProgressStyle(descri…


Creating folder models
Using framework PyTorch: 1.8.1+cu102
Found input input_ids with shape: {0: 'batch', 1: 'sequence'}
Found input token_type_ids with shape: {0: 'batch', 1: 'sequence'}
Found input attention_mask with shape: {0: 'batch', 1: 'sequence'}
Found output output_0 with shape: {0: 'batch', 1: 'sequence'}
Found output output_1 with shape: {0: 'batch'}
Ensuring inputs are in correct order
position_ids is not present in the generated input list.
Generated inputs order: ['input_ids', 'attention_mask', 'token_type_ids']


  input_tensor.shape[chunk_dim] == tensor_shape for input_tensor in input_tensors


## Mixed Precision

In [None]:
# or you are working with Tensorflow(tf.keras) models or pytorch models other than bert

# !pip install onnxruntime-tools
from onnxruntime_tools import optimizer

# Mixed precision conversion for bert-base-cased model converted from Pytorch
optimized_model = optimizer.optimize_model("bert-base-cased.onnx", model_type='bert', num_heads=12, hidden_size=768)
optimized_model.convert_model_float32_to_float16()
optimized_model.save_model_to_file("bert-base-cased.onnx")

# Inference Session: optimize compiled model

Inference is done using a specific backend definition which turns on hardware specific optimizations of the graph.

Optimizations are basically of three kinds:

- Constant Folding: Convert static variables to constants in the graph
- Deadcode Elimination: Remove nodes never accessed in the graph
- Operator Fusing: Merge multiple instruction into one (Linear -> ReLU can be fused to be LinearReLU)

ONNX Runtime automatically applies most optimizations by setting specific SessionOptions.

_Note: Some of the latest optimizations that are not yet integrated into ONNX Runtime are available in optimization script that tunes models for the best performance._

In [1]:
from onnxruntime.transformers import optimizer
from onnxruntime.transformers.onnx_model_bert import BertOptimizationOptions

# disable embedding layer norm optimization for better model size reduction
opt_options = BertOptimizationOptions('bert')
opt_options.enable_embed_layer_norm = False

# opimtimize compiled model
opt_model = optimizer.optimize_model(
    'models/bert.onnx',
    'bert', 
    num_heads=12,
    hidden_size=768,
    optimization_options=opt_options)
opt_model.('bert.opt.onnx')

In [6]:
from os import environ
from psutil import cpu_count

# Constants from the performance optimization available in onnxruntime
# It needs to be done before importing onnxruntime
environ["OMP_NUM_THREADS"] = str(cpu_count(logical=True))
environ["OMP_WAIT_POLICY"] = 'ACTIVE'

from onnxruntime import GraphOptimizationLevel, InferenceSession, SessionOptions, get_all_providers

In [19]:
from contextlib import contextmanager
from time import time

def create_model_for_provider(model_path: str, provider: str) -> InferenceSession: 
    assert provider in get_all_providers(), f"provider {provider} not found, {get_all_providers()}"

    # Few properties that might have an impact on performances (provided by MS)
    options = SessionOptions()
    options.intra_op_num_threads = 1
    options.graph_optimization_level = GraphOptimizationLevel.ORT_ENABLE_ALL

    # Load the model as a graph and prepare the CPU backend 
    session = InferenceSession(model_path, options, providers=[provider])
    session.disable_fallback()

    return session


@contextmanager
def track_infer_time(buffer: [int]):
    start = time()
    yield
    end = time()

    buffer.append(end - start)

## using our optimized ONNX model running on CPU

When the model is loaded for inference over a specific provider, for instance CPUExecutionProvider as above, an optimized graph can be saved. This graph will might include various optimizations, and you might be able to see some higher-level operations in the graph (through Netron for instance) such as:

- EmbedLayerNormalization
- Attention
- FastGeLU

These operations are an example of the kind of optimization onnxruntime is doing, for instance here gathering multiple operations into bigger one (Operator Fusing).

In [20]:
from transformers import AutoTokenizer
import numpy as np

buffer=[]

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased-finetuned-mrpc")
cpu_model = create_model_for_provider("models/bert.onnx", "CPUExecutionProvider")

In [26]:
from torch.nn import functional as F

text = "Paris is the " + tokenizer.mask_token + " of France."
input = tokenizer.encode_plus(text, return_tensors = "pt")
inputs_onnx = {k: v.cpu().detach().numpy() for k, v in input.items()}

mask_index = torch.where(input["input_ids"][0] == tokenizer.mask_token_id)


# Run the model (None = get all the outputs)
output = cpu_model.run(None, inputs_onnx)

logits = torch.from_numpy(output[0])
softmax = F.softmax(logits, dim = -1)
mask_word = softmax[0, mask_index, :]
top_10 = torch.topk(mask_word, 10, dim = 1)[1][0]
for token in top_10:
    word = tokenizer.decode([token])
    new_sentence = text.replace(tokenizer.mask_token, word)
    print(new_sentence)

Paris is the Ł of France.
Paris is the य of France.
Paris is the Я of France.
Paris is the ད of France.
Paris is the ď of France.
Paris is the Ю of France.
Paris is the [unused54] of France.
Paris is the ᵏ of France.
Paris is the ئ of France.
Paris is the ¨ of France.


In [16]:
# numpy version

input = tokenizer(text, return_tensors="np")
inputs_onnx = {k: v for k, v in input.__dict__["data"].items()}

output = cpu_model.run(None, inputs_onnx)

token_logits = output[0]

mask_token_index = np.where(input["input_ids"] == tokenizer.mask_token_id)[1]
mask_token_logits_onnx1 = token_logits[0, mask_token_index, :]

score = np.exp(mask_token_logits_onnx1) / np.exp(mask_token_logits_onnx1).sum(-1, keepdims=True)

top_5_idx = (-score[0]).argsort()[:5]
top_5_values = score[0][top_5_idx]

result = []

for token, s in zip(top_5_idx.tolist(), top_5_values.tolist()):
    result.append(f"{text.replace(tokenizer.mask_token, tokenizer.decode([token]))} (score: {s})")
result

NameError: name 'cpu_model' is not defined

# Wrapper for Pipeline

over write `model.forward()` with onnx specifica

In [156]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from transformers import Pipeline, TextClassificationPipeline 
from transformers.tokenization_utils import TruncationStrategy
import numpy as np
import torch
from transformers import AutoConfig, AutoModelForMaskedLM
# Download configuration from huggingface.co and cache.
config = AutoConfig.from_pretrained('bert-base-cased-finetuned-mrpc')
model = AutoModelForMaskedLM.from_config(config)

class OnnxTextClassificationPipeline(FillMaskPipeline):
  #todo create a batched version of text-classification
  # can we overtake a nested nested function 
    def __init__(self, **kwargs):
        super().__init__( **kwargs)

#     def _parse_and_tokenize(
#       self, inputs, padding=True, add_special_tokens=True, truncation=TruncationStrategy.DO_NOT_TRUNCATE, **kwargs
#     ):
#         """
#         Parse arguments and tokenize
#         """
#         # Parse arguments
#         processed_inputs = []

#         inputs = inputs if isinstance(inputs,list) else [inputs]
#         for input in inputs:
#         tok = self.tokenizer(
#             input,
#             add_special_tokens=add_special_tokens,
#             padding=padding,
#             truncation=truncation,
#         )
#         processed_inputs.append(tok)

#         return processed_inputs

    def _forward(self, inputs, return_tensors=False):
        inputs_onnx = {k: v for k, v in input.__dict__["data"].items()}
        outputs = cpu_model.run(None, inputs)

        real_output = normalize_onnx_outputs(outputs, output_names)
        print("Successful inference on ONNX")
        return real_output

In [157]:
x = OnnxTextClassificationPipeline(model=model,tokenizer=tokenizer)

In [158]:
text = "Paris is the " + tokenizer.mask_token + " of France."
x(text)

Successful inference on ONNX


AttributeError: 'dict' object has no attribute 'size'

In [160]:

input_names = [input_.name for input_ in cpu_model.get_inputs()]
output_names = [output_.name for output_ in cpu_model.get_outputs()]

In [161]:
text = "Paris is the " + tokenizer.mask_token + " of France."
input = tokenizer(text, return_tensors="np")
inputs_onnx = {k: v for k, v in input.__dict__["data"].items()}

In [163]:
res = onnx_forward(inputs_onnx)

Successful inference on ONNX


In [93]:
sequence = ["The company HuggingFace is based in New York City",
           # "Apples are especially bad for your health"
           "HuggingFace's headquarters are situated in Manhattan"]

In [94]:
# numpy version


output = cpu_model.run(None, inputs_onnx)

token_logits = output[0]

mask_token_index = np.where(input["input_ids"] == tokenizer.mask_token_id)[1]
mask_token_logits_onnx1 = token_logits[0, mask_token_index, :]

score = np.exp(mask_token_logits_onnx1) / np.exp(mask_token_logits_onnx1).sum(-1, keepdims=True)

top_5_idx = (-score[0]).argsort()[:5]
top_5_values = score[0][top_5_idx]

result = []

for token, s in zip(top_5_idx.tolist(), top_5_values.tolist()):
    result.append(f"{text.replace(tokenizer.mask_token, tokenizer.decode([token]))} (score: {s})")
result

['Paris is the Ł of France. (score: 0.004327240400016308)',
 'Paris is the य of France. (score: 0.0032415653113275766)',
 'Paris is the Я of France. (score: 0.003234578762203455)',
 'Paris is the ད of France. (score: 0.0031752123031765223)',
 'Paris is the ď of France. (score: 0.003119068220257759)']

In [107]:
cpu_model.run(None, inputs_onnx)[0].shape

(1, 9, 768)

In [15]:
from transformers import AutoTokenizer, AutoModelForMaskedLM,pipeline
import numpy as np
# from inference_api_wrapper import load_onnx_model_on_pipeline
import inference_api_wrapper

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased-finetuned-mrpc")
model = AutoModelForMaskedLM.from_pretrained('bert-base-cased-finetuned-mrpc')



pipe = pipeline('fill-mask',model=model,tokenizer=tokenizer)


inference_api_wrapper.load_onnx_model_on_pipeline(pipe,"models/bert.onnx")

Some weights of the model checkpoint at bert-base-cased-finetuned-mrpc were not used when initializing BertForMaskedLM: ['classifier.weight', 'classifier.bias']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForMaskedLM were not initialized from the model checkpoint at bert-base-cased-finetuned-mrpc and are newly initialized: ['cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.decoder.weight']
You

NameError: name 'SessionOptions' is not defined

In [13]:
text = "Paris is the " + tokenizer.mask_token + " of France."

pipe(text)

Error running onnx_forward: __init__() keywords must be strings


[{'sequence': 'Paris is thejid of France.',
  'score': 0.004590186290442944,
  'token': 26151,
  'token_str': '##jid'},
 {'sequence': 'Paris is theregation of France.',
  'score': 0.003801520448178053,
  'token': 22998,
  'token_str': '##regation'},
 {'sequence': 'Paris is thecourse of France.',
  'score': 0.0032074858900159597,
  'token': 16461,
  'token_str': '##course'},
 {'sequence': 'Paris is themark of France.',
  'score': 0.0030862404964864254,
  'token': 8519,
  'token_str': '##mark'},
 {'sequence': 'Paris is theina of France.',
  'score': 0.0027963079046458006,
  'token': 2983,
  'token_str': '##ina'}]