In [None]:
!pip install -qU optimum[exporters] transformers[onnx]

# Export to ONNX

Deploying models in production environments requries, or can benefit from exporting the models into a serialized format that can be loaded and executed on specialized runtimes and hardware.

HuggingFace Optimum is an extension of Transformers that enables exporting models from PyTorch or Tensorflow to serialized formats such as ONNX and TFLite through its `exporters` module.

**ONNX (Open Neural Network eXchange)** is an open standard that defines a common set of operators and a common file format to represent deep learning models in a wide variety of frameworks. When a model is exported to the ONNX format, these operators are used to construct a computational graph which represents the flow of data through the neural network.

By exposing a graph with standardized operators and data types, ONNX makes it easy to switch between frameworks.

Once exported to ONNX format, a model can be
* optimized for inference via techniques such as graph optimization and quantization
* run with ONNX Runtime, which follow the same `AutoModel` API
* run with optimized inference pipelines, which has the same API as the `pipeline()` function in Transformers library

## Exporting a Transformers model to ONNX with CLI

We can check all all available arguments in command line:
```bash
optimum-cli export onnx --help
```
To expor a model's checkpoint from the Hub, for example, `distilbert/distilbert-base-uncased-distilled-squad`:
```bash
optimum-cli export onnx --model distilbert/distilbert-base-uncased-distilled-squad distilbert_base_uncased_squad_onnx/
```

When exporting a local model, we first need to make sure that we saved both the model's weights and tokenizer files in the same directory (`local_path`). When using CLI, we pass the `local_path` to the `model` argument and provide the `--task` argument. If `task` argument is not provided, it will default to the model architecture without any task specific head.
```bash
optimum-cli export onnx --model local_path --task question-answering distilbert_base_uncased_squad_onnx/
```

The resulting `model.onnx` file can then be run on one of the acclerators that support the ONNX standard. For example, we can load and run the model with ONNX Runtime:

In [None]:
from transformers import AutoTokenizer
from optimum.onnxruntime import ORTModelForQuestionAnswering

tokenizer = AutoTokenizer.from_pretrained("distilbert_base_uncased_squad_onnx")
model = ORTModelForQuestionAnswering.from_pretrained("distilbert_base_uncased_squad_onnx")

In [None]:
inputs = tokenizer("What am I using?", "Using DistilBERT with ONNX Runtime!", return_tensors="pt")
outputs = model(**inputs)

## Exporting a Transformers model to ONNX with `optimum.onnxruntime`

In [None]:
from optimum.onnxruntime import ORTModelForSequenceClassification
from transformers import AutoTokenizer

model_checkpoint = 'distilbert/distilbert-base-uncased-distilled-squad'
save_directory = 'onnx/'

# load a model from transformers and export it to ONNX
ort_model = ORTModelForSequenceClassification.from_pretrained(model_checkpoint, export=True)
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

# save the onnx model and tokenizer
ort_model.save_pretrained(save_directory)
tokenizer.save_pretrained(save_directory)

## Exporting a model with `transformers.onnx`

Use `transformers.onnx` as a Python module to export a checkpoint using a ready-made configuration:
```bash
python -m transformers.onnx --model=distilbert/distilbert-base-uncased onnx/
```

This exports an ONNX graph of the checkpoint defined by the `--model` argument.

The resulting `model.onnx` can be loaded again:

In [None]:
from transformers import AutoTokenizer
from onnxruntime import InferenceSession

tokenizer = AutoTokenizer.from_pretrained("distilbert/distilbert-base-uncased")
session = InferenceSession("onnx/model.onnx")

In [None]:
# ONNX Runtime expects NumPy arrays as input
inputs = tokenizer("Using DistilBERT with ONNX Runtime!", return_tensors="np")
outputs = session.run(output_names=["last_hidden_state"], input_feed=dict(inputs))

The required output names (`["last_hidden_stat"]`) can be obtained by taking a look at the ONNX configuration of each model. For example, for DistilBERT,

In [None]:
from transformers.models.distilbert import DistilBertConfig, DistilBertOnnxConfig

config = DistilBertConfig()
onnx_config = DistilBertOnnxConfig(config)
print(list(onnx_config.outputs.keys()))

# Export to TorchScript

According to the TorchScript documentation, TorchScript is a way to create serializable and optimizable models from PyTorch code.

There are two PyTorch modules, JIT and TRACE.

## TorchScript flag and tied weights

Most of the Transformers language models have tied weights between their `Embedding` layer and their `Decoding` layer. TorchScript does not allow us to export models that have tied weights, so it is necessary to untie and clone the weights beforehand. Models instantiated with the `torchscript` flag have their `Embedding` layer and `Decoding` layer separated, which means that they should not be trained down the line.

## Using TorchScript in Python

### Saving a model

To export a `BertModel` with TorchScript, instantiate `BertModel` from the `BertConfig` class and then save it to disk under the filename `traced_bert.pt`:

In [1]:
from transformers import BertModel, BertTokenizer, BertConfig
import torch

enc = BertTokenizer.from_pretrained('google-bert/bert-base-uncased')

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

In [2]:
# tokenize input text
text = "[CLS] Who was Jim Henson? [SEP] Jim Henson was a puppeteer [SEP]"
tokenized_text = enc.tokenize(text)
tokenized_text

['[CLS]',
 'who',
 'was',
 'jim',
 'henson',
 '?',
 '[SEP]',
 'jim',
 'henson',
 'was',
 'a',
 'puppet',
 '##eer',
 '[SEP]']

In [3]:
# mask one of the input tokens
masked_index = 8
tokenized_text[masked_index] = "[MASK]"
indexed_tokens = enc.convert_tokens_to_ids(tokenized_text)
segments_ids = [0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1]
indexed_tokens

[101,
 2040,
 2001,
 3958,
 27227,
 1029,
 102,
 3958,
 103,
 2001,
 1037,
 13997,
 11510,
 102]

In [4]:
# create a dummy input
tokens_tensor = torch.tensor([indexed_tokens])
segments_tensors = torch.tensor([segments_ids])
dummy_input = (tokens_tensor, segments_tensors)
dummy_input

(tensor([[  101,  2040,  2001,  3958, 27227,  1029,   102,  3958,   103,  2001,
           1037, 13997, 11510,   102]]),
 tensor([[0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1]]))

In [5]:
# initialize the model with the torchscript flag
# flag set to True even though it is not necessary as this model does not have an LM head
config = BertConfig(
    vocab_size_or_config_json_file=32000,
    hidden_size=768,
    num_hidden_layers=12,
    num_attention_heads=12,
    intermediate_size=3072,
    torchscript=True
)
# instantiate the model
model = BertModel(config)
model.eval()

BertModel(
  (embeddings): BertEmbeddings(
    (word_embeddings): Embedding(30522, 768, padding_idx=0)
    (position_embeddings): Embedding(512, 768)
    (token_type_embeddings): Embedding(2, 768)
    (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (encoder): BertEncoder(
    (layer): ModuleList(
      (0-11): 12 x BertLayer(
        (attention): BertAttention(
          (self): BertSdpaSelfAttention(
            (query): Linear(in_features=768, out_features=768, bias=True)
            (key): Linear(in_features=768, out_features=768, bias=True)
            (value): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (output): BertSelfOutput(
            (dense): Linear(in_features=768, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False

In [6]:
# create a trace
traced_model = torch.jit.trace(model, dummy_input)
torch.jit.save(traced_model, "traced_bert.pt")

`loss_type=None` was set in the config but it is unrecognised.Using the default loss: `ForCausalLMLoss`.


In [None]:
# we can also instantiate the model with `from_pretrained` method
model = BertModel.from_pretrained(
    'google-bert/bert-base-uncased',
    torchscript=True
)
model.eval()

### Loading a model

We can load the previously saved `BertModel`, `traced_bert.pt` from disk and use it on the previously initialized `dummy_input`:

In [7]:
loaded_model = torch.jit.load("traced_bert.pt")
loaded_model.eval()

RecursiveScriptModule(
  original_name=BertModel
  (embeddings): RecursiveScriptModule(
    original_name=BertEmbeddings
    (word_embeddings): RecursiveScriptModule(original_name=Embedding)
    (position_embeddings): RecursiveScriptModule(original_name=Embedding)
    (token_type_embeddings): RecursiveScriptModule(original_name=Embedding)
    (LayerNorm): RecursiveScriptModule(original_name=LayerNorm)
    (dropout): RecursiveScriptModule(original_name=Dropout)
  )
  (encoder): RecursiveScriptModule(
    original_name=BertEncoder
    (layer): RecursiveScriptModule(
      original_name=ModuleList
      (0): RecursiveScriptModule(
        original_name=BertLayer
        (attention): RecursiveScriptModule(
          original_name=BertAttention
          (self): RecursiveScriptModule(
            original_name=BertSdpaSelfAttention
            (query): RecursiveScriptModule(original_name=Linear)
            (key): RecursiveScriptModule(original_name=Linear)
            (value): RecursiveScr

In [8]:
all_encoder_layers, pooled_output = loaded_model(*dummy_input)

In [10]:
all_encoder_layers

tensor([[[-1.4378,  0.6825, -2.4851,  ..., -0.0238, -0.2185, -0.5856],
         [-0.4958, -0.5208, -2.0513,  ..., -0.5470,  1.0878,  0.3045],
         [-1.3614,  1.0366, -1.4385,  ...,  0.1271,  1.4248,  0.3793],
         ...,
         [-0.7424,  0.5994, -1.5412,  ...,  0.0531,  0.9640,  0.4728],
         [-1.8554,  1.1029, -1.0090,  ..., -0.4807,  1.0942,  0.2148],
         [-0.4105,  0.9331, -1.8850,  ...,  0.9676, -0.0521,  0.5957]]],
       grad_fn=<NativeLayerNormBackward0>)

In [11]:
pooled_output

tensor([[-0.3747, -0.2615,  0.4945,  0.1161,  0.5884, -0.6834,  0.2086, -0.5298,
          0.7423, -0.4602,  0.8943, -0.0630,  0.3268, -0.7138, -0.2951,  0.5479,
         -0.5130,  0.7972, -0.8440,  0.0408,  0.0861, -0.3240, -0.1557,  0.3286,
         -0.5127,  0.4582, -0.5460, -0.0812,  0.0277,  0.6589, -0.1288, -0.4241,
         -0.0799,  0.4402,  0.0040,  0.7200, -0.3204,  0.0461, -0.4271,  0.2593,
         -0.4389, -0.3657,  0.7497, -0.0093,  0.4834,  0.5736,  0.0692,  0.5747,
         -0.3417, -0.3182,  0.6341, -0.6795,  0.1140, -0.4094,  0.3372,  0.2712,
          0.5584,  0.4738,  0.4626, -0.8052,  0.6120, -0.3280, -0.4708, -0.4730,
         -0.3467, -0.4055, -0.9262,  0.4617,  0.0718, -0.7652,  0.0020, -0.2757,
         -0.8060,  0.0685,  0.1937,  0.1022, -0.1405, -0.3898,  0.4622,  0.1027,
         -0.0250,  0.0375, -0.2743,  0.3211, -0.1176, -0.4698, -0.3924,  0.2698,
         -0.1476,  0.2752,  0.0325,  0.2037, -0.5341, -0.2437,  0.3984,  0.3869,
          0.4138, -0.6146, -

### Using a traced model for inference

In [13]:
traced_model(tokens_tensor, segments_tensors)

(tensor([[[-1.4378,  0.6825, -2.4851,  ..., -0.0238, -0.2185, -0.5856],
          [-0.4958, -0.5208, -2.0513,  ..., -0.5470,  1.0878,  0.3045],
          [-1.3614,  1.0366, -1.4385,  ...,  0.1271,  1.4248,  0.3793],
          ...,
          [-0.7424,  0.5994, -1.5412,  ...,  0.0531,  0.9640,  0.4728],
          [-1.8554,  1.1029, -1.0090,  ..., -0.4807,  1.0942,  0.2148],
          [-0.4105,  0.9331, -1.8850,  ...,  0.9676, -0.0521,  0.5957]]],
        grad_fn=<NativeLayerNormBackward0>),
 tensor([[-0.3747, -0.2615,  0.4945,  0.1161,  0.5884, -0.6834,  0.2086, -0.5298,
           0.7423, -0.4602,  0.8943, -0.0630,  0.3268, -0.7138, -0.2951,  0.5479,
          -0.5130,  0.7972, -0.8440,  0.0408,  0.0861, -0.3240, -0.1557,  0.3286,
          -0.5127,  0.4582, -0.5460, -0.0812,  0.0277,  0.6589, -0.1288, -0.4241,
          -0.0799,  0.4402,  0.0040,  0.7200, -0.3204,  0.0461, -0.4271,  0.2593,
          -0.4389, -0.3657,  0.7497, -0.0093,  0.4834,  0.5736,  0.0692,  0.5747,
          -0.3