# Export to ONNX and inference using TensorRT

Transformer Engine is the library supporting the low precision training. But the training is one part of the story, another is the inference. There are multiple ways of transfering models trained in TE to inference software. If the model is trained used NeMo, one can export it directly to the TRT-LMM format - refer for this [tutorial](link). We provide another way of transfering model out of the Transformer Engine - namely export to the ONNX format with custom FP8 extensions. 

When to use export to the ONNX?

- if Transformer Engine is only part of the bigger model - PyTorch ONNX export API allows us to easily export such models,
- when KV cache is not needed - encoder only use cases - like in this tutorial.

Perfect example when exporting model to the ONNX and using TensorRT for inference are the diffusion models.
Below we provide example implementation of one of such models.

```python3

class DiffusionModel(torch.nn.Module):
   ....

```

Let's see how fast the inference in the raw TE is.

```python3
# time measue
for _ in range(...):
    model(img_enc, text_enc)

```

#### Exporting the TE model to the ONNX format

To export TE model to the ONNX format one needs to 

- use warm-up  runs in correct autocasts,
- wrap exporting code in `te.onnx_export`,
- use Pytorch dynamo onnx exporter `torch.onnx.export(..., dynamo=True)`

```python3

with te.onnx_export():
    onnx_model = torch.onnx.export(..., dynamo=True)

```

#### Inference with TensorRT

If the model is exported to the ONNX format, it can be loaded by TensorRT. TensorRT is NVIDIA inference software. Let's see at the example of export and engine generation:

```

```

Now engine needs to be run, let's see how fast it is
```python3


```

So it is **1.1x** faster than raw Pytorch/Transformer Engine forward.

#### Low precision in ONNX and TensorRT

ONNX standard does not support all precisions supposted by the Transformer Engine. One can see all onnx operators in [this website](https://onnx.ai/onnx/operators/). Thus TensorRT and Transformer Engine use some low precision operators, documented below.

**TRT_FP8_QUANTIZE**

aaa

**TRT_FP8_DEQUANTIZE**

aaa


Since standard operators do not support input and output in some precisions, we use workaround - we dequantize all tensors before the input to such operators or quantize tensors after such operators. TensorRT detect such patterns and subsitutes them with proper operations.

