Release v1.7.0: ONNX export extension, TFLite export, single-ONNX decoding, ONNX Runtime extension for audio, vision tasks, stable diffusion · huggingface/optimum

New models supported in the ONNX export

Additional architectures are supported in the ONNX export: PoolFormer, Pegasus, Audio Spectrogram Transformer, Hubert, SEW, Speech2Text, UniSpeech, UniSpeech-SAT, Wav2Vec2, Wav2Vec2-Conformer, WavLM, Data2Vec Audio, MPNet, stable diffusion VAE encoder, vision encoder decoder, Nystromformer, Splinter, GPT NeoX.

Add PoolFormer support in exporters.onnx by @BakingBrains in #646
Support pegasus exporters by @mht-sharma in #620
Audio models support with optimum.exporters.onnx by @michaelbenayoun in #622
Add MPNet ONNX export by @jplu in #691
Add stable diffusion VAE encoder export by @echarlaix in #705
Add vision encoder decoder model in exporters by @mht-sharma in #588
Nystromformer ONNX export by @whr778 in #728
Support Splinter exporters (#555) by @Allanbeddouk in #736
Add gpt-neo-x support by @sidthekidder in #745

New models supported in BetterTransformer

A few additional architectures are supported in BetterTransformer: RoCBERT, RoFormer, Marian

Add RoCBert support for Bettertransformer by @shogohida in #542
Add better transformer support for RoFormer by @manish-p-gupta in #680
added BetterTransformer support for Marian by @IlyasMoutawwakil in #808

Additional tasks supported in the ONNX Runtime integration

With ORTModelForMaskedLM, ORTModelForVision2Seq, ORTModelForAudioClassification, ORTModelForCTC, ORTModelForAudioXVector, ORTModelForAudioFrameClassification, ORTStableDiffusionPipeline.

Reference: https://huggingface.co/docs/optimum/main/en/onnxruntime/package_reference/modeling_ort and https://huggingface.co/docs/optimum/main/en/onnxruntime/usage_guides/models#export-and-inference-of-stable-diffusion-models

Add ORTModelForMaskedLM class by @JingyaHuang in #729
Add ORTModelForVision2Seq for VisionEncoderDecoder models inference by @mht-sharma in #742
Add ORTModelXXX for audio by @mht-sharma in #774
Add stable diffusion onnx runtime pipeline by @echarlaix in #786

Support of the ONNX export from PyTorch on float16

In the ONNX export, it is possible to pass the options --fp16 --device cuda to export using float16 when a GPU is available, directly with the native torch.onnx.export.

Example: optimum-cli export onnx --model gpt2 --fp16 --device cuda gpt2_onnx/

Support ONNX export on torch.float16 type by @fxmarty in #749

TFLite export

TFLite export is now supported, with static shapes:

optimum-cli export tflite --help
optimum-cli export tflite --model bert-base-uncased --sequence_length 128 bert_tflite/

exporters.tflite initial support by @michaelbenayoun in #716
TFLite auto-encoder models by @michaelbenayoun in #757
[TFLite Export] Adds support for ResNet by @sayakpaul in #813

ONNX Runtime optimization and quantization directly in the CLI

Add optimize and quantize command CLI by @jplu in #700
Support ONNX Runtime optimizations in exporters.onnx by @fxmarty in #807

The ONNX export optionally supports the ONNX Runtime optimizations directly in the export, passing the --optimize O1, up to --optimize O4 option:

optimum-cli export onnx --help
optimum-cli export onnx --model t5-small --optimize O3 t5small_onnx/

ONNX Runtime quantization is supported directly in command line, using optimum-cli onnxruntime quantize:

optimum-cli onnxruntime quantize --help
optimum-cli onnxruntime quantize --onnx_model distilbert_onnx --avx512

ONNX Runtime optimization is supported directly in command line, using optimum-cli onnxruntime optimize:

optimum-cli onnxruntime optimize --help
optimum-cli onnxruntime optimize --onnx_model distilbert_onnx -O3

ORTModelForCausalLM supports decoding with a single ONNX

Up no now, for decoders, two ONNX were used:

One handling the first forward pass where no past key values have been cached yet - thus not taking them as input.
One handling the following forward pass where past key values have been cached, thus taking them as input.

This release introduces the support in the ONNX export and in ORTModelForCausalLM of a single ONNX handling both steps of the decoding. This allows to reduce memory usage, as weights are not duplicated between two separate models during inference.

Using a single ONNX for decoders can be used by passing use_merged=True to ORTModelForCausalLM.from_pretrained, loading directly from a PyTorch model:

from optimum.onnxruntime import ORTModelForCausalLM

model = ORTModelForCausalLM.from_pretrained("gpt2", export=True, use_merged=True)

Alternatively, using a single ONNX for decoders is the default behavior in the ONNX export, that can later be used for example with ORTModelForCausalLM, the command optimum-cli export onnx --model gpt2 gpt2_onnx/ will produce:

└── gpt2_onnx
    ├── config.json
    ├── decoder_model_merged.onnx
    ├── decoder_model.onnx
    ├── decoder_with_past_model.onnx
    ├── merges.txt
    ├── special_tokens_map.json
    ├── tokenizer_config.json
    ├── tokenizer.json
    └── vocab.json

The decoder_model.onnx and decoder_with_past_model.onnx are kept separate for backward compatibility, but during inference using solely decoder_model_merged.onnx is enough.

Enable inference with a merged decoder in ORTModelForCausalLM by @JingyaHuang in #647

Single-file ORTModel accept numpy arrays

ORTModel accept numpy arrays as inputs, in addition to PyTorch tensors. This is only the case for models that use a single ONNX.

Accept numpy.ndarray as input and output to ORTModel by @fxmarty in #790

ORTOptimizer support for ORTModelForCausalLM

ORTOptimizer support ORTModelForCausalLM by @fxmarty in #794
Support IO Binding for merged decoder by @fxmarty in #797

Breaking changes

In the ONNX export, exporting models in several ONNX (encoder, decoder) is now the default behavior: #747. The old behavior is still accessible with --monolith.
In decoders, reusing past key values is now the default in the ONNX export: #748. The old behavior is still accessible by explicitly passing, for example, --task causal-lm instead of --task causal-lm-with-past.
BigBird support in the ONNX export is removed, due to the block_sparse attention type being written in pure numpy in Transformers, and hence not exportable to ONNX: #778
The parameter from_transformers of ORTModel.from_pretrained will be deprecated in favor of export.

Bugfixes and improvements

Fix disable shape inference for optimization by @regisss in #652
Fix uninformative message when passing use_cache=True to ORTModel and no ONNX with cache is available by @fxmarty in #650
Fix provider options when several providers are passed by @fxmarty in #653
Add TensorRT engine to ONNX Runtime GPU documentation by @fxmarty in #657
Improve documentation around ONNX export by @fxmarty in #666
minor updates on ONNX config guide by @mszsorondo in #662
Fix FlaubertOnnxConfig by @michaelbenayoun in #669
Use nvcr.io/nvidia/tensorrt image for GPU tests by @fxmarty in #660
Better Transformer doc fix by @HamidShojanazeri in #670
Add support for LongT5 optimization using ORT transformer optimizer script by @kunal-vaishnavi in #683
Add test for missing execution providers error messages by @fxmarty in #659
ONNX transformation to cast int64 constants to int32 when possible by @fxmarty in #655
Add missing normalized configs by @fxmarty in #694
Remove code duplication in ORTModel's load_model by @fxmarty in #695
Test more architectures in ORTModel by @fxmarty in #675
Avoid initializing unwanted attributes for ORTModel's having several inference sessions by @fxmarty in #696
Fix the ORTQuantizer loading from specific file by @echarlaix in #701
Add saving of diffusion model additional components for onnx export by @echarlaix in #699
Fix whisper export by @mht-sharma in #629
Support trust remote code option in ONNX export and ONNX Runtime integration by @fxmarty in #702
Add nightly tests on dependencies dev versions by @fxmarty in #703
Fix exception condition by @mht-sharma in #706
Add ORTModelForMultipleChoice to the documentation by @fxmarty in #712
Fix yaml format for dev tests by @fxmarty in #710
Add ONNX Runtime training benchmark by @JingyaHuang in #592
Allow from optimum.onnxruntime import QuantizationConfig by @fxmarty in #715
Fix documentation for doctest tests to pass by @fxmarty in #713
Use transformers>=4.26.0 in setup.py by @fxmarty in #723
Fix GPU tests by @fxmarty in #724
Fix ONNX Runtime inference in ORTTrainer by @JingyaHuang in #709
onnxruntime/modeling_ort.py refactor, part 1 by @michaelbenayoun in #698
Update docker and doc of ORT Trainer by @JingyaHuang in #725
Add test for code examples in the documentation and docstrings by @fxmarty in #704
add image classification example to optimum by @prathikr in #711
Add TensorrtExecutionProvider modeling tests by @fxmarty in #722
Whisper shape inference fix by @michaelbenayoun in #726
Add some redirections to Optimum Habana's documentation by @regisss in #735
Patch ORTTrainer inference with ONNX Runtime backend by @JingyaHuang in #737
Remove dead code in whisper ONNX output by @fxmarty in #741
Unpin protobuf 3.20.1 by @fxmarty in #738
Fix speech2text export by @mht-sharma in #746
Raise error on double call to BetterTransformer.transform() by @fxmarty in #750
exporters.onnx output names and dynamic axes fix by @michaelbenayoun in #731
Fix NNCF supported quantization strategies README table by @echarlaix in #752
Add GPU tests for BetterTransformer by @fxmarty in #751
Fix doctest by @fxmarty in #759
Fix ONNX Runtime cache usage for decoders, add relevant tests by @fxmarty in #756
Fix GPU tests by @fxmarty in #758
Update quality tooling for formatting by @regisss in #760
Fix wrong shapes used at ONNX export and validation by @fxmarty in #764
Change type annotation by @michaelbenayoun in #768
Fix stable diffusion ONNX export by @echarlaix in #762
Disable ONNX Runtime provider check on Windows by @fxmarty in #771
Fix FusionOptions following ORT 1.14 release by @fxmarty in #772
Unpin numpy <1.24.0 by @fxmarty in #773
Fix flaky ONNX Runtime generation test with past key value reuse by @fxmarty in #765
Fix output shape dimension for OnnxConfigWithPast by @fxmarty in #780
Fix used shapes, device at ONNX export by @fxmarty in #777
Pin numpy only for tensorflow export by @fxmarty in #781
Fixed broken paper space links by @Muhtasham in #766
Temporarily disable python 3.9 + macOS test due to onnxruntime 1.14 regression by @fxmarty in #783
Update ORT Training to 1.14.0 by @JingyaHuang in #787
Temporarily disable segformer TensorRT test by @fxmarty in #799
Use a stateful ordered_input_names in ORTModel by @fxmarty in #796
Test ORTOptimizer with IO Binding by @fxmarty in #801
[BT] Add stable layer-norm Wav2vec2 by @younesbelkada in #803
Update rules for ruff by @regisss in #806
Improve orttrainer test by @JingyaHuang in #779
Fix ORT quantization for TensorRT documentation by @fxmarty in #812
Fix GPU tests by @fxmarty in #814
Update ONNX Runtime training doc - use torchrun by @JingyaHuang in #820
Fix ONNX export tests by @fxmarty in #822
All back workflow dispatch on GPU tests by @fxmarty in #823
BetterTransformer pipeline padding issue fix by @vrdn-23 in #821
Fix optimum pipeline initialization by @fxmarty in #824
Fix failing GPU tests by @fxmarty in #829
Remove feature dimension as dynamic axes for stable diffusion ONNX export by @echarlaix in #816
Fix pipeline task dropping arguments bug by @fxmarty in #828
Fix ORTQuantizer behavior with ORTModelForCausalLM by @fxmarty in #831
Update tests by @mht-sharma in #826
Fix exporters GPU CI by @fxmarty in #835
Keep intermediary models for ONNX causal-lm by @fxmarty in #834
Fix duplicate name merged decoder by @fxmarty in #837
Apply lazy import for exporters by @JingyaHuang in #836

Full Changelog: v1.6.0...v1.7.0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v1.7.0: ONNX export extension, TFLite export, single-ONNX decoding, ONNX Runtime extension for audio, vision tasks, stable diffusion

Choose a tag to compare

Sorry, something went wrong.