v1.7.0: ONNX export extension, TFLite export, single-ONNX decoding, ONNX Runtime extension for audio, vision tasks, stable diffusion
New models supported in the ONNX export
Additional architectures are supported in the ONNX export: PoolFormer, Pegasus, Audio Spectrogram Transformer, Hubert, SEW, Speech2Text, UniSpeech, UniSpeech-SAT, Wav2Vec2, Wav2Vec2-Conformer, WavLM, Data2Vec Audio, MPNet, stable diffusion VAE encoder, vision encoder decoder, Nystromformer, Splinter, GPT NeoX.
- Add PoolFormer support in exporters.onnx by @BakingBrains in #646
- Support pegasus exporters by @mht-sharma in #620
- Audio models support with
optimum.exporters.onnxby @michaelbenayoun in #622 - Add MPNet ONNX export by @jplu in #691
- Add stable diffusion VAE encoder export by @echarlaix in #705
- Add vision encoder decoder model in exporters by @mht-sharma in #588
- Nystromformer ONNX export by @whr778 in #728
- Support Splinter exporters (#555) by @Allanbeddouk in #736
- Add gpt-neo-x support by @sidthekidder in #745
New models supported in BetterTransformer
A few additional architectures are supported in BetterTransformer: RoCBERT, RoFormer, Marian
- Add RoCBert support for Bettertransformer by @shogohida in #542
- Add better transformer support for RoFormer by @manish-p-gupta in #680
- added BetterTransformer support for Marian by @IlyasMoutawwakil in #808
Additional tasks supported in the ONNX Runtime integration
With ORTModelForMaskedLM, ORTModelForVision2Seq, ORTModelForAudioClassification, ORTModelForCTC, ORTModelForAudioXVector, ORTModelForAudioFrameClassification, ORTStableDiffusionPipeline.
Reference: https://huggingface.co/docs/optimum/main/en/onnxruntime/package_reference/modeling_ort and https://huggingface.co/docs/optimum/main/en/onnxruntime/usage_guides/models#export-and-inference-of-stable-diffusion-models
- Add ORTModelForMaskedLM class by @JingyaHuang in #729
- Add ORTModelForVision2Seq for VisionEncoderDecoder models inference by @mht-sharma in #742
- Add ORTModelXXX for audio by @mht-sharma in #774
- Add stable diffusion onnx runtime pipeline by @echarlaix in #786
Support of the ONNX export from PyTorch on float16
In the ONNX export, it is possible to pass the options --fp16 --device cuda to export using float16 when a GPU is available, directly with the native torch.onnx.export.
Example: optimum-cli export onnx --model gpt2 --fp16 --device cuda gpt2_onnx/
TFLite export
TFLite export is now supported, with static shapes:
optimum-cli export tflite --help
optimum-cli export tflite --model bert-base-uncased --sequence_length 128 bert_tflite/
exporters.tfliteinitial support by @michaelbenayoun in #716- TFLite auto-encoder models by @michaelbenayoun in #757
- [TFLite Export] Adds support for ResNet by @sayakpaul in #813
ONNX Runtime optimization and quantization directly in the CLI
- Add optimize and quantize command CLI by @jplu in #700
- Support ONNX Runtime optimizations in exporters.onnx by @fxmarty in #807
The ONNX export optionally supports the ONNX Runtime optimizations directly in the export, passing the --optimize O1, up to --optimize O4 option:
optimum-cli export onnx --help
optimum-cli export onnx --model t5-small --optimize O3 t5small_onnx/
ONNX Runtime quantization is supported directly in command line, using optimum-cli onnxruntime quantize:
optimum-cli onnxruntime quantize --help
optimum-cli onnxruntime quantize --onnx_model distilbert_onnx --avx512
ONNX Runtime optimization is supported directly in command line, using optimum-cli onnxruntime optimize:
optimum-cli onnxruntime optimize --help
optimum-cli onnxruntime optimize --onnx_model distilbert_onnx -O3
ORTModelForCausalLM supports decoding with a single ONNX
Up no now, for decoders, two ONNX were used:
- One handling the first forward pass where no past key values have been cached yet - thus not taking them as input.
- One handling the following forward pass where past key values have been cached, thus taking them as input.
This release introduces the support in the ONNX export and in ORTModelForCausalLM of a single ONNX handling both steps of the decoding. This allows to reduce memory usage, as weights are not duplicated between two separate models during inference.
Using a single ONNX for decoders can be used by passing use_merged=True to ORTModelForCausalLM.from_pretrained, loading directly from a PyTorch model:
from optimum.onnxruntime import ORTModelForCausalLM
model = ORTModelForCausalLM.from_pretrained("gpt2", export=True, use_merged=True)Alternatively, using a single ONNX for decoders is the default behavior in the ONNX export, that can later be used for example with ORTModelForCausalLM, the command optimum-cli export onnx --model gpt2 gpt2_onnx/ will produce:
└── gpt2_onnx
├── config.json
├── decoder_model_merged.onnx
├── decoder_model.onnx
├── decoder_with_past_model.onnx
├── merges.txt
├── special_tokens_map.json
├── tokenizer_config.json
├── tokenizer.json
└── vocab.json
The decoder_model.onnx and decoder_with_past_model.onnx are kept separate for backward compatibility, but during inference using solely decoder_model_merged.onnx is enough.
- Enable inference with a merged decoder in
ORTModelForCausalLMby @JingyaHuang in #647
Single-file ORTModel accept numpy arrays
ORTModel accept numpy arrays as inputs, in addition to PyTorch tensors. This is only the case for models that use a single ONNX.
ORTOptimizer support for ORTModelForCausalLM
- ORTOptimizer support ORTModelForCausalLM by @fxmarty in #794
- Support IO Binding for merged decoder by @fxmarty in #797
Breaking changes
- In the ONNX export, exporting models in several ONNX (encoder, decoder) is now the default behavior: #747. The old behavior is still accessible with
--monolith. - In decoders, reusing past key values is now the default in the ONNX export: #748. The old behavior is still accessible by explicitly passing, for example,
--task causal-lminstead of--task causal-lm-with-past. - BigBird support in the ONNX export is removed, due to the
block_sparseattention type being written in pure numpy in Transformers, and hence not exportable to ONNX: #778 - The parameter
from_transformersofORTModel.from_pretrainedwill be deprecated in favor ofexport.
Bugfixes and improvements
- Fix disable shape inference for optimization by @regisss in #652
- Fix uninformative message when passing
use_cache=Trueto ORTModel and no ONNX with cache is available by @fxmarty in #650 - Fix provider options when several providers are passed by @fxmarty in #653
- Add TensorRT engine to ONNX Runtime GPU documentation by @fxmarty in #657
- Improve documentation around ONNX export by @fxmarty in #666
- minor updates on ONNX config guide by @mszsorondo in #662
- Fix FlaubertOnnxConfig by @michaelbenayoun in #669
- Use nvcr.io/nvidia/tensorrt image for GPU tests by @fxmarty in #660
- Better Transformer doc fix by @HamidShojanazeri in #670
- Add support for LongT5 optimization using ORT transformer optimizer script by @kunal-vaishnavi in #683
- Add test for missing execution providers error messages by @fxmarty in #659
- ONNX transformation to cast int64 constants to int32 when possible by @fxmarty in #655
- Add missing normalized configs by @fxmarty in #694
- Remove code duplication in ORTModel's load_model by @fxmarty in #695
- Test more architectures in ORTModel by @fxmarty in #675
- Avoid initializing unwanted attributes for ORTModel's having several inference sessions by @fxmarty in #696
- Fix the ORTQuantizer loading from specific file by @echarlaix in #701
- Add saving of diffusion model additional components for onnx export by @echarlaix in #699
- Fix whisper export by @mht-sharma in #629
- Support trust remote code option in ONNX export and ONNX Runtime integration by @fxmarty in #702
- Add nightly tests on dependencies dev versions by @fxmarty in #703
- Fix exception condition by @mht-sharma in #706
- Add ORTModelForMultipleChoice to the documentation by @fxmarty in #712
- Fix yaml format for dev tests by @fxmarty in #710
- Add ONNX Runtime training benchmark by @JingyaHuang in #592
- Allow
from optimum.onnxruntime import QuantizationConfigby @fxmarty in #715 - Fix documentation for doctest tests to pass by @fxmarty in #713
- Use transformers>=4.26.0 in setup.py by @fxmarty in #723
- Fix GPU tests by @fxmarty in #724
- Fix ONNX Runtime inference in
ORTTrainerby @JingyaHuang in #709 onnxruntime/modeling_ort.pyrefactor, part 1 by @michaelbenayoun in #698- Update docker and doc of ORT Trainer by @JingyaHuang in #725
- Add test for code examples in the documentation and docstrings by @fxmarty in #704
- add image classification example to optimum by @prathikr in #711
- Add TensorrtExecutionProvider modeling tests by @fxmarty in #722
- Whisper shape inference fix by @michaelbenayoun in #726
- Add some redirections to Optimum Habana's documentation by @regisss in #735
- Patch
ORTTrainerinference with ONNX Runtime backend by @JingyaHuang in #737 - Remove dead code in whisper ONNX output by @fxmarty in #741
- Unpin protobuf 3.20.1 by @fxmarty in #738
- Fix speech2text export by @mht-sharma in #746
- Raise error on double call to
BetterTransformer.transform()by @fxmarty in #750 exporters.onnxoutput names and dynamic axes fix by @michaelbenayoun in #731- Fix NNCF supported quantization strategies README table by @echarlaix in #752
- Add GPU tests for BetterTransformer by @fxmarty in #751
- Fix doctest by @fxmarty in #759
- Fix ONNX Runtime cache usage for decoders, add relevant tests by @fxmarty in #756
- Fix GPU tests by @fxmarty in #758
- Update quality tooling for formatting by @regisss in #760
- Fix wrong shapes used at ONNX export and validation by @fxmarty in #764
- Change type annotation by @michaelbenayoun in #768
- Fix stable diffusion ONNX export by @echarlaix in #762
- Disable ONNX Runtime provider check on Windows by @fxmarty in #771
- Fix FusionOptions following ORT 1.14 release by @fxmarty in #772
- Unpin numpy <1.24.0 by @fxmarty in #773
- Fix flaky ONNX Runtime generation test with past key value reuse by @fxmarty in #765
- Fix output shape dimension for OnnxConfigWithPast by @fxmarty in #780
- Fix used shapes, device at ONNX export by @fxmarty in #777
- Pin numpy only for tensorflow export by @fxmarty in #781
- Fixed broken paper space links by @Muhtasham in #766
- Temporarily disable python 3.9 + macOS test due to onnxruntime 1.14 regression by @fxmarty in #783
- Update ORT Training to 1.14.0 by @JingyaHuang in #787
- Temporarily disable segformer TensorRT test by @fxmarty in #799
- Use a stateful ordered_input_names in ORTModel by @fxmarty in #796
- Test ORTOptimizer with IO Binding by @fxmarty in #801
- [
BT] Add stable layer-norm Wav2vec2 by @younesbelkada in #803 - Update rules for ruff by @regisss in #806
- Improve orttrainer test by @JingyaHuang in #779
- Fix ORT quantization for TensorRT documentation by @fxmarty in #812
- Fix GPU tests by @fxmarty in #814
- Update ONNX Runtime training doc - use torchrun by @JingyaHuang in #820
- Fix ONNX export tests by @fxmarty in #822
- All back workflow dispatch on GPU tests by @fxmarty in #823
- BetterTransformer pipeline padding issue fix by @vrdn-23 in #821
- Fix optimum pipeline initialization by @fxmarty in #824
- Fix failing GPU tests by @fxmarty in #829
- Remove feature dimension as dynamic axes for stable diffusion ONNX export by @echarlaix in #816
- Fix pipeline task dropping arguments bug by @fxmarty in #828
- Fix ORTQuantizer behavior with ORTModelForCausalLM by @fxmarty in #831
- Update tests by @mht-sharma in #826
- Fix exporters GPU CI by @fxmarty in #835
- Keep intermediary models for ONNX causal-lm by @fxmarty in #834
- Fix duplicate name merged decoder by @fxmarty in #837
- Apply lazy import for exporters by @JingyaHuang in #836
Full Changelog: v1.6.0...v1.7.0