v1.7.3: Patch release for PyTorch 2.0 and transformers 4.27.0
This patch releases fixes a few bugs with PyTorch 2.0 release, and include a few new features as well.
Breaking change: constant outputs removed from ONNX encoder-decoder models
We removed some constant past key values outputs from encoder-decoder models in the ONNX export. Beware that this could potentially break your existing code, but we recommend to use the new exported models as this removes unnecessary Identity nodes in the models.
- Remove constant outputs from decoder with past ONNX model for encoder-decoder architectures by @fxmarty in #872
torch.nn.functional.scaled_dot_product_attention support for decoders in BetterTransformer
Pytorch 2.0 introduces in beta torch.nn.functional.scaled_dot_product_attention, a fastpath for attention extending their accelerated transformer features. This is included in optimum.bettertransformer to be used with the following architectures: Bart, Blenderbot, GPT2, GTP-J, M2M100, Marian, Mbart, OPT, Pegasus, T5.
Beware that this is still experimental and speedups have yet to be validated on all architectures.
PyTorch's scaled_dot_product_attention allows to use flash attention and memory efficient attention natively in PyTorch.
Usage is as follow:
from transformers import AutoTokenizer, AutoModelForCausalLM
from optimum.bettertransformer import BetterTransformer
tokenizer = AutoTokenizer.from_pretrained("gpt2")
model = AutoModelForCausalLM.from_pretrained("gpt2")
model = BetterTransformer.transform(model) # modify transformers modeling to use native scaled_dot_product_attention
# do you inference or training here
model = BetterTransformer.reverse(model) # go back to using canonical transformers modeling
model.save_pretrained("gpt2_model")Inference benchmark (on fp16):
| Model | batch size | Input sequence length | Generated tokens | Latency eager (s) | Latency BT (s) | Speedup | Peak memory eager (MB) | Peak memory BT (MB) | Memory savings |
|---|---|---|---|---|---|---|---|---|---|
| gpt2 | 1 | 64 | 256 | 1.800 | 1.607 | 12.0% | 569.90 | 569.89 | 0% |
| gpt2 | 64 | 64 | 256 | 2.159 | 1.617 | 33.5% | 2067.45 | 2093.80 | 0% |
| opt-1.3b | 1 | 64 | 256 | 3.010 | 2.667 | 12.9% | 5408.238 | 5408.238 | 0% |
| gpt-neox-20b | 1 | 64 | 256 | 10.869 | 9.937 | 9.4% | 83670.67 | 83673.53 | 0% |
Training benchmark (on fp16):
| Model | batch size | Sequence length | time/epoch (eager, s) | time/epoch (BT, s) | Speedup | Peak memory eager (MB) | Peak memory BT (MB) | Memory savings |
|---|---|---|---|---|---|---|---|---|
| gpt2 | 8 | 1024 | 17.732 | 14.037 | 26.3% | 13291.16 | 10191.52 | 30.4% |
| gpt2 | 32 | 1024 | 17.336 | 13.309 | 30.3% | 52834.83 | 38858.56 | 36.0% |
| gpt2 | 64 | 1024 | OOM | 14.067 | / | OOM | 75600.08 | / |
Benchmarks can be reproduced using the inference script and training script:
python benchmark_bettertransformer.py --model-name gpt2 --use-half --use-cuda --is_decoder --num-batches 5 --max_token 256
python benchmark_bettertransformer.py --model-name gpt2 --use-half --use-cuda --is_decoder --num-batches 5 --max_token 256 --seqlen-stdev 0
- Add scaled_dot_product_attention support for decoder models by @fxmarty in #853
- Support scaled_dot_product_attention for t5 by @fxmarty in #856
- [
BT] add decoder benchmark script by @younesbelkada in #857 - [
BT] Fix bt benchmark by @younesbelkada in #858 - Fix pytorch version check in bettertransformer by @fxmarty in #862
- [
BT] Add fp16 support by @younesbelkada in #859 - [
BT] Add decoder training support by @younesbelkada in #860 - Bart support scaled_dot_product_attention by @fxmarty in #863
- [
BT] addaccelerate_testmarkers by @younesbelkada in #864 - Mbart, pegasus, blenderbot, marian, m2m_100 support scaled_dot_product_attention by @fxmarty in #865
- Add bettertransformer reverse transform by @fxmarty in #868
- Add bettertransformer training benchmark script by @fxmarty in #873
New architectures in the ONNX export
Three additional architectures are supported in the ONNX export: ImageGPT, RegNet, OPT.
- Adding ONNX support for ImageGPT by @adit299 in #819
- Add ONNX support for RegNet by @asrimanth in #833
- Adding support for Facebook's OPT models by @hivaze in #852
(WIP) TFLite export with quantization support
Continued progress in the TFLite export with quantization support. This is work in progress and not documented yet.
- Quantization with TFLite by @michaelbenayoun in #854
Bugfixes and improvements
- Update documentation by @echarlaix in #843
- Fix typo in documentation by @regisss in #848
- Remove redundant code by @mht-sharma in #841
- Update README by @echarlaix in #850
- Update documentation by @echarlaix in #855
- Remove iobinding ORTModelForCTC by @mht-sharma in #840
- Fix typo in documentation by @echarlaix in #861
- Fix causal-lm ONNX axis names by @fxmarty in #871
- add NNCF openvino notebook by @echarlaix in #875
- Remove positional-only parameters not support by python < v3.8 by @echarlaix in #881
- lazy import for task manager by @JingyaHuang in #844
- Remove onnx and ort dependencies on the TasksManager by @michaelbenayoun in #846
- Reactivate export & optimization tests for causal-lm models by @fxmarty in #885
- Fix ONNX export on transformers 4.27 release by @fxmarty in #884
- Do not use scaled_dot_product_attention for stable diffusion onnx export by @fxmarty in #888
- Fix loading of an ONNX stable diffusion model when config doesn't match by @echarlaix in #887
- Automatic framework detection in TasksManager for large models by @fxmarty in #883
- Fix WavLM onnx export upon torch 2.0 release by @fxmarty in #889
- Fix PushToHubMixin._create_repo according to transformers 4.27 release by @fxmarty in #892
- Fix stable diffusion framework detection by @fxmarty in #893
- Add donut CPU inference ORT by @mht-sharma in #761
- Fix check_model for large merged ONNX models by @fxmarty in #896
- Drop python 3.7 support by @fxmarty in #891
- Fix dummy label generator for vision tasks by @JingyaHuang in #900
- Add stable diffusion dummy object by @echarlaix in #899
- Automatic support for large ONNX models in ORTOptimizer by @fxmarty in #886
- Remove subprocess calls in ONNX export by @fxmarty in #897
- Registering mechanism for the
TasksManagerby @michaelbenayoun in #898 - add option to run inference with ort by @prathikr in #838
- Check min diffusers version by @echarlaix in #902
- Update bug-report.yml by @lewtun in #895
- Fix axis name for seq2seq ONNX models by @fxmarty in #904
- Fix GPU tests by @fxmarty in #909
- Fix misleading error message in ORTOptimizer by @fxmarty in #910
- Delete all Docker images before building the doc of Optimum by @regisss in #911
- Fix onnx export preprocessors save by @fxmarty in #913
- Fix GPU CI by @fxmarty in #914
New Contributors
- @adit299 made their first contribution in #819
- @asrimanth made their first contribution in #833
- @hivaze made their first contribution in #852
Full Changelog: v1.2.0...v1.7.2