Release v1.7.3: Patch release for PyTorch 2.0 and transformers 4.27.0 · huggingface/optimum

This patch releases fixes a few bugs with PyTorch 2.0 release, and include a few new features as well.

Breaking change: constant outputs removed from ONNX encoder-decoder models

We removed some constant past key values outputs from encoder-decoder models in the ONNX export. Beware that this could potentially break your existing code, but we recommend to use the new exported models as this removes unnecessary Identity nodes in the models.

Remove constant outputs from decoder with past ONNX model for encoder-decoder architectures by @fxmarty in #872

`torch.nn.functional.scaled_dot_product_attention` support for decoders in BetterTransformer

Pytorch 2.0 introduces in beta torch.nn.functional.scaled_dot_product_attention, a fastpath for attention extending their accelerated transformer features. This is included in optimum.bettertransformer to be used with the following architectures: Bart, Blenderbot, GPT2, GTP-J, M2M100, Marian, Mbart, OPT, Pegasus, T5.

Beware that this is still experimental and speedups have yet to be validated on all architectures.

PyTorch's scaled_dot_product_attention allows to use flash attention and memory efficient attention natively in PyTorch.

Usage is as follow:

from transformers import AutoTokenizer, AutoModelForCausalLM
from optimum.bettertransformer import BetterTransformer

tokenizer = AutoTokenizer.from_pretrained("gpt2")
model = AutoModelForCausalLM.from_pretrained("gpt2")

model = BetterTransformer.transform(model)  # modify transformers modeling to use native scaled_dot_product_attention

# do you inference or training here

model = BetterTransformer.reverse(model)  # go back to using canonical transformers modeling
model.save_pretrained("gpt2_model")

Inference benchmark (on fp16):

Model	batch size	Input sequence length	Generated tokens	Latency eager (s)	Latency BT (s)	Speedup	Peak memory eager (MB)	Peak memory BT (MB)	Memory savings
gpt2	1	64	256	1.800	1.607	12.0%	569.90	569.89	0%
gpt2	64	64	256	2.159	1.617	33.5%	2067.45	2093.80	0%
opt-1.3b	1	64	256	3.010	2.667	12.9%	5408.238	5408.238	0%
gpt-neox-20b	1	64	256	10.869	9.937	9.4%	83670.67	83673.53	0%

Training benchmark (on fp16):

Model	batch size	Sequence length	time/epoch (eager, s)	time/epoch (BT, s)	Speedup	Peak memory eager (MB)	Peak memory BT (MB)	Memory savings
gpt2	8	1024	17.732	14.037	26.3%	13291.16	10191.52	30.4%
gpt2	32	1024	17.336	13.309	30.3%	52834.83	38858.56	36.0%
gpt2	64	1024	OOM	14.067	/	OOM	75600.08	/

Benchmarks can be reproduced using the inference script and training script:

python benchmark_bettertransformer.py --model-name gpt2 --use-half --use-cuda --is_decoder --num-batches 5 --max_token 256
python benchmark_bettertransformer.py --model-name gpt2 --use-half --use-cuda --is_decoder --num-batches 5 --max_token 256 --seqlen-stdev 0

Add scaled_dot_product_attention support for decoder models by @fxmarty in #853
Support scaled_dot_product_attention for t5 by @fxmarty in #856
[BT] add decoder benchmark script by @younesbelkada in #857
[BT] Fix bt benchmark by @younesbelkada in #858
Fix pytorch version check in bettertransformer by @fxmarty in #862
[BT] Add fp16 support by @younesbelkada in #859
[BT] Add decoder training support by @younesbelkada in #860
Bart support scaled_dot_product_attention by @fxmarty in #863
[BT] add accelerate_test markers by @younesbelkada in #864
Mbart, pegasus, blenderbot, marian, m2m_100 support scaled_dot_product_attention by @fxmarty in #865
Add bettertransformer reverse transform by @fxmarty in #868
Add bettertransformer training benchmark script by @fxmarty in #873

New architectures in the ONNX export

Three additional architectures are supported in the ONNX export: ImageGPT, RegNet, OPT.

Adding ONNX support for ImageGPT by @adit299 in #819
Add ONNX support for RegNet by @asrimanth in #833
Adding support for Facebook's OPT models by @hivaze in #852

(WIP) TFLite export with quantization support

Continued progress in the TFLite export with quantization support. This is work in progress and not documented yet.

Quantization with TFLite by @michaelbenayoun in #854

Bugfixes and improvements

Update documentation by @echarlaix in #843
Fix typo in documentation by @regisss in #848
Remove redundant code by @mht-sharma in #841
Update README by @echarlaix in #850
Update documentation by @echarlaix in #855
Remove iobinding ORTModelForCTC by @mht-sharma in #840
Fix typo in documentation by @echarlaix in #861
Fix causal-lm ONNX axis names by @fxmarty in #871
add NNCF openvino notebook by @echarlaix in #875
Remove positional-only parameters not support by python < v3.8 by @echarlaix in #881
lazy import for task manager by @JingyaHuang in #844
Remove onnx and ort dependencies on the TasksManager by @michaelbenayoun in #846
Reactivate export & optimization tests for causal-lm models by @fxmarty in #885
Fix ONNX export on transformers 4.27 release by @fxmarty in #884
Do not use scaled_dot_product_attention for stable diffusion onnx export by @fxmarty in #888
Fix loading of an ONNX stable diffusion model when config doesn't match by @echarlaix in #887
Automatic framework detection in TasksManager for large models by @fxmarty in #883
Fix WavLM onnx export upon torch 2.0 release by @fxmarty in #889
Fix PushToHubMixin._create_repo according to transformers 4.27 release by @fxmarty in #892
Fix stable diffusion framework detection by @fxmarty in #893
Add donut CPU inference ORT by @mht-sharma in #761
Fix check_model for large merged ONNX models by @fxmarty in #896
Drop python 3.7 support by @fxmarty in #891
Fix dummy label generator for vision tasks by @JingyaHuang in #900
Add stable diffusion dummy object by @echarlaix in #899
Automatic support for large ONNX models in ORTOptimizer by @fxmarty in #886
Remove subprocess calls in ONNX export by @fxmarty in #897
Registering mechanism for the TasksManager by @michaelbenayoun in #898
add option to run inference with ort by @prathikr in #838
Check min diffusers version by @echarlaix in #902
Update bug-report.yml by @lewtun in #895
Fix axis name for seq2seq ONNX models by @fxmarty in #904
Fix GPU tests by @fxmarty in #909
Fix misleading error message in ORTOptimizer by @fxmarty in #910
Delete all Docker images before building the doc of Optimum by @regisss in #911
Fix onnx export preprocessors save by @fxmarty in #913
Fix GPU CI by @fxmarty in #914

New Contributors

@adit299 made their first contribution in #819
@asrimanth made their first contribution in #833
@hivaze made their first contribution in #852

Full Changelog: v1.2.0...v1.7.2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v1.7.3: Patch release for PyTorch 2.0 and transformers 4.27.0

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

Breaking change: constant outputs removed from ONNX encoder-decoder models

`torch.nn.functional.scaled_dot_product_attention` support for decoders in BetterTransformer

New architectures in the ONNX export

(WIP) TFLite export with quantization support

Bugfixes and improvements

New Contributors

Contributors

Uh oh!

v1.7.3: Patch release for PyTorch 2.0 and transformers 4.27.0

Breaking change: constant outputs removed from ONNX encoder-decoder models

torch.nn.functional.scaled_dot_product_attention support for decoders in BetterTransformer

New architectures in the ONNX export

(WIP) TFLite export with quantization support

Bugfixes and improvements

New Contributors

Contributors

Uh oh!

`torch.nn.functional.scaled_dot_product_attention` support for decoders in BetterTransformer