Skip to content

v1.7.3: Patch release for PyTorch 2.0 and transformers 4.27.0

Choose a tag to compare

@fxmarty fxmarty released this 23 Mar 16:37
· 696 commits to main since this release

This patch releases fixes a few bugs with PyTorch 2.0 release, and include a few new features as well.

Breaking change: constant outputs removed from ONNX encoder-decoder models

We removed some constant past key values outputs from encoder-decoder models in the ONNX export. Beware that this could potentially break your existing code, but we recommend to use the new exported models as this removes unnecessary Identity nodes in the models.

  • Remove constant outputs from decoder with past ONNX model for encoder-decoder architectures by @fxmarty in #872

torch.nn.functional.scaled_dot_product_attention support for decoders in BetterTransformer

Pytorch 2.0 introduces in beta torch.nn.functional.scaled_dot_product_attention, a fastpath for attention extending their accelerated transformer features. This is included in optimum.bettertransformer to be used with the following architectures: Bart, Blenderbot, GPT2, GTP-J, M2M100, Marian, Mbart, OPT, Pegasus, T5.

Beware that this is still experimental and speedups have yet to be validated on all architectures.

PyTorch's scaled_dot_product_attention allows to use flash attention and memory efficient attention natively in PyTorch.

Usage is as follow:

from transformers import AutoTokenizer, AutoModelForCausalLM
from optimum.bettertransformer import BetterTransformer

tokenizer = AutoTokenizer.from_pretrained("gpt2")
model = AutoModelForCausalLM.from_pretrained("gpt2")

model = BetterTransformer.transform(model)  # modify transformers modeling to use native scaled_dot_product_attention

# do you inference or training here

model = BetterTransformer.reverse(model)  # go back to using canonical transformers modeling
model.save_pretrained("gpt2_model")

Inference benchmark (on fp16):

Model batch size Input sequence length Generated tokens Latency eager (s) Latency BT (s) Speedup Peak memory eager (MB) Peak memory BT (MB) Memory savings
gpt2 1 64 256 1.800 1.607 12.0% 569.90 569.89 0%
gpt2 64 64 256 2.159 1.617 33.5% 2067.45 2093.80 0%
opt-1.3b 1 64 256 3.010 2.667 12.9% 5408.238 5408.238 0%
gpt-neox-20b 1 64 256 10.869 9.937 9.4% 83670.67 83673.53 0%

Training benchmark (on fp16):

Model batch size Sequence length time/epoch (eager, s) time/epoch (BT, s) Speedup Peak memory eager (MB) Peak memory BT (MB) Memory savings
gpt2 8 1024 17.732 14.037 26.3% 13291.16 10191.52 30.4%
gpt2 32 1024 17.336 13.309 30.3% 52834.83 38858.56 36.0%
gpt2 64 1024 OOM 14.067 / OOM 75600.08 /

Benchmarks can be reproduced using the inference script and training script:

python benchmark_bettertransformer.py --model-name gpt2 --use-half --use-cuda --is_decoder --num-batches 5 --max_token 256
python benchmark_bettertransformer.py --model-name gpt2 --use-half --use-cuda --is_decoder --num-batches 5 --max_token 256 --seqlen-stdev 0

New architectures in the ONNX export

Three additional architectures are supported in the ONNX export: ImageGPT, RegNet, OPT.

  • Adding ONNX support for ImageGPT by @adit299 in #819
  • Add ONNX support for RegNet by @asrimanth in #833
  • Adding support for Facebook's OPT models by @hivaze in #852

(WIP) TFLite export with quantization support

Continued progress in the TFLite export with quantization support. This is work in progress and not documented yet.

Bugfixes and improvements

New Contributors

  • @adit299 made their first contribution in #819
  • @asrimanth made their first contribution in #833
  • @hivaze made their first contribution in #852

Full Changelog: v1.2.0...v1.7.2