[`Docs` / `BetterTransformer` ] Added more details about flash attention + SDPA #25265

younesbelkada · 2023-08-02T12:59:23Z

What does this PR do?

as discussed offline with @LysandreJik

This PR clarifies to users how it is possible to use Flash Attention as a backend for most used models in transformers. As we have a seen some questions from users asking whether it is possible to integrate flash attention into HF models, whereas you can already benefit from it when using model.to_bettertransformer(), leveraging the BetterTransformer API from 🤗 optimum.

The informations are based from the official documentation of torch.nn.functional.scaled_dot_product

In the near future, we could also have a small blogpost explaining this as well

To do list / To clarify list:

Clarify that it is possible to do that for training as well (I did not added much on the training section)
Maybe add a few lines in overview of performance and scalability to emphasize this?

Let me know if I missed anything else

cc @fxmarty @MKhalusova @stevhliu

HuggingFaceDocBuilderDev · 2023-08-02T13:20:56Z

The documentation is not available anymore as the PR was closed or merged.

stevhliu

Thanks for adding these additional details! 😄

docs/source/en/perf_infer_gpu_many.md

docs/source/en/perf_infer_gpu_one.md

stevhliu · 2023-08-02T18:48:03Z

docs/source/en/perf_infer_gpu_one.md

 As of PyTorch 2.0, the attention fastpath is supported for both encoders and decoders. The list of supported architectures can be found [here](https://huggingface.co/docs/optimum/bettertransformer/overview#supported-models).

+For decoder-based models (e.g. GPT, T5, Llama, etc.), the `BetterTransformer` API will convert all attention operations to use the [`torch.nn.functional.scaled_dot_product_attention` method](https://pytorch.org/docs/master/generated/torch.nn.functional.scaled_dot_product_attention) (SDPA), that is available only from PyTorch 2.0 and onwards. 


Same comments for the rest of this section as in perf_infer_gpu_many.md (you can probably copy the changes over) :)

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

younesbelkada · 2023-08-03T07:15:57Z

Thanks a lot for the extensive review @stevhliu ! 🎉

fxmarty

Thanks a lot, that is much better.

I'll release on Optimum side to include huggingface/optimum#1225 that allows training with encoder models + SDPA as well.

It could be worth noting that a few models (Falcon, M4) start to have native SDPA support in transformers (but they may not dispatch to flash), see these discussions:

docs/source/en/perf_infer_gpu_many.md

fxmarty · 2023-08-03T09:03:42Z

docs/source/en/perf_infer_gpu_many.md

+For encoder models, the [`~PreTrainedModel.reverse_bettertransformer`] method reverts to the original model, which should be used before saving the model to use the canonical transformers modeling:
+
+```python
+model = model.reverse_bettertransformer()
+model.save_pretrained("saved_model")
+```


I think we should not make the distinction between encoder / decoder models when it come to using reverse_bettertransformer.

For example, for encoder-decoder models (e.g. t5), both SDPA (in the decoder) and nestedtensor (in the encoder) are used. So in case one wants to save the model, he'll need to use reverse_bettertransformer.

To me the distinction is more in that you can get speedups for inference with encoder models (since nestedtensor is used), but for decoder models the speedup / dispatch to flash will only come (in pytorch 2.0) for training & batch size = 1 for inference.

Thanks for the suggestion! I refactored a bit that section and removed the reverse_bettertransformer part as it is relevant only for training (that section is for inference only)

docs/source/en/perf_infer_gpu_many.md

fxmarty · 2023-08-03T09:06:50Z

docs/source/en/perf_infer_gpu_many.md

+# Use it for training or inference
+```
+
+SDPA can also call [Flash-Attention](https://arxiv.org/abs/2205.14135) kernels under the hood. If you want to force the usage of Flash Attention, use [`torch.backends.cuda.sdp_kernel(enable_flash=True)`](https://pytorch.org/docs/master/backends.html#torch.backends.cuda.sdp_kernel):


torch.backends.cuda.sdp_kernel(enable_flash=True) is not enough. You need torch.backends.cuda.sdp_kernel(enable_flash=True, enable_math=False, enable_mem_efficient=False as below

docs/source/en/perf_infer_gpu_many.md

docs/source/en/perf_infer_gpu_one.md

docs/source/en/perf_infer_gpu_many.md

docs/source/en/perf_train_gpu_one.md

docs/source/en/perf_infer_gpu_one.md

Co-authored-by: fxmarty <9808326+fxmarty@users.noreply.github.com>

docs/source/en/perf_infer_gpu_many.md

docs/source/en/perf_infer_gpu_one.md

Co-authored-by: fxmarty <9808326+fxmarty@users.noreply.github.com>

younesbelkada · 2023-08-17T15:01:57Z

Thanks @fxmarty for all the reviews, @stevhliu this is ready for another pass !

stevhliu

Looks awesome! I added some minor comments to make it a bit easier to read, and if you could also copy the changes from perf_infer_gpu_many to their corresponding sections in perf_infer_gpu_one that'd be great 🤗

docs/source/en/perf_infer_gpu_many.md

docs/source/en/perf_train_gpu_one.md

docs/source/en/perf_infer_gpu_one.md

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

ArthurZucker

Thanks for working on this! 🚀

…ion + SDPA (huggingface#25265) * added more details about flash attention * correct and add more details * Apply suggestions from code review Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * few modifs * more details * up * Apply suggestions from code review Co-authored-by: fxmarty <9808326+fxmarty@users.noreply.github.com> * adapt from suggestion * Apply suggestions from code review Co-authored-by: fxmarty <9808326+fxmarty@users.noreply.github.com> * trigger CI * Apply suggestions from code review Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * fix nits and copies * add new section --------- Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> Co-authored-by: fxmarty <9808326+fxmarty@users.noreply.github.com>

added more details about flash attention

fddb958

correct and add more details

e113c86

stevhliu reviewed Aug 2, 2023

View reviewed changes

younesbelkada and others added 4 commits August 3, 2023 09:07

Apply suggestions from code review

b625e9e

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

few modifs

36d33de

more details

3400318

up

0bc132a

younesbelkada requested review from stevhliu, fxmarty and LysandreJik August 3, 2023 07:15

fxmarty approved these changes Aug 3, 2023

View reviewed changes

younesbelkada and others added 3 commits August 3, 2023 12:03

Apply suggestions from code review

c150cc8

Co-authored-by: fxmarty <9808326+fxmarty@users.noreply.github.com>

Merge remote-tracking branch 'upstream/main' into sdpa-docs

af51646

adapt from suggestion

8acc2ae

younesbelkada requested a review from fxmarty August 17, 2023 10:36

fxmarty approved these changes Aug 17, 2023

View reviewed changes

Apply suggestions from code review

0377105

Co-authored-by: fxmarty <9808326+fxmarty@users.noreply.github.com>

younesbelkada mentioned this pull request Aug 17, 2023

[SFTTrainer] Flash attention support for SFTTrainer huggingface/trl#656

Closed

trigger CI

fd0848e

stevhliu approved these changes Aug 17, 2023

View reviewed changes

stevhliu reviewed Aug 17, 2023

View reviewed changes

docs/source/en/perf_infer_gpu_one.md Outdated Show resolved Hide resolved

younesbelkada and others added 3 commits August 17, 2023 17:46

Apply suggestions from code review

05ae343

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

fix nits and copies

00ed550

add new section

f9a6592

younesbelkada requested a review from ArthurZucker August 18, 2023 08:18

ArthurZucker approved these changes Aug 18, 2023

View reviewed changes

younesbelkada merged commit 940d1a7 into huggingface:main Aug 18, 2023
8 checks passed

younesbelkada deleted the sdpa-docs branch August 18, 2023 08:32

younesbelkada mentioned this pull request Aug 18, 2023

[core ] Integrate Flash attention 2 in most used models #25598

Merged

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[`Docs` / `BetterTransformer` ] Added more details about flash attention + SDPA #25265

[`Docs` / `BetterTransformer` ] Added more details about flash attention + SDPA #25265

younesbelkada commented Aug 2, 2023 •

edited

HuggingFaceDocBuilderDev commented Aug 2, 2023 •

edited

stevhliu left a comment

stevhliu Aug 2, 2023

younesbelkada commented Aug 3, 2023

fxmarty left a comment •

edited

fxmarty Aug 3, 2023

younesbelkada Aug 17, 2023

fxmarty Aug 3, 2023

younesbelkada commented Aug 17, 2023

stevhliu left a comment

ArthurZucker left a comment

		As of PyTorch 2.0, the attention fastpath is supported for both encoders and decoders. The list of supported architectures can be found [here](https://huggingface.co/docs/optimum/bettertransformer/overview#supported-models).

		For decoder-based models (e.g. GPT, T5, Llama, etc.), the `BetterTransformer` API will convert all attention operations to use the [`torch.nn.functional.scaled_dot_product_attention` method](https://pytorch.org/docs/master/generated/torch.nn.functional.scaled_dot_product_attention) (SDPA), that is available only from PyTorch 2.0 and onwards.

[Docs / BetterTransformer ] Added more details about flash attention + SDPA #25265

[Docs / BetterTransformer ] Added more details about flash attention + SDPA #25265

Conversation

younesbelkada commented Aug 2, 2023 • edited

What does this PR do?

HuggingFaceDocBuilderDev commented Aug 2, 2023 • edited

stevhliu left a comment

Choose a reason for hiding this comment

stevhliu Aug 2, 2023

Choose a reason for hiding this comment

younesbelkada commented Aug 3, 2023

fxmarty left a comment • edited

Choose a reason for hiding this comment

fxmarty Aug 3, 2023

Choose a reason for hiding this comment

younesbelkada Aug 17, 2023

Choose a reason for hiding this comment

fxmarty Aug 3, 2023

Choose a reason for hiding this comment

younesbelkada commented Aug 17, 2023

stevhliu left a comment

Choose a reason for hiding this comment

ArthurZucker left a comment

Choose a reason for hiding this comment

[`Docs` / `BetterTransformer` ] Added more details about flash attention + SDPA #25265

[`Docs` / `BetterTransformer` ] Added more details about flash attention + SDPA #25265

younesbelkada commented Aug 2, 2023 •

edited

HuggingFaceDocBuilderDev commented Aug 2, 2023 •

edited

fxmarty left a comment •

edited