From eb3e88cb4ed43c3bcde9bc20b4df3c48bb376e75 Mon Sep 17 00:00:00 2001 From: sayakpaul Date: Fri, 26 Sep 2025 11:07:33 +0530 Subject: [PATCH 1/2] slight edits to the attention backends docs. --- docs/source/en/optimization/attention_backends.md | 12 ++++++++++-- 1 file changed, 10 insertions(+), 2 deletions(-) diff --git a/docs/source/en/optimization/attention_backends.md b/docs/source/en/optimization/attention_backends.md index 04c8b4ba921c..6ecde66a22c0 100644 --- a/docs/source/en/optimization/attention_backends.md +++ b/docs/source/en/optimization/attention_backends.md @@ -11,7 +11,7 @@ specific language governing permissions and limitations under the License. --> # Attention backends -> [!TIP] +> [!NOTE] > The attention dispatcher is an experimental feature. Please open an issue if you have any feedback or encounter any problems. Diffusers provides several optimized attention algorithms that are more memory and computationally efficient through it's *attention dispatcher*. The dispatcher acts as a router for managing and switching between different attention implementations and provides a unified interface for interacting with them. @@ -33,7 +33,7 @@ The [`~ModelMixin.set_attention_backend`] method iterates through all the module The example below demonstrates how to enable the `_flash_3_hub` implementation for FlashAttention-3 from the [kernel](https://github.com/huggingface/kernels) library, which allows you to instantly use optimized compute kernels from the Hub without requiring any setup. -> [!TIP] +> [!NOTE] > FlashAttention-3 is not supported for non-Hopper architectures, in which case, use FlashAttention with `set_attention_backend("flash")`. ```py @@ -78,10 +78,16 @@ with attention_backend("_flash_3_hub"): image = pipeline(prompt).images[0] ``` +> [!TIP] +> Most of these attention backends come with `torch.compile` compatibility without any graph breaks. Consider using it for maximum speedups. + ## Available backends Refer to the table below for a complete list of available attention backends and their variants. +
+Expand + | Backend Name | Family | Description | |--------------|--------|-------------| | `native` | [PyTorch native](https://docs.pytorch.org/docs/stable/generated/torch.nn.attention.SDPBackend.html#torch.nn.attention.SDPBackend) | Default backend using PyTorch's scaled_dot_product_attention | @@ -104,3 +110,5 @@ Refer to the table below for a complete list of available attention backends and | `_sage_qk_int8_pv_fp16_cuda` | [SageAttention](https://github.com/thu-ml/SageAttention) | INT8 QK + FP16 PV (CUDA) | | `_sage_qk_int8_pv_fp16_triton` | [SageAttention](https://github.com/thu-ml/SageAttention) | INT8 QK + FP16 PV (Triton) | | `xformers` | [xFormers](https://github.com/facebookresearch/xformers) | Memory-efficient attention | + +
\ No newline at end of file From 2b5f4be13165a9174cfcaaffb75d4589d36c7bef Mon Sep 17 00:00:00 2001 From: Sayak Paul Date: Fri, 26 Sep 2025 21:24:47 +0530 Subject: [PATCH 2/2] Update docs/source/en/optimization/attention_backends.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> --- docs/source/en/optimization/attention_backends.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/source/en/optimization/attention_backends.md b/docs/source/en/optimization/attention_backends.md index 6ecde66a22c0..e603878a6383 100644 --- a/docs/source/en/optimization/attention_backends.md +++ b/docs/source/en/optimization/attention_backends.md @@ -79,7 +79,7 @@ with attention_backend("_flash_3_hub"): ``` > [!TIP] -> Most of these attention backends come with `torch.compile` compatibility without any graph breaks. Consider using it for maximum speedups. +> Most attention backends support `torch.compile` without graph breaks and can be used to further speed up inference. ## Available backends