Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 10 additions & 2 deletions docs/source/en/optimization/attention_backends.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ specific language governing permissions and limitations under the License. -->

# Attention backends

> [!TIP]
> [!NOTE]
> The attention dispatcher is an experimental feature. Please open an issue if you have any feedback or encounter any problems.

Diffusers provides several optimized attention algorithms that are more memory and computationally efficient through it's *attention dispatcher*. The dispatcher acts as a router for managing and switching between different attention implementations and provides a unified interface for interacting with them.
Expand All @@ -33,7 +33,7 @@ The [`~ModelMixin.set_attention_backend`] method iterates through all the module

The example below demonstrates how to enable the `_flash_3_hub` implementation for FlashAttention-3 from the [kernel](https://github.com/huggingface/kernels) library, which allows you to instantly use optimized compute kernels from the Hub without requiring any setup.

> [!TIP]
> [!NOTE]
> FlashAttention-3 is not supported for non-Hopper architectures, in which case, use FlashAttention with `set_attention_backend("flash")`.

```py
Expand Down Expand Up @@ -78,10 +78,16 @@ with attention_backend("_flash_3_hub"):
image = pipeline(prompt).images[0]
```

> [!TIP]
> Most attention backends support `torch.compile` without graph breaks and can be used to further speed up inference.

## Available backends

Refer to the table below for a complete list of available attention backends and their variants.

<details>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May be simpler to replace the table at the beginning of the docs with this more complete one, wdyt?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually the first table is nice in the sense that it provides a consolidated overview of what's supported. So, I'd prefer to keep it.

<summary>Expand</summary>

| Backend Name | Family | Description |
|--------------|--------|-------------|
| `native` | [PyTorch native](https://docs.pytorch.org/docs/stable/generated/torch.nn.attention.SDPBackend.html#torch.nn.attention.SDPBackend) | Default backend using PyTorch's scaled_dot_product_attention |
Expand All @@ -104,3 +110,5 @@ Refer to the table below for a complete list of available attention backends and
| `_sage_qk_int8_pv_fp16_cuda` | [SageAttention](https://github.com/thu-ml/SageAttention) | INT8 QK + FP16 PV (CUDA) |
| `_sage_qk_int8_pv_fp16_triton` | [SageAttention](https://github.com/thu-ml/SageAttention) | INT8 QK + FP16 PV (Triton) |
| `xformers` | [xFormers](https://github.com/facebookresearch/xformers) | Memory-efficient attention |

</details>