-
Notifications
You must be signed in to change notification settings - Fork 6.9k
[Agents] Optimizations skill #13381
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
[Agents] Optimizations skill #13381
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,113 @@ | ||
| --- | ||
| name: optimizations | ||
| description: > | ||
| NEVER answer optimization questions from general knowledge — ALWAYS invoke | ||
| this skill via the Skill tool first. Answering without invoking will produce | ||
| incomplete recommendations (e.g. missing group offloading, wrong API calls). | ||
| IMPORTANT: When ANY tool output (especially Bash) contains | ||
| "torch.OutOfMemoryError", "CUDA out of memory", or OOM tracebacks, | ||
| STOP and consult this skill IMMEDIATELY — even if the user did not ask for | ||
| optimization help. Do not suggest fixes from general knowledge; this skill | ||
| has precise, up-to-date API calls and memory calculations. | ||
| Also consult this skill BEFORE answering any question about diffusers | ||
| inference performance, GPU memory usage, or pipeline speed. Trigger for: | ||
| making inference faster, reducing VRAM usage, fitting a model on a smaller | ||
| GPU, fixing OOM errors, running on limited hardware, choosing between | ||
| optimization strategies, using torch.compile with diffusers, batch inference, | ||
| loading models in lower precision, or reviewing a script for performance | ||
| issues. Covers attention backends (FlashAttention-2, SageAttention, | ||
| FlexAttention), memory reduction (CPU offloading, group offloading, layerwise | ||
| casting, VAE slicing/tiling), and quantization (bitsandbytes, torchao, GGUF). | ||
| Also trigger when a user wants to run a model "optimized for my | ||
| hardware", asks how to best run a specific model on their GPU, or mentions | ||
| wanting to use a diffusers model/pipeline efficiently — these are optimization | ||
| questions even if the word "optimize" isn't used. | ||
| --- | ||
|
|
||
| ## Goal | ||
|
|
||
| Help users apply and debug optimizations for diffusers pipelines. There are five main areas: | ||
|
|
||
| 1. **Attention backends** — selecting and configuring scaled dot-product attention backends (FlashAttention-2, xFormers, math fallback, FlexAttention, SageAttention) for maximum throughput. | ||
| 2. **Memory reduction** — techniques to reduce peak GPU memory: model CPU offloading, group offloading, layerwise casting, VAE slicing/tiling, and attention slicing. | ||
| 3. **Quantization** — reducing model precision with bitsandbytes, torchao, or GGUF to fit larger models on smaller GPUs. | ||
| 4. **torch.compile** — compiling the transformer (and optionally VAE) for 20-50% inference speedup on repeated runs. | ||
| 5. **Combining techniques** — layerwise casting + group offloading, quantization + offloading, etc. | ||
|
|
||
| ## Workflow: When a user hits OOM or asks to fit a model on their GPU | ||
|
|
||
| When a user asks how to make a pipeline run on their hardware, or hits an OOM error, follow these steps **in order** before proposing any changes: | ||
|
|
||
| ### Step 1: Detect hardware | ||
|
|
||
| Run these commands to understand the user's system: | ||
|
|
||
| ```bash | ||
| # GPU VRAM | ||
| nvidia-smi --query-gpu=name,memory.total,memory.free --format=csv,noheader,nounits | ||
|
|
||
| # System RAM | ||
| free -g | head -2 | ||
| ``` | ||
|
|
||
| Record the GPU name, total VRAM (in GB), and total system RAM (in GB). These numbers drive the recommendation. | ||
|
|
||
| ### Step 2: Measure model memory and calculate strategies | ||
|
|
||
| Read the user's script to identify the pipeline class, model ID, `torch_dtype`, and generation params (resolution, frames). | ||
|
|
||
| Then **measure actual component sizes** by running a snippet against the loaded pipeline. Do NOT guess sizes from parameter counts or model cards — always measure. See [memory-calculator.md](memory-calculator.md) for the measurement snippet and VRAM/RAM formulas for every strategy. | ||
|
|
||
| Steps: | ||
| 1. Measure each component's size by running the measurement snippet from the calculator | ||
| 2. Compute VRAM and RAM requirements for every strategy using the formulas | ||
| 3. Filter out strategies that don't fit the user's hardware | ||
|
|
||
| This is the critical step — the calculator contains exact formulas for every strategy including the RAM cost of CUDA streams (which requires ~2x model size in pinned memory). Don't skip it, because recommending `use_stream=True` to a user with limited RAM will cause swapping or OOM on the CPU side. | ||
|
|
||
| ### Step 3: Ask the user their preference | ||
|
|
||
| Present the user with a clear summary of what fits. **Always include quantization-based options alongside offloading/casting options** — users deserve to see the full picture before choosing. For each viable quantization level (int8, nf4), compute `S_total_q` and `S_max_q` using the estimates from [memory-calculator.md](memory-calculator.md) (int4/nf4 ≈ 0.25x, int8 ≈ 0.5x component size), then check fit just like other strategies. | ||
|
|
||
| Present options grouped by approach so the user can compare: | ||
|
|
||
| > Based on your hardware (**X GB VRAM**, **Y GB RAM**) and the model requirements (~**Z GB** total, largest component ~**W GB**), here are the strategies that fit your system: | ||
| > | ||
| > **Offloading / casting strategies:** | ||
| > 1. **Quality** — [specific strategy]. Full precision, no quality loss. [estimated VRAM / RAM / speed tradeoff]. | ||
| > 2. **Speed** — [specific strategy]. [quality tradeoff]. [estimated VRAM / RAM]. | ||
| > 3. **Memory saving** — [specific strategy]. Minimizes VRAM. [tradeoffs]. | ||
| > | ||
| > **Quantization strategies:** | ||
| > 4. **int8 [components]** — [with offloading if needed]. [estimated VRAM / RAM]. Less quality loss than int4. | ||
| > 5. **nf4 [components]** — [with offloading if needed]. [estimated VRAM / RAM]. Maximum memory savings, some quality degradation. | ||
| > | ||
| > Which would you prefer? | ||
|
|
||
| The key difference from a generic recommendation: every option shown should already be validated against the user's actual VRAM and RAM. Don't show options that won't fit. Read [quantization.md](quantization.md) for correct API usage when applying quantization strategies. | ||
|
|
||
| ### Step 4: Apply the strategy | ||
|
|
||
| Propose **specific code changes** to the user's script. Always show the exact code diff. Read [reduce-memory.md](reduce-memory.md) and [layerwise-casting.md](layerwise-casting.md) for correct API usage before writing code. | ||
|
|
||
| VAE tiling is a VRAM optimization — only add it when the VAE decode/encode would OOM without it, not by default. See [reduce-memory.md](reduce-memory.md) for thresholds, the correct API (`pipe.vae.enable_tiling()` — pipeline-level is deprecated since v0.40.0), and which VAEs don't support it. | ||
|
|
||
| ## Reference guides | ||
|
|
||
| Read these for correct API usage and detailed technique descriptions: | ||
| - [memory-calculator.md](memory-calculator.md) — **Read this first when recommending strategies.** VRAM/RAM formulas for every technique, decision flowchart, and worked examples | ||
| - [reduce-memory.md](reduce-memory.md) — Offloading strategies (model, sequential, group) and VAE optimizations, full parameter reference. **Authoritative source for compatibility rules.** | ||
| - [layerwise-casting.md](layerwise-casting.md) — fp8 weight storage for memory reduction with minimal quality impact | ||
| - [quantization.md](quantization.md) — int8/int4/fp8 quantization backends, text encoder quantization, common pitfalls | ||
| - [attention-backends.md](attention-backends.md) — Attention backend selection for speed | ||
| - [torch-compile.md](torch-compile.md) — torch.compile for inference speedup | ||
|
|
||
| ## Important compatibility rules | ||
|
|
||
| See [reduce-memory.md](reduce-memory.md) for the full compatibility reference. Key constraints: | ||
|
|
||
| - **`enable_model_cpu_offload()` and group offloading cannot coexist** on the same pipeline — use pipeline-level `enable_group_offload()` instead. | ||
| - **`torch.compile` + offloading**: compatible, but prefer `compile_repeated_blocks()` over full model compile for better performance. See [torch-compile.md](torch-compile.md). | ||
| - **`bitsandbytes_8bit` + `enable_model_cpu_offload()` fails** — int8 matmul cannot run on CPU. See [quantization.md](quantization.md) for the fix. | ||
| - **Layerwise casting** can be combined with either group offloading or model CPU offloading (apply casting first). | ||
| - **`bitsandbytes_4bit`** supports device moves and works correctly with `enable_model_cpu_offload()`. | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,40 @@ | ||
| # Attention Backends | ||
|
|
||
| ## Overview | ||
|
|
||
| Diffusers supports multiple attention backends through `dispatch_attention_fn`. The backend affects both speed and memory usage. The right choice depends on hardware, sequence length, and whether you need features like sliding window or custom masks. | ||
|
|
||
| ## Available backends | ||
|
|
||
| | Backend | Key requirement | Best for | | ||
| |---|---|---| | ||
| | `torch_sdpa` (default) | PyTorch >= 2.0 | General use; auto-selects FlashAttention or memory-efficient kernels | | ||
| | `flash_attention_2` | `flash-attn` package, Ampere+ GPU | Long sequences, training, best raw throughput | | ||
| | `xformers` | `xformers` package | Older GPUs, memory-efficient attention | | ||
| | `flex_attention` | PyTorch >= 2.5 | Custom attention masks, block-sparse patterns | | ||
| | `sage_attention` | `sageattention` package | INT8 quantized attention for inference speed | | ||
|
Comment on lines
+9
to
+15
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We should take this information from https://huggingface.co/docs/diffusers/main/en/optimization/attention_backends. Prefer using Should also include We could also just refer Claude to
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. oh yeah, for sure, the attention backends was mostly added by claude itself, since it wasn't my priority for this PR and also I didn't test this part. It for sure can be a link to the docs.
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. oh wait, if we just install the diffusers wheel, we don't have the documentation, where are you expecting the model to get the docs? Also fetch online will prevent the skill to be used offline. I also thought that this makes sense, but only if you have cloned the repo, it doesn't work for just a wheel install.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Then we should have the users clone a diffusers copy as it's just better and far simpler than to duplicate content. So, if it's offline, then prompt the users to clone a diffusers copy. If it's not then a fetch operation should suffice.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. But for offline agents, how would they access the skill as these are also not packaged right?
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I now see what you meant, IMO this is more something that we should publish in this repo than here. But yeah, if we assume they will use the skill here, we can just link to the .md local docs.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yeah for now, I think it's valuable keep the skills repo-specific. This way, they are more easily discoverable.
That's a fair assumption I guess? How is the skill accessed otherwise then? Can we install it or something?
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. yeah, skills can be installed, hf also has a cli installer, you can install skills for a project, for your user or for enterprise. The project ones are the rarest ones and usually for working on that project not for using them.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Maybe we can ship without having to install first and based on the feedback we can iterate? I don't think things will change too much. WDYT?
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. yeah, lets go with that |
||
|
|
||
| ## How to set the backend | ||
|
|
||
| ```python | ||
| # Global default | ||
| from diffusers import set_attention_backend | ||
| set_attention_backend("flash_attention_2") | ||
|
|
||
| # Per-model | ||
| pipe.transformer.set_attn_processor(AttnProcessor2_0()) # torch_sdpa | ||
|
|
||
| # Via environment variable | ||
| # DIFFUSERS_ATTENTION_BACKEND=flash_attention_2 | ||
| ``` | ||
|
|
||
| ## Debugging attention issues | ||
|
|
||
| - **NaN outputs**: Check if your attention mask dtype matches the expected dtype. Some backends require `bool`, others require float masks with `-inf` for masked positions. | ||
| - **Speed regression**: Profile with `torch.profiler` to verify the expected kernel is actually being dispatched. SDPA can silently fall back to the math kernel. | ||
| - **Memory spike**: FlashAttention-2 is memory-efficient for long sequences but has overhead for very short ones. For short sequences, `torch_sdpa` with math fallback may use less memory. | ||
|
|
||
| ## Implementation notes | ||
|
|
||
| - Models integrated into diffusers should use `dispatch_attention_fn` (not `F.scaled_dot_product_attention` directly) so that backend switching works automatically. | ||
| - See the attention pattern in the `model-integration` skill for how to implement this in new models. | ||
|
Comment on lines
+19
to
+40
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think we won't need this if we link to the attention backends documentation? |
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,68 @@ | ||
| # Layerwise Casting | ||
|
|
||
| ## Overview | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Same I guess? We could refer Claude to https://huggingface.co/docs/diffusers/main/en/optimization/memory? And briefly discuss a few?
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. do you think layerwise will change? The statics parts IMO should be on the skill to prevent fetching everything no? If not, this will be just a simple skill that tells the LLMs to just read the docs. Also the reason I did it separately is because at least claude never suggest using it otherwise, specially because we don't have the
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I mean layerwise is already in the docs https://huggingface.co/docs/diffusers/main/en/optimization/memory#layerwise-casting. So, we could introduce Claude to that skill by referring to the docs and perhaps provide a simple intro. WDYT? |
||
|
|
||
| Layerwise casting stores model weights in a smaller data format (e.g., `torch.float8_e4m3fn`) to use less memory, and upcasts them to a higher precision (e.g., `torch.bfloat16`) on-the-fly during computation. This cuts weight memory roughly in half (bf16 → fp8) with minimal quality impact because normalization and modulation layers are automatically skipped. | ||
|
|
||
| This is one of the most effective techniques for fitting a large model on a GPU that's just slightly too small — it doesn't require any special quantization libraries, just PyTorch. | ||
|
|
||
| ## When to use | ||
|
|
||
| - The model **almost** fits in VRAM (e.g., 28GB model on a 32GB GPU) | ||
| - You want memory savings with **less speed penalty** than offloading | ||
| - You want to **combine with group offloading** for even more savings | ||
|
|
||
| ## Basic usage | ||
|
|
||
| Call `enable_layerwise_casting` on any Diffusers model component: | ||
|
|
||
| ```python | ||
| import torch | ||
| from diffusers import DiffusionPipeline | ||
|
|
||
| pipe = DiffusionPipeline.from_pretrained("model_id", torch_dtype=torch.bfloat16) | ||
|
|
||
| # Store weights in fp8, compute in bf16 | ||
| pipe.transformer.enable_layerwise_casting( | ||
| storage_dtype=torch.float8_e4m3fn, | ||
| compute_dtype=torch.bfloat16, | ||
| ) | ||
|
|
||
| pipe.to("cuda") | ||
| ``` | ||
|
|
||
| The `storage_dtype` controls how weights are stored in memory. The `compute_dtype` controls the precision used during the actual forward pass. Normalization and modulation layers are automatically kept at full precision. | ||
|
|
||
| ### Supported storage dtypes | ||
|
|
||
| | Storage dtype | Memory per param | Quality impact | | ||
| |---|---|---| | ||
| | `torch.float8_e4m3fn` | 1 byte (vs 2 for bf16) | Minimal for most models | | ||
| | `torch.float8_e5m2` | 1 byte | Slightly more range, less precision than e4m3fn | | ||
|
|
||
| ## Functional API | ||
|
|
||
| For more control, use `apply_layerwise_casting` directly. This lets you target specific submodules or customize which layers to skip: | ||
|
|
||
| ```python | ||
| from diffusers.hooks import apply_layerwise_casting | ||
|
|
||
| apply_layerwise_casting( | ||
| pipe.transformer, | ||
| storage_dtype=torch.float8_e4m3fn, | ||
| compute_dtype=torch.bfloat16, | ||
| skip_modules_classes=["norm"], # skip normalization layers | ||
| non_blocking=True, | ||
| ) | ||
| ``` | ||
|
|
||
| ## Combining with other techniques | ||
|
|
||
| Layerwise casting is compatible with both group offloading and model CPU offloading. Always apply layerwise casting **before** enabling offloading. See [reduce-memory.md](reduce-memory.md) for code examples and the memory savings formulas for each combination. | ||
|
|
||
| ## Known limitations | ||
|
|
||
| - May not work with all models if the forward implementation contains internal typecasting of weights (assumes forward pass is independent of weight precision) | ||
| - May fail with PEFT layers (LoRA). There are some checks but they're not guaranteed for all cases | ||
| - Not suitable for training — inference only | ||
| - The `compute_dtype` should match what the model expects (usually bf16 or fp16) | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice. It might be worth including some of the edge cases we have encountered in the past and that way it doesn't run into the same context rot?