[agents docs] add pipelines.md etc by yiyixuxu · Pull Request #13567 · huggingface/diffusers

yiyixuxu · 2026-04-27T06:09:54Z

No description provided.

- Add .ai/pipelines.md: pipeline conventions and gotchas (config-derived values, no_grad discipline, reinventing scheduler logic, subclassing variants, # Copied from annotations). - models.md: add Attention masks subsection inside Attention pattern; fold reference-implementations skim into conventions; consolidate __init__.py / _import_structure gotchas; trim gotchas covered by AGENTS.md (silent fallbacks, config serialization gap) or pipelines.md (no_grad, guider/scheduler reuse). - review-rules.md: collapse to a short reviewer checklist that points into AGENTS / models / pipelines / modular gotchas; only LLM-specific pattern (ephemeral context) lives here directly. - AGENTS.md: collapse defensive-code / unused-params / backwards-compat / deprecation rules into one umbrella bullet; replace inline pipeline bullet list with a pointer to pipelines.md. - SKILL.md (model-integration): trim pre-PR self-review to a one-line pointer. Sourced from the ACE-Step PR (#13095) review. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Co-authored-by: YiYi Xu <yixu310@gmail.com>

yiyixuxu · 2026-04-27T06:15:15Z

 ## Common model conventions

- Models use `ModelMixin` with `register_to_config` for config serialization
+When adding a new transformer (or reviewing one), skim `transformer_flux.py`, `transformer_flux2.py`, `transformer_qwenimage.py`, `transformer_wan.py` first to establish the pattern. Most conventions (mixin set, file structure, naming, gradient-checkpointing implementation, `_no_split_modules` / `_supports_cache_class` settings, etc.) are easiest to internalize by comparison rather than from a fixed list.


I think it works better this way, we provide the list of canonical references for them and explicitly to ask them to derive common conventions themselves

yiyixuxu · 2026-04-27T06:15:38Z

@@ -0,0 +1,62 @@
+# Pipeline conventions and rules


added a pipeline doc for agent

sayakpaul · 2026-04-27T13:19:37Z

- Support `generator` parameter for reproducibility
- Use `self.progress_bar(timesteps)` for progress tracking
- Don't subclass an existing pipeline for a variant — DO NOT use an existing pipeline class (e.g., `FluxPipeline`) to override another pipeline (e.g., `FluxImg2ImgPipeline`) which will be a part of the core codebase (`src`)
+- See [pipelines.md](pipelines.md) for pipeline conventions, patterns, and gotchas.


sayakpaul · 2026-04-27T13:38:56Z

 ## Common model conventions

- Models use `ModelMixin` with `register_to_config` for config serialization
+When adding a new transformer (or reviewing one), skim `transformer_flux.py`, `transformer_flux2.py`, `transformer_qwenimage.py`, `transformer_wan.py` first to establish the pattern. Most conventions (mixin set, file structure, naming, gradient-checkpointing implementation, `_no_split_modules` / `_supports_cache_class` settings, etc.) are easiest to internalize by comparison rather than from a fixed list.


Suggested change

When adding a new transformer (or reviewing one), skim `transformer_flux.py`, `transformer_flux2.py`, `transformer_qwenimage.py`, `transformer_wan.py` first to establish the pattern. Most conventions (mixin set, file structure, naming, gradient-checkpointing implementation, `_no_split_modules` / `_supports_cache_class` settings, etc.) are easiest to internalize by comparison rather than from a fixed list.

* Models use `ModelMixin` with `register_to_config` for config serialization.

* When adding a new transformer (or reviewing one), skim `src/diffusers/models/transformers/transformer_flux.py`, `src/diffusers/models/transformers/transformer_flux2.py`, `src/diffusers/models/transformers/transformer_qwenimage.py`, and `src/diffusers/models/transformers/transformer_wan.py` first to establish the pattern. Most conventions (mixin set, file structure, naming, gradient-checkpointing implementation, `_no_split_modules` settings, etc.) are easiest to internalize by comparison rather than from a fixed list.

I removed _supports_cache_class because the attribute doesn't seem to exist on ModelMixin.

sayakpaul · 2026-04-27T13:45:13Z

+
+- **No mask needed → pass `None`, not an all-zero tensor.** A dense 4D additive float mask of all `0.0` does no math but still hard-raises on `flash` / `_flash_3` / `_sage` (see `attention_dispatch.py:2328, 2544, 3266`). Only materialize a mask when it carries information. This is the Flux / Flux2 / Wan pattern: no mask, works on every backend, relies on the model having been trained tolerating consistent padding.
+- **Padding mask → bool `(B, L)` or `(B, 1, 1, L)`.** Stays compatible with the `*_varlen` kernels via `_normalize_attn_mask` (`attention_dispatch.py:639`), which reduces bool masks to `cu_seqlens`. Dense additive-float masks *cannot* be reduced this way and so lose the varlen path. This is the Qwen pattern (`transformer_qwenimage.py:951`).
+- **Structural mask (causal, sliding-window, band-diagonal) → dense `(1, 1, L, L)` is unavoidable.** Row-varying patterns can't be expressed as `(B, L)`. Expect SDPA/Flex-only for these layers; consider Flex's `sliding_window_mask_mod` or FA3's native `window_size=` kwarg if backend flexibility matters.


Suggested change

- **Structural mask (causal, sliding-window, band-diagonal) → dense `(1, 1, L, L)` is unavoidable.** Row-varying patterns can't be expressed as `(B, L)`. Expect SDPA/Flex-only for these layers; consider Flex's `sliding_window_mask_mod` or FA3's native `window_size=` kwarg if backend flexibility matters.

- **Structural mask (causal, sliding-window, band-diagonal) → dense `(1, 1, L, L)` is unavoidable.** Row-varying patterns can't be expressed as `(B, L)`. Expect SDPA/Flex-only for these layers; consider Flex's `sliding_window_mask_mod` or FA3's native `window_size=` kwarg if backend flexibility matters. Consult `src/models/transformers/transformer_kandinsky.py` as a reference.

sayakpaul · 2026-04-27T13:48:29Z

-6. **Config serialization gaps.** Every `__init__` parameter in a `ModelMixin` subclass must be captured by `register_to_config`. If you add a new param but forget to register it, `from_pretrained` will silently use the default instead of the saved value.
-
-7. **Forgetting to update `_import_structure` and `_lazy_modules`.** The top-level `src/diffusers/__init__.py` has both -- missing either one causes partial import failures.
+4. **Capability flags without matching implementation.** Class attributes like `_supports_cache_class`, `_no_split_modules`, `_supports_gradient_checkpointing`, `_supports_quantization` advertise a contract. Setting one without the corresponding logic (e.g. `_supports_gradient_checkpointing = True` with no `_set_gradient_checkpointing` method, or no `if gradient_checkpointing` branches in `forward`) means training code silently no-ops or errors deep inside `torch.utils.checkpoint`. For `_supports_cache_class` / `_no_split_modules` specifically, wrong values cause silent correctness bugs or OOM. Copy from a similar model, verify the corresponding logic is in place, and for inference-only ports just drop the flag.


_supports_quantization -- we don't have this attribute.

github-actions Bot added the size/M PR with diff < 200 LOC label Apr 27, 2026

yiyixuxu commented Apr 27, 2026

View reviewed changes

Comment thread .ai/skills/model-integration/SKILL.md Outdated

Apply suggestions from code review

75535d7

Co-authored-by: YiYi Xu <yixu310@gmail.com>

github-actions Bot added size/M PR with diff < 200 LOC and removed size/M PR with diff < 200 LOC labels Apr 27, 2026

yiyixuxu commented Apr 27, 2026

View reviewed changes

Comment thread .ai/skills/model-integration/SKILL.md Outdated

Apply suggestion from @yiyixuxu

c3f0cb6

github-actions Bot added size/M PR with diff < 200 LOC and removed size/M PR with diff < 200 LOC labels Apr 27, 2026

yiyixuxu commented Apr 27, 2026

View reviewed changes

Merge branch 'main' into docs/ace-step-review-learnings

6c25d14

yiyixuxu requested review from sayakpaul and stevhliu April 27, 2026 06:15

github-actions Bot added size/M PR with diff < 200 LOC and removed size/M PR with diff < 200 LOC labels Apr 27, 2026

yiyixuxu requested a review from dg845 April 27, 2026 06:16

yiyixuxu changed the title ~~[agents docs] add pipelines.md and restructure review rules~~ [agents docs] add pipelines.md etc Apr 27, 2026

sayakpaul reviewed Apr 27, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[agents docs] add pipelines.md etc#13567

[agents docs] add pipelines.md etc#13567
yiyixuxu wants to merge 4 commits intomainfrom
docs/ace-step-review-learnings

yiyixuxu commented Apr 27, 2026

Uh oh!

Uh oh!

Uh oh!

yiyixuxu Apr 27, 2026

Uh oh!

yiyixuxu Apr 27, 2026

Uh oh!

sayakpaul Apr 27, 2026

Uh oh!

sayakpaul Apr 27, 2026

Uh oh!

sayakpaul Apr 27, 2026

Uh oh!

sayakpaul Apr 27, 2026

Uh oh!

sayakpaul Apr 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	When adding a new transformer (or reviewing one), skim `transformer_flux.py`, `transformer_flux2.py`, `transformer_qwenimage.py`, `transformer_wan.py` first to establish the pattern. Most conventions (mixin set, file structure, naming, gradient-checkpointing implementation, `_no_split_modules` / `_supports_cache_class` settings, etc.) are easiest to internalize by comparison rather than from a fixed list.
	* Models use `ModelMixin` with `register_to_config` for config serialization.
	* When adding a new transformer (or reviewing one), skim `src/diffusers/models/transformers/transformer_flux.py`, `src/diffusers/models/transformers/transformer_flux2.py`, `src/diffusers/models/transformers/transformer_qwenimage.py`, and `src/diffusers/models/transformers/transformer_wan.py` first to establish the pattern. Most conventions (mixin set, file structure, naming, gradient-checkpointing implementation, `_no_split_modules` settings, etc.) are easiest to internalize by comparison rather than from a fixed list.

	- Structural mask (causal, sliding-window, band-diagonal) → dense `(1, 1, L, L)` is unavoidable. Row-varying patterns can't be expressed as `(B, L)`. Expect SDPA/Flex-only for these layers; consider Flex's `sliding_window_mask_mod` or FA3's native `window_size=` kwarg if backend flexibility matters.
	- Structural mask (causal, sliding-window, band-diagonal) → dense `(1, 1, L, L)` is unavoidable. Row-varying patterns can't be expressed as `(B, L)`. Expect SDPA/Flex-only for these layers; consider Flex's `sliding_window_mask_mod` or FA3's native `window_size=` kwarg if backend flexibility matters. Consult `src/models/transformers/transformer_kandinsky.py` as a reference.

Conversation

yiyixuxu commented Apr 27, 2026

Uh oh!

Uh oh!

Uh oh!

yiyixuxu Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

yiyixuxu Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

sayakpaul Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

sayakpaul Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

sayakpaul Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

sayakpaul Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

sayakpaul Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants