Skip to content

[agents docs] add pipelines.md etc#13567

Open
yiyixuxu wants to merge 4 commits intomainfrom
docs/ace-step-review-learnings
Open

[agents docs] add pipelines.md etc#13567
yiyixuxu wants to merge 4 commits intomainfrom
docs/ace-step-review-learnings

Conversation

@yiyixuxu
Copy link
Copy Markdown
Collaborator

No description provided.

- Add .ai/pipelines.md: pipeline conventions and gotchas (config-derived
  values, no_grad discipline, reinventing scheduler logic, subclassing
  variants, # Copied from annotations).
- models.md: add Attention masks subsection inside Attention pattern;
  fold reference-implementations skim into conventions; consolidate
  __init__.py / _import_structure gotchas; trim gotchas covered by
  AGENTS.md (silent fallbacks, config serialization gap) or pipelines.md
  (no_grad, guider/scheduler reuse).
- review-rules.md: collapse to a short reviewer checklist that points
  into AGENTS / models / pipelines / modular gotchas; only LLM-specific
  pattern (ephemeral context) lives here directly.
- AGENTS.md: collapse defensive-code / unused-params / backwards-compat
  / deprecation rules into one umbrella bullet; replace inline pipeline
  bullet list with a pointer to pipelines.md.
- SKILL.md (model-integration): trim pre-PR self-review to a one-line
  pointer.

Sourced from the ACE-Step PR (#13095) review.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions github-actions Bot added the size/M PR with diff < 200 LOC label Apr 27, 2026
Comment thread .ai/skills/model-integration/SKILL.md Outdated
Co-authored-by: YiYi Xu <yixu310@gmail.com>
@github-actions github-actions Bot added size/M PR with diff < 200 LOC and removed size/M PR with diff < 200 LOC labels Apr 27, 2026
Comment thread .ai/skills/model-integration/SKILL.md Outdated
@github-actions github-actions Bot added size/M PR with diff < 200 LOC and removed size/M PR with diff < 200 LOC labels Apr 27, 2026
Comment thread .ai/models.md
## Common model conventions

- Models use `ModelMixin` with `register_to_config` for config serialization
When adding a new transformer (or reviewing one), skim `transformer_flux.py`, `transformer_flux2.py`, `transformer_qwenimage.py`, `transformer_wan.py` first to establish the pattern. Most conventions (mixin set, file structure, naming, gradient-checkpointing implementation, `_no_split_modules` / `_supports_cache_class` settings, etc.) are easiest to internalize by comparison rather than from a fixed list.
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it works better this way, we provide the list of canonical references for them and explicitly to ask them to derive common conventions themselves

Comment thread .ai/pipelines.md
@@ -0,0 +1,62 @@
# Pipeline conventions and rules
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added a pipeline doc for agent

@yiyixuxu yiyixuxu requested review from sayakpaul and stevhliu April 27, 2026 06:15
@github-actions github-actions Bot added size/M PR with diff < 200 LOC and removed size/M PR with diff < 200 LOC labels Apr 27, 2026
@yiyixuxu yiyixuxu requested a review from dg845 April 27, 2026 06:16
@yiyixuxu yiyixuxu changed the title [agents docs] add pipelines.md and restructure review rules [agents docs] add pipelines.md etc Apr 27, 2026
Comment thread .ai/AGENTS.md
- Support `generator` parameter for reproducibility
- Use `self.progress_bar(timesteps)` for progress tracking
- Don't subclass an existing pipeline for a variant — DO NOT use an existing pipeline class (e.g., `FluxPipeline`) to override another pipeline (e.g., `FluxImg2ImgPipeline`) which will be a part of the core codebase (`src`)
- See [pipelines.md](pipelines.md) for pipeline conventions, patterns, and gotchas.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call!

Comment thread .ai/models.md
## Common model conventions

- Models use `ModelMixin` with `register_to_config` for config serialization
When adding a new transformer (or reviewing one), skim `transformer_flux.py`, `transformer_flux2.py`, `transformer_qwenimage.py`, `transformer_wan.py` first to establish the pattern. Most conventions (mixin set, file structure, naming, gradient-checkpointing implementation, `_no_split_modules` / `_supports_cache_class` settings, etc.) are easiest to internalize by comparison rather than from a fixed list.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
When adding a new transformer (or reviewing one), skim `transformer_flux.py`, `transformer_flux2.py`, `transformer_qwenimage.py`, `transformer_wan.py` first to establish the pattern. Most conventions (mixin set, file structure, naming, gradient-checkpointing implementation, `_no_split_modules` / `_supports_cache_class` settings, etc.) are easiest to internalize by comparison rather than from a fixed list.
* Models use `ModelMixin` with `register_to_config` for config serialization.
* When adding a new transformer (or reviewing one), skim `src/diffusers/models/transformers/transformer_flux.py`, `src/diffusers/models/transformers/transformer_flux2.py`, `src/diffusers/models/transformers/transformer_qwenimage.py`, and `src/diffusers/models/transformers/transformer_wan.py` first to establish the pattern. Most conventions (mixin set, file structure, naming, gradient-checkpointing implementation, `_no_split_modules` settings, etc.) are easiest to internalize by comparison rather than from a fixed list.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed _supports_cache_class because the attribute doesn't seem to exist on ModelMixin.

Comment thread .ai/models.md

- **No mask needed → pass `None`, not an all-zero tensor.** A dense 4D additive float mask of all `0.0` does no math but still hard-raises on `flash` / `_flash_3` / `_sage` (see `attention_dispatch.py:2328, 2544, 3266`). Only materialize a mask when it carries information. This is the Flux / Flux2 / Wan pattern: no mask, works on every backend, relies on the model having been trained tolerating consistent padding.
- **Padding mask → bool `(B, L)` or `(B, 1, 1, L)`.** Stays compatible with the `*_varlen` kernels via `_normalize_attn_mask` (`attention_dispatch.py:639`), which reduces bool masks to `cu_seqlens`. Dense additive-float masks *cannot* be reduced this way and so lose the varlen path. This is the Qwen pattern (`transformer_qwenimage.py:951`).
- **Structural mask (causal, sliding-window, band-diagonal) → dense `(1, 1, L, L)` is unavoidable.** Row-varying patterns can't be expressed as `(B, L)`. Expect SDPA/Flex-only for these layers; consider Flex's `sliding_window_mask_mod` or FA3's native `window_size=` kwarg if backend flexibility matters.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- **Structural mask (causal, sliding-window, band-diagonal) → dense `(1, 1, L, L)` is unavoidable.** Row-varying patterns can't be expressed as `(B, L)`. Expect SDPA/Flex-only for these layers; consider Flex's `sliding_window_mask_mod` or FA3's native `window_size=` kwarg if backend flexibility matters.
- **Structural mask (causal, sliding-window, band-diagonal) → dense `(1, 1, L, L)` is unavoidable.** Row-varying patterns can't be expressed as `(B, L)`. Expect SDPA/Flex-only for these layers; consider Flex's `sliding_window_mask_mod` or FA3's native `window_size=` kwarg if backend flexibility matters. Consult `src/models/transformers/transformer_kandinsky.py` as a reference.

Comment thread .ai/models.md
6. **Config serialization gaps.** Every `__init__` parameter in a `ModelMixin` subclass must be captured by `register_to_config`. If you add a new param but forget to register it, `from_pretrained` will silently use the default instead of the saved value.

7. **Forgetting to update `_import_structure` and `_lazy_modules`.** The top-level `src/diffusers/__init__.py` has both -- missing either one causes partial import failures.
4. **Capability flags without matching implementation.** Class attributes like `_supports_cache_class`, `_no_split_modules`, `_supports_gradient_checkpointing`, `_supports_quantization` advertise a contract. Setting one without the corresponding logic (e.g. `_supports_gradient_checkpointing = True` with no `_set_gradient_checkpointing` method, or no `if gradient_checkpointing` branches in `forward`) means training code silently no-ops or errors deep inside `torch.utils.checkpoint`. For `_supports_cache_class` / `_no_split_modules` specifically, wrong values cause silent correctness bugs or OOM. Copy from a similar model, verify the corresponding logic is in place, and for inference-only ports just drop the flag.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_supports_quantization -- we don't have this attribute.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size/M PR with diff < 200 LOC

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants