add SP support for `_flash_3_varlen_hub` backend by zhtmike · Pull Request #13809 · huggingface/diffusers

zhtmike · 2026-05-26T06:18:49Z

What does this PR do?

A follow up work for #13479. I have added _flash_3_varlen_hub support for SP forward & backward.

Tested with QwenImage pipeline, the result image is expected.
Tested with QwenImage training with SP, there is no error.
The UTs for Flux and QwenImage are passed.

Fixes # (issue)

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline?
Did you read our philosophy doc (important for complex PRs)?
Was this discussed/approved via a GitHub issue or the forum? Please add a link to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

@sayakpaul

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

sayakpaul · 2026-05-29T11:38:17Z

-    if attn_mask is not None:
-        attn_mask = _normalize_attn_mask(attn_mask, batch_size, seq_len_kv)
-
-    (_, seqlens_k), (cu_seqlens_q, cu_seqlens_k), (max_seqlen_q, max_seqlen_k) = (
-        _prepare_for_flash_attn_or_sage_varlen(
-            batch_size, seq_len_q, seq_len_kv, attn_mask=attn_mask, device=query.device
-        )
-    )
-
-    key_valid, value_valid = [], []
-    for b in range(batch_size):
-        valid_len = seqlens_k[b]
-        key_valid.append(key[b, :valid_len])
-        value_valid.append(value[b, :valid_len])


This seems like should come under if _parallel_config is None and attn_mask is not None:?

sayakpaul · 2026-05-29T11:39:14Z

+            max_seqlen_k=max_seqlen_k,
+            softmax_scale=scale,
+            causal=is_causal,
+            return_attn_probs=return_lse,


This seems like an extra argument?

sayakpaul · 2026-05-29T11:39:32Z

+            return_attn_probs=return_lse,
+        )
+        if return_lse:
+            out, lse, *_ = out


Why do we need to initialize lse = None above?

sayakpaul · 2026-05-29T11:41:23Z

@askserge could you do a review?

github-actions

🤗 Serge says:

This PR adds sequence parallel (context parallel) support for the _flash_3_varlen_hub attention backend, following the same pattern established by the existing _flash_varlen_hub (flash-attn2) and _flash_3_hub implementations. The code is well-structured and closely mirrors the existing patterns.

Correctness

Potential bug: indices_k used before assignment on the no-mask path. In _flash_attention_3_varlen_hub_forward_op, when attn_mask is None, the variable indices_k is never assigned, but at line 1721 ctx.indices_k = indices_k if attn_mask is not None else None — this is actually fine because the conditional guards it. However, if _save_ctx is False, indices_k is never referenced at all on the no-mask path, so there's no issue. This matches the flash-attn2 varlen pattern exactly.
Positional argument fragility in wrapped_forward_fn call. The non-varlen _flash_attention_3_hub_forward_op uses keyword arguments for causal, window_size_left, etc., but the new varlen forward op passes everything positionally (lines 1677–1712). This makes the code harder to read and more fragile if the upstream _flash_attn_forward signature changes. Consider using keyword arguments for at least the trailing parameters, consistent with the non-varlen version.
return_lse handling change in the non-SP path. The original code always unpacked out, lse, *_ from the function call. The new code passes return_attn_probs=return_lse and conditionally unpacks. This is a behavioral change for the non-SP path — if return_lse=False, the function now returns a single tensor instead of a tuple. This should be verified to work correctly with the flash_attn_varlen_func API. The flash-attn2 varlen hub uses return_attn_probs=return_lse similarly, so this is likely correct.

Tests

The new backend is properly added to ContextParallelAttentionBackendsTesterMixin and the ring_degree skip logic.
The _FLASH_3_VARLEN_HUB is added to the hub kernels set in tests/models/testing_utils/utils.py.

Minor Issues

Bug in existing test code (pre-existing). Line 413 (deleted): attention_backend in ("flash_varlen_hub") — using in with a parenthesized string (not a tuple) means this is just attention_backend in "flash_varlen_hub", which checks character membership, not string membership. The fix on line 417 adds a proper tuple ("flash_varlen_hub", "_flash_3_varlen_hub"), which is correct. However, the original single-element check was buggy. Good that it's fixed now.

Overall the implementation follows established patterns well and looks correct.

17 LLM turns · 18 tool calls · 92.5s · 309778 in / 4124 out tokens

github-actions · 2026-05-29T11:43:24Z

+        value_packed = value.flatten(0, 1)
+        seqlens_k = None
+
+    out_packed, softmax_lse, *_ = wrapped_forward_fn(


Nit: The non-varlen _flash_attention_3_hub_forward_op uses keyword arguments for the trailing parameters (causal=is_causal, window_size_left=window_size[0], etc.), but here everything is passed positionally with no inline comments explaining what each None corresponds to. This makes the code harder to audit and fragile if the upstream signature changes.

Consider either:

Using keyword arguments for at least the trailing parameters (like the non-varlen version does), or

Adding inline comments for the positional None values (like the non-varlen version does with # k_new, v_new, # cu_seqlens_q/k/k_new, etc.)

github-actions · 2026-05-29T11:43:24Z

+        window_size[1],
+        0,
+        softcap,
+        True,


What does True correspond to here? Looking at the non-varlen version, the parameters after softcap are num_splits, pack_gqa, sm_margin — but here there are two extra positional args (True and None) before num_splits. This likely corresponds to return_softmax=True and perhaps gen_=None or similar. Please add an inline comment to clarify, or use keyword arguments.

github-actions · 2026-05-29T11:43:24Z

+            max_seqlen_k=max_seqlen_k,
+            softmax_scale=scale,
+            causal=is_causal,
+            return_attn_probs=return_lse,


Note: The original code always unpacked out, lse, *_ = func(...). Now with return_attn_probs=return_lse, when return_lse=False the return value may be different (single tensor vs tuple). Make sure flash_attn_varlen_func from flash-attn3 returns a single tensor (not a tuple) when return_attn_probs=False. The flash-attn2 varlen hub uses the same pattern, so this is likely fine, but worth verifying.

add _flash_3_varlen_hub support for SP

40618ff

github-actions Bot added size/L PR with diff > 200 LOC models tests and removed size/L PR with diff > 200 LOC labels May 26, 2026

sayakpaul reviewed May 29, 2026

View reviewed changes

github-actions Bot reviewed May 29, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add SP support for `_flash_3_varlen_hub` backend#13809

add SP support for `_flash_3_varlen_hub` backend#13809
zhtmike wants to merge 1 commit into
huggingface:mainfrom
zhtmike:fa3

zhtmike commented May 26, 2026

Uh oh!

sayakpaul May 29, 2026

Uh oh!

sayakpaul May 29, 2026

Uh oh!

sayakpaul May 29, 2026

Uh oh!

sayakpaul commented May 29, 2026

Uh oh!

github-actions Bot left a comment

Uh oh!

github-actions Bot May 29, 2026

Uh oh!

github-actions Bot May 29, 2026

Uh oh!

github-actions Bot May 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

zhtmike commented May 26, 2026

What does this PR do?

Before submitting

Who can review?

Uh oh!

sayakpaul May 29, 2026

Choose a reason for hiding this comment

Uh oh!

sayakpaul May 29, 2026

Choose a reason for hiding this comment

Uh oh!

sayakpaul May 29, 2026

Choose a reason for hiding this comment

Uh oh!

sayakpaul commented May 29, 2026

Uh oh!

github-actions Bot left a comment

Choose a reason for hiding this comment

Correctness

Tests

Minor Issues

Uh oh!

github-actions Bot May 29, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot May 29, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot May 29, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants