Fix the attention mask in ulysses SP for QwenImage by zhtmike · Pull Request #13278 · huggingface/diffusers

zhtmike · 2026-03-17T08:25:19Z

What does this PR do?

Fix issue #13277.

QwenImagePipeline cannot run with Ulysses SP together with batch prompt inputs. It is related to the mask is not correctly broadcasted.
We need to broadcast the attention mask from [B, S] to [B, H, S_q, S_kv] or simply [B, 1, 1, S_kv] before feeding into SDPA.

Before Fix:

We have the error when running the code snippet mentioned in the issue.

RuntimeError: The expanded size of the tensor (4222) must match the existing size (2) at non-singleton dimension 2.  Target sizes: [2, 12, 4222, 4222].  Tensor sizes: [2, 4222]

After Fix:

The images are correctly produced.

Fixes # (issue)

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline?
Did you read our philosophy doc (important for complex PRs)?
Was this discussed/approved via a GitHub issue or the forum? Please add a link to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

@sayakpaul

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

sayakpaul · 2026-03-17T09:05:31Z

src/diffusers/models/attention_dispatch.py

+    if attn_mask is not None and attn_mask.dim() == 2:
+        attn_mask = attn_mask[:, None, None, :]


Is this Qwen specific?

I haven't test with other models. But I think models with a 2D masks input should have the similar problem

Possible to check out one other? And also run the

diffusers/tests/models/transformers/test_models_transformer_flux.py

Line 227 in 16e7067

class TestFluxTransformerContextParallel(FluxTransformerTesterConfig, ContextParallelTesterMixin):

?

Thanks. From a quick scan, most models seem to handle the mask shape correctly in their own implementations. So I’ve limited the modification to QwenImage only.

Should I run any test cases?

Thanks! Maybe we could add a similar test to

diffusers/tests/models/transformers/test_models_transformer_flux.py

Line 227 in 16e7067

class TestFluxTransformerContextParallel(FluxTransformerTesterConfig, ContextParallelTesterMixin):

?

I will give you a ping once it's refactored to follow the latest pattern.

Sorry disregard my suggestion on using the CUDNN backend.

Yes, native attention x Ulysses is perfect for single prompt input. Currently batch inputs have some problem.

Is it the case just for Qwen or the same happens for Flux, as well? Also, the test under consideration -- does it not use a single prompt?

diffusers/tests/models/transformers/test_models_transformer_qwenimage.py

Line 81 in 11a3284

batch_size = 1

Is it the case just for Qwen or the same happens for Flux, as well?

So far, I have only found that Qwen has this problem. Other models, such as Z-Image, HunyuanImage
expand the attention mask in a similar way before entering the attention block. For Flux, I tested with the main branch, and it works fine with both CP and batch inputs.

Also, the test under consideration -- does it not use a single prompt?

I am wondering whether we should add a batch input test if possible. At the beginning, I think we should first ensure that all unit tests pass without modifying them.

The background of this bug is that we are working on the training engine based on the Diffusers backend, using QwenImage as the first example. Therefore, we may need a combination of batch inputs (for high throughput) as well as Ulysses SP support. This is why we encountered this bug during the forward process.

Agreed and thanks so much for the context!

I am wondering whether we should add a batch input test if possible. At the beginning, I think we should first ensure that all unit tests pass without modifying them.

Would you like to take a crack at this? We'll be quick to review.

I think first we need to ensure that the test_context_parallel_inference() test is xfailed when ring attention is enabled with the SDPA. #13182 is adding a test suite for CP-backends and attention backends.

And then a test for batched inputs.

Then let's revisit this PR?

WDYT?

Sure NP, I will add a UT test for batch input

Hi @sayakpaul , I have added a PR #13312 for this, could you take a look?

sayakpaul · 2026-03-17T10:13:59Z

@naykun if you want to take a look

sayakpaul · 2026-03-17T10:14:14Z

src/diffusers/models/transformers/transformer_qwenimage.py

            batch_size, image_seq_len = hidden_states.shape[:2]
            image_mask = torch.ones((batch_size, image_seq_len), dtype=torch.bool, device=hidden_states.device)
            joint_attention_mask = torch.cat([encoder_hidden_states_mask, image_mask], dim=1)
+            joint_attention_mask = joint_attention_mask[:, None, None, :]


Is this okay for non-CP?

Yes. The image is same w/o this change.

zhtmike · 2026-03-24T03:48:45Z

Following up on #13312: we can drop the xfail after fixing the QwenImage mask.

running pytest tests/models/transformers/test_models_transformer_qwenimage.py:

=========================== short test summary info ============================
FAILED tests/models/transformers/test_models_transformer_qwenimage.py::TestQwenImageTransformerLoRAHotSwap::test_hotswapping_compiled_model_linear[11-11]
FAILED tests/models/transformers/test_models_transformer_qwenimage.py::TestQwenImageTransformerLoRAHotSwap::test_hotswapping_compiled_model_linear[7-13]
FAILED tests/models/transformers/test_models_transformer_qwenimage.py::TestQwenImageTransformerLoRAHotSwap::test_hotswapping_compiled_model_linear[13-7]
FAILED tests/models/transformers/test_models_transformer_qwenimage.py::TestQwenImageTransformerLoRAHotSwap::test_hotswapping_compiled_model_both_linear_and_other[11-11]
FAILED tests/models/transformers/test_models_transformer_qwenimage.py::TestQwenImageTransformerLoRAHotSwap::test_hotswapping_compiled_model_both_linear_and_other[7-13]
FAILED tests/models/transformers/test_models_transformer_qwenimage.py::TestQwenImageTransformerLoRAHotSwap::test_hotswapping_compiled_model_both_linear_and_other[13-7]
FAILED tests/models/transformers/test_models_transformer_qwenimage.py::TestQwenImageTransformerLoRAHotSwap::test_enable_lora_hotswap_called_after_adapter_added_warning
====== 7 failed, 69 passed, 54 skipped, 36 warnings in 182.34s (0:03:02) =======

The error of LoRAHotSwap seems unrelated.

HuggingFaceDocBuilderDev · 2026-03-24T04:24:59Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

sayakpaul · 2026-03-24T04:33:29Z

I ran the tests on my end and they are passing (CP tests). Hotswapping test failures are not needed.

dg845

Thanks for the PR!

dg845 · 2026-03-24T09:12:44Z

Merging as the CI is green.

fix mask in SP

832b4d2

sayakpaul reviewed Mar 17, 2026

View reviewed changes

change the modification to qwen specific

1e535f0

zhtmike changed the title ~~Fix the attention mask in ulysses SP~~ Fix the attention mask in ulysses SP for QwenImage Mar 17, 2026

sayakpaul requested a review from kashif March 17, 2026 10:13

sayakpaul reviewed Mar 17, 2026

View reviewed changes

Merge branch 'main' into fix_sp

3e212e7

zhtmike mentioned this pull request Mar 23, 2026

change QwenImageTransformer UT to batch inputs #13312

Merged

6 tasks

sayakpaul and others added 2 commits March 24, 2026 09:01

Merge branch 'main' into fix_sp

c76fc30

drop xfail since qwen-image mask is fixed

ea43309

sayakpaul requested a review from dg845 March 24, 2026 04:20

Merge branch 'main' into fix_sp

0a5c6ad

zhtmike mentioned this pull request Mar 24, 2026

The backward pass of QwenImageTransformer failed with Ulysses SP. #13319

Open

dg845 approved these changes Mar 24, 2026

View reviewed changes

dg845 merged commit afdda57 into huggingface:main Mar 24, 2026
11 checks passed

zhtmike deleted the fix_sp branch March 24, 2026 09:33

		if attn_mask is not None and attn_mask.dim() == 2:
		attn_mask = attn_mask[:, None, None, :]

Conversation

zhtmike commented Mar 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Before submitting

Who can review?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sayakpaul commented Mar 17, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zhtmike Mar 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zhtmike commented Mar 24, 2026

Uh oh!

HuggingFaceDocBuilderDev commented Mar 24, 2026

Uh oh!

sayakpaul commented Mar 24, 2026

Uh oh!

dg845 left a comment

Choose a reason for hiding this comment

Uh oh!

dg845 commented Mar 24, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

zhtmike commented Mar 17, 2026 •

edited

Loading

zhtmike Mar 17, 2026 •

edited

Loading