Fix qwen encoder hidden states mask #12655
Open
+447
−1
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What does this PR do?
Fixes the QwenImage encoder to properly apply
encoder_hidden_states_maskwhen passed to the model. Previously, the mask parameter was accepted but ignored, causing padding tokens to incorrectly influence attention computation.Changes
QwenDoubleStreamAttnProcessor2_0to create a 2D attention mask from the 1Dencoder_hidden_states_mask, properly masking text padding tokens while keeping all image tokens unmaskedImpact
This fix enables proper Classifier-Free Guidance (CFG) batching with variable-length text sequences, which is common when batching conditional and unconditional prompts together.
Benchmark Results
Overhead: +2.8% for mask processing without padding, +18.7% with actual padding (realistic CFG scenario)
The higher overhead with padding is expected and acceptable as it represents the cost of properly handling variable-length sequences in batched inference. This is a necessary correctness fix rather than an optimization. Test ran on RTX 4070 12GB.
Fixes #12294
Before submitting
Who can review?
@yiyixuxu @sayakpaul - Would appreciate your review, especially regarding the benchmarking approach. I used a custom benchmark rather than
BenchmarkMixinbecause:Note: The benchmark file is named
benchmarking_qwenimage_mask.py(with "benchmarking" prefix) rather thanbenchmark_qwenimage_mask.pyto prevent it from being picked up byrun_all.py, since it doesn't useBenchmarkMixinand produces a different CSV schema. If you prefer, I can adapt it to use the standard format instead.Happy to adjust the approach if you have suggestions!