Use correct mask for packed inputs in Qwen-VL by zucchini-nlp · Pull Request #44157 · huggingface/transformers

zucchini-nlp · 2026-02-19T14:49:49Z

What does this PR do?

As per title, gets rid of if/else per attn implementation

zucchini-nlp · 2026-02-19T14:50:22Z

run-slow: qwen2_vl, qwen2_5_vl

github-actions · 2026-02-19T14:51:40Z

Workflow Run ⚙️

This comment contains run-slow, running the specified jobs:

models: ["models/qwen2_5_vl", "models/qwen2_vl"]
quantizations: []

HuggingFaceDocBuilderDev · 2026-02-19T14:59:48Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

zucchini-nlp · 2026-02-19T15:33:41Z

run-slow: ernie4_5_vl_moe, glm4v, glm4v_moe, glm_ocr, paddleocr_vl, qwen2_5_omni, qwen2_5_vl, qwen2_vl, qwen3_5, qwen3_5_moe, qwen3_omni_moe, qwen3_vl, qwen3_vl_moe, video_llama_3

zucchini-nlp · 2026-02-19T15:35:16Z

src/transformers/models/ernie4_5_vl_moe/modeling_ernie4_5_vl_moe.py

        hidden_states: torch.Tensor,
        cu_seqlens: torch.Tensor,
-        rotary_pos_emb: torch.Tensor | None = None,
+        attention_mask: torch.Tensor | None = None,


unused arg, prob was a bad copy from other models. We use only position_embeddings

Oops, very likely yes 😓 I started from qwen vl and opted to make it a bit cleaner back then iirc

github-actions · 2026-02-20T03:18:47Z

CI Results

Workflow Run ⚙️

Commit Info

Context	Commit	Description
RUN	71eed1a0	workflow commit (merge commit)
PR	cd08613c	branch commit (from PR)
main	35324377	base commit (on `main`)

⚠️ No test being reported (jobs are skipped or cancelled)!

github-actions · 2026-02-20T03:20:01Z

Workflow Run ⚙️

This comment contains run-slow, running the specified jobs:

models: ["models/ernie4_5_vl_moe", "models/glm4v", "models/glm4v_moe", "models/glm_ocr", "models/paddleocr_vl", "models/qwen2_5_omni", "models/qwen2_5_vl", "models/qwen2_vl", "models/qwen3_5", "models/qwen3_5_moe", "models/qwen3_omni_moe", "models/qwen3_vl", "models/qwen3_vl_moe", "models/video_llama_3"]
quantizations: []

github-actions · 2026-02-20T10:13:36Z

CI Results

Workflow Run ⚙️

Commit Info

Context	Commit	Description
RUN	03104e90	workflow commit (merge commit)
PR	19b52db6	branch commit (from PR)
main	35324377	base commit (on `main`)

✅ No failing test specific to this PR 🎉 👏 !

github-actions · 2026-02-20T12:10:40Z

[For maintainers] Suggested jobs to run (before merge)

run-slow: ernie4_5_vl_moe, glm4v, glm4v_moe, glm_ocr, paddleocr_vl, qwen2_5_omni, qwen2_5_vl, qwen2_vl, qwen3_5, qwen3_5_moe, qwen3_omni_moe, qwen3_vl, qwen3_vl_moe, video_llama_3

vasqu

Some comments from my side, definitely the right way to go! We should aim to natively support this in our mask API (without vmap, i.e. the and mask fn) --> otherwise we lose quite a lot of perf

vasqu · 2026-02-24T12:54:06Z

src/transformers/models/ernie4_5_vl_moe/modeling_ernie4_5_vl_moe.py

        hidden_states: torch.Tensor,
        cu_seqlens: torch.Tensor,
-        rotary_pos_emb: torch.Tensor | None = None,
+        attention_mask: torch.Tensor | None = None,


Oops, very likely yes 😓 I started from qwen vl and opted to make it a bit cleaner back then iirc

vasqu · 2026-02-24T12:55:20Z

src/transformers/models/ernie4_5_vl_moe/modeling_ernie4_5_vl_moe.py

-            ]
-            attn_output = torch.cat(attn_outputs, dim=1)
+        max_seqlen = (cu_seqlens[1:] - cu_seqlens[:-1]).max()  # FA-kwargs
+        attn_output, _ = attention_interface(


Just noticing it now, but we never collected the attention weights of the vision side 👀

in qwen-vl we did collect as a list of 4D tensors per layer. Prob it wasn't standard and was not copied everywhere

vasqu · 2026-02-24T12:56:10Z

src/transformers/models/ernie4_5_vl_moe/modeling_ernie4_5_vl_moe.py

+            cu_seq_lens_k=cu_seqlens,
+            max_length_q=max_seqlen,
+            max_length_k=max_seqlen,
+            is_causal=False,


Suggested change

is_causal=False,

I know this comes from previous implementations, but we really shouldn't pass this manually ourselves but rely on self.is_causal

vasqu · 2026-02-24T12:57:55Z

src/transformers/models/ernie4_5_vl_moe/modeling_ernie4_5_vl_moe.py

+            config=self.config,
+            inputs_embeds=hidden_states[None, ...],
+            attention_mask=None,
+            and_mask_function=packed_sequence_mask_function(packed_sequence),


and_mask_function is quite expensive on runtime. I'd prefer if we could natively integrate it into our mask API instead, not to use vmap (should work OOB)

hmm, what do you mean by "natively"? I could prepare by looping over cu seq lens and un-masking each block and keep it as a small fn in model file. In that case we don't use create_bidiractional_mask

If you check the new mask API, we force vmap when we pass and/or mask functions

transformers/src/transformers/masking_utils.py

Lines 1003 to 1017 in efdcbc7

# Allow slight deviations from the base mask

# Note that it is very important to apply this before any other deviations of the mask (such as packed sequence mask,

# padding mask, etc) as the resulting mask may otherwise not be correct!

if or_mask_function is not None:

if not _is_torch_greater_or_equal_than_2_6:

raise ValueError("Using `or_mask_function` or `and_mask_function` arguments require torch>=2.6")

mask_factory_function = or_masks(mask_factory_function, or_mask_function)

allow_is_bidirectional_skip = False

use_vmap = True

if and_mask_function is not None:

if not _is_torch_greater_or_equal_than_2_6:

raise ValueError("Using `or_mask_function` or `and_mask_function` arguments require torch>=2.6")

mask_factory_function = and_masks(mask_factory_function, and_mask_function)

allow_is_bidirectional_skip = False

use_vmap = True

Natively in this case would mean either a new kwarg that we can use to control packed sequences as well (instead of pos ids) or a new function similar to how sliding window is extended

transformers/src/transformers/masking_utils.py

Line 1154 in efdcbc7

def create_bidirectional_sliding_window_mask(

The problem with and/or masks is that we can have no idea about their functionality and vmaping is powerful enough to support almost anything, so we are kinda forced to vmap for compatibility/safety

vasqu · 2026-02-24T13:00:30Z

src/transformers/models/ernie4_5_vl_moe/modeling_ernie4_5_vl_moe.py

-                for q, k, v in zip(*splits)
-            ]
-            attn_output = torch.cat(attn_outputs, dim=1)
+        max_seqlen = (cu_seqlens[1:] - cu_seqlens[:-1]).max()  # FA-kwargs


Unsure how / where max_seqlen would be called down the line, but I remember a lot of .item() to match FA signature (e.g. for compile); so might be smarter to call it once and before entering each attention module

yeah, might do in the VisionModel once. IIRC fa wouldn't prepare max length for us so we have to pass it explciitly

zucchini-nlp added 2 commits February 19, 2026 15:30

qwen-style packed vision attention

5cd7dfb

fix repo

cd08613

zucchini-nlp added 4 commits February 19, 2026 16:01

fix repo better

f655f17

oops

460038b

outputs shouldn't have batch dim!

495bede

fix video llama test

19b52db

zucchini-nlp requested a review from ArthurZucker February 19, 2026 15:33

zucchini-nlp commented Feb 19, 2026

View reviewed changes

fix

e2abf8b

vasqu reviewed Feb 24, 2026

View reviewed changes

kirawi mentioned this pull request Mar 6, 2026

[Bug] Qwen3.5 Packing leads to unstable grads unslothai/unsloth#4160

Open

	# Allow slight deviations from the base mask
	# Note that it is very important to apply this before any other deviations of the mask (such as packed sequence mask,
	# padding mask, etc) as the resulting mask may otherwise not be correct!
	if or_mask_function is not None:
	if not _is_torch_greater_or_equal_than_2_6:
	raise ValueError("Using `or_mask_function` or `and_mask_function` arguments require torch>=2.6")
	mask_factory_function = or_masks(mask_factory_function, or_mask_function)
	allow_is_bidirectional_skip = False
	use_vmap = True
	if and_mask_function is not None:
	if not _is_torch_greater_or_equal_than_2_6:
	raise ValueError("Using `or_mask_function` or `and_mask_function` arguments require torch>=2.6")
	mask_factory_function = and_masks(mask_factory_function, and_mask_function)
	allow_is_bidirectional_skip = False
	use_vmap = True

Conversation

zucchini-nlp commented Feb 19, 2026

What does this PR do?

Uh oh!

zucchini-nlp commented Feb 19, 2026

Uh oh!

github-actions bot commented Feb 19, 2026

Uh oh!

HuggingFaceDocBuilderDev commented Feb 19, 2026

Uh oh!

zucchini-nlp commented Feb 19, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Feb 20, 2026

CI Results

Commit Info

Uh oh!

github-actions bot commented Feb 20, 2026

Uh oh!

github-actions bot commented Feb 20, 2026

CI Results

Commit Info

Uh oh!

github-actions bot commented Feb 20, 2026

Uh oh!

vasqu left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants