Fix Qwen2.5VL temporal grid positions by zucchini-nlp · Pull Request #45400 · huggingface/transformers

zucchini-nlp · 2026-04-13T11:01:34Z

What does this PR do?

Fixes #45381 but it is weird, I remember checking position ids by value as well in qwen2.5 to verify that time-interval works 🤔

update: i know why, the integration test we have uses second_grid_its = 0.083 which rounds to 0.0. So multiplication is zero no matter what value we get for vision positions. Great!

For most models we didn't see any diff because each frame is separated by a timestamps, and is processed separately. Only the first two Qwen releases have a bulk processing for all frames at once

In any case, worth adding a fast test with expected positions, will do so

HuggingFaceDocBuilderDev · 2026-04-13T11:13:03Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

zucchini-nlp · 2026-04-13T12:13:07Z

src/transformers/models/ernie4_5_vl_moe/modeling_ernie4_5_vl_moe.py

            grid_thw[2].item() // spatial_merge_size,
        )

-        image_seq_length = llm_grid_h * llm_grid_w * llm_grid_t


fix repo from qwen2-vl, here and after this

zucchini-nlp · 2026-04-13T12:15:28Z

run-slow: qwen2_vl, qwen2_5_vl, glm4v, qwen3_vl, ernie4_5_vl_moe

github-actions · 2026-04-13T12:16:45Z

Workflow Run ⚙️

This comment contains run-slow, running the specified jobs:

models: ["models/ernie4_5_vl_moe", "models/glm4v", "models/qwen2_5_vl", "models/qwen2_vl", "models/qwen3_vl"]
quantizations: []

github-actions · 2026-04-13T12:34:44Z

CI Results

Workflow Run ⚙️

Commit Info

Context	Commit	Description
RUN	02a2c776	workflow commit (merge commit)
PR	7039d95c	branch commit (from PR)
main	def8e6a2	base commit (on `main`)

✅ No failing test specific to this PR 🎉 👏 !

zucchini-nlp · 2026-04-13T12:41:29Z

src/transformers/models/glm4v/modular_glm4v.py

    def get_rope_index(
        self,
-        input_ids: torch.LongTensor,
-        mm_token_type_ids: torch.IntTensor,


same thing, just a bit shorter and easier to follow. Copied from 'qwen3-vl'

vasqu

Imo, looks good I just have a few remarks to get some details in + let's really check for all models please like glm image as well. No need to be sparse about running tests here

1 concern: we change one integration test, just wanna make sure this is a proper fix and not to just align with this fix

vasqu · 2026-04-13T16:45:02Z

src/transformers/models/qwen2_vl/modeling_qwen2_vl.py

+        # Repeat the positions per each grid and per video frame. Add start position for temporal grid
+        # Important to add start positions after applying `time_interval`, order matters


Let's move this comment above, was thinking that directly on the first arange for temporal

tho it doesn't really apply to arange, we are repeating and adding start position afterwards ?

src/transformers/models/qwen2_vl/modeling_qwen2_vl.py

vasqu · 2026-04-13T16:50:51Z

src/transformers/models/glm_image/modeling_glm_image.py

Not the same as the other ones?

nope, it only supports image generation from text or another image. Used only in diffusers as part of their pipe

tests/models/glm4v/test_modeling_glm4v.py

vasqu · 2026-04-13T16:52:18Z

tests/models/qwen2_5_vl/test_modeling_qwen2_5_vl.py

            {
                (None, None): [
-                    'system\nYou are a helpful assistant.\nuser\nWhat is shown in this video?\nassistant\nThe video shows an indoor tennis court with a person standing on the service line, preparing to serve. The individual is wearing athletic attire, including a white',
+                    'system\nYou are a helpful assistant.\nuser\nWhat is shown in this video?\nassistant\nThe video shows two individuals playing tennis on an indoor court. The player in the foreground, dressed in a white shirt and black shorts, is preparing to',


So this is intentional, was it changed before and we just went along?

yeah, I changed the second_temporal_grid to non-zero value by passing the video in chat template, so it actually tests smth now. Prev it wasn't testing at all positions with video frames 😢
(old processors had no way to decode video which I added much later, so it's expected to be missed)

This test was passing with or without this PR, so its output was useless kinda...

github-actions · 2026-04-14T09:13:02Z

[For maintainers] Suggested jobs to run (before merge)

run-slow: ernie4_5_vl_moe, glm46v, glm4v, glm4v_moe, glm_image, glm_ocr, paddleocr_vl, qwen2_5_vl, qwen2_vl, qwen3_5, qwen3_5_moe, qwen3_vl, qwen3_vl_moe

zucchini-nlp · 2026-04-14T09:13:20Z

I chose models with specific position ids, and from different families. The other ones are all just copies, or don't support videos

This fix has no effect when temporal_grid == 1 which is true for almost all models after 2.5-VL that rely on separate frames and timestamps. Same for images, image have no frames, no temporal grid

zucchini-nlp · 2026-04-14T09:18:37Z

re: applies to having better multimodal tests. AFAIK very few of the vlms that supports video have proper video testing with dummy weights. Neither we have tests with many modalities in a single input sample. Unfortunately I never had time to push further

zucchini-nlp added 3 commits April 13, 2026 12:55

interesting

08d58a7

oops

2cb32bd

test uses better temporal positions now

771df9f

zucchini-nlp added 3 commits April 13, 2026 13:17

Merge remote-tracking branch 'upstream/main' into qwen-time-positions

fed5b10

fix repo

3f41256

re-unite glm and qwen3-vl

321fc0b

zucchini-nlp added the for patch Tag issues / labels that should be included in the next patch label Apr 13, 2026

add some fast tests

a1a56ab

zucchini-nlp commented Apr 13, 2026

View reviewed changes

zucchini-nlp added 2 commits April 13, 2026 14:13

dummy import

202c137

missed another dummy import

7039d95

zucchini-nlp changed the title ~~Qwen2.5VL temporal grid postions~~ Fix Qwen2.5VL temporal grid positions Apr 13, 2026

zucchini-nlp requested a review from vasqu April 13, 2026 12:40

zucchini-nlp commented Apr 13, 2026

View reviewed changes

vasqu approved these changes Apr 13, 2026

View reviewed changes

zucchini-nlp mentioned this pull request Apr 14, 2026

transformers==5.3.0, qwen2.5-vl video input vision_position_ids seems to be wrong #45381

Open

4 tasks

move comments around and add more comments

213ff72

		# Repeat the positions per each grid and per video frame. Add start position for temporal grid
		# Important to add start positions after applying `time_interval`, order matters

Conversation

zucchini-nlp commented Apr 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Uh oh!

HuggingFaceDocBuilderDev commented Apr 13, 2026

Uh oh!

zucchini-nlp Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

zucchini-nlp commented Apr 13, 2026

Uh oh!

github-actions bot commented Apr 13, 2026

Uh oh!

github-actions bot commented Apr 13, 2026

CI Results

Commit Info

Uh oh!

zucchini-nlp Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

vasqu left a comment

Choose a reason for hiding this comment

Uh oh!

vasqu Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

zucchini-nlp Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

vasqu Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

zucchini-nlp Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

vasqu Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

zucchini-nlp Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Apr 14, 2026

Uh oh!

zucchini-nlp commented Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zucchini-nlp commented Apr 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

zucchini-nlp commented Apr 13, 2026 •

edited

Loading

zucchini-nlp Apr 14, 2026 •

edited

Loading

zucchini-nlp commented Apr 14, 2026 •

edited

Loading