Skip to content

Fix Qwen2.5VL temporal grid positions#45400

Open
zucchini-nlp wants to merge 10 commits intohuggingface:mainfrom
zucchini-nlp:qwen-time-positions
Open

Fix Qwen2.5VL temporal grid positions#45400
zucchini-nlp wants to merge 10 commits intohuggingface:mainfrom
zucchini-nlp:qwen-time-positions

Conversation

@zucchini-nlp
Copy link
Copy Markdown
Member

@zucchini-nlp zucchini-nlp commented Apr 13, 2026

What does this PR do?

Fixes #45381 but it is weird, I remember checking position ids by value as well in qwen2.5 to verify that time-interval works 🤔

update: i know why, the integration test we have uses second_grid_its = 0.083 which rounds to 0.0. So multiplication is zero no matter what value we get for vision positions. Great!

For most models we didn't see any diff because each frame is separated by a timestamps, and is processed separately. Only the first two Qwen releases have a bulk processing for all frames at once

In any case, worth adding a fast test with expected positions, will do so

@HuggingFaceDocBuilderDev
Copy link
Copy Markdown

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@zucchini-nlp zucchini-nlp added the for patch Tag issues / labels that should be included in the next patch label Apr 13, 2026
grid_thw[2].item() // spatial_merge_size,
)

image_seq_length = llm_grid_h * llm_grid_w * llm_grid_t
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fix repo from qwen2-vl, here and after this

@zucchini-nlp
Copy link
Copy Markdown
Member Author

run-slow: qwen2_vl, qwen2_5_vl, glm4v, qwen3_vl, ernie4_5_vl_moe

@github-actions
Copy link
Copy Markdown
Contributor

Workflow Run ⚙️

This comment contains run-slow, running the specified jobs:

models: ["models/ernie4_5_vl_moe", "models/glm4v", "models/qwen2_5_vl", "models/qwen2_vl", "models/qwen3_vl"]
quantizations: []

@zucchini-nlp zucchini-nlp changed the title Qwen2.5VL temporal grid postions Fix Qwen2.5VL temporal grid positions Apr 13, 2026
@github-actions
Copy link
Copy Markdown
Contributor

CI Results

Workflow Run ⚙️

Commit Info

Context Commit Description
RUN 02a2c776 workflow commit (merge commit)
PR 7039d95c branch commit (from PR)
main def8e6a2 base commit (on main)

✅ No failing test specific to this PR 🎉 👏 !

@zucchini-nlp zucchini-nlp requested a review from vasqu April 13, 2026 12:40
Comment on lines 880 to -884
def get_rope_index(
self,
input_ids: torch.LongTensor,
mm_token_type_ids: torch.IntTensor,
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same thing, just a bit shorter and easier to follow. Copied from 'qwen3-vl'

Copy link
Copy Markdown
Contributor

@vasqu vasqu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Imo, looks good I just have a few remarks to get some details in + let's really check for all models please like glm image as well. No need to be sparse about running tests here

1 concern: we change one integration test, just wanna make sure this is a proper fix and not to just align with this fix

Comment on lines +981 to +982
# Repeat the positions per each grid and per video frame. Add start position for temporal grid
# Important to add start positions after applying `time_interval`, order matters
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's move this comment above, was thinking that directly on the first arange for temporal

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tho it doesn't really apply to arange, we are repeating and adding start position afterwards ?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not the same as the other ones?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nope, it only supports image generation from text or another image. Used only in diffusers as part of their pipe

{
(None, None): [
'system\nYou are a helpful assistant.\nuser\nWhat is shown in this video?\nassistant\nThe video shows an indoor tennis court with a person standing on the service line, preparing to serve. The individual is wearing athletic attire, including a white',
'system\nYou are a helpful assistant.\nuser\nWhat is shown in this video?\nassistant\nThe video shows two individuals playing tennis on an indoor court. The player in the foreground, dressed in a white shirt and black shorts, is preparing to',
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this is intentional, was it changed before and we just went along?

Copy link
Copy Markdown
Member Author

@zucchini-nlp zucchini-nlp Apr 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, I changed the second_temporal_grid to non-zero value by passing the video in chat template, so it actually tests smth now. Prev it wasn't testing at all positions with video frames 😢
(old processors had no way to decode video which I added much later, so it's expected to be missed)

This test was passing with or without this PR, so its output was useless kinda...

@github-actions
Copy link
Copy Markdown
Contributor

[For maintainers] Suggested jobs to run (before merge)

run-slow: ernie4_5_vl_moe, glm46v, glm4v, glm4v_moe, glm_image, glm_ocr, paddleocr_vl, qwen2_5_vl, qwen2_vl, qwen3_5, qwen3_5_moe, qwen3_vl, qwen3_vl_moe

@zucchini-nlp
Copy link
Copy Markdown
Member Author

zucchini-nlp commented Apr 14, 2026

I chose models with specific position ids, and from different families. The other ones are all just copies, or don't support videos

This fix has no effect when temporal_grid == 1 which is true for almost all models after 2.5-VL that rely on separate frames and timestamps. Same for images, image have no frames, no temporal grid

@zucchini-nlp
Copy link
Copy Markdown
Member Author

re: applies to having better multimodal tests. AFAIK very few of the vlms that supports video have proper video testing with dummy weights. Neither we have tests with many modalities in a single input sample. Unfortunately I never had time to push further

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

for patch Tag issues / labels that should be included in the next patch

Projects

None yet

Development

Successfully merging this pull request may close these issues.

transformers==5.3.0, qwen2.5-vl video input vision_position_ids seems to be wrong

3 participants