Fix Qwen2.5VL temporal grid positions#45400
Fix Qwen2.5VL temporal grid positions#45400zucchini-nlp wants to merge 10 commits intohuggingface:mainfrom
Conversation
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
| grid_thw[2].item() // spatial_merge_size, | ||
| ) | ||
|
|
||
| image_seq_length = llm_grid_h * llm_grid_w * llm_grid_t |
There was a problem hiding this comment.
fix repo from qwen2-vl, here and after this
|
run-slow: qwen2_vl, qwen2_5_vl, glm4v, qwen3_vl, ernie4_5_vl_moe |
|
This comment contains models: ["models/ernie4_5_vl_moe", "models/glm4v", "models/qwen2_5_vl", "models/qwen2_vl", "models/qwen3_vl"] |
| def get_rope_index( | ||
| self, | ||
| input_ids: torch.LongTensor, | ||
| mm_token_type_ids: torch.IntTensor, |
There was a problem hiding this comment.
same thing, just a bit shorter and easier to follow. Copied from 'qwen3-vl'
vasqu
left a comment
There was a problem hiding this comment.
Imo, looks good I just have a few remarks to get some details in + let's really check for all models please like glm image as well. No need to be sparse about running tests here
1 concern: we change one integration test, just wanna make sure this is a proper fix and not to just align with this fix
| # Repeat the positions per each grid and per video frame. Add start position for temporal grid | ||
| # Important to add start positions after applying `time_interval`, order matters |
There was a problem hiding this comment.
Let's move this comment above, was thinking that directly on the first arange for temporal
There was a problem hiding this comment.
tho it doesn't really apply to arange, we are repeating and adding start position afterwards ?
There was a problem hiding this comment.
Not the same as the other ones?
There was a problem hiding this comment.
nope, it only supports image generation from text or another image. Used only in diffusers as part of their pipe
| { | ||
| (None, None): [ | ||
| 'system\nYou are a helpful assistant.\nuser\nWhat is shown in this video?\nassistant\nThe video shows an indoor tennis court with a person standing on the service line, preparing to serve. The individual is wearing athletic attire, including a white', | ||
| 'system\nYou are a helpful assistant.\nuser\nWhat is shown in this video?\nassistant\nThe video shows two individuals playing tennis on an indoor court. The player in the foreground, dressed in a white shirt and black shorts, is preparing to', |
There was a problem hiding this comment.
So this is intentional, was it changed before and we just went along?
There was a problem hiding this comment.
yeah, I changed the second_temporal_grid to non-zero value by passing the video in chat template, so it actually tests smth now. Prev it wasn't testing at all positions with video frames 😢
(old processors had no way to decode video which I added much later, so it's expected to be missed)
This test was passing with or without this PR, so its output was useless kinda...
|
[For maintainers] Suggested jobs to run (before merge) run-slow: ernie4_5_vl_moe, glm46v, glm4v, glm4v_moe, glm_image, glm_ocr, paddleocr_vl, qwen2_5_vl, qwen2_vl, qwen3_5, qwen3_5_moe, qwen3_vl, qwen3_vl_moe |
|
I chose models with specific position ids, and from different families. The other ones are all just copies, or don't support videos This fix has no effect when |
|
re: applies to having better multimodal tests. AFAIK very few of the vlms that supports video have proper video testing with dummy weights. Neither we have tests with many modalities in a single input sample. Unfortunately I never had time to push further |
What does this PR do?
Fixes #45381 but it is weird, I remember checking position ids by value as well in qwen2.5 to verify that time-interval works 🤔
update: i know why, the integration test we have uses
second_grid_its = 0.083which rounds to0.0. So multiplication is zero no matter what value we get for vision positions. Great!For most models we didn't see any diff because each frame is separated by a timestamps, and is processed separately. Only the first two Qwen releases have a bulk processing for all frames at once
In any case, worth adding a fast test with expected positions, will do so