Replace pixel_position_ids with image_position_ids for Gemma4 support#5452
Replace pixel_position_ids with image_position_ids for Gemma4 support#5452qgallouedec merged 2 commits intomainfrom
pixel_position_ids with image_position_ids for Gemma4 support#5452Conversation
| if num_images is not None and pixel_values is not None: | ||
| model_inputs["image_position_ids"] = image_position_ids[img_start:img_end] | ||
| else: | ||
| model_inputs["image_position_ids"] = image_position_ids[start : start + batch_size] |
There was a problem hiding this comment.
dead code, afaik, image_position_ids is never used without num_images, or pixel_values. Unless I'm missing something @sergiopaniego?
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: b068977e20
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
trl/trainer/rloo_trainer.py
Outdated
| if pixel_position_ids is not None: | ||
| model_inputs["pixel_position_ids"] = pixel_position_ids[start : start + batch_size] | ||
| if image_position_ids is not None: | ||
| model_inputs["image_position_ids"] = image_position_ids[img_start:img_end] |
There was a problem hiding this comment.
Handle non-
image_grid_thw path before slicing image positions
In RLOOTrainer._get_per_token_logps_and_entropies, image_position_ids is now sliced with image_position_ids[img_start:img_end], but img_start/img_end are only assigned in the image_grid_thw is not None branch. For Gemma-style VLM inputs (where image_position_ids is present but image_grid_thw is absent), this hits an UnboundLocalError at runtime and crashes loss/logprob computation for RLOO training.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 9c5ae1e. Configure here.
| model_inputs["pixel_values"] = pixel_values[img_start:img_end] | ||
| else: | ||
| model_inputs["pixel_values"] = pixel_values[start : start + batch_size] | ||
| model_inputs["pixel_values"] = pixel_values[start : start + batch_size] |
There was a problem hiding this comment.
Removed num_images slicing breaks per-image-indexed VLMs
High Severity
The elif pixel_values is not None fallback branch now uses sample-indexed slicing (pixel_values[start : start + batch_size]) instead of the previous cumulative num_images-based slicing. The split_pixel_values_by_grid utility in utils.py confirms that models without image_grid_thw (comment says "e.g. Gemma") have pixel_values indexed per-image, not per-sample. For any VLM with per-image-indexed pixel_values that lacks both image_grid_thw and image_position_ids (e.g., SmolVLM2/Idefics3 with multi-image inputs), this slicing provides the wrong images to the model, likely causing a runtime error from image token count mismatches.
Additional Locations (1)
Triggered by project rule: ../.ai/AGENTS.md
Reviewed by Cursor Bugbot for commit 9c5ae1e. Configure here.
albertvillanova
left a comment
There was a problem hiding this comment.
Thanks for addressing this and for the clear explanation of the discrepancy. 🤗
Regarding #5374, I think this is a good lesson for us going forward. For changes based on anticipated upstream behavior, it might be safer to keep such PRs on hold and only merge them once the model is officially released and the required interface is confirmed. This would help avoid introducing speculative code paths that we may need to clean up later.
What do you think?
|
also got this issue so thanks for the fix! |
|
Sounds good @albertvillanova. |
|
@albertvillanova thanks for the idea, I think it makes sense. Maybe we could keep a branch+ open PR that clearly describes the upcoming changes, so we can point to it in releases (blogs, comms...) if needed |


pixel_position_idswas speculatively added in #5374 before the Gemma4 release. Now that Gemma4 is out, the actual key isimage_position_ids(they renamed it presumably) with slightly different semantics: it's indexed by image (likepixel_values), not by sample. #5437 partially addressed this by addingimage_position_idsto GRPO, but didn't remove the oldpixel_position_idsleaving both keys in GRPO and only the stale one in DPO/RLOO.Note
Medium Risk
Touches multimodal forward-kwargs plumbing in several trainers; incorrect slicing/indexing of per-image tensors could break vision-language training for some models/datasets, though the change is narrow and largely a key rename with clarified semantics.
Overview
Updates VLM training/inference plumbing to drop the stale
pixel_position_idsinput and consistently use Gemma4’simage_position_idsacrossDPOTrainer,GRPOTrainer, andRLOOTrainer.Adjusts GRPO/RLOO batching logic so
image_position_idsis sliced per-image (aligned withpixel_values/num_images) when buildingmodel_inputs, and updates related comments/docstrings (includingSFTTrainer) to reflect the new key.Reviewed by Cursor Bugbot for commit 9c5ae1e. Bugbot is set up for automated code reviews on this repo. Configure here.