Replace `pixel_position_ids` with `image_position_ids` for Gemma4 support by qgallouedec · Pull Request #5452 · huggingface/trl

qgallouedec · 2026-04-04T00:07:36Z

pixel_position_ids was speculatively added in #5374 before the Gemma4 release. Now that Gemma4 is out, the actual key is image_position_ids (they renamed it presumably) with slightly different semantics: it's indexed by image (like pixel_values), not by sample. #5437 partially addressed this by adding image_position_ids to GRPO, but didn't remove the old pixel_position_ids leaving both keys in GRPO and only the stale one in DPO/RLOO.

Note

Medium Risk
Touches multimodal forward-kwargs plumbing in several trainers; incorrect slicing/indexing of per-image tensors could break vision-language training for some models/datasets, though the change is narrow and largely a key rename with clarified semantics.

Overview
Updates VLM training/inference plumbing to drop the stale pixel_position_ids input and consistently use Gemma4’s image_position_ids across DPOTrainer, GRPOTrainer, and RLOOTrainer.

Adjusts GRPO/RLOO batching logic so image_position_ids is sliced per-image (aligned with pixel_values/num_images) when building model_inputs, and updates related comments/docstrings (including SFTTrainer) to reflect the new key.

^{Reviewed by Cursor Bugbot for commit 9c5ae1e. Bugbot is set up for automated code reviews on this repo. Configure here.}

…port

qgallouedec · 2026-04-04T00:08:49Z

trl/trainer/grpo_trainer.py

-                if num_images is not None and pixel_values is not None:
-                    model_inputs["image_position_ids"] = image_position_ids[img_start:img_end]
-                else:
-                    model_inputs["image_position_ids"] = image_position_ids[start : start + batch_size]


dead code, afaik, image_position_ids is never used without num_images, or pixel_values. Unless I'm missing something @sergiopaniego?

HuggingFaceDocBuilderDev · 2026-04-04T00:10:01Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: b068977e20

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-04-04T00:10:21Z

trl/trainer/rloo_trainer.py

-            if pixel_position_ids is not None:
-                model_inputs["pixel_position_ids"] = pixel_position_ids[start : start + batch_size]
+            if image_position_ids is not None:
+                model_inputs["image_position_ids"] = image_position_ids[img_start:img_end]


Handle non-image_grid_thw path before slicing image positions

In RLOOTrainer._get_per_token_logps_and_entropies, image_position_ids is now sliced with image_position_ids[img_start:img_end], but img_start/img_end are only assigned in the image_grid_thw is not None branch. For Gemma-style VLM inputs (where image_position_ids is present but image_grid_thw is absent), this hits an UnboundLocalError at runtime and crashes loss/logprob computation for RLOO training.

Useful? React with 👍 / 👎.

trl/trainer/rloo_trainer.py

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit 9c5ae1e. Configure here.}

cursor · 2026-04-04T01:51:03Z

trl/trainer/grpo_trainer.py

-                    model_inputs["pixel_values"] = pixel_values[img_start:img_end]
-                else:
-                    model_inputs["pixel_values"] = pixel_values[start : start + batch_size]
+                model_inputs["pixel_values"] = pixel_values[start : start + batch_size]


Removed num_images slicing breaks per-image-indexed VLMs

High Severity

The elif pixel_values is not None fallback branch now uses sample-indexed slicing (pixel_values[start : start + batch_size]) instead of the previous cumulative num_images-based slicing. The split_pixel_values_by_grid utility in utils.py confirms that models without image_grid_thw (comment says "e.g. Gemma") have pixel_values indexed per-image, not per-sample. For any VLM with per-image-indexed pixel_values that lacks both image_grid_thw and image_position_ids (e.g., SmolVLM2/Idefics3 with multi-image inputs), this slicing provides the wrong images to the model, likely causing a runtime error from image token count mismatches.

Additional Locations (1)

trl/trainer/rloo_trainer.py#L709-L711

^{Triggered by project rule: ../.ai/AGENTS.md}

^{Reviewed by Cursor Bugbot for commit 9c5ae1e. Configure here.}

albertvillanova

Thanks for addressing this and for the clear explanation of the discrepancy. 🤗

Regarding #5374, I think this is a good lesson for us going forward. For changes based on anticipated upstream behavior, it might be safer to keep such PRs on hold and only merge them once the model is officially released and the required interface is confirmed. This would help avoid introducing speculative code paths that we may need to clean up later.

What do you think?

kashif · 2026-04-04T09:24:42Z

also got this issue so thanks for the fix!

qgallouedec · 2026-04-04T13:59:04Z

Sounds good @albertvillanova.

sergiopaniego · 2026-04-06T14:03:39Z

@albertvillanova thanks for the idea, I think it makes sense. Maybe we could keep a branch+ open PR that clearly describes the upcoming changes, so we can point to it in releases (blogs, comms...) if needed

Replace pixel_position_ids with image_position_ids for Gemma4 sup…

b068977

…port

qgallouedec commented Apr 4, 2026

View reviewed changes

qgallouedec requested review from AmineDiro, albertvillanova, kashif and sergiopaniego April 4, 2026 00:09

chatgpt-codex-connector bot reviewed Apr 4, 2026

View reviewed changes

cursor bot reviewed Apr 4, 2026

View reviewed changes

trl/trainer/rloo_trainer.py Outdated Show resolved Hide resolved

qgallouedec added a commit that referenced this pull request Apr 4, 2026

#5452

55cefc4

fix slicing

9c5ae1e

cursor bot reviewed Apr 4, 2026

View reviewed changes

qgallouedec mentioned this pull request Apr 4, 2026

CI Gemma 4 support #5453

Merged

albertvillanova approved these changes Apr 4, 2026

View reviewed changes

kashif approved these changes Apr 4, 2026

View reviewed changes

qgallouedec merged commit b03d6d2 into main Apr 4, 2026
14 checks passed

qgallouedec deleted the pixel-to-image_position_ids branch April 4, 2026 14:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replace `pixel_position_ids` with `image_position_ids` for Gemma4 support#5452

Replace `pixel_position_ids` with `image_position_ids` for Gemma4 support#5452
qgallouedec merged 2 commits intomainfrom
pixel-to-image_position_ids

qgallouedec commented Apr 4, 2026 •

edited by cursor bot

Loading

Uh oh!

qgallouedec Apr 4, 2026

Uh oh!

HuggingFaceDocBuilderDev commented Apr 4, 2026

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

chatgpt-codex-connector bot Apr 4, 2026

Uh oh!

Uh oh!

cursor bot left a comment

Uh oh!

cursor bot Apr 4, 2026

Uh oh!

albertvillanova left a comment

Uh oh!

kashif commented Apr 4, 2026

Uh oh!

qgallouedec commented Apr 4, 2026

Uh oh!

Uh oh!

sergiopaniego commented Apr 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

qgallouedec commented Apr 4, 2026 • edited by cursor bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

qgallouedec Apr 4, 2026

Choose a reason for hiding this comment

Uh oh!

HuggingFaceDocBuilderDev commented Apr 4, 2026

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Apr 4, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

cursor bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor bot Apr 4, 2026

Choose a reason for hiding this comment

Removed num_images slicing breaks per-image-indexed VLMs

Uh oh!

albertvillanova left a comment

Choose a reason for hiding this comment

Uh oh!

kashif commented Apr 4, 2026

Uh oh!

qgallouedec commented Apr 4, 2026

Uh oh!

Uh oh!

sergiopaniego commented Apr 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

qgallouedec commented Apr 4, 2026 •

edited by cursor bot

Loading

Removed `num_images` slicing breaks per-image-indexed VLMs