Skip to content

Replace pixel_position_ids with image_position_ids for Gemma4 support#5452

Merged
qgallouedec merged 2 commits intomainfrom
pixel-to-image_position_ids
Apr 4, 2026
Merged

Replace pixel_position_ids with image_position_ids for Gemma4 support#5452
qgallouedec merged 2 commits intomainfrom
pixel-to-image_position_ids

Conversation

@qgallouedec
Copy link
Copy Markdown
Member

@qgallouedec qgallouedec commented Apr 4, 2026

pixel_position_ids was speculatively added in #5374 before the Gemma4 release. Now that Gemma4 is out, the actual key is image_position_ids (they renamed it presumably) with slightly different semantics: it's indexed by image (like pixel_values), not by sample. #5437 partially addressed this by adding image_position_ids to GRPO, but didn't remove the old pixel_position_ids leaving both keys in GRPO and only the stale one in DPO/RLOO.


Note

Medium Risk
Touches multimodal forward-kwargs plumbing in several trainers; incorrect slicing/indexing of per-image tensors could break vision-language training for some models/datasets, though the change is narrow and largely a key rename with clarified semantics.

Overview
Updates VLM training/inference plumbing to drop the stale pixel_position_ids input and consistently use Gemma4’s image_position_ids across DPOTrainer, GRPOTrainer, and RLOOTrainer.

Adjusts GRPO/RLOO batching logic so image_position_ids is sliced per-image (aligned with pixel_values/num_images) when building model_inputs, and updates related comments/docstrings (including SFTTrainer) to reflect the new key.

Reviewed by Cursor Bugbot for commit 9c5ae1e. Bugbot is set up for automated code reviews on this repo. Configure here.

if num_images is not None and pixel_values is not None:
model_inputs["image_position_ids"] = image_position_ids[img_start:img_end]
else:
model_inputs["image_position_ids"] = image_position_ids[start : start + batch_size]
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dead code, afaik, image_position_ids is never used without num_images, or pixel_values. Unless I'm missing something @sergiopaniego?

@HuggingFaceDocBuilderDev
Copy link
Copy Markdown

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: b068977e20

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

if pixel_position_ids is not None:
model_inputs["pixel_position_ids"] = pixel_position_ids[start : start + batch_size]
if image_position_ids is not None:
model_inputs["image_position_ids"] = image_position_ids[img_start:img_end]
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Handle non-image_grid_thw path before slicing image positions

In RLOOTrainer._get_per_token_logps_and_entropies, image_position_ids is now sliced with image_position_ids[img_start:img_end], but img_start/img_end are only assigned in the image_grid_thw is not None branch. For Gemma-style VLM inputs (where image_position_ids is present but image_grid_thw is absent), this hits an UnboundLocalError at runtime and crashes loss/logprob computation for RLOO training.

Useful? React with 👍 / 👎.

qgallouedec added a commit that referenced this pull request Apr 4, 2026
Copy link
Copy Markdown

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 9c5ae1e. Configure here.

model_inputs["pixel_values"] = pixel_values[img_start:img_end]
else:
model_inputs["pixel_values"] = pixel_values[start : start + batch_size]
model_inputs["pixel_values"] = pixel_values[start : start + batch_size]
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed num_images slicing breaks per-image-indexed VLMs

High Severity

The elif pixel_values is not None fallback branch now uses sample-indexed slicing (pixel_values[start : start + batch_size]) instead of the previous cumulative num_images-based slicing. The split_pixel_values_by_grid utility in utils.py confirms that models without image_grid_thw (comment says "e.g. Gemma") have pixel_values indexed per-image, not per-sample. For any VLM with per-image-indexed pixel_values that lacks both image_grid_thw and image_position_ids (e.g., SmolVLM2/Idefics3 with multi-image inputs), this slicing provides the wrong images to the model, likely causing a runtime error from image token count mismatches.

Additional Locations (1)
Fix in Cursor Fix in Web

Triggered by project rule: ../.ai/AGENTS.md

Reviewed by Cursor Bugbot for commit 9c5ae1e. Configure here.

@qgallouedec qgallouedec mentioned this pull request Apr 4, 2026
Copy link
Copy Markdown
Member

@albertvillanova albertvillanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for addressing this and for the clear explanation of the discrepancy. 🤗

Regarding #5374, I think this is a good lesson for us going forward. For changes based on anticipated upstream behavior, it might be safer to keep such PRs on hold and only merge them once the model is officially released and the required interface is confirmed. This would help avoid introducing speculative code paths that we may need to clean up later.

What do you think?

@kashif
Copy link
Copy Markdown
Collaborator

kashif commented Apr 4, 2026

also got this issue so thanks for the fix!

@qgallouedec
Copy link
Copy Markdown
Member Author

Sounds good @albertvillanova.

@qgallouedec qgallouedec merged commit b03d6d2 into main Apr 4, 2026
14 checks passed
@qgallouedec qgallouedec deleted the pixel-to-image_position_ids branch April 4, 2026 14:04
@sergiopaniego
Copy link
Copy Markdown
Member

@albertvillanova thanks for the idea, I think it makes sense. Maybe we could keep a branch+ open PR that clearly describes the upcoming changes, so we can point to it in releases (blogs, comms...) if needed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants