Skip to content

Conversation

lambertwjh
Copy link
Contributor

What does this PR do?

Fixes # (issue)

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@lambertwjh
Copy link
Contributor Author

Implement a fast GLM-4.1V image processor with shape-based grouping, ensuring per-sub-batch size consistency and improving mixed-size batch efficiency.

@Rocketknight1
Copy link
Member

cc @yonigozlan

Copy link
Member

@yonigozlan yonigozlan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this @lambertwjh, very nice update! We just have to ensure we're not breaking backward compatibility

)
stacked_images = self.resize(
stacked_images,
size=SizeDict(height=resized_height, width=resized_width),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this might be breaking backward compatibility, as resized size used to be computed as the max of all target sizes in the batch. Not exactly sure why this was the case in the first place, but let's make sure we don't have edge cases here that would make this a breaking change. In particular, having the same resized size for all images in the batch ensured that we could stack the images in the end, not sure this is the case now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The "backward compatibility" concern is fundamentally invalid because the current behavior is already broken—it fails to process mixed-size batches entirely, while same-size batches remain completely unaffected since identical input dimensions produce identical output dimensions, and the Fast version has already proven the safety of this fix through successful implementation and testing.
The Fast version's proven approach:
group images by their original dimensions, process each group independently by applying smart_resize per group, maintain the original batch sequence order through proper reconstruction, and only stack dimensionally compatible tensors.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure I follow you, when you say the current behavior is broken, do you mean it's incorrect because the images are not resized to the correct size, or because it crashes?
A big part of the issue is that there is not image processing tests for this model for some reason. adding a test file would make it clearer what works and what doesn't.
Would you mind adding this test file? you can look at other test_image_processing_....py files to see how they should be written.
If you don't have the bandwidth for that, we can open a separate PR.
Thanks a lot!

@lambertwjh
Copy link
Contributor Author

1.What exactly is broken?
Incorrectness vs crash: it’s about a runtime concatenation failure on mixed-size batches. When images of different original sizes are processed, smart_resize produces different (H, W) per shape group in the Fast processor. After patchification, that yields different sequence lengths (grid_t * grid_h * grid_w) across groups, so the final torch.cat along the batch dimension fails because non-batch dimensions don’t match.
Same-size batches are unaffected because all groups end up with identical resized (H, W), thus identical grid sizes.
2. Why did this not show up in the “standard” processor?
The standard processor computes the resize target from the first image and applies the same target size to the whole batch. That avoids the final stacking error but is semantically inconsistent for mixed-size batches (the first image dictates the grid for all images).

Copy link
Contributor

[For maintainers] Suggested jobs to run (before merge)

run-slow: glm4v

Copy link
Member

@yonigozlan yonigozlan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi again @lambertwjh, thanks for the explanation, after reading the code more thoroughly I agree with your changes. I added tests to make sure we have equivalent slow and fast processors and so we don't break this in the future.
Thanks for contributing LGTM!

@yonigozlan yonigozlan enabled auto-merge (squash) September 10, 2025 20:56
@yonigozlan yonigozlan merged commit dae1ccf into huggingface:main Sep 10, 2025
15 checks passed
@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

vijayabhaskar-ev pushed a commit to vijayabhaskar-ev/transformers that referenced this pull request Oct 2, 2025
* fix_image_processing_fast_for_glm4v

* fix(format): auto-ruff format

* add test image processing glm4v

* fix quality

---------

Co-authored-by: Your Name <you@example.com>
Co-authored-by: yonigozlan <yoni.gozlan@huggingface.co>
yuchenxie4645 pushed a commit to yuchenxie4645/transformers that referenced this pull request Oct 4, 2025
* fix_image_processing_fast_for_glm4v

* fix(format): auto-ruff format

* add test image processing glm4v

* fix quality

---------

Co-authored-by: Your Name <you@example.com>
Co-authored-by: yonigozlan <yoni.gozlan@huggingface.co>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants