Skip to content

Conversation

zucchini-nlp
Copy link
Member

@zucchini-nlp zucchini-nlp commented Sep 11, 2025

What does this PR do?

it lays ground for my next PR, where I am aiming to unify DefaultImagesKwargsFastInitand ImagesKwargs since they are almost identical except for three kwargs (size_divisor, do_pad and pad_size)

This PR consolidates naming of the above there kwargs. Specifically:

  • size_divisor is removed from general kwargs as it's used only in ~5 model
  • pad_size and do_pad are added in default set of fast image processor kwargs along with a default self.pad method
  • A few deprecations added where the naming was not consistent, e.g. size_divisibility instead of size_divisor

return pad(image, padding, data_format=data_format, input_data_format=input_data_format)

def pad(self, *args, **kwargs):
logger.info("pad is deprecated and will be removed in version 4.27. Please use pad_image instead.")
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

4.27 🤯

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@zucchini-nlp
Copy link
Member Author

Done, finally! @yonigozlan , I'd like to ask a first review from you as it touches fast image processors. To keep you aligned, the next PR will remove DefaultFastImageKwargs and replace it with ImagesKwargs everywhere. I have a draft locally, haven't pushed changes to all models yet

Copy link
Member

@yonigozlan yonigozlan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for the huge work!
First quick review, I didn't check every models, but overall I'm completely on board with this, very much needed.
Cant' wait to unify DefaultFastImageProcessorKwargs and ImagesKwargs

Comment on lines +245 to +311
def pad(
self,
images: "torch.Tensor",
pad_size: SizeDict = None,
fill_value: Optional[int] = 0,
padding_mode: Optional[str] = "constant",
return_mask: Optional[bool] = False,
disable_grouping: Optional[bool] = False,
**kwargs,
) -> "torch.Tensor":
"""
Pads images to `(pad_size["height"], pad_size["width"])` or to the largest size in the batch.
Args:
images (`torch.Tensor`):
Images to pad.
pad_size (`SizeDict`, *optional*):
Dictionary in the format `{"height": int, "width": int}` specifying the size of the output image.
fill_value (`int`, *optional*, defaults to `0`):
The constant value used to fill the padded area.
padding_mode (`str`, *optional*, defaults to "constant"):
The padding mode to use. Can be any of the modes supported by
`torch.nn.functional.pad` (e.g. constant, reflection, replication).
return_mask (`bool`, *optional*, defaults to `False`):
Whether to return a pixel mask to denote padded regions.
disable_grouping (`bool`, *optional*, defaults to `False`):
Whether to disable grouping of images by size.
Returns:
`torch.Tensor`: The resized image.
"""
if pad_size is not None:
if not (pad_size.height and pad_size.width):
raise ValueError(f"Pad size must contain 'height' and 'width' keys only. Got pad_size={pad_size}.")
pad_size = (pad_size.height, pad_size.width)
else:
pad_size = get_max_height_width(images)

grouped_images, grouped_images_index = group_images_by_shape(images, disable_grouping=disable_grouping)
processed_images_grouped = {}
processed_masks_grouped = {}
for shape, stacked_images in grouped_images.items():
image_size = stacked_images.shape[-2:]
padding_height = pad_size[0] - image_size[0]
padding_width = pad_size[1] - image_size[1]
if padding_height < 0 or padding_width < 0:
raise ValueError(
f"Padding dimensions are negative. Please make sure that the `pad_size` is larger than the "
f"image size. Got pad_size={pad_size}, image_size={image_size}."
)
if image_size != pad_size:
padding = (0, 0, padding_width, padding_height)
stacked_images = F.pad(stacked_images, padding, fill=fill_value, padding_mode=padding_mode)
processed_images_grouped[shape] = stacked_images

if return_mask:
# keep only one from the channel dimension in pixel mask
stacked_masks = torch.zeros_like(stacked_images, dtype=torch.int64)[..., 0, :, :]
stacked_masks[..., : image_size[0], : image_size[1]] = 1
processed_masks_grouped[shape] = stacked_masks

processed_images = reorder_images(processed_images_grouped, grouped_images_index)
if return_mask:
processed_masks = reorder_images(processed_masks_grouped, grouped_images_index)
return processed_images, processed_masks

return processed_images
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was also wondering if we should include the whole group and reorder process in other functions like resize or rescale and normalize. Bu that means we can't have several processing functions under one grouping loop. And a new grouping is only needed if the previous function changed the shape of the images (not need after a rescale and normalize for example), and they do cost a bit of processing time, so it might be best to leave this to the _preprocess function. What do you think?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The grouping for padding was added only because it's supposed to be called outside of the loop if we want to pad all images to the same size. I see what you mean and it's partially related to above thread. We need to handle two types of paddings. WDYT of adding an arg like group_images=False ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh yes that makes sense then, if we're padding to the maximum size. I don't think we need to add anything, we can just override the function when we have a different type of padding needed

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep, that is what most processors do. Though it's not "overriding", they use fn called pad_to_square which we can actually standardize and add in Base class as well in the next iteration

@zucchini-nlp zucchini-nlp requested a review from qubvel September 15, 2025 16:39
Copy link
Member

@Cyrilvallez Cyrilvallez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice! Aligned with the standardization vision we have for v5! I'm just confused about the default value of None. Why not just set it to False so it's a proper bool, instead of relying on the truth value of None being False?

@zucchini-nlp zucchini-nlp enabled auto-merge (squash) September 17, 2025 16:03
@zucchini-nlp
Copy link
Member Author

zucchini-nlp commented Sep 17, 2025

Will merge to unblock myself for the next PR

EDIT: I don''t know where my comment is gone with a reply. Anyway, the None is simply for consistency with how fast image processors are implemented and other kwargs. We can batch update the defaults in subsequent PRs

Copy link
Contributor

[For maintainers] Suggested jobs to run (before merge)

run-slow: bridgetower, cohere2_vision, conditional_detr, convnext, deepseek_vl, deepseek_vl_hybrid, deformable_detr, depth_pro, detr, dinov3_vit, donut, dpt, fuyu, gemma3, got_ocr2, grounding_dino

@zucchini-nlp zucchini-nlp merged commit 8e837f6 into huggingface:main Sep 17, 2025
23 checks passed
ErfanBaghaei pushed a commit to ErfanBaghaei/transformers that referenced this pull request Sep 25, 2025
* use consistent naming for padding

* no validation on pad size

* add warnings

* fix

* fox copies

* another fix

* fix some tests

* fix more tests

* fix lasts tests

* fix copies

* better docstring

* delete print
vijayabhaskar-ev pushed a commit to vijayabhaskar-ev/transformers that referenced this pull request Oct 2, 2025
* use consistent naming for padding

* no validation on pad size

* add warnings

* fix

* fox copies

* another fix

* fix some tests

* fix more tests

* fix lasts tests

* fix copies

* better docstring

* delete print
yuchenxie4645 pushed a commit to yuchenxie4645/transformers that referenced this pull request Oct 4, 2025
* use consistent naming for padding

* no validation on pad size

* add warnings

* fix

* fox copies

* another fix

* fix some tests

* fix more tests

* fix lasts tests

* fix copies

* better docstring

* delete print
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants