Skip to content

Conversation

ydshieh
Copy link
Collaborator

@ydshieh ydshieh commented Sep 17, 2025

What does this PR do?

This integration test class takes > 3 hours to finish.

https://github.com/huggingface/transformers/actions/runs/17784986682/job/50551078690

The model is very large (despite being MOE) and the tests loading the model by offloading to cpu/disk.

Even with max_new_tokens=10, one test already takes 16 minutes.

This PR combines several tests into one and reduces the total number of tests to only 3 tests.

The whole integration tests runs in 30 minutes now (still slow however).

The disadvantage is that we don't have more complete outputs to compare with.

{
"type": "image",
"url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg",
"url": "https://huggingface.co/datasets/hf-transformers-bot/ci_outputs/resolve/main/pipeline-cat-chonk.jpeg",
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reduce the image size, but this doesn't help much. I might revert this back.

Copy link
Member

@zucchini-nlp zucchini-nlp Sep 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It;s because processor resizes it to self.size. We can initialize processor with smaller sizes and nudge images to be resized to less patches

I think it is either image_processor.size param or image_processor.min_pixels/image_processor.max_pixels param for this model

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @zucchini-nlp .

I first tried do_resize=False (with smaller image as I used), and it gives

(Pdb) inputs = self.processor.apply_chat_template(batch_messages, tokenize=True, add_generation_prompt=True, return_dict=True, return_tensors="pt", padding=True, do_resize=False)
*** ValueError: cannot reshape array of size 246240 into shape (1,2,3,6,2,14,8,2,14)

Then I tried to change patch_size from 14 to 56

inputs = self.processor.apply_chat_template(batch_messages, tokenize=True, add_generation_prompt=True, return_dict=True, return_tensors="pt", padding=True, patch_size=56)

which gives a short sequence, but it gives (during model forward)

        # Add adapted position encoding to embeddings
>       embeddings = embeddings + adapted_pos_embed
E       RuntimeError: The size of tensor a (320) must match the size of tensor b (20) at non-singleton dimension 0

Something must be binded in a tight way.

I would prefer this PR goes as it is, as each time it's take 30 minutes to run.

The above observed behavior, if need some fixes, should go in a separate PRs.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or if you have any idea about how to do with

"size": {"shortest_edge": 12544, "longest_edge": 9633792},

I am happy to give a last try.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep, the size is recommended way to control max VRAM needed for this model. Though I don't know how much oit will change the time used to generate

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to change from

self.processor = AutoProcessor.from_pretrained("zai-org/GLM-4.5V")

(we get size={"shortest_edge": 12544, "longest_edge": 9633792},)

to

self.processor = AutoProcessor.from_pretrained("zai-org/GLM-4.5V", size={"shortest_edge": 10800, "longest_edge": 10800})

The input sequence length is reduced by 4 (when using my own smaller image) but the run time is from 16m to 14m (about 1m30 ~ 2m less).

It doesn't help much, I think the overhead is heavily on the cpu/disk offloading <--> to gpu on each token generation.

I will still apply this change, but keep the short max_new_tokens (10 and 3).

There is not much we can do .

{
"type": "image",
"url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/coco_sample.png",
"url": "https://huggingface.co/datasets/hf-transformers-bot/ci_outputs/resolve/main/coco_sample.png",
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same, might revert this part

]
batched_messages = [self.message, message_wo_image]
model = self.get_model()
batch_messages = [self.message, self.message2, self.message_wo_image]
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

combine several tests into this one, but using batch


# it should not matter whether two images are the same size or not
output = model.generate(**inputs, max_new_tokens=30)
output = model.generate(**inputs, max_new_tokens=10)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

16 minutes for 10 tokens

"\nWhat kind of dog is this?\n<think>Got it, let's look at the image. The animal in the picture is not a dog; it's a cat. Specifically, it looks",
"\nWhat kind of dog is this?\n<think>Got it, let's look at the image. Wait, the animals here are cats, not dogs. The question is about a dog, but"
] # fmt: skip
output = model.generate(**inputs, max_new_tokens=3)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3 tokens - let's not being crazy to have all the tests being so slow.

This 3 tokens already takes 7 minutes

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Woow, btw, we can change the video size by setting small num_frames when calling processor.apply_chat_template. I dont know for sure what is the default sampling size for model, so maybe it is sampling a lot

I mean in this test and in the batched test above


# it should not matter whether two images are the same size or not
output = model.generate(**inputs, max_new_tokens=30)
output = model.generate(**inputs, max_new_tokens=3)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same, 3 tokens, 7 minutes

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Copy link
Member

@zucchini-nlp zucchini-nlp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left a few suggestions that might help to reduce memory usage by making smaller size images and less video frames. Up to you if you want to test them or just merge :)

"\nWhat kind of dog is this?\n<think>Got it, let's look at the image. The animal in the picture is not a dog; it's a cat. Specifically, it looks",
"\nWhat kind of dog is this?\n<think>Got it, let's look at the image. Wait, the animals here are cats, not dogs. The question is about a dog, but"
] # fmt: skip
output = model.generate(**inputs, max_new_tokens=3)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Woow, btw, we can change the video size by setting small num_frames when calling processor.apply_chat_template. I dont know for sure what is the default sampling size for model, so maybe it is sampling a lot

I mean in this test and in the batched test above

Copy link
Contributor

[For maintainers] Suggested jobs to run (before merge)

run-slow: glm4v_moe

@ydshieh ydshieh merged commit ecc1d77 into main Sep 17, 2025
18 checks passed
@ydshieh ydshieh deleted the fix_glm4v_moe branch September 17, 2025 16:21
ErfanBaghaei pushed a commit to ErfanBaghaei/transformers that referenced this pull request Sep 25, 2025
* fix

* fix

* fix

* fix

* fix

---------

Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>
vijayabhaskar-ev pushed a commit to vijayabhaskar-ev/transformers that referenced this pull request Oct 2, 2025
* fix

* fix

* fix

* fix

* fix

---------

Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>
yuchenxie4645 pushed a commit to yuchenxie4645/transformers that referenced this pull request Oct 4, 2025
* fix

* fix

* fix

* fix

* fix

---------

Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants