Fix `Glm4vMoeIntegrationTest` #40930

ydshieh · 2025-09-17T09:16:26Z

What does this PR do?

This integration test class takes > 3 hours to finish.

https://github.com/huggingface/transformers/actions/runs/17784986682/job/50551078690

The model is very large (despite being MOE) and the tests loading the model by offloading to cpu/disk.

Even with max_new_tokens=10, one test already takes 16 minutes.

This PR combines several tests into one and reduces the total number of tests to only 3 tests.

The whole integration tests runs in 30 minutes now (still slow however).

The disadvantage is that we don't have more complete outputs to compare with.

ydshieh · 2025-09-17T09:17:30Z

tests/models/glm4v_moe/test_modeling_glm4v_moe.py

                    {
                        "type": "image",
-                        "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg",
+                        "url": "https://huggingface.co/datasets/hf-transformers-bot/ci_outputs/resolve/main/pipeline-cat-chonk.jpeg",


I reduce the image size, but this doesn't help much. I might revert this back.

It;s because processor resizes it to self.size. We can initialize processor with smaller sizes and nudge images to be resized to less patches

I think it is either image_processor.size param or image_processor.min_pixels/image_processor.max_pixels param for this model

Thank you @zucchini-nlp .

I first tried do_resize=False (with smaller image as I used), and it gives

(Pdb) inputs = self.processor.apply_chat_template(batch_messages, tokenize=True, add_generation_prompt=True, return_dict=True, return_tensors="pt", padding=True, do_resize=False) *** ValueError: cannot reshape array of size 246240 into shape (1,2,3,6,2,14,8,2,14)

Then I tried to change patch_size from 14 to 56

inputs = self.processor.apply_chat_template(batch_messages, tokenize=True, add_generation_prompt=True, return_dict=True, return_tensors="pt", padding=True, patch_size=56)

which gives a short sequence, but it gives (during model forward)

# Add adapted position encoding to embeddings > embeddings = embeddings + adapted_pos_embed E RuntimeError: The size of tensor a (320) must match the size of tensor b (20) at non-singleton dimension 0

Something must be binded in a tight way.

I would prefer this PR goes as it is, as each time it's take 30 minutes to run.

The above observed behavior, if need some fixes, should go in a separate PRs.

Or if you have any idea about how to do with

"size": {"shortest_edge": 12544, "longest_edge": 9633792},

I am happy to give a last try.

yep, the size is recommended way to control max VRAM needed for this model. Though I don't know how much oit will change the time used to generate

I tried to change from

self.processor = AutoProcessor.from_pretrained("zai-org/GLM-4.5V")

(we get size={"shortest_edge": 12544, "longest_edge": 9633792},)

to

self.processor = AutoProcessor.from_pretrained("zai-org/GLM-4.5V", size={"shortest_edge": 10800, "longest_edge": 10800})

The input sequence length is reduced by 4 (when using my own smaller image) but the run time is from 16m to 14m (about 1m30 ~ 2m less).

It doesn't help much, I think the overhead is heavily on the cpu/disk offloading <--> to gpu on each token generation.

I will still apply this change, but keep the short max_new_tokens (10 and 3).

There is not much we can do .

ydshieh · 2025-09-17T09:17:39Z

tests/models/glm4v_moe/test_modeling_glm4v_moe.py

                    {
                        "type": "image",
-                        "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/coco_sample.png",
+                        "url": "https://huggingface.co/datasets/hf-transformers-bot/ci_outputs/resolve/main/coco_sample.png",


same, might revert this part

ydshieh · 2025-09-17T09:18:11Z

tests/models/glm4v_moe/test_modeling_glm4v_moe.py

-        ]
-        batched_messages = [self.message, message_wo_image]
+        model = self.get_model()
+        batch_messages = [self.message, self.message2, self.message_wo_image]


combine several tests into this one, but using batch

ydshieh · 2025-09-17T09:18:23Z

tests/models/glm4v_moe/test_modeling_glm4v_moe.py


        # it should not matter whether two images are the same size or not
-        output = model.generate(**inputs, max_new_tokens=30)
+        output = model.generate(**inputs, max_new_tokens=10)


16 minutes for 10 tokens

ydshieh · 2025-09-17T09:19:18Z

tests/models/glm4v_moe/test_modeling_glm4v_moe.py

-            "\nWhat kind of dog is this?\n<think>Got it, let's look at the image. The animal in the picture is not a dog; it's a cat. Specifically, it looks",
-            "\nWhat kind of dog is this?\n<think>Got it, let's look at the image. Wait, the animals here are cats, not dogs. The question is about a dog, but"
-        ]  # fmt: skip
+        output = model.generate(**inputs, max_new_tokens=3)


3 tokens - let's not being crazy to have all the tests being so slow.

This 3 tokens already takes 7 minutes

Woow, btw, we can change the video size by setting small num_frames when calling processor.apply_chat_template. I dont know for sure what is the default sampling size for model, so maybe it is sampling a lot

I mean in this test and in the batched test above

ydshieh · 2025-09-17T09:19:30Z

tests/models/glm4v_moe/test_modeling_glm4v_moe.py


        # it should not matter whether two images are the same size or not
-        output = model.generate(**inputs, max_new_tokens=30)
+        output = model.generate(**inputs, max_new_tokens=3)


same, 3 tokens, 7 minutes

HuggingFaceDocBuilderDev · 2025-09-17T09:27:24Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

zucchini-nlp

I left a few suggestions that might help to reduce memory usage by making smaller size images and less video frames. Up to you if you want to test them or just merge :)

zucchini-nlp · 2025-09-17T12:23:04Z

tests/models/glm4v_moe/test_modeling_glm4v_moe.py

-            "\nWhat kind of dog is this?\n<think>Got it, let's look at the image. The animal in the picture is not a dog; it's a cat. Specifically, it looks",
-            "\nWhat kind of dog is this?\n<think>Got it, let's look at the image. Wait, the animals here are cats, not dogs. The question is about a dog, but"
-        ]  # fmt: skip
+        output = model.generate(**inputs, max_new_tokens=3)


Woow, btw, we can change the video size by setting small num_frames when calling processor.apply_chat_template. I dont know for sure what is the default sampling size for model, so maybe it is sampling a lot

I mean in this test and in the batched test above

github-actions · 2025-09-17T15:50:15Z

[For maintainers] Suggested jobs to run (before merge)

run-slow: glm4v_moe

* fix * fix * fix * fix * fix --------- Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>

ydshieh requested a review from zucchini-nlp September 17, 2025 09:16

ydshieh commented Sep 17, 2025

View reviewed changes

zucchini-nlp approved these changes Sep 17, 2025

View reviewed changes

ydshieh added 2 commits September 17, 2025 16:33

fix

44d57fd

fix

52ed868

ydshieh force-pushed the fix_glm4v_moe branch from 4db9e51 to 52ed868 Compare September 17, 2025 14:53

ydshieh added 3 commits September 17, 2025 17:22

fix

f9749ee

fix

3efb8c2

fix

f6f45ba

ydshieh merged commit ecc1d77 into main Sep 17, 2025
18 checks passed

ydshieh deleted the fix_glm4v_moe branch September 17, 2025 16:21

ErfanBaghaei pushed a commit to ErfanBaghaei/transformers that referenced this pull request Sep 25, 2025

Fix Glm4vMoeIntegrationTest (huggingface#40930)

d8d78c6

* fix * fix * fix * fix * fix --------- Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>

vijayabhaskar-ev pushed a commit to vijayabhaskar-ev/transformers that referenced this pull request Oct 2, 2025

Fix Glm4vMoeIntegrationTest (huggingface#40930)

799d595

* fix * fix * fix * fix * fix --------- Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>

yuchenxie4645 pushed a commit to yuchenxie4645/transformers that referenced this pull request Oct 4, 2025

Fix Glm4vMoeIntegrationTest (huggingface#40930)

9814f50

* fix * fix * fix * fix * fix --------- Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>

Fix Glm4vMoeIntegrationTest #40930

Fix Glm4vMoeIntegrationTest #40930

Conversation

ydshieh commented Sep 17, 2025

What does this PR do?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zucchini-nlp Sep 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HuggingFaceDocBuilderDev commented Sep 17, 2025

Uh oh!

zucchini-nlp left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Sep 17, 2025

Uh oh!

Uh oh!

Uh oh!

Fix `Glm4vMoeIntegrationTest` #40930

Fix `Glm4vMoeIntegrationTest` #40930

zucchini-nlp Sep 17, 2025 •

edited

Loading