-
Notifications
You must be signed in to change notification settings - Fork 30.6k
Fix Glm4vMoeIntegrationTest
#40930
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix Glm4vMoeIntegrationTest
#40930
Conversation
{ | ||
"type": "image", | ||
"url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg", | ||
"url": "https://huggingface.co/datasets/hf-transformers-bot/ci_outputs/resolve/main/pipeline-cat-chonk.jpeg", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I reduce the image size, but this doesn't help much. I might revert this back.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It;s because processor resizes it to self.size
. We can initialize processor with smaller sizes and nudge images to be resized to less patches
I think it is either image_processor.size
param or image_processor.min_pixels/image_processor.max_pixels
param for this model
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @zucchini-nlp .
I first tried do_resize=False
(with smaller image as I used), and it gives
(Pdb) inputs = self.processor.apply_chat_template(batch_messages, tokenize=True, add_generation_prompt=True, return_dict=True, return_tensors="pt", padding=True, do_resize=False)
*** ValueError: cannot reshape array of size 246240 into shape (1,2,3,6,2,14,8,2,14)
Then I tried to change patch_size
from 14
to 56
inputs = self.processor.apply_chat_template(batch_messages, tokenize=True, add_generation_prompt=True, return_dict=True, return_tensors="pt", padding=True, patch_size=56)
which gives a short sequence, but it gives (during model forward)
# Add adapted position encoding to embeddings
> embeddings = embeddings + adapted_pos_embed
E RuntimeError: The size of tensor a (320) must match the size of tensor b (20) at non-singleton dimension 0
Something must be binded in a tight way.
I would prefer this PR goes as it is, as each time it's take 30 minutes to run.
The above observed behavior, if need some fixes, should go in a separate PRs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or if you have any idea about how to do with
"size": {"shortest_edge": 12544, "longest_edge": 9633792},
I am happy to give a last try.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep, the size is recommended way to control max VRAM needed for this model. Though I don't know how much oit will change the time used to generate
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried to change from
self.processor = AutoProcessor.from_pretrained("zai-org/GLM-4.5V")
(we get size={"shortest_edge": 12544, "longest_edge": 9633792},
)
to
self.processor = AutoProcessor.from_pretrained("zai-org/GLM-4.5V", size={"shortest_edge": 10800, "longest_edge": 10800})
The input sequence length is reduced by 4 (when using my own smaller image) but the run time is from 16m to 14m (about 1m30 ~ 2m less).
It doesn't help much, I think the overhead is heavily on the cpu/disk offloading <--> to gpu on each token generation.
I will still apply this change, but keep the short max_new_tokens
(10
and 3
).
There is not much we can do .
{ | ||
"type": "image", | ||
"url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/coco_sample.png", | ||
"url": "https://huggingface.co/datasets/hf-transformers-bot/ci_outputs/resolve/main/coco_sample.png", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same, might revert this part
] | ||
batched_messages = [self.message, message_wo_image] | ||
model = self.get_model() | ||
batch_messages = [self.message, self.message2, self.message_wo_image] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
combine several tests into this one, but using batch
|
||
# it should not matter whether two images are the same size or not | ||
output = model.generate(**inputs, max_new_tokens=30) | ||
output = model.generate(**inputs, max_new_tokens=10) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
16 minutes for 10 tokens
"\nWhat kind of dog is this?\n<think>Got it, let's look at the image. The animal in the picture is not a dog; it's a cat. Specifically, it looks", | ||
"\nWhat kind of dog is this?\n<think>Got it, let's look at the image. Wait, the animals here are cats, not dogs. The question is about a dog, but" | ||
] # fmt: skip | ||
output = model.generate(**inputs, max_new_tokens=3) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
3 tokens - let's not being crazy to have all the tests being so slow.
This 3 tokens already takes 7 minutes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Woow, btw, we can change the video size by setting small num_frames
when calling processor.apply_chat_template
. I dont know for sure what is the default sampling size for model, so maybe it is sampling a lot
I mean in this test and in the batched test above
|
||
# it should not matter whether two images are the same size or not | ||
output = model.generate(**inputs, max_new_tokens=30) | ||
output = model.generate(**inputs, max_new_tokens=3) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same, 3 tokens, 7 minutes
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left a few suggestions that might help to reduce memory usage by making smaller size images and less video frames. Up to you if you want to test them or just merge :)
"\nWhat kind of dog is this?\n<think>Got it, let's look at the image. The animal in the picture is not a dog; it's a cat. Specifically, it looks", | ||
"\nWhat kind of dog is this?\n<think>Got it, let's look at the image. Wait, the animals here are cats, not dogs. The question is about a dog, but" | ||
] # fmt: skip | ||
output = model.generate(**inputs, max_new_tokens=3) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Woow, btw, we can change the video size by setting small num_frames
when calling processor.apply_chat_template
. I dont know for sure what is the default sampling size for model, so maybe it is sampling a lot
I mean in this test and in the batched test above
4db9e51
to
52ed868
Compare
[For maintainers] Suggested jobs to run (before merge) run-slow: glm4v_moe |
* fix * fix * fix * fix * fix --------- Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>
* fix * fix * fix * fix * fix --------- Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>
* fix * fix * fix * fix * fix --------- Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>
What does this PR do?
This integration test class takes > 3 hours to finish.
https://github.com/huggingface/transformers/actions/runs/17784986682/job/50551078690
The model is very large (despite being MOE) and the tests loading the model by offloading to cpu/disk.
Even with
max_new_tokens=10
, one test already takes 16 minutes.This PR combines several tests into one and reduces the total number of tests to only 3 tests.
The whole integration tests runs in 30 minutes now (still slow however).
The disadvantage is that we don't have more complete outputs to compare with.