[model] Add PenguinVL implementation#44662
[model] Add PenguinVL implementation#44662CyrilSterling wants to merge 6 commits intohuggingface:mainfrom
Conversation
zucchini-nlp
left a comment
There was a problem hiding this comment.
Nice! I was looking at the model a few days ago and most of the building blocks are similar to Qwen family. Could you try to use modular inheritance as much as possible, and I will do a first review around next week?
Thank you for your attention. I have implemented as many inheritable modules as possible through inheritance. However, some models are difficult to inherit, such as:
|
|
you can override certain attributes if needed with modular. For ex if "PenguinVLVisionEmbeddings uses Conv2d instead of Conv3d". then you could Same goes for other modules, you can add and remove methods or attributes |
stevhliu
left a comment
There was a problem hiding this comment.
thanks, docs are very well-written! just have a few formatting nits
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
Thank you for your suggestion. I have reviewed the code again and added two new inheritance relationships: PenguinVLVisionAttention now inherits from Qwen3Attention, and PenguinVLPreTrainedModel inherits from Qwen3PreTrainedModel. For the remaining components, implementing them via inheritance would require extensive rewriting, which would bring relatively limited benefits. |
|
[For maintainers] Suggested jobs to run (before merge) run-slow: auto, penguinvl |
|
View the CircleCI Test Summary for this PR: https://huggingface.co/spaces/transformers-community/circle-ci-viz?pr=44662&sha=f0524a |
|
Hello, may I ask about the current progress? @zucchini-nlp |
|
reviewing today-tomorrow, got delayed by a few other models |
zucchini-nlp
left a comment
There was a problem hiding this comment.
Hey, sorry for delayed review. So many model being released recently
I think the PR needs a few iterations of clean-up since the current API doesn't follow transformers best practices. I suggest to rebase on main first as we also merged wto big refactors recently. I added models that are similar to copy from or adapt from for each class on the comments. Also, we need to separate video and image processing form each other
Please let me know if you have questions. I will unsubscriobe from this PR to not flood my notification bar, so ping me again when you need a review :)
| @@ -0,0 +1,310 @@ | |||
| <!--Copyright 2025 Tencent and The HuggingFace Team. All rights reserved. | |||
There was a problem hiding this comment.
nit: can you make sure it's 2026 everywhere
|
|
||
| ### Single media inference | ||
|
|
||
| PenguinVL accepts both images and videos as input. Use `processor.process_vision_info` to extract visual inputs from messages*before** calling `apply_chat_template`. |
There was a problem hiding this comment.
simply calling apply_chat_template will extract and load all data, no need to call more utilities
|
|
||
| model = PenguinVLForConditionalGeneration.from_pretrained( | ||
| "tencent/Penguin-VL-8B", | ||
| torch_dtype=torch.bfloat16, |
| ).to(model.device) | ||
|
|
||
| inputs = {k: v.to(model.device) if isinstance(v, torch.Tensor) else v for k, v in inputs.items()} | ||
| if "pixel_values" in inputs: | ||
| inputs["pixel_values"] = inputs["pixel_values"].to(torch.bfloat16) |
There was a problem hiding this comment.
more concise: inputs.to(device=model.device, dtype=torch.bfloat16)
| if "pixel_values" in inputs: | ||
| inputs["pixel_values"] = inputs["pixel_values"].to(torch.bfloat16) | ||
| output_ids = model.generate(**inputs, max_new_tokens=128) | ||
| generated_ids = [output_ids[len(input_ids):] for input_ids, output_ids in zip(inputs.input_ids, output_ids)] |
There was a problem hiding this comment.
concise way: output_ids[:, inputs.input_ids.shape[1]: ]
|
|
||
| @require_vision | ||
| @require_torch | ||
| class PenguinVLProcessorUnitTest(unittest.TestCase): |
There was a problem hiding this comment.
and a ProcessorTestMixin pls
|
|
||
| @classmethod | ||
| def setUpClass(cls): | ||
| from transformers import PenguinVLProcessor |
| "PenguinVLModel", # Building part of bigger (tested) model. Tested implicitly through PenguinVLForConditionalGeneration. | ||
| "PenguinVLLanguageModel", # Building part of bigger (tested) model. Tested implicitly through PenguinVLForConditionalGeneration. | ||
| "PenguinVLForConditionalGeneration", # Tested in PenguinVLIntegrationTest (integration tests). |
There was a problem hiding this comment.
we need to test Model and ForConditionalGeneration. The language model is same as qwen and will be deleted
| @@ -471,6 +468,7 @@ | |||
| "Ernie4_5_VL_MoeForConditionalGeneration", # BC Alias | |||
| "Ernie4_5_VL_MoeModel", # BC Alias | |||
| "Ernie4_5_VL_MoeTextModel", # BC Alias | |||
| "PenguinVLLanguageModel", # Building part of a bigger model | |||
| frame_types: list | None | ||
|
|
||
|
|
||
| class PenguinVLImageProcessor(Qwen2VLImageProcessor): |
There was a problem hiding this comment.
actually, you can also take a look at PR description here to check what needs to be changes (only some var names, file name and parent class prob)
zucchini-nlp
left a comment
There was a problem hiding this comment.
Hey, sorry for delayed review. So many model being released recently
I think the PR needs a few iterations of clean-up since the current API doesn't follow transformers best practices. I suggest to rebase on main first as we also merged wto big refactors recently. I added models that are similar to copy from or adapt from for each class on the comments. Also, we need to separate video and image processing form each other
Please let me know if you have questions. I will unsubscriobe from this PR to not flood my notification bar, so ping me again when you need a review :)
What does this PR do?
This PR supports PenguinVL model.
Paper: https://arxiv.org/abs/2603.06569
Github repo: https://github.com/tencent-ailab/Penguin-VL
HuggingFace Model: https://huggingface.co/collections/tencent/ai-lab
Before submitting
Pull Request section?
to it if that's the case.
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.
This PR may be related to:
Models:
Library:
Documentation: @stevhliu