[model] Add PenguinVL implementation by CyrilSterling · Pull Request #44662 · huggingface/transformers

CyrilSterling · 2026-03-13T13:02:26Z

What does this PR do?

This PR supports PenguinVL model.
Paper: https://arxiv.org/abs/2603.06569
Github repo: https://github.com/tencent-ailab/Penguin-VL
HuggingFace Model: https://huggingface.co/collections/tencent/ai-lab

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

This PR may be related to:
Models:

text models: @ArthurZucker @Cyrilvallez
vision models: @yonigozlan @molbap
multimodal models: @zucchini-nlp

Library:

generate: @zucchini-nlp (visual-language models)

Documentation: @stevhliu

zucchini-nlp

Nice! I was looking at the model a few days ago and most of the building blocks are similar to Qwen family. Could you try to use modular inheritance as much as possible, and I will do a first review around next week?

CyrilSterling · 2026-03-13T13:34:01Z

Nice! I was looking at the model a few days ago and most of the building blocks are similar to Qwen family. Could you try to use modular inheritance as much as possible, and I will do a first review around next week?

Thank you for your attention. I have implemented as many inheritable modules as possible through inheritance. However, some models are difficult to inherit, such as:

PenguinVLVisionEmbeddings uses Conv2d instead of Conv3d, which is different from PatchEmbed in Qwen2VL.
PenguinVLVisionAttention and PenguinVLVisionModel are consistent with the Qwen3 language model, but require 2D-RoPE positional encoding and bidirectional attention.
Both PenguinVLProcessor and PenguinVLImageProcessor are inherited from Qwen2VL, with only necessary components added (e.g., TRA algorithm and resize method).

zucchini-nlp · 2026-03-13T15:21:20Z

@CyrilSterling

you can override certain attributes if needed with modular. For ex if "PenguinVLVisionEmbeddings uses Conv2d instead of Conv3d". then you could

class PenguinVLVisionEmbeddings(QwenPatchEmbed):
    def __init__(self, config):
        self.patch_embed = nn.Conv3D() # this ovevrwrites the qwen attr and creates a conv3d

Same goes for other modules, you can add and remove methods or attributes

stevhliu

thanks, docs are very well-written! just have a few formatting nits

docs/source/en/model_doc/penguinvl.md

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

CyrilSterling · 2026-03-13T19:18:37Z

@CyrilSterling

you can override certain attributes if needed with modular. For ex if "PenguinVLVisionEmbeddings uses Conv2d instead of Conv3d". then you could
class PenguinVLVisionEmbeddings(QwenPatchEmbed):
    def __init__(self, config):
        self.patch_embed = nn.Conv3D() # this ovevrwrites the qwen attr and creates a conv3d
Same goes for other modules, you can add and remove methods or attributes

Thank you for your suggestion. I have reviewed the code again and added two new inheritance relationships: PenguinVLVisionAttention now inherits from Qwen3Attention, and PenguinVLPreTrainedModel inherits from Qwen3PreTrainedModel. For the remaining components, implementing them via inheritance would require extensive rewriting, which would bring relatively limited benefits.
Please let me know if you have any further comments or suggestions. :)

github-actions · 2026-03-13T19:19:47Z

[For maintainers] Suggested jobs to run (before merge)

run-slow: auto, penguinvl

github-actions · 2026-03-13T19:31:36Z

View the CircleCI Test Summary for this PR:

https://huggingface.co/spaces/transformers-community/circle-ci-viz?pr=44662&sha=f0524a

CyrilSterling · 2026-03-18T09:19:24Z

Hello, may I ask about the current progress? @zucchini-nlp

zucchini-nlp · 2026-03-18T10:56:02Z

reviewing today-tomorrow, got delayed by a few other models

zucchini-nlp

Hey, sorry for delayed review. So many model being released recently

I think the PR needs a few iterations of clean-up since the current API doesn't follow transformers best practices. I suggest to rebase on main first as we also merged wto big refactors recently. I added models that are similar to copy from or adapt from for each class on the comments. Also, we need to separate video and image processing form each other

Please let me know if you have questions. I will unsubscriobe from this PR to not flood my notification bar, so ping me again when you need a review :)

zucchini-nlp · 2026-03-19T17:58:58Z

docs/source/en/model_doc/penguinvl.md

@@ -0,0 +1,310 @@
+<!--Copyright 2025 Tencent and The HuggingFace Team. All rights reserved.


nit: can you make sure it's 2026 everywhere

zucchini-nlp · 2026-03-19T17:59:47Z

docs/source/en/model_doc/penguinvl.md

+
+### Single media inference
+
+PenguinVL accepts both images and videos as input. Use `processor.process_vision_info` to extract visual inputs from messages*before** calling `apply_chat_template`.


simply calling apply_chat_template will extract and load all data, no need to call more utilities

zucchini-nlp · 2026-03-19T18:00:01Z

docs/source/en/model_doc/penguinvl.md

+
+model = PenguinVLForConditionalGeneration.from_pretrained(
+    "tencent/Penguin-VL-8B",
+    torch_dtype=torch.bfloat16,


nit: dtype

zucchini-nlp · 2026-03-19T18:00:46Z

docs/source/en/model_doc/penguinvl.md

+).to(model.device)
+
+inputs = {k: v.to(model.device) if isinstance(v, torch.Tensor) else v for k, v in inputs.items()}
+if "pixel_values" in inputs:
+    inputs["pixel_values"] = inputs["pixel_values"].to(torch.bfloat16)


more concise: inputs.to(device=model.device, dtype=torch.bfloat16)

zucchini-nlp · 2026-03-19T18:01:27Z

docs/source/en/model_doc/penguinvl.md

+if "pixel_values" in inputs:
+    inputs["pixel_values"] = inputs["pixel_values"].to(torch.bfloat16)
+output_ids = model.generate(**inputs, max_new_tokens=128)
+generated_ids = [output_ids[len(input_ids):] for input_ids, output_ids in zip(inputs.input_ids, output_ids)]


concise way: output_ids[:, inputs.input_ids.shape[1]: ]

zucchini-nlp · 2026-03-19T19:26:11Z

tests/models/penguinvl/test_processing_penguinvl.py

+
+@require_vision
+@require_torch
+class PenguinVLProcessorUnitTest(unittest.TestCase):


and a ProcessorTestMixin pls

zucchini-nlp · 2026-03-19T19:26:29Z

tests/models/penguinvl/test_processing_penguinvl.py

+
+    @classmethod
+    def setUpClass(cls):
+        from transformers import PenguinVLProcessor


all import at the top pls

zucchini-nlp · 2026-03-19T19:27:06Z

utils/check_repo.py

+        "PenguinVLModel",  # Building part of bigger (tested) model. Tested implicitly through PenguinVLForConditionalGeneration.
+        "PenguinVLLanguageModel",  # Building part of bigger (tested) model. Tested implicitly through PenguinVLForConditionalGeneration.
+        "PenguinVLForConditionalGeneration",  # Tested in PenguinVLIntegrationTest (integration tests).


we need to test Model and ForConditionalGeneration. The language model is same as qwen and will be deleted

zucchini-nlp · 2026-03-19T19:27:15Z

utils/check_repo.py

@@ -471,6 +468,7 @@
    "Ernie4_5_VL_MoeForConditionalGeneration",  # BC Alias
    "Ernie4_5_VL_MoeModel",  # BC Alias
    "Ernie4_5_VL_MoeTextModel",  # BC Alias
+    "PenguinVLLanguageModel",  # Building part of a bigger model


will be deleted

zucchini-nlp · 2026-03-19T19:38:11Z

src/transformers/models/penguinvl/modular_penguinvl.py

+    frame_types: list | None
+
+
+class PenguinVLImageProcessor(Qwen2VLImageProcessor):


actually, you can also take a look at PR description here to check what needs to be changes (only some var names, file name and parent class prob)

zucchini-nlp

Hey, sorry for delayed review. So many model being released recently

I think the PR needs a few iterations of clean-up since the current API doesn't follow transformers best practices. I suggest to rebase on main first as we also merged wto big refactors recently. I added models that are similar to copy from or adapt from for each class on the comments. Also, we need to separate video and image processing form each other

Please let me know if you have questions. I will unsubscriobe from this PR to not flood my notification bar, so ping me again when you need a review :)

CyrilSterling added 2 commits March 13, 2026 20:05

Support PenguinVL

bdac5c6

update the docstring for PenguinVL

980dbca

zucchini-nlp reviewed Mar 13, 2026

View reviewed changes

fix problems using make fix-repo

edb7e1b

update modular script

ec3c1c7

stevhliu approved these changes Mar 13, 2026

View reviewed changes

CyrilSterling and others added 2 commits March 14, 2026 01:49

Apply suggestions from code review

405a462

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

Update inheritance, documentation, and fix issues raised by check-repo

f0524a8

zucchini-nlp reviewed Mar 20, 2026

View reviewed changes

		@@ -0,0 +1,310 @@
		<!--Copyright 2025 Tencent and The HuggingFace Team. All rights reserved.


		### Single media inference

		PenguinVL accepts both images and videos as input. Use `processor.process_vision_info` to extract visual inputs from messagesbefore* calling `apply_chat_template`.

		frame_types: list \| None


		class PenguinVLImageProcessor(Qwen2VLImageProcessor):

Conversation

CyrilSterling commented Mar 13, 2026

What does this PR do?

Before submitting

Who can review?

Uh oh!

zucchini-nlp left a comment

Choose a reason for hiding this comment

Uh oh!

CyrilSterling commented Mar 13, 2026

Uh oh!

zucchini-nlp commented Mar 13, 2026

Uh oh!

stevhliu left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

CyrilSterling commented Mar 13, 2026

Uh oh!

github-actions bot commented Mar 13, 2026

Uh oh!

github-actions bot commented Mar 13, 2026

Uh oh!

CyrilSterling commented Mar 18, 2026

Uh oh!

zucchini-nlp commented Mar 18, 2026

Uh oh!

zucchini-nlp left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zucchini-nlp left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants