Skip to content

[model] Add PenguinVL implementation#44662

Open
CyrilSterling wants to merge 6 commits intohuggingface:mainfrom
CyrilSterling:penguinvl
Open

[model] Add PenguinVL implementation#44662
CyrilSterling wants to merge 6 commits intohuggingface:mainfrom
CyrilSterling:penguinvl

Conversation

@CyrilSterling
Copy link
Copy Markdown

What does this PR do?

This PR supports PenguinVL model.
Paper: https://arxiv.org/abs/2603.06569
Github repo: https://github.com/tencent-ailab/Penguin-VL
HuggingFace Model: https://huggingface.co/collections/tencent/ai-lab

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

This PR may be related to:
Models:

Library:

Documentation: @stevhliu

Copy link
Copy Markdown
Member

@zucchini-nlp zucchini-nlp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! I was looking at the model a few days ago and most of the building blocks are similar to Qwen family. Could you try to use modular inheritance as much as possible, and I will do a first review around next week?

@CyrilSterling
Copy link
Copy Markdown
Author

Nice! I was looking at the model a few days ago and most of the building blocks are similar to Qwen family. Could you try to use modular inheritance as much as possible, and I will do a first review around next week?

Thank you for your attention. I have implemented as many inheritable modules as possible through inheritance. However, some models are difficult to inherit, such as:

  1. PenguinVLVisionEmbeddings uses Conv2d instead of Conv3d, which is different from PatchEmbed in Qwen2VL.
  2. PenguinVLVisionAttention and PenguinVLVisionModel are consistent with the Qwen3 language model, but require 2D-RoPE positional encoding and bidirectional attention.
  3. Both PenguinVLProcessor and PenguinVLImageProcessor are inherited from Qwen2VL, with only necessary components added (e.g., TRA algorithm and resize method).

@zucchini-nlp
Copy link
Copy Markdown
Member

@CyrilSterling

you can override certain attributes if needed with modular. For ex if "PenguinVLVisionEmbeddings uses Conv2d instead of Conv3d". then you could

class PenguinVLVisionEmbeddings(QwenPatchEmbed):
    def __init__(self, config):
        self.patch_embed = nn.Conv3D() # this ovevrwrites the qwen attr and creates a conv3d

Same goes for other modules, you can add and remove methods or attributes

Copy link
Copy Markdown
Member

@stevhliu stevhliu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, docs are very well-written! just have a few formatting nits

CyrilSterling and others added 2 commits March 14, 2026 01:49
@CyrilSterling
Copy link
Copy Markdown
Author

@CyrilSterling

you can override certain attributes if needed with modular. For ex if "PenguinVLVisionEmbeddings uses Conv2d instead of Conv3d". then you could

class PenguinVLVisionEmbeddings(QwenPatchEmbed):
    def __init__(self, config):
        self.patch_embed = nn.Conv3D() # this ovevrwrites the qwen attr and creates a conv3d

Same goes for other modules, you can add and remove methods or attributes

Thank you for your suggestion. I have reviewed the code again and added two new inheritance relationships: PenguinVLVisionAttention now inherits from Qwen3Attention, and PenguinVLPreTrainedModel inherits from Qwen3PreTrainedModel. For the remaining components, implementing them via inheritance would require extensive rewriting, which would bring relatively limited benefits.
Please let me know if you have any further comments or suggestions. :)

@github-actions
Copy link
Copy Markdown
Contributor

[For maintainers] Suggested jobs to run (before merge)

run-slow: auto, penguinvl

@github-actions
Copy link
Copy Markdown
Contributor

View the CircleCI Test Summary for this PR:

https://huggingface.co/spaces/transformers-community/circle-ci-viz?pr=44662&sha=f0524a

@CyrilSterling
Copy link
Copy Markdown
Author

Hello, may I ask about the current progress? @zucchini-nlp

@zucchini-nlp
Copy link
Copy Markdown
Member

reviewing today-tomorrow, got delayed by a few other models

Copy link
Copy Markdown
Member

@zucchini-nlp zucchini-nlp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey, sorry for delayed review. So many model being released recently

I think the PR needs a few iterations of clean-up since the current API doesn't follow transformers best practices. I suggest to rebase on main first as we also merged wto big refactors recently. I added models that are similar to copy from or adapt from for each class on the comments. Also, we need to separate video and image processing form each other

Please let me know if you have questions. I will unsubscriobe from this PR to not flood my notification bar, so ping me again when you need a review :)

@@ -0,0 +1,310 @@
<!--Copyright 2025 Tencent and The HuggingFace Team. All rights reserved.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: can you make sure it's 2026 everywhere


### Single media inference

PenguinVL accepts both images and videos as input. Use `processor.process_vision_info` to extract visual inputs from messages*before** calling `apply_chat_template`.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

simply calling apply_chat_template will extract and load all data, no need to call more utilities


model = PenguinVLForConditionalGeneration.from_pretrained(
"tencent/Penguin-VL-8B",
torch_dtype=torch.bfloat16,
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: dtype

Comment on lines +73 to +77
).to(model.device)

inputs = {k: v.to(model.device) if isinstance(v, torch.Tensor) else v for k, v in inputs.items()}
if "pixel_values" in inputs:
inputs["pixel_values"] = inputs["pixel_values"].to(torch.bfloat16)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

more concise: inputs.to(device=model.device, dtype=torch.bfloat16)

if "pixel_values" in inputs:
inputs["pixel_values"] = inputs["pixel_values"].to(torch.bfloat16)
output_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids = [output_ids[len(input_ids):] for input_ids, output_ids in zip(inputs.input_ids, output_ids)]
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

concise way: output_ids[:, inputs.input_ids.shape[1]: ]


@require_vision
@require_torch
class PenguinVLProcessorUnitTest(unittest.TestCase):
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and a ProcessorTestMixin pls


@classmethod
def setUpClass(cls):
from transformers import PenguinVLProcessor
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all import at the top pls

Comment on lines +249 to +251
"PenguinVLModel", # Building part of bigger (tested) model. Tested implicitly through PenguinVLForConditionalGeneration.
"PenguinVLLanguageModel", # Building part of bigger (tested) model. Tested implicitly through PenguinVLForConditionalGeneration.
"PenguinVLForConditionalGeneration", # Tested in PenguinVLIntegrationTest (integration tests).
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we need to test Model and ForConditionalGeneration. The language model is same as qwen and will be deleted

@@ -471,6 +468,7 @@
"Ernie4_5_VL_MoeForConditionalGeneration", # BC Alias
"Ernie4_5_VL_MoeModel", # BC Alias
"Ernie4_5_VL_MoeTextModel", # BC Alias
"PenguinVLLanguageModel", # Building part of a bigger model
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will be deleted

frame_types: list | None


class PenguinVLImageProcessor(Qwen2VLImageProcessor):
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually, you can also take a look at PR description here to check what needs to be changes (only some var names, file name and parent class prob)

Copy link
Copy Markdown
Member

@zucchini-nlp zucchini-nlp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey, sorry for delayed review. So many model being released recently

I think the PR needs a few iterations of clean-up since the current API doesn't follow transformers best practices. I suggest to rebase on main first as we also merged wto big refactors recently. I added models that are similar to copy from or adapt from for each class on the comments. Also, we need to separate video and image processing form each other

Please let me know if you have questions. I will unsubscriobe from this PR to not flood my notification bar, so ping me again when you need a review :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants