Skip to content

Conversation

@slin000111
Copy link
Collaborator

PR type

  • Bug Fix
  • New Feature
  • Document Updates
  • More Models or Datasets Support

PR information

Write the detail information belongs to this PR.
PaddlePaddle/PaddleOCR-VL-1.5, #7975

Experiment results

Paste your experiment result here(if needed).

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @slin000111, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request integrates the PaddleOCR-VL-1.5 model into the system. The changes involve updating documentation to reflect the new model, defining necessary constants and architecture details, implementing specific loading and input processing logic for its multimodal capabilities, and adding a dedicated test case to ensure its proper functioning.

Highlights

  • New Model Support: The pull request introduces comprehensive support for the PaddleOCR-VL-1.5 multimodal model, enabling its integration and use within the system.
  • Documentation Updates: The Supported-models-and-datasets.md documentation, in both Chinese and English versions, has been updated to include PaddleOCR-VL-1.5 in the list of available models.
  • Core Model Integration: New constants for paddle_ocr_1_5 have been added to MLLMModelType and MLLMTemplateType, and its architecture details (language model, aligner, vision tower) are now registered in MLLMModelArch.
  • Model Loading and Template Implementation: A dedicated PaddleOCR1_5Loader class has been implemented for handling the model's configuration and loading, along with a PaddleOCR1_5Template to manage multimodal input encoding, specifically for image embeddings.
  • Test Coverage: A new test function, test_paddle_ocr_1_5(), has been added to tests/test_align/test_template/test_vision.py to validate the correct integration and functionality of the new model.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • docs/source/Instruction/Supported-models-and-datasets.md
    • Added an entry for PaddlePaddle/PaddleOCR-VL-1.5 to the list of supported models.
  • docs/source_en/Instruction/Supported-models-and-datasets.md
    • Added an entry for PaddlePaddle/PaddleOCR-VL-1.5 to the English list of supported models.
  • swift/model/constant.py
    • Introduced paddle_ocr_1_5 as a new MLLMModelType.
  • swift/model/model_arch.py
    • Added paddle_ocr_1_5 to MLLMModelArch.
    • Registered the specific architecture details for paddle_ocr_1_5, including its language model, aligner, and vision tower components.
  • swift/model/models/baidu.py
    • Imported PretrainedConfig.
    • Implemented PaddleOCR1_5Loader for loading the PaddleOCR-VL-1.5 model.
    • Registered the PaddleOCR-VL-1.5 model with its metadata and architecture.
  • swift/template/constant.py
    • Defined paddle_ocr_1_5 as a new MLLMTemplateType.
  • swift/template/templates/baidu.py
    • Created PaddleOCR1_5Template to handle multimodal input encoding for PaddleOCR-VL-1.5, specifically managing image embeddings.
    • Registered the new PaddleOCR1_5Template.
  • tests/test_align/test_template/test_vision.py
    • Added a new test function test_paddle_ocr_1_5() to validate the new model's functionality.
    • Updated the main execution block to include the test_paddle_ocr_1_5() test.
Activity
  • The pull request was initiated by slin000111 with the explicit goal of supporting PaddleOCR-VL-1.5.
  • The changes involve adding new code for model integration, updating documentation, and extending test coverage.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds support for the PaddleOCR-VL-1.5 model. The changes are comprehensive, including updates to documentation, model constants, architecture registration, a new loader, a new template, and an integration test. The implementation is mostly solid, but I've identified a potential bug in the new PaddleOCR1_5Loader where it overrides a base class method incorrectly. I've provided a suggestion to fix this, which also simplifies the code.

Comment on lines +88 to +98
class PaddleOCR1_5Loader(ModelLoader):

def get_config(self, model_dir: str) -> PretrainedConfig:
from transformers import AutoConfig
auto_config_cls = self.auto_config_cls or AutoConfig
return auto_config_cls.from_pretrained(model_dir)

def get_model(self, model_dir: str, *args, **kwargs) -> PreTrainedModel:
from transformers import AutoModelForImageTextToText
self.auto_model_cls = self.auto_model_cls or AutoModelForImageTextToText
return super().get_model(model_dir, *args, **kwargs)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The get_config method overrides the base implementation and misses trust_remote_code=True, which is present in the base ModelLoader and important for loading many models from the hub. This could lead to errors. It's safer to remove this method and inherit the correct behavior from the base class.

Suggested change
class PaddleOCR1_5Loader(ModelLoader):
def get_config(self, model_dir: str) -> PretrainedConfig:
from transformers import AutoConfig
auto_config_cls = self.auto_config_cls or AutoConfig
return auto_config_cls.from_pretrained(model_dir)
def get_model(self, model_dir: str, *args, **kwargs) -> PreTrainedModel:
from transformers import AutoModelForImageTextToText
self.auto_model_cls = self.auto_model_cls or AutoModelForImageTextToText
return super().get_model(model_dir, *args, **kwargs)
class PaddleOCR1_5Loader(ModelLoader):
def get_model(self, model_dir: str, *args, **kwargs) -> PreTrainedModel:
from transformers import AutoModelForImageTextToText
self.auto_model_cls = self.auto_model_cls or AutoModelForImageTextToText
return super().get_model(model_dir, *args, **kwargs)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This model code has been merged into the transformers library and is different from the code in Model Hub, so trust_remote_code is not needed.

@slin000111 slin000111 merged commit 4b7fc22 into modelscope:main Feb 4, 2026
2 of 3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants