Skip to content

Conversation

@Jintao-Huang
Copy link
Collaborator

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @Jintao-Huang, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly expands the system's capabilities by integrating the deepseek-ocr model. The changes involve defining the model's type and architecture, registering its configuration, and establishing a corresponding template, thereby allowing the system to leverage its optical character recognition functionalities for processing image-based text.

Highlights

  • New Model Support: Added full support for the deepseek-ocr model, enabling its integration and usage within the system.
  • Model Type and Architecture Definition: Introduced deepseek_ocr into the MLLMModelType and MLLMModelArch enums, along with its specific multi-model architecture configuration.
  • Model Registration: Registered the deepseek-ai/DeepSeek-OCR model with its metadata, including the use of get_model_tokenizer_with_flash_attn and a 'vision' tag.
  • Template Integration: Created and registered a dedicated DeepseekOCR template to handle the specific prompting and processing requirements of the new model.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds support for the deepseek-ocr model by updating constants and adding new model and template registrations. While the changes are in the right direction, there are a couple of significant issues. The model registration uses an incorrect function for loading the multimodal processor and is missing the architectures definition. Additionally, the corresponding template class is empty, which will cause incorrect processing of image inputs. These issues need to be addressed for the model to function correctly.

Comment on lines 330 to 331
get_model_tokenizer_with_flash_attn,
model_arch=ModelArch.deepseek_ocr,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

There are a couple of issues with the ModelMeta registration for deepseek_ocr:

  1. Incorrect get_function: get_model_tokenizer_with_flash_attn is used, which is for text-only models. Since deepseek-ocr is a multimodal model, you should use get_model_tokenizer_multimodal to correctly load the AutoProcessor which includes the image processor.
  2. Missing architectures: The architectures field is missing. Based on the model's config.json on Hugging Face, it should be ['LlamaForCausalLM']. Adding this is important for model type inference.
Suggested change
get_model_tokenizer_with_flash_attn,
model_arch=ModelArch.deepseek_ocr,
get_model_tokenizer_multimodal,
architectures=['LlamaForCausalLM'],
model_arch=ModelArch.deepseek_ocr,

Comment on lines 237 to 238
class DeepseekOCR(Template):
pass
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The DeepseekOCR template class is currently empty (pass). For a multimodal model, this is incorrect as it will fall back to the base Template's encoding logic, which is designed for text-only models and won't handle image inputs correctly. You need to override the _encode method to properly use the model's processor to handle image and text inputs. You can look at other multimodal templates in this file, like DeepseekVLTemplate, for an example of how to implement this.

@Jintao-Huang
Copy link
Collaborator Author

/gemini review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds support for the deepseek-ocr model, including its model type, architecture, template, and a corresponding test case. The documentation is also updated to include deepseek-ocr and other new models, and to reorganize the reranker models into their own section. The implementation is mostly solid, but I've identified a bug in the DeepseekOCR template where a configurable parameter is hardcoded, which prevents it from being configured. I've also noted a couple of areas for improvement regarding maintainability and consistency in both the code and documentation.

def _preprocess_image(self, images):
# Code borrowed from
# https://modelscope.cn/models/deepseek-ai/DeepSeek-OCR/file/view/master/modeling_deepseekocr.py?status=1
crop_mode = True
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The crop_mode variable is hardcoded to True. This overrides the self.crop_mode attribute that is initialized from an environment variable in init_env_args, preventing users from disabling this feature. This should be updated to use the value from self.crop_mode.

Suggested change
crop_mode = True
crop_mode = self.crop_mode

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

For consistency, the JinaAI/jina-reranker-m0 model should be moved to the dedicated reranker models section. It is currently placed in the general Large Language Models table, but its tags reranker, vision indicate it belongs with the other rerankers.

|[google/gemma-3n-E4B-it](https://modelscope.cn/models/google/gemma-3n-E4B-it)|gemma3n|gemma3n|transformers>=4.53.1|&#x2718;|-|[google/gemma-3n-E4B-it](https://huggingface.co/google/gemma-3n-E4B-it)|
|[mistralai/Mistral-Small-3.1-24B-Base-2503](https://modelscope.cn/models/mistralai/Mistral-Small-3.1-24B-Base-2503)|mistral_2503|mistral_2503|transformers>=4.49|&#x2718;|-|[mistralai/Mistral-Small-3.1-24B-Base-2503](https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Base-2503)|
|[mistralai/Mistral-Small-3.1-24B-Instruct-2503](https://modelscope.cn/models/mistralai/Mistral-Small-3.1-24B-Instruct-2503)|mistral_2503|mistral_2503|transformers>=4.49|&#x2718;|-|[mistralai/Mistral-Small-3.1-24B-Instruct-2503](https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Instruct-2503)|
|[JinaAI/jina-reranker-m0](https://modelscope.cn/models/JinaAI/jina-reranker-m0)|jina_reranker_m0|jina_reranker_m0|-|&#x2718;|reranker, vision|[JinaAI/jina-reranker-m0](https://huggingface.co/JinaAI/jina-reranker-m0)|
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

For consistency, the JinaAI/jina-reranker-m0 model should be moved to the dedicated reranker models section. It is currently placed in the general Large Language Models table, but its tags reranker, vision indicate it belongs with the other rerankers.

ratio = 1 - ((max(w, h) - min(w, h)) / (max(w, h)))

image_transform = self.BasicImageTransform(mean=(0.5, 0.5, 0.5), std=(0.5, 0.5, 0.5), normalize=True)
image_token_id = 128815
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The image_token_id is hardcoded. While the value 128815 is correct for the current tokenizer, it's better practice to retrieve this value dynamically from the tokenizer to improve maintainability. This will prevent potential issues if the tokenizer is updated in the future. You can obtain it using self._tokenize('<image>')[0].

Suggested change
image_token_id = 128815
image_token_id = self._tokenize('<image>')[0]

@Jintao-Huang
Copy link
Collaborator Author

/gemini review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds support for the deepseek-ocr model, which is a great addition. The changes include model registration, a new template for handling OCR-specific preprocessing, an example script, and updates to the documentation. The implementation looks mostly solid, but I've identified a few areas for improvement, including a critical bug in the image preprocessing logic, some maintainability issues with magic numbers and file structure, and a concern about a strict dependency version. Addressing these points will make the integration more robust and user-friendly.


tokenized_image = ([image_token_id] * num_queries + [image_token_id]) * num_queries
tokenized_image += [image_token_id]
tokenized_str += tokenized_image
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

There is an inconsistency in how tokenized_str is populated. In the if crop_mode: block, tokenized_image is appended as a list, but here in the else block, it's extended (+=). This will cause a TypeError when crop_mode is False and multiple images are processed, as the calling function expects a list of lists. This should be changed to append to ensure consistent behavior.

Suggested change
tokenized_str += tokenized_image
tokenized_str.append(tokenized_image)

|[deepseek-ai/Janus-1.3B](https://modelscope.cn/models/deepseek-ai/Janus-1.3B)|deepseek_janus|deepseek_janus|-|&#x2718;|vision|[deepseek-ai/Janus-1.3B](https://huggingface.co/deepseek-ai/Janus-1.3B)|
|[deepseek-ai/Janus-Pro-1B](https://modelscope.cn/models/deepseek-ai/Janus-Pro-1B)|deepseek_janus_pro|deepseek_janus_pro|-|&#x2718;|vision|[deepseek-ai/Janus-Pro-1B](https://huggingface.co/deepseek-ai/Janus-Pro-1B)|
|[deepseek-ai/Janus-Pro-7B](https://modelscope.cn/models/deepseek-ai/Janus-Pro-7B)|deepseek_janus_pro|deepseek_janus_pro|-|&#x2718;|vision|[deepseek-ai/Janus-Pro-7B](https://huggingface.co/deepseek-ai/Janus-Pro-7B)|
|[deepseek-ai/DeepSeek-OCR](https://modelscope.cn/models/deepseek-ai/DeepSeek-OCR)|deepseek_ocr|deepseek_ocr|transformers==4.46.3, easydict|&#x2718;|vision|[deepseek-ai/DeepSeek-OCR](https://huggingface.co/deepseek-ai/DeepSeek-OCR)|
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Pinning the transformers dependency to an exact version (==4.46.3) can lead to dependency conflicts for users who might have other packages requiring a different version. It would be more flexible to specify a compatible range, such as >=4.46.3,<4.47.0. If this exact version is a strict requirement for the model to function correctly, it would be helpful to add a note explaining why.

@@ -0,0 +1,30 @@
# 24GiB
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The example script for deepseek-ocr is located in the deepseek_vl2 directory. For better project structure and maintainability, it would be more appropriate to create a dedicated directory for this new model, such as examples/models/deepseek_ocr/, and place the script there. This will make it easier for users to find examples related to this specific model.

# Code borrowed from
# https://modelscope.cn/models/deepseek-ai/DeepSeek-OCR/file/view/master/modeling_deepseekocr.py?status=1
crop_mode = self.crop_mode
patch_size = 16
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The _preprocess_image method contains several magic numbers, such as 16 for patch_size and 4 for downsample_ratio, as well as various hardcoded image sizes and token counts (e.g., 640, 1024, 256, 100). To improve readability and maintainability, it's recommended to define these as named constants at the class level or configure them via init_env_args.

crop_mode = self.crop_mode
patch_size = 16
downsample_ratio = 4
valid_img_tokens = 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The variable valid_img_tokens is initialized and incremented within the _preprocess_image method, but its value is never used. This appears to be dead code and should be removed to improve code clarity.

@Jintao-Huang Jintao-Huang merged commit b9f4e2c into modelscope:main Oct 21, 2025
1 of 2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants