[model] Support deepseek-ocr #6238

Jintao-Huang · 2025-10-21T03:49:46Z

gemini-code-assist · 2025-10-21T03:50:00Z

Summary of Changes

Hello @Jintao-Huang, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly expands the system's capabilities by integrating the deepseek-ocr model. The changes involve defining the model's type and architecture, registering its configuration, and establishing a corresponding template, thereby allowing the system to leverage its optical character recognition functionalities for processing image-based text.

Highlights

New Model Support: Added full support for the deepseek-ocr model, enabling its integration and usage within the system.
Model Type and Architecture Definition: Introduced deepseek_ocr into the MLLMModelType and MLLMModelArch enums, along with its specific multi-model architecture configuration.
Model Registration: Registered the deepseek-ai/DeepSeek-OCR model with its metadata, including the use of get_model_tokenizer_with_flash_attn and a 'vision' tag.
Template Integration: Created and registered a dedicated DeepseekOCR template to handle the specific prompting and processing requirements of the new model.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request adds support for the deepseek-ocr model by updating constants and adding new model and template registrations. While the changes are in the right direction, there are a couple of significant issues. The model registration uses an incorrect function for loading the multimodal processor and is missing the architectures definition. Additionally, the corresponding template class is empty, which will cause incorrect processing of image inputs. These issues need to be addressed for the model to function correctly.

gemini-code-assist · 2025-10-21T03:51:14Z

swift/llm/model/model/deepseek.py

+        get_model_tokenizer_with_flash_attn,
+        model_arch=ModelArch.deepseek_ocr,


There are a couple of issues with the ModelMeta registration for deepseek_ocr:

Incorrect get_function: get_model_tokenizer_with_flash_attn is used, which is for text-only models. Since deepseek-ocr is a multimodal model, you should use get_model_tokenizer_multimodal to correctly load the AutoProcessor which includes the image processor.

Missing architectures: The architectures field is missing. Based on the model's config.json on Hugging Face, it should be ['LlamaForCausalLM']. Adding this is important for model type inference.

Suggested change

get_model_tokenizer_with_flash_attn,

model_arch=ModelArch.deepseek_ocr,

get_model_tokenizer_multimodal,

architectures=['LlamaForCausalLM'],

model_arch=ModelArch.deepseek_ocr,

gemini-code-assist · 2025-10-21T03:51:14Z

swift/llm/template/template/deepseek.py

+class DeepseekOCR(Template):
+    pass


The DeepseekOCR template class is currently empty (pass). For a multimodal model, this is incorrect as it will fall back to the base Template's encoding logic, which is designed for text-only models and won't handle image inputs correctly. You need to override the _encode method to properly use the model's processor to handle image and text inputs. You can look at other multimodal templates in this file, like DeepseekVLTemplate, for an example of how to implement this.

…r' into support_deepseek_ocr

Jintao-Huang · 2025-10-21T07:44:33Z

/gemini review

gemini-code-assist

Code Review

This pull request adds support for the deepseek-ocr model, including its model type, architecture, template, and a corresponding test case. The documentation is also updated to include deepseek-ocr and other new models, and to reorganize the reranker models into their own section. The implementation is mostly solid, but I've identified a bug in the DeepseekOCR template where a configurable parameter is hardcoded, which prevents it from being configured. I've also noted a couple of areas for improvement regarding maintainability and consistency in both the code and documentation.

gemini-code-assist · 2025-10-21T07:46:16Z

swift/llm/template/template/deepseek.py

+    def _preprocess_image(self, images):
+        # Code borrowed from
+        # https://modelscope.cn/models/deepseek-ai/DeepSeek-OCR/file/view/master/modeling_deepseekocr.py?status=1
+        crop_mode = True


The crop_mode variable is hardcoded to True. This overrides the self.crop_mode attribute that is initialized from an environment variable in init_env_args, preventing users from disabling this feature. This should be updated to use the value from self.crop_mode.

Suggested change

crop_mode = True

crop_mode = self.crop_mode

gemini-code-assist · 2025-10-21T07:46:16Z

docs/source/Instruction/支持的模型和数据集.md

For consistency, the JinaAI/jina-reranker-m0 model should be moved to the dedicated reranker models section. It is currently placed in the general Large Language Models table, but its tags reranker, vision indicate it belongs with the other rerankers.

gemini-code-assist · 2025-10-21T07:46:17Z

docs/source_en/Instruction/Supported-models-and-datasets.md

 |[google/gemma-3n-E4B-it](https://modelscope.cn/models/google/gemma-3n-E4B-it)|gemma3n|gemma3n|transformers>=4.53.1|&#x2718;|-|[google/gemma-3n-E4B-it](https://huggingface.co/google/gemma-3n-E4B-it)|
 |[mistralai/Mistral-Small-3.1-24B-Base-2503](https://modelscope.cn/models/mistralai/Mistral-Small-3.1-24B-Base-2503)|mistral_2503|mistral_2503|transformers>=4.49|&#x2718;|-|[mistralai/Mistral-Small-3.1-24B-Base-2503](https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Base-2503)|
 |[mistralai/Mistral-Small-3.1-24B-Instruct-2503](https://modelscope.cn/models/mistralai/Mistral-Small-3.1-24B-Instruct-2503)|mistral_2503|mistral_2503|transformers>=4.49|&#x2718;|-|[mistralai/Mistral-Small-3.1-24B-Instruct-2503](https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Instruct-2503)|
+|[JinaAI/jina-reranker-m0](https://modelscope.cn/models/JinaAI/jina-reranker-m0)|jina_reranker_m0|jina_reranker_m0|-|&#x2718;|reranker, vision|[JinaAI/jina-reranker-m0](https://huggingface.co/JinaAI/jina-reranker-m0)|


For consistency, the JinaAI/jina-reranker-m0 model should be moved to the dedicated reranker models section. It is currently placed in the general Large Language Models table, but its tags reranker, vision indicate it belongs with the other rerankers.

gemini-code-assist · 2025-10-21T07:46:17Z

swift/llm/template/template/deepseek.py

+        ratio = 1 - ((max(w, h) - min(w, h)) / (max(w, h)))
+
+        image_transform = self.BasicImageTransform(mean=(0.5, 0.5, 0.5), std=(0.5, 0.5, 0.5), normalize=True)
+        image_token_id = 128815


The image_token_id is hardcoded. While the value 128815 is correct for the current tokenizer, it's better practice to retrieve this value dynamically from the tokenizer to improve maintainability. This will prevent potential issues if the tokenizer is updated in the future. You can obtain it using self._tokenize('<image>')[0].

Suggested change

image_token_id = 128815

image_token_id = self._tokenize('<image>')[0]

…r' into support_deepseek_ocr

Jintao-Huang · 2025-10-21T08:00:58Z

/gemini review

gemini-code-assist

Code Review

This pull request adds support for the deepseek-ocr model, which is a great addition. The changes include model registration, a new template for handling OCR-specific preprocessing, an example script, and updates to the documentation. The implementation looks mostly solid, but I've identified a few areas for improvement, including a critical bug in the image preprocessing logic, some maintainability issues with magic numbers and file structure, and a concern about a strict dependency version. Addressing these points will make the integration more robust and user-friendly.

gemini-code-assist · 2025-10-21T08:02:46Z

swift/llm/template/template/deepseek.py

+
+                tokenized_image = ([image_token_id] * num_queries + [image_token_id]) * num_queries
+                tokenized_image += [image_token_id]
+                tokenized_str += tokenized_image


There is an inconsistency in how tokenized_str is populated. In the if crop_mode: block, tokenized_image is appended as a list, but here in the else block, it's extended (+=). This will cause a TypeError when crop_mode is False and multiple images are processed, as the calling function expects a list of lists. This should be changed to append to ensure consistent behavior.

Suggested change

tokenized_str += tokenized_image

tokenized_str.append(tokenized_image)

gemini-code-assist · 2025-10-21T08:02:46Z

docs/source_en/Instruction/Supported-models-and-datasets.md

 |[deepseek-ai/Janus-1.3B](https://modelscope.cn/models/deepseek-ai/Janus-1.3B)|deepseek_janus|deepseek_janus|-|&#x2718;|vision|[deepseek-ai/Janus-1.3B](https://huggingface.co/deepseek-ai/Janus-1.3B)|
 |[deepseek-ai/Janus-Pro-1B](https://modelscope.cn/models/deepseek-ai/Janus-Pro-1B)|deepseek_janus_pro|deepseek_janus_pro|-|&#x2718;|vision|[deepseek-ai/Janus-Pro-1B](https://huggingface.co/deepseek-ai/Janus-Pro-1B)|
 |[deepseek-ai/Janus-Pro-7B](https://modelscope.cn/models/deepseek-ai/Janus-Pro-7B)|deepseek_janus_pro|deepseek_janus_pro|-|&#x2718;|vision|[deepseek-ai/Janus-Pro-7B](https://huggingface.co/deepseek-ai/Janus-Pro-7B)|
+|[deepseek-ai/DeepSeek-OCR](https://modelscope.cn/models/deepseek-ai/DeepSeek-OCR)|deepseek_ocr|deepseek_ocr|transformers==4.46.3, easydict|&#x2718;|vision|[deepseek-ai/DeepSeek-OCR](https://huggingface.co/deepseek-ai/DeepSeek-OCR)|


Pinning the transformers dependency to an exact version (==4.46.3) can lead to dependency conflicts for users who might have other packages requiring a different version. It would be more flexible to specify a compatible range, such as >=4.46.3,<4.47.0. If this exact version is a strict requirement for the model to function correctly, it would be helpful to add a note explaining why.

gemini-code-assist · 2025-10-21T08:02:46Z

examples/models/deepseek_vl2/deepseek_ocr.sh

@@ -0,0 +1,30 @@
+# 24GiB


The example script for deepseek-ocr is located in the deepseek_vl2 directory. For better project structure and maintainability, it would be more appropriate to create a dedicated directory for this new model, such as examples/models/deepseek_ocr/, and place the script there. This will make it easier for users to find examples related to this specific model.

gemini-code-assist · 2025-10-21T08:02:46Z

swift/llm/template/template/deepseek.py

+        # Code borrowed from
+        # https://modelscope.cn/models/deepseek-ai/DeepSeek-OCR/file/view/master/modeling_deepseekocr.py?status=1
+        crop_mode = self.crop_mode
+        patch_size = 16


The _preprocess_image method contains several magic numbers, such as 16 for patch_size and 4 for downsample_ratio, as well as various hardcoded image sizes and token counts (e.g., 640, 1024, 256, 100). To improve readability and maintainability, it's recommended to define these as named constants at the class level or configure them via init_env_args.

gemini-code-assist · 2025-10-21T08:02:47Z

swift/llm/template/template/deepseek.py

+        crop_mode = self.crop_mode
+        patch_size = 16
+        downsample_ratio = 4
+        valid_img_tokens = 0


The variable valid_img_tokens is initialized and incremented within the _preprocess_image method, but its value is never used. This appears to be dead code and should be removed to improve code clarity.

Jintao-Huang added 2 commits October 21, 2025 11:47

update

82d816a

update

3601221

gemini-code-assist bot reviewed Oct 21, 2025

View reviewed changes

Jintao-Huang added 9 commits October 21, 2025 13:34

update

e739f28

Merge branch 'main' into support_deepseek_ocr

7ff6b72

update

6b463e5

update

a4916fd

update

bbd1bdb

Merge remote-tracking branch 'refs/remotes/origin/support_deepseek_oc…

4eac5f4

…r' into support_deepseek_ocr

update

8be1b89

update

e02cdc2

update

803cf51

tastelikefeet approved these changes Oct 21, 2025

View reviewed changes

gemini-code-assist bot reviewed Oct 21, 2025

View reviewed changes

hjh0119 approved these changes Oct 21, 2025

View reviewed changes

Jintao-Huang added 3 commits October 21, 2025 15:50

update

e2ab416

Merge remote-tracking branch 'refs/remotes/origin/support_deepseek_oc…

ef4b24e

…r' into support_deepseek_ocr

update

4497de0

fix

e88fb1b

gemini-code-assist bot reviewed Oct 21, 2025

View reviewed changes

fix

1edb929

Jintao-Huang merged commit b9f4e2c into modelscope:main Oct 21, 2025
1 of 2 checks passed

Jintao-Huang mentioned this pull request Oct 21, 2025

Deepseek-OCR 微调支持 deepseek-ai/DeepSeek-OCR#38

Open

fchcao mentioned this pull request Oct 23, 2025

deepseekocr微调 deepseek-ai/DeepSeek-OCR#66

Open

happened mentioned this pull request Oct 27, 2025

deepseek-ocr ft failed when batch_size > 1 #6302

Closed

yingdachen mentioned this pull request Oct 27, 2025

希望ms-swift支持微调deepseekOCR #6311

Closed

		get_model_tokenizer_with_flash_attn,
		model_arch=ModelArch.deepseek_ocr,

	image_token_id = 128815
	image_token_id = self._tokenize('<image>')[0]

	tokenized_str += tokenized_image
	tokenized_str.append(tokenized_image)

[model] Support deepseek-ocr #6238

[model] Support deepseek-ocr #6238

Uh oh!

Conversation

Jintao-Huang commented Oct 21, 2025

Uh oh!

gemini-code-assist bot commented Oct 21, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Oct 21, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Oct 21, 2025

Choose a reason for hiding this comment

Uh oh!

Jintao-Huang commented Oct 21, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Oct 21, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Oct 21, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Oct 21, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Oct 21, 2025

Choose a reason for hiding this comment

Uh oh!

Jintao-Huang commented Oct 21, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Oct 21, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Oct 21, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Oct 21, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Oct 21, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Oct 21, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants