-
Notifications
You must be signed in to change notification settings - Fork 1k
[model] Support deepseek-ocr #6238
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[model] Support deepseek-ocr #6238
Conversation
Summary of ChangesHello @Jintao-Huang, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly expands the system's capabilities by integrating the Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request adds support for the deepseek-ocr model by updating constants and adding new model and template registrations. While the changes are in the right direction, there are a couple of significant issues. The model registration uses an incorrect function for loading the multimodal processor and is missing the architectures definition. Additionally, the corresponding template class is empty, which will cause incorrect processing of image inputs. These issues need to be addressed for the model to function correctly.
swift/llm/model/model/deepseek.py
Outdated
| get_model_tokenizer_with_flash_attn, | ||
| model_arch=ModelArch.deepseek_ocr, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are a couple of issues with the ModelMeta registration for deepseek_ocr:
- Incorrect
get_function:get_model_tokenizer_with_flash_attnis used, which is for text-only models. Sincedeepseek-ocris a multimodal model, you should useget_model_tokenizer_multimodalto correctly load theAutoProcessorwhich includes the image processor. - Missing
architectures: Thearchitecturesfield is missing. Based on the model'sconfig.jsonon Hugging Face, it should be['LlamaForCausalLM']. Adding this is important for model type inference.
| get_model_tokenizer_with_flash_attn, | |
| model_arch=ModelArch.deepseek_ocr, | |
| get_model_tokenizer_multimodal, | |
| architectures=['LlamaForCausalLM'], | |
| model_arch=ModelArch.deepseek_ocr, |
| class DeepseekOCR(Template): | ||
| pass |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The DeepseekOCR template class is currently empty (pass). For a multimodal model, this is incorrect as it will fall back to the base Template's encoding logic, which is designed for text-only models and won't handle image inputs correctly. You need to override the _encode method to properly use the model's processor to handle image and text inputs. You can look at other multimodal templates in this file, like DeepseekVLTemplate, for an example of how to implement this.
|
/gemini review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request adds support for the deepseek-ocr model, including its model type, architecture, template, and a corresponding test case. The documentation is also updated to include deepseek-ocr and other new models, and to reorganize the reranker models into their own section. The implementation is mostly solid, but I've identified a bug in the DeepseekOCR template where a configurable parameter is hardcoded, which prevents it from being configured. I've also noted a couple of areas for improvement regarding maintainability and consistency in both the code and documentation.
| def _preprocess_image(self, images): | ||
| # Code borrowed from | ||
| # https://modelscope.cn/models/deepseek-ai/DeepSeek-OCR/file/view/master/modeling_deepseekocr.py?status=1 | ||
| crop_mode = True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The crop_mode variable is hardcoded to True. This overrides the self.crop_mode attribute that is initialized from an environment variable in init_env_args, preventing users from disabling this feature. This should be updated to use the value from self.crop_mode.
| crop_mode = True | |
| crop_mode = self.crop_mode |
docs/source/Instruction/支持的模型和数据集.md
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| |[google/gemma-3n-E4B-it](https://modelscope.cn/models/google/gemma-3n-E4B-it)|gemma3n|gemma3n|transformers>=4.53.1|✘|-|[google/gemma-3n-E4B-it](https://huggingface.co/google/gemma-3n-E4B-it)| | ||
| |[mistralai/Mistral-Small-3.1-24B-Base-2503](https://modelscope.cn/models/mistralai/Mistral-Small-3.1-24B-Base-2503)|mistral_2503|mistral_2503|transformers>=4.49|✘|-|[mistralai/Mistral-Small-3.1-24B-Base-2503](https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Base-2503)| | ||
| |[mistralai/Mistral-Small-3.1-24B-Instruct-2503](https://modelscope.cn/models/mistralai/Mistral-Small-3.1-24B-Instruct-2503)|mistral_2503|mistral_2503|transformers>=4.49|✘|-|[mistralai/Mistral-Small-3.1-24B-Instruct-2503](https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Instruct-2503)| | ||
| |[JinaAI/jina-reranker-m0](https://modelscope.cn/models/JinaAI/jina-reranker-m0)|jina_reranker_m0|jina_reranker_m0|-|✘|reranker, vision|[JinaAI/jina-reranker-m0](https://huggingface.co/JinaAI/jina-reranker-m0)| |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| ratio = 1 - ((max(w, h) - min(w, h)) / (max(w, h))) | ||
|
|
||
| image_transform = self.BasicImageTransform(mean=(0.5, 0.5, 0.5), std=(0.5, 0.5, 0.5), normalize=True) | ||
| image_token_id = 128815 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The image_token_id is hardcoded. While the value 128815 is correct for the current tokenizer, it's better practice to retrieve this value dynamically from the tokenizer to improve maintainability. This will prevent potential issues if the tokenizer is updated in the future. You can obtain it using self._tokenize('<image>')[0].
| image_token_id = 128815 | |
| image_token_id = self._tokenize('<image>')[0] |
|
/gemini review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request adds support for the deepseek-ocr model, which is a great addition. The changes include model registration, a new template for handling OCR-specific preprocessing, an example script, and updates to the documentation. The implementation looks mostly solid, but I've identified a few areas for improvement, including a critical bug in the image preprocessing logic, some maintainability issues with magic numbers and file structure, and a concern about a strict dependency version. Addressing these points will make the integration more robust and user-friendly.
|
|
||
| tokenized_image = ([image_token_id] * num_queries + [image_token_id]) * num_queries | ||
| tokenized_image += [image_token_id] | ||
| tokenized_str += tokenized_image |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is an inconsistency in how tokenized_str is populated. In the if crop_mode: block, tokenized_image is appended as a list, but here in the else block, it's extended (+=). This will cause a TypeError when crop_mode is False and multiple images are processed, as the calling function expects a list of lists. This should be changed to append to ensure consistent behavior.
| tokenized_str += tokenized_image | |
| tokenized_str.append(tokenized_image) |
| |[deepseek-ai/Janus-1.3B](https://modelscope.cn/models/deepseek-ai/Janus-1.3B)|deepseek_janus|deepseek_janus|-|✘|vision|[deepseek-ai/Janus-1.3B](https://huggingface.co/deepseek-ai/Janus-1.3B)| | ||
| |[deepseek-ai/Janus-Pro-1B](https://modelscope.cn/models/deepseek-ai/Janus-Pro-1B)|deepseek_janus_pro|deepseek_janus_pro|-|✘|vision|[deepseek-ai/Janus-Pro-1B](https://huggingface.co/deepseek-ai/Janus-Pro-1B)| | ||
| |[deepseek-ai/Janus-Pro-7B](https://modelscope.cn/models/deepseek-ai/Janus-Pro-7B)|deepseek_janus_pro|deepseek_janus_pro|-|✘|vision|[deepseek-ai/Janus-Pro-7B](https://huggingface.co/deepseek-ai/Janus-Pro-7B)| | ||
| |[deepseek-ai/DeepSeek-OCR](https://modelscope.cn/models/deepseek-ai/DeepSeek-OCR)|deepseek_ocr|deepseek_ocr|transformers==4.46.3, easydict|✘|vision|[deepseek-ai/DeepSeek-OCR](https://huggingface.co/deepseek-ai/DeepSeek-OCR)| |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pinning the transformers dependency to an exact version (==4.46.3) can lead to dependency conflicts for users who might have other packages requiring a different version. It would be more flexible to specify a compatible range, such as >=4.46.3,<4.47.0. If this exact version is a strict requirement for the model to function correctly, it would be helpful to add a note explaining why.
| @@ -0,0 +1,30 @@ | |||
| # 24GiB | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The example script for deepseek-ocr is located in the deepseek_vl2 directory. For better project structure and maintainability, it would be more appropriate to create a dedicated directory for this new model, such as examples/models/deepseek_ocr/, and place the script there. This will make it easier for users to find examples related to this specific model.
| # Code borrowed from | ||
| # https://modelscope.cn/models/deepseek-ai/DeepSeek-OCR/file/view/master/modeling_deepseekocr.py?status=1 | ||
| crop_mode = self.crop_mode | ||
| patch_size = 16 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The _preprocess_image method contains several magic numbers, such as 16 for patch_size and 4 for downsample_ratio, as well as various hardcoded image sizes and token counts (e.g., 640, 1024, 256, 100). To improve readability and maintainability, it's recommended to define these as named constants at the class level or configure them via init_env_args.
| crop_mode = self.crop_mode | ||
| patch_size = 16 | ||
| downsample_ratio = 4 | ||
| valid_img_tokens = 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#6234