-
Notifications
You must be signed in to change notification settings - Fork 1k
[docs] update qwen3_vl best practice #6206
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[docs] update qwen3_vl best practice #6206
Conversation
Summary of ChangesHello @Jintao-Huang, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances the documentation for Qwen3-VL, providing users with detailed best practices for both inference and training. It includes practical code examples for various setups, updates command-line parameter references, and introduces new training scripts, making it easier for users to work with Qwen3-VL models. A small but important code improvement also enhances error visibility during training setup. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request updates the best practice documentation for Qwen3-VL, adds corresponding example scripts, and includes a small bug fix. The documentation is comprehensive and well-structured. I've found a few issues in the code snippets within the documentation that would prevent them from running correctly, such as a missing import. I've also noted an incorrect parameter in one of the new example scripts and some untranslated text in the English documentation. The code change in swift/trainers/mixin.py is a good improvement for error logging. My detailed comments are below.
| import os | ||
| os.environ['CUDA_VISIBLE_DEVICES'] = '0' | ||
| from modelscope import snapshot_download | ||
| from qwen_vl_utils import process_vision_info |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the transformers inference example, the process_vision_info function is called on line 49 but is not imported. This will cause a NameError when the code is executed. Please import it from qwen_vl_utils.
I suggest adding from qwen_vl_utils import process_vision_info after the transformers import.
| from qwen_vl_utils import process_vision_info | |
| from transformers import Qwen3VLForConditionalGeneration, AutoProcessor | |
| from qwen_vl_utils import process_vision_info |
| import os | ||
| os.environ['CUDA_VISIBLE_DEVICES'] = '0' | ||
| from modelscope import snapshot_download | ||
| from qwen_vl_utils import process_vision_info |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the transformers inference example, the process_vision_info function is called on line 45 but is not imported. This will cause a NameError when the code is executed. Please import it from qwen_vl_utils.
I suggest adding from qwen_vl_utils import process_vision_info after the transformers import.
| from qwen_vl_utils import process_vision_info | |
| from transformers import Qwen3VLForConditionalGeneration, AutoProcessor | |
| from qwen_vl_utils import process_vision_info |
examples/models/qwen3_vl/mixed.sh
Outdated
| --lora_rank 8 \ | ||
| --lora_alpha 32 \ | ||
| --target_modules all-linear \ | ||
| --router_aux_loss_coef 1e-3 \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
|
||
| Qwen3-VL's bbox output uses normalized 1000 relative coordinates. You can use the grounding dataset format provided by ms-swift, where the coordinates in "bbox" are absolute coordinates, and ms-swift will automatically convert absolute coordinates to normalized 1000 relative coordinates. For more information, please refer to the [Grounding Dataset Format Documentation](../Customization/Custom-dataset.md#grounding). | ||
|
|
||
| ```jsonl |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This JSONL example in the English documentation contains untranslated Chinese text. For consistency and clarity, it should be translated to English.
找到图像中的<ref-object>should beFind <ref-object> in the image.羊should besheep.
| ```jsonl | |
| {"messages": [{"role": "user", "content": "<image>Find <ref-object> in the image"}, {"role": "assistant", "content": "[\n\t{\"bbox_2d\": <bbox>, \"label\": \"<ref-object>\"}\n\t{\"bbox_2d\": <bbox>, \"label\": \"<ref-object>\"}\n]"}], "images": ["cat.png"], "objects": {"ref": ["sheep", "sheep", "sheep"], "bbox": [[90.9, 160.8, 135, 212.8], [360.9, 480.8, 495, 532.8]]}} |
|
/gemini review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request adds comprehensive documentation for Qwen3-VL best practices in both English and Chinese, including environment setup, inference, and training examples. It also adds corresponding shell scripts for training. The changes are well-structured and provide valuable information for users.
My review focuses on ensuring the accuracy and consistency of the provided examples and documentation. I've found a few minor inconsistencies in the example data and between the Chinese and English versions of the documentation. Addressing these will improve clarity and prevent potential confusion for users. The code change in swift/trainers/mixin.py to add logging for exceptions is a good improvement for debuggability.
|
|
||
| Qwen3-VL的bbox输出采用归一化1000的相对坐标。你可以使用ms-swift提供的grounding数据集格式,其中"bbox"中的坐标为绝对坐标,ms-swift会自动将绝对坐标转为归一化1000的相对坐标。更多信息请参考[grounding数据集格式文档](../Customization/自定义数据集.md#grounding)。 | ||
| ```jsonl | ||
| {"messages": [{"role": "user", "content": "<image>找到图像中的<ref-object>"}, {"role": "assistant", "content": "[\n\t{\"bbox_2d\": <bbox>, \"label\": \"<ref-object>\"}\n\t{\"bbox_2d\": <bbox>, \"label\": \"<ref-object>\"}\n]"}], "images": ["cat.png"], "objects": {"ref": ["羊", "羊", "羊"], "bbox": [[90.9, 160.8, 135, 212.8], [360.9, 480.8, 495, 532.8]]}} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this JSONL example, the ref array in the objects field contains three elements ('羊'), but the bbox array has only two coordinate pairs. Additionally, the assistant's response template only contains two bbox_2d entries. For consistency and to avoid confusion for users, the number of elements in ref, bbox, and the assistant's response should match. I suggest updating the ref array to contain two elements.
| {"messages": [{"role": "user", "content": "<image>找到图像中的<ref-object>"}, {"role": "assistant", "content": "[\n\t{\"bbox_2d\": <bbox>, \"label\": \"<ref-object>\"}\n\t{\"bbox_2d\": <bbox>, \"label\": \"<ref-object>\"}\n]"}], "images": ["cat.png"], "objects": {"ref": ["羊", "羊", "羊"], "bbox": [[90.9, 160.8, 135, 212.8], [360.9, 480.8, 495, 532.8]]}} | |
| {"messages": [{"role": "user", "content": "<image>找到图像中的<ref-object>"}, {"role": "assistant", "content": "[\n\t{\"bbox_2d\": <bbox>, \"label\": \"<ref-object>\"}\n\t{\"bbox_2d\": <bbox>, \"label\": \"<ref-object>\"}\n]"}], "images": ["cat.png"], "objects": {"ref": ["羊", "羊"], "bbox": [[90.9, 160.8, 135, 212.8], [360.9, 480.8, 495, 532.8]]}} |
|
|
||
| ```python | ||
| import os | ||
| # os.environ['SWIFT_DEBUG'] = '1' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's an inconsistency between the English and Chinese documentation. In the Chinese version (Qwen3-VL最佳实践.md), this line is os.environ['SWIFT_DEBUG'] = '1', but here it's commented out. To ensure both documents provide the same guidance, this line should be uncommented.
| # os.environ['SWIFT_DEBUG'] = '1' | |
| os.environ['SWIFT_DEBUG'] = '1' |
| Qwen3-VL's bbox output uses normalized 1000 relative coordinates. You can use the grounding dataset format provided by ms-swift, where the coordinates in "bbox" are absolute coordinates, and ms-swift will automatically convert absolute coordinates to normalized 1000 relative coordinates. For more information, please refer to the [Grounding Dataset Format Documentation](../Customization/Custom-dataset.md#grounding). | ||
|
|
||
| ```jsonl | ||
| {"messages": [{"role": "user", "content": "<image>Find <ref-object> in the image."}, {"role": "assistant", "content": "[\n\t{\"bbox_2d\": <bbox>, \"label\": \"<ref-object>\"}\n\t{\"bbox_2d\": <bbox>, \"label\": \"<ref-object>\"}\n]"}], "images": ["cat.png"], "objects": {"ref": ["羊", "羊", "羊"], "bbox": [[90.9, 160.8, 135, 212.8], [360.9, 480.8, 495, 532.8]]}} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This JSONL example has a couple of issues:
- The
refarray in theobjectsfield contains three elements, but thebboxarray has only two coordinate pairs. The assistant's response template also suggests twobbox_2dentries. These should be consistent. - The
refarray contains the Chinese character '羊'. In an English document, this should be translated to 'sheep'.
I suggest correcting the number of elements and translating the character for clarity.
| {"messages": [{"role": "user", "content": "<image>Find <ref-object> in the image."}, {"role": "assistant", "content": "[\n\t{\"bbox_2d\": <bbox>, \"label\": \"<ref-object>\"}\n\t{\"bbox_2d\": <bbox>, \"label\": \"<ref-object>\"}\n]"}], "images": ["cat.png"], "objects": {"ref": ["羊", "羊", "羊"], "bbox": [[90.9, 160.8, 135, 212.8], [360.9, 480.8, 495, 532.8]]}} | |
| {"messages": [{"role": "user", "content": "<image>Find <ref-object> in the image."}, {"role": "assistant", "content": "[\n\t{\"bbox_2d\": <bbox>, \"label\": \"<ref-object>\"}\n\t{\"bbox_2d\": <bbox>, \"label\": \"<ref-object>\"}\n]"}], "images": ["cat.png"], "objects": {"ref": ["sheep", "sheep"], "bbox": [[90.9, 160.8, 135, 212.8], [360.9, 480.8, 495, 532.8]]}} |
|
|
||
| 使用命令行推理: | ||
| ```shell | ||
| CUDA_VISIBLE_DEVICES=4 \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
=0
| ```shell | ||
| pip install "transformers>=4.57" "qwen_vl_utils>=0.0.14" | ||
|
|
||
| pip install "ms-swift>=3.9.1" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[SUGGESTION]add the dependency on vLLM >= 0.11.0 (optional)
No description provided.