[docs] update qwen3_vl best practice #6206

Jintao-Huang · 2025-10-19T15:50:53Z

No description provided.

gemini-code-assist · 2025-10-19T15:51:08Z

Summary of Changes

Hello @Jintao-Huang, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the documentation for Qwen3-VL, providing users with detailed best practices for both inference and training. It includes practical code examples for various setups, updates command-line parameter references, and introduces new training scripts, making it easier for users to work with Qwen3-VL models. A small but important code improvement also enhances error visibility during training setup.

Highlights

New Qwen3-VL Best Practices Documentation: Added comprehensive documentation in both Chinese and English for Qwen3-VL, detailing environment setup, inference methods (using transformers, ms-swift PtEngine, and command line), and training procedures.
Training Guidance for Dense and MoE Models: The new documentation provides specific guidance and example scripts for fine-tuning Dense models using ms-swift (transformers backend) and MoE models using Megatron-SWIFT, including dataset formatting and conversion steps.
Updated Command-Line Parameters: The command-line parameters documentation has been updated to include internvl3_5 for relevant model configurations.
New Example Training Scripts: Two new shell scripts (mcore_full.sh and mixed.sh) have been added under examples/models/qwen3_vl to demonstrate full parameter training for MoE models and LoRA fine-tuning for Dense models, respectively.
Improved Error Handling in Trainer Mixin: A minor code change in swift/trainers/mixin.py now logs exceptions encountered during gradient checkpointing preparation, providing better debugging information instead of silently passing.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request updates the best practice documentation for Qwen3-VL, adds corresponding example scripts, and includes a small bug fix. The documentation is comprehensive and well-structured. I've found a few issues in the code snippets within the documentation that would prevent them from running correctly, such as a missing import. I've also noted an incorrect parameter in one of the new example scripts and some untranslated text in the English documentation. The code change in swift/trainers/mixin.py is a good improvement for error logging. My detailed comments are below.

gemini-code-assist · 2025-10-19T15:52:30Z

docs/source/BestPractices/Qwen3-VL最佳实践.md

+import os
+os.environ['CUDA_VISIBLE_DEVICES'] = '0'
+from modelscope import snapshot_download
+from qwen_vl_utils import process_vision_info


In the transformers inference example, the process_vision_info function is called on line 49 but is not imported. This will cause a NameError when the code is executed. Please import it from qwen_vl_utils.

I suggest adding from qwen_vl_utils import process_vision_info after the transformers import.

Suggested change

from qwen_vl_utils import process_vision_info

from transformers import Qwen3VLForConditionalGeneration, AutoProcessor

from qwen_vl_utils import process_vision_info

gemini-code-assist · 2025-10-19T15:52:30Z

docs/source_en/BestPractices/Qwen3-VL-Best-Practice.md

+import os
+os.environ['CUDA_VISIBLE_DEVICES'] = '0'
+from modelscope import snapshot_download
+from qwen_vl_utils import process_vision_info


In the transformers inference example, the process_vision_info function is called on line 45 but is not imported. This will cause a NameError when the code is executed. Please import it from qwen_vl_utils.

I suggest adding from qwen_vl_utils import process_vision_info after the transformers import.

Suggested change

from qwen_vl_utils import process_vision_info

from transformers import Qwen3VLForConditionalGeneration, AutoProcessor

from qwen_vl_utils import process_vision_info

gemini-code-assist · 2025-10-19T15:52:31Z

examples/models/qwen3_vl/mixed.sh

+    --lora_rank 8 \
+    --lora_alpha 32 \
+    --target_modules all-linear \
+    --router_aux_loss_coef 1e-3 \


The --router_aux_loss_coef parameter is used for Mixture-of-Experts (MoE) models to set the weight of the auxiliary loss. However, Qwen/Qwen3-VL-4B-Instruct is a dense model, not a MoE model. This parameter is unnecessary here and should be removed to avoid confusion and potential issues.

gemini-code-assist · 2025-10-19T15:52:31Z

docs/source_en/BestPractices/Qwen3-VL-Best-Practice.md

+
+Qwen3-VL's bbox output uses normalized 1000 relative coordinates. You can use the grounding dataset format provided by ms-swift, where the coordinates in "bbox" are absolute coordinates, and ms-swift will automatically convert absolute coordinates to normalized 1000 relative coordinates. For more information, please refer to the [Grounding Dataset Format Documentation](../Customization/Custom-dataset.md#grounding).
+
+```jsonl


This JSONL example in the English documentation contains untranslated Chinese text. For consistency and clarity, it should be translated to English.

找到图像中的<ref-object> should be Find <ref-object> in the image.

羊 should be sheep.

Suggested change

```jsonl

{"messages": [{"role": "user", "content": "<image>Find <ref-object> in the image"}, {"role": "assistant", "content": "[\n\t{\"bbox_2d\": <bbox>, \"label\": \"<ref-object>\"}\n\t{\"bbox_2d\": <bbox>, \"label\": \"<ref-object>\"}\n]"}], "images": ["cat.png"], "objects": {"ref": ["sheep", "sheep", "sheep"], "bbox": [[90.9, 160.8, 135, 212.8], [360.9, 480.8, 495, 532.8]]}}

Jintao-Huang · 2025-10-19T16:03:19Z

/gemini review

gemini-code-assist

Code Review

This pull request adds comprehensive documentation for Qwen3-VL best practices in both English and Chinese, including environment setup, inference, and training examples. It also adds corresponding shell scripts for training. The changes are well-structured and provide valuable information for users.

My review focuses on ensuring the accuracy and consistency of the provided examples and documentation. I've found a few minor inconsistencies in the example data and between the Chinese and English versions of the documentation. Addressing these will improve clarity and prevent potential confusion for users. The code change in swift/trainers/mixin.py to add logging for exceptions is a good improvement for debuggability.

gemini-code-assist · 2025-10-19T16:05:21Z

docs/source/BestPractices/Qwen3-VL最佳实践.md

+
+Qwen3-VL的bbox输出采用归一化1000的相对坐标。你可以使用ms-swift提供的grounding数据集格式，其中"bbox"中的坐标为绝对坐标，ms-swift会自动将绝对坐标转为归一化1000的相对坐标。更多信息请参考[grounding数据集格式文档](../Customization/自定义数据集.md#grounding)。
+```jsonl
+{"messages": [{"role": "user", "content": "<image>找到图像中的<ref-object>"}, {"role": "assistant", "content": "[\n\t{\"bbox_2d\": <bbox>, \"label\": \"<ref-object>\"}\n\t{\"bbox_2d\": <bbox>, \"label\": \"<ref-object>\"}\n]"}], "images": ["cat.png"], "objects": {"ref": ["羊", "羊", "羊"], "bbox": [[90.9, 160.8, 135, 212.8], [360.9, 480.8, 495, 532.8]]}}


In this JSONL example, the ref array in the objects field contains three elements ('羊'), but the bbox array has only two coordinate pairs. Additionally, the assistant's response template only contains two bbox_2d entries. For consistency and to avoid confusion for users, the number of elements in ref, bbox, and the assistant's response should match. I suggest updating the ref array to contain two elements.

Suggested change

{"messages": [{"role": "user", "content": "<image>找到图像中的<ref-object>"}, {"role": "assistant", "content": "[\n\t{\"bbox_2d\": <bbox>, \"label\": \"<ref-object>\"}\n\t{\"bbox_2d\": <bbox>, \"label\": \"<ref-object>\"}\n]"}], "images": ["cat.png"], "objects": {"ref": ["羊", "羊", "羊"], "bbox": [[90.9, 160.8, 135, 212.8], [360.9, 480.8, 495, 532.8]]}}

{"messages": [{"role": "user", "content": "<image>找到图像中的<ref-object>"}, {"role": "assistant", "content": "[\n\t{\"bbox_2d\": <bbox>, \"label\": \"<ref-object>\"}\n\t{\"bbox_2d\": <bbox>, \"label\": \"<ref-object>\"}\n]"}], "images": ["cat.png"], "objects": {"ref": ["羊", "羊"], "bbox": [[90.9, 160.8, 135, 212.8], [360.9, 480.8, 495, 532.8]]}}

gemini-code-assist · 2025-10-19T16:05:21Z

docs/source_en/BestPractices/Qwen3-VL-Best-Practice.md

+
+```python
+import os
+# os.environ['SWIFT_DEBUG'] = '1'


There's an inconsistency between the English and Chinese documentation. In the Chinese version (Qwen3-VL最佳实践.md), this line is os.environ['SWIFT_DEBUG'] = '1', but here it's commented out. To ensure both documents provide the same guidance, this line should be uncommented.

Suggested change

# os.environ['SWIFT_DEBUG'] = '1'

os.environ['SWIFT_DEBUG'] = '1'

gemini-code-assist · 2025-10-19T16:05:21Z

docs/source_en/BestPractices/Qwen3-VL-Best-Practice.md

+Qwen3-VL's bbox output uses normalized 1000 relative coordinates. You can use the grounding dataset format provided by ms-swift, where the coordinates in "bbox" are absolute coordinates, and ms-swift will automatically convert absolute coordinates to normalized 1000 relative coordinates. For more information, please refer to the [Grounding Dataset Format Documentation](../Customization/Custom-dataset.md#grounding).
+
+```jsonl
+{"messages": [{"role": "user", "content": "<image>Find <ref-object> in the image."}, {"role": "assistant", "content": "[\n\t{\"bbox_2d\": <bbox>, \"label\": \"<ref-object>\"}\n\t{\"bbox_2d\": <bbox>, \"label\": \"<ref-object>\"}\n]"}], "images": ["cat.png"], "objects": {"ref": ["羊", "羊", "羊"], "bbox": [[90.9, 160.8, 135, 212.8], [360.9, 480.8, 495, 532.8]]}}


This JSONL example has a couple of issues:

The ref array in the objects field contains three elements, but the bbox array has only two coordinate pairs. The assistant's response template also suggests two bbox_2d entries. These should be consistent.

The ref array contains the Chinese character '羊'. In an English document, this should be translated to 'sheep'.

I suggest correcting the number of elements and translating the character for clarity.

Suggested change

{"messages": [{"role": "user", "content": "<image>Find <ref-object> in the image."}, {"role": "assistant", "content": "[\n\t{\"bbox_2d\": <bbox>, \"label\": \"<ref-object>\"}\n\t{\"bbox_2d\": <bbox>, \"label\": \"<ref-object>\"}\n]"}], "images": ["cat.png"], "objects": {"ref": ["羊", "羊", "羊"], "bbox": [[90.9, 160.8, 135, 212.8], [360.9, 480.8, 495, 532.8]]}}

{"messages": [{"role": "user", "content": "<image>Find <ref-object> in the image."}, {"role": "assistant", "content": "[\n\t{\"bbox_2d\": <bbox>, \"label\": \"<ref-object>\"}\n\t{\"bbox_2d\": <bbox>, \"label\": \"<ref-object>\"}\n]"}], "images": ["cat.png"], "objects": {"ref": ["sheep", "sheep"], "bbox": [[90.9, 160.8, 135, 212.8], [360.9, 480.8, 495, 532.8]]}}

hjh0119 · 2025-10-19T16:02:54Z

docs/source/BestPractices/Qwen3-VL最佳实践.md

+
+使用命令行推理：
+```shell
+CUDA_VISIBLE_DEVICES=4 \


hjh0119 · 2025-10-19T16:06:01Z

docs/source/BestPractices/Qwen3-VL最佳实践.md

+```shell
+pip install "transformers>=4.57" "qwen_vl_utils>=0.0.14"
+
+pip install "ms-swift>=3.9.1"


[SUGGESTION]add the dependency on vLLM >= 0.11.0 (optional)

update qwen3_vl docs

646e87e

update

977ca6c

gemini-code-assist bot reviewed Oct 19, 2025

View reviewed changes

Jintao-Huang added 3 commits October 19, 2025 23:58

update

07b1d69

fix

e8bba54

fix

1015b28

gemini-code-assist bot reviewed Oct 19, 2025

View reviewed changes

hjh0119 approved these changes Oct 19, 2025

View reviewed changes

Jintao-Huang added 6 commits October 20, 2025 00:10

update

fed1836

update

6ccab68

update

6683764

update

92876f8

update

e7c026e

fix

ba75656

Jintao-Huang merged commit 06a8f59 into modelscope:main Oct 19, 2025
1 of 2 checks passed

Jintao-Huang added a commit that referenced this pull request Oct 22, 2025

[docs] update qwen3_vl best practice (#6206)

d2a6a89

	from qwen_vl_utils import process_vision_info
	from transformers import Qwen3VLForConditionalGeneration, AutoProcessor
	from qwen_vl_utils import process_vision_info


		Qwen3-VL's bbox output uses normalized 1000 relative coordinates. You can use the grounding dataset format provided by ms-swift, where the coordinates in "bbox" are absolute coordinates, and ms-swift will automatically convert absolute coordinates to normalized 1000 relative coordinates. For more information, please refer to the [Grounding Dataset Format Documentation](../Customization/Custom-dataset.md#grounding).

		```jsonl

	```jsonl
	{"messages": [{"role": "user", "content": "<image>Find <ref-object> in the image"}, {"role": "assistant", "content": "[\n\t{\"bbox_2d\": <bbox>, \"label\": \"<ref-object>\"}\n\t{\"bbox_2d\": <bbox>, \"label\": \"<ref-object>\"}\n]"}], "images": ["cat.png"], "objects": {"ref": ["sheep", "sheep", "sheep"], "bbox": [[90.9, 160.8, 135, 212.8], [360.9, 480.8, 495, 532.8]]}}

	# os.environ['SWIFT_DEBUG'] = '1'
	os.environ['SWIFT_DEBUG'] = '1'

[docs] update qwen3_vl best practice #6206

[docs] update qwen3_vl best practice #6206

Uh oh!

Conversation

Jintao-Huang commented Oct 19, 2025

Uh oh!

gemini-code-assist bot commented Oct 19, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Oct 19, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Oct 19, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Oct 19, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Oct 19, 2025

Choose a reason for hiding this comment

Uh oh!

Jintao-Huang commented Oct 19, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Oct 19, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Oct 19, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Oct 19, 2025

Choose a reason for hiding this comment

Uh oh!

hjh0119 Oct 19, 2025

Choose a reason for hiding this comment

Uh oh!

hjh0119 Oct 19, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants