Skip to content

Conversation

@Jintao-Huang
Copy link
Collaborator

No description provided.

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @Jintao-Huang, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the documentation for Qwen3-VL, providing users with detailed best practices for both inference and training. It includes practical code examples for various setups, updates command-line parameter references, and introduces new training scripts, making it easier for users to work with Qwen3-VL models. A small but important code improvement also enhances error visibility during training setup.

Highlights

  • New Qwen3-VL Best Practices Documentation: Added comprehensive documentation in both Chinese and English for Qwen3-VL, detailing environment setup, inference methods (using transformers, ms-swift PtEngine, and command line), and training procedures.
  • Training Guidance for Dense and MoE Models: The new documentation provides specific guidance and example scripts for fine-tuning Dense models using ms-swift (transformers backend) and MoE models using Megatron-SWIFT, including dataset formatting and conversion steps.
  • Updated Command-Line Parameters: The command-line parameters documentation has been updated to include internvl3_5 for relevant model configurations.
  • New Example Training Scripts: Two new shell scripts (mcore_full.sh and mixed.sh) have been added under examples/models/qwen3_vl to demonstrate full parameter training for MoE models and LoRA fine-tuning for Dense models, respectively.
  • Improved Error Handling in Trainer Mixin: A minor code change in swift/trainers/mixin.py now logs exceptions encountered during gradient checkpointing preparation, providing better debugging information instead of silently passing.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates the best practice documentation for Qwen3-VL, adds corresponding example scripts, and includes a small bug fix. The documentation is comprehensive and well-structured. I've found a few issues in the code snippets within the documentation that would prevent them from running correctly, such as a missing import. I've also noted an incorrect parameter in one of the new example scripts and some untranslated text in the English documentation. The code change in swift/trainers/mixin.py is a good improvement for error logging. My detailed comments are below.

import os
os.environ['CUDA_VISIBLE_DEVICES'] = '0'
from modelscope import snapshot_download
from qwen_vl_utils import process_vision_info
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

In the transformers inference example, the process_vision_info function is called on line 49 but is not imported. This will cause a NameError when the code is executed. Please import it from qwen_vl_utils.

I suggest adding from qwen_vl_utils import process_vision_info after the transformers import.

Suggested change
from qwen_vl_utils import process_vision_info
from transformers import Qwen3VLForConditionalGeneration, AutoProcessor
from qwen_vl_utils import process_vision_info

import os
os.environ['CUDA_VISIBLE_DEVICES'] = '0'
from modelscope import snapshot_download
from qwen_vl_utils import process_vision_info
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

In the transformers inference example, the process_vision_info function is called on line 45 but is not imported. This will cause a NameError when the code is executed. Please import it from qwen_vl_utils.

I suggest adding from qwen_vl_utils import process_vision_info after the transformers import.

Suggested change
from qwen_vl_utils import process_vision_info
from transformers import Qwen3VLForConditionalGeneration, AutoProcessor
from qwen_vl_utils import process_vision_info

--lora_rank 8 \
--lora_alpha 32 \
--target_modules all-linear \
--router_aux_loss_coef 1e-3 \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The --router_aux_loss_coef parameter is used for Mixture-of-Experts (MoE) models to set the weight of the auxiliary loss. However, Qwen/Qwen3-VL-4B-Instruct is a dense model, not a MoE model. This parameter is unnecessary here and should be removed to avoid confusion and potential issues.


Qwen3-VL's bbox output uses normalized 1000 relative coordinates. You can use the grounding dataset format provided by ms-swift, where the coordinates in "bbox" are absolute coordinates, and ms-swift will automatically convert absolute coordinates to normalized 1000 relative coordinates. For more information, please refer to the [Grounding Dataset Format Documentation](../Customization/Custom-dataset.md#grounding).

```jsonl
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This JSONL example in the English documentation contains untranslated Chinese text. For consistency and clarity, it should be translated to English.

  • 找到图像中的<ref-object> should be Find <ref-object> in the image.
  • should be sheep.
Suggested change
```jsonl
{"messages": [{"role": "user", "content": "<image>Find <ref-object> in the image"}, {"role": "assistant", "content": "[\n\t{\"bbox_2d\": <bbox>, \"label\": \"<ref-object>\"}\n\t{\"bbox_2d\": <bbox>, \"label\": \"<ref-object>\"}\n]"}], "images": ["cat.png"], "objects": {"ref": ["sheep", "sheep", "sheep"], "bbox": [[90.9, 160.8, 135, 212.8], [360.9, 480.8, 495, 532.8]]}}

@Jintao-Huang
Copy link
Collaborator Author

/gemini review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds comprehensive documentation for Qwen3-VL best practices in both English and Chinese, including environment setup, inference, and training examples. It also adds corresponding shell scripts for training. The changes are well-structured and provide valuable information for users.

My review focuses on ensuring the accuracy and consistency of the provided examples and documentation. I've found a few minor inconsistencies in the example data and between the Chinese and English versions of the documentation. Addressing these will improve clarity and prevent potential confusion for users. The code change in swift/trainers/mixin.py to add logging for exceptions is a good improvement for debuggability.


Qwen3-VL的bbox输出采用归一化1000的相对坐标。你可以使用ms-swift提供的grounding数据集格式,其中"bbox"中的坐标为绝对坐标,ms-swift会自动将绝对坐标转为归一化1000的相对坐标。更多信息请参考[grounding数据集格式文档](../Customization/自定义数据集.md#grounding)。
```jsonl
{"messages": [{"role": "user", "content": "<image>找到图像中的<ref-object>"}, {"role": "assistant", "content": "[\n\t{\"bbox_2d\": <bbox>, \"label\": \"<ref-object>\"}\n\t{\"bbox_2d\": <bbox>, \"label\": \"<ref-object>\"}\n]"}], "images": ["cat.png"], "objects": {"ref": ["羊", "羊", "羊"], "bbox": [[90.9, 160.8, 135, 212.8], [360.9, 480.8, 495, 532.8]]}}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

In this JSONL example, the ref array in the objects field contains three elements ('羊'), but the bbox array has only two coordinate pairs. Additionally, the assistant's response template only contains two bbox_2d entries. For consistency and to avoid confusion for users, the number of elements in ref, bbox, and the assistant's response should match. I suggest updating the ref array to contain two elements.

Suggested change
{"messages": [{"role": "user", "content": "<image>找到图像中的<ref-object>"}, {"role": "assistant", "content": "[\n\t{\"bbox_2d\": <bbox>, \"label\": \"<ref-object>\"}\n\t{\"bbox_2d\": <bbox>, \"label\": \"<ref-object>\"}\n]"}], "images": ["cat.png"], "objects": {"ref": ["羊", "羊", "羊"], "bbox": [[90.9, 160.8, 135, 212.8], [360.9, 480.8, 495, 532.8]]}}
{"messages": [{"role": "user", "content": "<image>找到图像中的<ref-object>"}, {"role": "assistant", "content": "[\n\t{\"bbox_2d\": <bbox>, \"label\": \"<ref-object>\"}\n\t{\"bbox_2d\": <bbox>, \"label\": \"<ref-object>\"}\n]"}], "images": ["cat.png"], "objects": {"ref": ["羊", "羊"], "bbox": [[90.9, 160.8, 135, 212.8], [360.9, 480.8, 495, 532.8]]}}


```python
import os
# os.environ['SWIFT_DEBUG'] = '1'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

There's an inconsistency between the English and Chinese documentation. In the Chinese version (Qwen3-VL最佳实践.md), this line is os.environ['SWIFT_DEBUG'] = '1', but here it's commented out. To ensure both documents provide the same guidance, this line should be uncommented.

Suggested change
# os.environ['SWIFT_DEBUG'] = '1'
os.environ['SWIFT_DEBUG'] = '1'

Qwen3-VL's bbox output uses normalized 1000 relative coordinates. You can use the grounding dataset format provided by ms-swift, where the coordinates in "bbox" are absolute coordinates, and ms-swift will automatically convert absolute coordinates to normalized 1000 relative coordinates. For more information, please refer to the [Grounding Dataset Format Documentation](../Customization/Custom-dataset.md#grounding).

```jsonl
{"messages": [{"role": "user", "content": "<image>Find <ref-object> in the image."}, {"role": "assistant", "content": "[\n\t{\"bbox_2d\": <bbox>, \"label\": \"<ref-object>\"}\n\t{\"bbox_2d\": <bbox>, \"label\": \"<ref-object>\"}\n]"}], "images": ["cat.png"], "objects": {"ref": ["羊", "羊", "羊"], "bbox": [[90.9, 160.8, 135, 212.8], [360.9, 480.8, 495, 532.8]]}}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This JSONL example has a couple of issues:

  1. The ref array in the objects field contains three elements, but the bbox array has only two coordinate pairs. The assistant's response template also suggests two bbox_2d entries. These should be consistent.
  2. The ref array contains the Chinese character '羊'. In an English document, this should be translated to 'sheep'.

I suggest correcting the number of elements and translating the character for clarity.

Suggested change
{"messages": [{"role": "user", "content": "<image>Find <ref-object> in the image."}, {"role": "assistant", "content": "[\n\t{\"bbox_2d\": <bbox>, \"label\": \"<ref-object>\"}\n\t{\"bbox_2d\": <bbox>, \"label\": \"<ref-object>\"}\n]"}], "images": ["cat.png"], "objects": {"ref": ["", "羊", "羊"], "bbox": [[90.9, 160.8, 135, 212.8], [360.9, 480.8, 495, 532.8]]}}
{"messages": [{"role": "user", "content": "<image>Find <ref-object> in the image."}, {"role": "assistant", "content": "[\n\t{\"bbox_2d\": <bbox>, \"label\": \"<ref-object>\"}\n\t{\"bbox_2d\": <bbox>, \"label\": \"<ref-object>\"}\n]"}], "images": ["cat.png"], "objects": {"ref": ["sheep", "sheep"], "bbox": [[90.9, 160.8, 135, 212.8], [360.9, 480.8, 495, 532.8]]}}


使用命令行推理:
```shell
CUDA_VISIBLE_DEVICES=4 \
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

=0

```shell
pip install "transformers>=4.57" "qwen_vl_utils>=0.0.14"

pip install "ms-swift>=3.9.1"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[SUGGESTION]add the dependency on vLLM >= 0.11.0 (optional)

@Jintao-Huang Jintao-Huang merged commit 06a8f59 into modelscope:main Oct 19, 2025
1 of 2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants