Skip to content

Conversation

@Jintao-Huang
Copy link
Collaborator

No description provided.

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @Jintao-Huang, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request primarily focuses on enhancing the compatibility and robustness of the system, particularly concerning the integration of Qwen-VL models with varying versions of the qwen_vl_utils library. It introduces a mechanism to dynamically adjust image and video token processing based on the utility library's version, ensuring smoother operation across different setups. Additionally, it refines error messaging for model type detection and improves the handling of metric reduction in distributed training, contributing to overall system stability and user experience.

Highlights

  • Qwen-VL Utility Compatibility: Implemented version-aware compatibility for the qwen_vl_utils library, specifically addressing how image and video token numbers are calculated for Qwen2VL, Qwen3VL, and Qwen3VLMoe models based on the installed utility version.
  • Dynamic Environment Variable Configuration: Introduced a new function, compat_qwen_vl_utils, which dynamically sets environment variables like IMAGE_MAX_TOKEN_NUM and VIDEO_MAX_TOKEN_NUM to ensure correct token processing based on the model's image_patch_size and SPATIAL_MERGE_SIZE.
  • Improved Error Messaging: Enhanced the error message displayed when the model_type cannot be automatically determined, providing clearer guidance and recommended model_types to the user.
  • Robust Metric Reduction: Refactored the _all_reduce_metric function in the Megatron trainer to use torch.stack for creating the reporting_metric tensor, leading to a more robust and potentially efficient handling of distributed metric aggregation.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@Jintao-Huang Jintao-Huang linked an issue Nov 13, 2025 that may be closed by this pull request
@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @Jintao-Huang, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request focuses on improving the robustness and compatibility of Qwen-VL models within the system. It implements a version-aware mechanism to handle different qwen_vl_utils library versions, ensuring that vision-language models correctly process image and video inputs by dynamically adjusting token limits. Additionally, it refines a user-facing error message to provide more helpful information when model types are ambiguous.

Highlights

  • Qwen-VL Compatibility: Introduced dynamic version checking for the qwen_vl_utils library to ensure compatibility with different versions, specifically for Qwen2.5-VL, Qwen3-VL, and Qwen3-MoE-VL models.
  • Vision Token Limit Adjustment: Added a new utility function, compat_qwen_vl_utils, which calculates and sets environment variables for image and video token limits based on image patch size and spatial merge size, crucial for proper vision model operation.
  • Improved Error Messaging: Enhanced the error message for cases where model_type cannot be automatically determined during model registration, providing clearer guidance to the user.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces compatibility changes for qwen_vl_utils, particularly for qwen2.5-vl. It adds a compatibility function compat_qwen_vl_utils to handle different versions of qwen_vl_utils by setting environment variables based on pixel limits. The changes also include version checking logic and updates to model initialization functions to use this new compatibility layer. Additionally, an error message in swift/llm/model/register.py is improved for clarity, and a minor optimization is made in swift/megatron/trainers/base.py. The changes look good overall. I have one suggestion to refactor some repetitive code for better maintainability.

Comment on lines 743 to 752
spatial_merge_size = int(os.getenv('SPATIAL_MERGE_SIZE', '2'))
image_factor = image_patch_size * spatial_merge_size
if os.getenv('MAX_PIXELS'):
os.environ['IMAGE_MAX_TOKEN_NUM'] = str(int(os.getenv('MAX_PIXELS')) // image_factor**2)
if os.getenv('MIN_PIXELS'):
os.environ['IMAGE_MIN_TOKEN_NUM'] = str(int(os.getenv('MIN_PIXELS')) // image_factor**2)
if os.getenv('VIDEO_MAX_PIXELS'):
os.environ['VIDEO_MAX_TOKEN_NUM'] = str(int(os.getenv('VIDEO_MAX_PIXELS')) // image_factor**2)
if os.getenv('VIDEO_MIN_PIXELS'):
os.environ['VIDEO_MIN_TOKEN_NUM'] = str(int(os.getenv('VIDEO_MIN_PIXELS')) // image_factor**2)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This function contains repetitive logic for handling different environment variables. To improve readability and maintainability, you can refactor this into a loop using a mapping dictionary. This will make the code more concise and easier to extend in the future.

Suggested change
spatial_merge_size = int(os.getenv('SPATIAL_MERGE_SIZE', '2'))
image_factor = image_patch_size * spatial_merge_size
if os.getenv('MAX_PIXELS'):
os.environ['IMAGE_MAX_TOKEN_NUM'] = str(int(os.getenv('MAX_PIXELS')) // image_factor**2)
if os.getenv('MIN_PIXELS'):
os.environ['IMAGE_MIN_TOKEN_NUM'] = str(int(os.getenv('MIN_PIXELS')) // image_factor**2)
if os.getenv('VIDEO_MAX_PIXELS'):
os.environ['VIDEO_MAX_TOKEN_NUM'] = str(int(os.getenv('VIDEO_MAX_PIXELS')) // image_factor**2)
if os.getenv('VIDEO_MIN_PIXELS'):
os.environ['VIDEO_MIN_TOKEN_NUM'] = str(int(os.getenv('VIDEO_MIN_PIXELS')) // image_factor**2)
spatial_merge_size = int(os.getenv('SPATIAL_MERGE_SIZE', '2'))
image_factor = image_patch_size * spatial_merge_size
image_factor_sq = image_factor**2
env_map = {
'MAX_PIXELS': 'IMAGE_MAX_TOKEN_NUM',
'MIN_PIXELS': 'IMAGE_MIN_TOKEN_NUM',
'VIDEO_MAX_PIXELS': 'VIDEO_MAX_TOKEN_NUM',
'VIDEO_MIN_PIXELS': 'VIDEO_MIN_TOKEN_NUM',
}
for src_key, dest_key in env_map.items():
value = os.getenv(src_key)
if value:
os.environ[dest_key] = str(int(value) // image_factor_sq)

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @Jintao-Huang, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces crucial compatibility enhancements for Qwen-VL models by intelligently adapting to different versions of the qwen_vl_utils library. It ensures that visual token limits are correctly configured based on the specific qwen_vl_utils version and model type, preventing potential issues. Additionally, it refines an error message in the model registration module to offer clearer instructions to users.

Highlights

  • qwen_vl_utils Version Compatibility: Implemented dynamic version checking for qwen_vl_utils to ensure compatibility with Qwen-VL models, specifically handling versions >=0.0.14 and <0.0.12.
  • Dynamic Token Limit Adjustment: Introduced a new utility function, compat_qwen_vl_utils, to adjust image and video token limits based on model patch size and environment variables, improving flexibility for different Qwen-VL model configurations.
  • Enhanced Model Registration Error Message: Updated the error message in the model registration process to provide more explicit guidance when automatic model_type detection fails.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds compatibility for different versions of qwen_vl_utils by introducing version checks and a new compatibility function. It also includes a minor improvement to an error message and a beneficial refactoring in _all_reduce_metric for better clarity and robustness. The changes are generally good, but I have one suggestion to refactor the new compat_qwen_vl_utils function to reduce code duplication and improve maintainability.

Comment on lines 742 to 752
def compat_qwen_vl_utils(image_patch_size: int):
spatial_merge_size = int(os.getenv('SPATIAL_MERGE_SIZE', '2'))
image_factor = image_patch_size * spatial_merge_size
if os.getenv('MAX_PIXELS'):
os.environ['IMAGE_MAX_TOKEN_NUM'] = str(int(os.getenv('MAX_PIXELS')) // image_factor**2)
if os.getenv('MIN_PIXELS'):
os.environ['IMAGE_MIN_TOKEN_NUM'] = str(int(os.getenv('MIN_PIXELS')) // image_factor**2)
if os.getenv('VIDEO_MAX_PIXELS'):
os.environ['VIDEO_MAX_TOKEN_NUM'] = str(int(os.getenv('VIDEO_MAX_PIXELS')) // image_factor**2)
if os.getenv('VIDEO_MIN_PIXELS'):
os.environ['VIDEO_MIN_TOKEN_NUM'] = str(int(os.getenv('VIDEO_MIN_PIXELS')) // image_factor**2)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The function compat_qwen_vl_utils contains several repetitive blocks for setting environment variables. This can be refactored into a loop over a configuration dictionary to reduce code duplication, which improves readability and makes it easier to add or modify environment variable mappings in the future.

Suggested change
def compat_qwen_vl_utils(image_patch_size: int):
spatial_merge_size = int(os.getenv('SPATIAL_MERGE_SIZE', '2'))
image_factor = image_patch_size * spatial_merge_size
if os.getenv('MAX_PIXELS'):
os.environ['IMAGE_MAX_TOKEN_NUM'] = str(int(os.getenv('MAX_PIXELS')) // image_factor**2)
if os.getenv('MIN_PIXELS'):
os.environ['IMAGE_MIN_TOKEN_NUM'] = str(int(os.getenv('MIN_PIXELS')) // image_factor**2)
if os.getenv('VIDEO_MAX_PIXELS'):
os.environ['VIDEO_MAX_TOKEN_NUM'] = str(int(os.getenv('VIDEO_MAX_PIXELS')) // image_factor**2)
if os.getenv('VIDEO_MIN_PIXELS'):
os.environ['VIDEO_MIN_TOKEN_NUM'] = str(int(os.getenv('VIDEO_MIN_PIXELS')) // image_factor**2)
def compat_qwen_vl_utils(image_patch_size: int):
spatial_merge_size = int(os.getenv('SPATIAL_MERGE_SIZE', '2'))
image_factor_sq = (image_patch_size * spatial_merge_size)**2
env_map = {
'MAX_PIXELS': 'IMAGE_MAX_TOKEN_NUM',
'MIN_PIXELS': 'IMAGE_MIN_TOKEN_NUM',
'VIDEO_MAX_PIXELS': 'VIDEO_MAX_TOKEN_NUM',
'VIDEO_MIN_PIXELS': 'VIDEO_MIN_TOKEN_NUM',
}
for src_env, dest_env in env_map.items():
pixel_val_str = os.getenv(src_env)
if pixel_val_str:
os.environ[dest_env] = str(int(pixel_val_str) // image_factor_sq)

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces compatibility for newer versions of qwen_vl_utils by adding a compatibility function and updating version checks. The changes also include a more informative error message and a minor code improvement in Megatron. My review includes a suggestion to refactor the new compatibility function for better readability and to avoid redundant environment variable lookups.

Comment on lines 742 to 752
def compat_qwen_vl_utils(image_patch_size: int):
spatial_merge_size = int(os.getenv('SPATIAL_MERGE_SIZE', '2'))
image_factor = image_patch_size * spatial_merge_size
if os.getenv('MAX_PIXELS'):
os.environ['IMAGE_MAX_TOKEN_NUM'] = str(int(os.getenv('MAX_PIXELS')) // image_factor**2)
if os.getenv('MIN_PIXELS'):
os.environ['IMAGE_MIN_TOKEN_NUM'] = str(int(os.getenv('MIN_PIXELS')) // image_factor**2)
if os.getenv('VIDEO_MAX_PIXELS'):
os.environ['VIDEO_MAX_TOKEN_NUM'] = str(int(os.getenv('VIDEO_MAX_PIXELS')) // image_factor**2)
if os.getenv('VIDEO_MIN_PIXELS'):
os.environ['VIDEO_MIN_TOKEN_NUM'] = str(int(os.getenv('VIDEO_MIN_PIXELS')) // image_factor**2)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The function compat_qwen_vl_utils repeatedly calls os.getenv for the same environment variables. This can be refactored to be more concise and efficient by retrieving each environment variable's value only once and using a loop to process them.

Suggested change
def compat_qwen_vl_utils(image_patch_size: int):
spatial_merge_size = int(os.getenv('SPATIAL_MERGE_SIZE', '2'))
image_factor = image_patch_size * spatial_merge_size
if os.getenv('MAX_PIXELS'):
os.environ['IMAGE_MAX_TOKEN_NUM'] = str(int(os.getenv('MAX_PIXELS')) // image_factor**2)
if os.getenv('MIN_PIXELS'):
os.environ['IMAGE_MIN_TOKEN_NUM'] = str(int(os.getenv('MIN_PIXELS')) // image_factor**2)
if os.getenv('VIDEO_MAX_PIXELS'):
os.environ['VIDEO_MAX_TOKEN_NUM'] = str(int(os.getenv('VIDEO_MAX_PIXELS')) // image_factor**2)
if os.getenv('VIDEO_MIN_PIXELS'):
os.environ['VIDEO_MIN_TOKEN_NUM'] = str(int(os.getenv('VIDEO_MIN_PIXELS')) // image_factor**2)
def compat_qwen_vl_utils(image_patch_size: int):
spatial_merge_size = int(os.getenv('SPATIAL_MERGE_SIZE', '2'))
image_factor = image_patch_size * spatial_merge_size
env_vars_to_process = {
'MAX_PIXELS': 'IMAGE_MAX_TOKEN_NUM',
'MIN_PIXELS': 'IMAGE_MIN_TOKEN_NUM',
'VIDEO_MAX_PIXELS': 'VIDEO_MAX_TOKEN_NUM',
'VIDEO_MIN_PIXELS': 'VIDEO_MIN_TOKEN_NUM',
}
for source_var, target_var in env_vars_to_process.items():
value = os.getenv(source_var)
if value:
os.environ[target_var] = str(int(value) // image_factor**2)

@Jintao-Huang Jintao-Huang merged commit f031e4e into modelscope:main Nov 13, 2025
1 of 2 checks passed
vx120 pushed a commit to vx120/ms-swift that referenced this pull request Nov 19, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support Qwen3-VL and Qwen2.5-VL in the same environment

2 participants