Skip to content

Conversation

@Jintao-Huang
Copy link
Collaborator

No description provided.

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @Jintao-Huang, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the Qwen3-VL model's capabilities by introducing robust support for mixed data inputs, specifically images and videos, during training. It achieves this by dynamically modifying the model's forward pass to correctly embed and integrate visual information alongside textual data, with a particular focus on compatibility and optimization when DeepSpeed is utilized. This change broadens the model's applicability in complex multimodal scenarios.

Highlights

  • Mixed Data Support for Qwen3-VL: This PR introduces compatibility for Qwen3-VL models to process both image and video data simultaneously, enhancing its multimodal capabilities, particularly when DeepSpeed is enabled.
  • Dynamic forward Method Patching: A new function, _compat_qwen3_vl_mixed_data, dynamically patches the Qwen3VL model's forward method. This patched method intelligently integrates visual embeddings from images and videos with text embeddings, ensuring a unified processing flow.
  • DeepSpeed Integration: The mixed data handling logic is specifically designed to be applied when DeepSpeed is enabled, optimizing its use in distributed training environments for large multimodal models.
  • Unified Visual Input Processing: The patched forward method consolidates the handling of pixel_values (images) and pixel_values_videos into a single, coherent processing flow, including the generation of visual masks and proper position ID calculations.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for mixed image and video data during training for qwen3-vl and qwen3-moe-vl models, specifically for DeepSpeed environments. The approach involves monkey-patching the model's forward pass to handle combined visual inputs, which is a reasonable strategy. However, I've identified a critical bug in the input validation logic that could lead to a runtime error. I've also noted a minor code redundancy that can be cleaned up for better maintainability. My review includes suggestions to address these points.

Comment on lines +887 to +888
if (input_ids is None) ^ (inputs_embeds is not None):
raise ValueError('You must specify exactly one of input_ids or inputs_embeds')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The input validation if (input_ids is None) ^ (inputs_embeds is not None): is incorrect for this multimodal model. It allows inputs_embeds to be provided without input_ids, which will cause a TypeError later when input_ids is used to create image_mask and video_mask. This check should be changed to if input_ids is None and inputs_embeds is None:, which aligns with the original transformers implementation and ensures input_ids is always available when needed.

Suggested change
if (input_ids is None) ^ (inputs_embeds is not None):
raise ValueError('You must specify exactly one of input_ids or inputs_embeds')
if input_ids is None and inputs_embeds is None:
raise ValueError('You have to specify either input_ids or inputs_embeds')

Comment on lines +930 to +937
image_embeds = image_embeds.to(inputs_embeds.device, inputs_embeds.dtype)
image_mask = image_mask.to(inputs_embeds.device)
inputs_embeds = inputs_embeds.masked_scatter(image_mask, image_embeds)

if video_embeds is not None:
video_embeds = video_embeds.to(inputs_embeds.device, inputs_embeds.dtype)
video_mask = video_mask.to(inputs_embeds.device)
inputs_embeds = inputs_embeds.masked_scatter(video_mask, video_embeds)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The device placement calls image_mask.to(inputs_embeds.device) and video_mask.to(inputs_embeds.device) are redundant. Both masks are derived from input_ids, which should already be on the same device as inputs_embeds. Removing these unnecessary device transfers will make the code cleaner.

Suggested change
image_embeds = image_embeds.to(inputs_embeds.device, inputs_embeds.dtype)
image_mask = image_mask.to(inputs_embeds.device)
inputs_embeds = inputs_embeds.masked_scatter(image_mask, image_embeds)
if video_embeds is not None:
video_embeds = video_embeds.to(inputs_embeds.device, inputs_embeds.dtype)
video_mask = video_mask.to(inputs_embeds.device)
inputs_embeds = inputs_embeds.masked_scatter(video_mask, video_embeds)
image_embeds = image_embeds.to(inputs_embeds.device, inputs_embeds.dtype)
inputs_embeds = inputs_embeds.masked_scatter(image_mask, image_embeds)
if video_embeds is not None:
video_embeds = video_embeds.to(inputs_embeds.device, inputs_embeds.dtype)
inputs_embeds = inputs_embeds.masked_scatter(video_mask, video_embeds)

@Jintao-Huang
Copy link
Collaborator Author

/gemini review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for mixed image and video data for qwen3-vl models during training with deepspeed and megatron. It patches the model's forward pass to correctly process and combine visual embeddings from both images and videos. While the implementation is comprehensive, there is significant code duplication between the deepspeed patch (swift/llm/model/model/qwen.py) and the megatron implementation (swift/megatron/model/mm_gpt/qwen3_vl.py) for handling mixed media data. I've suggested refactoring this duplicated logic into a shared utility function to improve maintainability.

Comment on lines +902 to +949
else:
if pixel_values is None:
pixel_values_mixed = pixel_values_videos
grid_thw = video_grid_thw
elif pixel_values_videos is None:
pixel_values_mixed = pixel_values
grid_thw = image_grid_thw
else:
pixel_values_mixed = torch.concat([pixel_values, pixel_values_videos], dim=0)
grid_thw = torch.concat([image_grid_thw, video_grid_thw], dim=0)
pixel_values_mixed = pixel_values_mixed.type(dtype)
mixed_embeds, deepstack_visual_embeds = self.visual(pixel_values_mixed, grid_thw=grid_thw)
if pixel_values is None:
image_embeds = None
video_embeds = mixed_embeds
elif pixel_values_videos is None:
image_embeds = mixed_embeds
video_embeds = None
else:
merge_length = processor.image_processor.merge_size**2
image_tokens = (image_grid_thw.prod(dim=-1) // merge_length).sum()
image_embeds = mixed_embeds[:image_tokens]
video_embeds = mixed_embeds[image_tokens:]

image_mask = (input_ids == self.config.image_token_id).unsqueeze(-1).expand_as(inputs_embeds)
video_mask = (input_ids == self.config.video_token_id).unsqueeze(-1).expand_as(inputs_embeds)
if image_embeds is not None:
image_embeds = image_embeds.to(inputs_embeds.device, inputs_embeds.dtype)
image_mask = image_mask.to(inputs_embeds.device)
inputs_embeds = inputs_embeds.masked_scatter(image_mask, image_embeds)

if video_embeds is not None:
video_embeds = video_embeds.to(inputs_embeds.device, inputs_embeds.dtype)
video_mask = video_mask.to(inputs_embeds.device)
inputs_embeds = inputs_embeds.masked_scatter(video_mask, video_embeds)
image_mask, video_mask = image_mask[..., 0], video_mask[..., 0]
visual_pos_masks = image_mask | video_mask
if image_embeds is not None and video_embeds is not None:
deepstack_image_embeds = [tensor[:image_tokens] for tensor in deepstack_visual_embeds]
deepstack_video_embeds = [tensor[image_tokens:] for tensor in deepstack_visual_embeds]
deepstack_visual_embeds = []
image_mask_joint = image_mask[visual_pos_masks]
video_mask_joint = video_mask[visual_pos_masks]
for img_embed, vid_embed in zip(deepstack_image_embeds, deepstack_video_embeds):
embed_joint = img_embed.new_zeros(visual_pos_masks.sum(), img_embed.shape[-1]).to(img_embed.device)
embed_joint[image_mask_joint, :] = img_embed
embed_joint[video_mask_joint, :] = vid_embed
deepstack_visual_embeds.append(embed_joint)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The logic within this else block for handling mixed image and video data (concatenating pixel values, splitting embeddings, and processing deepstack embeddings) appears to be duplicated in swift/megatron/model/mm_gpt/qwen3_vl.py within the _get_inputs_embeds static method. To improve code maintainability and avoid redundancy, consider refactoring this shared logic into a common utility function. This would make the code cleaner and easier to manage in the future.

@Jintao-Huang Jintao-Huang changed the title [model] qwen3-vl support mixed data [bugfix] qwen3-vl support mixed data #6161 Oct 16, 2025
@Jintao-Huang Jintao-Huang merged commit c1f2f3d into modelscope:main Oct 16, 2025
1 of 2 checks passed
@Jintao-Huang Jintao-Huang changed the title [bugfix] qwen3-vl support mixed data #6161 [bugfix] qwen3-vl support mixed data Oct 17, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants