[bugfix] qwen3-vl support mixed data #6161

Jintao-Huang · 2025-10-16T03:05:41Z

No description provided.

gemini-code-assist · 2025-10-16T03:05:56Z

Summary of Changes

Hello @Jintao-Huang, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the Qwen3-VL model's capabilities by introducing robust support for mixed data inputs, specifically images and videos, during training. It achieves this by dynamically modifying the model's forward pass to correctly embed and integrate visual information alongside textual data, with a particular focus on compatibility and optimization when DeepSpeed is utilized. This change broadens the model's applicability in complex multimodal scenarios.

Highlights

Mixed Data Support for Qwen3-VL: This PR introduces compatibility for Qwen3-VL models to process both image and video data simultaneously, enhancing its multimodal capabilities, particularly when DeepSpeed is enabled.
Dynamic forward Method Patching: A new function, _compat_qwen3_vl_mixed_data, dynamically patches the Qwen3VL model's forward method. This patched method intelligently integrates visual embeddings from images and videos with text embeddings, ensuring a unified processing flow.
DeepSpeed Integration: The mixed data handling logic is specifically designed to be applied when DeepSpeed is enabled, optimizing its use in distributed training environments for large multimodal models.
Unified Visual Input Processing: The patched forward method consolidates the handling of pixel_values (images) and pixel_values_videos into a single, coherent processing flow, including the generation of visual masks and proper position ID calculations.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces support for mixed image and video data during training for qwen3-vl and qwen3-moe-vl models, specifically for DeepSpeed environments. The approach involves monkey-patching the model's forward pass to handle combined visual inputs, which is a reasonable strategy. However, I've identified a critical bug in the input validation logic that could lead to a runtime error. I've also noted a minor code redundancy that can be cleaned up for better maintainability. My review includes suggestions to address these points.

gemini-code-assist · 2025-10-16T03:08:12Z

swift/llm/model/model/qwen.py

+        if (input_ids is None) ^ (inputs_embeds is not None):
+            raise ValueError('You must specify exactly one of input_ids or inputs_embeds')


The input validation if (input_ids is None) ^ (inputs_embeds is not None): is incorrect for this multimodal model. It allows inputs_embeds to be provided without input_ids, which will cause a TypeError later when input_ids is used to create image_mask and video_mask. This check should be changed to if input_ids is None and inputs_embeds is None:, which aligns with the original transformers implementation and ensures input_ids is always available when needed.

Suggested change

if (input_ids is None) ^ (inputs_embeds is not None):

raise ValueError('You must specify exactly one of input_ids or inputs_embeds')

if input_ids is None and inputs_embeds is None:

raise ValueError('You have to specify either input_ids or inputs_embeds')

gemini-code-assist · 2025-10-16T03:08:13Z

swift/llm/model/model/qwen.py

+                image_embeds = image_embeds.to(inputs_embeds.device, inputs_embeds.dtype)
+                image_mask = image_mask.to(inputs_embeds.device)
+                inputs_embeds = inputs_embeds.masked_scatter(image_mask, image_embeds)
+
+            if video_embeds is not None:
+                video_embeds = video_embeds.to(inputs_embeds.device, inputs_embeds.dtype)
+                video_mask = video_mask.to(inputs_embeds.device)
+                inputs_embeds = inputs_embeds.masked_scatter(video_mask, video_embeds)


The device placement calls image_mask.to(inputs_embeds.device) and video_mask.to(inputs_embeds.device) are redundant. Both masks are derived from input_ids, which should already be on the same device as inputs_embeds. Removing these unnecessary device transfers will make the code cleaner.

Suggested change

image_embeds = image_embeds.to(inputs_embeds.device, inputs_embeds.dtype)

image_mask = image_mask.to(inputs_embeds.device)

inputs_embeds = inputs_embeds.masked_scatter(image_mask, image_embeds)

if video_embeds is not None:

video_embeds = video_embeds.to(inputs_embeds.device, inputs_embeds.dtype)

video_mask = video_mask.to(inputs_embeds.device)

inputs_embeds = inputs_embeds.masked_scatter(video_mask, video_embeds)

image_embeds = image_embeds.to(inputs_embeds.device, inputs_embeds.dtype)

inputs_embeds = inputs_embeds.masked_scatter(image_mask, image_embeds)

if video_embeds is not None:

video_embeds = video_embeds.to(inputs_embeds.device, inputs_embeds.dtype)

inputs_embeds = inputs_embeds.masked_scatter(video_mask, video_embeds)

Jintao-Huang · 2025-10-16T09:35:29Z

/gemini review

gemini-code-assist

Code Review

This pull request introduces support for mixed image and video data for qwen3-vl models during training with deepspeed and megatron. It patches the model's forward pass to correctly process and combine visual embeddings from both images and videos. While the implementation is comprehensive, there is significant code duplication between the deepspeed patch (swift/llm/model/model/qwen.py) and the megatron implementation (swift/megatron/model/mm_gpt/qwen3_vl.py) for handling mixed media data. I've suggested refactoring this duplicated logic into a shared utility function to improve maintainability.

gemini-code-assist · 2025-10-16T09:37:07Z

swift/llm/model/model/qwen.py

+        else:
+            if pixel_values is None:
+                pixel_values_mixed = pixel_values_videos
+                grid_thw = video_grid_thw
+            elif pixel_values_videos is None:
+                pixel_values_mixed = pixel_values
+                grid_thw = image_grid_thw
+            else:
+                pixel_values_mixed = torch.concat([pixel_values, pixel_values_videos], dim=0)
+                grid_thw = torch.concat([image_grid_thw, video_grid_thw], dim=0)
+            pixel_values_mixed = pixel_values_mixed.type(dtype)
+            mixed_embeds, deepstack_visual_embeds = self.visual(pixel_values_mixed, grid_thw=grid_thw)
+            if pixel_values is None:
+                image_embeds = None
+                video_embeds = mixed_embeds
+            elif pixel_values_videos is None:
+                image_embeds = mixed_embeds
+                video_embeds = None
+            else:
+                merge_length = processor.image_processor.merge_size**2
+                image_tokens = (image_grid_thw.prod(dim=-1) // merge_length).sum()
+                image_embeds = mixed_embeds[:image_tokens]
+                video_embeds = mixed_embeds[image_tokens:]
+
+            image_mask = (input_ids == self.config.image_token_id).unsqueeze(-1).expand_as(inputs_embeds)
+            video_mask = (input_ids == self.config.video_token_id).unsqueeze(-1).expand_as(inputs_embeds)
+            if image_embeds is not None:
+                image_embeds = image_embeds.to(inputs_embeds.device, inputs_embeds.dtype)
+                image_mask = image_mask.to(inputs_embeds.device)
+                inputs_embeds = inputs_embeds.masked_scatter(image_mask, image_embeds)
+
+            if video_embeds is not None:
+                video_embeds = video_embeds.to(inputs_embeds.device, inputs_embeds.dtype)
+                video_mask = video_mask.to(inputs_embeds.device)
+                inputs_embeds = inputs_embeds.masked_scatter(video_mask, video_embeds)
+            image_mask, video_mask = image_mask[..., 0], video_mask[..., 0]
+            visual_pos_masks = image_mask | video_mask
+            if image_embeds is not None and video_embeds is not None:
+                deepstack_image_embeds = [tensor[:image_tokens] for tensor in deepstack_visual_embeds]
+                deepstack_video_embeds = [tensor[image_tokens:] for tensor in deepstack_visual_embeds]
+                deepstack_visual_embeds = []
+                image_mask_joint = image_mask[visual_pos_masks]
+                video_mask_joint = video_mask[visual_pos_masks]
+                for img_embed, vid_embed in zip(deepstack_image_embeds, deepstack_video_embeds):
+                    embed_joint = img_embed.new_zeros(visual_pos_masks.sum(), img_embed.shape[-1]).to(img_embed.device)
+                    embed_joint[image_mask_joint, :] = img_embed
+                    embed_joint[video_mask_joint, :] = vid_embed
+                    deepstack_visual_embeds.append(embed_joint)


The logic within this else block for handling mixed image and video data (concatenating pixel values, splitting embeddings, and processing deepstack embeddings) appears to be duplicated in swift/megatron/model/mm_gpt/qwen3_vl.py within the _get_inputs_embeds static method. To improve code maintainability and avoid redundancy, consider refactoring this shared logic into a common utility function. This would make the code cleaner and easier to manage in the future.

Jintao-Huang added 2 commits October 16, 2025 10:23

update

6c15133

update

c1d7bba

gemini-code-assist bot reviewed Oct 16, 2025

View reviewed changes

hjh0119 approved these changes Oct 16, 2025

View reviewed changes

Jintao-Huang added 3 commits October 16, 2025 13:26

Merge branch 'main' into update_qwen3_vl_mixed_data

1bc50d0

Merge branch 'main' into update_qwen3_vl_mixed_data

6d0efef

fix

390512b

gemini-code-assist bot reviewed Oct 16, 2025

View reviewed changes

Jintao-Huang changed the title ~~[model] qwen3-vl support mixed data~~ [bugfix] qwen3-vl support mixed data #6161 Oct 16, 2025

Jintao-Huang merged commit c1f2f3d into modelscope:main Oct 16, 2025
1 of 2 checks passed

Jintao-Huang added a commit that referenced this pull request Oct 16, 2025

[bugfix] qwen3-vl support mixed data #6161 (#6161)

b2c7053

Jintao-Huang changed the title ~~[bugfix] qwen3-vl support mixed data #6161~~ [bugfix] qwen3-vl support mixed data Oct 17, 2025

Jintao-Huang mentioned this pull request Oct 17, 2025

Qwen3-Omni使用deepspeed zero2 失败 #6171

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[bugfix] qwen3-vl support mixed data #6161

[bugfix] qwen3-vl support mixed data #6161

Jintao-Huang commented Oct 16, 2025

Uh oh!

gemini-code-assist bot commented Oct 16, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Oct 16, 2025

Uh oh!

gemini-code-assist bot Oct 16, 2025

Uh oh!

Jintao-Huang commented Oct 16, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Oct 16, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		if (input_ids is None) ^ (inputs_embeds is not None):
		raise ValueError('You must specify exactly one of input_ids or inputs_embeds')

[bugfix] qwen3-vl support mixed data #6161

[bugfix] qwen3-vl support mixed data #6161

Conversation

Jintao-Huang commented Oct 16, 2025

Uh oh!

gemini-code-assist bot commented Oct 16, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Oct 16, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Oct 16, 2025

Choose a reason for hiding this comment

Uh oh!

Jintao-Huang commented Oct 16, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Oct 16, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants