[template] Support qwen3 omni mixed data #6196

Jintao-Huang · 2025-10-18T08:43:44Z

No description provided.

gemini-code-assist · 2025-10-18T08:43:57Z

Summary of Changes

Hello @Jintao-Huang, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces enhanced support for Qwen3 Omni models, specifically enabling them to process mixed data inputs including images, videos, and audio. The core changes involve refactoring the visual data handling into a shared utility function and integrating this, along with audio processing logic, into the Qwen3OmniMoeThinkerCausalLM's forward method. This ensures robust multimodal capabilities for these models. Additionally, a new test case has been added to validate the Qwen3Guard model.

Highlights

Mixed Data Handling Refactor: Extracted common logic for processing mixed image and video data in Qwen3 models into a dedicated helper function, _forward_qwen3_vl_or_qwen3_omni, to improve code organization and reusability.
Qwen3 Omni Mixed Data Support: Implemented compatibility for Qwen3 Omni models to handle mixed data inputs (images, videos, and audio) by overriding the model's forward pass through the _compat_qwen3_omni_mixed_data function.
New Test for Qwen3Guard: Added a new test case, test_qwen3_guard, to verify the functionality of the Qwen3Guard-Gen-0.6B model, ensuring its proper integration and behavior.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces support for mixed image and video data in Qwen3 Omni models. This is achieved by refactoring the visual data processing logic from the existing Qwen3 VL forward pass into a new shared function, _forward_qwen3_vl_or_qwen3_omni. A new compatibility patch, _compat_qwen3_omni_mixed_data, is added to handle the forward pass for Qwen3 Omni models, utilizing this new shared function. This is a good refactoring that promotes code reuse.

However, I've found a critical typo in the new _compat_qwen3_omni_mixed_data function that would cause visual embeddings to be discarded, breaking multimodal functionality. Please see the specific comment for details.

gemini-code-assist · 2025-10-18T08:45:10Z

swift/llm/model/model/qwen.py

+        nputs_embeds, visual_pos_masks, visual_embeds_multiscale = _forward_qwen3_vl_or_qwen3_omni(
+            self, processor, input_ids, inputs_embeds, pixel_values, pixel_values_videos, image_grid_thw,
+            video_grid_thw)


There is a typo in the variable name. nputs_embeds should be inputs_embeds. Due to this typo, the visual embeddings computed by _forward_qwen3_vl_or_qwen3_omni are discarded. The subsequent audio embedding logic is then applied to the original text embeddings, leading to incorrect multimodal inputs where visual information is lost.

Suggested change

nputs_embeds, visual_pos_masks, visual_embeds_multiscale = _forward_qwen3_vl_or_qwen3_omni(

self, processor, input_ids, inputs_embeds, pixel_values, pixel_values_videos, image_grid_thw,

video_grid_thw)

inputs_embeds, visual_pos_masks, visual_embeds_multiscale = _forward_qwen3_vl_or_qwen3_omni(

self, processor, input_ids, inputs_embeds, pixel_values, pixel_values_videos, image_grid_thw,

video_grid_thw)

…mixed_data' into support_qwen3_omni_mixed_data

Jintao-Huang · 2025-10-18T09:35:18Z

/gemini review

gemini-code-assist

Code Review

This pull request introduces support for mixed image and video data in qwen3-omni models by refactoring and reusing logic from qwen3-vl. The changes are well-structured, extracting shared functionality into helper functions, which significantly improves code organization and maintainability. A new test for qwen3-guard is also included. My review includes a few suggestions to enhance code clarity and remove minor redundancies.

gemini-code-assist · 2025-10-18T09:36:34Z

swift/llm/model/model/qwen.py

+            image_mask_joint = image_mask[visual_pos_masks]
+            video_mask_joint = video_mask[visual_pos_masks]
+            for img_embed, vid_embed in zip(deepstack_image_embeds, deepstack_video_embeds):
+                embed_joint = img_embed.new_zeros(visual_pos_masks.sum(), img_embed.shape[-1]).to(img_embed.device)


The .to(img_embed.device) call here is redundant. According to PyTorch documentation, torch.Tensor.new_zeros creates a tensor on the same device as the original tensor (img_embed in this case). Removing this will make the code slightly cleaner.

Suggested change

embed_joint = img_embed.new_zeros(visual_pos_masks.sum(), img_embed.shape[-1]).to(img_embed.device)

embed_joint = img_embed.new_zeros(visual_pos_masks.sum(), img_embed.shape[-1])

gemini-code-assist · 2025-10-18T09:36:34Z

swift/llm/model/model/qwen.py

-        video_grid_thw (`torch.LongTensor` of shape `(num_videos, 3)`, *optional*):
-            The temporal, height and width of feature shape of each video in LLM.
-        """
+    ) -> Union[tuple, output_cls]:


The docstring for this forward function was removed. This function has a complex signature with several optional tensor arguments like image_grid_thw and video_grid_thw. Re-adding the docstring explaining these parameters would improve code readability and maintainability.
For example:

) -> Union[tuple, output_cls]: r""" image_grid_thw (`torch.LongTensor` of shape `(num_images, 3)`, *optional*): The temporal, height and width of feature shape of each image in LLM. video_grid_thw (`torch.LongTensor` of shape `(num_videos, 3)`, *optional*): The temporal, height and width of feature shape of each video in LLM. """

gemini-code-assist · 2025-10-18T09:36:34Z

swift/llm/model/model/qwen.py

+            if (cache_position is None or (cache_position is not None and cache_position[0] == 0)
+                    or self.rope_deltas is None):


The condition in this if statement is a bit complex and can be simplified for better readability. The sub-expression (cache_position is None or (cache_position is not None and cache_position[0] == 0)) is logically equivalent to cache_position is None or cache_position[0] == 0. Simplifying this would make the overall condition easier to understand.

if cache_position is None or cache_position[0] == 0 or self.rope_deltas is None:

Jintao-Huang · 2025-10-18T13:48:26Z

/gemini review

gemini-code-assist

Code Review

This pull request introduces support for Qwen3 Omni models with mixed data types, including a new example script and necessary patches for DeepSpeed compatibility. The refactoring of media processing logic into a shared function is a good improvement for maintainability. A notable and important change is the bug fix in the multimodal optimizer, which now correctly handles parameter grouping when weight decay is zero. The overall changes are well-implemented and enhance the model's training capabilities.

gemini-code-assist · 2025-10-18T13:49:56Z

swift/plugin/optimizer.py

+        for use_wd, wd in zip([False, True], [0., args.weight_decay]):
+            if use_wd:
                params = [p for n, p in parameters if n in decay_parameters]


This change fixes a subtle but important bug in how optimizer parameter groups are created. The previous logic would fail to correctly assign parameters to the weight decay group if args.weight_decay was 0.0. Specifically, it would add the non-decay parameters twice and never add the decay parameters. The new implementation using zip with a boolean flag use_wd correctly and more explicitly separates the parameters, ensuring the optimizer is configured correctly regardless of the weight_decay value.

Suggested change

for use_wd, wd in zip([False, True], [0., args.weight_decay]):

if use_wd:

params = [p for n, p in parameters if n in decay_parameters]

for use_wd, wd in zip([False, True], [0., args.weight_decay]):

if use_wd:

params = [p for n, p in parameters if n in decay_parameters]

else:

params = [p for n, p in parameters if n not in decay_parameters]

Jintao-Huang added 2 commits October 18, 2025 15:49

update

61fd6e9

update

60b8b23

gemini-code-assist bot reviewed Oct 18, 2025

View reviewed changes

Jintao-Huang added 8 commits October 18, 2025 16:51

update

b3efca9

update

d03f0c9

Merge remote-tracking branch 'refs/remotes/origin/support_qwen3_omni_…

170ae75

…mixed_data' into support_qwen3_omni_mixed_data

update

857a29e

update

37b4f29

update

8486b9a

Merge remote-tracking branch 'refs/remotes/origin/support_qwen3_omni_…

f2bbc7b

…mixed_data' into support_qwen3_omni_mixed_data

fix

0ea0b1f

gemini-code-assist bot reviewed Oct 18, 2025

View reviewed changes

tastelikefeet approved these changes Oct 18, 2025

View reviewed changes

update

ee9912c

Jintao-Huang mentioned this pull request Oct 18, 2025

Qwen3-Omni使用deepspeed zero2 失败 #6171

Closed

Jintao-Huang added 3 commits October 18, 2025 19:48

update

bef46d7

fix

86da223

update

be08ff8

gemini-code-assist bot reviewed Oct 18, 2025

View reviewed changes

Jintao-Huang merged commit 26d845a into modelscope:main Oct 18, 2025
1 of 2 checks passed

Jintao-Huang added a commit that referenced this pull request Oct 18, 2025

[template] Support qwen3 omni mixed data (#6196)

fd4ea9b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[template] Support qwen3 omni mixed data #6196

[template] Support qwen3 omni mixed data #6196

Jintao-Huang commented Oct 18, 2025

Uh oh!

gemini-code-assist bot commented Oct 18, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Oct 18, 2025

Uh oh!

Jintao-Huang commented Oct 18, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Oct 18, 2025

Uh oh!

gemini-code-assist bot Oct 18, 2025

Uh oh!

gemini-code-assist bot Oct 18, 2025

Uh oh!

Jintao-Huang commented Oct 18, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Oct 18, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	embed_joint = img_embed.new_zeros(visual_pos_masks.sum(), img_embed.shape[-1]).to(img_embed.device)
	embed_joint = img_embed.new_zeros(visual_pos_masks.sum(), img_embed.shape[-1])

		if (cache_position is None or (cache_position is not None and cache_position[0] == 0)
		or self.rope_deltas is None):

[template] Support qwen3 omni mixed data #6196

[template] Support qwen3 omni mixed data #6196

Conversation

Jintao-Huang commented Oct 18, 2025

Uh oh!

gemini-code-assist bot commented Oct 18, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Oct 18, 2025

Choose a reason for hiding this comment

Uh oh!

Jintao-Huang commented Oct 18, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Oct 18, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Oct 18, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Oct 18, 2025

Choose a reason for hiding this comment

Uh oh!

Jintao-Huang commented Oct 18, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Oct 18, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants