-
Notifications
You must be signed in to change notification settings - Fork 1k
[bugfix] qwen3-vl support mixed data #6161
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[bugfix] qwen3-vl support mixed data #6161
Conversation
Summary of ChangesHello @Jintao-Huang, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances the Qwen3-VL model's capabilities by introducing robust support for mixed data inputs, specifically images and videos, during training. It achieves this by dynamically modifying the model's forward pass to correctly embed and integrate visual information alongside textual data, with a particular focus on compatibility and optimization when DeepSpeed is utilized. This change broadens the model's applicability in complex multimodal scenarios. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces support for mixed image and video data during training for qwen3-vl and qwen3-moe-vl models, specifically for DeepSpeed environments. The approach involves monkey-patching the model's forward pass to handle combined visual inputs, which is a reasonable strategy. However, I've identified a critical bug in the input validation logic that could lead to a runtime error. I've also noted a minor code redundancy that can be cleaned up for better maintainability. My review includes suggestions to address these points.
| if (input_ids is None) ^ (inputs_embeds is not None): | ||
| raise ValueError('You must specify exactly one of input_ids or inputs_embeds') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The input validation if (input_ids is None) ^ (inputs_embeds is not None): is incorrect for this multimodal model. It allows inputs_embeds to be provided without input_ids, which will cause a TypeError later when input_ids is used to create image_mask and video_mask. This check should be changed to if input_ids is None and inputs_embeds is None:, which aligns with the original transformers implementation and ensures input_ids is always available when needed.
| if (input_ids is None) ^ (inputs_embeds is not None): | |
| raise ValueError('You must specify exactly one of input_ids or inputs_embeds') | |
| if input_ids is None and inputs_embeds is None: | |
| raise ValueError('You have to specify either input_ids or inputs_embeds') |
| image_embeds = image_embeds.to(inputs_embeds.device, inputs_embeds.dtype) | ||
| image_mask = image_mask.to(inputs_embeds.device) | ||
| inputs_embeds = inputs_embeds.masked_scatter(image_mask, image_embeds) | ||
|
|
||
| if video_embeds is not None: | ||
| video_embeds = video_embeds.to(inputs_embeds.device, inputs_embeds.dtype) | ||
| video_mask = video_mask.to(inputs_embeds.device) | ||
| inputs_embeds = inputs_embeds.masked_scatter(video_mask, video_embeds) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The device placement calls image_mask.to(inputs_embeds.device) and video_mask.to(inputs_embeds.device) are redundant. Both masks are derived from input_ids, which should already be on the same device as inputs_embeds. Removing these unnecessary device transfers will make the code cleaner.
| image_embeds = image_embeds.to(inputs_embeds.device, inputs_embeds.dtype) | |
| image_mask = image_mask.to(inputs_embeds.device) | |
| inputs_embeds = inputs_embeds.masked_scatter(image_mask, image_embeds) | |
| if video_embeds is not None: | |
| video_embeds = video_embeds.to(inputs_embeds.device, inputs_embeds.dtype) | |
| video_mask = video_mask.to(inputs_embeds.device) | |
| inputs_embeds = inputs_embeds.masked_scatter(video_mask, video_embeds) | |
| image_embeds = image_embeds.to(inputs_embeds.device, inputs_embeds.dtype) | |
| inputs_embeds = inputs_embeds.masked_scatter(image_mask, image_embeds) | |
| if video_embeds is not None: | |
| video_embeds = video_embeds.to(inputs_embeds.device, inputs_embeds.dtype) | |
| inputs_embeds = inputs_embeds.masked_scatter(video_mask, video_embeds) |
|
/gemini review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces support for mixed image and video data for qwen3-vl models during training with deepspeed and megatron. It patches the model's forward pass to correctly process and combine visual embeddings from both images and videos. While the implementation is comprehensive, there is significant code duplication between the deepspeed patch (swift/llm/model/model/qwen.py) and the megatron implementation (swift/megatron/model/mm_gpt/qwen3_vl.py) for handling mixed media data. I've suggested refactoring this duplicated logic into a shared utility function to improve maintainability.
| else: | ||
| if pixel_values is None: | ||
| pixel_values_mixed = pixel_values_videos | ||
| grid_thw = video_grid_thw | ||
| elif pixel_values_videos is None: | ||
| pixel_values_mixed = pixel_values | ||
| grid_thw = image_grid_thw | ||
| else: | ||
| pixel_values_mixed = torch.concat([pixel_values, pixel_values_videos], dim=0) | ||
| grid_thw = torch.concat([image_grid_thw, video_grid_thw], dim=0) | ||
| pixel_values_mixed = pixel_values_mixed.type(dtype) | ||
| mixed_embeds, deepstack_visual_embeds = self.visual(pixel_values_mixed, grid_thw=grid_thw) | ||
| if pixel_values is None: | ||
| image_embeds = None | ||
| video_embeds = mixed_embeds | ||
| elif pixel_values_videos is None: | ||
| image_embeds = mixed_embeds | ||
| video_embeds = None | ||
| else: | ||
| merge_length = processor.image_processor.merge_size**2 | ||
| image_tokens = (image_grid_thw.prod(dim=-1) // merge_length).sum() | ||
| image_embeds = mixed_embeds[:image_tokens] | ||
| video_embeds = mixed_embeds[image_tokens:] | ||
|
|
||
| image_mask = (input_ids == self.config.image_token_id).unsqueeze(-1).expand_as(inputs_embeds) | ||
| video_mask = (input_ids == self.config.video_token_id).unsqueeze(-1).expand_as(inputs_embeds) | ||
| if image_embeds is not None: | ||
| image_embeds = image_embeds.to(inputs_embeds.device, inputs_embeds.dtype) | ||
| image_mask = image_mask.to(inputs_embeds.device) | ||
| inputs_embeds = inputs_embeds.masked_scatter(image_mask, image_embeds) | ||
|
|
||
| if video_embeds is not None: | ||
| video_embeds = video_embeds.to(inputs_embeds.device, inputs_embeds.dtype) | ||
| video_mask = video_mask.to(inputs_embeds.device) | ||
| inputs_embeds = inputs_embeds.masked_scatter(video_mask, video_embeds) | ||
| image_mask, video_mask = image_mask[..., 0], video_mask[..., 0] | ||
| visual_pos_masks = image_mask | video_mask | ||
| if image_embeds is not None and video_embeds is not None: | ||
| deepstack_image_embeds = [tensor[:image_tokens] for tensor in deepstack_visual_embeds] | ||
| deepstack_video_embeds = [tensor[image_tokens:] for tensor in deepstack_visual_embeds] | ||
| deepstack_visual_embeds = [] | ||
| image_mask_joint = image_mask[visual_pos_masks] | ||
| video_mask_joint = video_mask[visual_pos_masks] | ||
| for img_embed, vid_embed in zip(deepstack_image_embeds, deepstack_video_embeds): | ||
| embed_joint = img_embed.new_zeros(visual_pos_masks.sum(), img_embed.shape[-1]).to(img_embed.device) | ||
| embed_joint[image_mask_joint, :] = img_embed | ||
| embed_joint[video_mask_joint, :] = vid_embed | ||
| deepstack_visual_embeds.append(embed_joint) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The logic within this else block for handling mixed image and video data (concatenating pixel values, splitting embeddings, and processing deepstack embeddings) appears to be duplicated in swift/megatron/model/mm_gpt/qwen3_vl.py within the _get_inputs_embeds static method. To improve code maintainability and avoid redundancy, consider refactoring this shared logic into a common utility function. This would make the code cleaner and easier to manage in the future.
No description provided.