Skip to content

Conversation

Jintao-Huang
Copy link
Collaborator

@Jintao-Huang Jintao-Huang commented Sep 19, 2025

Copy link
Contributor

Summary of Changes

Hello @Jintao-Huang, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces full support for the Kimi VL multimodal model, enabling its use and conversion within the Megatron framework. The changes encompass the integration of the model's unique architecture, specialized handling for its vision and language components, and updates to the Megatron configuration and testing suite to accommodate this new model. The primary goal is to expand the range of supported multimodal large language models, facilitating research and deployment of Kimi VL.

Highlights

  • Kimi VL Model Integration: Comprehensive support for the Kimi VL (Vision-Language) model has been added, including dynamic loading, specific patching for input embeddings, and robust state conversion mechanisms between Hugging Face and Megatron Core formats.
  • Multimodal Input Handling: Specialized logic has been implemented within the Kimi VL template and the Megatron visual module to effectively process and merge image features with text embeddings, which is critical for the model's multimodal capabilities.
  • Megatron Configuration Updates: The Megatron configuration has been updated to correctly recognize and manage the Kimi VL model architecture, incorporating specific settings for Mixture-of-Experts (MoE) routing and layer normalization.
  • Testing and Validation: A dedicated test case for the Kimi VL model has been introduced to ensure its proper functionality and alignment within the Megatron framework, verifying the integration's correctness.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds support for the Kimi-VL model in Megatron. The changes look mostly good, including model registration, configuration conversion, and adding a test case. However, I've identified a significant piece of duplicated code for handling image embeddings between the standard Hugging Face path and the Megatron path. I've left a comment suggesting to refactor this into a shared utility function to improve maintainability. Other changes, like the improved precision testing, are positive additions.

Comment on lines 74 to 91
def _post_encode(self, model: nn.Module, inputs: Dict[str, Any]) -> Dict[str, Any]:
input_ids = inputs['input_ids']
pixel_values = inputs.get('pixel_values')
inputs_embeds = model.get_input_embeddings()(input_ids)

if pixel_values is not None and pixel_values.size(0) > 0:
pixel_values = pixel_values.to(model.vision_tower.dtype)
image_features: torch.Tensor = model._extract_image_features(pixel_values, inputs['image_grid_hws'])
inputs_embeds = inputs_embeds.to(image_features[0].dtype)
inputs_embeds = model._merge_with_image_features(inputs_embeds, input_ids, image_features)
else:
image_processor = self.processor.image_processor
dummy_image = Image.new('RGB', (32, 32), (0, 0, 0))
image_inputs = image_processor([dummy_image], return_tensors='pt')
pixel_values = image_inputs['pixel_values'].to(model.vision_tower.dtype)
image_features: torch.Tensor = model._extract_image_features(pixel_values, image_inputs['image_grid_hws'])
inputs_embeds = inputs_embeds + image_features.mean() * 0.
return {'inputs_embeds': inputs_embeds}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

There is significant code duplication between this _post_encode method and the get_inputs_embeds method in the new file swift/megatron/model/mm_gpt/kimi_vl.py. Both methods implement nearly identical logic for merging image features with text embeddings.

To improve maintainability and reduce redundancy, I suggest refactoring this logic into a shared utility function. This function could be called from both _post_encode and get_inputs_embeds, ensuring that any future changes to this logic only need to be made in one place.

@Jintao-Huang
Copy link
Collaborator Author

/gemini review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds support for the Kimi-VL model in Megatron. The changes include registering the new model type, defining its template, implementing Megatron-specific conversion functions and model wrappers, and adding corresponding tests. The overall implementation looks good, but there is an opportunity to refactor some duplicated code to improve maintainability.

Comment on lines +38 to +54
def get_inputs_embeds(self, inputs_embeds, **kwargs):
model = self._hf_model[0]
input_ids = kwargs['input_ids']
pixel_values = kwargs.get('pixel_values')
if pixel_values is not None and pixel_values.size(0) > 0:
pixel_values = pixel_values.to(model.vision_tower.dtype)
image_features: torch.Tensor = model._extract_image_features(pixel_values, kwargs['image_grid_hws'])
inputs_embeds = inputs_embeds.to(image_features[0].dtype).clone()
inputs_embeds = model._merge_with_image_features(inputs_embeds, input_ids, image_features)
else:
image_processor = self.processor.image_processor
dummy_image = Image.new('RGB', (32, 32), (0, 0, 0))
image_inputs = image_processor([dummy_image], return_tensors='pt')
pixel_values = image_inputs['pixel_values'].to(model.vision_tower.dtype)
image_features: torch.Tensor = model._extract_image_features(pixel_values, image_inputs['image_grid_hws'])
inputs_embeds = inputs_embeds + image_features.mean() * 0.
return inputs_embeds
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

There is significant code duplication between this get_inputs_embeds method and the _post_encode method in swift/llm/template/template/moonshot.py. Both methods implement nearly identical logic for processing image features and handling dummy images for distributed training. To improve maintainability and reduce redundancy, this logic should be refactored into a shared utility function.

The duplicated logic can be found in swift/llm/template/template/moonshot.py on lines 76-93.

@Jintao-Huang Jintao-Huang merged commit faf4b54 into modelscope:main Sep 19, 2025
1 of 2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants