[megatron] Support kimi vl megatron #5872

Jintao-Huang · 2025-09-19T04:38:26Z

gemini-code-assist · 2025-09-19T04:38:46Z

Summary of Changes

Hello @Jintao-Huang, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces full support for the Kimi VL multimodal model, enabling its use and conversion within the Megatron framework. The changes encompass the integration of the model's unique architecture, specialized handling for its vision and language components, and updates to the Megatron configuration and testing suite to accommodate this new model. The primary goal is to expand the range of supported multimodal large language models, facilitating research and deployment of Kimi VL.

Highlights

Kimi VL Model Integration: Comprehensive support for the Kimi VL (Vision-Language) model has been added, including dynamic loading, specific patching for input embeddings, and robust state conversion mechanisms between Hugging Face and Megatron Core formats.
Multimodal Input Handling: Specialized logic has been implemented within the Kimi VL template and the Megatron visual module to effectively process and merge image features with text embeddings, which is critical for the model's multimodal capabilities.
Megatron Configuration Updates: The Megatron configuration has been updated to correctly recognize and manage the Kimi VL model architecture, incorporating specific settings for Mixture-of-Experts (MoE) routing and layer normalization.
Testing and Validation: A dedicated test case for the Kimi VL model has been introduced to ensure its proper functionality and alignment within the Megatron framework, verifying the integration's correctness.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request adds support for the Kimi-VL model in Megatron. The changes look mostly good, including model registration, configuration conversion, and adding a test case. However, I've identified a significant piece of duplicated code for handling image embeddings between the standard Hugging Face path and the Megatron path. I've left a comment suggesting to refactor this into a shared utility function to improve maintainability. Other changes, like the improved precision testing, are positive additions.

gemini-code-assist · 2025-09-19T04:40:02Z

swift/llm/template/template/moonshot.py

+    def _post_encode(self, model: nn.Module, inputs: Dict[str, Any]) -> Dict[str, Any]:
+        input_ids = inputs['input_ids']
+        pixel_values = inputs.get('pixel_values')
+        inputs_embeds = model.get_input_embeddings()(input_ids)
+
+        if pixel_values is not None and pixel_values.size(0) > 0:
+            pixel_values = pixel_values.to(model.vision_tower.dtype)
+            image_features: torch.Tensor = model._extract_image_features(pixel_values, inputs['image_grid_hws'])
+            inputs_embeds = inputs_embeds.to(image_features[0].dtype)
+            inputs_embeds = model._merge_with_image_features(inputs_embeds, input_ids, image_features)
+        else:
+            image_processor = self.processor.image_processor
+            dummy_image = Image.new('RGB', (32, 32), (0, 0, 0))
+            image_inputs = image_processor([dummy_image], return_tensors='pt')
+            pixel_values = image_inputs['pixel_values'].to(model.vision_tower.dtype)
+            image_features: torch.Tensor = model._extract_image_features(pixel_values, image_inputs['image_grid_hws'])
+            inputs_embeds = inputs_embeds + image_features.mean() * 0.
+        return {'inputs_embeds': inputs_embeds}


There is significant code duplication between this _post_encode method and the get_inputs_embeds method in the new file swift/megatron/model/mm_gpt/kimi_vl.py. Both methods implement nearly identical logic for merging image features with text embeddings.

To improve maintainability and reduce redundancy, I suggest refactoring this logic into a shared utility function. This function could be called from both _post_encode and get_inputs_embeds, ensuring that any future changes to this logic only need to be made in one place.

Jintao-Huang · 2025-09-19T08:05:23Z

/gemini review

gemini-code-assist

Code Review

This pull request adds support for the Kimi-VL model in Megatron. The changes include registering the new model type, defining its template, implementing Megatron-specific conversion functions and model wrappers, and adding corresponding tests. The overall implementation looks good, but there is an opportunity to refactor some duplicated code to improve maintainability.

gemini-code-assist · 2025-09-19T08:06:41Z

swift/megatron/model/mm_gpt/kimi_vl.py

+    def get_inputs_embeds(self, inputs_embeds, **kwargs):
+        model = self._hf_model[0]
+        input_ids = kwargs['input_ids']
+        pixel_values = kwargs.get('pixel_values')
+        if pixel_values is not None and pixel_values.size(0) > 0:
+            pixel_values = pixel_values.to(model.vision_tower.dtype)
+            image_features: torch.Tensor = model._extract_image_features(pixel_values, kwargs['image_grid_hws'])
+            inputs_embeds = inputs_embeds.to(image_features[0].dtype).clone()
+            inputs_embeds = model._merge_with_image_features(inputs_embeds, input_ids, image_features)
+        else:
+            image_processor = self.processor.image_processor
+            dummy_image = Image.new('RGB', (32, 32), (0, 0, 0))
+            image_inputs = image_processor([dummy_image], return_tensors='pt')
+            pixel_values = image_inputs['pixel_values'].to(model.vision_tower.dtype)
+            image_features: torch.Tensor = model._extract_image_features(pixel_values, image_inputs['image_grid_hws'])
+            inputs_embeds = inputs_embeds + image_features.mean() * 0.
+        return inputs_embeds


There is significant code duplication between this get_inputs_embeds method and the _post_encode method in swift/llm/template/template/moonshot.py. Both methods implement nearly identical logic for processing image features and handling dummy images for distributed training. To improve maintainability and reduce redundancy, this logic should be refactored into a shared utility function.

The duplicated logic can be found in swift/llm/template/template/moonshot.py on lines 76-93.

Jintao-Huang added 2 commits September 19, 2025 11:16

update

9fe3b95

support kimi_vl megatron

1038756

gemini-code-assist bot reviewed Sep 19, 2025

View reviewed changes

tastelikefeet approved these changes Sep 19, 2025

View reviewed changes

hjh0119 approved these changes Sep 19, 2025

View reviewed changes

Jintao-Huang added 4 commits September 19, 2025 13:55

fix

6e4b353

fix

834e4d6

update

ec7abd3

fix

2a5f08e

gemini-code-assist bot reviewed Sep 19, 2025

View reviewed changes

Jintao-Huang merged commit faf4b54 into modelscope:main Sep 19, 2025
1 of 2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[megatron] Support kimi vl megatron #5872

[megatron] Support kimi vl megatron #5872

Uh oh!

Jintao-Huang commented Sep 19, 2025 •

edited

Loading

Uh oh!

gemini-code-assist bot commented Sep 19, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Sep 19, 2025

Uh oh!

Jintao-Huang commented Sep 19, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Sep 19, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[megatron] Support kimi vl megatron #5872

[megatron] Support kimi vl megatron #5872

Uh oh!

Conversation

Jintao-Huang commented Sep 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist bot commented Sep 19, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Sep 19, 2025

Choose a reason for hiding this comment

Uh oh!

Jintao-Huang commented Sep 19, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Sep 19, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Jintao-Huang commented Sep 19, 2025 •

edited

Loading