-
Notifications
You must be signed in to change notification settings - Fork 903
[megatron] Support kimi vl megatron #5872
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[megatron] Support kimi vl megatron #5872
Conversation
Summary of ChangesHello @Jintao-Huang, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces full support for the Kimi VL multimodal model, enabling its use and conversion within the Megatron framework. The changes encompass the integration of the model's unique architecture, specialized handling for its vision and language components, and updates to the Megatron configuration and testing suite to accommodate this new model. The primary goal is to expand the range of supported multimodal large language models, facilitating research and deployment of Kimi VL. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request adds support for the Kimi-VL model in Megatron. The changes look mostly good, including model registration, configuration conversion, and adding a test case. However, I've identified a significant piece of duplicated code for handling image embeddings between the standard Hugging Face path and the Megatron path. I've left a comment suggesting to refactor this into a shared utility function to improve maintainability. Other changes, like the improved precision testing, are positive additions.
def _post_encode(self, model: nn.Module, inputs: Dict[str, Any]) -> Dict[str, Any]: | ||
input_ids = inputs['input_ids'] | ||
pixel_values = inputs.get('pixel_values') | ||
inputs_embeds = model.get_input_embeddings()(input_ids) | ||
|
||
if pixel_values is not None and pixel_values.size(0) > 0: | ||
pixel_values = pixel_values.to(model.vision_tower.dtype) | ||
image_features: torch.Tensor = model._extract_image_features(pixel_values, inputs['image_grid_hws']) | ||
inputs_embeds = inputs_embeds.to(image_features[0].dtype) | ||
inputs_embeds = model._merge_with_image_features(inputs_embeds, input_ids, image_features) | ||
else: | ||
image_processor = self.processor.image_processor | ||
dummy_image = Image.new('RGB', (32, 32), (0, 0, 0)) | ||
image_inputs = image_processor([dummy_image], return_tensors='pt') | ||
pixel_values = image_inputs['pixel_values'].to(model.vision_tower.dtype) | ||
image_features: torch.Tensor = model._extract_image_features(pixel_values, image_inputs['image_grid_hws']) | ||
inputs_embeds = inputs_embeds + image_features.mean() * 0. | ||
return {'inputs_embeds': inputs_embeds} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is significant code duplication between this _post_encode
method and the get_inputs_embeds
method in the new file swift/megatron/model/mm_gpt/kimi_vl.py
. Both methods implement nearly identical logic for merging image features with text embeddings.
To improve maintainability and reduce redundancy, I suggest refactoring this logic into a shared utility function. This function could be called from both _post_encode
and get_inputs_embeds
, ensuring that any future changes to this logic only need to be made in one place.
/gemini review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request adds support for the Kimi-VL model in Megatron. The changes include registering the new model type, defining its template, implementing Megatron-specific conversion functions and model wrappers, and adding corresponding tests. The overall implementation looks good, but there is an opportunity to refactor some duplicated code to improve maintainability.
def get_inputs_embeds(self, inputs_embeds, **kwargs): | ||
model = self._hf_model[0] | ||
input_ids = kwargs['input_ids'] | ||
pixel_values = kwargs.get('pixel_values') | ||
if pixel_values is not None and pixel_values.size(0) > 0: | ||
pixel_values = pixel_values.to(model.vision_tower.dtype) | ||
image_features: torch.Tensor = model._extract_image_features(pixel_values, kwargs['image_grid_hws']) | ||
inputs_embeds = inputs_embeds.to(image_features[0].dtype).clone() | ||
inputs_embeds = model._merge_with_image_features(inputs_embeds, input_ids, image_features) | ||
else: | ||
image_processor = self.processor.image_processor | ||
dummy_image = Image.new('RGB', (32, 32), (0, 0, 0)) | ||
image_inputs = image_processor([dummy_image], return_tensors='pt') | ||
pixel_values = image_inputs['pixel_values'].to(model.vision_tower.dtype) | ||
image_features: torch.Tensor = model._extract_image_features(pixel_values, image_inputs['image_grid_hws']) | ||
inputs_embeds = inputs_embeds + image_features.mean() * 0. | ||
return inputs_embeds |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is significant code duplication between this get_inputs_embeds
method and the _post_encode
method in swift/llm/template/template/moonshot.py
. Both methods implement nearly identical logic for processing image features and handling dummy images for distributed training. To improve maintainability and reduce redundancy, this logic should be refactored into a shared utility function.
The duplicated logic can be found in swift/llm/template/template/moonshot.py
on lines 76-93.
MoonshotAI/Kimi-VL#49