[feat] support dense/moe mixed models in Megatron GKD#7241
[feat] support dense/moe mixed models in Megatron GKD#7241hjh0119 merged 7 commits intomodelscope:mainfrom
Conversation
Summary of ChangesHello @hjh0119, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances the Megatron GKD framework by introducing robust support for mixed Dense and Mixture-of-Experts (MoE) models. It addresses compatibility issues arising from Megatron's strict validation of Expert Parallelism (EP) settings, particularly when a Dense model is initialized in an EP-enabled environment. The changes involve dynamic adjustment of EP parameters for Dense student models, comprehensive handling of MoE and EP configuration keys during teacher model loading, and refactoring of argument default settings to ensure seamless operation across diverse model architectures. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
The pull request introduces documentation for GKD (General Knowledge Distillation) and refactors MoE (Mixture of Experts) argument defaults in megatron_args.py. The core changes address compatibility issues when using Dense student models with MoE teacher models, particularly concerning Expert Parallelism (EP) settings. A new method, _adjust_ep_for_dense_student, is added to gkd_trainer.py to temporarily reset EP parameters to 1 for a Dense student before Megatron's initialization, preventing validation errors. The _load_teacher_model method is also enhanced to correctly manage and restore MoE and EP related arguments when loading teacher models, ensuring proper model architecture creation. A patched_validate_args method was introduced in gkd_trainer.py to explicitly reset EP parameters for a Dense student, but a review comment suggests this logic is redundant as _adjust_ep_for_dense_student already handles this, recommending its removal to avoid duplication.
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces documentation for a new 'GKD' feature and refactors Megatron argument handling by moving MoE default settings into a dedicated static method and adjusting moe_layer_freq to an integer. The primary changes are within gkd_trainer.py, which implements logic for loading a teacher model in a knowledge distillation setup, specifically addressing complex interactions between student and teacher models regarding Mixture of Experts (MoE) and Expert Parallelism (EP) configurations. The review comments identified a critical bug where student model properties (student_is_moe, student_ep_size, student_etp_size) were accessed before being initialized in _load_teacher_model, leading to an AttributeError. The reviewer proposed a fix to derive these properties from the original_config within _load_teacher_model and suggested removing the now-redundant instance variable assignments from the patched_validate_args method.
Support MoE + Dense models in Megatron GKD and EP