-
Notifications
You must be signed in to change notification settings - Fork 994
[bugfix] fix megatron grpo rollout_group #6655
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Summary of ChangesHello @hjh0119, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request addresses a bug within the Megatron GRPO trainer by refining how distributed process groups, specifically 'rollout groups,' are managed. The changes involve removing an apparently unnecessary data collator preparation step and, more significantly, overhauling the rollout group creation mechanism to align with Megatron-LM's Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a bugfix for the rollout_group creation in Megatron GRPO training. The previous implementation relied on fragile assumptions about rank layout. The new implementation robustly creates groups using mpu.RankGenerator based on data parallelism, which is a significant improvement. The docstrings have also been updated for clarity, and some related code cleanup has been performed. Overall, this is a solid fix. I have one suggestion to further improve the robustness of the rollout_group creation by considering expert parallelism.
| cp_size = mpu.get_context_parallel_world_size() | ||
|
|
||
| # Calculate the rank range for my rollout group | ||
| group_start = my_dp_block_index * ranks_per_dp_group | ||
| # Create RankGenerator following Megatron-LM pattern | ||
| # Order: tp-cp-ep-dp-pp (default in Megatron-LM) | ||
| decoder_rank_generator = mpu.RankGenerator( | ||
| tp=tp_size, | ||
| ep=1, | ||
| dp=dp_size, | ||
| pp=pp_size, | ||
| cp=cp_size, | ||
| order='tp-cp-ep-dp-pp', | ||
| rank_offset=0, | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The RankGenerator is initialized with a hardcoded ep=1. This could lead to incorrect rollout group creation if expert parallelism is used (i.e., expert_model_parallel_size > 1). The rollout group should encompass all model parallel dimensions for a given data parallel rank, including expert parallelism.
To make this more robust, I suggest fetching the expert parallel size from the arguments and using it to initialize the RankGenerator.
| cp_size = mpu.get_context_parallel_world_size() | |
| # Calculate the rank range for my rollout group | |
| group_start = my_dp_block_index * ranks_per_dp_group | |
| # Create RankGenerator following Megatron-LM pattern | |
| # Order: tp-cp-ep-dp-pp (default in Megatron-LM) | |
| decoder_rank_generator = mpu.RankGenerator( | |
| tp=tp_size, | |
| ep=1, | |
| dp=dp_size, | |
| pp=pp_size, | |
| cp=cp_size, | |
| order='tp-cp-ep-dp-pp', | |
| rank_offset=0, | |
| ) | |
| cp_size = mpu.get_context_parallel_world_size() | |
| ep_size = self.args.expert_model_parallel_size | |
| # Create RankGenerator following Megatron-LM pattern | |
| # Order: tp-cp-ep-dp-pp (default in Megatron-LM) | |
| decoder_rank_generator = mpu.RankGenerator( | |
| tp=tp_size, | |
| ep=ep_size, | |
| dp=dp_size, | |
| pp=pp_size, | |
| cp=cp_size, | |
| order='tp-cp-ep-dp-pp', | |
| rank_offset=0, | |
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same in Megatron-LM initialize_model_parallel, ignore
No description provided.