Skip to content

Conversation

@hjh0119
Copy link
Collaborator

@hjh0119 hjh0119 commented Nov 18, 2025

No description provided.

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @hjh0119, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request addresses a bug within the Megatron GRPO trainer by refining how distributed process groups, specifically 'rollout groups,' are managed. The changes involve removing an apparently unnecessary data collator preparation step and, more significantly, overhauling the rollout group creation mechanism to align with Megatron-LM's RankGenerator for accurate data distribution and collection across parallel ranks. This ensures that ranks processing identical data samples are correctly grouped for coordination during the rollout phase, leading to more robust and correct distributed training.

Highlights

  • Data Collator Removal: The _prepare_template_data_collator method and its corresponding call in the __init__ method have been removed, streamlining the trainer's initialization process.
  • Rollout Group Refactoring: The logic for creating _get_rollout_group has been completely refactored to utilize mpu.RankGenerator, ensuring that distributed process groups are formed correctly according to Megatron-LM's parallelization scheme for consistent data handling.
  • Documentation Clarity: Comments within the _get_rollout_group method have been updated to provide clearer explanations of its purpose and behavior, emphasizing the importance of grouping ranks that process the same data samples.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a bugfix for the rollout_group creation in Megatron GRPO training. The previous implementation relied on fragile assumptions about rank layout. The new implementation robustly creates groups using mpu.RankGenerator based on data parallelism, which is a significant improvement. The docstrings have also been updated for clarity, and some related code cleanup has been performed. Overall, this is a solid fix. I have one suggestion to further improve the robustness of the rollout_group creation by considering expert parallelism.

Comment on lines +382 to +394
cp_size = mpu.get_context_parallel_world_size()

# Calculate the rank range for my rollout group
group_start = my_dp_block_index * ranks_per_dp_group
# Create RankGenerator following Megatron-LM pattern
# Order: tp-cp-ep-dp-pp (default in Megatron-LM)
decoder_rank_generator = mpu.RankGenerator(
tp=tp_size,
ep=1,
dp=dp_size,
pp=pp_size,
cp=cp_size,
order='tp-cp-ep-dp-pp',
rank_offset=0,
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The RankGenerator is initialized with a hardcoded ep=1. This could lead to incorrect rollout group creation if expert parallelism is used (i.e., expert_model_parallel_size > 1). The rollout group should encompass all model parallel dimensions for a given data parallel rank, including expert parallelism.

To make this more robust, I suggest fetching the expert parallel size from the arguments and using it to initialize the RankGenerator.

Suggested change
cp_size = mpu.get_context_parallel_world_size()
# Calculate the rank range for my rollout group
group_start = my_dp_block_index * ranks_per_dp_group
# Create RankGenerator following Megatron-LM pattern
# Order: tp-cp-ep-dp-pp (default in Megatron-LM)
decoder_rank_generator = mpu.RankGenerator(
tp=tp_size,
ep=1,
dp=dp_size,
pp=pp_size,
cp=cp_size,
order='tp-cp-ep-dp-pp',
rank_offset=0,
)
cp_size = mpu.get_context_parallel_world_size()
ep_size = self.args.expert_model_parallel_size
# Create RankGenerator following Megatron-LM pattern
# Order: tp-cp-ep-dp-pp (default in Megatron-LM)
decoder_rank_generator = mpu.RankGenerator(
tp=tp_size,
ep=ep_size,
dp=dp_size,
pp=pp_size,
cp=cp_size,
order='tp-cp-ep-dp-pp',
rank_offset=0,
)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same in Megatron-LM initialize_model_parallel, ignore

@hjh0119 hjh0119 merged commit 717c284 into modelscope:main Nov 18, 2025
1 of 2 checks passed
@hjh0119 hjh0119 deleted the fix-mg-cp-mrope branch November 18, 2025 13:50
vx120 pushed a commit to vx120/ms-swift that referenced this pull request Nov 19, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants