[bugfix] fix megatron grpo rollout_group #6655

hjh0119 · 2025-11-18T12:18:19Z

No description provided.

gemini-code-assist · 2025-11-18T12:18:36Z

Summary of Changes

Hello @hjh0119, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request addresses a bug within the Megatron GRPO trainer by refining how distributed process groups, specifically 'rollout groups,' are managed. The changes involve removing an apparently unnecessary data collator preparation step and, more significantly, overhauling the rollout group creation mechanism to align with Megatron-LM's RankGenerator for accurate data distribution and collection across parallel ranks. This ensures that ranks processing identical data samples are correctly grouped for coordination during the rollout phase, leading to more robust and correct distributed training.

Highlights

Data Collator Removal: The _prepare_template_data_collator method and its corresponding call in the __init__ method have been removed, streamlining the trainer's initialization process.
Rollout Group Refactoring: The logic for creating _get_rollout_group has been completely refactored to utilize mpu.RankGenerator, ensuring that distributed process groups are formed correctly according to Megatron-LM's parallelization scheme for consistent data handling.
Documentation Clarity: Comments within the _get_rollout_group method have been updated to provide clearer explanations of its purpose and behavior, emphasizing the importance of grouping ranks that process the same data samples.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces a bugfix for the rollout_group creation in Megatron GRPO training. The previous implementation relied on fragile assumptions about rank layout. The new implementation robustly creates groups using mpu.RankGenerator based on data parallelism, which is a significant improvement. The docstrings have also been updated for clarity, and some related code cleanup has been performed. Overall, this is a solid fix. I have one suggestion to further improve the robustness of the rollout_group creation by considering expert parallelism.

gemini-code-assist · 2025-11-18T12:20:53Z

swift/megatron/trainers/grpo_trainer.py

+        cp_size = mpu.get_context_parallel_world_size()

-        # Calculate the rank range for my rollout group
-        group_start = my_dp_block_index * ranks_per_dp_group
+        # Create RankGenerator following Megatron-LM pattern
+        # Order: tp-cp-ep-dp-pp (default in Megatron-LM)
+        decoder_rank_generator = mpu.RankGenerator(
+            tp=tp_size,
+            ep=1,
+            dp=dp_size,
+            pp=pp_size,
+            cp=cp_size,
+            order='tp-cp-ep-dp-pp',
+            rank_offset=0,
+        )


The RankGenerator is initialized with a hardcoded ep=1. This could lead to incorrect rollout group creation if expert parallelism is used (i.e., expert_model_parallel_size > 1). The rollout group should encompass all model parallel dimensions for a given data parallel rank, including expert parallelism.

To make this more robust, I suggest fetching the expert parallel size from the arguments and using it to initialize the RankGenerator.

Suggested change

cp_size = mpu.get_context_parallel_world_size()

# Calculate the rank range for my rollout group

group_start = my_dp_block_index * ranks_per_dp_group

# Create RankGenerator following Megatron-LM pattern

# Order: tp-cp-ep-dp-pp (default in Megatron-LM)

decoder_rank_generator = mpu.RankGenerator(

tp=tp_size,

ep=1,

dp=dp_size,

pp=pp_size,

cp=cp_size,

order='tp-cp-ep-dp-pp',

rank_offset=0,

)

cp_size = mpu.get_context_parallel_world_size()

ep_size = self.args.expert_model_parallel_size

# Create RankGenerator following Megatron-LM pattern

# Order: tp-cp-ep-dp-pp (default in Megatron-LM)

decoder_rank_generator = mpu.RankGenerator(

tp=tp_size,

ep=ep_size,

dp=dp_size,

pp=pp_size,

cp=cp_size,

order='tp-cp-ep-dp-pp',

rank_offset=0,

)

same in Megatron-LM initialize_model_parallel, ignore

hjh0119 added 4 commits November 18, 2025 16:30

fix cp mllm

eaff3e8

fix grpo cp

61bbe18

clean up

bc150b2

Merge remote-tracking branch 'origin/main' into fix-mg-cp-mrope

638ea48

gemini-code-assist bot reviewed Nov 18, 2025

View reviewed changes

hjh0119 added 2 commits November 18, 2025 20:39

fix

13d40a4

fix

92758f3

Jintao-Huang approved these changes Nov 18, 2025

View reviewed changes

hjh0119 merged commit 717c284 into modelscope:main Nov 18, 2025
1 of 2 checks passed

hjh0119 deleted the fix-mg-cp-mrope branch November 18, 2025 13:50

vx120 pushed a commit to vx120/ms-swift that referenced this pull request Nov 19, 2025

[bugfix] fix megatron grpo rollout_group (modelscope#6655)

cd85a49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[bugfix] fix megatron grpo rollout_group #6655

[bugfix] fix megatron grpo rollout_group #6655

Uh oh!

hjh0119 commented Nov 18, 2025

Uh oh!

gemini-code-assist bot commented Nov 18, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Nov 18, 2025

Uh oh!

hjh0119 Nov 18, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[bugfix] fix megatron grpo rollout_group #6655

[bugfix] fix megatron grpo rollout_group #6655

Uh oh!

Conversation

hjh0119 commented Nov 18, 2025

Uh oh!

gemini-code-assist bot commented Nov 18, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Nov 18, 2025

Choose a reason for hiding this comment

Uh oh!

hjh0119 Nov 18, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants