Skip to content

[bugfix] fix minimax qk_norm sp#83

Merged
Jintao-Huang merged 8 commits into
modelscope:mainfrom
Jintao-Huang:fix_minimax_qk_norm
May 17, 2026
Merged

[bugfix] fix minimax qk_norm sp#83
Jintao-Huang merged 8 commits into
modelscope:mainfrom
Jintao-Huang:fix_minimax_qk_norm

Conversation

@Jintao-Huang
Copy link
Copy Markdown
Collaborator

No description provided.

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request standardizes the use of 'duplicated' parallel mode for linear layers and refines sequence parallel gathering logic across the bridge. In gpt_model.py, the gathering logic for sequence classification and embedding tasks was moved from hidden_states to logits. In minimax_m2.py, normalization parameters are now explicitly excluded from sequence parallel gradient reduction to prevent gradient amplification. Feedback highlights a critical issue in gpt_model.py where the new gathering logic may perform a redundant AllGather during dynamic batching inference if sequence_parallel_override is active, potentially leading to incorrect output shapes.

Comment thread src/mcore_bridge/model/gpt_model.py
@Jintao-Huang
Copy link
Copy Markdown
Collaborator Author

/gemini review

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request modifies the GPT model and LoRA tuners to refine sequence parallelism (SP) and tensor parallelism (TP) handling, including changes to parallel_mode settings and the placement of sequence parallel gathering. Critical feedback was provided regarding potential weight divergence in the output layer and LoRA adapters when SP is active, as well as a potential double-gather issue during inference in the _postprocess method.

Comment thread src/mcore_bridge/model/gpt_model.py
Comment thread src/mcore_bridge/model/gpt_model.py
Comment thread src/mcore_bridge/tuners/lora.py
@Jintao-Huang
Copy link
Copy Markdown
Collaborator Author

/gemini review

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors model parallelism configurations, specifically setting parallel_mode to 'duplicated' for output layers and LoRA linear layers. It also adjusts the sequence parallel gathering logic in GPTModel and disables sequence parallelism for specific normalization layers in MinimaxM2 to prevent gradient amplification. Additionally, the deepcopy utility was updated to preserve parameter metadata. Feedback suggests ensuring that all parameter attributes are overwritten during deep copying to correctly mirror the source state, even if default values exist in the target.

Comment on lines +97 to +98
if not hasattr(res_param, k):
setattr(res_param, k, v)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The check if not hasattr(res_param, k) prevents copying attributes that might have been initialized with default values in the constructor of the new module. For example, if a parameter is initialized with sequence_parallel=False by default, it will not be updated even if the source parameter has it set to True. Since __dict__ on an nn.Parameter typically only contains custom metadata added by the user or the framework (such as Megatron's parallelism flags), it is safer to overwrite these attributes to ensure the state is correctly mirrored in the deep-copied module.

Suggested change
if not hasattr(res_param, k):
setattr(res_param, k, v)
for k, v in param.__dict__.items():
setattr(res_param, k, v)

@Jintao-Huang
Copy link
Copy Markdown
Collaborator Author

/gemini review

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refines model parallelism and LoRA adapter logic, specifically moving sequence parallel gathering to the logit stage in gpt_model.py, adjusting gradient averaging for normalization layers in minimax_m2.py, and disabling communication overlap for certain LoRA configurations. It also improves the deepcopy utility to preserve parameter attributes. Feedback highlights a potential double-gather issue in gpt_model.py during inference for sequence classification tasks, suggesting a conditional check and explicit process group specification to ensure correct tensor shapes.

Comment thread src/mcore_bridge/model/gpt_model.py
@Jintao-Huang Jintao-Huang merged commit e746e89 into modelscope:main May 17, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants