[bugfix] fix minimax qk_norm sp#83
Conversation
There was a problem hiding this comment.
Code Review
This pull request standardizes the use of 'duplicated' parallel mode for linear layers and refines sequence parallel gathering logic across the bridge. In gpt_model.py, the gathering logic for sequence classification and embedding tasks was moved from hidden_states to logits. In minimax_m2.py, normalization parameters are now explicitly excluded from sequence parallel gradient reduction to prevent gradient amplification. Feedback highlights a critical issue in gpt_model.py where the new gathering logic may perform a redundant AllGather during dynamic batching inference if sequence_parallel_override is active, potentially leading to incorrect output shapes.
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request modifies the GPT model and LoRA tuners to refine sequence parallelism (SP) and tensor parallelism (TP) handling, including changes to parallel_mode settings and the placement of sequence parallel gathering. Critical feedback was provided regarding potential weight divergence in the output layer and LoRA adapters when SP is active, as well as a potential double-gather issue during inference in the _postprocess method.
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request refactors model parallelism configurations, specifically setting parallel_mode to 'duplicated' for output layers and LoRA linear layers. It also adjusts the sequence parallel gathering logic in GPTModel and disables sequence parallelism for specific normalization layers in MinimaxM2 to prevent gradient amplification. Additionally, the deepcopy utility was updated to preserve parameter metadata. Feedback suggests ensuring that all parameter attributes are overwritten during deep copying to correctly mirror the source state, even if default values exist in the target.
| if not hasattr(res_param, k): | ||
| setattr(res_param, k, v) |
There was a problem hiding this comment.
The check if not hasattr(res_param, k) prevents copying attributes that might have been initialized with default values in the constructor of the new module. For example, if a parameter is initialized with sequence_parallel=False by default, it will not be updated even if the source parameter has it set to True. Since __dict__ on an nn.Parameter typically only contains custom metadata added by the user or the framework (such as Megatron's parallelism flags), it is safer to overwrite these attributes to ensure the state is correctly mirrored in the deep-copied module.
| if not hasattr(res_param, k): | |
| setattr(res_param, k, v) | |
| for k, v in param.__dict__.items(): | |
| setattr(res_param, k, v) |
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request refines model parallelism and LoRA adapter logic, specifically moving sequence parallel gathering to the logit stage in gpt_model.py, adjusting gradient averaging for normalization layers in minimax_m2.py, and disabling communication overlap for certain LoRA configurations. It also improves the deepcopy utility to preserve parameter attributes. Feedback highlights a potential double-gather issue in gpt_model.py during inference for sequence classification tasks, suggesting a conditional check and explicit process group specification to ensure correct tensor shapes.
No description provided.