Skip to content

[bugfix] Fix megatron save_total_limit & pp#9175

Merged
Jintao-Huang merged 4 commits into
modelscope:mainfrom
Jintao-Huang:fix_megatron_save_total_limit
Apr 22, 2026
Merged

[bugfix] Fix megatron save_total_limit & pp#9175
Jintao-Huang merged 4 commits into
modelscope:mainfrom
Jintao-Huang:fix_megatron_save_total_limit

Conversation

@Jintao-Huang
Copy link
Copy Markdown
Collaborator

No description provided.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates the documentation for flash_attention_3 and introduces distributed synchronization for tracking the best metric in the Megatron trainer. A critical issue was identified in the _determine_best_metric implementation where the trainer state is unconditionally overwritten with zero values during synchronization, which could lead to incorrect checkpoint rotation and the loss of the actual best metric value.

Comment thread swift/megatron/trainers/base.py Outdated
@Jintao-Huang Jintao-Huang changed the title [bugfix] Fix megatron save_total_limit [bugfix] Fix megatron save_total_limit & pp Apr 22, 2026
@Jintao-Huang
Copy link
Copy Markdown
Collaborator Author

/gemini review

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates the documentation for flash_attention_3 and implements distributed synchronization for tracking the best metric and global step in the Megatron trainer. The feedback suggests using torch.float64 instead of torch.float32 for the synchronization tensors to ensure precision for large step counts, which is a valid improvement for long-running training jobs.

Comment thread swift/megatron/trainers/base.py Outdated
Comment thread swift/megatron/trainers/base.py Outdated
@Jintao-Huang Jintao-Huang merged commit 2596236 into modelscope:main Apr 22, 2026
2 of 3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants