[bugfix] Fix megatron save_total_limit & pp by Jintao-Huang · Pull Request #9175 · modelscope/ms-swift

Jintao-Huang · 2026-04-22T03:27:19Z

No description provided.

gemini-code-assist

Code Review

This pull request updates the documentation for flash_attention_3 and introduces distributed synchronization for tracking the best metric in the Megatron trainer. A critical issue was identified in the _determine_best_metric implementation where the trainer state is unconditionally overwritten with zero values during synchronization, which could lead to incorrect checkpoint rotation and the loss of the actual best metric value.

Jintao-Huang · 2026-04-22T03:40:11Z

/gemini review

gemini-code-assist

Code Review

This pull request updates the documentation for flash_attention_3 and implements distributed synchronization for tracking the best metric and global step in the Megatron trainer. The feedback suggests using torch.float64 instead of torch.float32 for the synchronization tensors to ensure precision for large step counts, which is a valid improvement for long-running training jobs.

Jintao-Huang added 2 commits April 22, 2026 11:01

fix megatron save_total_limit

31d5443

fix

988139e

gemini-code-assist Bot reviewed Apr 22, 2026

View reviewed changes

Comment thread swift/megatron/trainers/base.py Outdated

Jintao-Huang changed the title ~~[bugfix] Fix megatron save_total_limit~~ [bugfix] Fix megatron save_total_limit & pp Apr 22, 2026

hjh0119 approved these changes Apr 22, 2026

View reviewed changes

fix

777b1a3

gemini-code-assist Bot reviewed Apr 22, 2026

View reviewed changes

Comment thread swift/megatron/trainers/base.py Outdated

Comment thread swift/megatron/trainers/base.py Outdated

fix

7d6c4cb

Jintao-Huang merged commit 2596236 into modelscope:main Apr 22, 2026
2 of 3 checks passed

Jintao-Huang added a commit that referenced this pull request Apr 23, 2026

[bugfix] Fix megatron save_total_limit & pp (#9175)

814a063

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[bugfix] Fix megatron save_total_limit & pp#9175

[bugfix] Fix megatron save_total_limit & pp#9175
Jintao-Huang merged 4 commits into
modelscope:mainfrom
Jintao-Huang:fix_megatron_save_total_limit

Jintao-Huang commented Apr 22, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Jintao-Huang commented Apr 22, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Jintao-Huang commented Apr 22, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Jintao-Huang commented Apr 22, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants