[bugfix] Fix megatron save_total_limit & pp#9175
Conversation
There was a problem hiding this comment.
Code Review
This pull request updates the documentation for flash_attention_3 and introduces distributed synchronization for tracking the best metric in the Megatron trainer. A critical issue was identified in the _determine_best_metric implementation where the trainer state is unconditionally overwritten with zero values during synchronization, which could lead to incorrect checkpoint rotation and the loss of the actual best metric value.
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request updates the documentation for flash_attention_3 and implements distributed synchronization for tracking the best metric and global step in the Megatron trainer. The feedback suggests using torch.float64 instead of torch.float32 for the synchronization tensors to ensure precision for large step counts, which is a valid improvement for long-running training jobs.
No description provided.