Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions swift/megatron/train/sft.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@
from typing import List, Optional, Union

import torch
import torch.distributed as dist

from swift.llm import TEMPLATE_MAPPING
from swift.llm.train import SwiftSft
Expand Down Expand Up @@ -60,6 +61,7 @@ def run(self):

try:
self.trainer.train(train_dataset, val_dataset, data_collator)
dist.barrier() # Ensure all weights are saved completely
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

While adding a barrier here correctly synchronizes processes after training and ensures weights are saved, its placement inside the try block could lead to a race condition. After passing the barrier, processes that are not the last_rank will execute the empty finally block and may exit, while the last_rank process is still running the visualization logic. If a job scheduler sees that most processes have terminated, it might kill the entire job, interrupting the visualization task on the last rank.

A more robust approach is to place the dist.barrier() after the finally block. This ensures all processes, including the last rank, complete their final tasks before synchronization and exit. As the finally block is not part of this diff, I cannot provide a direct code suggestion, but I recommend moving the barrier to improve the robustness of the shutdown sequence.

finally:
# Visualization
if is_last_rank():
Expand Down
Loading