Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions swift/megatron/trainers/grpo_trainer.py
Original file line number Diff line number Diff line change
Expand Up @@ -245,7 +245,7 @@ def _export_and_load_weights(self):
# Colocate mode: load_weights supports iterator, pass directly
llm_model = self.engine.inner_model
llm_model.load_weights(weight_iterator)
elif self.vllm_mode == 'server' and self.is_main_process:
elif self.vllm_mode == 'server':
# Server mode: process in buckets and sync with flattened tensors
self._load_weights_to_server_in_buckets(weight_iterator)

Expand Down Expand Up @@ -285,7 +285,7 @@ def _sync_bucket_to_server(self, bucket_params: List[Tuple[str, torch.Tensor]]):
Args:
bucket_params: List of (name, tensor) tuples to sync
"""
if not bucket_params:
if not bucket_params or not self.is_main_process:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

This change, combined with the modification at line 248, appears to introduce a critical bug. The change at line 248 causes _load_weights_to_server_in_buckets to be called on all processes, which is likely correct if export_weights is a collective operation. However, adding or not self.is_main_process here makes _sync_bucket_to_server return immediately on non-main processes. As a result, only the weights from the main process's weight_iterator are synchronized with the vLLM server, while weights from all other processes are silently ignored. This will lead to an incompletely updated model on the server.

To fix this, you should gather bucket_params from all processes onto the main process before syncing. This would likely require changes in the calling function, _load_weights_to_server_in_buckets, to orchestrate the gathering of buckets from all ranks before the main rank performs the synchronization.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The complete weights have already been gathered in the upper _load_weights_to_server_in_buckets method, ignore.

return

# Create FlattenedTensorBucket for efficient transfer
Expand Down
Loading