Skip to content

Commit

Permalink
Fix the MoE-params gradient-scaling (#4957)
Browse files Browse the repository at this point in the history
This PR fixes a bug that I introduced in a previous
[PR](#4695). The MoE-Params'
gradients got accidentally double-scaled due to passing
`self.ipg_bucket_has_moe_params` to the all_reduce functions. Since, we
have already done the scaling the MoE parameters
[here](https://github.com/microsoft/DeepSpeed/blob/master/deepspeed/runtime/zero/stage_1_and_2.py#L1054),
we can safely pass `divide=False`. The divide argument may not be needed
anymore, however, I just let it be there as I think it may be needed for
the sequence-parallelism accuracy stability adjustments.

cc: @tjruwase
  • Loading branch information
RezaYazdaniAminabadi committed Jan 20, 2024
1 parent 7fb5bad commit 9d2660d
Showing 1 changed file with 2 additions and 2 deletions.
4 changes: 2 additions & 2 deletions deepspeed/runtime/zero/stage_1_and_2.py
Original file line number Diff line number Diff line change
Expand Up @@ -1109,14 +1109,14 @@ def average_tensor(self, tensor):
if self.use_multi_rank_bucket_allreduce:
self.allreduce_and_scatter(buckets[bucket_key],
numel_per_bucket=self.reduce_bucket_size,
divide=self.ipg_bucket_has_moe_params,
divide=False,
process_group=bucket_key)
else:
dst, process_group = bucket_key
self.allreduce_no_retain(buckets[bucket_key],
numel_per_bucket=self.reduce_bucket_size,
rank=dst,
divide=self.ipg_bucket_has_moe_params,
divide=False,
process_group=process_group)

##############################################################################
Expand Down

0 comments on commit 9d2660d

Please sign in to comment.