Fix the MoE-params gradient-scaling (#4957)

This PR fixes a bug that I introduced in a previous [PR](#4695). The MoE-Params' gradients got accidentally double-scaled due to passing `self.ipg_bucket_has_moe_params` to the all_reduce functions. Since, we have already done the scaling the MoE parameters [here](https://github.com/microsoft/DeepSpeed/blob/master/deepspeed/runtime/zero/stage_1_and_2.py#L1054), we can safely pass `divide=False`. The divide argument may not be needed anymore, however, I just let it be there as I think it may be needed for the sequence-parallelism accuracy stability adjustments. cc: @tjruwase
microsoft · Jan 20, 2024 · 9d2660d · 9d2660d
1 parent 7fb5bad
commit 9d2660d
Showing 1 changed file with 2 additions and 2 deletions.
diff --git a/deepspeed/runtime/zero/stage_1_and_2.py b/deepspeed/runtime/zero/stage_1_and_2.py
@@ -1109,14 +1109,14 @@ def average_tensor(self, tensor):
                 if self.use_multi_rank_bucket_allreduce:
                     self.allreduce_and_scatter(buckets[bucket_key],
                                                numel_per_bucket=self.reduce_bucket_size,
-                                               divide=self.ipg_bucket_has_moe_params,
+                                               divide=False,
                                                process_group=bucket_key)
                 else:
                     dst, process_group = bucket_key
                     self.allreduce_no_retain(buckets[bucket_key],
                                              numel_per_bucket=self.reduce_bucket_size,
                                              rank=dst,
-                                             divide=self.ipg_bucket_has_moe_params,
+                                             divide=False,
                                              process_group=process_group)
 
     ##############################################################################