Fix the MoE-params gradient-scaling #4957

RezaYazdaniAminabadi · 2024-01-16T06:18:04Z

This PR fixes a bug that I introduced in a previous PR. The MoE-Params' gradients got accidentally double-scaled due to passing self.ipg_bucket_has_moe_params to the all_reduce functions. Since, we have already done the scaling the MoE parameters here, we can safely pass divide=False. The divide argument may not be needed anymore, however, I just let it be there as I think it may be needed for the sequence-parallelism accuracy stability adjustments.

cc: @tjruwase

RezaYazdaniAminabadi · 2024-01-17T01:01:28Z

@tjruwase, can we merge this please? thanks :)

RezaYazdaniAminabadi · 2024-01-19T00:38:16Z

@mrwyattii can you please approve the workflows here? Thanks

RezaYazdaniAminabadi · 2024-01-20T21:55:24Z

kind ping on this @tjruwase @mrwyattii

@tjruwase

This PR fixes a bug that I introduced in a previous [PR](microsoft#4695). The MoE-Params' gradients got accidentally double-scaled due to passing `self.ipg_bucket_has_moe_params` to the all_reduce functions. Since, we have already done the scaling the MoE parameters [here](https://github.com/microsoft/DeepSpeed/blob/master/deepspeed/runtime/zero/stage_1_and_2.py#L1054), we can safely pass `divide=False`. The divide argument may not be needed anymore, however, I just let it be there as I think it may be needed for the sequence-parallelism accuracy stability adjustments. cc: @tjruwase

@tjruwase

This PR fixes a bug that I introduced in a previous [PR](microsoft#4695). The MoE-Params' gradients got accidentally double-scaled due to passing `self.ipg_bucket_has_moe_params` to the all_reduce functions. Since, we have already done the scaling the MoE parameters [here](https://github.com/microsoft/DeepSpeed/blob/master/deepspeed/runtime/zero/stage_1_and_2.py#L1054), we can safely pass `divide=False`. The divide argument may not be needed anymore, however, I just let it be there as I think it may be needed for the sequence-parallelism accuracy stability adjustments. cc: @tjruwase

Fix the MoE-params gradient-scaling

ecd102f

RezaYazdaniAminabadi requested review from tjruwase and mrwyattii as code owners January 16, 2024 06:18

Merge branch 'master' into fix-moe-grad-scaling

35b7979

tjruwase approved these changes Jan 17, 2024

View reviewed changes

tjruwase added this pull request to the merge queue Jan 17, 2024

github-merge-queue bot removed this pull request from the merge queue due to failed status checks Jan 18, 2024

Merge branch 'master' into fix-moe-grad-scaling

a0868e3

RezaYazdaniAminabadi added 2 commits January 18, 2024 22:09

Merge branch 'master' into fix-moe-grad-scaling

cf2af49

Merge branch 'master' into fix-moe-grad-scaling

f8f8ef9

tjruwase added this pull request to the merge queue Jan 20, 2024

Merged via the queue into microsoft:master with commit 9d2660d Jan 20, 2024
12 checks passed

tkdcjf159 mentioned this pull request Mar 9, 2024

[BUG] Sequence Parallel(Ulysses) Training Gradient Scaling Issue #5248

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix the MoE-params gradient-scaling #4957

Fix the MoE-params gradient-scaling #4957

RezaYazdaniAminabadi commented Jan 16, 2024

RezaYazdaniAminabadi commented Jan 17, 2024

RezaYazdaniAminabadi commented Jan 19, 2024

RezaYazdaniAminabadi commented Jan 20, 2024

Fix the MoE-params gradient-scaling #4957

Fix the MoE-params gradient-scaling #4957

Conversation

RezaYazdaniAminabadi commented Jan 16, 2024

RezaYazdaniAminabadi commented Jan 17, 2024

RezaYazdaniAminabadi commented Jan 19, 2024

RezaYazdaniAminabadi commented Jan 20, 2024