MOE gate fixes and enhancements #5156

mosheisland · 2024-02-19T10:52:19Z

Fixes the following issues:

Fix capacity when using TP for non-MoE by aligning the capacity to TP
Fix TopKGate.wg (gate weight) when using ZeRO with fp16 or bf16
Fix top2 aux loss to be similar to top1 aux loss

Following are few configurable enhancements:

Support top2 with disable token dropping
Support disable top2 2nd expert sampling

Currently, disable token dropping is only integrated in top1 gate logic. This commit integrates disable token dropping into top2 logic. Signed-off-by: Moshe Island <misland@habana.ai>

MoE aux loss is based on https://arxiv.org/pdf/2006.16668.pdf, Algo 1. For top1 aux loss is implemented as: l_aux = torch.sum(me * ce) * num_experts Where, for top2 aux loss is implemented as: l_aux = torch.sum(me * ce) * num_experts * num_experts Based on Algo 1, no reason to have an extra multiplications by num_experts. Signed-off-by: Moshe Island <misland@habana.ai>

DeepSpeed's MoE top2 gating performs sampling to select 2nd expert. Support disabling of sampling (i.e. using argmax). This is configurable while the default is to perform 2nd expert sampling. Signed-off-by: Moshe Island <misland@habana.ai>

When non-expert layers use TP and experts do not use TP, we drop duplicate tokens sent to experts. Dropping duplicate tokens is done by slicing the tokens tensor sent to experts where each expert handles only 1/TP of the tokens. However, for that, we need to make sure that the capacity is divisible by TP. Signed-off-by: Moshe Island <misland@habana.ai>

Currently, during forward, topkgate gate linear layer is converted to fp32. This is forbidden since the linear layer params are a view into deepspeed's parameter flat buffer. To fix it, use torch.nn.functional.linear with gate.weight.float(). Signed-off-by: Moshe Island <misland@habana.ai>

awan-10 · 2024-02-21T18:56:04Z

deepspeed/moe/sharded_moe.py

        super().__init__()

        # Only top-1 and top-2 are supported at the moment.
        if k != 1 and k != 2:
            raise ValueError('Only top-1 and top-2 gatings are supported.')
-        self.wg = torch.nn.Linear(model_dim, num_experts, bias=False).float()
+        self.wg = torch.nn.Linear(model_dim, num_experts, bias=False)


@ykim362 - can you please review this part? I remember we forced .float() here because we wanted the gate weight to always be fp32 even if everything else was fp16.

deepspeed/moe/sharded_moe.py

awan-10 · 2024-02-21T19:01:45Z

deepspeed/moe/sharded_moe.py

    # Compute l_aux
    me = torch.mean(gates, dim=0)
    ce = torch.mean(mask1.float(), dim=0)
-    l_aux = torch.mean(me * ce) * num_experts * num_experts
+    l_aux = torch.mean(me * ce) * num_experts


@ykim362 - FYI. I think the current value of l_aux in top-2 was giving us good convergence. I am not sure if we change it, we will need to run training experiments to verify any regression in loss.

@mosheisland - have you trained the models using top-2 and you see that l_aux with your change gives better convergence/loss?

I did not run full training before and after my changes and compare.
I added this change to better aligns to original paper.
Since you saw better results with the current formulation, lets keep it as-is.
I will upload a new commit that reverts this change.
BTW, in the case of using same num_experts across all MOE layers, this can be modified by changing the ----moe-loss-coeff.

awan-10 · 2024-02-21T19:13:34Z

@mosheisland -- Thank you very much for working on this PR! Overall, it looks good to me but given that we don't have the bandwidth to test the convergence, it will really help us and users if you can do two things:

Protect any changes to l_aux, capacity padding, etc. with user-configurable flags and set their default values to be False.
- e.g. add a flag like pad_capacity=False as a default and then only do this if pad_capacity=True is passed by the user. This is to avoid any regression as many users are using this in the current form. If we don't know if new code is going to give better results or previous code, at least, we should have the option to have new things disabled by default.
Share any before your changes and after your changes convergence plots for any model that you have trained with new changes. That data will help the users to refer back to this PR in case setting pad_capacity=True or using new l_aux value will change any training/loss-curves for them.

mosheisland · 2024-02-21T19:48:18Z

Protect any changes to l_aux, capacity padding, etc. with user-configurable flags and set their default values to be False.

e.g. add a flag like pad_capacity=False as a default and then only do this if pad_capacity=True is passed by the user. This is to avoid any regression as many users are using this in the current form. If we don't know if new code is going to give better results or previous code, at least, we should have the option to have new things disabled by default.

l_aux - I will revert this change.

capacity padding - this bug is reproduced when you use drop-tokens=False and non-expert-tp>1.
The padding itself is with 0-ed rows and has no real effect on training.

Share any before your changes and after your changes convergence plots for any model that you have trained with new changes. That data will help the users to refer back to this PR in case setting pad_capacity=True or using new l_aux value will change any training/loss-curves for them.

l_aux - no need, I will revert
capacity padding - I can't share before the fix since it crashes

This reverts commit 692d42d.

mosheisland · 2024-02-21T19:51:56Z

Reverted l_aux commit

Fixes the following issues: - Fix capacity when using TP for non-MoE by aligning the capacity to TP - Fix TopKGate.wg (gate weight) when using ZeRO with fp16 or bf16 - Fix top2 aux loss to be similar to top1 aux loss Following are few configurable enhancements: - Support top2 with disable token dropping - Support disable top2 2nd expert sampling --------- Signed-off-by: Moshe Island <misland@habana.ai> Co-authored-by: Moshe Island <misland@habana.ai> Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>

mosheisland requested a review from awan-10 as a code owner February 19, 2024 10:52

tjruwase requested review from tohtana and removed request for tohtana February 20, 2024 16:17

mosheisland force-pushed the moe/gate branch 2 times, most recently from 4706fa2 to 8f9d75c Compare February 21, 2024 07:07

misland-habana added 5 commits February 21, 2024 20:23

MOE: Support top2 with disable token dropping

3e0c35f

Currently, disable token dropping is only integrated in top1 gate logic. This commit integrates disable token dropping into top2 logic. Signed-off-by: Moshe Island <misland@habana.ai>

MOE: Support disable top2 2nd expert sampling

953e698

DeepSpeed's MoE top2 gating performs sampling to select 2nd expert. Support disabling of sampling (i.e. using argmax). This is configurable while the default is to perform 2nd expert sampling. Signed-off-by: Moshe Island <misland@habana.ai>

mosheisland force-pushed the moe/gate branch from 8f9d75c to aab9fc3 Compare February 21, 2024 18:23

awan-10 reviewed Feb 21, 2024

View reviewed changes

deepspeed/moe/sharded_moe.py Show resolved Hide resolved

awan-10 reviewed Feb 21, 2024

View reviewed changes

awan-10 self-assigned this Feb 21, 2024

Revert "MOE: Fix top2 aux loss"

1657955

This reverts commit 692d42d.

awan-10 approved these changes Feb 22, 2024

View reviewed changes

loadams and others added 7 commits February 23, 2024 08:13

Merge branch 'master' into moe/gate

05b2262

Merge branch 'master' into moe/gate

6c04451

Merge branch 'master' into moe/gate

c8e05ea

Merge branch 'master' into moe/gate

7eb3633

Merge branch 'master' into moe/gate

b76a541

Merge branch 'master' into moe/gate

3d227ef

Merge branch 'master' into moe/gate

30118cf

tjruwase added this pull request to the merge queue Mar 6, 2024

github-merge-queue bot removed this pull request from the merge queue due to failed status checks Mar 6, 2024

tjruwase added this pull request to the merge queue Mar 7, 2024

Merged via the queue into microsoft:master with commit 5a2e705 Mar 7, 2024
12 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MOE gate fixes and enhancements #5156

MOE gate fixes and enhancements #5156

mosheisland commented Feb 19, 2024

awan-10 Feb 21, 2024

awan-10 Feb 21, 2024

mosheisland Feb 21, 2024

awan-10 commented Feb 21, 2024

mosheisland commented Feb 21, 2024

mosheisland commented Feb 21, 2024

MOE gate fixes and enhancements #5156

MOE gate fixes and enhancements #5156

Conversation

mosheisland commented Feb 19, 2024

awan-10 Feb 21, 2024

Choose a reason for hiding this comment

awan-10 Feb 21, 2024

Choose a reason for hiding this comment

mosheisland Feb 21, 2024

Choose a reason for hiding this comment

awan-10 commented Feb 21, 2024

mosheisland commented Feb 21, 2024

mosheisland commented Feb 21, 2024