New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[REQUEST] Expert Choice Routing for MoE #2517
Comments
The authors claim 2x convergence rate with EC routing: https://ai.googleblog.com/2022/11/mixture-of-experts-with-expert-choice.html I hope this incentivizes implementing it in DeepSpeed. |
In case this helps, TL;DR is in Lilian Weng's blog post. |
No @ykim362, but I would like to experiment with it and share the results. |
@clumsy you can take a look at this experimental branch. https://github.com/ykim362/DeepSpeed/tree/youki/expc |
hey, google has implementation of expert choice routing here: https://github.com/google/flaxformer/blob/main/flaxformer/architectures/moe/routing.py#L647-L717 They have a note that it should not be used in decoder blocks, maybe that was reason for poor results during your experiments? |
Is your feature request related to a problem? Please describe.
A paper was published regarding potentially better token-expert routing for MoE that leaves less experts under-trained.
Describe the solution you'd like
In addition to GShard's top2 and SwitchTransformer's top1 per token expert routing add expert choice routing option.
Describe alternatives you've considered
N/A
Additional context
N/A
The text was updated successfully, but these errors were encountered: