Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding Expert Prototyping to FastMoE #69

Open
JustinLin610 opened this issue Aug 23, 2021 · 1 comment
Open

Adding Expert Prototyping to FastMoE #69

JustinLin610 opened this issue Aug 23, 2021 · 1 comment
Assignees
Labels
enhancement New feature or request

Comments

@JustinLin610
Copy link

Hi, thanks for your provding end-to-end training framework in Pytorch for MoE models. We have recently implemented MoE in tensorflow and found out that categorizing experts to different groups can bring improvements in model quality. More details can be referred to our paper https://arxiv.org/abs/2105.15082. I wonder if it is possible to add this feature as FastMoE really facilitates research in sparse expert models.

Generally, this strategy categorizes experts to different groups, each of which has its own gating function for routing. It is compatible with the conventional routing method like Switch or top-2 routing as you can set the group number to 1. We find that increasing the value of k in top-k can increase model performance and k top-1 can achieve similar effect. Also, it is possible to try out more complex strategies, say k top-k' or so.

We have a code snippet in the appendix, which may be helpful.

@xptree xptree added the enhancement New feature or request label Aug 23, 2021
@xptree
Copy link
Collaborator

xptree commented Sep 6, 2021

Here is another recent work about MoE.

DSelect-k: Differentiable Selection in the Mixture of Experts with Applications to Multi-Task Learning
https://arxiv.org/abs/2106.03760

The idea is to activate all experts at the beginning of training, but quickly converge to sparse activation. I wonder whether such mechanism can help train better pre-trained models when our expert pool is not that large.

Let me know how do you think about it?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants