Adding Expert Prototyping to FastMoE #69

JustinLin610 · 2021-08-23T02:56:43Z

Hi, thanks for your provding end-to-end training framework in Pytorch for MoE models. We have recently implemented MoE in tensorflow and found out that categorizing experts to different groups can bring improvements in model quality. More details can be referred to our paper https://arxiv.org/abs/2105.15082. I wonder if it is possible to add this feature as FastMoE really facilitates research in sparse expert models.

Generally, this strategy categorizes experts to different groups, each of which has its own gating function for routing. It is compatible with the conventional routing method like Switch or top-2 routing as you can set the group number to 1. We find that increasing the value of k in top-k can increase model performance and k top-1 can achieve similar effect. Also, it is possible to try out more complex strategies, say k top-k' or so.

We have a code snippet in the appendix, which may be helpful.

xptree · 2021-09-06T02:59:03Z

Here is another recent work about MoE.

DSelect-k: Differentiable Selection in the Mixture of Experts with Applications to Multi-Task Learning
https://arxiv.org/abs/2106.03760

The idea is to activate all experts at the beginning of training, but quickly converge to sparse activation. I wonder whether such mechanism can help train better pre-trained models when our expert pool is not that large.

Let me know how do you think about it?

xptree added the enhancement New feature or request label Aug 23, 2021

xptree assigned Sengxian Aug 23, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding Expert Prototyping to FastMoE #69

Adding Expert Prototyping to FastMoE #69

JustinLin610 commented Aug 23, 2021

xptree commented Sep 6, 2021 •

edited

Adding Expert Prototyping to FastMoE #69

Adding Expert Prototyping to FastMoE #69

Comments

JustinLin610 commented Aug 23, 2021

xptree commented Sep 6, 2021 • edited

xptree commented Sep 6, 2021 •

edited