Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Question] Comparison to FasterMoE #232

Open
Guodanding opened this issue Apr 17, 2024 · 4 comments
Open

[Question] Comparison to FasterMoE #232

Guodanding opened this issue Apr 17, 2024 · 4 comments

Comments

@Guodanding
Copy link

Hello! I am a freshman of MoE. And I am interesting in the following question:

What do you think of the differences of Tutel (or Megatron-DeepSpeed, use dp+tp+ep in MoE layers) and Fast/FasterMoE. In my opinion, Tutel is better at scalability, as it uses a fixed but searchable parallel solution, while FasterMoE is more elegant and fine-grained, but not good at scalability, because it is fine-grained and introduces other communication(cost of shadow) and mess up the ep+tp+dp communication. (I don't know) And maybe in a limited resources situation, FasterMoE can do better?

Please correct me if I misunderstand someting! :)

@ghostplant
Copy link
Contributor

ghostplant commented Apr 17, 2024

The main difference is what assumption to base on.

The assumption of Tutel MoE is no assumption. e.g. Allowing switching execution approaches during runtime, without influencing designed accuracy, no extra penaty of doing ANY switching.

FasterMoE assumes it can launch the expert data migration decision which may take extra time to complete, thus, when MoE-based model trend to choose fixed subset of experts, this migration decision and cost could pay off.

FasterMoE also assumes to do intra-node all2all occasionally even cross-node all2all is expect, because it could save more inter-node bandwidth and thus having better throughput. The penalty is that some models may have accuracy drop since inter-node information are less exchanged.

@Guodanding
Copy link
Author

The main difference is what assumption to base on.

The assumption of Tutel MoE is no assumption. e.g. Allowing switching execution approaches during runtime, without influencing designed accuracy, no extra penaty of doing ANY switching.

FasterMoE assumes it can launch the expert data migration decision which may take extra time to complete, thus, when MoE-based model trend to choose fixed subset of experts, thus, when MoE-based model tends to choose fixed subset of experts, this migration cost could pay off.

FasterMoE also assumes to do intra-node all2all occasionally even cross-node all2all is expect, because it could save more inter-node bandwidth and thus having better throughput. The penalty is that some models may have accuracy drop when used.

Does that mean Tutel focus on a more general situation, and FasterMoE focus on and do better at a special situation?

@ghostplant
Copy link
Contributor

ghostplant commented Apr 18, 2024

Tutel only integrates math-equivalent optimizations for standard MoE algorithm, while FasterMoE explores algorithm-wise changes as well as data-wise prediction of expert selection, expecting to use both 2 ideas to achieve less e2e training time in addition to comparable accuracy. In other words, the gain from Tutel benefits general situation for sure, while the gain from FasterMoE depends on experimental factors, e.g. predictor accuracy / weight migration penalty / dataset specialty / all2all differences, etc. When these factors work well together, it could be a lot faster than standard MoE algorithm.

@Guodanding
Copy link
Author

In other words, the gain from Tutel benefits general situation for sure, while the gain from FasterMoE depends on experimental factors, e.g. predictor accuracy / weight migration penalty / dataset specialty / all2all differences, etc. When these factors work well together, it could be a lot faster than standard MoE algorithm.

I get the point. Thanks!

while FasterMoE explores algorithm-wise changes as well as data-wise prediction of expert selection

Does algorithm-wise changes mean topology-aware gate and data-wise prediction mean shadow experts? If so, shadow policy is decided after gate and maybe it is not a prediction.

By the way, since Tutel and FasterMoE (along with others like SmartMoE, MegaBlock, Janus) emerged in 2022-2023, are there any other state-of-the-art frameworks designed to accelerate MoE training now? What is the remaining challenge in MoE training now? What type of framework is preferred in the industry for training MoE?

Thanks :)!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants