New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
The program hang at the forward function when use model parallel in Megatron-LM #58
Comments
Is it possible that two nccl calls happen concurrently when enabling pipeline for megatron? @laekov (i.e., megatron calls send/recv for pipelining, and fastmoe calls send/recv for alltoall) |
@ymjiang We haven't officially test fmoe with The communication of pipeline mp happens between transfomer layers while the communication of fmoe happens within each transformer layer, so it is a little bit wired to me that they will call nccl concurrently. There could be other possible reasons, e.g., cuda version. @laekov have we test fmoe in cuda11. |
I suppose a case like this: Node-1 just received the inputs (using NCCL) from its upstream Node-0, and begins the fmoe's NCCL call. Meanwhile, Node-1 continuously waits for new inputs from Node-0 (still using NCCL). I think this is likely to happen, given that megatron v2.2 already enables pipelined parallel by breaking data into micro-batches. That said, I haven't dived into the code further. So it is just a guess. |
Have you tested the model parameters of each machine when model parallelization is enabled? @xptree @ymjiang @laekov In actual use, I found some strange places:
But after fastmoe is enabled (the number of experts is 12),
|
I am running fastmoe with cuda@11 and nccl@2.9.9, so it should not be a cuda issue. @xptree FastMoE currently does not support pipeline parallelism in Megatron. I am not sure about its behavior. I will inspect it now and see if we can support it with less burden.@ymjiang @seanM29 For tensor model parallel, the experts are not divided into pieces like what Megatron does. On each GPU locates a different expert. However, the attention layer is partitioned the same as Megatron does. So, in your observation, I suppose |
I think I have the idea of why it gets stuck. Given that we have data parallel (DP), tensor model parallel (MP) and pipeline parallel (PP), |
Should be resolved in #59 |
Thank you very much for your reply, but I look at the pictures in the readme, fastmoe supports different experts and puts them on different machines? If I want to put the experts nto pieces like what Megatron does, do you have any suggestions? |
In the current version, you have 3 ways to give experts to FastMoE: you give it a single expert class, you give it a list of N experts, or you have a fused expert (see
I am not sure, but most likely you'll need to use a customized fused expert. I think it would cause some issues though since you might need to add some extra communication between the different shards. Someone with more experience with Megatron might be able to help you better |
Megatron-LM uses column and row partition for two layers of MLP, which equals to activating a group of MLPs (or experts) at the same time. The easist way of doing this is that you can develop a new gate with fewer number of logical experts and activate a group of experts instead of some specific expert by |
I saw some public reports that Wenlan used fastmoe to reach 1.75 trillion parameters, @laekov @TiagoMAntunes In my experiment, on a V100, with 345 million bert and 12 experts, the gpu memory is basically full. And it only reached 1.3 billion parameters |
They are using a private version of FastMoE and Megatron on SunWay platform, which is quite different from NVIDIA's stuff. |
Thank you very much for your reply @laekov So sorry to disturb you again. I want to confirm that when setting num_expert=4 and using 8 GPUs, in BaseGate, self.tot_expert should be 8*4=32, it seems use 32 different experts? does the whole network have 4*8=32 different experts? Are there still only 4 experts? |
exactly
yup, there are 32 experts in total |
thanks for your work!!! I love it very much!!
I met a problem, hope you can help me. Thx a lot !
Platform
update
The text was updated successfully, but these errors were encountered: