-
Notifications
You must be signed in to change notification settings - Fork 182
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[NCCL Error] Enable distributed expert feature #27
Comments
This may be because you did not initialize nccl. Can you please provide a minimum script that can reproduce the error? Thx. |
|
"The repository is currently tested with PyTorch v1.8.0 and CUDA 10, with designed compatibility to older versions." |
We built a docker image with PyTorch 1.8.0, CUDA 10.2, NCCL 2.7.8 and we have tested this image that it can be used directly to install FastMoE with distributed expert feature. It can be found on Docker Hub: co1lin/fastmoe:pytorch1.8.0-cuda10.2-cudnn7-nccl2708 |
Thanks for the docker image. I installed fastmoe using USE_NCCL=1, but when i run GPT2 (L12-H768, intermediate size 1536, top2)in a 8xGPU device, the largest expert number can be increased to 32. However, 96 experts reported in the FastMoE paper. When i increase the expert number to 48 (batch size per gpu: 1), CUDA OOM occurs. It seems that the distributed expert feature was not activated. Do you have any suggestions ? |
The distributed experts feature is by default enabled in |
Hi,
I installed fastmoe useding
USE_NCCL=1 python setup.py install
.When i set "expert_dp_comm" to "dp", the training process is fine. But when i set "expert_dp_comm" to "none" (i.e., Each worker serves several unique expert networks), the process has a nccl error:
NCCL Error at /home/h/code_gpt/fmoe-package/cuda/moe_comm_kernel.cu:29 value 4
I'm looking forward to the help!
My environment:
python 1.8
nccl 2.8.3
cuda 10.1
The text was updated successfully, but these errors were encountered: