New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
nccl.h is not found or ncclUnhandledCudaError: Call to CUDA function failed #119
Comments
So, there are two issues. The first one is that FastMoE cannot find NCCL, and you have that addressed by installing NCCL. Then, PyTorch gets into trouble with its NCCL AllGather operator. You can first check if your PyTorch |
To check if PyTorch distributed.all_gather works well in a mini-reproduction code without FastMoE. code
log
|
log
|
To see the NCCL debug info, you are supposed to add that environment variable at the |
remove log
|
How many GPUs do you have? The default |
Line 28 in 670e140
device_count = 4 |
We also try to use docker to fix its problem.
log
|
I finally begin to understand the issue. We updated the distributed parameter initialization with bcast over https://github.com/laekov/fastmoe/blob/master/fmoe/distributed.py#L100, which is not correct. In PyTorch's distributed module, you are supposed to pass a global rank to the I will have that fixed later today. |
Describe the bug
'nccl.h' file is not found or ncclUnhandledCudaError: Call to CUDA function failed
To Reproduce
Steps to reproduce the behavior:
Logs
Try to fix
Platform
Additional context
May some necessary environment variables be lost during the process of subprocess.Popen?
fastmoe/tests/test_ddp.py
Line 44 in 670e140
The text was updated successfully, but these errors were encountered: