-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ncclCommInitRank failed: unhandled system error
with 3 local nodes (HOROVOD_GPU_OPERATIONS=NCCL, 0.20.3), hang with 4 nodes (HOROVOD_GPU_OPERATIONS=NCCL 0.20.3, HOROVOD_GPU_BROADCAST=NCCL 0.19.5)
#2395
Labels
Comments
bioothod
changed the title
Oct 22, 2020
ncclCommInitRank failed: unhandled system error
with 3 local nodes, hang with 4 nodes (with HOROVOD_GPU_BROADCAST=NCCL)ncclCommInitRank failed: unhandled system error
with 3 local nodes (HOROVOD_GPU_OPERATIONS=NCCL, 0.20.3), hang with 4 nodes (HOROVOD_GPU_OPERATIONS=NCCL 0.20.3, HOROVOD_GPU_BROADCAST=NCCL 0.19.5)
Hey @bioothod, can you also share the output of |
Sure
|
Hi, any progress on this? Do you need more information, debug, tests? |
Hey @bioothod, going through the logs, the relevant bit appears to be here:
Seems very similar to NVIDIA/nccl#290. Can you take a look at that issue and see if the suggestions apply to your environment? |
Yes, |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Environment:
Checklist:
Yes, you didn't answer:
#2255 (comment)
#1651 (comment)
If your question is about hang, did you read this doc?
It is not a hang
If your question is about docker, did you read this doc?
No, it is not
Did you check if you question is answered in the troubleshooting guide?
It is not listed there
Bug report:
Running
horovodrun -np 3 -H localhost:3 python keras_mnist_advanced.py
immediately risesncclCommInitRank failed: unhandled system error
exception with NCCL-enabled (HOROVOD_GPU_OPERATIONS=NCCL pip3 install horovod
) horovod:Full trace attached trace.txt
Running the same command with 2 nodes works very well,
HOROVOD_GPU_OPERATIONS=NCCL
with 0.20.3 horovod performs significantly better (5-6 times faster) thanHOROVOD_GPU_ALLREDUCE=NCCL + HOROVOD_GPU_BROADCAST=NCCL
0.19.5 version.But here comes second bug which can be related - running either 0.20.3 or 0.19.5 version (with
HOROVOD_GPU_BROADCAST=NCCL
) with 4 local nodes stucks in the initial weight broadcasting.And it stucks forever (2, sometimes 3 nodes eat each of 100/200% of a cpu), so it looks like gpu-enabled broadcasting only works with 2 nodes, but this can be a different issue.
The text was updated successfully, but these errors were encountered: