`ncclCommInitRank failed: unhandled system error` with 3 local nodes (HOROVOD_GPU_OPERATIONS=NCCL, 0.20.3), hang with 4 nodes (HOROVOD_GPU_OPERATIONS=NCCL 0.20.3, HOROVOD_GPU_BROADCAST=NCCL 0.19.5) #2395

bioothod · 2020-10-22T23:24:18Z

Environment:

Framework: (TensorFlow, Keras, PyTorch, MXNet) tensorflow
Framework version: 2.3.0
Horovod version: 0.20.3
MPI version: openmpi 4.0.2
CUDA version: 10.2
NCCL version: 2.7.8 (2.7.8-1+cuda11.0 in ubuntu 18.04.3)
Python version: 3.6.9
Spark / PySpark version:
OS and version: ubuntu 18.04.3, standard tensorflow container
GCC version: 7.5.0
CMake version: 3.10.2

Checklist:

Did you search issues to find if somebody asked this question before?

Yes, you didn't answer:
#2255 (comment)
#1651 (comment)

If your question is about hang, did you read this doc?
It is not a hang
If your question is about docker, did you read this doc?
No, it is not
Did you check if you question is answered in the troubleshooting guide?
It is not listed there

Bug report:

Running horovodrun -np 3 -H localhost:3 python keras_mnist_advanced.py immediately rises ncclCommInitRank failed: unhandled system error exception with NCCL-enabled (HOROVOD_GPU_OPERATIONS=NCCL pip3 install horovod) horovod:

[1,2]<stderr>:tensorflow.python.framework.errors_impl.UnknownError:  ncclCommInitRank failed: unhandled system error
[1,2]<stderr>:   [[{{node PartitionedCall/DistributedSGD_Allreduce/cond_213/then/_2133/DistributedSGD_Allreduce/cond_213/HorovodAllreduce_grads_213_0}}]] [Op:__inference_train_function_14460]

Full trace attached trace.txt

Running the same command with 2 nodes works very well, HOROVOD_GPU_OPERATIONS=NCCL with 0.20.3 horovod performs significantly better (5-6 times faster) than HOROVOD_GPU_ALLREDUCE=NCCL + HOROVOD_GPU_BROADCAST=NCCL 0.19.5 version.

But here comes second bug which can be related - running either 0.20.3 or 0.19.5 version (with HOROVOD_GPU_BROADCAST=NCCL) with 4 local nodes stucks in the initial weight broadcasting.
And it stucks forever (2, sometimes 3 nodes eat each of 100/200% of a cpu), so it looks like gpu-enabled broadcasting only works with 2 nodes, but this can be a different issue.

The text was updated successfully, but these errors were encountered:

tgaddair · 2020-10-23T17:00:39Z

Hey @bioothod, can you also share the output of nvidia-smi here?

bioothod · 2020-10-23T17:07:09Z

Sure

nvidia-smi 
Fri Oct 23 17:06:15 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 430.26       Driver Version: 430.26       CUDA Version: 10.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  TITAN RTX           Off  | 00000000:18:00.0 Off |                  N/A |
| 43%   39C    P0    60W / 280W |      0MiB / 24220MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  TITAN RTX           Off  | 00000000:3B:00.0 Off |                  N/A |
| 49%   44C    P0    71W / 280W |      0MiB / 24220MiB |      1%      Default |
+-------------------------------+----------------------+----------------------+
|   2  TITAN RTX           Off  | 00000000:86:00.0 Off |                  N/A |
| 40%   43C    P0    68W / 280W |      0MiB / 24220MiB |      1%      Default |
+-------------------------------+----------------------+----------------------+
|   3  TITAN RTX           Off  | 00000000:AF:00.0 Off |                  N/A |
| 22%   41C    P0     1W / 280W |      0MiB / 24220MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

bioothod · 2020-11-06T20:28:45Z

Hi, any progress on this? Do you need more information, debug, tests?

tgaddair · 2020-11-06T20:51:12Z

Hey @bioothod, going through the logs, the relevant bit appears to be here:

[1,0]<stdout>:localhost:14339:14471 [0] include/shm.h:28 NCCL WARN Call to posix_fallocate failed : No space left on device
[1,0]<stdout>:localhost:14339:14471 [0] NCCL INFO include/shm.h:41 -> 2
[1,0]<stdout>:
[1,0]<stdout>:localhost:14339:14471 [0] include/shm.h:48 NCCL WARN Error while creating shared memory segment nccl-shm-recv-64e738791311722a-0-2-0 (size 9637888)

Seems very similar to NVIDIA/nccl#290. Can you take a look at that issue and see if the suggestions apply to your environment?

bioothod · 2020-11-09T18:14:50Z

Yes, --shm-size= option fixes this.
Thanks a lot!

bioothod added the bug label Oct 22, 2020

tgaddair added question and removed bug labels Oct 23, 2020

bioothod closed this as completed Nov 9, 2020

Passerby mentioned this issue Sep 21, 2022

I coundn't use all GPU with 2 machine #3704

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`ncclCommInitRank failed: unhandled system error` with 3 local nodes (HOROVOD_GPU_OPERATIONS=NCCL, 0.20.3), hang with 4 nodes (HOROVOD_GPU_OPERATIONS=NCCL 0.20.3, HOROVOD_GPU_BROADCAST=NCCL 0.19.5) #2395

`ncclCommInitRank failed: unhandled system error` with 3 local nodes (HOROVOD_GPU_OPERATIONS=NCCL, 0.20.3), hang with 4 nodes (HOROVOD_GPU_OPERATIONS=NCCL 0.20.3, HOROVOD_GPU_BROADCAST=NCCL 0.19.5) #2395

bioothod commented Oct 22, 2020

tgaddair commented Oct 23, 2020

bioothod commented Oct 23, 2020

bioothod commented Nov 6, 2020

tgaddair commented Nov 6, 2020

bioothod commented Nov 9, 2020

ncclCommInitRank failed: unhandled system error with 3 local nodes (HOROVOD_GPU_OPERATIONS=NCCL, 0.20.3), hang with 4 nodes (HOROVOD_GPU_OPERATIONS=NCCL 0.20.3, HOROVOD_GPU_BROADCAST=NCCL 0.19.5) #2395

ncclCommInitRank failed: unhandled system error with 3 local nodes (HOROVOD_GPU_OPERATIONS=NCCL, 0.20.3), hang with 4 nodes (HOROVOD_GPU_OPERATIONS=NCCL 0.20.3, HOROVOD_GPU_BROADCAST=NCCL 0.19.5) #2395

Comments

bioothod commented Oct 22, 2020

tgaddair commented Oct 23, 2020

bioothod commented Oct 23, 2020

bioothod commented Nov 6, 2020

tgaddair commented Nov 6, 2020

bioothod commented Nov 9, 2020

`ncclCommInitRank failed: unhandled system error` with 3 local nodes (HOROVOD_GPU_OPERATIONS=NCCL, 0.20.3), hang with 4 nodes (HOROVOD_GPU_OPERATIONS=NCCL 0.20.3, HOROVOD_GPU_BROADCAST=NCCL 0.19.5) #2395

`ncclCommInitRank failed: unhandled system error` with 3 local nodes (HOROVOD_GPU_OPERATIONS=NCCL, 0.20.3), hang with 4 nodes (HOROVOD_GPU_OPERATIONS=NCCL 0.20.3, HOROVOD_GPU_BROADCAST=NCCL 0.19.5) #2395