Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ncclCommInitRank failed: unhandled system error with 3 local nodes (HOROVOD_GPU_OPERATIONS=NCCL, 0.20.3), hang with 4 nodes (HOROVOD_GPU_OPERATIONS=NCCL 0.20.3, HOROVOD_GPU_BROADCAST=NCCL 0.19.5) #2395

Closed
bioothod opened this issue Oct 22, 2020 · 5 comments
Labels

Comments

@bioothod
Copy link

Environment:

  1. Framework: (TensorFlow, Keras, PyTorch, MXNet) tensorflow
  2. Framework version: 2.3.0
  3. Horovod version: 0.20.3
  4. MPI version: openmpi 4.0.2
  5. CUDA version: 10.2
  6. NCCL version: 2.7.8 (2.7.8-1+cuda11.0 in ubuntu 18.04.3)
  7. Python version: 3.6.9
  8. Spark / PySpark version:
  9. OS and version: ubuntu 18.04.3, standard tensorflow container
  10. GCC version: 7.5.0
  11. CMake version: 3.10.2

Checklist:

  1. Did you search issues to find if somebody asked this question before?

Yes, you didn't answer:
#2255 (comment)
#1651 (comment)

  1. If your question is about hang, did you read this doc?
    It is not a hang

  2. If your question is about docker, did you read this doc?
    No, it is not

  3. Did you check if you question is answered in the troubleshooting guide?
    It is not listed there

Bug report:

Running horovodrun -np 3 -H localhost:3 python keras_mnist_advanced.py immediately rises ncclCommInitRank failed: unhandled system error exception with NCCL-enabled (HOROVOD_GPU_OPERATIONS=NCCL pip3 install horovod) horovod:

[1,2]<stderr>:tensorflow.python.framework.errors_impl.UnknownError:  ncclCommInitRank failed: unhandled system error
[1,2]<stderr>:   [[{{node PartitionedCall/DistributedSGD_Allreduce/cond_213/then/_2133/DistributedSGD_Allreduce/cond_213/HorovodAllreduce_grads_213_0}}]] [Op:__inference_train_function_14460]

Full trace attached trace.txt

Running the same command with 2 nodes works very well, HOROVOD_GPU_OPERATIONS=NCCL with 0.20.3 horovod performs significantly better (5-6 times faster) than HOROVOD_GPU_ALLREDUCE=NCCL + HOROVOD_GPU_BROADCAST=NCCL 0.19.5 version.

But here comes second bug which can be related - running either 0.20.3 or 0.19.5 version (with HOROVOD_GPU_BROADCAST=NCCL) with 4 local nodes stucks in the initial weight broadcasting.
And it stucks forever (2, sometimes 3 nodes eat each of 100/200% of a cpu), so it looks like gpu-enabled broadcasting only works with 2 nodes, but this can be a different issue.

@bioothod bioothod added the bug label Oct 22, 2020
@bioothod bioothod changed the title ncclCommInitRank failed: unhandled system error with 3 local nodes, hang with 4 nodes (with HOROVOD_GPU_BROADCAST=NCCL) ncclCommInitRank failed: unhandled system error with 3 local nodes (HOROVOD_GPU_OPERATIONS=NCCL, 0.20.3), hang with 4 nodes (HOROVOD_GPU_OPERATIONS=NCCL 0.20.3, HOROVOD_GPU_BROADCAST=NCCL 0.19.5) Oct 22, 2020
@tgaddair tgaddair added question and removed bug labels Oct 23, 2020
@tgaddair
Copy link
Collaborator

Hey @bioothod, can you also share the output of nvidia-smi here?

@bioothod
Copy link
Author

Sure

nvidia-smi 
Fri Oct 23 17:06:15 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 430.26       Driver Version: 430.26       CUDA Version: 10.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  TITAN RTX           Off  | 00000000:18:00.0 Off |                  N/A |
| 43%   39C    P0    60W / 280W |      0MiB / 24220MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  TITAN RTX           Off  | 00000000:3B:00.0 Off |                  N/A |
| 49%   44C    P0    71W / 280W |      0MiB / 24220MiB |      1%      Default |
+-------------------------------+----------------------+----------------------+
|   2  TITAN RTX           Off  | 00000000:86:00.0 Off |                  N/A |
| 40%   43C    P0    68W / 280W |      0MiB / 24220MiB |      1%      Default |
+-------------------------------+----------------------+----------------------+
|   3  TITAN RTX           Off  | 00000000:AF:00.0 Off |                  N/A |
| 22%   41C    P0     1W / 280W |      0MiB / 24220MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

@bioothod
Copy link
Author

bioothod commented Nov 6, 2020

Hi, any progress on this? Do you need more information, debug, tests?

@tgaddair
Copy link
Collaborator

tgaddair commented Nov 6, 2020

Hey @bioothod, going through the logs, the relevant bit appears to be here:

[1,0]<stdout>:localhost:14339:14471 [0] include/shm.h:28 NCCL WARN Call to posix_fallocate failed : No space left on device
[1,0]<stdout>:localhost:14339:14471 [0] NCCL INFO include/shm.h:41 -> 2
[1,0]<stdout>:
[1,0]<stdout>:localhost:14339:14471 [0] include/shm.h:48 NCCL WARN Error while creating shared memory segment nccl-shm-recv-64e738791311722a-0-2-0 (size 9637888)

Seems very similar to NVIDIA/nccl#290. Can you take a look at that issue and see if the suggestions apply to your environment?

@bioothod
Copy link
Author

bioothod commented Nov 9, 2020

Yes, --shm-size= option fixes this.
Thanks a lot!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Development

No branches or pull requests

2 participants