Horovod hangs -- include/shm.h:42 NCCL WARN Cuda failure 'invalid argument' #893

maxhgerlach · 2019-03-07T13:13:22Z

Hi,

I am trying to get Horovod trainings running on a cluster built from Nvidia RTX 2080 ti GPUs and Infiniband interconnects.

Installed software:

Ubuntu 16.04.5 LTS
Cuda 10.0.130-1
NCCL 2.4.2-1+cuda10
Open MPI 3.1.2
Mellanox OFED 4.4-2.0.7.0
Tensorflow 1.13.1, custom built without XLA support
Horovod 0.16

This is the device topology (PCI express dual root):

$ nvidia-smi topo -m
	GPU0	GPU1	GPU2	GPU3	GPU4	GPU5	GPU6	GPU7	mlx4_0	CPU Affinity
GPU0	 X 	PIX	PIX	PIX	SYS	SYS	SYS	SYS	SYS	0-7,16-23
GPU1	PIX	 X 	PIX	PIX	SYS	SYS	SYS	SYS	SYS	0-7,16-23
GPU2	PIX	PIX	 X 	PIX	SYS	SYS	SYS	SYS	SYS	0-7,16-23
GPU3	PIX	PIX	PIX	 X 	SYS	SYS	SYS	SYS	SYS	0-7,16-23
GPU4	SYS	SYS	SYS	SYS	 X 	PIX	PIX	PIX	PIX	8-15,24-31
GPU5	SYS	SYS	SYS	SYS	PIX	 X 	PIX	PIX	PIX	8-15,24-31
GPU6	SYS	SYS	SYS	SYS	PIX	PIX	 X 	PIX	PIX	8-15,24-31
GPU7	SYS	SYS	SYS	SYS	PIX	PIX	PIX	 X 	PIX	8-15,24-31
mlx4_0	SYS	SYS	SYS	SYS	PIX	PIX	PIX	PIX	 X

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe switches (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing a single PCIe switch
  NV#  = Connection traversing a bonded set of # NVLinks

Horovod works OK on a single machine or if I use multiple machines incorporating less than four GPUs on each of the machines. But using four or more GPUs per host, Horovod hangs with NCCL warnings showing up that indicate some failure.

Here's an example, there is no progress after the last message:

horovod-0.16.0/examples$ mpirun --prefix /opt/openmpi/ -np 8 -H heinzel135:4,heinzel136:4 -bind-to none -map-by slot -x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH -x PATH -mca pml ob1 -mca btl ^openib python tensorflow_mnist.py

# ...

INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Graph was finalized.
2019-03-07 14:06:11.418743: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1433] Found device 0 with properties:
name: GeForce RTX 2080 Ti major: 7 minor: 5 memoryClockRate(GHz): 1.545
pciBusID: 0000:1e:00.0
totalMemory: 10.73GiB freeMemory: 10.53GiB
2019-03-07 14:06:11.418823: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1512] Adding visible gpu devices: 3
2019-03-07 14:06:11.436803: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1433] Found device 0 with properties:
name: GeForce RTX 2080 Ti major: 7 minor: 5 memoryClockRate(GHz): 1.545
pciBusID: 0000:1d:00.0
totalMemory: 10.73GiB freeMemory: 10.53GiB
2019-03-07 14:06:11.436911: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1512] Adding visible gpu devices: 2
2019-03-07 14:06:11.443926: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1433] Found device 0 with properties:
name: GeForce RTX 2080 Ti major: 7 minor: 5 memoryClockRate(GHz): 1.545
pciBusID: 0000:1b:00.0
totalMemory: 10.73GiB freeMemory: 10.53GiB
2019-03-07 14:06:11.444031: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1512] Adding visible gpu devices: 0
2019-03-07 14:06:11.444763: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1433] Found device 0 with properties:
name: GeForce RTX 2080 Ti major: 7 minor: 5 memoryClockRate(GHz): 1.545
pciBusID: 0000:1c:00.0
totalMemory: 10.73GiB freeMemory: 10.53GiB
2019-03-07 14:06:11.444866: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1512] Adding visible gpu devices: 1
2019-03-07 14:06:11.481025: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1433] Found device 0 with properties:
name: GeForce RTX 2080 Ti major: 7 minor: 5 memoryClockRate(GHz): 1.545
pciBusID: 0000:1b:00.0
totalMemory: 10.73GiB freeMemory: 10.53GiB
2019-03-07 14:06:11.481102: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1512] Adding visible gpu devices: 0
2019-03-07 14:06:11.481267: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1433] Found device 0 with properties:
name: GeForce RTX 2080 Ti major: 7 minor: 5 memoryClockRate(GHz): 1.545
pciBusID: 0000:1e:00.0
totalMemory: 10.73GiB freeMemory: 10.53GiB
2019-03-07 14:06:11.481338: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1512] Adding visible gpu devices: 3
2019-03-07 14:06:11.481805: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1433] Found device 0 with properties:
name: GeForce RTX 2080 Ti major: 7 minor: 5 memoryClockRate(GHz): 1.545
pciBusID: 0000:1c:00.0
totalMemory: 10.73GiB freeMemory: 10.53GiB
2019-03-07 14:06:11.481896: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1512] Adding visible gpu devices: 1
2019-03-07 14:06:11.481982: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1433] Found device 0 with properties:
name: GeForce RTX 2080 Ti major: 7 minor: 5 memoryClockRate(GHz): 1.545
pciBusID: 0000:1d:00.0
totalMemory: 10.73GiB freeMemory: 10.53GiB
2019-03-07 14:06:11.482091: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1512] Adding visible gpu devices: 2
2019-03-07 14:06:11.643322: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-03-07 14:06:11.643368: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990]      2
2019-03-07 14:06:11.643377: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 2:   N
2019-03-07 14:06:11.643736: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 10233 MB memory) -> physical GPU (device: 2, name: GeForce RTX 2080 Ti, pci bus id: 0000:1d:00.0, compute capability: 7.5)
2019-03-07 14:06:11.651778: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-03-07 14:06:11.651825: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990]      0
2019-03-07 14:06:11.651853: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 0:   N
2019-03-07 14:06:11.652081: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-03-07 14:06:11.652123: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990]      1
2019-03-07 14:06:11.652130: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 1:   N
2019-03-07 14:06:11.652199: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 10233 MB memory) -> physical GPU (device: 0, name: GeForce RTX 2080 Ti, pci bus id: 0000:1b:00.0, compute capability: 7.5)
2019-03-07 14:06:11.652466: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 10233 MB memory) -> physical GPU (device: 1, name: GeForce RTX 2080 Ti, pci bus id: 0000:1c:00.0, compute capability: 7.5)
2019-03-07 14:06:11.653144: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-03-07 14:06:11.653210: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990]      3
2019-03-07 14:06:11.653223: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 3:   N
2019-03-07 14:06:11.653666: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 10233 MB memory) -> physical GPU (device: 3, name: GeForce RTX 2080 Ti, pci bus id: 0000:1e:00.0, compute capability: 7.5)
2019-03-07 14:06:11.725407: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-03-07 14:06:11.725458: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990]      0
2019-03-07 14:06:11.725482: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 0:   N
2019-03-07 14:06:11.725800: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 10233 MB memory) -> physical GPU (device: 0, name: GeForce RTX 2080 Ti, pci bus id: 0000:1b:00.0, compute capability: 7.5)
2019-03-07 14:06:11.727268: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-03-07 14:06:11.727301: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990]      1
2019-03-07 14:06:11.727309: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 1:   N
2019-03-07 14:06:11.727627: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 10233 MB memory) -> physical GPU (device: 1, name: GeForce RTX 2080 Ti, pci bus id: 0000:1c:00.0, compute capability: 7.5)
2019-03-07 14:06:11.729763: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-03-07 14:06:11.729805: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990]      3
2019-03-07 14:06:11.729813: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 3:   N
2019-03-07 14:06:11.730141: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 10233 MB memory) -> physical GPU (device: 3, name: GeForce RTX 2080 Ti, pci bus id: 0000:1e:00.0, compute capability: 7.5)
2019-03-07 14:06:11.733538: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-03-07 14:06:11.733583: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990]      2
2019-03-07 14:06:11.733596: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 2:   N
WARNING:tensorflow:From /learndata/virtualenv/tf1.13.1_no_xla/lib/python2.7/site-packages/tensorflow/python/training/saver.py:1266: checkpoint_exists (from tensorflow.python.training.checkpoint_management) is deprecated and will be removed in a future version.
Instructions for updating:
Use standard file APIs to check for files with this prefix.
INFO:tensorflow:Restoring parameters from ./checkpoints/model.ckpt-0
2019-03-07 14:06:11.741786: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 10233 MB memory) -> physical GPU (device: 2, name: GeForce RTX 2080 Ti, pci bus id: 0000:1d:00.0, compute capability: 7.5)
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Done running local_init_op.
WARNING:tensorflow:From /learndata/virtualenv/tf1.13.1_no_xla/lib/python2.7/site-packages/tensorflow/python/training/saver.py:1070: get_checkpoint_mtimes (from tensorflow.python.training.checkpoint_management) is deprecated and will be removed in a future version.
Instructions for updating:
Use standard file utilities to get mtimes.
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Saving checkpoints for 0 into ./checkpoints/model.ckpt.
2019-03-07 14:06:14.336218: I tensorflow/stream_executor/dso_loader.cc:152] successfully opened CUDA library libcublas.so.10.0 locally
2019-03-07 14:06:14.348293: I tensorflow/stream_executor/dso_loader.cc:152] successfully opened CUDA library libcublas.so.10.0 locally
2019-03-07 14:06:14.352477: I tensorflow/stream_executor/dso_loader.cc:152] successfully opened CUDA library libcublas.so.10.0 locally
2019-03-07 14:06:14.391831: I tensorflow/stream_executor/dso_loader.cc:152] successfully opened CUDA library libcublas.so.10.0 locally
2019-03-07 14:06:14.404225: I tensorflow/stream_executor/dso_loader.cc:152] successfully opened CUDA library libcublas.so.10.0 locally
2019-03-07 14:06:14.509689: I tensorflow/stream_executor/dso_loader.cc:152] successfully opened CUDA library libcublas.so.10.0 locally
2019-03-07 14:06:14.550282: I tensorflow/stream_executor/dso_loader.cc:152] successfully opened CUDA library libcublas.so.10.0 locally
2019-03-07 14:06:14.609623: I tensorflow/stream_executor/dso_loader.cc:152] successfully opened CUDA library libcublas.so.10.0 locally
heinzel135:151778:151927 [0] NCCL INFO NET/Socket : Using [0]ib0:192.168.21.135<0>
heinzel135:151778:151927 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so).
heinzel135:151778:151927 [0] NCCL INFO NET/IB : Using [0]mlx4_0:1/IB ; OOB ib0:192.168.21.135<0>
NCCL version 2.4.2+cuda10.0
heinzel135:151779:151929 [1] NCCL INFO NET/Socket : Using [0]ib0:192.168.21.135<0>
heinzel135:151780:151928 [2] NCCL INFO NET/Socket : Using [0]ib0:192.168.21.135<0>
heinzel135:151781:151926 [3] NCCL INFO NET/Socket : Using [0]ib0:192.168.21.135<0>
heinzel135:151780:151928 [2] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so).
heinzel135:151781:151926 [3] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so).
heinzel135:151779:151929 [1] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so).
heinzel136:84299:84449 [0] NCCL INFO NET/Socket : Using [0]ib0:192.168.21.136<0>
heinzel136:84300:84450 [1] NCCL INFO NET/Socket : Using [0]ib0:192.168.21.136<0>
heinzel136:84301:84448 [2] NCCL INFO NET/Socket : Using [0]ib0:192.168.21.136<0>
heinzel136:84302:84451 [3] NCCL INFO NET/Socket : Using [0]ib0:192.168.21.136<0>
heinzel136:84301:84448 [2] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so).
heinzel136:84300:84450 [1] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so).
heinzel136:84302:84451 [3] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so).
heinzel136:84299:84449 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so).
heinzel135:151780:151928 [2] NCCL INFO NET/IB : Using [0]mlx4_0:1/IB ; OOB ib0:192.168.21.135<0>
heinzel135:151779:151929 [1] NCCL INFO NET/IB : Using [0]mlx4_0:1/IB ; OOB ib0:192.168.21.135<0>
heinzel136:84301:84448 [2] NCCL INFO NET/IB : Using [0]mlx4_0:1/IB ; OOB ib0:192.168.21.136<0>
heinzel136:84300:84450 [1] NCCL INFO NET/IB : Using [0]mlx4_0:1/IB ; OOB ib0:192.168.21.136<0>
heinzel135:151781:151926 [3] NCCL INFO NET/IB : Using [0]mlx4_0:1/IB ; OOB ib0:192.168.21.135<0>
heinzel136:84302:84451 [3] NCCL INFO NET/IB : Using [0]mlx4_0:1/IB ; OOB ib0:192.168.21.136<0>
heinzel136:84299:84449 [0] NCCL INFO NET/IB : Using [0]mlx4_0:1/IB ; OOB ib0:192.168.21.136<0>
heinzel135:151778:151927 [0] NCCL INFO Setting affinity for GPU 0 to ff00ff
heinzel135:151778:151927 [0] NCCL INFO comm 0x7f0eb8340630 rank 0 nranks 8 cudaDev 0 nvmlDev 0
heinzel136:84301:84448 [2] NCCL INFO Setting affinity for GPU 2 to ff00ff
heinzel136:84301:84448 [2] NCCL INFO comm 0x7fdc24337090 rank 6 nranks 8 cudaDev 2 nvmlDev 2
heinzel135:151780:151928 [2] NCCL INFO Setting affinity for GPU 2 to ff00ff
heinzel135:151780:151928 [2] NCCL INFO comm 0x7f815833d0c0 rank 2 nranks 8 cudaDev 2 nvmlDev 2
heinzel135:151779:151929 [1] NCCL INFO Setting affinity for GPU 1 to ff00ff
heinzel135:151779:151929 [1] NCCL INFO comm 0x7f2f4432eb80 rank 1 nranks 8 cudaDev 1 nvmlDev 1
heinzel136:84302:84451 [3] NCCL INFO Setting affinity for GPU 3 to ff00ff
heinzel136:84302:84451 [3] NCCL INFO comm 0x7f63ec349120 rank 7 nranks 8 cudaDev 3 nvmlDev 3
heinzel135:151781:151926 [3] NCCL INFO Setting affinity for GPU 3 to ff00ff
heinzel135:151781:151926 [3] NCCL INFO comm 0x7fad2c36eb50 rank 3 nranks 8 cudaDev 3 nvmlDev 3
heinzel136:84299:84449 [0] NCCL INFO Setting affinity for GPU 0 to ff00ff
heinzel136:84299:84449 [0] NCCL INFO comm 0x7fe7fc34a900 rank 4 nranks 8 cudaDev 0 nvmlDev 0
heinzel136:84300:84450 [1] NCCL INFO Setting affinity for GPU 1 to ff00ff
heinzel136:84300:84450 [1] NCCL INFO comm 0x7f392433ae40 rank 5 nranks 8 cudaDev 1 nvmlDev 1
heinzel136:84302:84451 [3] NCCL INFO CUDA Dev 3[3], IB NIC distance :  SOC
heinzel136:84301:84448 [2] NCCL INFO CUDA Dev 2[2], IB NIC distance :  SOC
heinzel136:84300:84450 [1] NCCL INFO CUDA Dev 1[1], IB NIC distance :  SOC
heinzel136:84299:84449 [0] NCCL INFO CUDA Dev 0[0], IB NIC distance :  SOC
heinzel135:151778:151927 [0] NCCL INFO CUDA Dev 0[0], IB NIC distance :  SOC
heinzel135:151780:151928 [2] NCCL INFO CUDA Dev 2[2], IB NIC distance :  SOC
heinzel135:151781:151926 [3] NCCL INFO CUDA Dev 3[3], IB NIC distance :  SOC
heinzel135:151779:151929 [1] NCCL INFO CUDA Dev 1[1], IB NIC distance :  SOC
heinzel135:151778:151927 [0] NCCL INFO Channel 00 :    0   1   2   3   4   5   6   7
heinzel135:151779:151929 [1] NCCL INFO Ring 00 : 1[1] -> 2[2] via direct shared memory
heinzel136:84299:84449 [0] NCCL INFO Ring 00 : 3 -> 4 [receive] via NET/IB/0
heinzel135:151778:151927 [0] NCCL INFO Ring 00 : 7 -> 0 [receive] via NET/IB/0
heinzel135:151780:151928 [2] NCCL INFO Ring 00 : 2[2] -> 3[3] via direct shared memory
heinzel135:151778:151927 [0] NCCL INFO Ring 00 : 0[0] -> 1[1] via direct shared memory
heinzel136:84299:84449 [0] NCCL INFO Ring 00 : 4[0] -> 5[1] via direct shared memory
heinzel136:84301:84448 [2] NCCL INFO Ring 00 : 6[2] -> 7[3] via direct shared memory
heinzel136:84300:84450 [1] NCCL INFO Ring 00 : 5[1] -> 6[2] via direct shared memory
heinzel135:151779:151929 [1] NCCL INFO Ring 00 : 1[1] -> 0[0] via direct shared memory
heinzel135:151780:151928 [2] NCCL INFO Ring 00 : 2[2] -> 1[1] via direct shared memory

heinzel135:151780:151928 [2] include/shm.h:42 NCCL WARN Cuda failure 'invalid argument'
heinzel135:151780:151928 [2] NCCL INFO transport/shm.cu:187 -> 1
heinzel135:151780:151928 [2] NCCL INFO init.cu:658 -> 1
heinzel135:151780:151928 [2] NCCL INFO init.cu:804 -> 1
heinzel135:151780:151928 [2] NCCL INFO init.cu:915 -> 1
heinzel135:151781:151926 [3] NCCL INFO Ring 00 : 3 -> 4 [send] via NET/IB/0
heinzel136:84301:84448 [2] NCCL INFO Ring 00 : 6[2] -> 5[1] via direct shared memory
heinzel136:84302:84451 [3] NCCL INFO Ring 00 : 7 -> 0 [send] via NET/IB/0
heinzel136:84300:84450 [1] NCCL INFO Ring 00 : 5[1] -> 4[0] via direct shared memory
heinzel135:151781:151926 [3] NCCL INFO Ring 00 : 3[3] -> 2[2] via direct shared memory
heinzel135:151781:151926 [3] NCCL INFO Trees [0] 2->3->-1/-1/-1
heinzel136:84302:84451 [3] NCCL INFO Ring 00 : 7[3] -> 6[2] via direct shared memory
heinzel136:84301:84448 [2] NCCL INFO Trees [0] 5->6->7/-1/-1
heinzel135:151781:151926 [3] NCCL INFO comm 0x7fad2c36eb50 rank 3 nranks 8 cudaDev 3 nvmlDev 3 - Init COMPLETE
heinzel136:84302:84451 [3] NCCL INFO Trees [0] 6->7->-1/-1/-1
heinzel136:84301:84448 [2] NCCL INFO comm 0x7fdc24337090 rank 6 nranks 8 cudaDev 2 nvmlDev 2 - Init COMPLETE

heinzel135:151779:151929 [1] include/shm.h:42 NCCL WARN Cuda failure 'invalid argument'
heinzel135:151779:151929 [1] NCCL INFO transport/shm.cu:211 -> 1
heinzel135:151779:151929 [1] NCCL INFO init.cu:667 -> 1
heinzel135:151779:151929 [1] NCCL INFO init.cu:804 -> 1
heinzel135:151779:151929 [1] NCCL INFO init.cu:915 -> 1

heinzel136:84300:84450 [1] include/shm.h:42 NCCL WARN Cuda failure 'invalid argument'
heinzel136:84300:84450 [1] NCCL INFO transport/shm.cu:211 -> 1
heinzel136:84300:84450 [1] NCCL INFO init.cu:667 -> 1
heinzel136:84300:84450 [1] NCCL INFO init.cu:804 -> 1
heinzel136:84300:84450 [1] NCCL INFO init.cu:915 -> 1
heinzel135:151778:151927 [0] NCCL INFO Ring 00 : 4 -> 0 [receive] via NET/IB/0
heinzel136:84302:84451 [3] NCCL INFO comm 0x7f63ec349120 rank 7 nranks 8 cudaDev 3 nvmlDev 3 - Init COMPLETE
heinzel136:84299:84449 [0] NCCL INFO Ring 00 : 4 -> 0 [send] via NET/IB/0
heinzel136:84299:84449 [0] NCCL INFO Ring 00 : 0 -> 4 [receive] via NET/IB/0
heinzel135:151778:151927 [0] NCCL INFO Ring 00 : 0 -> 4 [send] via NET/IB/0
heinzel135:151778:151927 [0] NCCL INFO Trees [0] -1->0->1/4/-1
heinzel135:151778:151927 [0] NCCL INFO Using 256 threads, Min Comp Cap 7, Trees enabled up to size 149999
heinzel136:84299:84449 [0] NCCL INFO Trees [0] 0->4->5/-1/-1
heinzel135:151778:151927 [0] NCCL INFO comm 0x7f0eb8340630 rank 0 nranks 8 cudaDev 0 nvmlDev 0 - Init COMPLETE
heinzel136:84299:84449 [0] NCCL INFO comm 0x7fe7fc34a900 rank 4 nranks 8 cudaDev 0 nvmlDev 0 - Init COMPLETE

The text was updated successfully, but these errors were encountered:

maxhgerlach · 2019-03-08T15:13:44Z

I downgraded NCCL from 2.4.2-1 to 2.3.7-1 and reinstalled Horovod.

That seems to have fixed the problem!

alsrgv · 2019-03-13T00:03:33Z

@maxhgerlach, did you happen to run in a container environment? Since NCCL is now open source, we can find the failing line, which is:

  CUDACHECKGOTO(cudaHostRegister(ptr, shmsize, cudaHostRegisterMapped), res, cudaError);

Is it possible that you had insufficient shared memory provisioned?

maxhgerlach · 2019-03-13T09:44:05Z

@alsrgv, thanks for looking into this.

We are not using any container technology. I believe there are no shared memory limits in place -- here's what I checked on the four hosts:

$ ipcs -lm

------ Shared Memory Limits --------
max number of segments = 4096
max seg size (kbytes) = 18014398509465599
max total shared memory (kbytes) = 18014398442373116
min seg size (bytes) = 1

$ ulimit -a | grep memory
max locked memory       (kbytes, -l) unlimited
max memory size         (kbytes, -m) unlimited
virtual memory          (kbytes, -v) unlimited

alsrgv · 2019-03-15T08:23:16Z

That's pretty odd. @sjeaugey, does anything jump out to you as a possible cause for cudaHostRegister to fail?

sjeaugey · 2019-03-15T15:46:25Z

This is a known, fixed bug. NVIDIA/nccl#185
To get the fix, use the latest version on github.
If you do not want to recompile NCCL (and likely Horovod) you may workaround the issue by :

Disabling trees : NCCL_TREE_THRESHOLD=0 (but performance may be lower)
Forcing P2P (instead of SHM) across all GPUs if you have a Skylake CPU with a proper BIOS : NCCL_P2P_LEVEL=5
Or roll back to 2.3.

maxhgerlach · 2019-03-15T16:12:35Z

@sjeaugey -- thanks for letting me know about that known issue!

We will stick with NCCL 2.3 for now and reconsider 2.4 once the fix is released.

maxhgerlach closed this as completed Mar 15, 2019

alsrgv mentioned this issue Mar 18, 2019

Adopt horovodrun #924

Merged

amsword mentioned this issue May 24, 2019

NCCL bug. Need to upgrade NCCL submodule? pytorch/pytorch#20927

Closed

amsword mentioned this issue Jun 2, 2019

Got async event : local catastrophic error NVIDIA/nccl#230

Closed

alsrgv mentioned this issue Jun 10, 2019

multi-node training failed #1122

Closed

xieydd mentioned this issue Jul 4, 2019

Benchmark distribution job hang in IB network #1191

Open

un-knight mentioned this issue Jul 26, 2019

error in multi node pytorch/pytorch#23363

Closed

tgaddair mentioned this issue Feb 5, 2020

pytorch blocked with two machine #1710

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Horovod hangs -- include/shm.h:42 NCCL WARN Cuda failure 'invalid argument' #893

Horovod hangs -- include/shm.h:42 NCCL WARN Cuda failure 'invalid argument' #893

maxhgerlach commented Mar 7, 2019

maxhgerlach commented Mar 8, 2019

alsrgv commented Mar 13, 2019

maxhgerlach commented Mar 13, 2019

alsrgv commented Mar 15, 2019

sjeaugey commented Mar 15, 2019

maxhgerlach commented Mar 15, 2019

Horovod hangs -- include/shm.h:42 NCCL WARN Cuda failure 'invalid argument' #893

Horovod hangs -- include/shm.h:42 NCCL WARN Cuda failure 'invalid argument' #893

Comments

maxhgerlach commented Mar 7, 2019

maxhgerlach commented Mar 8, 2019

alsrgv commented Mar 13, 2019

maxhgerlach commented Mar 13, 2019

alsrgv commented Mar 15, 2019

sjeaugey commented Mar 15, 2019

maxhgerlach commented Mar 15, 2019