Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Horovod hangs -- include/shm.h:42 NCCL WARN Cuda failure 'invalid argument' #893

Closed
maxhgerlach opened this issue Mar 7, 2019 · 6 comments

Comments

@maxhgerlach
Copy link
Collaborator

Hi,

I am trying to get Horovod trainings running on a cluster built from Nvidia RTX 2080 ti GPUs and Infiniband interconnects.

Installed software:

  • Ubuntu 16.04.5 LTS
  • Cuda 10.0.130-1
  • NCCL 2.4.2-1+cuda10
  • Open MPI 3.1.2
  • Mellanox OFED 4.4-2.0.7.0
  • Tensorflow 1.13.1, custom built without XLA support
  • Horovod 0.16

This is the device topology (PCI express dual root):

$ nvidia-smi topo -m
	GPU0	GPU1	GPU2	GPU3	GPU4	GPU5	GPU6	GPU7	mlx4_0	CPU Affinity
GPU0	 X 	PIX	PIX	PIX	SYS	SYS	SYS	SYS	SYS	0-7,16-23
GPU1	PIX	 X 	PIX	PIX	SYS	SYS	SYS	SYS	SYS	0-7,16-23
GPU2	PIX	PIX	 X 	PIX	SYS	SYS	SYS	SYS	SYS	0-7,16-23
GPU3	PIX	PIX	PIX	 X 	SYS	SYS	SYS	SYS	SYS	0-7,16-23
GPU4	SYS	SYS	SYS	SYS	 X 	PIX	PIX	PIX	PIX	8-15,24-31
GPU5	SYS	SYS	SYS	SYS	PIX	 X 	PIX	PIX	PIX	8-15,24-31
GPU6	SYS	SYS	SYS	SYS	PIX	PIX	 X 	PIX	PIX	8-15,24-31
GPU7	SYS	SYS	SYS	SYS	PIX	PIX	PIX	 X 	PIX	8-15,24-31
mlx4_0	SYS	SYS	SYS	SYS	PIX	PIX	PIX	PIX	 X

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe switches (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing a single PCIe switch
  NV#  = Connection traversing a bonded set of # NVLinks

Horovod works OK on a single machine or if I use multiple machines incorporating less than four GPUs on each of the machines. But using four or more GPUs per host, Horovod hangs with NCCL warnings showing up that indicate some failure.

Here's an example, there is no progress after the last message:

horovod-0.16.0/examples$ mpirun --prefix /opt/openmpi/ -np 8 -H heinzel135:4,heinzel136:4 -bind-to none -map-by slot -x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH -x PATH -mca pml ob1 -mca btl ^openib python tensorflow_mnist.py

# ...

INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Graph was finalized.
2019-03-07 14:06:11.418743: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1433] Found device 0 with properties:
name: GeForce RTX 2080 Ti major: 7 minor: 5 memoryClockRate(GHz): 1.545
pciBusID: 0000:1e:00.0
totalMemory: 10.73GiB freeMemory: 10.53GiB
2019-03-07 14:06:11.418823: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1512] Adding visible gpu devices: 3
2019-03-07 14:06:11.436803: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1433] Found device 0 with properties:
name: GeForce RTX 2080 Ti major: 7 minor: 5 memoryClockRate(GHz): 1.545
pciBusID: 0000:1d:00.0
totalMemory: 10.73GiB freeMemory: 10.53GiB
2019-03-07 14:06:11.436911: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1512] Adding visible gpu devices: 2
2019-03-07 14:06:11.443926: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1433] Found device 0 with properties:
name: GeForce RTX 2080 Ti major: 7 minor: 5 memoryClockRate(GHz): 1.545
pciBusID: 0000:1b:00.0
totalMemory: 10.73GiB freeMemory: 10.53GiB
2019-03-07 14:06:11.444031: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1512] Adding visible gpu devices: 0
2019-03-07 14:06:11.444763: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1433] Found device 0 with properties:
name: GeForce RTX 2080 Ti major: 7 minor: 5 memoryClockRate(GHz): 1.545
pciBusID: 0000:1c:00.0
totalMemory: 10.73GiB freeMemory: 10.53GiB
2019-03-07 14:06:11.444866: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1512] Adding visible gpu devices: 1
2019-03-07 14:06:11.481025: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1433] Found device 0 with properties:
name: GeForce RTX 2080 Ti major: 7 minor: 5 memoryClockRate(GHz): 1.545
pciBusID: 0000:1b:00.0
totalMemory: 10.73GiB freeMemory: 10.53GiB
2019-03-07 14:06:11.481102: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1512] Adding visible gpu devices: 0
2019-03-07 14:06:11.481267: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1433] Found device 0 with properties:
name: GeForce RTX 2080 Ti major: 7 minor: 5 memoryClockRate(GHz): 1.545
pciBusID: 0000:1e:00.0
totalMemory: 10.73GiB freeMemory: 10.53GiB
2019-03-07 14:06:11.481338: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1512] Adding visible gpu devices: 3
2019-03-07 14:06:11.481805: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1433] Found device 0 with properties:
name: GeForce RTX 2080 Ti major: 7 minor: 5 memoryClockRate(GHz): 1.545
pciBusID: 0000:1c:00.0
totalMemory: 10.73GiB freeMemory: 10.53GiB
2019-03-07 14:06:11.481896: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1512] Adding visible gpu devices: 1
2019-03-07 14:06:11.481982: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1433] Found device 0 with properties:
name: GeForce RTX 2080 Ti major: 7 minor: 5 memoryClockRate(GHz): 1.545
pciBusID: 0000:1d:00.0
totalMemory: 10.73GiB freeMemory: 10.53GiB
2019-03-07 14:06:11.482091: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1512] Adding visible gpu devices: 2
2019-03-07 14:06:11.643322: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-03-07 14:06:11.643368: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990]      2
2019-03-07 14:06:11.643377: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 2:   N
2019-03-07 14:06:11.643736: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 10233 MB memory) -> physical GPU (device: 2, name: GeForce RTX 2080 Ti, pci bus id: 0000:1d:00.0, compute capability: 7.5)
2019-03-07 14:06:11.651778: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-03-07 14:06:11.651825: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990]      0
2019-03-07 14:06:11.651853: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 0:   N
2019-03-07 14:06:11.652081: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-03-07 14:06:11.652123: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990]      1
2019-03-07 14:06:11.652130: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 1:   N
2019-03-07 14:06:11.652199: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 10233 MB memory) -> physical GPU (device: 0, name: GeForce RTX 2080 Ti, pci bus id: 0000:1b:00.0, compute capability: 7.5)
2019-03-07 14:06:11.652466: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 10233 MB memory) -> physical GPU (device: 1, name: GeForce RTX 2080 Ti, pci bus id: 0000:1c:00.0, compute capability: 7.5)
2019-03-07 14:06:11.653144: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-03-07 14:06:11.653210: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990]      3
2019-03-07 14:06:11.653223: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 3:   N
2019-03-07 14:06:11.653666: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 10233 MB memory) -> physical GPU (device: 3, name: GeForce RTX 2080 Ti, pci bus id: 0000:1e:00.0, compute capability: 7.5)
2019-03-07 14:06:11.725407: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-03-07 14:06:11.725458: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990]      0
2019-03-07 14:06:11.725482: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 0:   N
2019-03-07 14:06:11.725800: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 10233 MB memory) -> physical GPU (device: 0, name: GeForce RTX 2080 Ti, pci bus id: 0000:1b:00.0, compute capability: 7.5)
2019-03-07 14:06:11.727268: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-03-07 14:06:11.727301: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990]      1
2019-03-07 14:06:11.727309: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 1:   N
2019-03-07 14:06:11.727627: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 10233 MB memory) -> physical GPU (device: 1, name: GeForce RTX 2080 Ti, pci bus id: 0000:1c:00.0, compute capability: 7.5)
2019-03-07 14:06:11.729763: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-03-07 14:06:11.729805: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990]      3
2019-03-07 14:06:11.729813: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 3:   N
2019-03-07 14:06:11.730141: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 10233 MB memory) -> physical GPU (device: 3, name: GeForce RTX 2080 Ti, pci bus id: 0000:1e:00.0, compute capability: 7.5)
2019-03-07 14:06:11.733538: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-03-07 14:06:11.733583: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990]      2
2019-03-07 14:06:11.733596: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 2:   N
WARNING:tensorflow:From /learndata/virtualenv/tf1.13.1_no_xla/lib/python2.7/site-packages/tensorflow/python/training/saver.py:1266: checkpoint_exists (from tensorflow.python.training.checkpoint_management) is deprecated and will be removed in a future version.
Instructions for updating:
Use standard file APIs to check for files with this prefix.
INFO:tensorflow:Restoring parameters from ./checkpoints/model.ckpt-0
2019-03-07 14:06:11.741786: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 10233 MB memory) -> physical GPU (device: 2, name: GeForce RTX 2080 Ti, pci bus id: 0000:1d:00.0, compute capability: 7.5)
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Done running local_init_op.
WARNING:tensorflow:From /learndata/virtualenv/tf1.13.1_no_xla/lib/python2.7/site-packages/tensorflow/python/training/saver.py:1070: get_checkpoint_mtimes (from tensorflow.python.training.checkpoint_management) is deprecated and will be removed in a future version.
Instructions for updating:
Use standard file utilities to get mtimes.
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Saving checkpoints for 0 into ./checkpoints/model.ckpt.
2019-03-07 14:06:14.336218: I tensorflow/stream_executor/dso_loader.cc:152] successfully opened CUDA library libcublas.so.10.0 locally
2019-03-07 14:06:14.348293: I tensorflow/stream_executor/dso_loader.cc:152] successfully opened CUDA library libcublas.so.10.0 locally
2019-03-07 14:06:14.352477: I tensorflow/stream_executor/dso_loader.cc:152] successfully opened CUDA library libcublas.so.10.0 locally
2019-03-07 14:06:14.391831: I tensorflow/stream_executor/dso_loader.cc:152] successfully opened CUDA library libcublas.so.10.0 locally
2019-03-07 14:06:14.404225: I tensorflow/stream_executor/dso_loader.cc:152] successfully opened CUDA library libcublas.so.10.0 locally
2019-03-07 14:06:14.509689: I tensorflow/stream_executor/dso_loader.cc:152] successfully opened CUDA library libcublas.so.10.0 locally
2019-03-07 14:06:14.550282: I tensorflow/stream_executor/dso_loader.cc:152] successfully opened CUDA library libcublas.so.10.0 locally
2019-03-07 14:06:14.609623: I tensorflow/stream_executor/dso_loader.cc:152] successfully opened CUDA library libcublas.so.10.0 locally
heinzel135:151778:151927 [0] NCCL INFO NET/Socket : Using [0]ib0:192.168.21.135<0>
heinzel135:151778:151927 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so).
heinzel135:151778:151927 [0] NCCL INFO NET/IB : Using [0]mlx4_0:1/IB ; OOB ib0:192.168.21.135<0>
NCCL version 2.4.2+cuda10.0
heinzel135:151779:151929 [1] NCCL INFO NET/Socket : Using [0]ib0:192.168.21.135<0>
heinzel135:151780:151928 [2] NCCL INFO NET/Socket : Using [0]ib0:192.168.21.135<0>
heinzel135:151781:151926 [3] NCCL INFO NET/Socket : Using [0]ib0:192.168.21.135<0>
heinzel135:151780:151928 [2] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so).
heinzel135:151781:151926 [3] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so).
heinzel135:151779:151929 [1] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so).
heinzel136:84299:84449 [0] NCCL INFO NET/Socket : Using [0]ib0:192.168.21.136<0>
heinzel136:84300:84450 [1] NCCL INFO NET/Socket : Using [0]ib0:192.168.21.136<0>
heinzel136:84301:84448 [2] NCCL INFO NET/Socket : Using [0]ib0:192.168.21.136<0>
heinzel136:84302:84451 [3] NCCL INFO NET/Socket : Using [0]ib0:192.168.21.136<0>
heinzel136:84301:84448 [2] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so).
heinzel136:84300:84450 [1] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so).
heinzel136:84302:84451 [3] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so).
heinzel136:84299:84449 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so).
heinzel135:151780:151928 [2] NCCL INFO NET/IB : Using [0]mlx4_0:1/IB ; OOB ib0:192.168.21.135<0>
heinzel135:151779:151929 [1] NCCL INFO NET/IB : Using [0]mlx4_0:1/IB ; OOB ib0:192.168.21.135<0>
heinzel136:84301:84448 [2] NCCL INFO NET/IB : Using [0]mlx4_0:1/IB ; OOB ib0:192.168.21.136<0>
heinzel136:84300:84450 [1] NCCL INFO NET/IB : Using [0]mlx4_0:1/IB ; OOB ib0:192.168.21.136<0>
heinzel135:151781:151926 [3] NCCL INFO NET/IB : Using [0]mlx4_0:1/IB ; OOB ib0:192.168.21.135<0>
heinzel136:84302:84451 [3] NCCL INFO NET/IB : Using [0]mlx4_0:1/IB ; OOB ib0:192.168.21.136<0>
heinzel136:84299:84449 [0] NCCL INFO NET/IB : Using [0]mlx4_0:1/IB ; OOB ib0:192.168.21.136<0>
heinzel135:151778:151927 [0] NCCL INFO Setting affinity for GPU 0 to ff00ff
heinzel135:151778:151927 [0] NCCL INFO comm 0x7f0eb8340630 rank 0 nranks 8 cudaDev 0 nvmlDev 0
heinzel136:84301:84448 [2] NCCL INFO Setting affinity for GPU 2 to ff00ff
heinzel136:84301:84448 [2] NCCL INFO comm 0x7fdc24337090 rank 6 nranks 8 cudaDev 2 nvmlDev 2
heinzel135:151780:151928 [2] NCCL INFO Setting affinity for GPU 2 to ff00ff
heinzel135:151780:151928 [2] NCCL INFO comm 0x7f815833d0c0 rank 2 nranks 8 cudaDev 2 nvmlDev 2
heinzel135:151779:151929 [1] NCCL INFO Setting affinity for GPU 1 to ff00ff
heinzel135:151779:151929 [1] NCCL INFO comm 0x7f2f4432eb80 rank 1 nranks 8 cudaDev 1 nvmlDev 1
heinzel136:84302:84451 [3] NCCL INFO Setting affinity for GPU 3 to ff00ff
heinzel136:84302:84451 [3] NCCL INFO comm 0x7f63ec349120 rank 7 nranks 8 cudaDev 3 nvmlDev 3
heinzel135:151781:151926 [3] NCCL INFO Setting affinity for GPU 3 to ff00ff
heinzel135:151781:151926 [3] NCCL INFO comm 0x7fad2c36eb50 rank 3 nranks 8 cudaDev 3 nvmlDev 3
heinzel136:84299:84449 [0] NCCL INFO Setting affinity for GPU 0 to ff00ff
heinzel136:84299:84449 [0] NCCL INFO comm 0x7fe7fc34a900 rank 4 nranks 8 cudaDev 0 nvmlDev 0
heinzel136:84300:84450 [1] NCCL INFO Setting affinity for GPU 1 to ff00ff
heinzel136:84300:84450 [1] NCCL INFO comm 0x7f392433ae40 rank 5 nranks 8 cudaDev 1 nvmlDev 1
heinzel136:84302:84451 [3] NCCL INFO CUDA Dev 3[3], IB NIC distance :  SOC
heinzel136:84301:84448 [2] NCCL INFO CUDA Dev 2[2], IB NIC distance :  SOC
heinzel136:84300:84450 [1] NCCL INFO CUDA Dev 1[1], IB NIC distance :  SOC
heinzel136:84299:84449 [0] NCCL INFO CUDA Dev 0[0], IB NIC distance :  SOC
heinzel135:151778:151927 [0] NCCL INFO CUDA Dev 0[0], IB NIC distance :  SOC
heinzel135:151780:151928 [2] NCCL INFO CUDA Dev 2[2], IB NIC distance :  SOC
heinzel135:151781:151926 [3] NCCL INFO CUDA Dev 3[3], IB NIC distance :  SOC
heinzel135:151779:151929 [1] NCCL INFO CUDA Dev 1[1], IB NIC distance :  SOC
heinzel135:151778:151927 [0] NCCL INFO Channel 00 :    0   1   2   3   4   5   6   7
heinzel135:151779:151929 [1] NCCL INFO Ring 00 : 1[1] -> 2[2] via direct shared memory
heinzel136:84299:84449 [0] NCCL INFO Ring 00 : 3 -> 4 [receive] via NET/IB/0
heinzel135:151778:151927 [0] NCCL INFO Ring 00 : 7 -> 0 [receive] via NET/IB/0
heinzel135:151780:151928 [2] NCCL INFO Ring 00 : 2[2] -> 3[3] via direct shared memory
heinzel135:151778:151927 [0] NCCL INFO Ring 00 : 0[0] -> 1[1] via direct shared memory
heinzel136:84299:84449 [0] NCCL INFO Ring 00 : 4[0] -> 5[1] via direct shared memory
heinzel136:84301:84448 [2] NCCL INFO Ring 00 : 6[2] -> 7[3] via direct shared memory
heinzel136:84300:84450 [1] NCCL INFO Ring 00 : 5[1] -> 6[2] via direct shared memory
heinzel135:151779:151929 [1] NCCL INFO Ring 00 : 1[1] -> 0[0] via direct shared memory
heinzel135:151780:151928 [2] NCCL INFO Ring 00 : 2[2] -> 1[1] via direct shared memory

heinzel135:151780:151928 [2] include/shm.h:42 NCCL WARN Cuda failure 'invalid argument'
heinzel135:151780:151928 [2] NCCL INFO transport/shm.cu:187 -> 1
heinzel135:151780:151928 [2] NCCL INFO init.cu:658 -> 1
heinzel135:151780:151928 [2] NCCL INFO init.cu:804 -> 1
heinzel135:151780:151928 [2] NCCL INFO init.cu:915 -> 1
heinzel135:151781:151926 [3] NCCL INFO Ring 00 : 3 -> 4 [send] via NET/IB/0
heinzel136:84301:84448 [2] NCCL INFO Ring 00 : 6[2] -> 5[1] via direct shared memory
heinzel136:84302:84451 [3] NCCL INFO Ring 00 : 7 -> 0 [send] via NET/IB/0
heinzel136:84300:84450 [1] NCCL INFO Ring 00 : 5[1] -> 4[0] via direct shared memory
heinzel135:151781:151926 [3] NCCL INFO Ring 00 : 3[3] -> 2[2] via direct shared memory
heinzel135:151781:151926 [3] NCCL INFO Trees [0] 2->3->-1/-1/-1
heinzel136:84302:84451 [3] NCCL INFO Ring 00 : 7[3] -> 6[2] via direct shared memory
heinzel136:84301:84448 [2] NCCL INFO Trees [0] 5->6->7/-1/-1
heinzel135:151781:151926 [3] NCCL INFO comm 0x7fad2c36eb50 rank 3 nranks 8 cudaDev 3 nvmlDev 3 - Init COMPLETE
heinzel136:84302:84451 [3] NCCL INFO Trees [0] 6->7->-1/-1/-1
heinzel136:84301:84448 [2] NCCL INFO comm 0x7fdc24337090 rank 6 nranks 8 cudaDev 2 nvmlDev 2 - Init COMPLETE

heinzel135:151779:151929 [1] include/shm.h:42 NCCL WARN Cuda failure 'invalid argument'
heinzel135:151779:151929 [1] NCCL INFO transport/shm.cu:211 -> 1
heinzel135:151779:151929 [1] NCCL INFO init.cu:667 -> 1
heinzel135:151779:151929 [1] NCCL INFO init.cu:804 -> 1
heinzel135:151779:151929 [1] NCCL INFO init.cu:915 -> 1

heinzel136:84300:84450 [1] include/shm.h:42 NCCL WARN Cuda failure 'invalid argument'
heinzel136:84300:84450 [1] NCCL INFO transport/shm.cu:211 -> 1
heinzel136:84300:84450 [1] NCCL INFO init.cu:667 -> 1
heinzel136:84300:84450 [1] NCCL INFO init.cu:804 -> 1
heinzel136:84300:84450 [1] NCCL INFO init.cu:915 -> 1
heinzel135:151778:151927 [0] NCCL INFO Ring 00 : 4 -> 0 [receive] via NET/IB/0
heinzel136:84302:84451 [3] NCCL INFO comm 0x7f63ec349120 rank 7 nranks 8 cudaDev 3 nvmlDev 3 - Init COMPLETE
heinzel136:84299:84449 [0] NCCL INFO Ring 00 : 4 -> 0 [send] via NET/IB/0
heinzel136:84299:84449 [0] NCCL INFO Ring 00 : 0 -> 4 [receive] via NET/IB/0
heinzel135:151778:151927 [0] NCCL INFO Ring 00 : 0 -> 4 [send] via NET/IB/0
heinzel135:151778:151927 [0] NCCL INFO Trees [0] -1->0->1/4/-1
heinzel135:151778:151927 [0] NCCL INFO Using 256 threads, Min Comp Cap 7, Trees enabled up to size 149999
heinzel136:84299:84449 [0] NCCL INFO Trees [0] 0->4->5/-1/-1
heinzel135:151778:151927 [0] NCCL INFO comm 0x7f0eb8340630 rank 0 nranks 8 cudaDev 0 nvmlDev 0 - Init COMPLETE
heinzel136:84299:84449 [0] NCCL INFO comm 0x7fe7fc34a900 rank 4 nranks 8 cudaDev 0 nvmlDev 0 - Init COMPLETE
@maxhgerlach
Copy link
Collaborator Author

I downgraded NCCL from 2.4.2-1 to 2.3.7-1 and reinstalled Horovod.

That seems to have fixed the problem!

@alsrgv
Copy link
Member

alsrgv commented Mar 13, 2019

@maxhgerlach, did you happen to run in a container environment? Since NCCL is now open source, we can find the failing line, which is:

  CUDACHECKGOTO(cudaHostRegister(ptr, shmsize, cudaHostRegisterMapped), res, cudaError);

Is it possible that you had insufficient shared memory provisioned?

@maxhgerlach
Copy link
Collaborator Author

@alsrgv, thanks for looking into this.

We are not using any container technology. I believe there are no shared memory limits in place -- here's what I checked on the four hosts:

$ ipcs -lm

------ Shared Memory Limits --------
max number of segments = 4096
max seg size (kbytes) = 18014398509465599
max total shared memory (kbytes) = 18014398442373116
min seg size (bytes) = 1

$ ulimit -a | grep memory
max locked memory       (kbytes, -l) unlimited
max memory size         (kbytes, -m) unlimited
virtual memory          (kbytes, -v) unlimited

@alsrgv
Copy link
Member

alsrgv commented Mar 15, 2019

That's pretty odd. @sjeaugey, does anything jump out to you as a possible cause for cudaHostRegister to fail?

@sjeaugey
Copy link

This is a known, fixed bug. NVIDIA/nccl#185
To get the fix, use the latest version on github.
If you do not want to recompile NCCL (and likely Horovod) you may workaround the issue by :

  • Disabling trees : NCCL_TREE_THRESHOLD=0 (but performance may be lower)
  • Forcing P2P (instead of SHM) across all GPUs if you have a Skylake CPU with a proper BIOS : NCCL_P2P_LEVEL=5
    Or roll back to 2.3.

@maxhgerlach
Copy link
Collaborator Author

@sjeaugey -- thanks for letting me know about that known issue!

We will stick with NCCL 2.3 for now and reconsider 2.4 once the fix is released.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

3 participants