Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why it doesn't show connection via NET/IB/0/GDRDMA #1523

Closed
vilmara opened this issue Nov 19, 2019 · 6 comments
Closed

Why it doesn't show connection via NET/IB/0/GDRDMA #1523

vilmara opened this issue Nov 19, 2019 · 6 comments
Labels

Comments

@vilmara
Copy link

vilmara commented Nov 19, 2019

Environment:

  1. Framework: TensorFlow
  2. Framework version: TF 1.4
  3. Horovod version: 0.18.2 via Horovod in docker
  4. MPI version: 4.0.0
  5. CUDA version: 10.0
  6. NCCL version: .4.7-1
  7. Python version: 2.7
  8. OS and version: Ubuntu 18.06
  9. GCC version: 4.8
  10. Mellanox OFED 4.7.1
  11. GPUDirect RDMA - nvidia-peer-memory_1.0-8

Your question:
Please ask your question here.
I am running the TF benchmarks in multi-node mode with the latest version of Horovod via docker but I am not seeing the output connection via NET/IB/0/GDRDMA as I did on #288. What else I am missing to activate GPUDirect RDMA with the new software stack?, see below the tracelog

Tracelog
master_node:20:289 [0] NCCL INFO NET/Socket : Using [0]ib0:192.168.11.1<0>
master_node:20:289 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so).
master_node:20:289 [0] NCCL INFO NCCL_IB_DISABLE set by environment to 0.
master_node:20:289 [0] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB ; OOB ib0:192.168.11.1<0>
NCCL version 2.4.7+cuda10.0
master_node:22:295 [2] NCCL INFO NET/Socket : Using [0]ib0:192.168.11.1<0>
master_node:22:295 [2] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so).
master_node:21:290 [1] NCCL INFO NET/Socket : Using [0]ib0:192.168.11.1<0>
master_node:23:288 [3] NCCL INFO NET/Socket : Using [0]ib0:192.168.11.1<0>
master_node:21:290 [1] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so).
master_node:23:288 [3] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so).
master_node:22:295 [2] NCCL INFO NCCL_IB_DISABLE set by environment to 0.
master_node:21:290 [1] NCCL INFO NCCL_IB_DISABLE set by environment to 0.
master_node:23:288 [3] NCCL INFO NCCL_IB_DISABLE set by environment to 0.
secondary_node:44:311 [3] NCCL INFO NET/Socket : Using [0]ib0:192.168.11.2<0>
secondary_node:41:312 [0] NCCL INFO NET/Socket : Using [0]ib0:192.168.11.2<0>
secondary_node:42:310 [1] NCCL INFO NET/Socket : Using [0]ib0:192.168.11.2<0>
secondary_node:43:309 [2] NCCL INFO NET/Socket : Using [0]ib0:192.168.11.2<0>
secondary_node:42:310 [1] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so).
secondary_node:43:309 [2] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so).
secondary_node:44:311 [3] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so).
secondary_node:41:312 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so).
secondary_node:43:309 [2] NCCL INFO NCCL_IB_DISABLE set by environment to 0.
secondary_node:44:311 [3] NCCL INFO NCCL_IB_DISABLE set by environment to 0.
secondary_node:42:310 [1] NCCL INFO NCCL_IB_DISABLE set by environment to 0.
secondary_node:41:312 [0] NCCL INFO NCCL_IB_DISABLE set by environment to 0.
master_node:22:295 [2] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB ; OOB ib0:192.168.11.1<0>
master_node:23:288 [3] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB ; OOB ib0:192.168.11.1<0>
master_node:21:290 [1] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB ; OOB ib0:192.168.11.1<0>
secondary_node:43:309 [2] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB ; OOB ib0:192.168.11.2<0>
secondary_node:44:311 [3] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB ; OOB ib0:192.168.11.2<0>
secondary_node:41:312 [0] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB ; OOB ib0:192.168.11.2<0>
secondary_node:42:310 [1] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB ; OOB ib0:192.168.11.2<0>
master_node:20:289 [0] NCCL INFO Setting affinity for GPU 0 to 5555,55555555,55555555
master_node:23:288 [3] NCCL INFO Setting affinity for GPU 3 to aaaa,aaaaaaaa,aaaaaaaa
master_node:21:290 [1] NCCL INFO Setting affinity for GPU 1 to 5555,55555555,55555555
master_node:22:295 [2] NCCL INFO Setting affinity for GPU 2 to aaaa,aaaaaaaa,aaaaaaaa
secondary_node:44:311 [3] NCCL INFO Setting affinity for GPU 3 to aaaa,aaaaaaaa,aaaaaaaa
secondary_node:43:309 [2] NCCL INFO Setting affinity for GPU 2 to aaaa,aaaaaaaa,aaaaaaaa
secondary_node:41:312 [0] NCCL INFO Setting affinity for GPU 0 to 5555,55555555,55555555
secondary_node:42:310 [1] NCCL INFO Setting affinity for GPU 1 to 5555,55555555,55555555
secondary_node:41:312 [0] NCCL INFO CUDA Dev 0[0], IB NIC distance : SYS
secondary_node:44:311 [3] NCCL INFO CUDA Dev 3[3], IB NIC distance : NODE
secondary_node:42:310 [1] NCCL INFO CUDA Dev 1[1], IB NIC distance : SYS
secondary_node:43:309 [2] NCCL INFO CUDA Dev 2[2], IB NIC distance : NODE
master_node:22:295 [2] NCCL INFO CUDA Dev 2[2], IB NIC distance : NODE
master_node:23:288 [3] NCCL INFO CUDA Dev 3[3], IB NIC distance : NODE
master_node:21:290 [1] NCCL INFO CUDA Dev 1[1], IB NIC distance : SYS
master_node:20:289 [0] NCCL INFO CUDA Dev 0[0], IB NIC distance : SYS
master_node:20:289 [0] NCCL INFO Channel 00 : 0 1 3 6 4 5 7 2
master_node:20:289 [0] NCCL INFO Channel 01 : 0 1 3 6 4 5 7 2
master_node:22:295 [2] NCCL INFO Ring 00 : 7 -> 2 [receive] via NET/IB/0
master_node:22:295 [2] NCCL INFO Ring 00 : 2[2] -> 0[0] via P2P/IPC
secondary_node:43:309 [2] NCCL INFO Ring 00 : 3 -> 6 [receive] via NET/IB/0
master_node:21:290 [1] NCCL INFO Ring 00 : 1[1] -> 3[3] via P2P/IPC
master_node:20:289 [0] NCCL INFO Ring 00 : 0[0] -> 1[1] via P2P/IPC
master_node:23:288 [3] NCCL INFO Ring 00 : 3 -> 6 [send] via NET/IB/0
master_node:23:288 [3] NCCL INFO Ring 00 : 3[3] -> 1[1] via P2P/IPC
secondary_node:43:309 [2] NCCL INFO Ring 00 : 6[2] -> 4[0] via P2P/IPC
master_node:21:290 [1] NCCL INFO Ring 00 : 1[1] -> 0[0] via P2P/IPC
master_node:20:289 [0] NCCL INFO Ring 00 : 0[0] -> 2[2] via P2P/IPC
master_node:21:290 [1] NCCL INFO Ring 01 : 1[1] -> 3[3] via P2P/IPC
master_node:23:288 [3] NCCL INFO Ring 01 : 3 -> 6 [send] via NET/IB/0
secondary_node:42:310 [1] NCCL INFO Ring 00 : 5[1] -> 7[3] via P2P/IPC
secondary_node:41:312 [0] NCCL INFO Ring 00 : 4[0] -> 5[1] via P2P/IPC
secondary_node:44:311 [3] NCCL INFO Ring 00 : 7 -> 2 [send] via NET/IB/0
master_node:22:295 [2] NCCL INFO Ring 00 : 6 -> 2 [receive] via NET/IB/0
master_node:20:289 [0] NCCL INFO Ring 01 : 0[0] -> 1[1] via P2P/IPC
master_node:21:290 [1] NCCL INFO Ring 01 : 1[1] -> 0[0] via P2P/IPC
secondary_node:44:311 [3] NCCL INFO Ring 00 : 7[3] -> 5[1] via P2P/IPC
secondary_node:43:309 [2] NCCL INFO Ring 00 : 6 -> 2 [send] via NET/IB/0
secondary_node:42:310 [1] NCCL INFO Ring 00 : 5[1] -> 4[0] via P2P/IPC
secondary_node:41:312 [0] NCCL INFO Ring 00 : 4[0] -> 6[2] via P2P/IPC
secondary_node:43:309 [2] NCCL INFO Ring 00 : 2 -> 6 [receive] via NET/IB/0
master_node:22:295 [2] NCCL INFO Ring 00 : 2 -> 6 [send] via NET/IB/0
master_node:22:295 [2] NCCL INFO Ring 01 : 7 -> 2 [receive] via NET/IB/0
master_node:22:295 [2] NCCL INFO Ring 01 : 2[2] -> 0[0] via P2P/IPC
secondary_node:43:309 [2] NCCL INFO Ring 01 : 3 -> 6 [receive] via NET/IB/0
master_node:23:288 [3] NCCL INFO Ring 01 : 3[3] -> 1[1] via P2P/IPC
master_node:21:290 [1] NCCL INFO Trees [0] 0->1->3/-1/-1 [1] 0->1->3/-1/-1
secondary_node:44:311 [3] NCCL INFO Ring 01 : 7 -> 2 [send] via NET/IB/0
master_node:23:288 [3] NCCL INFO Trees [0] 1->3->-1/-1/-1 [1] 1->3->-1/-1/-1
master_node:20:289 [0] NCCL INFO Ring 01 : 0[0] -> 2[2] via P2P/IPC
secondary_node:43:309 [2] NCCL INFO Ring 01 : 6[2] -> 4[0] via P2P/IPC
master_node:21:290 [1] NCCL INFO comm 0x7f4d6839f060 rank 1 nranks 8 cudaDev 1 nvmlDev 1 - Init COMPLETE
master_node:23:288 [3] NCCL INFO comm 0x7f48503a3650 rank 3 nranks 8 cudaDev 3 nvmlDev 3 - Init COMPLETE
master_node:20:289 [0] NCCL INFO Trees [0] 2->0->1/-1/-1 [1] 2->0->1/-1/-1
master_node:20:289 [0] NCCL INFO Using 256 threads, Min Comp Cap 7, Trees enabled for all sizes
secondary_node:42:310 [1] NCCL INFO Ring 01 : 5[1] -> 7[3] via P2P/IPC
secondary_node:41:312 [0] NCCL INFO Ring 01 : 4[0] -> 5[1] via P2P/IPC
master_node:20:289 [0] NCCL INFO comm 0x7f5450362840 rank 0 nranks 8 cudaDev 0 nvmlDev 0 - Init COMPLETE
master_node:22:295 [2] NCCL INFO Ring 01 : 2 -> 6 [send] via NET/IB/0
secondary_node:44:311 [3] NCCL INFO Ring 01 : 7[3] -> 5[1] via P2P/IPC
secondary_node:43:309 [2] NCCL INFO Ring 01 : 2 -> 6 [receive] via NET/IB/0
secondary_node:44:311 [3] NCCL INFO Trees [0] 5->7->-1/-1/-1 [1] 5->7->-1/-1/-1
master_node:22:295 [2] NCCL INFO Ring 01 : 6 -> 2 [receive] via NET/IB/0
secondary_node:42:310 [1] NCCL INFO Ring 01 : 5[1] -> 4[0] via P2P/IPC
secondary_node:41:312 [0] NCCL INFO Ring 01 : 4[0] -> 6[2] via P2P/IPC
secondary_node:44:311 [3] NCCL INFO comm 0x7ff2c43f7c00 rank 7 nranks 8 cudaDev 3 nvmlDev 3 - Init COMPLETE
secondary_node:42:310 [1] NCCL INFO Trees [0] 4->5->7/-1/-1 [1] 4->5->7/-1/-1
secondary_node:41:312 [0] NCCL INFO Trees [0] 6->4->5/-1/-1 [1] 6->4->5/-1/-1
secondary_node:41:312 [0] NCCL INFO comm 0x7fd8dc3c6740 rank 4 nranks 8 cudaDev 0 nvmlDev 0 - Init COMPLETE
secondary_node:43:309 [2] NCCL INFO Ring 01 : 6 -> 2 [send] via NET/IB/0
secondary_node:43:309 [2] NCCL INFO Trees [0] 2->6->4/-1/-1 [1] -1->6->4/2/-1
secondary_node:42:310 [1] NCCL INFO comm 0x7fa7cc422c90 rank 5 nranks 8 cudaDev 1 nvmlDev 1 - Init COMPLETE
secondary_node:43:309 [2] NCCL INFO comm 0x7fce9c438c90 rank 6 nranks 8 cudaDev 2 nvmlDev 2 - Init COMPLETE
master_node:22:295 [2] NCCL INFO Trees [0] -1->2->0/6/-1 [1] 6->2->0/-1/-1
master_node:22:295 [2] NCCL INFO comm 0x7fd8f038f460 rank 2 nranks 8 cudaDev 2 nvmlDev 2 - Init COMPLETE
master_node:20:289 [0] NCCL INFO Launch mode Parallel

@tgaddair
Copy link
Collaborator

tgaddair commented Dec 1, 2019

Hey @romerojosh or @DEKHTIARJonathan, do either of you know anyone who may be able to help with this GPUDirect issue?

@DEKHTIARJonathan
Copy link
Collaborator

Have you looked into the official guide from Mellanox ?

https://docs.mellanox.com/m/view-rendered-page.action?abstractPageId=15049724

If I understand correctly they even work on the same benchmark as you do

@vilmara
Copy link
Author

vilmara commented Dec 2, 2019

Hi @DEKHTIARJonathan, yes I have looked at the official guide from Mellanox, and also I have successfully done it before (see issue #288); however, for some unknown reason with the new SW version there is no connection via NET/IB/0/GDRDMA, could you please assist?

@vilmara
Copy link
Author

vilmara commented Jan 15, 2020

hi @DEKHTIARJonathan / @tgaddair, thanks for your support. I got GPUDirect RDMA enabled after rebuilding my multi-node system from scratch with the below configuration:

Environment:

Framework: TensorFlow
Framework version: TF 1.4
Horovod version: 0.18.2 via Horovod in docker
MPI version: 4.0.0
CUDA version: 10.0
NCCL version: 2.5.6
CUDNN: 7.6.5
Python version: 2.7
OS and version: Ubuntu 18.04
GCC version: 4.8
Mellanox OFED 4.7-3.2.9.0
GPUDirect RDMA - nvidia-peer-memory_1.0-8

Flags:
-x NCCL_NET_GDR_LEVEL=3 -x NCCL_DEBUG_SUBSYS=NET -x NCCL_IB_DISABLE=0 -mca btl_tcp_if_include ib0 -x NCCL_SOCKET_IFNAME=ib0 -x NCCL_DEBUG=INFO --bind-to none --map-by slot --mca plm_rsh_args "-p 12345"

Snipped tracelog:

master_node:3503:3838 [2] NCCL INFO NET/IB : GPU Direct RDMA Enabled for GPU 86000 / HCA 0 (distance 2 < 3), read 0
master_node:3504:3835 [3] NCCL INFO NET/IB : GPU Direct RDMA Enabled for GPU af000 / HCA 0 (distance 2 < 3), read 1
master_node:3503:3838 [2] NCCL INFO Ring 00 : 7[af000] -> 2[86000] [receive] via NET/IB/0/GDRDMA
master_node:3504:3835 [3] NCCL INFO Ring 00 : 3[af000] -> 6[86000] [send] via NET/IB/0/GDRDMA
c4140-m-2:67315:67835 [2] NCCL INFO NET/IB : GPU Direct RDMA Enabled for GPU 86000 / HCA 0 (distance 2 < 3), read 0
secondary_node:67316:67834 [3] NCCL INFO NET/IB : GPU Direct RDMA Enabled for GPU af000 / HCA 0 (distance 2 < 3), read 1
secondary_node:67315:67835 [2] NCCL INFO Ring 00 : 3[af000] -> 6[86000] [receive] via NET/IB/0/GDRDMA
secondary_node:67316:67834 [3] NCCL INFO Ring 00 : 7[af000] -> 2[86000] [send] via NET/IB/0/GDRDMA

@ajtarraga
Copy link

Hi @vilmara, I have been followed your issues realted with Horovod and GPUDirect RDMA. I am trying a configuration and I am not able to see the difference between training locally and with GPUDirect RDMA, because I obtain 100% performance while using GPUDirect RDMA.

I mean:
1 GPU train 600 images/sec
2 GPUs with GPUDirect RDMA train 1.200 images/sec

It looks like if there was no penalty with GPUDirect RDMA because the GPUs are not having communication.

Could you please post the final command that you use with GPUDirect RDMA? Thank you in advance!

@vilmara
Copy link
Author

vilmara commented Jan 26, 2023

hi @tgaddair, could you please provide the current throughput (images/sec) as below?:
Throughput 1xGPU within a node:
Throughput 1xGPU across the nodes:
Throughput 2xGPU's total in multi-node:

Also, when running the multi-node training make sure you see similar stdout information for both servers as shown below:

master_node:3504:3835 [3] NCCL INFO Ring 00 : 3[af000] -> 6[86000] [send] via NET/IB/0/GDRDMA
secondary_node:67316:67834 [3] NCCL INFO Ring 00 : 7[af000] -> 2[86000] [send] via NET/IB/0/GDRDMA

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Development

No branches or pull requests

4 participants