-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Why it doesn't show connection via NET/IB/0/GDRDMA #1523
Comments
Hey @romerojosh or @DEKHTIARJonathan, do either of you know anyone who may be able to help with this GPUDirect issue? |
Have you looked into the official guide from Mellanox ? https://docs.mellanox.com/m/view-rendered-page.action?abstractPageId=15049724 If I understand correctly they even work on the same benchmark as you do |
Hi @DEKHTIARJonathan, yes I have looked at the official guide from Mellanox, and also I have successfully done it before (see issue #288); however, for some unknown reason with the new SW version there is no connection via NET/IB/0/GDRDMA, could you please assist? |
hi @DEKHTIARJonathan / @tgaddair, thanks for your support. I got GPUDirect RDMA enabled after rebuilding my multi-node system from scratch with the below configuration: Environment:
Flags: Snipped tracelog:
|
Hi @vilmara, I have been followed your issues realted with Horovod and GPUDirect RDMA. I am trying a configuration and I am not able to see the difference between training locally and with GPUDirect RDMA, because I obtain 100% performance while using GPUDirect RDMA. I mean: It looks like if there was no penalty with GPUDirect RDMA because the GPUs are not having communication. Could you please post the final command that you use with GPUDirect RDMA? Thank you in advance! |
hi @tgaddair, could you please provide the current throughput (images/sec) as below?: Also, when running the multi-node training make sure you see similar stdout information for both servers as shown below:
|
Environment:
Your question:
Please ask your question here.
I am running the TF benchmarks in multi-node mode with the latest version of Horovod via docker but I am not seeing the output connection via
NET/IB/0/GDRDMA
as I did on #288. What else I am missing to activate GPUDirect RDMA with the new software stack?, see below the tracelogTracelog
master_node:20:289 [0] NCCL INFO NET/Socket : Using [0]ib0:192.168.11.1<0>
master_node:20:289 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so).
master_node:20:289 [0] NCCL INFO NCCL_IB_DISABLE set by environment to 0.
master_node:20:289 [0] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB ; OOB ib0:192.168.11.1<0>
NCCL version 2.4.7+cuda10.0
master_node:22:295 [2] NCCL INFO NET/Socket : Using [0]ib0:192.168.11.1<0>
master_node:22:295 [2] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so).
master_node:21:290 [1] NCCL INFO NET/Socket : Using [0]ib0:192.168.11.1<0>
master_node:23:288 [3] NCCL INFO NET/Socket : Using [0]ib0:192.168.11.1<0>
master_node:21:290 [1] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so).
master_node:23:288 [3] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so).
master_node:22:295 [2] NCCL INFO NCCL_IB_DISABLE set by environment to 0.
master_node:21:290 [1] NCCL INFO NCCL_IB_DISABLE set by environment to 0.
master_node:23:288 [3] NCCL INFO NCCL_IB_DISABLE set by environment to 0.
secondary_node:44:311 [3] NCCL INFO NET/Socket : Using [0]ib0:192.168.11.2<0>
secondary_node:41:312 [0] NCCL INFO NET/Socket : Using [0]ib0:192.168.11.2<0>
secondary_node:42:310 [1] NCCL INFO NET/Socket : Using [0]ib0:192.168.11.2<0>
secondary_node:43:309 [2] NCCL INFO NET/Socket : Using [0]ib0:192.168.11.2<0>
secondary_node:42:310 [1] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so).
secondary_node:43:309 [2] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so).
secondary_node:44:311 [3] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so).
secondary_node:41:312 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so).
secondary_node:43:309 [2] NCCL INFO NCCL_IB_DISABLE set by environment to 0.
secondary_node:44:311 [3] NCCL INFO NCCL_IB_DISABLE set by environment to 0.
secondary_node:42:310 [1] NCCL INFO NCCL_IB_DISABLE set by environment to 0.
secondary_node:41:312 [0] NCCL INFO NCCL_IB_DISABLE set by environment to 0.
master_node:22:295 [2] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB ; OOB ib0:192.168.11.1<0>
master_node:23:288 [3] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB ; OOB ib0:192.168.11.1<0>
master_node:21:290 [1] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB ; OOB ib0:192.168.11.1<0>
secondary_node:43:309 [2] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB ; OOB ib0:192.168.11.2<0>
secondary_node:44:311 [3] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB ; OOB ib0:192.168.11.2<0>
secondary_node:41:312 [0] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB ; OOB ib0:192.168.11.2<0>
secondary_node:42:310 [1] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB ; OOB ib0:192.168.11.2<0>
master_node:20:289 [0] NCCL INFO Setting affinity for GPU 0 to 5555,55555555,55555555
master_node:23:288 [3] NCCL INFO Setting affinity for GPU 3 to aaaa,aaaaaaaa,aaaaaaaa
master_node:21:290 [1] NCCL INFO Setting affinity for GPU 1 to 5555,55555555,55555555
master_node:22:295 [2] NCCL INFO Setting affinity for GPU 2 to aaaa,aaaaaaaa,aaaaaaaa
secondary_node:44:311 [3] NCCL INFO Setting affinity for GPU 3 to aaaa,aaaaaaaa,aaaaaaaa
secondary_node:43:309 [2] NCCL INFO Setting affinity for GPU 2 to aaaa,aaaaaaaa,aaaaaaaa
secondary_node:41:312 [0] NCCL INFO Setting affinity for GPU 0 to 5555,55555555,55555555
secondary_node:42:310 [1] NCCL INFO Setting affinity for GPU 1 to 5555,55555555,55555555
secondary_node:41:312 [0] NCCL INFO CUDA Dev 0[0], IB NIC distance : SYS
secondary_node:44:311 [3] NCCL INFO CUDA Dev 3[3], IB NIC distance : NODE
secondary_node:42:310 [1] NCCL INFO CUDA Dev 1[1], IB NIC distance : SYS
secondary_node:43:309 [2] NCCL INFO CUDA Dev 2[2], IB NIC distance : NODE
master_node:22:295 [2] NCCL INFO CUDA Dev 2[2], IB NIC distance : NODE
master_node:23:288 [3] NCCL INFO CUDA Dev 3[3], IB NIC distance : NODE
master_node:21:290 [1] NCCL INFO CUDA Dev 1[1], IB NIC distance : SYS
master_node:20:289 [0] NCCL INFO CUDA Dev 0[0], IB NIC distance : SYS
master_node:20:289 [0] NCCL INFO Channel 00 : 0 1 3 6 4 5 7 2
master_node:20:289 [0] NCCL INFO Channel 01 : 0 1 3 6 4 5 7 2
master_node:22:295 [2] NCCL INFO Ring 00 : 7 -> 2 [receive] via NET/IB/0
master_node:22:295 [2] NCCL INFO Ring 00 : 2[2] -> 0[0] via P2P/IPC
secondary_node:43:309 [2] NCCL INFO Ring 00 : 3 -> 6 [receive] via NET/IB/0
master_node:21:290 [1] NCCL INFO Ring 00 : 1[1] -> 3[3] via P2P/IPC
master_node:20:289 [0] NCCL INFO Ring 00 : 0[0] -> 1[1] via P2P/IPC
master_node:23:288 [3] NCCL INFO Ring 00 : 3 -> 6 [send] via NET/IB/0
master_node:23:288 [3] NCCL INFO Ring 00 : 3[3] -> 1[1] via P2P/IPC
secondary_node:43:309 [2] NCCL INFO Ring 00 : 6[2] -> 4[0] via P2P/IPC
master_node:21:290 [1] NCCL INFO Ring 00 : 1[1] -> 0[0] via P2P/IPC
master_node:20:289 [0] NCCL INFO Ring 00 : 0[0] -> 2[2] via P2P/IPC
master_node:21:290 [1] NCCL INFO Ring 01 : 1[1] -> 3[3] via P2P/IPC
master_node:23:288 [3] NCCL INFO Ring 01 : 3 -> 6 [send] via NET/IB/0
secondary_node:42:310 [1] NCCL INFO Ring 00 : 5[1] -> 7[3] via P2P/IPC
secondary_node:41:312 [0] NCCL INFO Ring 00 : 4[0] -> 5[1] via P2P/IPC
secondary_node:44:311 [3] NCCL INFO Ring 00 : 7 -> 2 [send] via NET/IB/0
master_node:22:295 [2] NCCL INFO Ring 00 : 6 -> 2 [receive] via NET/IB/0
master_node:20:289 [0] NCCL INFO Ring 01 : 0[0] -> 1[1] via P2P/IPC
master_node:21:290 [1] NCCL INFO Ring 01 : 1[1] -> 0[0] via P2P/IPC
secondary_node:44:311 [3] NCCL INFO Ring 00 : 7[3] -> 5[1] via P2P/IPC
secondary_node:43:309 [2] NCCL INFO Ring 00 : 6 -> 2 [send] via NET/IB/0
secondary_node:42:310 [1] NCCL INFO Ring 00 : 5[1] -> 4[0] via P2P/IPC
secondary_node:41:312 [0] NCCL INFO Ring 00 : 4[0] -> 6[2] via P2P/IPC
secondary_node:43:309 [2] NCCL INFO Ring 00 : 2 -> 6 [receive] via NET/IB/0
master_node:22:295 [2] NCCL INFO Ring 00 : 2 -> 6 [send] via NET/IB/0
master_node:22:295 [2] NCCL INFO Ring 01 : 7 -> 2 [receive] via NET/IB/0
master_node:22:295 [2] NCCL INFO Ring 01 : 2[2] -> 0[0] via P2P/IPC
secondary_node:43:309 [2] NCCL INFO Ring 01 : 3 -> 6 [receive] via NET/IB/0
master_node:23:288 [3] NCCL INFO Ring 01 : 3[3] -> 1[1] via P2P/IPC
master_node:21:290 [1] NCCL INFO Trees [0] 0->1->3/-1/-1 [1] 0->1->3/-1/-1
secondary_node:44:311 [3] NCCL INFO Ring 01 : 7 -> 2 [send] via NET/IB/0
master_node:23:288 [3] NCCL INFO Trees [0] 1->3->-1/-1/-1 [1] 1->3->-1/-1/-1
master_node:20:289 [0] NCCL INFO Ring 01 : 0[0] -> 2[2] via P2P/IPC
secondary_node:43:309 [2] NCCL INFO Ring 01 : 6[2] -> 4[0] via P2P/IPC
master_node:21:290 [1] NCCL INFO comm 0x7f4d6839f060 rank 1 nranks 8 cudaDev 1 nvmlDev 1 - Init COMPLETE
master_node:23:288 [3] NCCL INFO comm 0x7f48503a3650 rank 3 nranks 8 cudaDev 3 nvmlDev 3 - Init COMPLETE
master_node:20:289 [0] NCCL INFO Trees [0] 2->0->1/-1/-1 [1] 2->0->1/-1/-1
master_node:20:289 [0] NCCL INFO Using 256 threads, Min Comp Cap 7, Trees enabled for all sizes
secondary_node:42:310 [1] NCCL INFO Ring 01 : 5[1] -> 7[3] via P2P/IPC
secondary_node:41:312 [0] NCCL INFO Ring 01 : 4[0] -> 5[1] via P2P/IPC
master_node:20:289 [0] NCCL INFO comm 0x7f5450362840 rank 0 nranks 8 cudaDev 0 nvmlDev 0 - Init COMPLETE
master_node:22:295 [2] NCCL INFO Ring 01 : 2 -> 6 [send] via NET/IB/0
secondary_node:44:311 [3] NCCL INFO Ring 01 : 7[3] -> 5[1] via P2P/IPC
secondary_node:43:309 [2] NCCL INFO Ring 01 : 2 -> 6 [receive] via NET/IB/0
secondary_node:44:311 [3] NCCL INFO Trees [0] 5->7->-1/-1/-1 [1] 5->7->-1/-1/-1
master_node:22:295 [2] NCCL INFO Ring 01 : 6 -> 2 [receive] via NET/IB/0
secondary_node:42:310 [1] NCCL INFO Ring 01 : 5[1] -> 4[0] via P2P/IPC
secondary_node:41:312 [0] NCCL INFO Ring 01 : 4[0] -> 6[2] via P2P/IPC
secondary_node:44:311 [3] NCCL INFO comm 0x7ff2c43f7c00 rank 7 nranks 8 cudaDev 3 nvmlDev 3 - Init COMPLETE
secondary_node:42:310 [1] NCCL INFO Trees [0] 4->5->7/-1/-1 [1] 4->5->7/-1/-1
secondary_node:41:312 [0] NCCL INFO Trees [0] 6->4->5/-1/-1 [1] 6->4->5/-1/-1
secondary_node:41:312 [0] NCCL INFO comm 0x7fd8dc3c6740 rank 4 nranks 8 cudaDev 0 nvmlDev 0 - Init COMPLETE
secondary_node:43:309 [2] NCCL INFO Ring 01 : 6 -> 2 [send] via NET/IB/0
secondary_node:43:309 [2] NCCL INFO Trees [0] 2->6->4/-1/-1 [1] -1->6->4/2/-1
secondary_node:42:310 [1] NCCL INFO comm 0x7fa7cc422c90 rank 5 nranks 8 cudaDev 1 nvmlDev 1 - Init COMPLETE
secondary_node:43:309 [2] NCCL INFO comm 0x7fce9c438c90 rank 6 nranks 8 cudaDev 2 nvmlDev 2 - Init COMPLETE
master_node:22:295 [2] NCCL INFO Trees [0] -1->2->0/6/-1 [1] 6->2->0/-1/-1
master_node:22:295 [2] NCCL INFO comm 0x7fd8f038f460 rank 2 nranks 8 cudaDev 2 nvmlDev 2 - Init COMPLETE
master_node:20:289 [0] NCCL INFO Launch mode Parallel
The text was updated successfully, but these errors were encountered: