Distributed training speed slow down compare to one node？ #476

tingweiwu · 2018-09-03T02:41:58Z

I run tf_cnn_benchmarks.py
on one node with 1GPU
speed is total images/sec: 195.11

on one node with 8GPU
it's speed is total images/sec: 1188.51

then I run this on two nodes. each one is 8 GPU. use docker with hostnetwork.
it's speed is total images/sec: 745.06

GPU:V100
HCA: Infiniband 100Gb/s
Ethernet NIC: 2*10Gb/s

run command
On Primary node

mpirun -np 16 -H IP1:8,IP2:8 -bind-to none -map-by slot -x NCCL_SOCKET_IFNAME=ib0 -mca btl_tcp_if_exclude docker0,tunl0,lo -x NCCL_DEBUG=ERROR -x LD_LIBRARY_PATH -mca plm_rsh_args "-p 12345 -vvvv" python tf_cnn_benchmarks.py --model resnet101 --batch_size 64 --variable_update horovod

On Secondary node

bash -c "/usr/sbin/sshd -p 12345; sleep infinity"

here is the timeline
#466 (comment)

Appreciate if you can give some suggestions

The text was updated successfully, but these errors were encountered:

tingweiwu · 2018-09-05T12:50:56Z

@tgaddair could you take a look at the timeline and give me some idea to look into this issue. I've been stuck for days. Thanks a lot if you reply

tgaddair · 2018-09-05T15:53:12Z

Hey @tingweiwu, I'll try and take a look today. Been pretty swamped recently. Thanks for bearing with me.

tgaddair · 2018-09-06T00:09:41Z

I took a look at your timeline, and ran the same experiment on GCP with 2 nodes and 8 GPUs per node. The scaling efficiency I saw was nearly 100%.

Here's what my timeline is showing:

In contrast, here's what we're seeing in your timeline:

There's sporadic queuing that's occurring that's causing the slowdown. This happens when reduction is done with NCCL, but the previous NCCL operation hasn't yet finished.

One thing I'd recommend is playing around with the -mca btl_openib_receive_queues mpi options. Per this thread, one common suggestion with Infiniband is to set -mca btl_openib_receive_queues P,128,32:P,2048,32:P,12288,32:P,131072,32.

If you get a chance, try a few different settings and let me know if it has an effect on performance.

tingweiwu · 2018-09-06T03:10:44Z

@tgaddair I have tried -mca btl_openib_receive_queues options

HOROVOD_TIMELINE=/code/timeline.json mpirun -np 16 -H IP1:8,IP2:8 -bind-to none -map-by slot -x HOROVOD_TIMELINE -x NCCL_SOCKET_IFNAME=ib0 -mca btl_tcp_if_exclude docker0,tunl0,lo -x NCCL_DEBUG=ERROR -x LD_LIBRARY_PATH -mca btl_openib_receive_queues P,128,32:P,2048,32:P,12288,32:P,131072,32 -mca plm_rsh_args "-p 12345 -vvvv" python /code/tf_cnn_benchmarks.py --model=resnet101 --batch_size=64 --variable_update=horovod

speed is total images/sec: 440.62

and the timeline shows that the queuing still occurs

how do I choose the value of -mca btl_openib_receive_queues ? as the value P,128,32:P,2048,32:P,12288,32:P,131072,32 you mentioned did not work with me

tingweiwu · 2018-09-06T06:58:55Z

@alsrgv hi, I saw you said here that -mca btl_openib_receive_queues P,128,32:P,2048,32:P,12288,32:P,131072,32 you use between 4 servers and 32 servers and get 90%+ scaling efficiency.

As you said you come up with by reading Open MPI source code. Could you give me some idea how to change this parameter as mca btl_openib_receive_queues P,128,32:P,2048,32:P,12288,32:P,131072,32 did not work with me.
Thanks a lot if you relpy

tgaddair · 2018-09-06T22:05:43Z

Hey @tingweiwu, have you tried running the benchmark on just the second node (IP2)? If I remember correctly, you've run the job on IP1 and IP1 + IP2. I was just chatting with @alsrgv, and we're wondering if it might be a hardware issue of some kind.

Could you try running the benchmark on just the second node and let us know how it goes? Thanks.

tgaddair · 2018-09-11T16:09:17Z

Hey @tingweiwu, glad you managed to get it working. And thanks for documenting everything, this will be very useful to users in with similar issues in the future.

I can't say for sure why you saw such a huge difference between RDMA and TCP. In the other issue you referenced, that user was seeing about 77% scaling efficiency, which was significantly better than what you were seeing.

It's possible that using sockets instead of TCP on top of IPoIB, in addition to not using RDMA, contributed to the noticeably larger difference you saw.

RookieoftheYear · 2020-11-18T09:56:55Z

As you said you come up with by reading Open MPI source code

hi @tingweiwu ，i am suffering from same issue, i get lots of QUEUE in timeline & 2 nodes is much slower than 1 node & option mca btl_openib_receive_queues P,128,32:P,2048,32:P,12288,32:P,131072,32 did not work with me , could you please tell me how to fix it ? thx!

tingweiwu mentioned this issue Sep 3, 2018

what's the meaning of tf_cnn_benchmarks.py output ？ #466

Closed

tgaddair added the question label Sep 5, 2018

tingweiwu closed this as completed Nov 20, 2018

RookieoftheYear mentioned this issue Nov 18, 2020

Reproduce the example benchmarks #21

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Distributed training speed slow down compare to one node？ #476

Distributed training speed slow down compare to one node？ #476

tingweiwu commented Sep 3, 2018 •

edited

tingweiwu commented Sep 5, 2018

tgaddair commented Sep 5, 2018

tgaddair commented Sep 6, 2018

tingweiwu commented Sep 6, 2018 •

edited

tingweiwu commented Sep 6, 2018

tgaddair commented Sep 6, 2018

tgaddair commented Sep 11, 2018

RookieoftheYear commented Nov 18, 2020 •

edited

Distributed training speed slow down compare to one node？ #476

Distributed training speed slow down compare to one node？ #476

Comments

tingweiwu commented Sep 3, 2018 • edited

tingweiwu commented Sep 5, 2018

tgaddair commented Sep 5, 2018

tgaddair commented Sep 6, 2018

tingweiwu commented Sep 6, 2018 • edited

tingweiwu commented Sep 6, 2018

tgaddair commented Sep 6, 2018

tgaddair commented Sep 11, 2018

RookieoftheYear commented Nov 18, 2020 • edited

tingweiwu commented Sep 3, 2018 •

edited

tingweiwu commented Sep 6, 2018 •

edited

RookieoftheYear commented Nov 18, 2020 •

edited