Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Distributed training speed slow down compare to one node? #476

Closed
tingweiwu opened this issue Sep 3, 2018 · 8 comments
Closed

Distributed training speed slow down compare to one node? #476

tingweiwu opened this issue Sep 3, 2018 · 8 comments
Labels

Comments

@tingweiwu
Copy link

tingweiwu commented Sep 3, 2018

I run tf_cnn_benchmarks.py
on one node with 1GPU
speed is total images/sec: 195.11

on one node with 8GPU
it's speed is total images/sec: 1188.51

then I run this on two nodes. each one is 8 GPU. use docker with hostnetwork.
it's speed is total images/sec: 745.06

GPU:V100
HCA: Infiniband 100Gb/s
Ethernet NIC: 2*10Gb/s

run command
On Primary node

mpirun -np 16 -H IP1:8,IP2:8 -bind-to none -map-by slot -x NCCL_SOCKET_IFNAME=ib0 -mca btl_tcp_if_exclude docker0,tunl0,lo -x NCCL_DEBUG=ERROR -x LD_LIBRARY_PATH -mca plm_rsh_args "-p 12345 -vvvv" python tf_cnn_benchmarks.py --model resnet101 --batch_size 64 --variable_update horovod

On Secondary node

bash -c "/usr/sbin/sshd -p 12345; sleep infinity"

here is the timeline
#466 (comment)

Appreciate if you can give some suggestions

@tingweiwu
Copy link
Author

@tgaddair could you take a look at the timeline and give me some idea to look into this issue. I've been stuck for days. Thanks a lot if you reply

@tgaddair
Copy link
Collaborator

tgaddair commented Sep 5, 2018

Hey @tingweiwu, I'll try and take a look today. Been pretty swamped recently. Thanks for bearing with me.

@tgaddair
Copy link
Collaborator

tgaddair commented Sep 6, 2018

I took a look at your timeline, and ran the same experiment on GCP with 2 nodes and 8 GPUs per node. The scaling efficiency I saw was nearly 100%.

Here's what my timeline is showing:

screen shot 2018-09-05 at 5 03 01 pm

In contrast, here's what we're seeing in your timeline:

screen shot 2018-09-05 at 5 03 15 pm

There's sporadic queuing that's occurring that's causing the slowdown. This happens when reduction is done with NCCL, but the previous NCCL operation hasn't yet finished.

One thing I'd recommend is playing around with the -mca btl_openib_receive_queues mpi options. Per this thread, one common suggestion with Infiniband is to set -mca btl_openib_receive_queues P,128,32:P,2048,32:P,12288,32:P,131072,32.

If you get a chance, try a few different settings and let me know if it has an effect on performance.

@tingweiwu
Copy link
Author

tingweiwu commented Sep 6, 2018

@tgaddair I have tried -mca btl_openib_receive_queues options

HOROVOD_TIMELINE=/code/timeline.json mpirun -np 16 -H IP1:8,IP2:8 -bind-to none -map-by slot -x HOROVOD_TIMELINE -x NCCL_SOCKET_IFNAME=ib0 -mca btl_tcp_if_exclude docker0,tunl0,lo -x NCCL_DEBUG=ERROR -x LD_LIBRARY_PATH -mca btl_openib_receive_queues P,128,32:P,2048,32:P,12288,32:P,131072,32 -mca plm_rsh_args "-p 12345 -vvvv" python /code/tf_cnn_benchmarks.py --model=resnet101 --batch_size=64 --variable_update=horovod

speed is total images/sec: 440.62

and the timeline shows that the queuing still occurs
20180906110937

how do I choose the value of -mca btl_openib_receive_queues ? as the value P,128,32:P,2048,32:P,12288,32:P,131072,32 you mentioned did not work with me

@tingweiwu
Copy link
Author

@alsrgv hi, I saw you said here that -mca btl_openib_receive_queues P,128,32:P,2048,32:P,12288,32:P,131072,32 you use between 4 servers and 32 servers and get 90%+ scaling efficiency.

As you said you come up with by reading Open MPI source code. Could you give me some idea how to change this parameter as mca btl_openib_receive_queues P,128,32:P,2048,32:P,12288,32:P,131072,32 did not work with me.
Thanks a lot if you relpy

@tgaddair
Copy link
Collaborator

tgaddair commented Sep 6, 2018

Hey @tingweiwu, have you tried running the benchmark on just the second node (IP2)? If I remember correctly, you've run the job on IP1 and IP1 + IP2. I was just chatting with @alsrgv, and we're wondering if it might be a hardware issue of some kind.

Could you try running the benchmark on just the second node and let us know how it goes? Thanks.

@tgaddair
Copy link
Collaborator

Hey @tingweiwu, glad you managed to get it working. And thanks for documenting everything, this will be very useful to users in with similar issues in the future.

I can't say for sure why you saw such a huge difference between RDMA and TCP. In the other issue you referenced, that user was seeing about 77% scaling efficiency, which was significantly better than what you were seeing.

It's possible that using sockets instead of TCP on top of IPoIB, in addition to not using RDMA, contributed to the noticeably larger difference you saw.

@RookieoftheYear
Copy link

RookieoftheYear commented Nov 18, 2020

As you said you come up with by reading Open MPI source code

hi @tingweiwu ,i am suffering from same issue, i get lots of QUEUE in timeline & 2 nodes is much slower than 1 node & option mca btl_openib_receive_queues P,128,32:P,2048,32:P,12288,32:P,131072,32 did not work with me , could you please tell me how to fix it ? thx!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Development

No branches or pull requests

3 participants