-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Distributed training speed slow down compare to one node? #476
Comments
@tgaddair could you take a look at the timeline and give me some idea to look into this issue. I've been stuck for days. Thanks a lot if you reply |
Hey @tingweiwu, I'll try and take a look today. Been pretty swamped recently. Thanks for bearing with me. |
I took a look at your timeline, and ran the same experiment on GCP with 2 nodes and 8 GPUs per node. The scaling efficiency I saw was nearly 100%. Here's what my timeline is showing: In contrast, here's what we're seeing in your timeline: There's sporadic queuing that's occurring that's causing the slowdown. This happens when reduction is done with NCCL, but the previous NCCL operation hasn't yet finished. One thing I'd recommend is playing around with the If you get a chance, try a few different settings and let me know if it has an effect on performance. |
@tgaddair I have tried
speed is and the timeline shows that the queuing still occurs how do I choose the value of |
@alsrgv hi, I saw you said here that As you said you come up with by reading Open MPI source code. Could you give me some idea how to change this parameter as |
Hey @tingweiwu, have you tried running the benchmark on just the second node (IP2)? If I remember correctly, you've run the job on IP1 and IP1 + IP2. I was just chatting with @alsrgv, and we're wondering if it might be a hardware issue of some kind. Could you try running the benchmark on just the second node and let us know how it goes? Thanks. |
Hey @tingweiwu, glad you managed to get it working. And thanks for documenting everything, this will be very useful to users in with similar issues in the future. I can't say for sure why you saw such a huge difference between RDMA and TCP. In the other issue you referenced, that user was seeing about 77% scaling efficiency, which was significantly better than what you were seeing. It's possible that using sockets instead of TCP on top of IPoIB, in addition to not using RDMA, contributed to the noticeably larger difference you saw. |
hi @tingweiwu ,i am suffering from same issue, i get lots of |
I run
tf_cnn_benchmarks.py
on one node with 1GPU
speed is
total images/sec: 195.11
on one node with 8GPU
it's speed is
total images/sec: 1188.51
then I run this on two nodes. each one is 8 GPU. use docker with hostnetwork.
it's speed is
total images/sec: 745.06
GPU:V100
HCA: Infiniband 100Gb/s
Ethernet NIC: 2*10Gb/s
run command
On Primary node
On Secondary node
here is the timeline
#466 (comment)
Appreciate if you can give some suggestions
The text was updated successfully, but these errors were encountered: