Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

horovod performance decrease dramatically when run on multiple server #221

Closed
scotthuang1989 opened this issue Mar 23, 2018 · 13 comments
Closed
Labels

Comments

@scotthuang1989
Copy link

@scotthuang1989 scotthuang1989 commented Mar 23, 2018

I have 2 server, each have 4 GPUs, if I run horovod on single server, 1 epoch takes 10 seconds, but If I run it on 2 serer, it takes 60 seconds. I am not familiar with mpi. So I can only debug it by observing system resources. when I run it on multiple server.

  1. cpu utilization is around (60%), which is a bit higher then run it with single server (40%)
  2. network trans/receive is about 20M/s, I have a 10G network card, so it should not be the bottleneck
  3. for most time, GPU utilization is almost 0.

Is there any tool or documentation to debug this issue?

@alsrgv

This comment has been minimized.

Copy link
Member

@alsrgv alsrgv commented Mar 23, 2018

@scotthuang1989, what model are you training and are you using NCCL?

@alsrgv alsrgv added the question label Mar 23, 2018
@scotthuang1989

This comment has been minimized.

Copy link
Author

@scotthuang1989 scotthuang1989 commented Mar 24, 2018

just a simple LSTM model which is ok when run locally. And one of my college told me maybe is because network cable. But it need administrator to go to server room to check it, I will update the status when they have confirmation.

@alsrgv

This comment has been minimized.

Copy link
Member

@alsrgv alsrgv commented Mar 25, 2018

@scotthuang1989, how large is the checkpoint of your model?

@scotthuang1989

This comment has been minimized.

Copy link
Author

@scotthuang1989 scotthuang1989 commented Mar 25, 2018

around 16M. I have 2 server, each have 4 GPUS.

@alsrgv

This comment has been minimized.

Copy link
Member

@alsrgv alsrgv commented Mar 25, 2018

I see. You may have trouble scaling this small model over regular 10GbE because of latency. You may need RoCE or InfiniBand low latency network. Training bigger model may help with scaling, but then you may become bandwidth constrained.

Rule of thumb: 25GbE for 1080TI, 50GbE for P100, 100GbE for V100.

@scotthuang1989

This comment has been minimized.

Copy link
Author

@scotthuang1989 scotthuang1989 commented Mar 25, 2018

My GPU is 1080Ti. sound like a long way to go...

@alsrgv

This comment has been minimized.

Copy link
Member

@alsrgv alsrgv commented Mar 25, 2018

Yeah, network is quite demanding for distributed deep learning. How many epochs are you training? I wish my epochs took 10 seconds to run.

@scotthuang1989

This comment has been minimized.

Copy link
Author

@scotthuang1989 scotthuang1989 commented Mar 26, 2018

In this example, 10 seconds if I run my model only on local GPUs, If I go distributed, it takes 60 seconds.

@YesterdayxD

This comment has been minimized.

Copy link

@YesterdayxD YesterdayxD commented Jun 14, 2019

@scotthuang1989 hi I have same problem with you . And do you change your NIC?Has your preformance improved?

@cham-3

This comment has been minimized.

Copy link

@cham-3 cham-3 commented Jun 23, 2019

I met the same problem.

@scotthuang1989

This comment has been minimized.

Copy link
Author

@scotthuang1989 scotthuang1989 commented Jun 24, 2019

@cham-3 , I am sorry that I don't have the answer. I met this problem a year ago. A year ago our company just setup the servers. during the following months, we adjust our hardware and software (including NIC, network cable, router, harddisk etc) for several reasons. the problem disappeared at some point, so I don't know for sure what cause this problem.

@YesterdayxD

This comment has been minimized.

Copy link

@YesterdayxD YesterdayxD commented Jun 24, 2019

when I changed the NIC,from 1GB's NIC to 25GB's,the performan has actually improved.BUT the NIC's IO is only 300MB/s. @scotthuang1989 @cham-3

@abhi278

This comment has been minimized.

Copy link

@abhi278 abhi278 commented Jan 31, 2020

Tried horovod again today with docker. Following are the results:
Versions:
horovod: 0.19.0
mpirun: 4.0.0
environment: docker
example: tensorflow2_synthetic_benchmarks.py provided in horovod/examples
GPU: nvidia 1080Ti - one per node

  • Scenario 1: 1 GPU, local machine
    command:
    horovodrun -np -H localhost:1 python3 tensorflow2_synthetic_benchmark.py
    Output:
[1,0]<stdout>:Iter #0: 189.8 img/sec per GPU
[1,0]<stdout>:Iter #1: 188.2 img/sec per GPU
[1,0]<stdout>:Iter #2: 187.9 img/sec per GPU
[1,0]<stdout>:Iter #3: 187.8 img/sec per GPU
[1,0]<stdout>:Iter #4: 187.7 img/sec per GPU
[1,0]<stdout>:Iter #5: 188.0 img/sec per GPU
[1,0]<stdout>:Iter #6: 187.9 img/sec per GPU
[1,0]<stdout>:Iter #7: 187.8 img/sec per GPU
[1,0]<stdout>:Iter #8: 187.9 img/sec per GPU
[1,0]<stdout>:Iter #9: 187.9 img/sec per GPU
[1,0]<stdout>:Img/sec per GPU: 188.1 +-1.1
[1,0]<stdout>:Total img/sec on 1 GPU(s): 188.1 +-1.1
  • Scenario 2: 2 GPUs. 1 on local machine, 1 on other machine in LAN
    Command:
    horovodrun -np 2 -H localhost:1,ml03:1 -p 12345 python3 tensorflow2_synthetic_benchmark.py
    Output:
[1,0]<stdout>:Iter #0: 33.1 img/sec per GPU
[1,0]<stdout>:Iter #1: 33.2 img/sec per GPU
[1,0]<stdout>:Iter #2: 33.1 img/sec per GPU
[1,0]<stdout>:Iter #3: 33.2 img/sec per GPU
[1,0]<stdout>:Iter #4: 33.3 img/sec per GPU
[1,0]<stdout>:Iter #5: 33.1 img/sec per GPU
[1,0]<stdout>:Iter #6: 33.2 img/sec per GPU
[1,0]<stdout>:Iter #7: 33.2 img/sec per GPU
[1,0]<stdout>:Iter #8: 33.1 img/sec per GPU
[1,0]<stdout>:Iter #9: 33.1 img/sec per GPU
[1,0]<stdout>:Img/sec per GPU: 33.2 +-0.1
[1,0]<stdout>:Total img/sec on 2 GPU(s): 66.3 +-0.2

Conclusion:
Horovod is slower when run on multiple machines for small networks. Probably it requires higher bandwidth. Ref: #752, #221

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
5 participants
You can’t perform that action at this time.