This issue was moved to a discussion.
You can continue the conversation there. Go to discussion →
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
poor scaling performance in CPU multi-node #2739
Comments
From the timeline, I can see the wall duration of allgather and allreduce is more than 100ms, is that normal? network is only about 1.5Gb/s while my network is 100Gb/s. |
It seems that you have high-speed fabric other than ethernet? In that case, you have to use |
@chongxiaoc oh, sorry my network should be up to 25Gb/s. I am not using RDMA. |
How many CPU cores per node? |
96 vCores per node which should be quite enough. |
Did you try with fixing torch threads per rank? I mean this example: I think we have to fix number of threads per rank, then see the scalability. |
Is there similar interface in tensorflow to fix the threads number? I am using tensorflow 2.4.1. |
@chongxiaoc Thanks very much for your infor. By the way, besides this example, what I would like really to run is W&D training using tensorflow as following key scripts:
Besides the scaling problem, there is another strange phenomena that when set np=1, the single process can use up to 50 cores, but when scaling up to 2 nodes with np=8, each process only use ~4 cores. Why this should happen? maybe because each process process less data when np set larger? |
I'm not sure I understand this question.
|
for np=1, for np=8 on 2 nodes, |
This issue was moved to a discussion.
You can continue the conversation there. Go to discussion →
Environment:
Checklist:
Your question:
I tried the benchmark examples/tensorflow2/tensorflow2_synthetic_benchmark.py to identify the scaling performance, but it's becoming slower when scaling up and what would be the possible root causes?
horovodrun -np 1 -H localhost:1 python tensorflow2_synthetic_benchmark.py
[1,0]:Iter #0: 11.5 img/sec per GPU
[1,0]:Iter #1: 11.5 img/sec per GPU
[1,0]:Iter #2: 11.6 img/sec per GPU
[1,0]:Iter #3: 11.8 img/sec per GPU
[1,0]:Iter #4: 11.7 img/sec per GPU
[1,0]:Iter #5: 11.8 img/sec per GPU
[1,0]:Iter #6: 11.6 img/sec per GPU
[1,0]:Iter #7: 12.0 img/sec per GPU
[1,0]:Iter #8: 12.1 img/sec per GPU
[1,0]:Iter #9: 12.1 img/sec per GPU
[1,0]:Img/sec per GPU: 11.8 +-0.4
scale to 4 np, horovodrun -np 4 -H localhost:4 python tensorflow2_synthetic_benchmark.py
[1,0]:Iter #0: 6.9 img/sec per GPU
[1,0]:Iter #1: 6.9 img/sec per GPU
scale up to 8 np with 2 nodes, horovodrun -np 8 -H localhost:4,sr225:4 python tensorflow2_synthetic_benchmark.py:
[1,0]:Iter #0: 6.7 img/sec per GPU
[1,0]:Iter #1: 6.6 img/sec per GPU
[1,0]:Iter #2: 6.8 img/sec per GPU
[1,0]:Iter #3: 6.6 img/sec per GPU
Thanks!
The text was updated successfully, but these errors were encountered: