-
Notifications
You must be signed in to change notification settings - Fork 58
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance drops when TensorFlow experimental XLA JIT is enabled. #300
Comments
Thanks for reporting this. I can confirm that this issue can be re-produced on our side. We have duplicate the same setting in the Horovod's benchmark. We see the same performance downgrade with XLA when using Horovod. This means this issue is not only applicable to KungFu. I have explored different setting for XLA. Only the JIT can speed up the training throughput of the ResNet-50 model. The options I played are:
We are studying why XLA JIT will cause the performance downgrade when using multiple GPUs with Horovod and KungFu. The tested environment has TensorFlow 1.13.2 and Horovod 0.16.0. |
Just tried TensorFlow 1.15.2 and Horovod 0.16.1. Still the same issue. |
The Horovod and Nvidia teams find the same issue. The "cluster of ops" feature in XLA-JIT unfortrunately prevents communication and computation operators to be overlapped. The same problem is applied to KungFu. That is why Horovod and KungFu both drop performance with XLA-JIT. Here is the quoted reply from the Nvidia team (horovod/horovod#1673):
Similar issues are reported by other Horovod users: Similar issues are also reported by TensorFlow users when using XLA-JIT in large clusters: |
Result of running horovod benchmark on 2 DGX:
|
Horovod 1 DGX result:
|
KungFu 1 DGX result: with XLA JIT:
without XLA JIT
|
@lgarithm @luomai
you can test using your kungfu_benchmark.py with one line adding.
the testing by me is tested in 8 V100 node, the result as follows,
Configuration:
optimizer=sync-sgd
batch-size=64.
kungfu:
no xla:
with xla:
horovod:
no xla:
with xla:
The text was updated successfully, but these errors were encountered: