Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance drops when TensorFlow experimental XLA JIT is enabled. #300

Closed
rankeey opened this issue Jun 23, 2020 · 6 comments
Closed

Performance drops when TensorFlow experimental XLA JIT is enabled. #300

rankeey opened this issue Jun 23, 2020 · 6 comments
Assignees

Comments

@rankeey
Copy link
Collaborator

rankeey commented Jun 23, 2020

@lgarithm @luomai
you can test using your kungfu_benchmark.py with one line adding.

args = parser.parse_args()
args.cuda = not args.no_cuda

config = tf.ConfigProto()
config.graph_options.optimizer_options.global_jit_level = tf.OptimizerOptions.ON_1

the testing by me is tested in 8 V100 node, the result as follows,
Configuration:
optimizer=sync-sgd
batch-size=64.

kungfu:
no xla:

[�[1;32m127.0.0.1.10000�[m::stdout] Iter #3: 313.9 img/sec per /gpu:0
[�[1;32m127.0.0.1.10000�[m::stdout] Iter #4: 312.8 img/sec per /gpu:0
[�[1;32m127.0.0.1.10000�[m::stdout] Iter #5: 316.1 img/sec per /gpu:0
[�[1;32m127.0.0.1.10000�[m::stdout] Iter #6: 311.5 img/sec per /gpu:0
[�[1;32m127.0.0.1.10000�[m::stdout] Iter #7: 314.6 img/sec per /gpu:0
[�[1;32m127.0.0.1.10000�[m::stdout] Iter #8: 315.3 img/sec per /gpu:0
[�[1;32m127.0.0.1.10000�[m::stdout] Iter #9: 313.4 img/sec per /gpu:0
[�[1;32m127.0.0.1.10000�[m::stdout] Img/sec per /gpu:0: 313.8 +-2.4

with xla:

[�[1;32m127.0.0.1.10000�[m::stdout] Iter #5: 230.2 img/sec per /gpu:0
[�[1;32m127.0.0.1.10000�[m::stdout] Iter #6: 230.9 img/sec per /gpu:0
[�[1;32m127.0.0.1.10000�[m::stdout] Iter #7: 231.4 img/sec per /gpu:0
[�[1;32m127.0.0.1.10000�[m::stdout] Iter #8: 229.6 img/sec per /gpu:0
[�[1;32m127.0.0.1.10000�[m::stdout] Iter #9: 230.1 img/sec per /gpu:0
[�[1;32m127.0.0.1.10000�[m::stdout] Img/sec per /gpu:0: 230.1 +-1.5

horovod:
no xla:

Iter #35: 334.8 img/sec per GPU
Iter #36: 335.5 img/sec per GPU
Iter #37: 327.0 img/sec per GPU
Iter #38: 327.9 img/sec per GPU
Iter #39: 335.2 img/sec per GPU
Iter #40: 334.9 img/sec per GPU
Iter #41: 335.0 img/sec per GPU
Iter #42: 334.9 img/sec per GPU
Iter #43: 335.4 img/sec per GPU
Iter #44: 335.3 img/sec per GPU
Iter #45: 331.8 img/sec per GPU
Iter #46: 334.7 img/sec per GPU
Iter #47: 335.3 img/sec per GPU
Iter #48: 334.8 img/sec per GPU
Iter #49: 335.2 img/sec per GPU
Img/sec per GPU: 334.2 +-5.4

with xla:

Iter #39: 372.4 img/sec per GPU
Iter #40: 379.9 img/sec per GPU
Iter #41: 379.0 img/sec per GPU
Iter #42: 380.2 img/sec per GPU
Iter #43: 378.8 img/sec per GPU
Iter #44: 379.9 img/sec per GPU
Iter #45: 380.3 img/sec per GPU
Iter #46: 379.7 img/sec per GPU
Iter #47: 379.4 img/sec per GPU
Iter #48: 379.5 img/sec per GPU
Iter #49: 379.1 img/sec per GPU
Img/sec per GPU: 379.7 +-5.1
@luomai luomai self-assigned this Jun 23, 2020
@luomai
Copy link
Member

luomai commented Jun 23, 2020

Thanks for reporting this. I can confirm that this issue can be re-produced on our side.

We have duplicate the same setting in the Horovod's benchmark. We see the same performance downgrade with XLA when using Horovod. This means this issue is not only applicable to KungFu.

I have explored different setting for XLA. Only the JIT can speed up the training throughput of the ResNet-50 model. The options I played are:

config.graph_options.optimizer_options.global_jit_level = tf.OptimizerOptions.ON_1 # Only this make difference
config.graph_options.optimizer_options.opt_level = tf.OptimizerOptions.L1
config.graph_options.optimizer_options.do_common_subexpression_elimination = True # No change
config.graph_options.optimizer_options.do_constant_folding = True # No change
config.graph_options.optimizer_options.do_function_inlining = True # No change

We are studying why XLA JIT will cause the performance downgrade when using multiple GPUs with Horovod and KungFu.

The tested environment has TensorFlow 1.13.2 and Horovod 0.16.0.

@luomai
Copy link
Member

luomai commented Jun 24, 2020

Just tried TensorFlow 1.15.2 and Horovod 0.16.1. Still the same issue.

@luomai
Copy link
Member

luomai commented Jun 24, 2020

The Horovod and Nvidia teams find the same issue. The "cluster of ops" feature in XLA-JIT unfortrunately prevents communication and computation operators to be overlapped. The same problem is applied to KungFu. That is why Horovod and KungFu both drop performance with XLA-JIT.

Here is the quoted reply from the Nvidia team (horovod/horovod#1673):

Due to clustering of ops by XLA, enabling XLA can cause Horovod ops to no longer overlap (or overlap less efficiently) with computation, causing degradation in scaling performance. The issue is that Horovod will only be informed of tensors needing processing between XLA clusters. With that being said, depending on the scale you are running at, the increase in performance provided by XLA may outweigh the loss in scalability, resulting in higher raw throughput (as seen in @vilmara's results where enabling XLA reduces scaling efficiency, but achieves a much higher throughput).

You can try limiting the XLA cluster size by setting the following environment variable TF_XLA_FLAGS="--tf_xla_max_cluster_size=N" where you set N to a moderately size value like 500, 1000, or maybe more. Limiting the max cluster size can enable Horovod to overlap communication more, but may reduce raw performance so you'll have to experiment to see if the tradeoff is worth it for your application.

Similar issues are reported by other Horovod users:

Similar issues are also reported by TensorFlow users when using XLA-JIT in large clusters:

@luomai luomai changed the title When using XLA acceleration, performance degradation is severe Performance drops when TensorFlow experimental XLA JIT is enabled. Jun 24, 2020
@lgarithm
Copy link
Collaborator

Result of running horovod benchmark on 2 DGX:

# with `config.graph_options.optimizer_options.global_jit_level = tf.OptimizerOptions.ON_1`
RESULT: 224.546734 +-14.929252 {"framework":"horovod","version":"0.16.1","np":16,"bs":32,"model":"ResNet50"}

# without XLA
RESULT: 240.788208 +-6.951544 {"framework":"horovod","version":"0.16.1","np":16,"bs":32,"model":"ResNet50"}

@lgarithm
Copy link
Collaborator

lgarithm commented Jun 25, 2020

Horovod 1 DGX result:

# with `config.graph_options.optimizer_options.global_jit_level = tf.OptimizerOptions.ON_1`
RESULT: 330.247851 +-8.340494 {"framework":"horovod","version":"0.16.1","np":8,"bs":32,"model":"ResNet50"}

# without XLA
RESULT: 292.453241 +-2.209386 {"framework":"horovod","version":"0.16.1","np":8,"bs":32,"model":"ResNet50"}

lgarithm added a commit that referenced this issue Jun 25, 2020
lgarithm added a commit that referenced this issue Jun 25, 2020
@luomai
Copy link
Member

luomai commented Jun 25, 2020

KungFu 1 DGX result:

with XLA JIT:

RESULT: 298.445066 +-3.038366 {"framework":"kungfu","np":8,"strategy":"BINARY_TREE_STAR","bs":32,"model":"ResNet50","xla":false,"kf-opt":"sync-sgd-nccl","fuse":false,"nvlink":"true"}

without XLA JIT

RESULT: 311.254795 +-3.357482 {"framework":"kungfu","np":8,"strategy":"BINARY_TREE_STAR","bs":32,"model":"ResNet50","xla":true,"kf-opt":"sync-sgd-nccl","fuse":false,"nvlink":"true"}

@luomai luomai closed this as completed Jun 27, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants