Performance drops when TensorFlow experimental XLA JIT is enabled. #300

rankeey · 2020-06-23T07:31:04Z

@lgarithm @luomai
you can test using your kungfu_benchmark.py with one line adding.

args = parser.parse_args()
args.cuda = not args.no_cuda

config = tf.ConfigProto()
config.graph_options.optimizer_options.global_jit_level = tf.OptimizerOptions.ON_1

the testing by me is tested in 8 V100 node, the result as follows,
Configuration:
optimizer=sync-sgd
batch-size=64.

kungfu:
no xla:

[�[1;32m127.0.0.1.10000�[m::stdout] Iter #3: 313.9 img/sec per /gpu:0
[�[1;32m127.0.0.1.10000�[m::stdout] Iter #4: 312.8 img/sec per /gpu:0
[�[1;32m127.0.0.1.10000�[m::stdout] Iter #5: 316.1 img/sec per /gpu:0
[�[1;32m127.0.0.1.10000�[m::stdout] Iter #6: 311.5 img/sec per /gpu:0
[�[1;32m127.0.0.1.10000�[m::stdout] Iter #7: 314.6 img/sec per /gpu:0
[�[1;32m127.0.0.1.10000�[m::stdout] Iter #8: 315.3 img/sec per /gpu:0
[�[1;32m127.0.0.1.10000�[m::stdout] Iter #9: 313.4 img/sec per /gpu:0
[�[1;32m127.0.0.1.10000�[m::stdout] Img/sec per /gpu:0: 313.8 +-2.4

with xla:

[�[1;32m127.0.0.1.10000�[m::stdout] Iter #5: 230.2 img/sec per /gpu:0
[�[1;32m127.0.0.1.10000�[m::stdout] Iter #6: 230.9 img/sec per /gpu:0
[�[1;32m127.0.0.1.10000�[m::stdout] Iter #7: 231.4 img/sec per /gpu:0
[�[1;32m127.0.0.1.10000�[m::stdout] Iter #8: 229.6 img/sec per /gpu:0
[�[1;32m127.0.0.1.10000�[m::stdout] Iter #9: 230.1 img/sec per /gpu:0
[�[1;32m127.0.0.1.10000�[m::stdout] Img/sec per /gpu:0: 230.1 +-1.5

horovod:
no xla:

Iter #35: 334.8 img/sec per GPU
Iter #36: 335.5 img/sec per GPU
Iter #37: 327.0 img/sec per GPU
Iter #38: 327.9 img/sec per GPU
Iter #39: 335.2 img/sec per GPU
Iter #40: 334.9 img/sec per GPU
Iter #41: 335.0 img/sec per GPU
Iter #42: 334.9 img/sec per GPU
Iter #43: 335.4 img/sec per GPU
Iter #44: 335.3 img/sec per GPU
Iter #45: 331.8 img/sec per GPU
Iter #46: 334.7 img/sec per GPU
Iter #47: 335.3 img/sec per GPU
Iter #48: 334.8 img/sec per GPU
Iter #49: 335.2 img/sec per GPU
Img/sec per GPU: 334.2 +-5.4

with xla:

Iter #39: 372.4 img/sec per GPU
Iter #40: 379.9 img/sec per GPU
Iter #41: 379.0 img/sec per GPU
Iter #42: 380.2 img/sec per GPU
Iter #43: 378.8 img/sec per GPU
Iter #44: 379.9 img/sec per GPU
Iter #45: 380.3 img/sec per GPU
Iter #46: 379.7 img/sec per GPU
Iter #47: 379.4 img/sec per GPU
Iter #48: 379.5 img/sec per GPU
Iter #49: 379.1 img/sec per GPU
Img/sec per GPU: 379.7 +-5.1

The text was updated successfully, but these errors were encountered:

luomai · 2020-06-23T15:57:46Z

Thanks for reporting this. I can confirm that this issue can be re-produced on our side.

We have duplicate the same setting in the Horovod's benchmark. We see the same performance downgrade with XLA when using Horovod. This means this issue is not only applicable to KungFu.

I have explored different setting for XLA. Only the JIT can speed up the training throughput of the ResNet-50 model. The options I played are:

config.graph_options.optimizer_options.global_jit_level = tf.OptimizerOptions.ON_1 # Only this make difference
config.graph_options.optimizer_options.opt_level = tf.OptimizerOptions.L1
config.graph_options.optimizer_options.do_common_subexpression_elimination = True # No change
config.graph_options.optimizer_options.do_constant_folding = True # No change
config.graph_options.optimizer_options.do_function_inlining = True # No change

We are studying why XLA JIT will cause the performance downgrade when using multiple GPUs with Horovod and KungFu.

The tested environment has TensorFlow 1.13.2 and Horovod 0.16.0.

luomai · 2020-06-24T10:21:08Z

Just tried TensorFlow 1.15.2 and Horovod 0.16.1. Still the same issue.

luomai · 2020-06-24T12:00:52Z

The Horovod and Nvidia teams find the same issue. The "cluster of ops" feature in XLA-JIT unfortrunately prevents communication and computation operators to be overlapped. The same problem is applied to KungFu. That is why Horovod and KungFu both drop performance with XLA-JIT.

Here is the quoted reply from the Nvidia team (horovod/horovod#1673):

Due to clustering of ops by XLA, enabling XLA can cause Horovod ops to no longer overlap (or overlap less efficiently) with computation, causing degradation in scaling performance. The issue is that Horovod will only be informed of tensors needing processing between XLA clusters. With that being said, depending on the scale you are running at, the increase in performance provided by XLA may outweigh the loss in scalability, resulting in higher raw throughput (as seen in @vilmara's results where enabling XLA reduces scaling efficiency, but achieves a much higher throughput).

You can try limiting the XLA cluster size by setting the following environment variable TF_XLA_FLAGS="--tf_xla_max_cluster_size=N" where you set N to a moderately size value like 500, 1000, or maybe more. Limiting the max cluster size can enable Horovod to overlap communication more, but may reduce raw performance so you'll have to experiment to see if the tradeoff is worth it for your application.

Similar issues are reported by other Horovod users:

Why Horovod doesn't have compute and communication overlap when XLA is used? horovod/horovod#1283

Similar issues are also reported by TensorFlow users when using XLA-JIT in large clusters:

XLA_GPU_JIT Slows Down GNMT (when large clusters are formed) tensorflow/tensorflow#28890

lgarithm · 2020-06-24T21:08:24Z

Result of running horovod benchmark on 2 DGX:

# with `config.graph_options.optimizer_options.global_jit_level = tf.OptimizerOptions.ON_1`
RESULT: 224.546734 +-14.929252 {"framework":"horovod","version":"0.16.1","np":16,"bs":32,"model":"ResNet50"}

# without XLA
RESULT: 240.788208 +-6.951544 {"framework":"horovod","version":"0.16.1","np":16,"bs":32,"model":"ResNet50"}

lgarithm · 2020-06-25T10:34:50Z

Horovod 1 DGX result:

# with `config.graph_options.optimizer_options.global_jit_level = tf.OptimizerOptions.ON_1`
RESULT: 330.247851 +-8.340494 {"framework":"horovod","version":"0.16.1","np":8,"bs":32,"model":"ResNet50"}

# without XLA
RESULT: 292.453241 +-2.209386 {"framework":"horovod","version":"0.16.1","np":8,"bs":32,"model":"ResNet50"}

luomai · 2020-06-25T12:13:44Z

KungFu 1 DGX result:

with XLA JIT:

RESULT: 298.445066 +-3.038366 {"framework":"kungfu","np":8,"strategy":"BINARY_TREE_STAR","bs":32,"model":"ResNet50","xla":false,"kf-opt":"sync-sgd-nccl","fuse":false,"nvlink":"true"}

without XLA JIT

RESULT: 311.254795 +-3.357482 {"framework":"kungfu","np":8,"strategy":"BINARY_TREE_STAR","bs":32,"model":"ResNet50","xla":true,"kf-opt":"sync-sgd-nccl","fuse":false,"nvlink":"true"}

luomai self-assigned this Jun 23, 2020

luomai changed the title ~~When using XLA acceleration, performance degradation is severe~~ Performance drops when TensorFlow experimental XLA JIT is enabled. Jun 24, 2020

lgarithm added a commit that referenced this issue Jun 25, 2020

--xla (#300)

d5bcdbe

lgarithm added a commit that referenced this issue Jun 25, 2020

--xla (#300) (#303)

7fca5b4

luomai closed this as completed Jun 27, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance drops when TensorFlow experimental XLA JIT is enabled. #300

Performance drops when TensorFlow experimental XLA JIT is enabled. #300

rankeey commented Jun 23, 2020 •

edited by luomai

Loading

luomai commented Jun 23, 2020 •

edited

Loading

luomai commented Jun 24, 2020

luomai commented Jun 24, 2020 •

edited

Loading

lgarithm commented Jun 24, 2020

lgarithm commented Jun 25, 2020 •

edited by luomai

Loading

luomai commented Jun 25, 2020 •

edited

Loading

Performance drops when TensorFlow experimental XLA JIT is enabled. #300

Performance drops when TensorFlow experimental XLA JIT is enabled. #300

Comments

rankeey commented Jun 23, 2020 • edited by luomai Loading

luomai commented Jun 23, 2020 • edited Loading

luomai commented Jun 24, 2020

luomai commented Jun 24, 2020 • edited Loading

lgarithm commented Jun 24, 2020

lgarithm commented Jun 25, 2020 • edited by luomai Loading

luomai commented Jun 25, 2020 • edited Loading

rankeey commented Jun 23, 2020 •

edited by luomai

Loading

luomai commented Jun 23, 2020 •

edited

Loading

luomai commented Jun 24, 2020 •

edited

Loading

lgarithm commented Jun 25, 2020 •

edited by luomai

Loading

luomai commented Jun 25, 2020 •

edited

Loading