Scaling Efficiency drops with xla enabled #1673

vilmara · 2020-01-15T19:31:24Z

Environment:

Framework: TensorFlow
Framework version: TF 1.4
Horovod version: 0.18.2 via Horovod in docker
MPI version: 4.0.0
CUDA version: 10.0
NCCL version: 2.5.6
Python version: 2.7
OS and version: Ubuntu 18.04
GCC version: 4.8
CUDNN version: 7.6.5
Mellanox OFED 4.7-3.2.9.0
GPUDirect RDMA - nvidia-peer-memory_1.0-8

Your question:
I am running TF benchmarks with Horovod in distributed mode (2 nodes, each with 4xV100GPUs), I see the scaling efficiency drops to 78% when xla is enabled versus xla disabled (90%), see below the throughput:

With --xla=True
1xGPU within the node total images/sec: 1236.81
8GPU's across the nodes total images/sec: 7691.81 (~78% scaling efficiency drops only when crossing the nodes)

Without --xla
1xGPU within the node total images/sec: 831.15
8GPU's across the nodes total images/sec: 5996.44 (~%90 scaling efficiency)

chengdianxuezi · 2020-02-14T09:42:06Z

Has this problem been solved？I also have this problem

romerojosh · 2020-02-14T17:47:17Z

Due to clustering of ops by XLA, enabling XLA can cause Horovod ops to no longer overlap (or overlap less efficiently) with computation, causing degradation in scaling performance. The issue is that Horovod will only be informed of tensors needing processing between XLA clusters. With that being said, depending on the scale you are running at, the increase in performance provided by XLA may outweigh the loss in scalability, resulting in higher raw throughput (as seen in @vilmara's results where enabling XLA reduces scaling efficiency, but achieves a much higher throughput).

You can try limiting the XLA cluster size by setting the following environment variable TF_XLA_FLAGS="--tf_xla_max_cluster_size=N" where you set N to a moderately size value like 500, 1000, or maybe more. Limiting the max cluster size can enable Horovod to overlap communication more, but may reduce raw performance so you'll have to experiment to see if the tradeoff is worth it for your application.

vilmara added the question label Jan 15, 2020

luomai mentioned this issue Jun 24, 2020

Performance drops when TensorFlow experimental XLA JIT is enabled. lsds/KungFu#300

Closed

vilmara closed this as completed Oct 13, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scaling Efficiency drops with xla enabled #1673

Scaling Efficiency drops with xla enabled #1673

vilmara commented Jan 15, 2020 •

edited

Loading

chengdianxuezi commented Feb 14, 2020

romerojosh commented Feb 14, 2020

Scaling Efficiency drops with xla enabled #1673

Scaling Efficiency drops with xla enabled #1673

Comments

vilmara commented Jan 15, 2020 • edited Loading

chengdianxuezi commented Feb 14, 2020

romerojosh commented Feb 14, 2020

vilmara commented Jan 15, 2020 •

edited

Loading