Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scaling Efficiency drops with xla enabled #1673

Closed
vilmara opened this issue Jan 15, 2020 · 2 comments
Closed

Scaling Efficiency drops with xla enabled #1673

vilmara opened this issue Jan 15, 2020 · 2 comments
Labels

Comments

@vilmara
Copy link

vilmara commented Jan 15, 2020

Environment:

  1. Framework: TensorFlow
  2. Framework version: TF 1.4
  3. Horovod version: 0.18.2 via Horovod in docker
  4. MPI version: 4.0.0
  5. CUDA version: 10.0
  6. NCCL version: 2.5.6
  7. Python version: 2.7
  8. OS and version: Ubuntu 18.04
  9. GCC version: 4.8
  10. CUDNN version: 7.6.5
  11. Mellanox OFED 4.7-3.2.9.0
  12. GPUDirect RDMA - nvidia-peer-memory_1.0-8

Your question:
I am running TF benchmarks with Horovod in distributed mode (2 nodes, each with 4xV100GPUs), I see the scaling efficiency drops to 78% when xla is enabled versus xla disabled (90%), see below the throughput:

With --xla=True
1xGPU within the node total images/sec: 1236.81
8GPU's across the nodes total images/sec: 7691.81 (~78% scaling efficiency drops only when crossing the nodes)

Without --xla
1xGPU within the node total images/sec: 831.15
8GPU's across the nodes total images/sec: 5996.44 (~%90 scaling efficiency)

@chengdianxuezi
Copy link

Has this problem been solved?I also have this problem

@romerojosh
Copy link
Collaborator

Due to clustering of ops by XLA, enabling XLA can cause Horovod ops to no longer overlap (or overlap less efficiently) with computation, causing degradation in scaling performance. The issue is that Horovod will only be informed of tensors needing processing between XLA clusters. With that being said, depending on the scale you are running at, the increase in performance provided by XLA may outweigh the loss in scalability, resulting in higher raw throughput (as seen in @vilmara's results where enabling XLA reduces scaling efficiency, but achieves a much higher throughput).

You can try limiting the XLA cluster size by setting the following environment variable TF_XLA_FLAGS="--tf_xla_max_cluster_size=N" where you set N to a moderately size value like 500, 1000, or maybe more. Limiting the max cluster size can enable Horovod to overlap communication more, but may reduce raw performance so you'll have to experiment to see if the tradeoff is worth it for your application.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Development

No branches or pull requests

3 participants