You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Your question:
I am running TF benchmarks with Horovod in distributed mode (2 nodes, each with 4xV100GPUs), I see the scaling efficiency drops to 78% when xla is enabled versus xla disabled (90%), see below the throughput:
With --xla=True
1xGPU within the node total images/sec: 1236.81
8GPU's across the nodes total images/sec: 7691.81 (~78% scaling efficiency drops only when crossing the nodes)
Without --xla
1xGPU within the node total images/sec: 831.15
8GPU's across the nodes total images/sec: 5996.44 (~%90 scaling efficiency)
The text was updated successfully, but these errors were encountered:
Due to clustering of ops by XLA, enabling XLA can cause Horovod ops to no longer overlap (or overlap less efficiently) with computation, causing degradation in scaling performance. The issue is that Horovod will only be informed of tensors needing processing between XLA clusters. With that being said, depending on the scale you are running at, the increase in performance provided by XLA may outweigh the loss in scalability, resulting in higher raw throughput (as seen in @vilmara's results where enabling XLA reduces scaling efficiency, but achieves a much higher throughput).
You can try limiting the XLA cluster size by setting the following environment variable TF_XLA_FLAGS="--tf_xla_max_cluster_size=N" where you set N to a moderately size value like 500, 1000, or maybe more. Limiting the max cluster size can enable Horovod to overlap communication more, but may reduce raw performance so you'll have to experiment to see if the tradeoff is worth it for your application.
Environment:
Your question:
I am running TF benchmarks with Horovod in distributed mode (2 nodes, each with 4xV100GPUs), I see the scaling efficiency drops to 78% when xla is enabled versus xla disabled (90%), see below the throughput:
With --xla=True
1xGPU within the node total images/sec: 1236.81
8GPU's across the nodes total images/sec: 7691.81 (~78% scaling efficiency drops only when crossing the nodes)
Without --xla
1xGPU within the node total images/sec: 831.15
8GPU's across the nodes total images/sec: 5996.44 (~%90 scaling efficiency)
The text was updated successfully, but these errors were encountered: