Why Horovod doesn't have compute and communication overlap when XLA is used? #1283

LiweiPeng · 2019-08-07T20:15:45Z

Environment:

Framework: (TensorFlow, Keras, PyTorch, MXNet) Tensorflow
Framework version: 1.14
Horovod version: 0.16.4
MPI version: 3.1.1
CUDA version: 10.0
NCCL version: 2.4.7
Python version: 2.7.5
OS and version: CentOS 7.4
GCC version: 4.8.5

Checklist:

Did you search issues to find if somebody asked this question before? yes
If your question is about hang, did you read this doc?
If your question is about docker, did you read this doc?
Did you check if you question is answered in the troubleshooting guide?

Your question:
I used horovod to train distributed BERT model using nvidia's src code at https://github.com/NVIDIA/DeepLearningExamples/tree/master/TensorFlow/LanguageModeling/BERT. I found when XLA is not used, horovod+bert has very good scalability (~1.92x from 1 to 16 GPUs). nvprof shows that there is good compute and communication (shown as 'mem' in the screenshot below) overlap, as expected.

NOTE: the screenshots below are nvprof results for a whole step.

However, when XLA is used, horovod+bert has much worse scalability (~1.8x from 1 to 16 GPUs). nvprof shows that there is little compute and communication overlap.

My question is: What caused this no compute and communication overlap when XLA is used?

luomai · 2020-06-24T12:45:02Z

@LiweiPeng Hi Liwei, did you resolve this issue?

stale · 2020-11-06T18:45:09Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

LiweiPeng added the question label Aug 7, 2019

luomai mentioned this issue Jun 24, 2020

Performance drops when TensorFlow experimental XLA JIT is enabled. lsds/KungFu#300

Closed

stale bot added the wontfix label Nov 6, 2020

stale bot closed this as completed Nov 13, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why Horovod doesn't have compute and communication overlap when XLA is used? #1283

Why Horovod doesn't have compute and communication overlap when XLA is used? #1283

LiweiPeng commented Aug 7, 2019 •

edited

Loading

luomai commented Jun 24, 2020

stale bot commented Nov 6, 2020

Why Horovod doesn't have compute and communication overlap when XLA is used? #1283

Why Horovod doesn't have compute and communication overlap when XLA is used? #1283

Comments

LiweiPeng commented Aug 7, 2019 • edited Loading

luomai commented Jun 24, 2020

stale bot commented Nov 6, 2020

LiweiPeng commented Aug 7, 2019 •

edited

Loading