You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Your question:
I used horovod to train distributed BERT model using nvidia's src code at https://github.com/NVIDIA/DeepLearningExamples/tree/master/TensorFlow/LanguageModeling/BERT. I found when XLA is not used, horovod+bert has very good scalability (~1.92x from 1 to 16 GPUs). nvprof shows that there is good compute and communication (shown as 'mem' in the screenshot below) overlap, as expected.
NOTE: the screenshots below are nvprof results for a whole step.
However, when XLA is used, horovod+bert has much worse scalability (~1.8x from 1 to 16 GPUs). nvprof shows that there is little compute and communication overlap.
My question is: What caused this no compute and communication overlap when XLA is used?
The text was updated successfully, but these errors were encountered:
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Environment:
Checklist:
Your question:
I used horovod to train distributed BERT model using nvidia's src code at https://github.com/NVIDIA/DeepLearningExamples/tree/master/TensorFlow/LanguageModeling/BERT. I found when XLA is not used, horovod+bert has very good scalability (~1.92x from 1 to 16 GPUs). nvprof shows that there is good compute and communication (shown as 'mem' in the screenshot below) overlap, as expected.
NOTE: the screenshots below are nvprof results for a whole step.
However, when XLA is used, horovod+bert has much worse scalability (~1.8x from 1 to 16 GPUs). nvprof shows that there is little compute and communication overlap.
My question is: What caused this no compute and communication overlap when XLA is used?
The text was updated successfully, but these errors were encountered: