Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Distributed training bert model with multi-node is so slow... #1577

Closed
y-rok opened this issue Dec 8, 2019 · 1 comment
Closed

Distributed training bert model with multi-node is so slow... #1577

y-rok opened this issue Dec 8, 2019 · 1 comment
Labels

Comments

@y-rok
Copy link

y-rok commented Dec 8, 2019

Environment:

  1. Framework: Tensorflow
  2. Framework version:
  3. Horovod version:0.18.2
  4. MPI version:
  5. CUDA version:10
  6. NCCL version:
  7. Python version:3.7.5
  8. OS and version:
  9. GCC version:

Hello.
I am trying to speed up training BERT model with horovod.
(I modified google's bert code to support distributed training horovod like this link)
I have two servers and each have 2 GPU 2080ti.

When i try distributed training in a server with 2 gpus, it is very fast. (4 batch 1 iteraction ~1sec)
However, the speed is super slow when i use 2 servers. (4 batch 1 iteration 11seconds)
In horovod timeline, nccl "all reduce" time is very long because of long "queue" time.

I would like to know whether there is any problem in horovod or it is natural situation.
(As i know, geforce graphic device does not support gpu direct rdma. So i guess distributed training with multi node could be slow)

@y-rok y-rok added the question label Dec 8, 2019
@y-rok y-rok changed the title Distributed training bert with multi-node. Distributed training bert model with multi-node is so slow... Dec 8, 2019
@tgaddair
Copy link
Collaborator

tgaddair commented Dec 8, 2019

Hey @y-rok, what kind of interconnect do you have between your servers? Lack of GPUDirect will definitely cause longer queue times as well.

Since you seem to be bottlenecked by network, I would suggest trying fp16 compression (hvd.DistributedOptimizer(..., compression=hvd.Compression.fp16)). You may also want to try using horovodrun --autotune to explore different fusion buffer settings.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Development

No branches or pull requests

2 participants