Distributed training bert model with multi-node is so slow... #1577

y-rok · 2019-12-08T12:44:39Z

Environment:

Framework: Tensorflow
Framework version:
Horovod version:0.18.2
MPI version:
CUDA version:10
NCCL version:
Python version:3.7.5
OS and version:
GCC version:

Hello.
I am trying to speed up training BERT model with horovod.
(I modified google's bert code to support distributed training horovod like this link)
I have two servers and each have 2 GPU 2080ti.

When i try distributed training in a server with 2 gpus, it is very fast. (4 batch 1 iteraction ~1sec)
However, the speed is super slow when i use 2 servers. (4 batch 1 iteration 11seconds)
In horovod timeline, nccl "all reduce" time is very long because of long "queue" time.

I would like to know whether there is any problem in horovod or it is natural situation.
(As i know, geforce graphic device does not support gpu direct rdma. So i guess distributed training with multi node could be slow)

tgaddair · 2019-12-08T17:57:00Z

Hey @y-rok, what kind of interconnect do you have between your servers? Lack of GPUDirect will definitely cause longer queue times as well.

Since you seem to be bottlenecked by network, I would suggest trying fp16 compression (hvd.DistributedOptimizer(..., compression=hvd.Compression.fp16)). You may also want to try using horovodrun --autotune to explore different fusion buffer settings.

y-rok added the question label Dec 8, 2019

y-rok changed the title ~~Distributed training bert with multi-node.~~ Distributed training bert model with multi-node is so slow... Dec 8, 2019

y-rok closed this as completed Dec 19, 2019

guangdongliang mentioned this issue Feb 8, 2022

distribute training with horovod cost more time every epoch facebookresearch/mae#34

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Distributed training bert model with multi-node is so slow... #1577

Distributed training bert model with multi-node is so slow... #1577

y-rok commented Dec 8, 2019

tgaddair commented Dec 8, 2019

Distributed training bert model with multi-node is so slow... #1577

Distributed training bert model with multi-node is so slow... #1577

Comments

y-rok commented Dec 8, 2019

tgaddair commented Dec 8, 2019