You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hello.
I am trying to speed up training BERT model with horovod.
(I modified google's bert code to support distributed training horovod like this link)
I have two servers and each have 2 GPU 2080ti.
When i try distributed training in a server with 2 gpus, it is very fast. (4 batch 1 iteraction ~1sec)
However, the speed is super slow when i use 2 servers. (4 batch 1 iteration 11seconds)
In horovod timeline, nccl "all reduce" time is very long because of long "queue" time.
I would like to know whether there is any problem in horovod or it is natural situation.
(As i know, geforce graphic device does not support gpu direct rdma. So i guess distributed training with multi node could be slow)
The text was updated successfully, but these errors were encountered:
Hey @y-rok, what kind of interconnect do you have between your servers? Lack of GPUDirect will definitely cause longer queue times as well.
Since you seem to be bottlenecked by network, I would suggest trying fp16 compression (hvd.DistributedOptimizer(..., compression=hvd.Compression.fp16)). You may also want to try using horovodrun --autotune to explore different fusion buffer settings.
Environment:
Hello.
I am trying to speed up training BERT model with horovod.
(I modified google's bert code to support distributed training horovod like this link)
I have two servers and each have 2 GPU 2080ti.
When i try distributed training in a server with 2 gpus, it is very fast. (4 batch 1 iteraction ~1sec)
However, the speed is super slow when i use 2 servers. (4 batch 1 iteration 11seconds)
In horovod timeline, nccl "all reduce" time is very long because of long "queue" time.
I would like to know whether there is any problem in horovod or it is natural situation.
(As i know, geforce graphic device does not support gpu direct rdma. So i guess distributed training with multi node could be slow)
The text was updated successfully, but these errors were encountered: