Horovod training stalled #563

LucasSloan · 2020-07-04T19:25:46Z

I attempted to train with multiple GPUs and got this error message:

[1,0]<stderr>:[2020-07-04 19:23:57.817885: W horovod/common/stall_inspector.cc:105] One or more tensors were submitted to be reduced, gathered or broadcasted by subset of ranks and are waiting for remainder of ranks for more than 60 seconds. This may indicate that different ranks are trying to submit different tensors or that only subset of ranks is submitting tensors, which will cause deadlock. 
[1,0]<stderr>:Stalled ranks:
[1,0]<stderr>:0: [HorovodBroadcast_box_net_box_0_bias_0, HorovodBroadcast_box_net_box_0_bias_Momentum_0, HorovodBroadcast_box_net_box_0_bn_3_beta_0, HorovodBroadcast_box_net_box_0_bn_3_beta_Momentum_0, HorovodBroadcast_box_net_box_0_bn_3_gamma_0, HorovodBroadcast_box_net_box_0_bn_3_gamma_Momentum_0 ...]
[1,0]<stderr>:1: [DistributedMomentumOptimizer_Allreduce/HorovodAllreduce_gradients_AddN_100_0, DistributedMomentumOptimizer_Allreduce/HorovodAllreduce_gradients_AddN_101_0, DistributedMomentumOptimizer_Allreduce/HorovodAllreduce_gradients_AddN_102_0, DistributedMomentumOptimizer_Allreduce/HorovodAllreduce_gradients_AddN_103_0, DistributedMomentumOptimizer_Allreduce/HorovodAllreduce_gradients_AddN_124_0, DistributedMomentumOptimizer_Allreduce/HorovodAllreduce_gradients_AddN_125_0 ...]

GPU utilization on both cards is 0%.

The text was updated successfully, but these errors were encountered:

SiBensberg · 2020-07-06T12:39:03Z

I am facing the same Problem, but only with:
--mode=train_and_eval
with --mode=train it works.
I just start another process for just evaluating.

romainvo · 2020-07-14T12:05:57Z

I have the same error too : #427

drwaltman mentioned this issue Aug 2, 2020

Problems with training on multi gpus with horovod #427

Closed

fsx950223 closed this as completed Sep 18, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Horovod training stalled #563

Horovod training stalled #563

LucasSloan commented Jul 4, 2020

SiBensberg commented Jul 6, 2020 •

edited

Loading

romainvo commented Jul 14, 2020

Horovod training stalled #563

Horovod training stalled #563

Comments

LucasSloan commented Jul 4, 2020

SiBensberg commented Jul 6, 2020 • edited Loading

romainvo commented Jul 14, 2020

SiBensberg commented Jul 6, 2020 •

edited

Loading