Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Horovod training stalled #563

Closed
LucasSloan opened this issue Jul 4, 2020 · 2 comments
Closed

Horovod training stalled #563

LucasSloan opened this issue Jul 4, 2020 · 2 comments

Comments

@LucasSloan
Copy link
Collaborator

I attempted to train with multiple GPUs and got this error message:

[1,0]<stderr>:[2020-07-04 19:23:57.817885: W horovod/common/stall_inspector.cc:105] One or more tensors were submitted to be reduced, gathered or broadcasted by subset of ranks and are waiting for remainder of ranks for more than 60 seconds. This may indicate that different ranks are trying to submit different tensors or that only subset of ranks is submitting tensors, which will cause deadlock. 
[1,0]<stderr>:Stalled ranks:
[1,0]<stderr>:0: [HorovodBroadcast_box_net_box_0_bias_0, HorovodBroadcast_box_net_box_0_bias_Momentum_0, HorovodBroadcast_box_net_box_0_bn_3_beta_0, HorovodBroadcast_box_net_box_0_bn_3_beta_Momentum_0, HorovodBroadcast_box_net_box_0_bn_3_gamma_0, HorovodBroadcast_box_net_box_0_bn_3_gamma_Momentum_0 ...]
[1,0]<stderr>:1: [DistributedMomentumOptimizer_Allreduce/HorovodAllreduce_gradients_AddN_100_0, DistributedMomentumOptimizer_Allreduce/HorovodAllreduce_gradients_AddN_101_0, DistributedMomentumOptimizer_Allreduce/HorovodAllreduce_gradients_AddN_102_0, DistributedMomentumOptimizer_Allreduce/HorovodAllreduce_gradients_AddN_103_0, DistributedMomentumOptimizer_Allreduce/HorovodAllreduce_gradients_AddN_124_0, DistributedMomentumOptimizer_Allreduce/HorovodAllreduce_gradients_AddN_125_0 ...]

GPU utilization on both cards is 0%.

@SiBensberg
Copy link

SiBensberg commented Jul 6, 2020

I am facing the same Problem, but only with:
--mode=train_and_eval
with --mode=train it works.
I just start another process for just evaluating.

@romainvo
Copy link

I have the same error too : #427

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants