Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tf-keras example is not working well when scale worker up in elastic mode #2285

Closed
BobLiu20 opened this issue Sep 17, 2020 · 8 comments · Fixed by #2289
Closed

tf-keras example is not working well when scale worker up in elastic mode #2285

BobLiu20 opened this issue Sep 17, 2020 · 8 comments · Fixed by #2289
Labels

Comments

@BobLiu20
Copy link
Contributor

Environment:

  1. Framework: tf-keras
  2. Framework version: 1.15.0
  3. Horovod version: v0.20.0

Bug report:
Just run this example examples/elastic/tensorflow_keras_mnist_elastic.py in elastic mode. It will be blocked when try to scale worker up.

FYI
I am trying to resolve this issue. Unfortunately, I can't find out how to reset the uniq id of op name.
For example, sync state in first time, the name is dict.sz. But the second time is dict.sz_1 and so on. In this case the new worker's name is dict.sz in first time. It is mismatch between new and old worker.
Any idea for this issue?

@BobLiu20 BobLiu20 added the bug label Sep 17, 2020
@tgaddair
Copy link
Collaborator

Thanks for raising this issue @BobLiu20, let me take a look. We definitely have this working in CI, so it may be we just need to tweak this script to use the same structure as our tests. Worst case, we may need to call tf.keras.backend.reset_uids to avoid the incrementing. I will take a look today.

@tgaddair
Copy link
Collaborator

Hmm, looks like we don't TensorFlow Keras v1 under CI for elastic mode, let me add that and see if we can fix this along the way.

@tgaddair
Copy link
Collaborator

Hey @BobLiu20, can you try #2289 and let me know if it works for you?

@BobLiu20
Copy link
Contributor Author

BobLiu20 commented Sep 18, 2020

@tgaddair Thanks. The issue still here:
Scale worker from 2 to 3. The each worker have 4 process.

Fri Sep 18 03:11:09 2020[0]<stderr>:[2020-09-18 03:11:09.347196: W /tmp/pip-req-build-dc48wa7r/horovod/common/stall_inspector.cc:105] One or more ten[283/1927]
submitted to be reduced, gathered or broadcasted by subset of ranks and are waiting for remainder of ranks for more than 60 seconds. This may indicate that dif
ferent ranks are trying to submit different tensors or that only subset of ranks is submitting tensors, which will cause deadlock.
Fri Sep 18 03:11:09 2020[0]<stderr>:Missing ranks:
Fri Sep 18 03:11:09 2020[0]<stderr>:0: [broadcast_object_fn.sz]
Fri Sep 18 03:11:09 2020[0]<stderr>:1: [broadcast_object_fn.sz]
Fri Sep 18 03:11:09 2020[0]<stderr>:2: [broadcast_object_fn.sz]
Fri Sep 18 03:11:09 2020[0]<stderr>:3: [broadcast_object_fn.sz]
Fri Sep 18 03:11:09 2020[0]<stderr>:4: [broadcast_object_fn.sz]
Fri Sep 18 03:11:09 2020[0]<stderr>:5: [broadcast_object_fn.sz]
Fri Sep 18 03:11:09 2020[0]<stderr>:6: [broadcast_object_fn.sz]
Fri Sep 18 03:11:09 2020[0]<stderr>:7: [broadcast_object_fn.sz]
Fri Sep 18 03:11:09 2020[0]<stderr>:8: [training/DistributedAdadelta_Allreduce/cond/HorovodAllreduce_training_Adadelta_gradients_gradients_conv2d_Conv2D_grad_$
onv2DBackpropFilter_0, training/DistributedAdadelta_Allreduce/cond_1/HorovodAllreduce_training_Adadelta_gradients_gradients_conv2d_BiasAdd_grad_BiasAddGrad_0,
training/DistributedAdadelta_Allreduce/cond_2/HorovodAllreduce_training_Adadelta_gradients_gradients_conv2d_1_Conv2D_grad_Conv2DBackpropFilter_0, training/Dis$
ributedAdadelta_Allreduce/cond_3/HorovodAllreduce_training_Adadelta_gradients_gradients_conv2d_1_BiasAdd_grad_BiasAddGrad_0, training/DistributedAdadelta_Allr$
duce/cond_4/HorovodAllreduce_training_Adadelta_gradients_gradients_dense_MatMul_grad_MatMul_1_0, training/DistributedAdadelta_Allreduce/cond_5/HorovodAllreduc$
_training_Adadelta_gradients_gradients_dense_BiasAdd_grad_BiasAddGrad_0 ...]
Fri Sep 18 03:11:09 2020[0]<stderr>:9: [training/DistributedAdadelta_Allreduce/cond/HorovodAllreduce_training_Adadelta_gradients_gradients_conv2d_Conv2D_grad_$
onv2DBackpropFilter_0, training/DistributedAdadelta_Allreduce/cond_1/HorovodAllreduce_training_Adadelta_gradients_gradients_conv2d_BiasAdd_grad_BiasAddGrad_0,
training/DistributedAdadelta_Allreduce/cond_2/HorovodAllreduce_training_Adadelta_gradients_gradients_conv2d_1_Conv2D_grad_Conv2DBackpropFilter_0, training/Dis$
ributedAdadelta_Allreduce/cond_3/HorovodAllreduce_training_Adadelta_gradients_gradients_conv2d_1_BiasAdd_grad_BiasAddGrad_0, training/DistributedAdadelta_Allr$
duce/cond_4/HorovodAllreduce_training_Adadelta_gradients_gradients_dense_MatMul_grad_MatMul_1_0, training/DistributedAdadelta_Allreduce/cond_5/HorovodAllreduc$
_training_Adadelta_gradients_gradients_dense_BiasAdd_grad_BiasAddGrad_0 ...]
Fri Sep 18 03:11:09 2020[0]<stderr>:10: [training/DistributedAdadelta_Allreduce/cond/HorovodAllreduce_training_Adadelta_gradients_gradients_conv2d_Conv2D_grad$
Conv2DBackpropFilter_0, training/DistributedAdadelta_Allreduce/cond_1/HorovodAllreduce_training_Adadelta_gradients_gradients_conv2d_BiasAdd_grad_BiasAddGrad_0$
 training/DistributedAdadelta_Allreduce/cond_2/HorovodAllreduce_training_Adadelta_gradients_gradients_conv2d_1_Conv2D_grad_Conv2DBackpropFilter_0, training/Di$
tributedAdadelta_Allreduce/cond_3/HorovodAllreduce_training_Adadelta_gradients_gradients_conv2d_1_BiasAdd_grad_BiasAddGrad_0, training/DistributedAdadelta_All$
educe/cond_4/HorovodAllreduce_training_Adadelta_gradients_gradients_dense_MatMul_grad_MatMul_1_0, training/DistributedAdadelta_Allreduce/cond_5/HorovodAllredu$
e_training_Adadelta_gradients_gradients_dense_BiasAdd_grad_BiasAddGrad_0 ...]
Fri Sep 18 03:11:09 2020[0]<stderr>:11: [training/DistributedAdadelta_Allreduce/cond/HorovodAllreduce_training_Adadelta_gradients_gradients_conv2d_Conv2D_grad$
Conv2DBackpropFilter_0, training/DistributedAdadelta_Allreduce/cond_1/HorovodAllreduce_training_Adadelta_gradients_gradients_conv2d_BiasAdd_grad_BiasAddGrad_0$
 training/DistributedAdadelta_Allreduce/cond_2/HorovodAllreduce_training_Adadelta_gradients_gradients_conv2d_1_Conv2D_grad_Conv2DBackpropFilter_0, training/Di$
tributedAdadelta_Allreduce/cond_3/HorovodAllreduce_training_Adadelta_gradients_gradients_conv2d_1_BiasAdd_grad_BiasAddGrad_0, training/DistributedAdadelta_All$
educe/cond_4/HorovodAllreduce_training_Adadelta_gradients_gradients_dense_MatMul_grad_MatMul_1_0, training/DistributedAdadelta_Allreduce/cond_5/HorovodAllredu$
e_training_Adadelta_gradients_gradients_dense_BiasAdd_grad_BiasAddGrad_0 ...]
Fri Sep 18 03:12:09 2020[0]<stderr>:[2020-09-18 03:12:09.351095: W /tmp/pip-req-build-dc48wa7r/horovod/common/stall_inspector.cc:105] One or more tensors were
submitted to be reduced, gathered or broadcasted by subset of ranks and are waiting for remainder of ranks for more than 60 seconds. This may indicate that di$
ferent ranks are trying to submit different tensors or that only subset of ranks is submitting tensors, which will cause deadlock.

@BobLiu20
Copy link
Contributor Author

@tgaddair After debug, It seems new worker will trigger on_batch_end in CommitStateCallbackImpl before sync weight. So new worker will call check_host_updates() immediately but old worker is not.
image

I don't konw why...

@tgaddair
Copy link
Collaborator

Thanks for digging into this @BobLiu20, and apologies for the ongoing issues. I'll look into this a bit more today and see if I can figure out what's going on.

@tgaddair
Copy link
Collaborator

Hey @BobLiu20, I was able to repro the issue in my environment and resolve it. It appears the problem was due to the Keras Callback state not being reinitialized when workers were added or removed. As a result, the old workers were committing at different steps from the new workers. I have updated the PR with changes to address this, please try it again and let me know if it works this time. Thanks.

@BobLiu20
Copy link
Contributor Author

@tgaddair Cool. It is working well now. Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Development

Successfully merging a pull request may close this issue.

2 participants