tf-keras example is not working well when scale worker up in elastic mode #2285

BobLiu20 · 2020-09-17T13:39:00Z

Environment:

Framework: tf-keras
Framework version: 1.15.0
Horovod version: v0.20.0

Bug report:
Just run this example examples/elastic/tensorflow_keras_mnist_elastic.py in elastic mode. It will be blocked when try to scale worker up.

FYI
I am trying to resolve this issue. Unfortunately, I can't find out how to reset the uniq id of op name.
For example, sync state in first time, the name is dict.sz. But the second time is dict.sz_1 and so on. In this case the new worker's name is dict.sz in first time. It is mismatch between new and old worker.
Any idea for this issue?

The text was updated successfully, but these errors were encountered:

tgaddair · 2020-09-17T13:51:22Z

Thanks for raising this issue @BobLiu20, let me take a look. We definitely have this working in CI, so it may be we just need to tweak this script to use the same structure as our tests. Worst case, we may need to call tf.keras.backend.reset_uids to avoid the incrementing. I will take a look today.

tgaddair · 2020-09-17T13:53:45Z

Hmm, looks like we don't TensorFlow Keras v1 under CI for elastic mode, let me add that and see if we can fix this along the way.

tgaddair · 2020-09-17T22:36:24Z

Hey @BobLiu20, can you try #2289 and let me know if it works for you?

BobLiu20 · 2020-09-18T03:22:16Z

@tgaddair Thanks. The issue still here:
Scale worker from 2 to 3. The each worker have 4 process.

Fri Sep 18 03:11:09 2020[0]<stderr>:[2020-09-18 03:11:09.347196: W /tmp/pip-req-build-dc48wa7r/horovod/common/stall_inspector.cc:105] One or more ten[283/1927]
submitted to be reduced, gathered or broadcasted by subset of ranks and are waiting for remainder of ranks for more than 60 seconds. This may indicate that dif
ferent ranks are trying to submit different tensors or that only subset of ranks is submitting tensors, which will cause deadlock.
Fri Sep 18 03:11:09 2020[0]<stderr>:Missing ranks:
Fri Sep 18 03:11:09 2020[0]<stderr>:0: [broadcast_object_fn.sz]
Fri Sep 18 03:11:09 2020[0]<stderr>:1: [broadcast_object_fn.sz]
Fri Sep 18 03:11:09 2020[0]<stderr>:2: [broadcast_object_fn.sz]
Fri Sep 18 03:11:09 2020[0]<stderr>:3: [broadcast_object_fn.sz]
Fri Sep 18 03:11:09 2020[0]<stderr>:4: [broadcast_object_fn.sz]
Fri Sep 18 03:11:09 2020[0]<stderr>:5: [broadcast_object_fn.sz]
Fri Sep 18 03:11:09 2020[0]<stderr>:6: [broadcast_object_fn.sz]
Fri Sep 18 03:11:09 2020[0]<stderr>:7: [broadcast_object_fn.sz]
Fri Sep 18 03:11:09 2020[0]<stderr>:8: [training/DistributedAdadelta_Allreduce/cond/HorovodAllreduce_training_Adadelta_gradients_gradients_conv2d_Conv2D_grad_$
onv2DBackpropFilter_0, training/DistributedAdadelta_Allreduce/cond_1/HorovodAllreduce_training_Adadelta_gradients_gradients_conv2d_BiasAdd_grad_BiasAddGrad_0,
training/DistributedAdadelta_Allreduce/cond_2/HorovodAllreduce_training_Adadelta_gradients_gradients_conv2d_1_Conv2D_grad_Conv2DBackpropFilter_0, training/Dis$
ributedAdadelta_Allreduce/cond_3/HorovodAllreduce_training_Adadelta_gradients_gradients_conv2d_1_BiasAdd_grad_BiasAddGrad_0, training/DistributedAdadelta_Allr$
duce/cond_4/HorovodAllreduce_training_Adadelta_gradients_gradients_dense_MatMul_grad_MatMul_1_0, training/DistributedAdadelta_Allreduce/cond_5/HorovodAllreduc$
_training_Adadelta_gradients_gradients_dense_BiasAdd_grad_BiasAddGrad_0 ...]
Fri Sep 18 03:11:09 2020[0]<stderr>:9: [training/DistributedAdadelta_Allreduce/cond/HorovodAllreduce_training_Adadelta_gradients_gradients_conv2d_Conv2D_grad_$
onv2DBackpropFilter_0, training/DistributedAdadelta_Allreduce/cond_1/HorovodAllreduce_training_Adadelta_gradients_gradients_conv2d_BiasAdd_grad_BiasAddGrad_0,
training/DistributedAdadelta_Allreduce/cond_2/HorovodAllreduce_training_Adadelta_gradients_gradients_conv2d_1_Conv2D_grad_Conv2DBackpropFilter_0, training/Dis$
ributedAdadelta_Allreduce/cond_3/HorovodAllreduce_training_Adadelta_gradients_gradients_conv2d_1_BiasAdd_grad_BiasAddGrad_0, training/DistributedAdadelta_Allr$
duce/cond_4/HorovodAllreduce_training_Adadelta_gradients_gradients_dense_MatMul_grad_MatMul_1_0, training/DistributedAdadelta_Allreduce/cond_5/HorovodAllreduc$
_training_Adadelta_gradients_gradients_dense_BiasAdd_grad_BiasAddGrad_0 ...]
Fri Sep 18 03:11:09 2020[0]<stderr>:10: [training/DistributedAdadelta_Allreduce/cond/HorovodAllreduce_training_Adadelta_gradients_gradients_conv2d_Conv2D_grad$
Conv2DBackpropFilter_0, training/DistributedAdadelta_Allreduce/cond_1/HorovodAllreduce_training_Adadelta_gradients_gradients_conv2d_BiasAdd_grad_BiasAddGrad_0$
 training/DistributedAdadelta_Allreduce/cond_2/HorovodAllreduce_training_Adadelta_gradients_gradients_conv2d_1_Conv2D_grad_Conv2DBackpropFilter_0, training/Di$
tributedAdadelta_Allreduce/cond_3/HorovodAllreduce_training_Adadelta_gradients_gradients_conv2d_1_BiasAdd_grad_BiasAddGrad_0, training/DistributedAdadelta_All$
educe/cond_4/HorovodAllreduce_training_Adadelta_gradients_gradients_dense_MatMul_grad_MatMul_1_0, training/DistributedAdadelta_Allreduce/cond_5/HorovodAllredu$
e_training_Adadelta_gradients_gradients_dense_BiasAdd_grad_BiasAddGrad_0 ...]
Fri Sep 18 03:11:09 2020[0]<stderr>:11: [training/DistributedAdadelta_Allreduce/cond/HorovodAllreduce_training_Adadelta_gradients_gradients_conv2d_Conv2D_grad$
Conv2DBackpropFilter_0, training/DistributedAdadelta_Allreduce/cond_1/HorovodAllreduce_training_Adadelta_gradients_gradients_conv2d_BiasAdd_grad_BiasAddGrad_0$
 training/DistributedAdadelta_Allreduce/cond_2/HorovodAllreduce_training_Adadelta_gradients_gradients_conv2d_1_Conv2D_grad_Conv2DBackpropFilter_0, training/Di$
tributedAdadelta_Allreduce/cond_3/HorovodAllreduce_training_Adadelta_gradients_gradients_conv2d_1_BiasAdd_grad_BiasAddGrad_0, training/DistributedAdadelta_All$
educe/cond_4/HorovodAllreduce_training_Adadelta_gradients_gradients_dense_MatMul_grad_MatMul_1_0, training/DistributedAdadelta_Allreduce/cond_5/HorovodAllredu$
e_training_Adadelta_gradients_gradients_dense_BiasAdd_grad_BiasAddGrad_0 ...]
Fri Sep 18 03:12:09 2020[0]<stderr>:[2020-09-18 03:12:09.351095: W /tmp/pip-req-build-dc48wa7r/horovod/common/stall_inspector.cc:105] One or more tensors were
submitted to be reduced, gathered or broadcasted by subset of ranks and are waiting for remainder of ranks for more than 60 seconds. This may indicate that di$
ferent ranks are trying to submit different tensors or that only subset of ranks is submitting tensors, which will cause deadlock.

BobLiu20 · 2020-09-18T04:29:12Z

@tgaddair After debug, It seems new worker will trigger on_batch_end in CommitStateCallbackImpl before sync weight. So new worker will call check_host_updates() immediately but old worker is not.

I don't konw why...

tgaddair · 2020-09-18T13:35:35Z

Thanks for digging into this @BobLiu20, and apologies for the ongoing issues. I'll look into this a bit more today and see if I can figure out what's going on.

tgaddair · 2020-09-18T15:32:56Z

Hey @BobLiu20, I was able to repro the issue in my environment and resolve it. It appears the problem was due to the Keras Callback state not being reinitialized when workers were added or removed. As a result, the old workers were committing at different steps from the new workers. I have updated the PR with changes to address this, please try it again and let me know if it works this time. Thanks.

BobLiu20 · 2020-09-21T02:29:53Z

@tgaddair Cool. It is working well now. Thanks

BobLiu20 added the bug label Sep 17, 2020

tgaddair mentioned this issue Sep 17, 2020

Fixed Elastic tf.keras and added additional integration tests #2289

Merged

tgaddair closed this as completed in #2289 Sep 21, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tf-keras example is not working well when scale worker up in elastic mode #2285

tf-keras example is not working well when scale worker up in elastic mode #2285

BobLiu20 commented Sep 17, 2020

tgaddair commented Sep 17, 2020

tgaddair commented Sep 17, 2020

tgaddair commented Sep 17, 2020

BobLiu20 commented Sep 18, 2020 •

edited

BobLiu20 commented Sep 18, 2020

tgaddair commented Sep 18, 2020

tgaddair commented Sep 18, 2020

BobLiu20 commented Sep 21, 2020

tf-keras example is not working well when scale worker up in elastic mode #2285

tf-keras example is not working well when scale worker up in elastic mode #2285

Comments

BobLiu20 commented Sep 17, 2020

tgaddair commented Sep 17, 2020

tgaddair commented Sep 17, 2020

tgaddair commented Sep 17, 2020

BobLiu20 commented Sep 18, 2020 • edited

BobLiu20 commented Sep 18, 2020

tgaddair commented Sep 18, 2020

tgaddair commented Sep 18, 2020

BobLiu20 commented Sep 21, 2020

BobLiu20 commented Sep 18, 2020 •

edited