-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
tf-keras example is not working well when scale worker up in elastic mode #2285
Comments
Thanks for raising this issue @BobLiu20, let me take a look. We definitely have this working in CI, so it may be we just need to tweak this script to use the same structure as our tests. Worst case, we may need to call tf.keras.backend.reset_uids to avoid the incrementing. I will take a look today. |
Hmm, looks like we don't TensorFlow Keras v1 under CI for elastic mode, let me add that and see if we can fix this along the way. |
@tgaddair Thanks. The issue still here:
|
@tgaddair After debug, It seems new worker will trigger on_batch_end in CommitStateCallbackImpl before sync weight. So new worker will call check_host_updates() immediately but old worker is not. I don't konw why... |
Thanks for digging into this @BobLiu20, and apologies for the ongoing issues. I'll look into this a bit more today and see if I can figure out what's going on. |
Hey @BobLiu20, I was able to repro the issue in my environment and resolve it. It appears the problem was due to the Keras Callback state not being reinitialized when workers were added or removed. As a result, the old workers were committing at different steps from the new workers. I have updated the PR with changes to address this, please try it again and let me know if it works this time. Thanks. |
@tgaddair Cool. It is working well now. Thanks |
Environment:
Bug report:
Just run this example
examples/elastic/tensorflow_keras_mnist_elastic.py
in elastic mode. It will be blocked when try to scale worker up.FYI
I am trying to resolve this issue. Unfortunately, I can't find out how to reset the uniq id of op name.
For example, sync state in first time, the name is
dict.sz
. But the second time isdict.sz_1
and so on. In this case the new worker's name isdict.sz
in first time. It is mismatch between new and old worker.Any idea for this issue?
The text was updated successfully, but these errors were encountered: