-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TensorFlow sync batch norm elastic compatibility #2100
Comments
tf 1.15 support BN? |
Yes, it should be supported in TF 1.15. @romerojosh can you confirm which versions of TF are supported? |
The graph mode can also work in tf 1.15? |
I test BN in graph mode(tf 1.14) but nor work. In future, will support graph mode? |
Hey @weiminggao, what's the error you're seeing? We have tests that run with graph mode on TF 1.14 and 1.15 here, so if it should be working. |
Thans very much, now, the Sync_BN can work well. But when I use it to train, it shows ”Stalled ranks:“.I add UPDATE_OP Dependency. Which is the right way to use the Sync_BN? @tgaddair |
My Code: |
@romerojosh can you take a look at the usage of Sync Batch Norm here? |
I will take a look and report back. |
Considering the |
Thanks, but it can not work well when train by this way, shows ”Stalled ranks:“. I use tf1.14, can you test it as same as this way?@romerojosh |
And I try to delete UPDATE_OPS. But when train, the same problem happens. @romerojosh
running script:
|
Hi @weiminggao,
Running this script with If you try this script, does it still stall? Also, in your original case with the stall, how long did you wait before cancelling the run? It is possible the stall message is just due to rank 0 taking more time to startup than the other ranks. |
Thanks very much, now it can work well. |
@tgaddair Tensorflow now mainly encourages tf.keras APIs. There are |
Following #2075, TensorFlow now supports sync batch norm.
Currently we use
size()
constant to determine whether to do sync batch norm and how to scale. This works in eager mode, but not graph mode. We should use the newly introducedsize_op()
instead.The text was updated successfully, but these errors were encountered: