Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TF 2.x BroadcastGlobalVariablesHook Fix #1265

Merged
merged 21 commits into from Aug 13, 2019

Conversation

@DEKHTIARJonathan
Copy link
Contributor

commented Jul 31, 2019

Hi,

I found that BroadcastGlobalVariablesHook is not working with Tensorflow 2.x.

Issue:

  File "cnn_session.py", line 169, in <module>
    training_hooks.append(hvd.BroadcastGlobalVariablesHook(0))
AttributeError: module 'horovod.tensorflow' has no attribute 'BroadcastGlobalVariablesHook'

Same issue with: broadcast_global_variables

  File "/usr/local/lib/python3.6/dist-packages/horovod/tensorflow/__init__.py", line 167, in begin
    self.bcast_op = broadcast_global_variables(self.root_rank)
NameError: name 'broadcast_global_variables' is not defined

Same issue with: get_default_graph

  File "/usr/local/lib/python3.6/dist-packages/horovod/tensorflow/__init__.py", line 172, in begin
    if not self.bcast_op or self.bcast_op.graph != tf.get_default_graph():
AttributeError: module 'tensorflow' has no attribute 'get_default_graph'

Basically, the hook is not being defined. This PR implements fixes for these issues.

Tested on MaskRCNN and it works as expected

DEKHTIARJonathan added some commits Jul 31, 2019

TF BroadcastGlobalVariablesHook Fix
Signed-off-by: DEKHTIARJonathan <jdekhtiar@nvidia.com>
tf.global_variables() fix and uniformization try/except
Signed-off-by: DEKHTIARJonathan <jdekhtiar@nvidia.com>

@DEKHTIARJonathan DEKHTIARJonathan force-pushed the DEKHTIARJonathan:tf2_fix branch from 82b02ce to 9aad5b2 Jul 31, 2019

get_default_graph issue fix
Signed-off-by: DEKHTIARJonathan <jdekhtiar@nvidia.com>
@alsrgv

This comment has been minimized.

Copy link
Member

commented Aug 2, 2019

@DEKHTIARJonathan, the reason these were excluded is due to the fact that when I examined tf.compat.v1.global_variables() I found that it was always empty. So, having this hook gives the user a false impression that variable synchronization is happening, while in reality, it does not.

Do you see tf.compat.v1.global_variables() returning non-empty list in your case?

@alsrgv

This comment has been minimized.

Copy link
Member

commented Aug 2, 2019

This is more TF 2.0 way of broadcasting model & optimizer weights: https://github.com/horovod/horovod/blob/master/examples/tensorflow2_mnist.py#L75

@DEKHTIARJonathan

This comment has been minimized.

Copy link
Contributor Author

commented Aug 3, 2019

In tensorflow 2 you have the possibility to deactivate eager execution. And if done GradientTape is not the way to go. And tf.compat.v1.global_variables() returns everything you expect.

We are currently testing TF2.0 and porting MaskRCNN to TF2.0, and it looks very stable and consistent with our results in TF1.x (I am using the modifications in this PR to make it work).

@alsrgv: maybe issuing a user warning if we detect tf.compat.v1.global_variables() to be empty and therefore protecting the user of unknown issue

@alsrgv

This comment has been minimized.

Copy link
Member

commented Aug 4, 2019

I see. But it seems to defeat the purpose of migrating to TensorFlow 2.0. Why not keep using TensorFlow 1.14?

I'm worried that bringing back these hooks will be very misleading. Unless eager execution is disabled, they will do nothing w/o complaining.

A middle ground could be throwing an error with TensorFlow 2.0 is eager mode was not disabled.

@DEKHTIARJonathan

This comment has been minimized.

Copy link
Contributor Author

commented Aug 4, 2019

Going this direction will basically block everyone using Horovod to migrate to TF 2.x without completely rewriting the project for eager execution. In addition, it's still unclear if eager execution has any performance cost.

Tensorflow provides a script tf_upgrade_v2 which basically converts most of the API to tf.compat.v1.****. Enforcing directly "if you use TF2.x you must use eager execution", without any technical limitation, will block numerous people from using Horovod in TF2.x or from upgrading to TF2.x at all.

As extra data, TF.Estimator is not deprecated and still exists. This API uses only graph mode: https://github.com/tensorflow/estimator/blob/r2.0/tensorflow_estimator/python/estimator/estimator.py#L340. Preventing the user to use HVD Hook basically completely forbid the use of TF.Estimators in TF2.x which is not deprecated (if using Horovod of course).

I agree it can be misleading/dangerous, hence maybe we could issue a warning or send an exception if we detect eager mode. Nonetheless, blocking it completely does not appear as the best strategy here.

@alsrgv

This comment has been minimized.

Copy link
Member

commented Aug 4, 2019

That's fair. Can we throw an error if the hook/functions are used outside of the graph mode?

@DEKHTIARJonathan

This comment has been minimized.

Copy link
Contributor Author

commented Aug 4, 2019

I think it's doable. Let me look into it ;)

@DEKHTIARJonathan

This comment has been minimized.

Copy link
Contributor Author

commented Aug 4, 2019

@alsrgv Seems like yes => https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/eager/context.py#L648
https://www.tensorflow.org/versions/r2.0/api_docs/python/tf/executing_eagerly

I'll try to edit my PR and test it to see if it works this week ;)

Out of curiosity is it even possible to use a TF Hook in Eager mode ?

@alsrgv

This comment has been minimized.

Copy link
Member

commented Aug 5, 2019

Sounds good. We should add tests for these scenarios.

DEKHTIARJonathan added some commits Aug 5, 2019

Eager Execution Protection for TF Hook
Signed-off-by: DEKHTIARJonathan <contact@jonathandekhtiar.eu>
Merge remote-tracking branch 'origin/master' into tf2_fix
Signed-off-by: DEKHTIARJonathan <contact@jonathandekhtiar.eu>
Missing Symbol Fix
Signed-off-by: DEKHTIARJonathan <contact@jonathandekhtiar.eu>
@DEKHTIARJonathan

This comment has been minimized.

Copy link
Contributor Author

commented Aug 5, 2019

@alsrgv I've looked into it. And I don't know how to add unittest for this. It's impossible to use a TF Hook in Eager mode ... At the least to the best of my knowledge.

Let me know if there's anything else I can do to get this PR merged

Missing Symbol Fix - TF 1.6.0
Signed-off-by: DEKHTIARJonathan <contact@jonathandekhtiar.eu>
@alsrgv
Copy link
Member

left a comment

What if you enable eager mode, and start using MonitoredTrainingSession with hooks? E.g. https://github.com/horovod/horovod/blob/master/examples/tensorflow_mnist.py#L150

Left more comments in-line.

Show resolved Hide resolved horovod/tensorflow/__init__.py Outdated
)

if not self.bcast_op or self.bcast_op.graph != _get_default_graph():

This comment has been minimized.

Copy link
@alsrgv

alsrgv Aug 6, 2019

Member

Nit: remove space

Show resolved Hide resolved horovod/tensorflow/__init__.py
Show resolved Hide resolved horovod/tensorflow/__init__.py Outdated
@DEKHTIARJonathan

This comment has been minimized.

Copy link
Contributor Author

commented Aug 6, 2019

@alsrgv You can't use the session in eager mode. That's the whole point actually

@alsrgv

This comment has been minimized.

Copy link
Member

commented Aug 6, 2019

I see, that's good. What about estimators, can you use estimator in eager mode?

@DEKHTIARJonathan

This comment has been minimized.

Copy link
Contributor Author

commented Aug 6, 2019

If you use estimator, graph mode is enforced

https://github.com/tensorflow/estimator/blob/r2.0/tensorflow_estimator/python/estimator/estimator.py#L34
Actually this check about eager exec is useless...

@alsrgv

This comment has been minimized.

Copy link
Member

commented Aug 6, 2019

OK, gotcha. That's good then. Let's just make sure broadcast_global_variables() throws an exception in eager mode, and we should be good to go.

@DEKHTIARJonathan

This comment has been minimized.

Copy link
Contributor Author

commented Aug 6, 2019

I'll apply the requested changes tomorrow and you can merge if its good with you

Sent from my Galaxy S9+ using FastHub

DEKHTIARJonathan added some commits Aug 7, 2019

Changed requested before merge
Signed-off-by: DEKHTIARJonathan <jdekhtiar@nvidia.com>
Merge branch 'master' into tf2_fix
Signed-off-by: DEKHTIARJonathan <jdekhtiar@nvidia.com>
@DEKHTIARJonathan

This comment has been minimized.

Copy link
Contributor Author

commented Aug 7, 2019

@alsrgv I have done the followings:

@@ -26,7 +26,7 @@

def _executing_eagerly():
"""Returns true if eager execution is supported and enabled."""
return _has_eager and context.in_eager_mode()
return _has_eager and context.executing_eagerly()()

This comment has been minimized.

Copy link
@alsrgv

alsrgv Aug 7, 2019

Member

Too many parenthesis?

This comment has been minimized.

Copy link
@DEKHTIARJonathan

DEKHTIARJonathan Aug 7, 2019

Author Contributor

Oupsi. Sent a patch

if hasattr(tf, 'estimator') and hasattr(tf.estimator, 'SessionRunHook'):
_SessionRunHook = tf.estimator.SessionRunHook
else:
if _executing_eagerly():

This comment has been minimized.

Copy link
@alsrgv

alsrgv Aug 7, 2019

Member

Great! Can you add a test for this functionality (verify that it fails in eager mode)?

Typo Fix
Signed-off-by: DEKHTIARJonathan <jdekhtiar@nvidia.com>
@alsrgv

This comment has been minimized.

Copy link
Member

commented Aug 11, 2019

@DEKHTIARJonathan, bump. We're planning to release Horovod 0.17.0 next week and it would be great to include this fix. We just need to add a test which validates exception, like this :-)

@DEKHTIARJonathan

This comment has been minimized.

Copy link
Contributor Author

commented Aug 11, 2019

I'll do my best to send smthg tomorrow. I really hope it will make it to 0.17.0

DEKHTIARJonathan added some commits Aug 12, 2019

Merge remote-tracking branch 'origin/master' into tf2_fix
Signed-off-by: DEKHTIARJonathan <jdekhtiar@nvidia.com>
Unittests added
Signed-off-by: DEKHTIARJonathan <jdekhtiar@nvidia.com>
Unittest Fix
Signed-off-by: DEKHTIARJonathan <jdekhtiar@nvidia.com>
Unittest Fix
Signed-off-by: DEKHTIARJonathan <jdekhtiar@nvidia.com>
Unittest Fix
Signed-off-by: DEKHTIARJonathan <jdekhtiar@nvidia.com>
Python2 Fix Unittest
Signed-off-by: DEKHTIARJonathan <contact@jonathandekhtiar.eu>

@DEKHTIARJonathan DEKHTIARJonathan force-pushed the DEKHTIARJonathan:tf2_fix branch from 1518ab6 to b7961b7 Aug 12, 2019

@DEKHTIARJonathan

This comment has been minimized.

Copy link
Contributor Author

commented Aug 12, 2019

@alsrgv : Done, if it's good with you. Please merge. Otherwise feel free to directly modify.
Took me a few iterations ;)

Unittest Fix
Signed-off-by: DEKHTIARJonathan <jdekhtiar@nvidia.com>
@@ -19,6 +19,7 @@
from __future__ import division
from __future__ import print_function

from distutils.version import LooseVersion

This comment has been minimized.

Copy link
@alsrgv

alsrgv Aug 13, 2019

Member

Remove?

This comment has been minimized.

Copy link
@DEKHTIARJonathan

DEKHTIARJonathan Aug 13, 2019

Author Contributor

Indeed, unused

"""Test that tries to broadcast tensorflow global variables
in eager execution mode. This call should raise a RuntimeError."""

if not hvd.util._has_eager:

This comment has been minimized.

Copy link
@alsrgv

alsrgv Aug 13, 2019

Member

These tests are already executing eagerly and in graph mode (can be checked via executing_eagerly()), maybe we can just check for that and skip the test if there is a mode mismatch?

This comment has been minimized.

Copy link
@DEKHTIARJonathan

DEKHTIARJonathan Aug 13, 2019

Author Contributor

Oh that explain why I had so much trouble allowing/disabling eager exec. I'll update my code :)


from tensorflow.python.eager import context
with context.eager_mode():

This comment has been minimized.

Copy link
@alsrgv

alsrgv Aug 13, 2019

Member

Style: avoid empty line after :

@alsrgv

This comment has been minimized.

Copy link
Member

commented Aug 13, 2019

@DEKHTIARJonathan, just few comments

DEKHTIARJonathan added some commits Aug 13, 2019

Requested Changes Applied
Signed-off-by: DEKHTIARJonathan <jdekhtiar@nvidia.com>
Merge remote-tracking branch 'origin/master' into tf2_fix
Signed-off-by: DEKHTIARJonathan <jdekhtiar@nvidia.com>
@DEKHTIARJonathan

This comment has been minimized.

Copy link
Contributor Author

commented Aug 13, 2019

@alsrgv requested changes applied
@nluehr FYI

@alsrgv

alsrgv approved these changes Aug 13, 2019

Copy link
Member

left a comment

LGTM, thanks!

Remove spaces
Signed-off-by: Alex Sergeev <alexander.sergeev@live.com>
@DEKHTIARJonathan

This comment has been minimized.

Copy link
Contributor Author

commented Aug 13, 2019

Awesome ! Let's merge and make sure it makes it for 0.17.0 release :)

@alsrgv alsrgv merged commit 6078b46 into horovod:master Aug 13, 2019

2 checks passed

DCO DCO
Details
buildkite/horovod/pr Build #947 passed (1 hour, 51 minutes, 58 seconds)
Details

@DEKHTIARJonathan DEKHTIARJonathan deleted the DEKHTIARJonathan:tf2_fix branch Aug 13, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
2 participants
You can’t perform that action at this time.