Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Loading optimizer state without using model.compile #15917

Closed
ttdd11 opened this issue Jan 19, 2022 · 5 comments
Closed

Loading optimizer state without using model.compile #15917

ttdd11 opened this issue Jan 19, 2022 · 5 comments
Assignees
Labels
stale stat:awaiting response from contributor type:support User is asking for help / asking an implementation question. Stackoverflow would be better suited.

Comments

@ttdd11
Copy link

ttdd11 commented Jan 19, 2022

TF: 2.5, compiled
Environment: GCP cloud TPU V2-32

The optimizer state is saved when calling compile on the model and saving. However, when using the apply_gradients method on the optimizer instead of fit, there is no compile required.

In this instance, we save optimize states using np.save(PATH, optimizer.get_weights()). When continuing the re-train using a distributed approach, loading these using the optimizer.set_weights isn't working. Things we have tried to resolve this:

  1. Directly call set weights after initializing the optimizer -> weights at this point are length 0 so this doesn't work
  2. Call set weights once after the first strategy.run is called -> weights are the correct length now but it hangs.
  3. Call set weights on dummy weights using the same strategy.run before the optimizer has run once like:
    opt_weights = np.load(opt_path, allow_pickle=True)
    grad_vars = model.trainable_weights
    zero_grads = [tf.zeros_like(w) for w in grad_vars]
    optimizer.apply_gradients(zip(zero_grads, grad_vars))
    optimizer.set_weights(opt_weights)

But this results in NotImplementedError: TPUStrategy.run(fn, ...) does not support pure eager execution. please make sure the function passed into strategy.run is a tf.function or strategy.run is called inside a tf.function if eager behavior is enabled

Which is pretty self explanatory:

  1. Turning eager mode off using tf.compat.v1.disable_eager_execution(): Results in InaccessibleTensorError: Operation 'LogicalAnd_30' has been marked as not fetchable. Typically this happens when it is defined in another function or code block. Use return values,explicit Python locals or TensorFlow collections to access it. When calling optimizer.set_weights(..)

Any advice on how to load weights for an optimizer when doing distributed learning on TPUs would be greatly appreciated. Even a workaround by temporary compiling and saving a model would be okay for now. We run many expensive experiments and not having the weights for the optimizer to restart and tune is a big challenge.

I'm not sure if this is best posted here or on the TF issues board.

@jvishnuvardhan jvishnuvardhan self-assigned this Jan 26, 2022
@jvishnuvardhan jvishnuvardhan added the type:support User is asking for help / asking an implementation question. Stackoverflow would be better suited. label Jan 26, 2022
@jvishnuvardhan
Copy link
Contributor

@ttdd11 It looks like this is more related to TPU distribution strategy. Can you open this issue on Tensorflow where I can triage it to TPU team? Thanks!

@ttdd11
Copy link
Author

ttdd11 commented Jan 27, 2022

@jvishnuvardhan I did a few days ago, sorry that I didn't mention it here: tensorflow/tensorflow#53844 , still waiting to hear back. Any added visibility would be greatly appreciated, as all of our compute experiments are currently on hold.

@google-ml-butler
Copy link

This issue has been automatically marked as stale because it has no recent activity. It will be closed if no further activity occurs. Thank you.

@google-ml-butler
Copy link

Closing as stale. Please reopen if you'd like to work on this further.

@google-ml-butler
Copy link

Are you satisfied with the resolution of your issue?
Yes
No

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stale stat:awaiting response from contributor type:support User is asking for help / asking an implementation question. Stackoverflow would be better suited.
Projects
None yet
Development

No branches or pull requests

2 participants