Loading optimizer state without using model.compile #15917

ttdd11 · 2022-01-19T14:22:44Z

TF: 2.5, compiled
Environment: GCP cloud TPU V2-32

The optimizer state is saved when calling compile on the model and saving. However, when using the apply_gradients method on the optimizer instead of fit, there is no compile required.

In this instance, we save optimize states using np.save(PATH, optimizer.get_weights()). When continuing the re-train using a distributed approach, loading these using the optimizer.set_weights isn't working. Things we have tried to resolve this:

Directly call set weights after initializing the optimizer -> weights at this point are length 0 so this doesn't work
Call set weights once after the first strategy.run is called -> weights are the correct length now but it hangs.
Call set weights on dummy weights using the same strategy.run before the optimizer has run once like:
opt_weights = np.load(opt_path, allow_pickle=True)
grad_vars = model.trainable_weights
zero_grads = [tf.zeros_like(w) for w in grad_vars]
optimizer.apply_gradients(zip(zero_grads, grad_vars))
optimizer.set_weights(opt_weights)

But this results in NotImplementedError: TPUStrategy.run(fn, ...) does not support pure eager execution. please make sure the function passed into strategy.run is a tf.function or strategy.run is called inside a tf.function if eager behavior is enabled

Which is pretty self explanatory:

Turning eager mode off using tf.compat.v1.disable_eager_execution(): Results in InaccessibleTensorError: Operation 'LogicalAnd_30' has been marked as not fetchable. Typically this happens when it is defined in another function or code block. Use return values,explicit Python locals or TensorFlow collections to access it. When calling optimizer.set_weights(..)

Any advice on how to load weights for an optimizer when doing distributed learning on TPUs would be greatly appreciated. Even a workaround by temporary compiling and saving a model would be okay for now. We run many expensive experiments and not having the weights for the optimizer to restart and tune is a big challenge.

I'm not sure if this is best posted here or on the TF issues board.

The text was updated successfully, but these errors were encountered:

jvishnuvardhan · 2022-01-26T06:04:53Z

@ttdd11 It looks like this is more related to TPU distribution strategy. Can you open this issue on Tensorflow where I can triage it to TPU team? Thanks!

ttdd11 · 2022-01-27T13:40:32Z

@jvishnuvardhan I did a few days ago, sorry that I didn't mention it here: tensorflow/tensorflow#53844 , still waiting to hear back. Any added visibility would be greatly appreciated, as all of our compute experiments are currently on hold.

google-ml-butler · 2022-02-03T14:24:42Z

This issue has been automatically marked as stale because it has no recent activity. It will be closed if no further activity occurs. Thank you.

google-ml-butler · 2022-02-10T14:28:37Z

Closing as stale. Please reopen if you'd like to work on this further.

google-ml-butler · 2022-02-10T14:28:39Z

Are you satisfied with the resolution of your issue?
Yes
No

jvishnuvardhan self-assigned this Jan 26, 2022

jvishnuvardhan added the type:support User is asking for help / asking an implementation question. Stackoverflow would be better suited. label Jan 26, 2022

jvishnuvardhan added the stat:awaiting response from contributor label Jan 26, 2022

google-ml-butler bot added the stale label Feb 3, 2022

google-ml-butler bot closed this as completed Feb 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Loading optimizer state without using model.compile #15917

Loading optimizer state without using model.compile #15917

ttdd11 commented Jan 19, 2022 •

edited

Loading

jvishnuvardhan commented Jan 26, 2022

ttdd11 commented Jan 27, 2022 •

edited

Loading

google-ml-butler bot commented Feb 3, 2022

google-ml-butler bot commented Feb 10, 2022

google-ml-butler bot commented Feb 10, 2022

Loading optimizer state without using model.compile #15917

Loading optimizer state without using model.compile #15917

Comments

ttdd11 commented Jan 19, 2022 • edited Loading

jvishnuvardhan commented Jan 26, 2022

ttdd11 commented Jan 27, 2022 • edited Loading

google-ml-butler bot commented Feb 3, 2022

google-ml-butler bot commented Feb 10, 2022

google-ml-butler bot commented Feb 10, 2022

ttdd11 commented Jan 19, 2022 •

edited

Loading

ttdd11 commented Jan 27, 2022 •

edited

Loading