New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TerminateOnNaN epoch level callback #17384
TerminateOnNaN epoch level callback #17384
Conversation
Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA). View this failed invocation of the CLA check for more information. For the most up to date status, view the checks section at the bottom of the pull request. |
fixed formatting issue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR! Can you add a test to make sure the newly added paths are covered and the test is passing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the Pull Request!
A couple of suggestions:
added tests and they are passing. |
@hertschuh @rchao |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the delay! It's been a busy week. Largely looks good, just a couple quick suggestions.
Looking at original issue,
why not use RemoveValue.get to convert removevalue to concrete one in batch hook? Why can't Parameter server/other distributed strategies not support batch level terminate on nan? |
I can include this as well, but calling |
Yes but that is controllable by check_freq. If you use check_freq of 1 and call .get() every batch then training will probably become slow. If you use check_freq of 100-1000 batches then it’s more likely sync effect will be minor. Exact impact will depend a lot on your model/dataset but you can adjust it to minimize performance impact. |
@rchao @hertschuh |
Hi @rchao / @hertschuh Can you please review this PR ? Thank you! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Largely looks good to me, just a couple suggestion on the docs. Thanks and sorry about the delay.
|
||
def __init__(self): | ||
Args: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's be consistent with other classes and move the Args section to class doc
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it is already in the class doc
Hi @abinthomasonline Can you please check @rchao's comments and keep us posted ? Thank you! |
Update: pradeepkuppla@ is helping with some internal testing, and we'll circle back once we have more updates. |
@rchao , this is failing TAP test with below error.
|
Imported from GitHub PR #17384 `epoch` frequency support for TerminateOnNaN callback. Batch level hook fails when used with distributed strategies, as `logs` is not of type `dict` (tensorflow.python.distribute.coordinator.values.RemoteValueImpl) Copybara import of the project: -- 57f6741 by abinthomasonline <abinthomasonline@gmail.com>: TerminateOnNaN epoch level callback -- 1fa08cd by Abin Thomas <abinthomasonline@gmail.com>: formatting -- 9601771 by abinthomasonline <abinthomasonline@gmail.com>: better error message, better variable name -- bee433d by abinthomasonline <abinthomasonline@gmail.com>: typo -- 3383c03 by abinthomasonline <abinthomasonline@gmail.com>: tests -- 8b5470b by abinthomasonline <abinthomasonline@gmail.com>: reformat -- c328382 by abinthomasonline <abinthomasonline@gmail.com>: sync logs if remote value - terminateonnan callback -- 6b02a18 by abinthomasonline <abinthomasonline@gmail.com>: log epoch number -- 501ce74 by abinthomasonline <abinthomasonline@gmail.com>: formatting -- 663660f by abinthomasonline <abinthomasonline@gmail.com>: add batch trigger check to CallBack -- 1781f39 by abinthomasonline <abinthomasonline@gmail.com>: update backupandrestore callback -- 09fc4f3 by abinthomasonline <abinthomasonline@gmail.com>: use model._train_counter in callback_test -- a72672d by abinthomasonline <abinthomasonline@gmail.com>: doc fixes Merging this change closes #17384 FUTURE_COPYBARA_INTEGRATE_REVIEW=#17384 from abinthomasonline:terminate-on-nan-epoch-support a72672d PiperOrigin-RevId: 535277817
@abinthomasonline can you check if following the instruction here helps with resolving this? |
Hi @abinthomasonline Any update on this PR? Please. Thank you! |
Hello, Thank you for submitting a pull request. We're currently in the process of migrating the new |
epoch
frequency support for TerminateOnNaN callback.Batch level hook fails when used with distributed strategies, as
logs
is not of typedict
(tensorflow.python.distribute.coordinator.values.RemoteValueImpl)