Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TerminateOnNaN epoch level callback #17384

Conversation

abinthomasonline
Copy link

epoch frequency support for TerminateOnNaN callback.

Batch level hook fails when used with distributed strategies, as logs is not of type dict (tensorflow.python.distribute.coordinator.values.RemoteValueImpl)

@google-cla
Copy link

google-cla bot commented Dec 30, 2022

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.

@gbaned gbaned added this to Assigned Reviewer in PR Queue via automation Dec 30, 2022
@gbaned gbaned requested a review from fchollet December 30, 2022 13:50
@google-ml-butler google-ml-butler bot added the keras-team-review-pending Pending review by a Keras team member. label Dec 30, 2022
@abinthomasonline
Copy link
Author

fixed formatting issue

@hertschuh hertschuh requested a review from rchao January 5, 2023 18:20
Copy link
Contributor

@rchao rchao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR! Can you add a test to make sure the newly added paths are covered and the test is passing?

PR Queue automation moved this from Assigned Reviewer to Reviewer Requested Changes Jan 5, 2023
@rchao rchao removed the keras-team-review-pending Pending review by a Keras team member. label Jan 5, 2023
Copy link
Contributor

@hertschuh hertschuh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the Pull Request!
A couple of suggestions:

keras/callbacks.py Outdated Show resolved Hide resolved
keras/callbacks.py Outdated Show resolved Hide resolved
@abinthomasonline
Copy link
Author

Thanks for the PR! Can you add a test to make sure the newly added paths are covered and the test is passing?

added tests and they are passing.

@abinthomasonline
Copy link
Author

@hertschuh @rchao
anything else needed here?

Copy link
Contributor

@rchao rchao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the delay! It's been a busy week. Largely looks good, just a couple quick suggestions.

keras/callbacks.py Outdated Show resolved Hide resolved
keras/callbacks.py Outdated Show resolved Hide resolved
@hmc-cs-mdrissi
Copy link

Looking at original issue,

Batch level hook fails when used with distributed strategies, as logs is not of type dict (tensorflow.python.distribute.coordinator.values.RemoteValueImpl)

why not use RemoveValue.get to convert removevalue to concrete one in batch hook? Why can't Parameter server/other distributed strategies not support batch level terminate on nan?

@abinthomasonline
Copy link
Author

Looking at original issue,

Batch level hook fails when used with distributed strategies, as logs is not of type dict (tensorflow.python.distribute.coordinator.values.RemoteValueImpl)

why not use RemoveValue.get to convert removevalue to concrete one in batch hook? Why can't Parameter server/other distributed strategies not support batch level terminate on nan?

I can include this as well, but calling .get() on remote value comes with sync cost and can slow down distributed training.

@hmc-cs-mdrissi
Copy link

Yes but that is controllable by check_freq. If you use check_freq of 1 and call .get() every batch then training will probably become slow. If you use check_freq of 100-1000 batches then it’s more likely sync effect will be minor. Exact impact will depend a lot on your model/dataset but you can adjust it to minimize performance impact.

@abinthomasonline
Copy link
Author

@rchao @hertschuh
anything remaining?

@gbaned
Copy link
Collaborator

gbaned commented Mar 21, 2023

Hi @rchao / @hertschuh Can you please review this PR ? Thank you!

Copy link
Contributor

@rchao rchao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Largely looks good to me, just a couple suggestion on the docs. Thanks and sorry about the delay.


def __init__(self):
Args:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's be consistent with other classes and move the Args section to class doc

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is already in the class doc

keras/callbacks.py Outdated Show resolved Hide resolved
@gbaned
Copy link
Collaborator

gbaned commented Apr 13, 2023

Hi @abinthomasonline Can you please check @rchao's comments and keep us posted ? Thank you!

@abinthomasonline
Copy link
Author

@gbaned @rchao Updated.
Sorry, I missed the updates on this thread.

@gbaned gbaned requested a review from rchao April 17, 2023 14:34
@google-ml-butler google-ml-butler bot added the keras-team-review-pending Pending review by a Keras team member. label Apr 17, 2023
@hertschuh hertschuh removed the keras-team-review-pending Pending review by a Keras team member. label Apr 20, 2023
@rchao
Copy link
Contributor

rchao commented Apr 20, 2023

Update: pradeepkuppla@ is helping with some internal testing, and we'll circle back once we have more updates.

@sachinprasadhs sachinprasadhs added the ready to pull Ready to be merged into the codebase label Apr 29, 2023
@sachinprasadhs
Copy link
Collaborator

@rchao , this is failing TAP test with below error.
Is it something actionable from user side?

third_party/py/keras/api/tests/api_compatibility_test.py", [line 293](https://cs.corp.google.com/piper///depot/google3/third_party/py/keras/api/tests/api_compatibility_test.py?l=293&ws=tap-presubmit-server/54210978&snapshot=2), in _AssertProtoDictEquals
    self.fail(
AssertionError: 1 differences found between API and golden.

copybara-service bot pushed a commit that referenced this pull request May 26, 2023
Imported from GitHub PR #17384

`epoch` frequency support for TerminateOnNaN callback.

Batch level hook fails when used with distributed strategies, as `logs` is not of type `dict` (tensorflow.python.distribute.coordinator.values.RemoteValueImpl)

Copybara import of the project:

--
57f6741 by abinthomasonline <abinthomasonline@gmail.com>:

TerminateOnNaN epoch level callback

--
1fa08cd by Abin Thomas <abinthomasonline@gmail.com>:

formatting
--
9601771 by abinthomasonline <abinthomasonline@gmail.com>:

better error message, better variable name

--
bee433d by abinthomasonline <abinthomasonline@gmail.com>:

typo

--
3383c03 by abinthomasonline <abinthomasonline@gmail.com>:

tests

--
8b5470b by abinthomasonline <abinthomasonline@gmail.com>:

reformat

--
c328382 by abinthomasonline <abinthomasonline@gmail.com>:

sync logs if remote value - terminateonnan callback

--
6b02a18 by abinthomasonline <abinthomasonline@gmail.com>:

log epoch number

--
501ce74 by abinthomasonline <abinthomasonline@gmail.com>:

formatting

--
663660f by abinthomasonline <abinthomasonline@gmail.com>:

add batch trigger check to CallBack

--
1781f39 by abinthomasonline <abinthomasonline@gmail.com>:

update backupandrestore callback

--
09fc4f3 by abinthomasonline <abinthomasonline@gmail.com>:

use model._train_counter in callback_test

--
a72672d by abinthomasonline <abinthomasonline@gmail.com>:

doc fixes

Merging this change closes #17384

FUTURE_COPYBARA_INTEGRATE_REVIEW=#17384 from abinthomasonline:terminate-on-nan-epoch-support a72672d
PiperOrigin-RevId: 535277817
@rchao
Copy link
Contributor

rchao commented Jun 7, 2023

@rchao , this is failing TAP test with below error. Is it something actionable from user side?

third_party/py/keras/api/tests/api_compatibility_test.py", [line 293](https://cs.corp.google.com/piper///depot/google3/third_party/py/keras/api/tests/api_compatibility_test.py?l=293&ws=tap-presubmit-server/54210978&snapshot=2), in _AssertProtoDictEquals
    self.fail(
AssertionError: 1 differences found between API and golden.

@abinthomasonline can you check if following the instruction here helps with resolving this?

@gbaned
Copy link
Collaborator

gbaned commented Aug 18, 2023

Hi @abinthomasonline Any update on this PR? Please. Thank you!

@sachinprasadhs
Copy link
Collaborator

Hello, Thank you for submitting a pull request.

We're currently in the process of migrating the new Keras 3 code base from keras-team/keras-core to keras-team/keras.
Consequently, merging this PR is not possible at the moment. After the migration is successfully completed, feel free to reopen this PR at keras-team/keras if you believe it remains relevant to the Keras 3 code base. If instead this PR fixes a bug or security issue in legacy tf.keras, you can instead reopen the PR at keras-team/tf-keras, which hosts the TensorFlow-only, legacy version of Keras.

PR Queue automation moved this from Reviewer Requested Changes to Closed/Rejected Sep 19, 2023
@google-ml-butler google-ml-butler bot removed the ready to pull Ready to be merged into the codebase label Sep 19, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
PR Queue
Closed/Rejected
Development

Successfully merging this pull request may close these issues.

None yet

6 participants