TerminateOnNaN epoch level callback #17384

abinthomasonline · 2022-12-30T09:44:15Z

epoch frequency support for TerminateOnNaN callback.

Batch level hook fails when used with distributed strategies, as logs is not of type dict (tensorflow.python.distribute.coordinator.values.RemoteValueImpl)

google-cla · 2022-12-30T09:44:18Z

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.

abinthomasonline · 2022-12-30T14:21:06Z

fixed formatting issue

rchao

Thanks for the PR! Can you add a test to make sure the newly added paths are covered and the test is passing?

hertschuh

Thanks for the Pull Request!
A couple of suggestions:

keras/callbacks.py

abinthomasonline · 2023-01-06T10:28:21Z

Thanks for the PR! Can you add a test to make sure the newly added paths are covered and the test is passing?

added tests and they are passing.

abinthomasonline · 2023-01-13T12:38:33Z

@hertschuh @rchao
anything else needed here?

rchao

Sorry for the delay! It's been a busy week. Largely looks good, just a couple quick suggestions.

keras/callbacks.py

hmc-cs-mdrissi · 2023-01-19T03:59:52Z

Looking at original issue,

Batch level hook fails when used with distributed strategies, as logs is not of type dict (tensorflow.python.distribute.coordinator.values.RemoteValueImpl)

why not use RemoveValue.get to convert removevalue to concrete one in batch hook? Why can't Parameter server/other distributed strategies not support batch level terminate on nan?

abinthomasonline · 2023-01-20T06:14:56Z

Looking at original issue,

Batch level hook fails when used with distributed strategies, as logs is not of type dict (tensorflow.python.distribute.coordinator.values.RemoteValueImpl)

why not use RemoveValue.get to convert removevalue to concrete one in batch hook? Why can't Parameter server/other distributed strategies not support batch level terminate on nan?

I can include this as well, but calling .get() on remote value comes with sync cost and can slow down distributed training.

hmc-cs-mdrissi · 2023-01-20T07:21:45Z

Yes but that is controllable by check_freq. If you use check_freq of 1 and call .get() every batch then training will probably become slow. If you use check_freq of 100-1000 batches then it’s more likely sync effect will be minor. Exact impact will depend a lot on your model/dataset but you can adjust it to minimize performance impact.

abinthomasonline · 2023-03-04T05:43:48Z

@rchao @hertschuh
anything remaining?

gbaned · 2023-03-21T17:05:47Z

Hi @rchao / @hertschuh Can you please review this PR ? Thank you!

rchao

Largely looks good to me, just a couple suggestion on the docs. Thanks and sorry about the delay.

rchao · 2023-03-21T17:37:31Z

keras/callbacks.py


-    def __init__(self):
+    Args:


Let's be consistent with other classes and move the Args section to class doc

it is already in the class doc

keras/callbacks.py

gbaned · 2023-04-13T11:32:10Z

Hi @abinthomasonline Can you please check @rchao's comments and keep us posted ? Thank you!

abinthomasonline · 2023-04-13T13:23:59Z

@gbaned @rchao Updated.
Sorry, I missed the updates on this thread.

rchao · 2023-04-20T18:35:45Z

Update: pradeepkuppla@ is helping with some internal testing, and we'll circle back once we have more updates.

sachinprasadhs · 2023-05-26T18:03:34Z

@rchao , this is failing TAP test with below error.
Is it something actionable from user side?

third_party/py/keras/api/tests/api_compatibility_test.py", [line 293](https://cs.corp.google.com/piper///depot/google3/third_party/py/keras/api/tests/api_compatibility_test.py?l=293&ws=tap-presubmit-server/54210978&snapshot=2), in _AssertProtoDictEquals
    self.fail(
AssertionError: 1 differences found between API and golden.

Imported from GitHub PR #17384 `epoch` frequency support for TerminateOnNaN callback. Batch level hook fails when used with distributed strategies, as `logs` is not of type `dict` (tensorflow.python.distribute.coordinator.values.RemoteValueImpl) Copybara import of the project: -- 57f6741 by abinthomasonline <abinthomasonline@gmail.com>: TerminateOnNaN epoch level callback -- 1fa08cd by Abin Thomas <abinthomasonline@gmail.com>: formatting -- 9601771 by abinthomasonline <abinthomasonline@gmail.com>: better error message, better variable name -- bee433d by abinthomasonline <abinthomasonline@gmail.com>: typo -- 3383c03 by abinthomasonline <abinthomasonline@gmail.com>: tests -- 8b5470b by abinthomasonline <abinthomasonline@gmail.com>: reformat -- c328382 by abinthomasonline <abinthomasonline@gmail.com>: sync logs if remote value - terminateonnan callback -- 6b02a18 by abinthomasonline <abinthomasonline@gmail.com>: log epoch number -- 501ce74 by abinthomasonline <abinthomasonline@gmail.com>: formatting -- 663660f by abinthomasonline <abinthomasonline@gmail.com>: add batch trigger check to CallBack -- 1781f39 by abinthomasonline <abinthomasonline@gmail.com>: update backupandrestore callback -- 09fc4f3 by abinthomasonline <abinthomasonline@gmail.com>: use model._train_counter in callback_test -- a72672d by abinthomasonline <abinthomasonline@gmail.com>: doc fixes Merging this change closes #17384 FUTURE_COPYBARA_INTEGRATE_REVIEW=#17384 from abinthomasonline:terminate-on-nan-epoch-support a72672d PiperOrigin-RevId: 535277817

rchao · 2023-06-07T16:52:49Z

@rchao , this is failing TAP test with below error. Is it something actionable from user side?

third_party/py/keras/api/tests/api_compatibility_test.py", [line 293](https://cs.corp.google.com/piper///depot/google3/third_party/py/keras/api/tests/api_compatibility_test.py?l=293&ws=tap-presubmit-server/54210978&snapshot=2), in _AssertProtoDictEquals
    self.fail(
AssertionError: 1 differences found between API and golden.

@abinthomasonline can you check if following the instruction here helps with resolving this?

gbaned · 2023-08-18T10:36:52Z

Hi @abinthomasonline Any update on this PR? Please. Thank you!

sachinprasadhs · 2023-09-19T16:31:48Z

Hello, Thank you for submitting a pull request.

We're currently in the process of migrating the new Keras 3 code base from keras-team/keras-core to keras-team/keras.
Consequently, merging this PR is not possible at the moment. After the migration is successfully completed, feel free to reopen this PR at keras-team/keras if you believe it remains relevant to the Keras 3 code base. If instead this PR fixes a bug or security issue in legacy tf.keras, you can instead reopen the PR at keras-team/tf-keras, which hosts the TensorFlow-only, legacy version of Keras.

TerminateOnNaN epoch level callback

57f6741

google-ml-butler bot added the size:M label Dec 30, 2022

google-ml-butler bot assigned gbaned Dec 30, 2022

abinthomasonline closed this Dec 30, 2022

abinthomasonline reopened this Dec 30, 2022

gbaned added this to Assigned Reviewer in PR Queue via automation Dec 30, 2022

gbaned requested a review from fchollet December 30, 2022 13:50

google-ml-butler bot added the keras-team-review-pending Pending review by a Keras team member. label Dec 30, 2022

formatting

1fa08cd

hertschuh requested a review from rchao January 5, 2023 18:20

rchao suggested changes Jan 5, 2023

View reviewed changes

PR Queue automation moved this from Assigned Reviewer to Reviewer Requested Changes Jan 5, 2023

rchao removed the keras-team-review-pending Pending review by a Keras team member. label Jan 5, 2023

hertschuh suggested changes Jan 5, 2023

View reviewed changes

keras/callbacks.py Outdated Show resolved Hide resolved

keras/callbacks.py Outdated Show resolved Hide resolved

abinthomasonline added 5 commits January 6, 2023 10:00

better error message, better variable name

9601771

typo

bee433d

tests

3383c03

Merge branch 'keras-team:master' into terminate-on-nan-epoch-support

151dc3a

reformat

8b5470b

rchao suggested changes Jan 18, 2023

View reviewed changes

keras/callbacks.py Outdated Show resolved Hide resolved

keras/callbacks.py Outdated Show resolved Hide resolved

abinthomasonline added 3 commits January 20, 2023 11:49

Merge branch 'keras-team:master' into terminate-on-nan-epoch-support

a34e0f1

sync logs if remote value - terminateonnan callback

c328382

log epoch number

6b02a18

abinthomasonline added 6 commits January 20, 2023 09:49

add batch trigger check to CallBack

663660f

update backupandrestore callback

1781f39

use model._train_counter in callback_test

09fc4f3

Merge branch 'master' into terminate-on-nan-epoch-support

ea21f65

Merge branch 'keras-team:master' into terminate-on-nan-epoch-support

5dc1144

Merge branch 'keras-team:master' into terminate-on-nan-epoch-support

81880f2

rchao suggested changes Mar 21, 2023

View reviewed changes

abinthomasonline added 2 commits April 13, 2023 18:42

Merge branch 'keras-team:master' into terminate-on-nan-epoch-support

18b6c85

doc fixes

a72672d

gbaned requested a review from rchao April 17, 2023 14:34

google-ml-butler bot added the keras-team-review-pending Pending review by a Keras team member. label Apr 17, 2023

hertschuh removed the keras-team-review-pending Pending review by a Keras team member. label Apr 20, 2023

sachinprasadhs added the ready to pull Ready to be merged into the codebase label Apr 29, 2023

sachinprasadhs added the pending internal tests label May 26, 2023

sachinprasadhs added failing internal tests and removed pending internal tests labels May 26, 2023

copybara-service bot mentioned this pull request May 26, 2023

PR #17384: TerminateOnNaN epoch level callback #18161

Closed

sachinprasadhs closed this Sep 19, 2023

PR Queue automation moved this from Reviewer Requested Changes to Closed/Rejected Sep 19, 2023

google-ml-butler bot removed the ready to pull Ready to be merged into the codebase label Sep 19, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TerminateOnNaN epoch level callback #17384

TerminateOnNaN epoch level callback #17384

abinthomasonline commented Dec 30, 2022

google-cla bot commented Dec 30, 2022

abinthomasonline commented Dec 30, 2022

rchao left a comment

hertschuh left a comment

abinthomasonline commented Jan 6, 2023

abinthomasonline commented Jan 13, 2023

rchao left a comment

hmc-cs-mdrissi commented Jan 19, 2023

abinthomasonline commented Jan 20, 2023

hmc-cs-mdrissi commented Jan 20, 2023

abinthomasonline commented Mar 4, 2023

gbaned commented Mar 21, 2023

rchao left a comment

rchao Mar 21, 2023

abinthomasonline Apr 13, 2023

gbaned commented Apr 13, 2023

abinthomasonline commented Apr 13, 2023

rchao commented Apr 20, 2023

sachinprasadhs commented May 26, 2023

rchao commented Jun 7, 2023

gbaned commented Aug 18, 2023

sachinprasadhs commented Sep 19, 2023

TerminateOnNaN epoch level callback #17384

TerminateOnNaN epoch level callback #17384

Conversation

abinthomasonline commented Dec 30, 2022

google-cla bot commented Dec 30, 2022

abinthomasonline commented Dec 30, 2022

rchao left a comment

Choose a reason for hiding this comment

hertschuh left a comment

Choose a reason for hiding this comment

abinthomasonline commented Jan 6, 2023

abinthomasonline commented Jan 13, 2023

rchao left a comment

Choose a reason for hiding this comment

hmc-cs-mdrissi commented Jan 19, 2023

abinthomasonline commented Jan 20, 2023

hmc-cs-mdrissi commented Jan 20, 2023

abinthomasonline commented Mar 4, 2023

gbaned commented Mar 21, 2023

rchao left a comment

Choose a reason for hiding this comment

rchao Mar 21, 2023

Choose a reason for hiding this comment

abinthomasonline Apr 13, 2023

Choose a reason for hiding this comment

gbaned commented Apr 13, 2023

abinthomasonline commented Apr 13, 2023

rchao commented Apr 20, 2023

sachinprasadhs commented May 26, 2023

rchao commented Jun 7, 2023

gbaned commented Aug 18, 2023

sachinprasadhs commented Sep 19, 2023