Fix ModelCheckpoint trained-on batch counting when using steps_per_execution>1 #17632

jasnyj · 2023-03-04T20:10:02Z

When setting the steps_per_execution argument to a value N>1 when calling model.compile(), the on_train_batch_end() method of model.fit()'s callbacks only gets called every N batches with an argument batch equal to the 0-indexed index of the last batch which has been trained on. That is, after the first N trained-on batches, on_train_batch_end() gets called with its batch argument equal to N-1, then N trained-on batches later, to 2N-1, etc. until the end of the epoch.

In order to handle this situation, ModelCheckpoint uses a _last_batch_seen member integer variable to record the value of the batch argument of its on_train_batch_end() method the last time this method was called. When on_train_batch_end() is called again, ModelCheckpoint then computes (in its _should_save_on_batch() method) add_batches = batch - self._last_batch_seen in order to know the number of batches which have been trained on between two consecutive calls to its on_train_batch_end() method.

However, the _last_batch_seen member variable is initialized to 0 which means that, when using steps_per_execution=N, the first time on_train_batch_end() is called after N batches have been trained on (with a batch argument equal to N-1), only N-1 batches are counted since add_batches = batch - self._last_batch_seen = (N-1) - 0 = N-1 instead of N. This makes ModelCheckpoint miss one batch when counting them and effectively offset its save_freq contructor argument by 1. Therefore an initialization value of -1 is needed.

In the special cases of steps_per_execution=1 or
steps_per_execution=None (which are equivalent), the bug was hidden by the fact that the condition to check for a new epoch (batch <= self._last_batch_seen) was true since on the first call to on_train_batch_end() both the batch argument and _last_batch_seen variable were equal to 0. In this case, the number of batches trained on is counted by computing add_batches = batch + 1 = 1, which is indeed the correct result.

When setting the steps_per_execution argument to a value N>1 when calling model.compile(), the on_train_batch_end() method of model.fit()'s callbacks only gets called every N batches with an argument batch equal to the 0-indexed index of the last batch which has been trained on. That is, after the first N trained-on batches, on_train_batch_end() gets called with its batch argument equal to N-1, then N trained-on batches later, to 2N-1, etc. until the end of the epoch. In order to handle this situation, ModelCheckpoint uses a _last_batch_seen member integer variable to record the value of the batch argument of its on_train_batch_end() method the last time this method was called. When on_train_batch_end() is called again, ModelCheckpoint then computes (in its _should_save_on_batch() method) add_batches = batch - self._last_batch_seen in order to know the number of batches which have been trained on between two consecutive calls to its on_train_batch_end() method. However, the _last_batch_seen member variable is initialized to 0 which means that, when using steps_per_execution=N, the first time on_train_batch_end() is called after N batches have been trained on (with a batch argument equal to N-1), only N-1 batches are counted since add_batches = batch - self._last_batch_seen = (N-1) - 0 = N-1 instead of N. This makes ModelCheckpoint miss one batch when counting them and effectively offset its save_freq contructor argument by 1. Therefore an initialization value of -1 is needed. In the special cases of steps_per_execution=1 or steps_per_execution=None (which are equivalent), the bug was hidden by the fact that the condition to check for a new epoch (batch <= self._last_batch_seen) was true since on the first call to on_train_batch_end() both the batch argument and _last_batch_seen variable were equal to 0. In this case, the number of batches trained on is counted by computing add_batches = batch + 1 = 1, which is indeed the correct result.

google-cla · 2023-03-04T20:10:06Z

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.

fchollet · 2023-03-05T18:25:39Z

Thanks for the PR! Please add a unit test for the change.

jasnyj · 2023-03-06T18:58:50Z

I'm new to writing unit tests for Keras. Should I create a new method named like test_fit_with_ModelCheckpoint_with_steps_per_execution in the KerasCallbacksTest class of the keras/callbacks_test.py file ? I guess I should create a test which fail with the current keras nightly (while obviously being wrong to fail) but works with my change, using asserts to check that the checkpoint was correctly created.

fchollet · 2023-03-07T21:24:34Z

Yes, exactly. You can use a tmp folder for saving the checkpoints (see how it's done in other tests).

jasnyj · 2023-03-11T18:33:54Z

I have just added unit tests which correctly fail without the fix and pass with it.

fchollet

LGTM, thank you!

…teps_per_execution>1 Imported from GitHub PR #17632 When setting the steps_per_execution argument to a value N>1 when calling model.compile(), the on_train_batch_end() method of model.fit()'s callbacks only gets called every N batches with an argument batch equal to the 0-indexed index of the last batch which has been trained on. That is, after the first N trained-on batches, on_train_batch_end() gets called with its batch argument equal to N-1, then N trained-on batches later, to 2N-1, etc. until the end of the epoch. In order to handle this situation, ModelCheckpoint uses a _last_batch_seen member integer variable to record the value of the batch argument of its on_train_batch_end() method the last time this method was called. When on_train_batch_end() is called again, ModelCheckpoint then computes (in its _should_save_on_batch() method) add_batches = batch - self._last_batch_seen in order to know the number of batches which have been trained on between two consecutive calls to its on_train_batch_end() method. However, the _last_batch_seen member variable is initialized to 0 which means that, when using steps_per_execution=N, the first time on_train_batch_end() is called after N batches have been trained on (with a batch argument equal to N-1), only N-1 batches are counted since add_batches = batch - self._last_batch_seen = (N-1) - 0 = N-1 instead of N. This makes ModelCheckpoint miss one batch when counting them and effectively offset its save_freq contructor argument by 1. Therefore an initialization value of -1 is needed. In the special cases of steps_per_execution=1 or steps_per_execution=None (which are equivalent), the bug was hidden by the fact that the condition to check for a new epoch (batch <= self._last_batch_seen) was true since on the first call to on_train_batch_end() both the batch argument and _last_batch_seen variable were equal to 0. In this case, the number of batches trained on is counted by computing add_batches = batch + 1 = 1, which is indeed the correct result. Copybara import of the project: -- 8b9f81d by Maël A <86840696+jasnyj@users.noreply.github.com>: Fix ModelCheckpoint trained-on batch counting When setting the steps_per_execution argument to a value N>1 when calling model.compile(), the on_train_batch_end() method of model.fit()'s callbacks only gets called every N batches with an argument batch equal to the 0-indexed index of the last batch which has been trained on. That is, after the first N trained-on batches, on_train_batch_end() gets called with its batch argument equal to N-1, then N trained-on batches later, to 2N-1, etc. until the end of the epoch. In order to handle this situation, ModelCheckpoint uses a _last_batch_seen member integer variable to record the value of the batch argument of its on_train_batch_end() method the last time this method was called. When on_train_batch_end() is called again, ModelCheckpoint then computes (in its _should_save_on_batch() method) add_batches = batch - self._last_batch_seen in order to know the number of batches which have been trained on between two consecutive calls to its on_train_batch_end() method. However, the _last_batch_seen member variable is initialized to 0 which means that, when using steps_per_execution=N, the first time on_train_batch_end() is called after N batches have been trained on (with a batch argument equal to N-1), only N-1 batches are counted since add_batches = batch - self._last_batch_seen = (N-1) - 0 = N-1 instead of N. This makes ModelCheckpoint miss one batch when counting them and effectively offset its save_freq contructor argument by 1. Therefore an initialization value of -1 is needed. In the special cases of steps_per_execution=1 or steps_per_execution=None (which are equivalent), the bug was hidden by the fact that the condition to check for a new epoch (batch <= self._last_batch_seen) was true since on the first call to on_train_batch_end() both the batch argument and _last_batch_seen variable were equal to 0. In this case, the number of batches trained on is counted by computing add_batches = batch + 1 = 1, which is indeed the correct result. -- b342b3a by Maël A <86840696+jasnyj@users.noreply.github.com>: Test ModelCheckpoint with steps_per_execution Merging this change closes #17632 FUTURE_COPYBARA_INTEGRATE_REVIEW=#17632 from jasnyj:master b342b3a PiperOrigin-RevId: 516873881

gbaned · 2023-03-21T16:38:49Z

Here are the internal errors, @jasnyj can you please verify ? Thank you!

Traceback (most recent call last):
File "/py/absl/testing/parameterized.py", line 320, in bound_param_test
return test_method(self, *testcase_params)
File "/keras/testing_infra/test_combinations.py", line 291, in decorated
_test_sequential_model_type(f, self, *args, **kwargs)
File "/keras/testing_infra/test_combinations.py", line 312, in _test_sequential_model_type
f(test_or_class, *args, **kwargs)
File "/keras/callbacks_test.py", line 1762, in test_fit_with_ModelCheckpoint_with_steps_per_execution
self._run_fit_with_ModelCheckpoint_with_steps_per_execution(
File "/keras/callbacks_test.py", line 1712, in _run_fit_with_ModelCheckpoint_with_steps_per_execution
model.compile(
File "/tensorflow/python/trackable/base.py", line 205, in _method_wrapper
result = method(self, *args, **kwargs)
File "/py/keras/engine/training_v1.py", line 308, in compile
raise TypeError(
TypeError: Invalid keyword argument(s) in compile: {'steps_per_execution'}

The steps_per_execution argument to model.compile(...) is only available on Keras>=2.4.0. Unit tests which are using this argument are therefore causing errors in v1 mode and should not be run in this mode.

jasnyj · 2023-03-22T21:11:01Z

@gbaned Indeed. The steps_per_execution argument to model.compile(...) only exists in Keras>=2.4.0. Given the stack trace it seems the test is being executed in v1 mode while it shouldn't as the functionality do not exist in v1. I therefore made the test only run in v2 mode with a @test_utils.run_v2_only annotation.

I was however unable to reproduce the error locally (and therefore test if it did go away), bazel test keras:callbacks_test doesn't seem to run tests in v1 mode by default in my environment. How can I run tests in v1 mode in the future to catch these errors locally ?

(If you are wondering why I force pushed, it's just that I made the last commit with a wrong email address configured and I had to amend the commit to change it.)

gbaned · 2023-04-06T09:54:57Z

Hi @fchollet Can you please assist on above comments from @jasnyj. Thank you!

fchollet

LGTM, thanks

…teps_per_execution>1 Imported from GitHub PR #17632 When setting the steps_per_execution argument to a value N>1 when calling model.compile(), the on_train_batch_end() method of model.fit()'s callbacks only gets called every N batches with an argument batch equal to the 0-indexed index of the last batch which has been trained on. That is, after the first N trained-on batches, on_train_batch_end() gets called with its batch argument equal to N-1, then N trained-on batches later, to 2N-1, etc. until the end of the epoch. In order to handle this situation, ModelCheckpoint uses a _last_batch_seen member integer variable to record the value of the batch argument of its on_train_batch_end() method the last time this method was called. When on_train_batch_end() is called again, ModelCheckpoint then computes (in its _should_save_on_batch() method) add_batches = batch - self._last_batch_seen in order to know the number of batches which have been trained on between two consecutive calls to its on_train_batch_end() method. However, the _last_batch_seen member variable is initialized to 0 which means that, when using steps_per_execution=N, the first time on_train_batch_end() is called after N batches have been trained on (with a batch argument equal to N-1), only N-1 batches are counted since add_batches = batch - self._last_batch_seen = (N-1) - 0 = N-1 instead of N. This makes ModelCheckpoint miss one batch when counting them and effectively offset its save_freq contructor argument by 1. Therefore an initialization value of -1 is needed. In the special cases of steps_per_execution=1 or steps_per_execution=None (which are equivalent), the bug was hidden by the fact that the condition to check for a new epoch (batch <= self._last_batch_seen) was true since on the first call to on_train_batch_end() both the batch argument and _last_batch_seen variable were equal to 0. In this case, the number of batches trained on is counted by computing add_batches = batch + 1 = 1, which is indeed the correct result. Copybara import of the project: -- 8b9f81d by Maël A <86840696+jasnyj@users.noreply.github.com>: Fix ModelCheckpoint trained-on batch counting When setting the steps_per_execution argument to a value N>1 when calling model.compile(), the on_train_batch_end() method of model.fit()'s callbacks only gets called every N batches with an argument batch equal to the 0-indexed index of the last batch which has been trained on. That is, after the first N trained-on batches, on_train_batch_end() gets called with its batch argument equal to N-1, then N trained-on batches later, to 2N-1, etc. until the end of the epoch. In order to handle this situation, ModelCheckpoint uses a _last_batch_seen member integer variable to record the value of the batch argument of its on_train_batch_end() method the last time this method was called. When on_train_batch_end() is called again, ModelCheckpoint then computes (in its _should_save_on_batch() method) add_batches = batch - self._last_batch_seen in order to know the number of batches which have been trained on between two consecutive calls to its on_train_batch_end() method. However, the _last_batch_seen member variable is initialized to 0 which means that, when using steps_per_execution=N, the first time on_train_batch_end() is called after N batches have been trained on (with a batch argument equal to N-1), only N-1 batches are counted since add_batches = batch - self._last_batch_seen = (N-1) - 0 = N-1 instead of N. This makes ModelCheckpoint miss one batch when counting them and effectively offset its save_freq contructor argument by 1. Therefore an initialization value of -1 is needed. In the special cases of steps_per_execution=1 or steps_per_execution=None (which are equivalent), the bug was hidden by the fact that the condition to check for a new epoch (batch <= self._last_batch_seen) was true since on the first call to on_train_batch_end() both the batch argument and _last_batch_seen variable were equal to 0. In this case, the number of batches trained on is counted by computing add_batches = batch + 1 = 1, which is indeed the correct result. -- b342b3a by Maël A <86840696+jasnyj@users.noreply.github.com>: Test ModelCheckpoint with steps_per_execution -- d290db4 by Maël A <86840696+jasnyj@users.noreply.github.com>: Do not run steps_per_execution tests in v1 mode The steps_per_execution argument to model.compile(...) is only available on Keras>=2.4.0. Unit tests which are using this argument are therefore causing errors in v1 mode and should not be run in this mode. Merging this change closes #17632 FUTURE_COPYBARA_INTEGRATE_REVIEW=#17632 from jasnyj:master d290db4 PiperOrigin-RevId: 557831739

google-ml-butler bot added the size:XS label Mar 4, 2023

google-ml-butler bot assigned gbaned Mar 4, 2023

gbaned added this to Assigned Reviewer in PR Queue via automation Mar 7, 2023

gbaned requested a review from fchollet March 7, 2023 18:42

google-ml-butler bot added the keras-team-review-pending Pending review by a Keras team member. label Mar 7, 2023

qlzh727 added stat:awaiting response from contributor and removed keras-team-review-pending Pending review by a Keras team member. labels Mar 8, 2023

Test ModelCheckpoint with steps_per_execution

b342b3a

google-ml-butler bot removed the stat:awaiting response from contributor label Mar 11, 2023

fchollet approved these changes Mar 14, 2023

View reviewed changes

PR Queue automation moved this from Assigned Reviewer to Approved by Reviewer Mar 14, 2023

google-ml-butler bot added kokoro:force-run ready to pull Ready to be merged into the codebase labels Mar 14, 2023

kokoro-team removed the kokoro:force-run label Mar 14, 2023

copybara-service bot mentioned this pull request Mar 15, 2023

PR #17632: Fix ModelCheckpoint trained-on batch counting when using steps_per_execution>1 #17681

Closed

google-ml-butler bot removed the ready to pull Ready to be merged into the codebase label Mar 22, 2023

Do not run steps_per_execution tests in v1 mode

d290db4

The steps_per_execution argument to model.compile(...) is only available on Keras>=2.4.0. Unit tests which are using this argument are therefore causing errors in v1 mode and should not be run in this mode.

jasnyj force-pushed the master branch from 323ff5e to d290db4 Compare March 22, 2023 20:52

jasnyj requested a review from fchollet April 19, 2023 20:03

google-ml-butler bot added the keras-team-review-pending Pending review by a Keras team member. label Apr 19, 2023

mihirparadkar removed the keras-team-review-pending Pending review by a Keras team member. label Apr 26, 2023

gbaned added keras-team-review-pending Pending review by a Keras team member. stat:awaiting keras-eng Awaiting response from Keras engineer and removed keras-team-review-pending Pending review by a Keras team member. labels May 4, 2023

fchollet approved these changes Aug 16, 2023

View reviewed changes

google-ml-butler bot added kokoro:force-run ready to pull Ready to be merged into the codebase labels Aug 16, 2023

kokoro-team removed the kokoro:force-run label Aug 16, 2023

copybara-service bot mentioned this pull request Aug 17, 2023

PR #17632: Fix ModelCheckpoint trained-on batch counting when using steps_per_execution>1 #18349

Closed

copybara-service bot merged commit b4d2fd9 into keras-team:master Aug 17, 2023
7 checks passed

PR Queue automation moved this from Approved by Reviewer to Merged Aug 17, 2023

jasnyj mentioned this pull request Aug 19, 2023

ModelCheckpoint doesn't count batches correctly when using steps_per_execution>1 keras-team/tf-keras#255

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix ModelCheckpoint trained-on batch counting when using steps_per_execution>1 #17632

Fix ModelCheckpoint trained-on batch counting when using steps_per_execution>1 #17632

jasnyj commented Mar 4, 2023

google-cla bot commented Mar 4, 2023

fchollet commented Mar 5, 2023

jasnyj commented Mar 6, 2023

fchollet commented Mar 7, 2023

jasnyj commented Mar 11, 2023

fchollet left a comment

gbaned commented Mar 21, 2023

jasnyj commented Mar 22, 2023

gbaned commented Apr 6, 2023

fchollet left a comment

Fix ModelCheckpoint trained-on batch counting when using steps_per_execution>1 #17632

Fix ModelCheckpoint trained-on batch counting when using steps_per_execution>1 #17632

Conversation

jasnyj commented Mar 4, 2023

google-cla bot commented Mar 4, 2023

fchollet commented Mar 5, 2023

jasnyj commented Mar 6, 2023

fchollet commented Mar 7, 2023

jasnyj commented Mar 11, 2023

fchollet left a comment

Choose a reason for hiding this comment

gbaned commented Mar 21, 2023

jasnyj commented Mar 22, 2023

gbaned commented Apr 6, 2023

fchollet left a comment

Choose a reason for hiding this comment