Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix ModelCheckpoint trained-on batch counting when using steps_per_execution>1 #17632

Merged
merged 3 commits into from Aug 17, 2023

Conversation

jasnyj
Copy link
Contributor

@jasnyj jasnyj commented Mar 4, 2023

When setting the steps_per_execution argument to a value N>1 when calling model.compile(), the on_train_batch_end() method of model.fit()'s callbacks only gets called every N batches with an argument batch equal to the 0-indexed index of the last batch which has been trained on. That is, after the first N trained-on batches, on_train_batch_end() gets called with its batch argument equal to N-1, then N trained-on batches later, to 2N-1, etc. until the end of the epoch.

In order to handle this situation, ModelCheckpoint uses a _last_batch_seen member integer variable to record the value of the batch argument of its on_train_batch_end() method the last time this method was called. When on_train_batch_end() is called again, ModelCheckpoint then computes (in its _should_save_on_batch() method) add_batches = batch - self._last_batch_seen in order to know the number of batches which have been trained on between two consecutive calls to its on_train_batch_end() method.

However, the _last_batch_seen member variable is initialized to 0 which means that, when using steps_per_execution=N, the first time on_train_batch_end() is called after N batches have been trained on (with a batch argument equal to N-1), only N-1 batches are counted since add_batches = batch - self._last_batch_seen = (N-1) - 0 = N-1 instead of N. This makes ModelCheckpoint miss one batch when counting them and effectively offset its save_freq contructor argument by 1. Therefore an initialization value of -1 is needed.

In the special cases of steps_per_execution=1 or
steps_per_execution=None (which are equivalent), the bug was hidden by the fact that the condition to check for a new epoch (batch <= self._last_batch_seen) was true since on the first call to on_train_batch_end() both the batch argument and _last_batch_seen variable were equal to 0. In this case, the number of batches trained on is counted by computing add_batches = batch + 1 = 1, which is indeed the correct result.

When setting the steps_per_execution argument to a value N>1 when
calling model.compile(), the on_train_batch_end() method of
model.fit()'s callbacks only gets called every N batches with an
argument batch equal to the 0-indexed index of the last batch which has
been trained on. That is, after the first N trained-on batches,
on_train_batch_end() gets called with its batch argument equal to N-1,
then N trained-on batches later, to 2N-1, etc. until the end of the
epoch.

In order to handle this situation, ModelCheckpoint uses a
_last_batch_seen member integer variable to record the value of the
batch argument of its on_train_batch_end() method the last time this
method was called. When on_train_batch_end() is called again,
ModelCheckpoint then computes (in its _should_save_on_batch() method)
add_batches = batch - self._last_batch_seen in order to know the number
of batches which have been trained on between two consecutive calls to
its on_train_batch_end() method.

However, the _last_batch_seen member variable is initialized to 0 which
means that, when using steps_per_execution=N, the first time
on_train_batch_end() is called after N batches have been trained on
(with a batch argument equal to N-1), only N-1 batches are counted since
add_batches = batch - self._last_batch_seen = (N-1) - 0 = N-1 instead
of N. This makes ModelCheckpoint miss one batch when counting them and
effectively offset its save_freq contructor argument by 1. Therefore an
initialization value of -1 is needed.

In the special cases of steps_per_execution=1 or
steps_per_execution=None (which are equivalent), the bug was hidden by
the fact that the condition to check for a new epoch (batch <=
self._last_batch_seen) was true since on the first call to
on_train_batch_end() both the batch argument and _last_batch_seen
variable were equal to 0. In this case, the number of batches trained on
is counted by computing add_batches = batch + 1 = 1, which is indeed the
correct result.
@google-cla
Copy link

google-cla bot commented Mar 4, 2023

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.

@fchollet
Copy link
Member

fchollet commented Mar 5, 2023

Thanks for the PR! Please add a unit test for the change.

@jasnyj
Copy link
Contributor Author

jasnyj commented Mar 6, 2023

I'm new to writing unit tests for Keras. Should I create a new method named like test_fit_with_ModelCheckpoint_with_steps_per_execution in the KerasCallbacksTest class of the keras/callbacks_test.py file ? I guess I should create a test which fail with the current keras nightly (while obviously being wrong to fail) but works with my change, using asserts to check that the checkpoint was correctly created.

@gbaned gbaned added this to Assigned Reviewer in PR Queue via automation Mar 7, 2023
@gbaned gbaned requested a review from fchollet March 7, 2023 18:42
@google-ml-butler google-ml-butler bot added the keras-team-review-pending Pending review by a Keras team member. label Mar 7, 2023
@fchollet
Copy link
Member

fchollet commented Mar 7, 2023

Yes, exactly. You can use a tmp folder for saving the checkpoints (see how it's done in other tests).

@qlzh727 qlzh727 added stat:awaiting response from contributor and removed keras-team-review-pending Pending review by a Keras team member. labels Mar 8, 2023
@jasnyj
Copy link
Contributor Author

jasnyj commented Mar 11, 2023

I have just added unit tests which correctly fail without the fix and pass with it.

Copy link
Member

@fchollet fchollet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thank you!

PR Queue automation moved this from Assigned Reviewer to Approved by Reviewer Mar 14, 2023
@google-ml-butler google-ml-butler bot added kokoro:force-run ready to pull Ready to be merged into the codebase labels Mar 14, 2023
copybara-service bot pushed a commit that referenced this pull request Mar 15, 2023
…teps_per_execution>1

Imported from GitHub PR #17632

When setting the steps_per_execution argument to a value N>1 when calling model.compile(), the on_train_batch_end() method of model.fit()'s callbacks only gets called every N batches with an argument batch equal to the 0-indexed index of the last batch which has been trained on. That is, after the first N trained-on batches, on_train_batch_end() gets called with its batch argument equal to N-1, then N trained-on batches later, to 2N-1, etc. until the end of the epoch.

In order to handle this situation, ModelCheckpoint uses a _last_batch_seen member integer variable to record the value of the batch argument of its on_train_batch_end() method the last time this method was called. When on_train_batch_end() is called again, ModelCheckpoint then computes (in its _should_save_on_batch() method) add_batches = batch - self._last_batch_seen in order to know the number of batches which have been trained on between two consecutive calls to its on_train_batch_end() method.

However, the _last_batch_seen member variable is initialized to 0 which means that, when using steps_per_execution=N, the first time on_train_batch_end() is called after N batches have been trained on (with a batch argument equal to N-1), only N-1 batches are counted since add_batches = batch - self._last_batch_seen = (N-1) - 0 = N-1 instead of N. This makes ModelCheckpoint miss one batch when counting them and effectively offset its save_freq contructor argument by 1. Therefore an initialization value of -1 is needed.

In the special cases of steps_per_execution=1 or
steps_per_execution=None (which are equivalent), the bug was hidden by the fact that the condition to check for a new epoch (batch <= self._last_batch_seen) was true since on the first call to on_train_batch_end() both the batch argument and _last_batch_seen variable were equal to 0. In this case, the number of batches trained on is counted by computing add_batches = batch + 1 = 1, which is indeed the correct result.
Copybara import of the project:

--
8b9f81d by Maël A <86840696+jasnyj@users.noreply.github.com>:

Fix ModelCheckpoint trained-on batch counting

When setting the steps_per_execution argument to a value N>1 when
calling model.compile(), the on_train_batch_end() method of
model.fit()'s callbacks only gets called every N batches with an
argument batch equal to the 0-indexed index of the last batch which has
been trained on. That is, after the first N trained-on batches,
on_train_batch_end() gets called with its batch argument equal to N-1,
then N trained-on batches later, to 2N-1, etc. until the end of the
epoch.

In order to handle this situation, ModelCheckpoint uses a
_last_batch_seen member integer variable to record the value of the
batch argument of its on_train_batch_end() method the last time this
method was called. When on_train_batch_end() is called again,
ModelCheckpoint then computes (in its _should_save_on_batch() method)
add_batches = batch - self._last_batch_seen in order to know the number
of batches which have been trained on between two consecutive calls to
its on_train_batch_end() method.

However, the _last_batch_seen member variable is initialized to 0 which
means that, when using steps_per_execution=N, the first time
on_train_batch_end() is called after N batches have been trained on
(with a batch argument equal to N-1), only N-1 batches are counted since
add_batches = batch - self._last_batch_seen = (N-1) - 0 = N-1 instead
of N. This makes ModelCheckpoint miss one batch when counting them and
effectively offset its save_freq contructor argument by 1. Therefore an
initialization value of -1 is needed.

In the special cases of steps_per_execution=1 or
steps_per_execution=None (which are equivalent), the bug was hidden by
the fact that the condition to check for a new epoch (batch <=
self._last_batch_seen) was true since on the first call to
on_train_batch_end() both the batch argument and _last_batch_seen
variable were equal to 0. In this case, the number of batches trained on
is counted by computing add_batches = batch + 1 = 1, which is indeed the
correct result.

--
b342b3a by Maël A <86840696+jasnyj@users.noreply.github.com>:

Test ModelCheckpoint with steps_per_execution

Merging this change closes #17632

FUTURE_COPYBARA_INTEGRATE_REVIEW=#17632 from jasnyj:master b342b3a
PiperOrigin-RevId: 516873881
@gbaned
Copy link
Collaborator

gbaned commented Mar 21, 2023

Here are the internal errors, @jasnyj can you please verify ? Thank you!

Traceback (most recent call last):
File "/py/absl/testing/parameterized.py", line 320, in bound_param_test
return test_method(self, *testcase_params)
File "/keras/testing_infra/test_combinations.py", line 291, in decorated
_test_sequential_model_type(f, self, *args, **kwargs)
File "/keras/testing_infra/test_combinations.py", line 312, in _test_sequential_model_type
f(test_or_class, *args, **kwargs)
File "/keras/callbacks_test.py", line 1762, in test_fit_with_ModelCheckpoint_with_steps_per_execution
self._run_fit_with_ModelCheckpoint_with_steps_per_execution(
File "/keras/callbacks_test.py", line 1712, in _run_fit_with_ModelCheckpoint_with_steps_per_execution
model.compile(
File "/tensorflow/python/trackable/base.py", line 205, in _method_wrapper
result = method(self, *args, **kwargs)
File "/py/keras/engine/training_v1.py", line 308, in compile
raise TypeError(
TypeError: Invalid keyword argument(s) in compile: {'steps_per_execution'}

@google-ml-butler google-ml-butler bot removed the ready to pull Ready to be merged into the codebase label Mar 22, 2023
The steps_per_execution argument to model.compile(...) is only available
on Keras>=2.4.0. Unit tests which are using this argument are therefore
causing errors in v1 mode and should not be run in this mode.
@jasnyj
Copy link
Contributor Author

jasnyj commented Mar 22, 2023

@gbaned Indeed. The steps_per_execution argument to model.compile(...) only exists in Keras>=2.4.0. Given the stack trace it seems the test is being executed in v1 mode while it shouldn't as the functionality do not exist in v1. I therefore made the test only run in v2 mode with a @test_utils.run_v2_only annotation.

I was however unable to reproduce the error locally (and therefore test if it did go away), bazel test keras:callbacks_test doesn't seem to run tests in v1 mode by default in my environment. How can I run tests in v1 mode in the future to catch these errors locally ?

(If you are wondering why I force pushed, it's just that I made the last commit with a wrong email address configured and I had to amend the commit to change it.)

@gbaned
Copy link
Collaborator

gbaned commented Apr 6, 2023

Hi @fchollet Can you please assist on above comments from @jasnyj. Thank you!

@jasnyj jasnyj requested a review from fchollet April 19, 2023 20:03
@google-ml-butler google-ml-butler bot added the keras-team-review-pending Pending review by a Keras team member. label Apr 19, 2023
@mihirparadkar mihirparadkar removed the keras-team-review-pending Pending review by a Keras team member. label Apr 26, 2023
@gbaned gbaned added keras-team-review-pending Pending review by a Keras team member. stat:awaiting keras-eng Awaiting response from Keras engineer and removed keras-team-review-pending Pending review by a Keras team member. labels May 4, 2023
Copy link
Member

@fchollet fchollet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks

@google-ml-butler google-ml-butler bot added kokoro:force-run ready to pull Ready to be merged into the codebase labels Aug 16, 2023
copybara-service bot pushed a commit that referenced this pull request Aug 17, 2023
…teps_per_execution>1

Imported from GitHub PR #17632

When setting the steps_per_execution argument to a value N>1 when calling model.compile(), the on_train_batch_end() method of model.fit()'s callbacks only gets called every N batches with an argument batch equal to the 0-indexed index of the last batch which has been trained on. That is, after the first N trained-on batches, on_train_batch_end() gets called with its batch argument equal to N-1, then N trained-on batches later, to 2N-1, etc. until the end of the epoch.

In order to handle this situation, ModelCheckpoint uses a _last_batch_seen member integer variable to record the value of the batch argument of its on_train_batch_end() method the last time this method was called. When on_train_batch_end() is called again, ModelCheckpoint then computes (in its _should_save_on_batch() method) add_batches = batch - self._last_batch_seen in order to know the number of batches which have been trained on between two consecutive calls to its on_train_batch_end() method.

However, the _last_batch_seen member variable is initialized to 0 which means that, when using steps_per_execution=N, the first time on_train_batch_end() is called after N batches have been trained on (with a batch argument equal to N-1), only N-1 batches are counted since add_batches = batch - self._last_batch_seen = (N-1) - 0 = N-1 instead of N. This makes ModelCheckpoint miss one batch when counting them and effectively offset its save_freq contructor argument by 1. Therefore an initialization value of -1 is needed.

In the special cases of steps_per_execution=1 or
steps_per_execution=None (which are equivalent), the bug was hidden by the fact that the condition to check for a new epoch (batch <= self._last_batch_seen) was true since on the first call to on_train_batch_end() both the batch argument and _last_batch_seen variable were equal to 0. In this case, the number of batches trained on is counted by computing add_batches = batch + 1 = 1, which is indeed the correct result.
Copybara import of the project:

--
8b9f81d by Maël A <86840696+jasnyj@users.noreply.github.com>:

Fix ModelCheckpoint trained-on batch counting

When setting the steps_per_execution argument to a value N>1 when
calling model.compile(), the on_train_batch_end() method of
model.fit()'s callbacks only gets called every N batches with an
argument batch equal to the 0-indexed index of the last batch which has
been trained on. That is, after the first N trained-on batches,
on_train_batch_end() gets called with its batch argument equal to N-1,
then N trained-on batches later, to 2N-1, etc. until the end of the
epoch.

In order to handle this situation, ModelCheckpoint uses a
_last_batch_seen member integer variable to record the value of the
batch argument of its on_train_batch_end() method the last time this
method was called. When on_train_batch_end() is called again,
ModelCheckpoint then computes (in its _should_save_on_batch() method)
add_batches = batch - self._last_batch_seen in order to know the number
of batches which have been trained on between two consecutive calls to
its on_train_batch_end() method.

However, the _last_batch_seen member variable is initialized to 0 which
means that, when using steps_per_execution=N, the first time
on_train_batch_end() is called after N batches have been trained on
(with a batch argument equal to N-1), only N-1 batches are counted since
add_batches = batch - self._last_batch_seen = (N-1) - 0 = N-1 instead
of N. This makes ModelCheckpoint miss one batch when counting them and
effectively offset its save_freq contructor argument by 1. Therefore an
initialization value of -1 is needed.

In the special cases of steps_per_execution=1 or
steps_per_execution=None (which are equivalent), the bug was hidden by
the fact that the condition to check for a new epoch (batch <=
self._last_batch_seen) was true since on the first call to
on_train_batch_end() both the batch argument and _last_batch_seen
variable were equal to 0. In this case, the number of batches trained on
is counted by computing add_batches = batch + 1 = 1, which is indeed the
correct result.

--
b342b3a by Maël A <86840696+jasnyj@users.noreply.github.com>:

Test ModelCheckpoint with steps_per_execution

--
d290db4 by Maël A <86840696+jasnyj@users.noreply.github.com>:

Do not run steps_per_execution tests in v1 mode

The steps_per_execution argument to model.compile(...) is only available
on Keras>=2.4.0. Unit tests which are using this argument are therefore
causing errors in v1 mode and should not be run in this mode.

Merging this change closes #17632

FUTURE_COPYBARA_INTEGRATE_REVIEW=#17632 from jasnyj:master d290db4
PiperOrigin-RevId: 557831739
@copybara-service copybara-service bot merged commit b4d2fd9 into keras-team:master Aug 17, 2023
7 checks passed
PR Queue automation moved this from Approved by Reviewer to Merged Aug 17, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ready to pull Ready to be merged into the codebase size:XS stat:awaiting keras-eng Awaiting response from Keras engineer
Projects
PR Queue
Merged
Development

Successfully merging this pull request may close these issues.

None yet

6 participants