New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fixed Elastic tf.keras and added additional integration tests #2289
Conversation
Signed-off-by: Travis Addair <taddair@uber.com>
Signed-off-by: Travis Addair <taddair@uber.com>
Signed-off-by: Travis Addair <taddair@uber.com>
Signed-off-by: Travis Addair <taddair@uber.com>
Signed-off-by: Travis Addair <taddair@uber.com>
e2c84ea
to
712b9f9
Compare
return averaged_gradients | ||
else: | ||
return gradients | ||
return self._allreduce_grads(grads) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@romerojosh this part may be relevant to you since you recently did some work on this part of the code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Excellent! Glad to see this duplicate code being removed.
|
||
def on_epoch_end(self, epoch, logs=None): | ||
self.state.epoch = epoch | ||
self.state.epoch = self.initial_epoch + epoch + 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is epoch
the epoch that ended? why self.state.epoch = self.initial_epoch + epoch + 1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, epoch
is the epoch that just ended. The idea is that at the end of the epoch, we want to commit
to save our progress. We could avoid having to do the + 1
here if we called this on_epoch_begin
, but if we commit before incrementing the epoch number (but after resetting the batch number), then we would end up repeating the last epoch. So this way we ensure that when we commit, the state is fully up to date in case an error occurs.
The initial_epoch
is because the epoch
provided by the callback is a relative epoch (we cannot offset it at an initial epoch). So if we complete some number of epochs and then reset, we do not want to lose our progress.
I will add comments to explain this in the code.
Signed-off-by: Travis Addair <taddair@uber.com>
Signed-off-by: Travis Addair <taddair@uber.com>
Signed-off-by: Travis Addair <taddair@uber.com>
Signed-off-by: Travis Addair <taddair@uber.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM 👍
Fixes #2285.