Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixed Elastic tf.keras and added additional integration tests #2289

Merged
merged 9 commits into from Sep 21, 2020

Conversation

tgaddair
Copy link
Collaborator

Fixes #2285.

Signed-off-by: Travis Addair <taddair@uber.com>
Signed-off-by: Travis Addair <taddair@uber.com>
Signed-off-by: Travis Addair <taddair@uber.com>
Signed-off-by: Travis Addair <taddair@uber.com>
Signed-off-by: Travis Addair <taddair@uber.com>
return averaged_gradients
else:
return gradients
return self._allreduce_grads(grads)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@romerojosh this part may be relevant to you since you recently did some work on this part of the code.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent! Glad to see this duplicate code being removed.

@github-actions
Copy link

Unit Test Results

   427 files  +  12     427 suites  +12   4h 29m 6s ⏱️ + 54m 16s
   491 tests +    7     465 ✔️ +    4       25 💤 +  2  1 ✖️ +1 
8 713 runs  +152  7 489 ✔️ +133  1 223 💤 +18  1 ✖️ +1 

results for commit 712b9f9 ± comparison against base commit 04ec434


def on_epoch_end(self, epoch, logs=None):
self.state.epoch = epoch
self.state.epoch = self.initial_epoch + epoch + 1
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is epoch the epoch that ended? why self.state.epoch = self.initial_epoch + epoch + 1

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, epoch is the epoch that just ended. The idea is that at the end of the epoch, we want to commit to save our progress. We could avoid having to do the + 1 here if we called this on_epoch_begin, but if we commit before incrementing the epoch number (but after resetting the batch number), then we would end up repeating the last epoch. So this way we ensure that when we commit, the state is fully up to date in case an error occurs.

The initial_epoch is because the epoch provided by the callback is a relative epoch (we cannot offset it at an initial epoch). So if we complete some number of epochs and then reset, we do not want to lose our progress.

I will add comments to explain this in the code.

Signed-off-by: Travis Addair <taddair@uber.com>
Signed-off-by: Travis Addair <taddair@uber.com>
Signed-off-by: Travis Addair <taddair@uber.com>
@github-actions
Copy link

Unit Test Results

0 files  -   415  0 suites  -415   0s ⏱️ - 3h 34m 50s
0 tests -   484  0 ✔️ -   461  0 💤 -     23  0 ✖️ ±0 
0 runs  -8 561  0 ✔️ -7 356  0 💤 -1 205  0 ✖️ ±0 

results for commit 7121295 ± comparison against base commit 04ec434

Signed-off-by: Travis Addair <taddair@uber.com>
@github-actions
Copy link

Unit Test Results

0 files  -   415  0 suites  -415   0s ⏱️ - 3h 34m 50s
0 tests -   484  0 ✔️ -   461  0 💤 -     23  0 ✖️ ±0 
0 runs  -8 561  0 ✔️ -7 356  0 💤 -1 205  0 ✖️ ±0 

results for commit 06ab66b ± comparison against base commit 04ec434

@github-actions
Copy link

Unit Test Results

   472 files  +  57     472 suites  +57   4h 29m 50s ⏱️ + 55m 0s
   491 tests +    7     466 ✔️ +    5       25 💤 +    2  0 ✖️ ±0 
9 411 runs  +850  8 104 ✔️ +748  1 307 💤 +102  0 ✖️ ±0 

results for commit 323f0cc ± comparison against base commit 04ec434

Copy link
Collaborator

@romerojosh romerojosh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 👍

@tgaddair tgaddair merged commit 0f37436 into master Sep 21, 2020
@tgaddair tgaddair deleted the elastic_tf_keras branch September 21, 2020 23:12
@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

tf-keras example is not working well when scale worker up in elastic mode
3 participants