Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Keras estimator: set reader's epoch = 1 to avoid sample duplication and drop-out in a single epoch #2896

Merged
merged 6 commits into from May 17, 2021

Conversation

chongxiaoc
Copy link
Collaborator

@chongxiaoc chongxiaoc commented May 5, 2021

Checklist before submitting

  • Did you read the contributor guide?
  • Did you update the docs?
  • Did you write any tests to validate this change?
  • Did you update the CHANGELOG, if this change affects users?

Description

We set infinite epochs for petastorm reader in horovod, due to the known issue that data partition is not guarantee to be equivalent size. TF's dataset shuffle function will automatically refill the shuffle buffer after retrieving a single element.
Code in horovod:

            if shuffle:
                dataset = dataset.shuffle(shuffle_buffer_size)

Mixing infinite iterations from reader and TF's shuffle function, will introduce sample duplication and drop-out in a single epoch. For example, training dataset is [0,1,2,3,4,5,6,7], and shuffle size is 8.
Shuffling buffer is initialized as [0,1,2,3,4,5,6,7]
It randomly pick 7, and refill with 0 again: [0,1,2,3,4,5,6,0]
Now it is highly possible 0 will be selected twice in next few iterations, even before other values are selected once.

This PR fixes above issue by setting reader's epoch to be 1, while looping infinitely inside dataloader.
Also we introduced cache() support from tf.data.Dataset since it speeds up throughput.

Review process to land

  1. All tests and other checks must succeed.
  2. At least one member of the technical steering committee must review and approve.
  3. If any member of the technical steering committee requests changes, they must be addressed.

@github-actions
Copy link

github-actions bot commented May 5, 2021

Unit Test Results

     764 files  ±0       764 suites  ±0   5h 48m 33s ⏱️ ±0s
     592 tests ±0       556 ✔️ ±0       35 💤 ±0  1 ❌ ±0 
15 780 runs  ±0  11 912 ✔️ ±0  3 867 💤 ±0  1 ❌ ±0 

For more details on these failures, see this check.

Results for commit 41af508. ± Comparison against base commit 41af508.

♻️ This comment has been updated with latest results.

@chongxiaoc
Copy link
Collaborator Author

This PR should be tested again after petastrom tf dataset support repeat().
Otherwise it only works for inmemory_cache=True case.
PR pending on petastorm: uber/petastorm#677

@chongxiaoc
Copy link
Collaborator Author

Rebased and change pytorch inmem caching usage from petastorm v0.11.0rc6

Copy link
Collaborator

@tgaddair tgaddair left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Using epochs > 1 for reader will introduce duplicated samlpes in every
epoch, since tf.data.Dataset.shuffle() will automatically refill
shuffling buffer with next availble elemenet (from next epoch).
Set reader's epoch=1, and use tf.data.Dataset's cache() and repeat()
to loop indefinitely instead.

Also change examples/spark/keras/keras_spark_mnist.py to use
inmemory-cache option since we can load all mnist data in memory.

Signed-off-by: Chongxiao Cao <chongxiaoc@uber.com>
Signed-off-by: Chongxiao Cao <chongxiaoc@uber.com>
Signed-off-by: Chongxiao Cao <chongxiaoc@uber.com>
Petastorm v0.11.0rc6 refactored in-memory caching usage for pytorch
dataloader.

Signed-off-by: Chongxiao Cao <chongxiaoc@uber.com>
Signed-off-by: Chongxiao Cao <chongxiaoc@uber.com>
@chongxiaoc
Copy link
Collaborator Author

Also fixes #2909

@@ -112,7 +112,7 @@ def build_extensions(self):
pyspark_require_list = ['pyspark>=2.3.2;python_version<"3.8"',
'pyspark>=3.0.0;python_version>="3.8"']
# Pin h5py: https://github.com/h5py/h5py/issues/1732
spark_require_list = ['h5py<3', 'numpy', 'petastorm>=0.9.8,<0.11.0', 'pyarrow>=0.15.0']
spark_require_list = ['h5py<3', 'numpy', 'petastorm>=0.11.0', 'pyarrow>=0.15.0']
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tgaddair do you think requiring latest petastorm for Horovod on Spark is too much of a constraint? Should we support petastorm <0.11 for a transition period?

If we go forward with this, that "limitation" must be recorded in the CHANGELOG and release notes.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only difference of petastorm v0.11.0 is refactoring pytorch inmem dataloader, which I believe is not used frequently. Yep, I agree to add CHANGELOG and release notes.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated CHANGLOG to bring pytorch inmem dataloder change.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is fine. Petastorm is pretty lightweight with few overlapping dependencies, so shouldn't be an issue. Unfortunately, it's frequently the case that Horovod needs to be updated to the latest Petastorm to take advantage of new features.

Signed-off-by: Chongxiao Cao <chongxiaoc@uber.com>
@tgaddair tgaddair merged commit 41af508 into horovod:master May 17, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants