Keras estimator: set reader's epoch = 1 to avoid sample duplication and drop-out in a single epoch #2896

chongxiaoc · 2021-05-05T20:53:46Z

Checklist before submitting

Did you read the contributor guide?
Did you update the docs?
Did you write any tests to validate this change?
Did you update the CHANGELOG, if this change affects users?

Description

We set infinite epochs for petastorm reader in horovod, due to the known issue that data partition is not guarantee to be equivalent size. TF's dataset shuffle function will automatically refill the shuffle buffer after retrieving a single element.
Code in horovod:

            if shuffle:
                dataset = dataset.shuffle(shuffle_buffer_size)

Mixing infinite iterations from reader and TF's shuffle function, will introduce sample duplication and drop-out in a single epoch. For example, training dataset is [0,1,2,3,4,5,6,7], and shuffle size is 8.
Shuffling buffer is initialized as [0,1,2,3,4,5,6,7]
It randomly pick 7, and refill with 0 again: [0,1,2,3,4,5,6,0]
Now it is highly possible 0 will be selected twice in next few iterations, even before other values are selected once.

This PR fixes above issue by setting reader's epoch to be 1, while looping infinitely inside dataloader.
Also we introduced cache() support from tf.data.Dataset since it speeds up throughput.

Review process to land

All tests and other checks must succeed.
At least one member of the technical steering committee must review and approve.
If any member of the technical steering committee requests changes, they must be addressed.

github-actions · 2021-05-05T21:47:48Z

Unit Test Results

    764 files ±0     764 suites ±0 5h 48m 33s ⏱️ ±0s
    592 tests ±0     556 ✔️ ±0     35 💤 ±0 1 ❌ ±0
15 780 runs ±0 11 912 ✔️ ±0 3 867 💤 ±0 1 ❌ ±0

For more details on these failures, see this check.

Results for commit 41af508. ± Comparison against base commit 41af508.

♻️ This comment has been updated with latest results.

chongxiaoc · 2021-05-06T16:22:43Z

This PR should be tested again after petastrom tf dataset support repeat().
Otherwise it only works for inmemory_cache=True case.
PR pending on petastorm: uber/petastorm#677

chongxiaoc · 2021-05-13T21:24:42Z

Rebased and change pytorch inmem caching usage from petastorm v0.11.0rc6

tgaddair

LGTM!

Using epochs > 1 for reader will introduce duplicated samlpes in every epoch, since tf.data.Dataset.shuffle() will automatically refill shuffling buffer with next availble elemenet (from next epoch). Set reader's epoch=1, and use tf.data.Dataset's cache() and repeat() to loop indefinitely instead. Also change examples/spark/keras/keras_spark_mnist.py to use inmemory-cache option since we can load all mnist data in memory. Signed-off-by: Chongxiao Cao <chongxiaoc@uber.com>

Signed-off-by: Chongxiao Cao <chongxiaoc@uber.com>

Petastorm v0.11.0rc6 refactored in-memory caching usage for pytorch dataloader. Signed-off-by: Chongxiao Cao <chongxiaoc@uber.com>

Signed-off-by: Chongxiao Cao <chongxiaoc@uber.com>

chongxiaoc · 2021-05-17T16:16:42Z

Also fixes #2909

EnricoMi · 2021-05-17T17:33:41Z

setup.py

@@ -112,7 +112,7 @@ def build_extensions(self):
 pyspark_require_list = ['pyspark>=2.3.2;python_version<"3.8"',
                        'pyspark>=3.0.0;python_version>="3.8"']
 # Pin h5py: https://github.com/h5py/h5py/issues/1732
-spark_require_list = ['h5py<3', 'numpy', 'petastorm>=0.9.8,<0.11.0', 'pyarrow>=0.15.0']
+spark_require_list = ['h5py<3', 'numpy', 'petastorm>=0.11.0', 'pyarrow>=0.15.0']


@tgaddair do you think requiring latest petastorm for Horovod on Spark is too much of a constraint? Should we support petastorm <0.11 for a transition period?

If we go forward with this, that "limitation" must be recorded in the CHANGELOG and release notes.

The only difference of petastorm v0.11.0 is refactoring pytorch inmem dataloader, which I believe is not used frequently. Yep, I agree to add CHANGELOG and release notes.

updated CHANGLOG to bring pytorch inmem dataloder change.

I think this is fine. Petastorm is pretty lightweight with few overlapping dependencies, so shouldn't be an issue. Unfortunately, it's frequently the case that Horovod needs to be updated to the latest Petastorm to take advantage of new features.

Signed-off-by: Chongxiao Cao <chongxiaoc@uber.com>

chongxiaoc force-pushed the reader_set_1_epoch branch from d4f97b5 to 2ce2191 Compare May 13, 2021 21:24

chongxiaoc requested review from tgaddair and irasit May 13, 2021 23:01

tgaddair approved these changes May 14, 2021

View reviewed changes

chongxiaoc mentioned this pull request May 17, 2021

Make Horovod work with petastorm >= 0.11.0 #2909

Closed

chongxiaoc added 5 commits May 17, 2021 11:09

test: modify test_batch_generator_fn to adapt new function signature

b7fa41e

Signed-off-by: Chongxiao Cao <chongxiaoc@uber.com>

Update docs for change

becec75

Signed-off-by: Chongxiao Cao <chongxiaoc@uber.com>

TorchEstimator: fix the usage of in-memory dataloader from petastorm

ed6a1fa

Petastorm v0.11.0rc6 refactored in-memory caching usage for pytorch dataloader. Signed-off-by: Chongxiao Cao <chongxiaoc@uber.com>

bump up petastorm to v0.11.0

8562db9

Signed-off-by: Chongxiao Cao <chongxiaoc@uber.com>

chongxiaoc force-pushed the reader_set_1_epoch branch from 2ce2191 to 8562db9 Compare May 17, 2021 16:13

EnricoMi reviewed May 17, 2021

View reviewed changes

Update CHANGELOG.md

01dec22

Signed-off-by: Chongxiao Cao <chongxiaoc@uber.com>

tgaddair merged commit 41af508 into horovod:master May 17, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Keras estimator: set reader's epoch = 1 to avoid sample duplication and drop-out in a single epoch #2896

Keras estimator: set reader's epoch = 1 to avoid sample duplication and drop-out in a single epoch #2896

chongxiaoc commented May 5, 2021 •

edited

github-actions bot commented May 5, 2021 •

edited

chongxiaoc commented May 6, 2021

chongxiaoc commented May 13, 2021

tgaddair left a comment

chongxiaoc commented May 17, 2021

EnricoMi May 17, 2021

chongxiaoc May 17, 2021

chongxiaoc May 17, 2021

tgaddair May 17, 2021

Keras estimator: set reader's epoch = 1 to avoid sample duplication and drop-out in a single epoch #2896

Keras estimator: set reader's epoch = 1 to avoid sample duplication and drop-out in a single epoch #2896

Conversation

chongxiaoc commented May 5, 2021 • edited

Checklist before submitting

Description

Review process to land

github-actions bot commented May 5, 2021 • edited

Unit Test Results

chongxiaoc commented May 6, 2021

chongxiaoc commented May 13, 2021

tgaddair left a comment

Choose a reason for hiding this comment

chongxiaoc commented May 17, 2021

EnricoMi May 17, 2021

Choose a reason for hiding this comment

chongxiaoc May 17, 2021

Choose a reason for hiding this comment

chongxiaoc May 17, 2021

Choose a reason for hiding this comment

tgaddair May 17, 2021

Choose a reason for hiding this comment

chongxiaoc commented May 5, 2021 •

edited

github-actions bot commented May 5, 2021 •

edited