New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Keras estimator: set reader's epoch = 1 to avoid sample duplication and drop-out in a single epoch #2896
Conversation
Unit Test Results 764 files ±0 764 suites ±0 5h 48m 33s ⏱️ ±0s For more details on these failures, see this check. Results for commit 41af508. ± Comparison against base commit 41af508. ♻️ This comment has been updated with latest results. |
This PR should be tested again after petastrom tf dataset support repeat(). |
d4f97b5
to
2ce2191
Compare
Rebased and change pytorch inmem caching usage from petastorm |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
Using epochs > 1 for reader will introduce duplicated samlpes in every epoch, since tf.data.Dataset.shuffle() will automatically refill shuffling buffer with next availble elemenet (from next epoch). Set reader's epoch=1, and use tf.data.Dataset's cache() and repeat() to loop indefinitely instead. Also change examples/spark/keras/keras_spark_mnist.py to use inmemory-cache option since we can load all mnist data in memory. Signed-off-by: Chongxiao Cao <chongxiaoc@uber.com>
Signed-off-by: Chongxiao Cao <chongxiaoc@uber.com>
Signed-off-by: Chongxiao Cao <chongxiaoc@uber.com>
Petastorm v0.11.0rc6 refactored in-memory caching usage for pytorch dataloader. Signed-off-by: Chongxiao Cao <chongxiaoc@uber.com>
Signed-off-by: Chongxiao Cao <chongxiaoc@uber.com>
2ce2191
to
8562db9
Compare
Also fixes #2909 |
@@ -112,7 +112,7 @@ def build_extensions(self): | |||
pyspark_require_list = ['pyspark>=2.3.2;python_version<"3.8"', | |||
'pyspark>=3.0.0;python_version>="3.8"'] | |||
# Pin h5py: https://github.com/h5py/h5py/issues/1732 | |||
spark_require_list = ['h5py<3', 'numpy', 'petastorm>=0.9.8,<0.11.0', 'pyarrow>=0.15.0'] | |||
spark_require_list = ['h5py<3', 'numpy', 'petastorm>=0.11.0', 'pyarrow>=0.15.0'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tgaddair do you think requiring latest petastorm for Horovod on Spark is too much of a constraint? Should we support petastorm <0.11 for a transition period?
If we go forward with this, that "limitation" must be recorded in the CHANGELOG and release notes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only difference of petastorm v0.11.0
is refactoring pytorch inmem dataloader
, which I believe is not used frequently. Yep, I agree to add CHANGELOG
and release notes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated CHANGLOG to bring pytorch inmem dataloder
change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is fine. Petastorm is pretty lightweight with few overlapping dependencies, so shouldn't be an issue. Unfortunately, it's frequently the case that Horovod needs to be updated to the latest Petastorm to take advantage of new features.
Signed-off-by: Chongxiao Cao <chongxiaoc@uber.com>
Checklist before submitting
Description
We set infinite epochs for petastorm reader in horovod, due to the known issue that data partition is not guarantee to be equivalent size. TF's dataset shuffle function will automatically refill the shuffle buffer after retrieving a single element.
Code in horovod:
Mixing infinite iterations from reader and TF's shuffle function, will introduce sample duplication and drop-out in a single epoch. For example, training dataset is [
0,1,2,3,4,5,6,7]
, and shuffle size is 8.Shuffling buffer is initialized as
[0,1,2,3,4,5,6,7]
It randomly pick 7, and refill with 0 again:
[0,1,2,3,4,5,6,0]
Now it is highly possible 0 will be selected twice in next few iterations, even before other values are selected once.
This PR fixes above issue by setting reader's epoch to be 1, while looping infinitely inside dataloader.
Also we introduced cache() support from
tf.data.Dataset
since it speeds up throughput.Review process to land