add customized data loader #2923

irasit · 2021-05-20T20:50:15Z

Signed-off-by: Peng Zhang pengz@uber.com

Checklist before submitting

Did you read the contributor guide?
Did you update the docs?
Did you write any tests to validate this change?
Did you update the CHANGELOG, if this change affects users?

Description

To fix the deadlock issue from uneven data in model check pointing and early stopping callback

Make data loader’s num_epoch confinable, default as infinite loop
Make train_steps_per_epoch configurable, default as “row_count/batch_size/hvd.size()”
Make data loader configurable in estimator, default as new async data loader.
Add common data loaders interface, and a mixin class to load and produce batches asynchronously.
Mainly used to detach data loader with trainer. It can be used by Ray data loader in future.
Enable model checkpoint call back and early stopping call back test cases.

Review process to land

All tests and other checks must succeed.
At least one member of the technical steering committee must review and approve.
If any member of the technical steering committee requests changes, they must be addressed.

github-actions · 2021-05-20T22:18:56Z

Unit Test Results

    792 files ±0     792 suites ±0 6h 3m 41s ⏱️ ±0s
    600 tests ±0     564 ✔️ ±0     35 💤 ±0 1 ❌ ±0
16 473 runs ±0 12 430 ✔️ ±0 4 042 💤 ±0 1 ❌ ±0

For more details on these failures, see this check.

Results for commit 52d0b27. ± Comparison against base commit 52d0b27.

♻️ This comment has been updated with latest results.

chongxiaoc

Overall it looks good to me. I leave a few comments open to discuss.

horovod/spark/common/data_loader.py

horovod/spark/lightning/estimator.py

horovod/spark/lightning/remote.py

horovod/spark/common/data_loader.py

chongxiaoc · 2021-05-25T02:19:13Z

Also, I think we should add in README or somewhere in docs, introducing the basic class of pytorch dataloader. In that sense, users know how to inherit and implement a dataloader on their own.
For example,

import horovod.spark.common.basePytorchDataLoader  as BaseDataloader

class MyDataLoader(BaseDataloader)
#  customize below

Signed-off-by: Peng Zhang <pengz@uber.com>

chongxiaoc · 2021-05-26T20:19:25Z

horovod/spark/data_loaders/pytorch_data_loaders.py

+            self.reader.reset()
+
+        # Re-create the data loader for each iterate. There maybe some left over data
+        # from last epoch which will cause petastorm's BatchedDataLoader fail to reset.


The comment means this dataloader is expected to fail in some corner cases?

Your change guarantees to do reset() when last row is consumed, leaving a comment here is very confusing. How about just saying "need to reset reader() once last row is consumed"?

chongxiaoc · 2021-05-26T20:19:51Z

horovod/spark/data_loaders/pytorch_data_loaders.py

+            self.reader.reset()
+
+        # Re-create the data loader for each iterate. There maybe some left over data
+        # from last epoch which will cause petastorm's BatchedDataLoader fail to reset.


same as discussion above for this comment.

chongxiaoc · 2021-05-26T20:23:45Z

horovod/spark/lightning/remote.py

    make_petastorm_reader = _make_petastorm_reader_fn(transformation, schema_fields,
                                                      batch_size, calculate_shuffle_buffer_size,
-                                                      dataloader_cls)
+                                                      data_loader_cls, loader_num_epochs)


I did a quick search here, the make_petastorm_reader is not used anymore?
This function actually set dataloader as attribute of lightning model.
If I understand correctly, we should rename this function.

chongxiaoc

Leave a few comments for 2nd round.
Thanks for refactoring the code, looks better.

and add mixin?

thuningxu · 2021-05-28T05:59:06Z

Like it! Looks good to me but also want @tgaddair to take a look.

chongxiaoc · 2021-05-28T16:22:32Z

horovod/spark/data_loaders/data_loader_base.py

+        class PytorchAsyncDataLoader(AsyncDataLoaderMixin, PytorchDataLoader):
+    """
+
+    def __init__(self, async_loader_queue_size=64, *args, **kwargs):


Question: I remember we discussed offline but I forgot, why *args, **kwargs are needed for constructor?

In some cases you may have multiple-inheritance, in which case one of the other superclasses could need initialization as well, so I think this is good practice. (see https://www.educative.io/edpresso/what-is-mro-in-python).

tgaddair · 2021-05-28T17:48:09Z

horovod/spark/data_loaders/data_loader_base.py

+
+        if self.async_loader_queue_size > 0:
+            self.finished_event = Event()
+            self.queue = Queue(self.async_loader_queue_size)


Should the option be provided to use a Process instead of a Thread here? Did you try both in your tests? I suppose if most of the work is on I/O it should be okay to use a thread.

tgaddair · 2021-05-28T17:49:19Z

horovod/spark/data_loaders/data_loader_base.py

+from threading import Thread, Event
+
+
+class BaseDataLoader(object):


Is this stuff really Spark specific? Maybe we can move this into a new horovod.data module. What do you think?

tgaddair · 2021-05-28T17:49:49Z

docs/spark.rst

@@ -96,6 +96,8 @@ logging (for Tensorboard) using the Estimator ``Store`` abstraction.  Stores are
 artifacts including intermediate representations of the training data.  Horovod natively supports stores for HDFS
 and local filesystems.

+Petastorm based data loader is used by default, but user can define a custom data loader by override the `base_data_loader` interface.


Nit: link Petastorm to the GitHub page.

tgaddair · 2021-05-28T17:50:08Z

docs/pytorch.rst

@@ -139,3 +139,5 @@ Start the training job and specify the number of workers on the command line as
 You can find an example of use pytorch lightning trainer with horovod backend in `pytorch_lightning_mnist.py script <../examples/pytorch/pytorch_lightning_mnist.py>`__

 See the PyTorch Lightning `docs <https://pytorch-lightning.readthedocs.io/en/stable/multi_gpu.html#horovod>`_ for more details.
+
+A pytorch-lightning based spark estimator trainer is also added example is in `pytorch_lightning_spark_mnist.py <../examples/spark/pytorch/pytorch_lightning_spark_mnist.py>`__


Nit: capitalize PyTorch Lightning.

tgaddair · 2021-05-28T17:51:49Z

horovod/spark/lightning/estimator.py

+    data_loader_class = Param(Params._dummy(), 'data_loader_class',
+                              'Name of the dataloader class.')
+
+    loader_num_epochs = Param(Params._dummy(), 'loader_num_epochs',


When would the user want to set this? It seems we would always want to set it to None and let the trainer decide when stop reading.

tgaddair · 2021-05-28T17:52:36Z

horovod/spark/lightning/estimator.py

-                                typeConverter=TypeConverters.toInt)
+                              typeConverter=TypeConverters.toInt)
+
+    data_loader_class = Param(Params._dummy(), 'data_loader_class',


Is this a name (string) or is it a class / function? Seems it would be easier to pass around a class directly.

tgaddair

Looks good, just a few minor things.

Signed-off-by: Peng Zhang <pengz@uber.com>

irasit force-pushed the pl_model_checkpoint branch 4 times, most recently from 6f0ba55 to 36350e2 Compare May 24, 2021 17:08

irasit requested review from tgaddair and chongxiaoc May 24, 2021 20:05

chongxiaoc requested changes May 24, 2021

View reviewed changes

irasit force-pushed the pl_model_checkpoint branch from 93e01e1 to a24d6fa Compare May 26, 2021 07:23

irasit added 4 commits May 26, 2021 00:27

add customized data loader

046b905

Signed-off-by: Peng Zhang <pengz@uber.com>

fix test

23a2988

Signed-off-by: Peng Zhang <pengz@uber.com>

add callbacks in integration test

e7a4e48

Signed-off-by: Peng Zhang <pengz@uber.com>

data_loader

ac29047

Signed-off-by: Peng Zhang <pengz@uber.com>

irasit force-pushed the pl_model_checkpoint branch from a24d6fa to 4e1b239 Compare May 26, 2021 07:30

irasit requested review from chongxiaoc and thuningxu May 26, 2021 15:38

chongxiaoc reviewed May 26, 2021

View reviewed changes

irasit force-pushed the pl_model_checkpoint branch 5 times, most recently from 6d2503c to e3986c3 Compare May 28, 2021 05:47

irasit force-pushed the pl_model_checkpoint branch from e3986c3 to 29c9212 Compare May 28, 2021 06:06

irasit requested a review from chongxiaoc May 28, 2021 06:11

irasit force-pushed the pl_model_checkpoint branch from 29c9212 to 08adab3 Compare May 28, 2021 06:46

irasit force-pushed the pl_model_checkpoint branch 2 times, most recently from 981a777 to d3164ad Compare May 28, 2021 08:05

chongxiaoc reviewed May 28, 2021

View reviewed changes

irasit force-pushed the pl_model_checkpoint branch from d3164ad to f67f0df Compare May 28, 2021 17:39

irasit requested a review from chongxiaoc May 28, 2021 17:40

tgaddair reviewed May 28, 2021

View reviewed changes

irasit force-pushed the pl_model_checkpoint branch from f67f0df to 5b8e53d Compare May 28, 2021 21:45

chongxiaoc approved these changes May 28, 2021

View reviewed changes

make the data loader interface more general.

50ef5d6

Signed-off-by: Peng Zhang <pengz@uber.com>

irasit force-pushed the pl_model_checkpoint branch from 5b8e53d to 50ef5d6 Compare May 30, 2021 08:18

irasit merged commit 52d0b27 into master May 30, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add customized data loader #2923

add customized data loader #2923

irasit commented May 20, 2021 •

edited

github-actions bot commented May 20, 2021 •

edited

chongxiaoc left a comment

chongxiaoc commented May 25, 2021

chongxiaoc May 26, 2021

chongxiaoc May 28, 2021

chongxiaoc May 26, 2021

chongxiaoc May 26, 2021

chongxiaoc left a comment •

edited

thuningxu commented May 28, 2021

chongxiaoc May 28, 2021 •

edited

tgaddair May 28, 2021

tgaddair May 28, 2021

tgaddair May 28, 2021

tgaddair May 28, 2021

tgaddair May 28, 2021

tgaddair May 28, 2021

tgaddair May 28, 2021

tgaddair left a comment

		from threading import Thread, Event


		class BaseDataLoader(object):

add customized data loader #2923

add customized data loader #2923

Conversation

irasit commented May 20, 2021 • edited

Checklist before submitting

Description

Review process to land

github-actions bot commented May 20, 2021 • edited

Unit Test Results

chongxiaoc left a comment

Choose a reason for hiding this comment

chongxiaoc commented May 25, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chongxiaoc left a comment • edited

Choose a reason for hiding this comment

thuningxu commented May 28, 2021

chongxiaoc May 28, 2021 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tgaddair left a comment

Choose a reason for hiding this comment

irasit commented May 20, 2021 •

edited

github-actions bot commented May 20, 2021 •

edited

chongxiaoc left a comment •

edited

chongxiaoc May 28, 2021 •

edited