Add support for Ray Train and Ray Datasets in training #1391

tgaddair · 2021-10-15T15:27:56Z

Fixes #1354.
Fixes #1331.

ShreyaR

This looks awesome! Added mostly minor comments -- feel free to merge whenever updated!

ShreyaR · 2021-10-21T23:07:56Z

ludwig/backend/ray.py

+        min_cpus = min(r['CPU'] for r in resources)
+        num_workers = len(resources)
+        resources_per_worker = {
+            'CPU': min(min_cpus / 2 + 1, min_cpus)


resources_per_worker could potentially be fractional here -- you could use min_cpus // 2.

Good point. This is actually allowed in Ray, though, that you can specify fractional CPUs. So I think it's fine to leave as is.

ShreyaR · 2021-10-21T23:18:54Z

ludwig/backend/ray.py

@@ -110,14 +154,121 @@ def train_online(self, *args, **kwargs):
        return results


-class RayTrainer(BaseTrainer):
-    def __init__(self, horovod_kwargs, trainer_kwargs):
+def train_fn(executable_kwargs=None, remote_model=None, train_shards=None, val_shards=None, test_shards=None, **kwargs):


Could you add type hints here? It's not super clear from the code what train_shards, val_shards, etc. should be.

Yes! Good catch.

ShreyaR · 2021-10-21T23:59:31Z

ludwig/utils/data_utils.py

-    training_set = split_dataset(dataset, split, 0)
-    validation_set = split_dataset(dataset, split, 1)
-    test_set = split_dataset(dataset, split, 2)
+    distinct_values = dataset[split].drop_duplicates()


General Q in case we want to keep duplicates: is it possible that duplicates in the dataset were an artifact of upsampling some points in the dataset?

Here the drop_duplicates is only on the split column. The idea is that this is a more efficient way to determine whether or not the datasets has, for example, a validation split, so we don't have to call len(val_df) (which is expensive).

In general, I don't think Ludwig does any duplicate removal across the entire dataset.

Added comment.

ShreyaR · 2021-10-22T00:08:56Z

ludwig/data/dataset/ray.py

+            training_set_metadata.get(DATA_TRAIN_HDF5_FP)
+        )
+
+    def create_inference_dataset(


Looks like create_inference_dataset and create have the same functionality. The tag argument looks like it isn't being used. Is it possible to consolidate both methods?

Yeah, it should be once we remove Petastorm. We need it for Petastorm (and by extension here to fit the interface) as we use different datasets for training vs prediction. Will add a TODO.

Changed create_inference_dataset to call create.

ShreyaR · 2021-10-22T00:12:45Z

ludwig/data/dataset/ray.py

+
+    @contextlib.contextmanager
+    def initialize_batcher(self, batch_size=128,
+                           should_shuffle=True,


nit: It looks like some of the arguments here aren't used.

True, but we need it for the interface. Some of these params will be removed when we drop Petastorm.

tgaddair added 10 commits October 15, 2021 08:26

WIP RaySgd + RayDatasets for training

bac0f55

Merge branch 'master' into ray-train

2b16484

Fixed dataset manager registry

9081ccb

Fixed cache check

5074934

Distinguish epoch and batch iters

6f1eba4

Fix

5177e13

Fix for 1.7.0

179b684

Fixed param

efd9ad6

Fixed test resources

bbc29e5

Debug, removed shuffle

6f47dbf

tgaddair mentioned this pull request Oct 15, 2021

[Bug] Calling count() on repeated DatasetPipeline loops forever ray-project/ray#19410

Closed

2 tasks

tgaddair added 19 commits October 15, 2021 14:57

Convert to numpy

c301a3f

Fixed batcher

a407f92

Removed debug

6d4b717

Improved training

165ea2a

Convert to tensors in map_batches

15e4f8f

Convert to numpy and use async producer consumer

a0bc8d0

Use single producer thread

05f1ee5

Removed trainer diagnostics

3d6df5d

Added fast dask train/test split

5315280

Removed RayDataset import

b06b80a

Added instrumentation

c731fc7

Refactor

c46302c

Support legacy trainer

09607c5

Cleanup readers

20a60c4

Ensure equal splitting

9995218

TO REVERT test configs

3ed3c06

Added persist for df

482b9ad

Persist by default

33f3cca

Added ray 1.7 compatibility layer

6cd6a87

tgaddair added 9 commits October 19, 2021 09:45

Fix OOM on multiple devices

9a6a287

Improved resource allocation

34021b7

Cleaned up RayDatasets code

f1f544e

Cleaned up preprocessing

64f3da0

Fixed tests

3c65198

Removed debug calls

3e19d41

Removed debug code

7d97799

Replaced fixtures with context managers

63b8ae8

Use legacy backend for hyperopt

d44fbb1

tgaddair mentioned this pull request Oct 20, 2021

[ray] Use Ray Train trainer for distributed hyperopt #1400

Closed

tgaddair changed the title ~~Add support for RaySGD and RayDatasets in training~~ Add support for Ray Train and Ray Datasets in training Oct 20, 2021

tgaddair added 2 commits October 20, 2021 07:37

Merge

2ba65b8

Revert Petastorm async

000b40c

tgaddair marked this pull request as ready for review October 20, 2021 16:31

tgaddair requested a review from ShreyaR October 20, 2021 16:31

Support ray 1.8

2923ed6

ShreyaR approved these changes Oct 22, 2021

View reviewed changes

tgaddair added 6 commits October 21, 2021 18:57

Addressed comments

7ceaf61

Merge branch 'master' into ray-train

ff57cdf

Updated set_epoch

119788e

Fixed multi-dimensional data

48d9e4a

Only flatten for Dask

7ff02bf

Bugfix

893f235

tgaddair merged commit 8e6420c into master Oct 22, 2021

tgaddair deleted the ray-train branch October 22, 2021 18:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for Ray Train and Ray Datasets in training #1391

Add support for Ray Train and Ray Datasets in training #1391

tgaddair commented Oct 15, 2021 •

edited

Loading

ShreyaR left a comment

ShreyaR Oct 21, 2021

tgaddair Oct 22, 2021

ShreyaR Oct 21, 2021

tgaddair Oct 22, 2021

ShreyaR Oct 21, 2021

tgaddair Oct 22, 2021

tgaddair Oct 22, 2021

ShreyaR Oct 22, 2021

tgaddair Oct 22, 2021

tgaddair Oct 22, 2021

ShreyaR Oct 22, 2021

tgaddair Oct 22, 2021

Add support for Ray Train and Ray Datasets in training #1391

Add support for Ray Train and Ray Datasets in training #1391

Conversation

tgaddair commented Oct 15, 2021 • edited Loading

ShreyaR left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tgaddair commented Oct 15, 2021 •

edited

Loading