Added Ray remote backend and Dask distributed preprocessing #1090

tgaddair · 2021-02-04T04:25:05Z

No description provided.

tgaddair · 2021-03-02T23:51:13Z

ludwig/backend/ray.py

+    # TODO ray: select this more intelligently,
+    #  must be greather than or equal to number of Horovod workers
+    return dict(
+        parallelism=int(ray.cluster_resources()['CPU'])


@clarkzinzow does this make sense as the default repartition value? One partition per CPU? Not sure if there's a more reasonable heuristic for this. The one restriction we have is that for Petastorm, we must have at least one row group per Horovod worker, and the safest way to guarantee this at the moment is to repartition the dataframe.

That's the typical heuristic, yes, under the soft constraint of those chunks/partitions fitting nicely into each worker's memory.

w4nderlust

Absolutely amazing job with this! Added couple very minor comments

w4nderlust · 2021-03-18T23:05:46Z

ludwig/data/dataframe/dask.py

+        features = get_combined_features(config)
+        for feature in features:
+            name = feature[NAME]
+            proc_column = feature[PROC_COLUMN]
+            reshape = training_set_metadata[name].get('reshape')
+            if reshape is not None:
+                dataset[proc_column] = self.map_objects(dataset[proc_column], lambda x: x.reshape(-1))


Curious about this, what is it a work around for? (probably worth adding a comment about it too)

Basically, PyArrow cannot serialize the data as a multi-dimensional tensors, so we need to store the shape so when we read it back, we can reshape it into its correct form. Will add a comment to this effect.

w4nderlust · 2021-03-18T23:09:05Z

ludwig/data/dataset/parquet.py

+        t = getattr(sample, feature_name)
+        reshape_dim = self.reshape_features.get(feature_name)
+        if reshape_dim is not None:
+            t = tf.reshape(t, reshape_dim)


I guess my previous question had to do with this, right? Tensors with multiple ranks get squashed and then put back in shape, right?

Yes, exactly, I can add a comment here.

tests/integration_tests/test_ray.py

tgaddair added 30 commits October 18, 2020 10:29

POC of Dask replacing Pandas for CSV

3b8713d

WIP performance improvements for categorical

197314c

Removed debug code

6bf7083

Auto parallelize across CPU cores

dd52d5c

Added DataProcessingEngine

1f228b7

Fixed split

12bfea7

Fixed API

b39d372

Fixed data processing

5b5fc60

Drop index

8b2b594

Added Petastorm dataset

b7f9546

Cleaned up dataset creation

c952130

Added Dataset

2c93b60

Train from dataset

ea6b4a7

Fixed bugs

b203fa6

Fixed string_utils

6b3bb08

Fixed tests

ef2a314

Fixed temp dataset

9a743fe

Added Backend

a630f14

Plumb through backend

945d56e

Plumb backend through get_feature_meta

2aab9c5

Plumb through backend to add_feature_data

0a0a7c4

Plumb in preprocess_for_prediction

3419178

Fixed Pandas processing

9d13c71

Added cache management

95a7952

Fixed unit tests

22b7538

Removed context, engine to processor

fd7cbab

Added numerical test

b63b316

RayBackend -> DaskBackend

0941ecd

Fixed read_xsv

77a59f9

Fixed set feature

cab90a1

tgaddair added 3 commits March 2, 2021 09:13

Fixed comments

1822501

Merge

745c96c

Removed link

ff795a4

richardliaw mentioned this pull request Mar 2, 2021

[ray] Support 'num_gpus' for Horovod horovod/horovod#2702

Closed

tgaddair added 7 commits March 2, 2021 11:03

Added test

8f65656

TEST: skip experiment.py

b87e7bc

Split backend tests

f6e3686

Added pytest.ini

15920c1

backend -> distributed

1094f78

Reordered tests

7c324c1

Configure Dask parallelism

ff598e3

tgaddair commented Mar 2, 2021

View reviewed changes

tgaddair added 12 commits March 4, 2021 11:52

TEST: disable fiber

7ea761f

Test without Ray hyperopt

23b4be0

Test ray only

214f718

Test all distributed

65f3293

Run distributed

f6a4227

Only ray

740ed9d

Serialize on load

d10e14f

Only return the weights

faf0052

Revert test changes

0259e07

Revert changes to visualization_utils

97139a6

Merge branch 'master' into ray

54f1939

Resolved merge conflicts

eb9fb2e

tgaddair requested a review from w4nderlust March 16, 2021 13:25

w4nderlust approved these changes Mar 18, 2021

View reviewed changes

Addressed comments

6dfbce6

tgaddair merged commit 01d114b into master Mar 19, 2021

tgaddair deleted the ray branch March 19, 2021 14:02

tgaddair mentioned this pull request Jun 2, 2021

Added Dask processor and Petastorm reader to train large datasets #970

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added Ray remote backend and Dask distributed preprocessing #1090

Added Ray remote backend and Dask distributed preprocessing #1090

tgaddair commented Feb 4, 2021

tgaddair Mar 2, 2021

clarkzinzow Mar 4, 2021

w4nderlust left a comment

w4nderlust Mar 18, 2021

tgaddair Mar 19, 2021

w4nderlust Mar 18, 2021

tgaddair Mar 19, 2021

Added Ray remote backend and Dask distributed preprocessing #1090

Added Ray remote backend and Dask distributed preprocessing #1090

Conversation

tgaddair commented Feb 4, 2021

tgaddair Mar 2, 2021

Choose a reason for hiding this comment

clarkzinzow Mar 4, 2021

Choose a reason for hiding this comment

w4nderlust left a comment

Choose a reason for hiding this comment

w4nderlust Mar 18, 2021

Choose a reason for hiding this comment

tgaddair Mar 19, 2021

Choose a reason for hiding this comment

w4nderlust Mar 18, 2021

Choose a reason for hiding this comment

tgaddair Mar 19, 2021

Choose a reason for hiding this comment