Perform inference with Dask when using the Ray backend #1128

tgaddair · 2021-03-21T16:50:29Z

This PR introduces a new PartitionedDataset constructed by the backend when calling create_inference_dataset within preprocess_for_prediction. Unlike other datasets, the PartitionedDataset runs the batch prediction in parallel on each partition of the distributed DataFrame.

A consequence of this rearchitecture is that instead of Predictor.batch_predict returning a dict, it now returns a DataFrame (as the dataset could be large enough to not fit into memory). Therefore, all the postprocessing now happens on the DataFrame with a flat structure, instead of the nested dict structure.

There a few API limitations to the current implementation:

No activation collection for Ray or Dask (usually you wouldn't need to run this in a massive dataset, anyway). We can add this in the future.
No batch evaluation when using Dask backend. The evaluation requires some tricky state management to collect the metrics across partitions, which we can do using Ray actors (though it is not ideal). We should rearchitect this in the future so that metrics collection can occur after we run the prediction.
No overall stats collection for partitioned datasets. It's not impossible, but will require more work, and is not a top priority at this time.

Another limitation we should address soon is that this implementation does not optimize for GPU inference. Because it runs a separate model replica on each partition, we have no way of preventing multiple model replicas from landing on the same GPU. With standalone Dask, this is possible when configuring your workers (see here). However, this approach doesn't work for Ray at the moment. One possibility is to create workers for each GPU and then use ray.remote to send partitions to GPU workers. If we go this route, we need be smart with how we route partitions to model replicas to ensure minimal network overhead. We can do this in a follow-up PR.

ifokeev · 2021-03-22T17:37:54Z

so cool feature

w4nderlust · 2021-04-20T17:31:56Z

ludwig/data/postprocessing.py

        )
-    return postprocessed
+
+    # Save any new columns but do not save the original columns again


why do we need this two times? Can't we do it only once at the end?
Will likely also remove the need for saved_keys

At least in the original implementation, existing columns were written to numpy before postprocessing, while new columns were written after postprocessing. This change is meant to preserve that behavior, but we can definitely remove the earlier lines if that change is acceptable.

w4nderlust · 2021-04-20T17:47:40Z

ludwig/features/category_feature.py

+        if top_k_col in predictions:
+            if 'idx2str' in metadata:
+                predictions[top_k_col] = backend.df_engine.map_objects(
+                    predictions[top_k_col],


not 100% sure of how the content of the top k column looks like, but before each was a list of values, so in line 431 there was a loop for each value, while here it sems like there's only a loop over preds and not inside each pred

Good catch, every row here should be a pred_top_k. Fixed.

w4nderlust · 2021-04-20T17:55:03Z

ludwig/backend/ray.py

+        self.metrics.append(metrics)
+
+    def collect(self):
+        return sum_dicts(


This still does not convince me 100%, as most of the metrics summed / averaged across partitions do no make much sense. maybe we can deactivate evaluation also in the ray case, think it through and then reactivate it.

I agree, though this implies it is also currently broken in the normal Horovod case as well:

ludwig/ludwig/models/predictor.py

Line 289 in 51e357d

merged_output_metrics = sum_dicts(

I believe once we adopt the Metrics package for PyTorch that natively supports distributed aggregation this problem will be addressed. Until then, I agree that disabling makes sense.

I think it depends on what we are aggregating, it it's accuracies for instance, this wouldn't be correct, it it's aggregation by summing the counts of correct predictions and all predictions and then later diving correct prediction by all predictions, this would be correct. But I believe we are doing hte formaer.

tgaddair added 15 commits March 19, 2021 11:05

WIP Dask prediction

8e2f303

WIP conversion to DataFrame

bdb5e35

Updated postprocess

d42b621

WIP sequence features

061f1fe

Fixed text

af0fd3a

Set feature

8e24fe2

Fixed calculate stats

98ef1fc

Fixed column ordering

2f10951

Fixed closure

d705280

Fixed set features

400d134

Fixed sequence features

1d71361

Fixed collect predictions

f249cc9

Disable collect_overall_stats for now

ef93df6

Added error handling

7f7332a

Removed len() calls

3373403

tgaddair added 14 commits March 22, 2021 10:52

Improved performance of text features

023bec1

Improved sequence feature performance

87933e7

Added back evaluation

498e3c9

Revert test_api changes

a9cc657

Fixed Dask

cc4e9ae

Removed decorator

b1d13bb

Fixed postprocessing to numpy

da99b41

Fixed series equality test

9471136

Added dict conversion

bae5616

Resolved conflicts

3222de3

Fixed neuropod tests

dadb1c9

Fixed serve

8eca94d

Fixed visualization test

5f5e1e1

Fixed 2d viz api

783e9e9

tgaddair added 6 commits April 9, 2021 09:11

Fixed kwargs

54669d8

Fixed more tests

2786830

Fixed categorical indicies

3cd4fc2

Merge branch 'master' into dask-predict

ce10762

Fixed test

21b60fd

No dashboard

4c94af7

tgaddair marked this pull request as ready for review April 9, 2021 19:43

tgaddair requested a review from w4nderlust April 9, 2021 19:43

tgaddair added 4 commits April 9, 2021 14:56

Removed debug code

0dd2276

Merge branch 'master' into dask-predict

1b7259b

Try more ray params

969aa17

Try spawn

1e3f4ef

w4nderlust reviewed Apr 20, 2021

View reviewed changes

tgaddair added 8 commits April 20, 2021 13:28

Addressed comments

336375c

Fixed visualization API

ab324bd

Fixed tests

4414c64

Merge branch 'master' into dask-predict

fb84492

Removed unused code

753bcbb

Reordered loops

4929403

Removed debug print

1acf358

Fixed tests

7411a15

w4nderlust approved these changes Apr 23, 2021

View reviewed changes

tgaddair added 5 commits April 28, 2021 19:23

Merge branch 'master' into dask-predict

5708f40

Merge

98fb2db

Fixed indentation

bb3e4af

Fixed create_inference_dataset

03680c0

Fixed parquet without partitioned backend

80b7454

tgaddair merged commit 651db58 into master May 14, 2021

tgaddair deleted the dask-predict branch May 14, 2021 16:44

jimthompson5802 mentioned this pull request Jul 18, 2021

Fix: reinstitute saving predictions as csv for LocalBackend #1236

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Perform inference with Dask when using the Ray backend #1128

Perform inference with Dask when using the Ray backend #1128

tgaddair commented Mar 21, 2021 •

edited

Loading

ifokeev commented Mar 22, 2021

w4nderlust Apr 20, 2021

tgaddair Apr 20, 2021

w4nderlust Apr 20, 2021

tgaddair Apr 20, 2021

w4nderlust Apr 20, 2021

tgaddair Apr 20, 2021

w4nderlust Apr 20, 2021

Perform inference with Dask when using the Ray backend #1128

Perform inference with Dask when using the Ray backend #1128

Conversation

tgaddair commented Mar 21, 2021 • edited Loading

ifokeev commented Mar 22, 2021

w4nderlust Apr 20, 2021

Choose a reason for hiding this comment

tgaddair Apr 20, 2021

Choose a reason for hiding this comment

w4nderlust Apr 20, 2021

Choose a reason for hiding this comment

tgaddair Apr 20, 2021

Choose a reason for hiding this comment

w4nderlust Apr 20, 2021

Choose a reason for hiding this comment

tgaddair Apr 20, 2021

Choose a reason for hiding this comment

w4nderlust Apr 20, 2021

Choose a reason for hiding this comment

tgaddair commented Mar 21, 2021 •

edited

Loading