Improved performance of Dask preprocessing by adding parallelism #1193

tgaddair · 2021-06-03T19:41:07Z

This change addresses a fundamental performance bottleneck in the Dask preprocessing pipeline. Before, every feature was being processed sequentially, resulting in a task graph with the following form:

The problem with this is because of Dask's lazy execution, if any step needs to be materialized (for example, computing a metadata statistic), it results in the entire graph up that point needing to be re-executed. The result is that adding new features can quadratically slow down the preprocessing time (every feature adds a linear amount of extra work due to redundancy).

The culprit is the use of assign which occurs when we assign a series to a Dask DataFrame:

df[col] = process(df[col])

Because subsequent operations need to reuse the df from the previous iteration, it creates a task dependency, meaning that steps cannot be done in parallel, and the task graph is one long chain.

The change here is to instead split each feature into an independent subgraph of computation, which (1) improves overall parallelism, and (2) means that computing statistics needs to only process the part of the graph relevant to the given feature. Now the graph looks like the following:

We accomplish this by replacing the intermediate dataframe with a dict to store the processed series for each feature. Then, after getting the processed series for each feature at the end, we can assign them to the final output dataframe as a "join" operation.

Before, processing the Higgs dataset with Dask took over 45 minutes, and is now down to about 30 seconds (on par with Pandas) when running locally.

cc @clarkzinzow

tgaddair added 9 commits June 3, 2021 10:26

WIP parallelize Dask cast columns

77c8b87

WIP df to dicts

36e439b

Fixed DROP_ROW

e2e64f4

Removed debug

ab32d7e

Fixed PROC_COLUMN

2a3ca82

Added SRC to training_set_metadata

8e35831

Fixed input_df

9ab5155

Fixed vector feature

7c35afc

Added progress bar

6156675

tgaddair requested a review from w4nderlust June 4, 2021 00:13

w4nderlust approved these changes Jun 4, 2021

View reviewed changes

w4nderlust merged commit 9b3c83b into master Jun 4, 2021

w4nderlust deleted the dask-parallel branch June 4, 2021 01:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improved performance of Dask preprocessing by adding parallelism #1193

Improved performance of Dask preprocessing by adding parallelism #1193

tgaddair commented Jun 3, 2021 •

edited

Loading

Improved performance of Dask preprocessing by adding parallelism #1193

Improved performance of Dask preprocessing by adding parallelism #1193

Conversation

tgaddair commented Jun 3, 2021 • edited Loading

tgaddair commented Jun 3, 2021 •

edited

Loading