Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improved performance of Dask preprocessing by adding parallelism #1193

Merged
merged 9 commits into from
Jun 4, 2021

Conversation

tgaddair
Copy link
Collaborator

@tgaddair tgaddair commented Jun 3, 2021

This change addresses a fundamental performance bottleneck in the Dask preprocessing pipeline. Before, every feature was being processed sequentially, resulting in a task graph with the following form:

Screen Shot 2021-06-03 at 4 47 15 PM

The problem with this is because of Dask's lazy execution, if any step needs to be materialized (for example, computing a metadata statistic), it results in the entire graph up that point needing to be re-executed. The result is that adding new features can quadratically slow down the preprocessing time (every feature adds a linear amount of extra work due to redundancy).

The culprit is the use of assign which occurs when we assign a series to a Dask DataFrame:

df[col] = process(df[col])

Because subsequent operations need to reuse the df from the previous iteration, it creates a task dependency, meaning that steps cannot be done in parallel, and the task graph is one long chain.

The change here is to instead split each feature into an independent subgraph of computation, which (1) improves overall parallelism, and (2) means that computing statistics needs to only process the part of the graph relevant to the given feature. Now the graph looks like the following:

Screen Shot 2021-06-03 at 4 42 46 PM

We accomplish this by replacing the intermediate dataframe with a dict to store the processed series for each feature. Then, after getting the processed series for each feature at the end, we can assign them to the final output dataframe as a "join" operation.

Before, processing the Higgs dataset with Dask took over 45 minutes, and is now down to about 30 seconds (on par with Pandas) when running locally.

cc @clarkzinzow

@tgaddair tgaddair requested a review from w4nderlust June 4, 2021 00:13
@w4nderlust w4nderlust merged commit 9b3c83b into master Jun 4, 2021
@w4nderlust w4nderlust deleted the dask-parallel branch June 4, 2021 01:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants