Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Run analysis steps in parallel with Dask #311

Merged
merged 6 commits into from
Nov 16, 2020
Merged

Run analysis steps in parallel with Dask #311

merged 6 commits into from
Nov 16, 2020

Conversation

scholtzan
Copy link
Collaborator

@scholtzan scholtzan commented Nov 6, 2020

Supersedes #299

I ran some benchmarks to see how much speedup this gives us. The actual speedup depends on available CPU resources, but I got up to a 50% speed increase for computation intensive tasks (e.g. the weekly analysis run for bug-1664289-pref-activity-stream-pocket-new-tab-en-global-feed-release-81-82 and bug-1665555-pref-etp4-retention-study-release-81-83).

An even better solution here would be to use Dask Kubernetes which would deploy Dask workers on the cluster instead of just the single pod running the analysis for an experiment. I did try to get this working, however was unable to get the Dask cluster manager working. It's probably worth keeping this option in mind for future improvements though.

jetstream/analysis.py Outdated Show resolved Hide resolved
jetstream/analysis.py Outdated Show resolved Hide resolved
jetstream/analysis.py Outdated Show resolved Hide resolved
Dockerfile Outdated Show resolved Hide resolved
@scholtzan scholtzan mentioned this pull request Nov 6, 2020
Copy link
Contributor

@tdsmith tdsmith left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a big commit and I'll have to come back to it later but here's some initial feedback. A concern I have about the approach is that I don't think there's anything intrinsically unserializable about an AnalysisConfiguration and I don't think we should have to try so hard to avoid holding onto one.

One thing we could consider is to separate "resolving" an AnalysisSpec from the process of binding it to an experiment -- we could attach an Experiment to an AnalysisSpec as a separate step, producing a, idk, FinalizedAnalysisSpec object that's still trivially serializable, and delay instantiation of metrics and statistics to the child processes.

I'm curious what it's choking on and if it points to design problems, though.

Dockerfile Outdated Show resolved Hide resolved
bin/entrypoint Show resolved Hide resolved
jetstream/analysis.py Outdated Show resolved Hide resolved
jetstream/analysis.py Outdated Show resolved Hide resolved
@scholtzan
Copy link
Collaborator Author

I figured out a way to make AnalysisConfiguration pickleable, which reduces the changes necessary here. So the main change here is the re-written analysis logic to make it compatible with dask.delayed

Copy link
Contributor

@tdsmith tdsmith left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good, a few questions

jetstream/analysis.py Show resolved Hide resolved
from jetstream.bigquery_client import BigQueryClient
from pandas import DataFrame
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

meganit: i'd normally sort this with the second group of imports according to an idea of "stdlib on top, then third-party, and then same-package at the end," which is a pep-8-ism https://www.python.org/dev/peps/pep-0008/#imports

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about using isort?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

generally in favour!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright, I applied isort here. It's in a separate commit so can easily be reverted if it doesn't look great.

jetstream/analysis.py Outdated Show resolved Hide resolved
jetstream/analysis.py Outdated Show resolved Hide resolved
jetstream/analysis.py Outdated Show resolved Hide resolved
jetstream/analysis.py Outdated Show resolved Hide resolved
jetstream/analysis.py Show resolved Hide resolved
jetstream/config.py Show resolved Hide resolved
jetstream/analysis.py Outdated Show resolved Hide resolved
Copy link
Contributor

@tdsmith tdsmith left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💪:shipit:

jetstream/analysis.py Outdated Show resolved Hide resolved
@scholtzan scholtzan merged commit 28118e1 into main Nov 16, 2020
@scholtzan scholtzan deleted the dask2 branch November 16, 2020 20:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants