-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Run analysis steps in parallel with Dask #311
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a big commit and I'll have to come back to it later but here's some initial feedback. A concern I have about the approach is that I don't think there's anything intrinsically unserializable about an AnalysisConfiguration and I don't think we should have to try so hard to avoid holding onto one.
One thing we could consider is to separate "resolving" an AnalysisSpec from the process of binding it to an experiment -- we could attach an Experiment to an AnalysisSpec as a separate step, producing a, idk, FinalizedAnalysisSpec
object that's still trivially serializable, and delay instantiation of metrics and statistics to the child processes.
I'm curious what it's choking on and if it points to design problems, though.
I figured out a way to make |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good, a few questions
jetstream/analysis.py
Outdated
from jetstream.bigquery_client import BigQueryClient | ||
from pandas import DataFrame |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
meganit: i'd normally sort this with the second group of imports according to an idea of "stdlib on top, then third-party, and then same-package at the end," which is a pep-8-ism https://www.python.org/dev/peps/pep-0008/#imports
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you think about using isort?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
generally in favour!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alright, I applied isort here. It's in a separate commit so can easily be reverted if it doesn't look great.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💪:shipit:
Supersedes #299
I ran some benchmarks to see how much speedup this gives us. The actual speedup depends on available CPU resources, but I got up to a 50% speed increase for computation intensive tasks (e.g. the weekly analysis run for
bug-1664289-pref-activity-stream-pocket-new-tab-en-global-feed-release-81-82
andbug-1665555-pref-etp4-retention-study-release-81-83
).An even better solution here would be to use Dask Kubernetes which would deploy Dask workers on the cluster instead of just the single pod running the analysis for an experiment. I did try to get this working, however was unable to get the Dask cluster manager working. It's probably worth keeping this option in mind for future improvements though.