-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve the performance of bayesian bootstrap by 10x #177
Conversation
0d8c1a9
to
f6fbaac
Compare
The vast majority of the current time is in: np.random.RandomState(unique_seed) Previously, when generation of the sample weights was parallelized it made sense to reseed for each sampling so that it was deterministic. However, that parallelization was removed in e33914d With that gone we can drop the reseeding entirely which makes 16 or so 100 sample calls to compare_branches go from 50seconds down to 5.
cc @danielkberry: you are probably most familiar with this |
Hi there, thanks for making this PR. I tested the change under conditions like we might encounter in normal use of this code. This test simulated an experiment and measured the timing of As we can see from the timings in the notebook, the timings are not statistically different from each other, so this method does not appear to provide any speedup. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please demonstrate that the code changes improve performance
Your notebook doesn't actually use the new code. If I fix that with: mozanalysis.bayesian_stats.bayesian_bootstrap.get_bootstrap_samples = get_bootstrap_samples
mozanalysis.bayesian_stats.bayesian_bootstrap._resample_and_agg_once = _resample_and_agg_once I get: Before:
After:
Here's an updated notebook: https://colab.research.google.com/drive/1p5Zl6vnmwVG8daoF4LTrNHF844zqFcCG?usp=sharing |
Thanks for clearing that up. I did some comparisons of the new method to make sure nothing was obviously different from the previous one (as an end-to-end test) and it checks out. Graphs in this notebook: https://colab.research.google.com/drive/1eKfa_nGBzKduzN_5fDSmjbVt0b3TcL-q |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The changes provide a speedup and do not appear to impact results, at least running locally. To summarize, the changes do not re-initialize the sampler each time, which saves time.
Based on my knowledge of how jetstream runs in prod on k8s, all calculations for a particular metric are done within a pod (we run Dask in a LocalCluster configuration), so not parallelized across machines as was done with our previous Spark deployment. Therefore, I do not believe we need to be as careful with sampling as we needed in the past.
The vast majority of the current time is in:
np.random.RandomState(unique_seed)
Previously, when generation of the sample weights was parallelized it made sense to reseed for each sampling so that it was deterministic. However, that parallelization was removed in e33914d
With that gone we can drop the reseeding entirely which makes 16 or so 100 sample calls to compare_branches go from 50seconds down to 5.