Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve the performance of bayesian bootstrap by 10x #177

Merged
merged 1 commit into from
Aug 1, 2023

Conversation

jrmuizel
Copy link
Contributor

@jrmuizel jrmuizel commented Apr 3, 2023

The vast majority of the current time is in:
np.random.RandomState(unique_seed)

Previously, when generation of the sample weights was parallelized it made sense to reseed for each sampling so that it was deterministic. However, that parallelization was removed in e33914d

With that gone we can drop the reseeding entirely which makes 16 or so 100 sample calls to compare_branches go from 50seconds down to 5.

@jrmuizel jrmuizel force-pushed the faster branch 2 times, most recently from 0d8c1a9 to f6fbaac Compare April 3, 2023 17:05
The vast majority of the current time is in:
np.random.RandomState(unique_seed)

Previously, when generation of the sample weights was parallelized it
made sense to reseed for each sampling so that it was
deterministic. However, that parallelization was removed in
e33914d

With that gone we can drop the reseeding entirely which makes
16 or so 100 sample calls to compare_branches go from 50seconds
down to 5.
@scholtzan
Copy link
Contributor

cc @danielkberry: you are probably most familiar with this

@danielkberry
Copy link
Contributor

Hi there, thanks for making this PR. I tested the change under conditions like we might encounter in normal use of this code. This test simulated an experiment and measured the timing of compare_branches under both the old code and the proposed change here (as described in the 1st comment).

As we can see from the timings in the notebook, the timings are not statistically different from each other, so this method does not appear to provide any speedup.

Copy link
Contributor

@danielkberry danielkberry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please demonstrate that the code changes improve performance

@jrmuizel
Copy link
Contributor Author

jrmuizel commented Aug 1, 2023

Your notebook doesn't actually use the new code. If I fix that with:

mozanalysis.bayesian_stats.bayesian_bootstrap.get_bootstrap_samples = get_bootstrap_samples
mozanalysis.bayesian_stats.bayesian_bootstrap._resample_and_agg_once = _resample_and_agg_once

I get:

Before:

%%timeit
compare_branches(data, 'metric_a')
11.9 s ± 86 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit
compare_branches(data, 'metric_b')
5.78 s ± 2.11 s per loop (mean ± std. dev. of 7 runs, 1 loop each)

After:

%%timeit
compare_branches(data, 'metric_a')
6.53 s ± 599 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit
compare_branches(data, 'metric_b')
358 ms ± 84.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Here's an updated notebook: https://colab.research.google.com/drive/1p5Zl6vnmwVG8daoF4LTrNHF844zqFcCG?usp=sharing

@danielkberry
Copy link
Contributor

Thanks for clearing that up. I did some comparisons of the new method to make sure nothing was obviously different from the previous one (as an end-to-end test) and it checks out. Graphs in this notebook: https://colab.research.google.com/drive/1eKfa_nGBzKduzN_5fDSmjbVt0b3TcL-q

Copy link
Contributor

@danielkberry danielkberry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes provide a speedup and do not appear to impact results, at least running locally. To summarize, the changes do not re-initialize the sampler each time, which saves time.

Based on my knowledge of how jetstream runs in prod on k8s, all calculations for a particular metric are done within a pod (we run Dask in a LocalCluster configuration), so not parallelized across machines as was done with our previous Spark deployment. Therefore, I do not believe we need to be as careful with sampling as we needed in the past.

@danielkberry danielkberry merged commit 347642b into mozilla:main Aug 1, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants