Improve the performance of bayesian bootstrap by 10x #177

jrmuizel · 2023-04-03T15:44:20Z

The vast majority of the current time is in:
np.random.RandomState(unique_seed)

Previously, when generation of the sample weights was parallelized it made sense to reseed for each sampling so that it was deterministic. However, that parallelization was removed in e33914d

With that gone we can drop the reseeding entirely which makes 16 or so 100 sample calls to compare_branches go from 50seconds down to 5.

The vast majority of the current time is in: np.random.RandomState(unique_seed) Previously, when generation of the sample weights was parallelized it made sense to reseed for each sampling so that it was deterministic. However, that parallelization was removed in e33914d With that gone we can drop the reseeding entirely which makes 16 or so 100 sample calls to compare_branches go from 50seconds down to 5.

scholtzan · 2023-05-15T17:21:59Z

cc @danielkberry: you are probably most familiar with this

danielkberry · 2023-07-31T19:21:58Z

Hi there, thanks for making this PR. I tested the change under conditions like we might encounter in normal use of this code. This test simulated an experiment and measured the timing of compare_branches under both the old code and the proposed change here (as described in the 1st comment).

As we can see from the timings in the notebook, the timings are not statistically different from each other, so this method does not appear to provide any speedup.

danielkberry

Please demonstrate that the code changes improve performance

jrmuizel · 2023-08-01T13:42:50Z

Your notebook doesn't actually use the new code. If I fix that with:

mozanalysis.bayesian_stats.bayesian_bootstrap.get_bootstrap_samples = get_bootstrap_samples
mozanalysis.bayesian_stats.bayesian_bootstrap._resample_and_agg_once = _resample_and_agg_once

I get:

Before:

%%timeit
compare_branches(data, 'metric_a')
11.9 s ± 86 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%%timeit
compare_branches(data, 'metric_b')
5.78 s ± 2.11 s per loop (mean ± std. dev. of 7 runs, 1 loop each)

After:

%%timeit
compare_branches(data, 'metric_a')
6.53 s ± 599 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%%timeit
compare_branches(data, 'metric_b')
358 ms ± 84.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Here's an updated notebook: https://colab.research.google.com/drive/1p5Zl6vnmwVG8daoF4LTrNHF844zqFcCG?usp=sharing

danielkberry · 2023-08-01T15:32:27Z

Thanks for clearing that up. I did some comparisons of the new method to make sure nothing was obviously different from the previous one (as an end-to-end test) and it checks out. Graphs in this notebook: https://colab.research.google.com/drive/1eKfa_nGBzKduzN_5fDSmjbVt0b3TcL-q

danielkberry

The changes provide a speedup and do not appear to impact results, at least running locally. To summarize, the changes do not re-initialize the sampler each time, which saves time.

Based on my knowledge of how jetstream runs in prod on k8s, all calculations for a particular metric are done within a pod (we run Dask in a LocalCluster configuration), so not parallelized across machines as was done with our previous Spark deployment. Therefore, I do not believe we need to be as careful with sampling as we needed in the past.

jrmuizel force-pushed the faster branch 2 times, most recently from 0d8c1a9 to f6fbaac Compare April 3, 2023 17:05

jrmuizel force-pushed the faster branch from f6fbaac to 3ca5726 Compare April 3, 2023 17:13

scholtzan mentioned this pull request Jul 18, 2023

[POC | not ready for merge] Speed up bootstrap using multiprocessing #185

Open

mikewilli requested a review from danielkberry July 31, 2023 17:01

danielkberry requested changes Jul 31, 2023

View reviewed changes

danielkberry approved these changes Aug 1, 2023

View reviewed changes

danielkberry merged commit 347642b into mozilla:main Aug 1, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve the performance of bayesian bootstrap by 10x #177

Improve the performance of bayesian bootstrap by 10x #177

jrmuizel commented Apr 3, 2023

scholtzan commented May 15, 2023

danielkberry commented Jul 31, 2023

danielkberry left a comment

jrmuizel commented Aug 1, 2023

danielkberry commented Aug 1, 2023

danielkberry left a comment

Improve the performance of bayesian bootstrap by 10x #177

Improve the performance of bayesian bootstrap by 10x #177

Conversation

jrmuizel commented Apr 3, 2023

scholtzan commented May 15, 2023

danielkberry commented Jul 31, 2023

danielkberry left a comment

Choose a reason for hiding this comment

jrmuizel commented Aug 1, 2023

danielkberry commented Aug 1, 2023

danielkberry left a comment

Choose a reason for hiding this comment