-
Notifications
You must be signed in to change notification settings - Fork 79
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use a better measures than average and standard deviation #1
Comments
Just for the record, |
That's interesting. We looked at bootstrap methods for our performance analysis at work (and my benchmarks in general), but found them to be heavyhanded, not to mention misleading with empirical performance measurements. For instance, criterion performs linear regression analysis, but what does that mean for nonlinear cases? With 99% of performance work, especially microbenchmarks, I'm interested in the exact measurements, and will handle the predictions myself, thank you. Love criterion's KDEs though. Not sure if that could ever work in the console, but one can dream. :) |
The assumption Also, the regression part of Can you elaborate on your experience with boostrap/regression methods for benchmarking? Specific examples leading to inaccurate estimates would be very helpful. |
The primary question is one of audience. Bootstrapping to get confidence intervals results is often misleading because the process and results are nontrivial to interpret. Criterion tries hard around this, but I still don't think it closes the gap nearly enough to be useful as a timeit standin. Adding a second dimension, I have a philosophical question mark with regard to whether bootstrapping applies. It can be applied, the numbers go into the equations, but is it valuable to apply it? Looking at the origins of bootstrapping, it's a frequentist formulation designed to make threadbare samples more usable with other frequentist techniques, as the name suggests. I think there is an argument to be made that in many performance cases we don't have an imperfect sample. We often have the ability to perfectly sample or even compute statistics on the entire population. Bootstrapping or jack knifing are certainly interesting. I initially had a section for them in Statistics for Software. But the more I ran it past other developers, I concluded that they were not practical performance tools, and arguably added to the confusion around an already confusing topic. Better to stick to the nonparametric fundamentals. |
@mahmoud: "I recommended median and median absolute deviation." I don't understand well the difference between median and mean. I don't understand which one is the "good one": more representative of the performance the users will see on their computer. I plotted the distribution of one benchmark, "bench_dict.py if -p 250 -n 10" (2500 samples) on a very busy system without CPU isolation: I plotted a lot of files with various configuration: CPU isolation on/off, system idle/busy, different programs, etc. This shape is quite common on my tests: the curve is "cutted" at the left and has a more or less long tail at the right. The red line is the median, the blue line is the mean. The mean line looks like to be the mode, whereas the median is at the left. I counted the number of samples:
|
Re: mean and median, the short answer is that the median is the better one. If there were an anomaly in testing and one test ran very fast or very slow, the mean would be the one that is skewed. A similar relationship exists between standard deviation and median absolute deviation (MAD). You can see this in your chart. The blue mean line is overrepresenting the slower runs, even though they are in the minority. The median gives equal representation, and is the more appropriate indicator of central value. Re: curve shape, yes, that looks about right for a lot of performance measurements. As I mention (offhandedly) in the statistics post, the two most helpful distributions for modelling performance are the Poisson and Erlang. That said, nonparametric statistics like the minimum and median are the easiest to interpret and build on. |
myself:
Oops, it's just that the curve was explicitly centered on the mean in my code, that's why the mean fits well on this curve :-D So ignore this sentence of my previous message... I never really used numpy+scipy before, I'm still learning how to plot things :-) |
@superbobry: "Just for the record, criterion, a Haskell benchmarking library, also does mean and SD, but complements them with a bootstrap confidence intervals. This approach is more computationally intensive than median and MAD, but I think it could be more useful in practice, because ranges carry more information than point estimates." They are many ways to display data: draw a histogram, display the mean, etc. But let me explain the basic use case: the perf module is written for developers who want to check if an optimization makes the code faster or slower, and so compute two benchmark results. Currently, most Python developers use the minimum since it's the defacto standard from the timeit module. Using the minimum is simple to understand and with enough samples, it's a little bit reliable. Multiple ranges seem overcomplicated for developers. Most of them don't know anything about statistics and only want a single number. I'm trying to do better by providing two numbers: mean + standard deviation. The perf module is too young to know if users and developers are able to understand that. |
I discussed on the #pypy IRC channel to try to understand the difference between mean and median:
I understood that there are two different and incompatible point of views:
If I should make a choice between the two use cases, I would pick (2) because it's more important to compare results and get reliable and reproductible benchmarks. If I understood correctly, the mean should be used for (1) and the median for (2). |
About the median absolute deviation (MAD): it's still a little bit hard for me to understand why MAD would be better than standard deviation. If I used mean+std dev or median+std dev, the range contains usually between 75% and 90% samples. If I use median+MAD, the range contains 50% of samples. I guess that it's again the definition of MAD. So MAD range looks smaller than std dev. Why do you consider that it's better to have a smaller range? Would you be ok to use median to ignore outliers caused by the "system noise", but with standard deviation to "show" the system noise? In the development version, I'm now using the standard deviation to warn users when the benchmarks "looks" unstable. Example of unstable benchmark because the number of loops is too low:
|
Note: 11% comes from stdev(samples) / mean(samples). I'm not sure if this check is reliable to detect "unstable" benchmarks. |
It looks like mean and median are almost equal when the benchmark is run with isolated CPUs: Example:
The difference is more important without isolated CPUs so with the "system noise". Example with on a very busy system:
|
Re: the two points of view: I agree that you want the second option, minimizing system noise for more comparability. Not all noise is made equal, so the median is the way to go. Re: MAD, if you're trying to detect overall spread as an indicator of system noise, then I agree that standard deviation might do a better job there. Really though, some benchmarks may naturally have a 10%+ normalized standard deviation, and what you're trying to do is detect the outliers. Consider printing the min-max range with the proportional deviation.
For boltons, I have a describe method inspired by Pandas, with the addition of MAD. It provides all the greatest hits:
But I understand if that's too much output. With decent histogram binning, the min comes right out and the max is usually not hard to infer. All said and done, I see the appeal of standard deviation for this case. It provides somewhat dampened noise detection. |
Also it bears repeating for the record that the minimum was not simply the default choice that timeit arbitrarily made.
I'm not saying that we can't do better, but for a single number, I just wanted to recognize that the minimum has some strong points and a strong history. (Even Kevin concludes that he still uses the minimum.) Still, maybe the time for that has passed. Personally, I can't think of a more pragmatic advancement than more runs in separate processes with a histogram. That's what makes perf great! |
There are other sources of randomness than hash randomization, like address space randomization, command line arguments, environment variables, etc. |
Oh I've read the post! ASLR is real. But its performance impact is not made clear in that post. And regardless, compared to most big name runtimes, CPython is still remarkably consistent. Not perfectly consistent, but remarkably consistent nonetheless. But the world is bigger than CPython 2 now. |
(Oh, it looks like I lost my long comment. I had to rewrite it from scratch :-( ) I ran bm_regex_v8.py benchmark on a very busy system. I ran random scripts to make my computer much slower, but not constantly. I stopped scripts, then restarted them, etc. Result: I see two gaussian curves:
In short, the benchmark is 1.6x slower when the system is busy. Statistics:
The blue median+mad curve is the closest to the left gaussian curve (performance when system is idle). The red mean+std dev contains more samples than the two median curves. The green median + std dev curse is somehow a compromise between the two curses. Now the question at one million dollar is: what is the "correct" curve? This example was created to exaggerate the effect of the system noise and understand the purpose of mean/median and std dev/MAD. State of my mind:
|
Once you get into bimodal/multimodal data, such as that which inconsistent load creates, the game gets massively complex. The "center" loses meaning, because you really have two centers. This is where the histogram is absolutely necessary and maybe not sufficient. I still say median + standard deviation for providing a general two-number figure that can help with noise detection. The min-max range, interquartile range, and histogram would be my next steps. |
Some other examples. call_simple. On this one, std dev seems inaccurate because it incluses the "noise". telco, nice full gaussian curve :-) timeit: "pass" instruction on Python 2.7, min=10 ns, max=11 ns. 1000 samples. timeit 2: "pass" on Python 3 with CPU isolation. 98% of 2500 samples have the value 14.4 ns! Raw data:
timeit 2, zoomed: Again, here the std dev doesn't seem accurate since it includes the "noise", the right long tail. |
FYI I started to work on a "stats" command for perf:
The "range buckets" part should go away as soon as this issue is fixed. We can extend easily this command since it will not be the default. |
That's great! I may have a pull request for you in the near future in that case. |
Issue #1: * Replace mean() with median() to display the summary of a benchmark * Add Benchmark.mean() method
I released perf 0.4 which uses boltons.statsutils if available in stats and hist_scipy commands. The stats commands computes mean, median, standard deviation, median absolute deviation and skewness. Just after the perf 0.4 release, I replaced mean with median in the development branch. |
I removed mean from perf and finished to replace mean with median everywhere. I plotted a lot of benchmarks, and I'm not fully convinced yet that the median absolute deviation (MAD) is "better" than the standard deviation (stdev). I kept the standard deviation and removed MAD from perf. In my experience, stdev helps to identify more easily when the benchmark is unstable and represents better the actual benchmark results. IMO MAD hides too much information. The temptation to ignore/remove/hide outliers is high, but I'm not convinced yet that it is correct to do that. Anyway, this discussion was helpful. At least, I replaced mean with median :-) I also replaced "Average: ... +- ..." with "Median +- std dev: ... +- ..." to be very explicit on what is displayed. So it becomes possible to change that later. |
Yep! Couldn't have said it better myself. Glad this helped! |
Example where MAD seems overoptimistic. The benchmark was run on a very busy system, it was a deliberate test to check how perf describes such "random" samples with a lot of variation. MAD is "+- 1.2 ms", wheras is std dev is "+- 112.8 ms".
|
I mean, that's fine. You're designing the stats to match a certain intuition, which is probably wise, given how people will use them. You could get a similar, if not better, nonparametric result with ranges or trimmed ranges. Standard deviation isn't the worst for this case. And standard deviation feels familiar. But how intuitive is it really? In a normal distribution, the standard deviation is ~33% of the range. That's not really an intuitive calculation, and that's the one that's most commonly taught. It gets harder for other distributions. So really the standard deviation here is being used as a semi-arbitrary indicator of variation that enables some comparison between runs. It's not the worst thing. Still, I'm going to look to the median and histogram the most. :) Keep up the good work! |
Hey Victor! Great library you've got going here (not a surprise; I've been a fan since long before faulthandler was in the stdlib).
As I mentioned on Twitter, I think perf would benefit from using a more robust set of measures than average + standard deviation. I recommended median and median absolute deviation. I have a well-rounded Python2/3 compatible implementation in boltons.statsutils. You can lift the whole module, if you'd like, or just reimplement those functions.
In addition to making the statistics more robust, this also gets rid of the pesky problem of whether to use Bessel's correction. The Python 3 statistics module is rife with it, and in this domain it raises more questions than it answers.
I've written a bit about statistics for exactly cases like these, so I'm obligated to link: Statistics for Software. Let me know if that helps and/or if I can help clarify further.
The text was updated successfully, but these errors were encountered: