Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.Sign up
x/perf/benchstat: bogus "no statistical difference" report when times are the same #19634
Moving from rsc/benchstat#7, which now appears to be a dead issue tracker.
In short - if you happen to get benchmark time numbers that happen to be the same,
new.txt (note that all the times are the same: 78.8 ns/op):
benchstat old.txt new.txt gives:
i.e. reports "no statistically significant improvement", which is clearly wrong.
The problem here isn't the significance test, it's the outlier rejection combined with the small sample size.
As an order test, Mann-Whitney has a floor on the p value that depends on the number of samples (not their values). p=0.079 is simply the lowest p-value you can get with n=4, m=5. The significance test isn't failing. It's genuinely saying that with so few samples, the chance of getting that order randomly is 0.079.
If you change the 114 ns/op to 115 ns/op, there's even less variance, but now outlier rejection doesn't kick in, so you get n=5, m=5 and a p-value of 0.008, which is considered significant.
I think the real bug here is that we're doing outlier rejection before computing an order statistic. We probably shouldn't do that. But if we still want to do outlier rejection for computing the mean ± x%, then I'm not sure how to present the sample size. Maybe we shouldn't be doing outlier rejection for that either. Perhaps we should be reporting a trimmed mean and its standard error?
Independently, perhaps benchstat should report when the sample sizes are too small to ever get a significant result.
I hit the same or similar problem with "macro"-benchmark. My data:
@AlekSi that is expected - you'll need to run each benchmark multiple times - that's what the