New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

x/perf/benchstat: bogus "no statistical difference" report when times are the same #19634

Open
mvdan opened this Issue Mar 21, 2017 · 10 comments

Comments

Projects
None yet
5 participants
@mvdan
Copy link
Member

mvdan commented Mar 21, 2017

Moving from rsc/benchstat#7, which now appears to be a dead issue tracker.

In short - if you happen to get benchmark time numbers that happen to be the same, benchstat seems to discard them.

old.txt:

BenchmarkFloatSub/100-4         20000000           115 ns/op          64 B/op          1 allocs/op
BenchmarkFloatSub/100-4         20000000           114 ns/op          64 B/op          1 allocs/op
BenchmarkFloatSub/100-4         20000000           115 ns/op          64 B/op          1 allocs/op
BenchmarkFloatSub/100-4         20000000           115 ns/op          64 B/op          1 allocs/op
BenchmarkFloatSub/100-4         20000000           115 ns/op          64 B/op          1 allocs/op

new.txt (note that all the times are the same: 78.8 ns/op):

BenchmarkFloatSub/100-4         20000000            78.8 ns/op         0 B/op          0 allocs/op
BenchmarkFloatSub/100-4         20000000            78.8 ns/op         0 B/op          0 allocs/op
BenchmarkFloatSub/100-4         20000000            78.8 ns/op         0 B/op          0 allocs/op
BenchmarkFloatSub/100-4         20000000            78.8 ns/op         0 B/op          0 allocs/op
BenchmarkFloatSub/100-4         20000000            78.8 ns/op         0 B/op          0 allocs/op

benchstat old.txt new.txt gives:

name            old time/op    new time/op    delta
FloatSub/100-4     115ns ± 0%      79ns ± 0%      ~     (p=0.079 n=4+5)

i.e. reports "no statistically significant improvement", which is clearly wrong.

cc @ALTree @rsc

@mvdan

This comment has been minimized.

Copy link
Member

mvdan commented Mar 21, 2017

Also see Alberto's suggested fix at rsc/benchstat#7 (comment).

@mvdan mvdan added this to the Unreleased milestone Mar 21, 2017

@mvdan

This comment has been minimized.

Copy link
Member

mvdan commented Aug 23, 2017

Friendly ping @rsc - do you think Alberto's suggested fix above would make sense?

@ALTree have you considered posting your patch to Gerrit? I'd be happy to help if you want, but I assume you'd want to do it yourself to keep authorship.

@ALTree

This comment has been minimized.

Copy link
Member

ALTree commented Aug 23, 2017

@mvdan I never sent a patch because I'm not 100% sure the explanation I gave on the old repo's issues tracker is correct; anyway if you are convinced it is correct you can send a change, if you'd like to.

@mvdan

This comment has been minimized.

Copy link
Member

mvdan commented Aug 23, 2017

Ah, best to not send it then.

@rsc

This comment has been minimized.

Copy link
Contributor

rsc commented Aug 24, 2017

@aclements

This comment has been minimized.

Copy link
Member

aclements commented Aug 24, 2017

The problem here isn't the significance test, it's the outlier rejection combined with the small sample size.

As an order test, Mann-Whitney has a floor on the p value that depends on the number of samples (not their values). p=0.079 is simply the lowest p-value you can get with n=4, m=5. The significance test isn't failing. It's genuinely saying that with so few samples, the chance of getting that order randomly is 0.079.

If you change the 114 ns/op to 115 ns/op, there's even less variance, but now outlier rejection doesn't kick in, so you get n=5, m=5 and a p-value of 0.008, which is considered significant.

I think the real bug here is that we're doing outlier rejection before computing an order statistic. We probably shouldn't do that. But if we still want to do outlier rejection for computing the mean ± x%, then I'm not sure how to present the sample size. Maybe we shouldn't be doing outlier rejection for that either. Perhaps we should be reporting a trimmed mean and its standard error?

Independently, perhaps benchstat should report when the sample sizes are too small to ever get a significant result.

@AlekSi

This comment has been minimized.

Copy link
Contributor

AlekSi commented Jan 18, 2018

I hit the same or similar problem with "macro"-benchmark. My data:

old.txt:

BenchmarkDecode-4             	       5	2203036637 ns/op	1885067576 B/op	13009444 allocs/op
BenchmarkDecodeConcurrent-4   	       5	2258279089 ns/op	1885066763 B/op	13009433 allocs/op

new.txt:

BenchmarkDecode-4             	      10	1543669126 ns/op	1917568089 B/op	13009415 allocs/op
BenchmarkDecodeConcurrent-4   	      10	1561361127 ns/op	1917567531 B/op	13009413 allocs/op

benchcmp works as expected:

-> benchcmp old.txt new.txt
benchmark                       old ns/op      new ns/op      delta
BenchmarkDecode-4               2203036637     1543669126     -29.93%
BenchmarkDecodeConcurrent-4     2258279089     1561361127     -30.86%

benchmark                       old allocs     new allocs     delta
BenchmarkDecode-4               13009444       13009415       -0.00%
BenchmarkDecodeConcurrent-4     13009433       13009413       -0.00%

benchmark                       old bytes      new bytes      delta
BenchmarkDecode-4               1885067576     1917568089     +1.72%
BenchmarkDecodeConcurrent-4     1885066763     1917567531     +1.72%

benchstat doesn't think that -30% is significant:

-> benchstat old.txt new.txt
name                old time/op    new time/op    delta
Decode-4               2.20s ± 0%     1.54s ± 0%   ~     (p=1.000 n=1+1)
DecodeConcurrent-4     2.26s ± 0%     1.56s ± 0%   ~     (p=1.000 n=1+1)

name                old alloc/op   new alloc/op   delta
Decode-4              1.89GB ± 0%    1.92GB ± 0%   ~     (p=1.000 n=1+1)
DecodeConcurrent-4    1.89GB ± 0%    1.92GB ± 0%   ~     (p=1.000 n=1+1)

name                old allocs/op  new allocs/op  delta
Decode-4               13.0M ± 0%     13.0M ± 0%   ~     (p=1.000 n=1+1)
DecodeConcurrent-4     13.0M ± 0%     13.0M ± 0%   ~     (p=1.000 n=1+1)
@mvdan

This comment has been minimized.

Copy link
Member

mvdan commented Jan 18, 2018

@AlekSi that is expected - you'll need to run each benchmark multiple times - that's what the n at the end of the line means, before+after. Usually, you'll need at least 4 or 5 runs to get a p-value low enough. If the numbers vary quite a bit, you might need up to 10 or 20 runs for benchstat to be happy with the p-value.

@AlekSi

This comment has been minimized.

Copy link
Contributor

AlekSi commented Jan 18, 2018

Thank you. I wish it were more clear in the command output and documentation, though.

@mvdan

This comment has been minimized.

Copy link
Member

mvdan commented Jan 18, 2018

@AlekSi good point - I also struggled to use benchstat at first. I have filed #23471 for better docs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment