New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

x/perf/cmd/benchstat: tips or quickstart for newcomers #23471

Open
mvdan opened this Issue Jan 18, 2018 · 14 comments

Comments

Projects
None yet
6 participants
@mvdan
Member

mvdan commented Jan 18, 2018

benchstat is a very useful tool, but if you're not familiar with what it does, it can be very confusing to use for the first time.

One such example is "how many times should my benchmarks run". If one has used benchcmp before, running each benchmark before and after exactly once, trying to use benchstat will result in something confusing like:

name                old time/op    new time/op    delta
Decode-4               2.20s ± 0%     1.54s ± 0%   ~     (p=1.000 n=1+1)

The answer here is that the user should be running the benchmark more times - at least 3 or 4 to get p-values low enough for a result.

However, neither benchstat -h nor the godoc page are very clear on this, nor do they have a "quickstart" guide. The godoc page does show an example input with many runs, and does talk about "a number of runs" and p-values, but it's not very clear if you're not familiar with statistics and benchmarking.

I believe that a quick guide would greatly improve the usability of the tool - for example:

$ go test -bench=. -count 5 >old.txt
$ <apply source changes>
$ go test -bench=. -count 5 >new.txt
$ benchstat old.txt new.txt

I think it should also introduce other best practices, such as:

  • Using higher -count values if the benchmark numbers aren't stable
  • Usingn -benchmem to also get stats on allocated objects and space
  • Running the benchmarks on an idle machine not running on battery (and with power management off?)
  • Adding -run='$^' or -run=- to each go test command to avoid running the tests too

I realise that some of these tips are more about benchmarking than benchstat itself. But I think it's fine to have it all there, as in general you're going to be using that tool anyway.

/cc @rsc @ALTree @aclements @AlekSi

@aclements

This comment has been minimized.

Member

aclements commented Jan 18, 2018

These all sound like good things to include in the docs to me. If I don't get around to this myself, I'd be happy to review a CL.

I'd recommend doing at least 10 runs, if not 20 (where practical). Otherwise you might be tempted to compare them, see that the p-values are too high, and then get more samples, falling prey to the multiple testing problem. (I'd love for benchstat to support some form of sequential hypothesis testing, but alas.)

@AlekSi

This comment has been minimized.

Contributor

AlekSi commented Jan 18, 2018

Daniel, thank you for creating this issue!

May I also request a short warning in stderr if a number of runs is not enough? Similar to "no test files" from go test.

@ALTree

This comment has been minimized.

Member

ALTree commented Jan 18, 2018

Where would we put this? I'd say the README.md file in the repo may be a good place, since it's already quite wordy but at the moment I think it's not terribly useful as an introduction to benchstat. It could be re-organized into a small tutorial-style document.

@mvdan

This comment has been minimized.

Member

mvdan commented Jan 18, 2018

Well, I was thinking just the package godoc. That's where the current information on the tool is.

@mvdan

This comment has been minimized.

Member

mvdan commented Jan 18, 2018

@AlekSi that might work, but it's tricker than it may sound. What should the minimum be? If the numbers are very stable (for example, the number of allocations for a deterministic program tends to be), 5 runs may be enough. Or you might not get any low p-values even if you do 20 runs.

However, such a warning could be added for 1 run, as that seems to be a common mistake.

@aclements 10 runs is probably better general advice. My default is 6 simply because I tend to split the benchmarks into multiple sub-benchmarks, meaning that even just 6 runs stalls me for a good minute. And I tend to go after substantial performance wins, in which cases less than 10 runs is often enough.

I didn't know about the multiple testing problem. Is that basically saying that it's much worse to benchmark twice with -count 5 than once with -count 10, because with the former you are more likely to get wrong results?

@aclements

This comment has been minimized.

Member

aclements commented Jan 18, 2018

I didn't know about the multiple testing problem. Is that basically saying that it's much worse to benchmark twice with -count 5 than once with -count 10, because with the former you are more likely to get wrong results?

Benchmarking twice with -count 5 is fine, as long as you don't look at the results after the first run. :) (In fact, just in terms of running benchmarks, it's better to alternate individual runs back and forth in case there are systemic effects.)

The multiple testing problem arises when you look at the results and then decide whether to run more iterations based on the p-value. Since you'll be happy and stop if the p-value is small (or sad, if it got worse, but the same argument applies :), you bias toward accepting a "significant" result. But with a p-value of 0.05, you have a 5% chance that any random result will be "significant". In the extreme, if you imagine that you just keep running iterations until p < 0.05, you'll eventually get what you're looking for, even if it's not real.

There are ways to deal with this. For example, you can keep raising the p-value threshold each time you look at the results (the "Bonferroni correction"). But this sort of thing gets really complicated really fast, so the easiest rule is just to pick an n and stick with it. :)

In fact! The exact same problem arises if you run a whole bunch of separate benchmarks and look at whether any of them changed. You should expect 5% of them to show a statistically significant change, even if it's just noise. I don't think all of the sequential testing stuff I said above is worth explaining in the benchstat docs, but it might be worth mentioning this.

@mvdan

This comment has been minimized.

Member

mvdan commented Jan 18, 2018

You should expect 5% of them to show a statistically significant change, even if it's just noise.

Welp, I'm likely doing this wrong now then. Instead of having five benchmarks with different inputs for a parser, perhaps I should just have a single benchmark with a mix of all the kinds of input. That would also make it easier to use higher -count values.

This should definitely be an item in the best practices list :)

@dgryski

This comment has been minimized.

Contributor

dgryski commented Jan 18, 2018

I find git stash helps the workflow while benchmarking old vs new code.

@mvdan

This comment has been minimized.

Member

mvdan commented Jan 18, 2018

I use git stash for this too, but I have found that it can be confusing to use to someone who isn't used to it. Perhaps it could be another one of the tips, instead of putting it as part of the quickstart.

@aclements

This comment has been minimized.

Member

aclements commented Jan 18, 2018

Welp, I'm likely doing this wrong now then. Instead of having five benchmarks with different inputs for a parser, perhaps I should just have a single benchmark with a mix of all the kinds of input.

That's not necessarily wrong, you just have to understand that there's always a false positive rate. Even if it's just a single benchmark, there's a 5% chance of benchstat reporting noise as a significant change. It's just that if you have a whole benchmark set and each individual benchmark has a 5% chance of a false positive, then there's a higher collective chance that some benchmark in the set is going to have a false positive.

@mvdan

This comment has been minimized.

Member

mvdan commented Jan 20, 2018

Another piece of advice could be whether to do go test -bench=... -count=N or for i in {1..N}; do go test -bench=...; done.

I assume calling go test more times is slower, though. I wonder if an option to alternate benchmark runs would be a good idea, as -count just runs the same benchmark multiple times in a row at the moment.

@dgryski

This comment has been minimized.

Contributor

dgryski commented Jan 20, 2018

Rather than calling go test multiple times, run go test -c and run the binary multiple times. If you have an original test binary you don't need to worry about swapping back and forth between source code changes with git stash.

@josharian

This comment has been minimized.

Contributor

josharian commented Jan 21, 2018

I wonder if an option to alternate benchmark runs would be a good idea, as -count just runs the same benchmark multiple times in a row at the moment.

I believe that e.g. -cpu=8,8,8,8 should cycle through the benchmarks four times. Compare with -count=4, which runs each benchmark four times before moving on to the next.

Or write a driver script, a la compilebench. See also my abandoned attempts at writing tooling for a more generic driver script at #10930.

run go test -c and run the binary multiple times

And don't forget -run=NONE.

@mvdan

This comment has been minimized.

Member

mvdan commented Jan 29, 2018

Thanks for the tips regarding cycling through benchmarks. Whatever the best solution at the moment would be, we should document it. And if a better tool or method surfaces at a later point, we can just change the docs.

And don't forget -run=NONE.

Already mentioned in the list in my initial post, though I prefer -run=^$ or -run=- since those leave no possibility of actually matching a test.

Quasilyte added a commit to Quasilyte/kfu-go-2018 that referenced this issue Apr 1, 2018

tasks: add several new tracked tasks
New tasks include:
golang/go#19675 cmd/vet: report uses of -0 in float32/64 context
golang/go#19683 cmd/compile: eliminate usages of global lineno
golang/go#19670 x/tools/go/ssa: make opaqueType less annoying to use
golang/go#19636 encoding/base64: decoding is slow
golang/go#23471 x/perf/cmd/benchstat: tips or quickstart for newcomers
golang/go#19577 test: errorcheck support for intraline errors
golang/go#19490 cmd/vet: reduce the amount of false positives for -shadow mode
golang/go#19042 cmd/internal/obj: optimize wrapper method prologue for branch prediction
golang/go#19013 cmd/compile: add tool for understanding/debugging SSA rules
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment