New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

testing: autodetect appropriate benchtime #10930

Open
josharian opened this Issue May 22, 2015 · 14 comments

Comments

Projects
None yet
5 participants
@josharian
Contributor

josharian commented May 22, 2015

For discussion:

There is tension in how long to run benchmarks for. You want to run long, in order to make any overhead irrelevant and to reduce run-to-run variance. You want to run short, so that it takes less time; if you have a fixed amount of computing time, it'd be better to run multiple short tests, so you can do better analysis than taking the mean, perhaps by using benchstat.

Right now we use a fixed duration, which is ok, but we could do better. For example, many of the microbenchmarks in strconv appear stable at 10ms, which is 100x faster than the default of 1s.

Rough idea, input welcomed:

The time to run a benchmark is V+C*b.N, where b.N is the number of iterations and V and C are random variables -- V for overhead, C for code execution time. We can take measurements using different b.N (starting small and growing) and estimate V and N. Based on that, we can calculate what b.N value is required to reduce the contribution of V to the sum to some fixed limit, say 1%.

This should allow stable, fast benchmarks to execute very quickly. Slower benchmarks would get slower (you have to execute with b.N=1 and 2 at a bare minimum), but that's better than accidentally misleading the user into thinking that they have a meaningful performance number, which is what can currently happen.

We would probably want to change benchtime to be a cap on running time and increase the default value substantially. If stable numbers are not achievable within the provided benchtime, we would warn the user, who could increase the benchtime or change the benchmark.

I put together a quick-and-dirty version of this using linear regression to estimate V and C. It almost immediately caught a badly behaved benchmark (fixed in CL 10053), when it estimated that the benchmark would take hours to run in order to be reliable. I haven't run it outside the encoding/json package; I imagine that there are other benchmarks that need fixing.

Again, input welcomed. I'm not a statistician; I don't even play one on TV.

@josharian josharian added this to the Go1.6 milestone May 22, 2015

@minux

This comment has been minimized.

Member

minux commented May 22, 2015

I have another idea.

We can add two new interfaces to testing package,
one is to list the benchmarks/test/examples within a test binary,
this is trivial.

The other is specify how many times to run a given benchmark
and report the result no matter what (disregard benchtime).

Then we can have an external tool driving the benchmark to
do proper statistic analysis.

I really want benchcmp or some other tool (e.g. Russ' benchstat)
to be given old and new binaries, and automagically do the
right thing and give me result.

Then the benefit is that we decouple the statistic engine from
the testing package.

If we want to do better, we can even make a benchmark server
listen for jsonrpc calls, but that's probably too much for the
testing package.

@josharian

This comment has been minimized.

Contributor

josharian commented May 22, 2015

The prototype implementation I referred to was done by hacking in a stupid line-oriented benchmark server and writing a simple external driver program. So we're thinking along similar lines.

The advantage of a server is that the overhead of executing the binary each time might prove significant when we're talking about 5ns-per-op benchmarks.

I'm agnostic about whether the statistic engine should be internal or not. I just want it to be good, and I know that (1) that requires testing support and (2) that I personally lack the domain expertise to make it awesome. :)

@josharian

This comment has been minimized.

Contributor

josharian commented May 22, 2015

Oh, and I have a simple driver script that invokes benchmarks in a loop and uses benchstat to print rolling results. I'd love to invest to make it sophisticated and generally useful (right now it is very tuned to my personal setup and habits), but again I lack the statistics expertise to design it correctly. I agree that accepting two test binaries is the right api.

@minux

This comment has been minimized.

Member

minux commented May 23, 2015

@josharian

This comment has been minimized.

Contributor

josharian commented May 26, 2015

I'm interested in doing this, but not using unsafe.

@minux

This comment has been minimized.

Member

minux commented May 26, 2015

@josharian

This comment has been minimized.

Contributor

josharian commented May 26, 2015

Some simple codegen could provide the flag and the list of benchmarks. The basic implementation of benchmarking is pretty simple and could be copy/pasted; CL 10020 makes it simpler yet.

I'd still rather design proper support for core building blocks into the stdlib, but I guess this would be better than nothing.

Any interest in working on this with me?

@josharian

This comment has been minimized.

Contributor

josharian commented May 28, 2015

@minux well, a combination of unsafe, reflection, and testing.M seemed like the best combination in the end. Please see package benchserve (godoc) for an initial implementation. Feedback most welcome. If it looks good to you, then we can turn our attention to making a good test driver, probably at a first pass a combination of bench and rod.

@minux

This comment has been minimized.

Member

minux commented May 28, 2015

@josharian

This comment has been minimized.

Contributor

josharian commented May 28, 2015

I've updated benchserve to support the case in which there is already a TestMain. I also added an (aspirational, unwritten) client API to make writing drivers easier.

@gaillard

This comment has been minimized.

gaillard commented Jul 12, 2017

What about adding a -minbenchiterations that takes precendence over benchtime so faster benchmarks can complete quickly and longer ones can still run enough times for a meaningful number?

@josharian

This comment has been minimized.

Contributor

josharian commented Jul 12, 2017

@gaillard I think -benchsplit (#19128) will help on this front.

@gaillard

This comment has been minimized.

gaillard commented Jul 14, 2017

@josharian Say I have two benchmarks A and B, A takes 10s to get to 1k iterations and B 1s to get to 1k iterations. I'd like it to run, just like that.
With benchsplit = 10, in order to get enough iterations I'd still have to specify benchtime=10s and then B would take 10s right?

@azavorotnii

This comment has been minimized.

azavorotnii commented Jul 14, 2017

I think when we have different sizes of benchmarks, it would be much easier to have basic control from inside of benchmark itself. For example:

  1. have ability to change "b.duration":
func BenchmarkSomethingBig(b *testing.B) {
  b.Duration = 10 * time.Second
}
  1. have ability to set minimal b.N:
func BenchmarkSomethingUnstable(b *testing.B) {
  b.MinimalN(1000)
}

As for me, it is much convenient to adjust specific variables than apply command line options to all benchmarks together.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment