Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

testing: add -benchtime=100x (x suffix for exact count) #24735

Closed
wmorrow opened this issue Apr 6, 2018 · 33 comments

Comments

Projects
None yet
@wmorrow
Copy link

commented Apr 6, 2018

Currently benchmarks built using testing's Benchmark interface run a variable number of iterations, auto-adjusting to run for at least -benchiters seconds. This complicates HW PMU counter collection and A/B comparisons because the amount of work is (potentially) variable and the time adjustment code can muddle the benchmark results. It can also easily overshoot the amount of time you expect it to take:

$ go test archive/zip -benchtime 30s -run=Benchmark -bench=BenchmarkZip64Test$
goos: linux
goarch: amd64
pkg: archive/zip
BenchmarkZip64Test-8        1000          59382178 ns/op
PASS
ok      archive/zip     65.380s

The proposal

is to add a new flag to go test that circumvents the adjustment process and runs a benchmark for some exact user-defined number of iterations:

$ go test archive/zip -benchiterations 500 -run=Benchmark -bench=BenchmarkZip64Test$
goos: linux
goarch: amd64
pkg: archive/zip
BenchmarkZip64Test-8         500          59419401 ns/op
PASS
ok      archive/zip     29.775s

References

A draft of this change is already on Gerrit (+92617)

@gopherbot gopherbot added this to the Proposal milestone Apr 6, 2018

@gopherbot gopherbot added the Proposal label Apr 6, 2018

@bcmills

This comment has been minimized.

Copy link
Member

commented Apr 6, 2018

(CC @rsc @josharian; see also #19128 and #10930.)

@rsc

This comment has been minimized.

Copy link
Contributor

commented Apr 9, 2018

I'm reluctant to add more complexity to the benchmark flags. We just added benchsplit last round.
/cc @aclements

@josharian

This comment has been minimized.

Copy link
Contributor

commented Apr 11, 2018

The main time I've wanted this recently is when working around benchmarks that are non-linear (i.e. broken) and hard to fix. But I share the reluctance to add more complexity here.

@aclements

This comment has been minimized.

Copy link
Member

commented Apr 11, 2018

@dr2chase had some recent situation where there would have been useful, though I'm blanking on the details.

I'm also reluctant to add more flags. Sometimes, however, the non-linearity isn't the benchmark's fault, but GC's. For example, 1,000 iterations vs 2,000 iterations could be the difference between 1 GC cycle and 3 GC cycles.

@rsc

This comment has been minimized.

Copy link
Contributor

commented Apr 16, 2018

It bothers me that this is only meaningful for a single benchmark at a time. Is there something different we can do that addresses the underlying problem?

@aclements

This comment has been minimized.

Copy link
Member

commented Apr 16, 2018

We could be less aggressive about rounding to "nice" iteration counts, the effect of this would be damped.

For smoothing out GCs specifically, we could track the allocation rate of the benchmark to predict the number of GC cycles. If the number of cycles would be small, bias the next selected iteration count toward a nice value that would stay away from the edge between n and n+1 GC cycles.

@cespare

This comment has been minimized.

Copy link
Contributor

commented Apr 16, 2018

Possibly related to #23423.

We could be less aggressive about rounding to "nice" iteration counts, the effect of this would be damped.

This behavior has annoyed me in the past. On #23423 I wrote

[...] it would've been nice if go test had picked a smaller value than 1000 here. It took 10s when I asked for 5s.

@wmorrow

This comment has been minimized.

Copy link
Author

commented Apr 16, 2018

From my perspective the goal is to have some way to perform a set amount of work in a benchmark, rather than a (potentially) variable amount based on execution time or some other heuristic. Anything that can fix that amount of work to a reproducible value would be fine.

Specifically I'm thinking of cases where I want to compare hardware counters on two different hosts or between two versions of Go (with and without some patch). Assuming the two instances (hosts/toolchains/etc) have some different performance profile, the faster instance will likely perform a different amount of work than the slower instance if only because it uses different iteration counts when scaling. If the two instances also settle to different iteration counts the problem is exacerbated.

Another strategy I can think of off the top of my head is to read some datafile (perhaps a previous log?) and use the number of iterations set there. It's certainly not elegant but I'm also dubious that a heuristic based on measured performance of the instance can accurately reproduce a set amount of work. I think at some point you'll have to have an iteration count recorded or passed somewhere.

@meirf

This comment has been minimized.

Copy link
Contributor

commented Apr 21, 2018

I'm reluctant to add more complexity to the benchmark flags. We just added benchsplit last round.

As a quick note, the benchsplit proposal was accepted but the feature was not added. The review hasn't been merged. I'm not mentioning this as a nit; rather there is some debate on that review which relates to the complexity/reluctance (and needs a decision to move forward.)

https://go-review.googlesource.com/c/go/+/47411/2/src/testing/benchmark.go#317
https://go-review.googlesource.com/c/go/+/47411/5/src/testing/benchmark.go#296

@rsc

This comment has been minimized.

Copy link
Contributor

commented May 7, 2018

It seems like if you want to do right by performance counters generally, you need to have the testing package collect them around the runs and report them, including dividing by the iteration count. For example the 1000 iterations are preceded by 1, 10, and 100 iterations, trying to guess how many will be needed for benchtime. Really you don't want to see the performance counters for those trial runs at all. I don't know of any proposals for adding performance counters, though, and that would be quite a complex API.

Alternatively, the big jumps you see, where sometimes a benchmark runs for 1000 iterations and sometimes 2000 to reach benchtime, would go away if we stopped rounding so aggressively, as @aclements suggested in #24735 (comment)? Would that be enough to address the issue here, @wmorrow?

Originally I put the "nice" numbers in for eye-balling times more easily, but it really makes little sense given that you need to run 10 or so and run them through benchstat to have reliable statistics anyway.

@josharian

This comment has been minimized.

Copy link
Contributor

commented May 7, 2018

a12cc71 took a small step towards removing aggressive rounding. I’m all for ditching it entirely. We should also consider eliminating or reducing the multiplier as well. (See that commit message for details.)

@aclements

This comment has been minimized.

Copy link
Member

commented May 7, 2018

I don't know of any proposals for adding performance counters, though, and that would be quite a complex API.

#21295

@wmorrow, it seems like any performance counter comparison needs to account for warm-up and divide by the number of iterations anyway. If you're using perf for this, it seems like perf record --switch-output (maybe with -s if you just want counts) would make it easy to capture just the last run of the benchmark, and then the counts should be divided by the iteration count reported in the benchmark output.

@gopherbot

This comment has been minimized.

Copy link

commented May 8, 2018

Change https://golang.org/cl/112155 mentions this issue: testing: stop rounding b.N

@josharian

This comment has been minimized.

Copy link
Contributor

commented May 8, 2018

I wanted to see the impact, so I hacked together CL 112155.

Thinking about that CL raises another concern.

Suppose we have a benchmark with high variance. We use our estimates to try to get near the benchtime. We are more likely to exceed the benchtime if we get a particularly slow run. A particularly fast run is more likely to trigger another benchmark run.

The current approach thus introduces bias.

One simple way to fix this would be to decide when our estimate is going to be "close enough", that is, when we are one iteration way from being done, and then stick with that final iteration even if it falls short of the benchtime.

There are probably other fixes, too...including setting a fixed number of iterations up front. :)

@josharian

This comment has been minimized.

Copy link
Contributor

commented May 8, 2018

Oh, and note that making our iteration estimates ever more accurate (e.g. by eliminating rounding) actually exacerbates the bias-for-slow-runs problem, rather than ameliorating it.

@wmorrow

This comment has been minimized.

Copy link
Author

commented May 8, 2018

I don't know of any proposals for adding performance counters, though, and that would be quite a complex API.

I agree, I think you'd end up having to rewrite perf/etc or call out to it and probably lose some expressiveness in the process.

If you're using perf for this, it seems like perf record --switch-output (maybe with -s if you just want counts) would make it easy to capture just the last run of the benchmark, and then the counts should be divided by the iteration count reported in the benchmark output.

If I'm reading this correctly you're suggesting adding a SIGUSR2 (or some other signal) when restarting the benchmark with some new N so that the outer perf can segment the perf.data files accordingly?

That could work to some extent (at least to ensure there's some known amount of work represented in the perf record), but it seems like a complicated solution to the problem. I suppose you save on having to define iteration counts for each benchmark but you have to do more work after the run to ensure you're matching the right perf data to the right benchmark. It's also not portable for collecting data in situations where we're not using perf (as an example we use MAMBO to track visited basic blocks).

josharian added a commit to josharian/go that referenced this issue May 8, 2018

testing: stop rounding b.N
The original goal of rounding to readable b.N
was to make it easier to eyeball times.
However, proper analysis requires tooling
(such as benchstat) anyway.

Instead, take b.N as it comes.
This will reduce the impact of external noise
such as GC on benchmarks.

This requires doing our iteration estimation in floats instead of ints.
When using ints, extremely fast (sub-nanosecond) benchmarks
always get an estimated ns/op of 0, and with the reduced
rounding up, they converged much too slowly.

It also reduces the wall time required to run benchmarks.

Here's the impact of this CL on the wall time to run
all benchmarks once with benchtime=1s on some std packages:

name           old time/op       new time/op       delta
bytes                 306s ± 1%         238s ± 1%  -22.24%  (p=0.000 n=10+10)
encoding/json         112s ± 8%          99s ± 7%  -11.64%  (p=0.000 n=10+10)
net/http             54.7s ± 7%        44.9s ± 4%  -17.94%  (p=0.000 n=10+9)
runtime               957s ± 1%         714s ± 0%  -25.38%  (p=0.000 n=10+9)
strings               262s ± 1%         201s ± 1%  -23.27%  (p=0.000 n=10+10)
[Geo mean]            216s              172s       -20.23%

Updates golang#24735

Change-Id: I7e38efb8e23c804046bf4fc065b3f5f3991d0a15
@warpfork

This comment has been minimized.

Copy link

commented Jul 21, 2018

I have a very similar desire to what @wmorrow describes: I want to compare profiling info across code changes, or across packages which implement similar interfaces, or of the same code on different hardware, and so forth -- all situations where the iteration count detected by the bench tool is (quite sensibly) going to vary -- and have all the numbers be reasonably comparable with a bare minimum of further normalization.

In order to get stable human-handy info like this, it seems I want to either be able to declare a fixed number for b.N; or, have some way to get that number out afterwards so I can divide e.g. prof stats into average-per-op numbers.

A potential hack to work around this would be to also write Test1000Times_Foo functions to match every Benchmark_Foo function I have in the codebase; but this seems like a lot of boilerplate. It's a practical truism that the subjects of my benchmark functions are also what I want to evaluate any pprof'ing, etc, against, and so it would be nice if the test and benchmark run tooling would DWIM for this.

@rsc

This comment has been minimized.

Copy link
Contributor

commented Jul 23, 2018

We're discussing a lot of complexity but even if we add it, it's hard to see whether it solves the real problems with warm-ups and the like. The SIGUSR2 is kind of weird but interesting. I bet if we sent SIGUSR2 to test binaries in general some packages would get mad, like when we tried to have package testing install a SIGINT handler.

@theckman

This comment has been minimized.

Copy link
Contributor

commented Aug 5, 2018

To add my color from the other issue I opened (since GitHub search didn't lead me here):

I'm in the process of helping do a technical review of a Go book for newbies, and the author wants to show the performance benefits of using concurrency for web requests over serializing them. Originally, the author wanted to teach readers how to write benchmarks first, and to then use these benchmarking skills to compare a concurrent vs serial implementation. The issue here is that because b.N can't be set to a constant value of 1, the author doesn't want to cause a DDoS against golang.org (and another site) from people running this benchmark. Because we're looking to measure the performance difference on the magnitude of milliseconds, running the benchmark once should give us an accurate-enough result to show the performance improvement.

I have been in a similar spot too, where I wanted to benchmark something but only wanted the high-level/one run result because of the overall complexity of the thing I'm benchmarking. It would have at least reduced the time I was waiting for benchmarks to finish. As I recall, I ended up having to implement a package main that invoked my function with a time.Since. I would have much rather written a benchmark and invoked go test with the right inputs to do what I wanted.

Edit: For posterity, here's the other reason I'd seen this asked for (from my comment in the other issue):

While I'm less convinced of this use case, another ask I've seen in the Go community for this functionality is around using a constant value for your benchmarks so that you can compare the total wallclock time of the benchmarks. The reason I'm not convinced in this situation is that because most people do (and should) care about the ns/op result on a per-benchmark basis, not the total wallclock time it took the benchmark suite to run. I think they wanted to use this as a way to automatically detect large changes in performance, without needing to parse the results and keep track of each benchmark between runs.

@josharian

This comment has been minimized.

Copy link
Contributor

commented Aug 5, 2018

FWIW, my reaction to the "I really just want it to run once" style of benchmarking is that package main and time.Since is the right answer. Otherwise you'll always have to remember to set b.N=1 on every benchmark invocation; the opportunity for accidental DDOS is high.

I'd argue that doing so might also make for good pedagogy: Start with time.Since, discuss the shortcomings of that approach, graduate to benchmarks, and discuss their benefits.

@bcmills

This comment has been minimized.

Copy link
Member

commented Aug 7, 2018

because b.N can't be set to a constant value of 1, the author doesn't want to cause a DDoS against golang.org (and another site) from people running this benchmark.

If the author is concerned about the amount of traffic, then:
a. Congratulations are in order for a successful book! 🙂
b. They should provide a server that readers can start locally to produce (or simulate) the desired latency characteristics.
c. Or, they should target the benchmark against the book's own website. Since traffic is proportional to readers, and readers are proportional to sales, the more traffic the book produces, the more machines the author can afford in order to handle the traffic!

More seriously, though: benchmarks are experiments, and a low N is bad for benchmarking for exactly the same reason that it's bad for a scientific study. Ideally we should compute it based on the measured variability rather than a fixed duration, but deliberately setting it much too low is not a good practice to teach either way.

Because we're looking to measure the performance difference on the magnitude of milliseconds, running the benchmark once should give us an accurate-enough result to show the performance improvement.

-benchtime already fills a similar niche: if you know that the operation only takes a few hundred milliseconds, just run it for a few hundred milliseconds.

@theckman

This comment has been minimized.

Copy link
Contributor

commented Aug 7, 2018

The point was to keep the example simple and digestable for readers who are newer to programming. Adding more complexity to the example makes it easier for people to become overwhelmed.

Thank you for the feedback.

@magical

This comment has been minimized.

Copy link
Contributor

commented Aug 14, 2018

Like @warpfork i'm interested in an option like this to help with profiling.

go test -cpuprofile makes it very easy to identify hotspots in the code, but once identified i want to be able to reliably compare profiles both before and after a change. go tool pprof reports time in two units: percentage of runtime, and absolute runtime. It can't report time per iteration like go test -bench does because it has no concept of iterations. Percentage is out because, due to the potato paradox, comparing percentages can be extremely counterintuitive (if a function accounts for 99% of runtime and you make it twice as fast, it now takes 98%). That leaves absolute runtime, but if benchmarks always run for a constant amount of time, then this becomes as useless as percentages.

Things mostly work out today because, due to rounding, it is likely that go test -bench will select the same number of iterations each time--as long as your optimizations aren't extreme enough to kick you into a higher bucket. If CL 112155 goes in rounding goes away it seems like this problem will get worse.

My desired workflow goes something like this

  1. pick a benchmark to work on
  2. run go test -bench=foo to find a suitable number of iterations N to run it for
  3. run go test -bench=foo -benchiter=N -run=NONE -cpuprofile=cpu.prof; go tool pprof -top cpu.prof to identify hot functions
  4. make changes to the code
  5. run go test -bench=foo -benchiter=N -run=NONE -cpuprofile=cpu.prof; go tool pprof -top cpu.prof again to see if changes had the intended effect by comparing absolute run times from step 3
  6. repeat steps 4-5

I worked around this issue at first by fudging with -benchtime to get the number of iterations i wanted, but later took the more direct route of patching a -benchiter option directly into go test.

I could do what @josharian suggested and write a main program which starts a CPU profile and runs a copy of the benchmark code for a fixed number of iterations, but it seems a shame to give up all the tooling around go test.

@rsc

This comment has been minimized.

Copy link
Contributor

commented Aug 20, 2018

@bradfitz suggested changing -benchtime to accept a little bit more than a time.Duration, so that you can say -benchtime=100x. Maybe that's good enough?

@josharian

This comment has been minimized.

Copy link
Contributor

commented Aug 21, 2018

SGTM. I'd still like to finish up CL 112155 and think more about the bias concerns raised in #24735 (comment).

@magical

This comment has been minimized.

Copy link
Contributor

commented Aug 21, 2018

SGTM too. I'd be happy to take a stab at implementing this.

Another thought: for the examples @theckman brought up about wanting to run a benchmark only once, it seems like a workaround would be to use -benchtime=0s, since benchmarks are always run at least once.

@gopherbot

This comment has been minimized.

Copy link

commented Aug 22, 2018

Change https://golang.org/cl/130675 mentions this issue: testing: allow passing an iteration count to -benchtime

@magical

This comment has been minimized.

Copy link
Contributor

commented Aug 22, 2018

Oh, I didn't notice @mrosier-qdt had already uploaded a CL. Oh well, here's my take on it.

@josharian

This comment has been minimized.

Copy link
Contributor

commented Aug 23, 2018

SGTM. I'd still like to finish up CL 112155 and think more about the bias concerns raised in #24735 (comment).

I've broken out the bias concern as #27168. (And I'll eventually get back to the CL.)

@rsc rsc changed the title proposal: testing: add option for running benchmarks a fixed number of iterations testing: add -benchtime=100x (x suffix for exact count) Sep 19, 2018

@rsc rsc modified the milestones: Proposal, Go1.12 Sep 19, 2018

@rsc

This comment has been minimized.

Copy link
Contributor

commented Sep 19, 2018

Looks like people generally agree with -benchtime=100x and there is a pending CL. Proposal accepted.

@rsc rsc added the NeedsFix label Sep 26, 2018

@gopherbot

This comment has been minimized.

Copy link

commented Oct 3, 2018

Change https://golang.org/cl/139258 mentions this issue: testing: implement -benchtime=100x

@magical

This comment has been minimized.

Copy link
Contributor

commented Oct 9, 2018

@rsc I see you've uploaded a CL of your own. Should i abandon https://golang.org/cl/130675?

@bradfitz

This comment has been minimized.

Copy link
Member

commented Oct 9, 2018

@magical, sorry, I don't think we realized a CL was already open for it. I'll let @rsc decide how he wants to go forward.

@gopherbot gopherbot closed this in 8e0aea1 Oct 12, 2018

gopherbot pushed a commit that referenced this issue Mar 20, 2019

testing: stop rounding b.N
The original goal of rounding to readable b.N
was to make it easier to eyeball times.
However, proper analysis requires tooling
(such as benchstat) anyway.

Instead, take b.N as it comes.
This will reduce the impact of external noise
such as GC on benchmarks.

This requires reworking our iteration estimates.
We used to calculate the estimated ns/op
and then divide our target ns by that estimate.
However, this order of operations was destructive
when the ns/op was very small; rounding could
hide almost an order of magnitude of variation.
Instead, multiply first, then divide.
Also, make n an int64 to avoid overflow.

Prior to this change, we attempted to cap b.N at 1e9.
Due to rounding up, it was possible to get b.N as high as 2e9.
This change consistently enforces the 1e9 cap.

This change also reduces the wall time required to run benchmarks.

Here's the impact of this change on the wall time to run
all benchmarks once with benchtime=1s on some std packages:

name           old time/op       new time/op       delta
bytes                 306s ± 1%         238s ± 1%  -22.24%  (p=0.000 n=10+10)
encoding/json         112s ± 8%          99s ± 7%  -11.64%  (p=0.000 n=10+10)
net/http             54.7s ± 7%        44.9s ± 4%  -17.94%  (p=0.000 n=10+9)
runtime               957s ± 1%         714s ± 0%  -25.38%  (p=0.000 n=10+9)
strings               262s ± 1%         201s ± 1%  -23.27%  (p=0.000 n=10+10)
[Geo mean]            216s              172s       -20.23%

Updates #24735

Change-Id: I7e38efb8e23c804046bf4fc065b3f5f3991d0a15
Reviewed-on: https://go-review.googlesource.com/c/go/+/112155
Reviewed-by: Austin Clements <austin@google.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.