-
Notifications
You must be signed in to change notification settings - Fork 17.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
testing: add -benchtime=100x (x suffix for exact count) #24735
Comments
(CC @rsc @josharian; see also #19128 and #10930.) |
I'm reluctant to add more complexity to the benchmark flags. We just added benchsplit last round. |
The main time I've wanted this recently is when working around benchmarks that are non-linear (i.e. broken) and hard to fix. But I share the reluctance to add more complexity here. |
@dr2chase had some recent situation where there would have been useful, though I'm blanking on the details. I'm also reluctant to add more flags. Sometimes, however, the non-linearity isn't the benchmark's fault, but GC's. For example, 1,000 iterations vs 2,000 iterations could be the difference between 1 GC cycle and 3 GC cycles. |
It bothers me that this is only meaningful for a single benchmark at a time. Is there something different we can do that addresses the underlying problem? |
We could be less aggressive about rounding to "nice" iteration counts, the effect of this would be damped. For smoothing out GCs specifically, we could track the allocation rate of the benchmark to predict the number of GC cycles. If the number of cycles would be small, bias the next selected iteration count toward a nice value that would stay away from the edge between n and n+1 GC cycles. |
Possibly related to #23423.
This behavior has annoyed me in the past. On #23423 I wrote
|
From my perspective the goal is to have some way to perform a set amount of work in a benchmark, rather than a (potentially) variable amount based on execution time or some other heuristic. Anything that can fix that amount of work to a reproducible value would be fine. Specifically I'm thinking of cases where I want to compare hardware counters on two different hosts or between two versions of Go (with and without some patch). Assuming the two instances (hosts/toolchains/etc) have some different performance profile, the faster instance will likely perform a different amount of work than the slower instance if only because it uses different iteration counts when scaling. If the two instances also settle to different iteration counts the problem is exacerbated. Another strategy I can think of off the top of my head is to read some datafile (perhaps a previous log?) and use the number of iterations set there. It's certainly not elegant but I'm also dubious that a heuristic based on measured performance of the instance can accurately reproduce a set amount of work. I think at some point you'll have to have an iteration count recorded or passed somewhere. |
As a quick note, the https://go-review.googlesource.com/c/go/+/47411/2/src/testing/benchmark.go#317 |
It seems like if you want to do right by performance counters generally, you need to have the testing package collect them around the runs and report them, including dividing by the iteration count. For example the 1000 iterations are preceded by 1, 10, and 100 iterations, trying to guess how many will be needed for benchtime. Really you don't want to see the performance counters for those trial runs at all. I don't know of any proposals for adding performance counters, though, and that would be quite a complex API. Alternatively, the big jumps you see, where sometimes a benchmark runs for 1000 iterations and sometimes 2000 to reach benchtime, would go away if we stopped rounding so aggressively, as @aclements suggested in #24735 (comment)? Would that be enough to address the issue here, @wmorrow? Originally I put the "nice" numbers in for eye-balling times more easily, but it really makes little sense given that you need to run 10 or so and run them through benchstat to have reliable statistics anyway. |
a12cc71 took a small step towards removing aggressive rounding. I’m all for ditching it entirely. We should also consider eliminating or reducing the multiplier as well. (See that commit message for details.) |
@wmorrow, it seems like any performance counter comparison needs to account for warm-up and divide by the number of iterations anyway. If you're using perf for this, it seems like |
Change https://golang.org/cl/112155 mentions this issue: |
I wanted to see the impact, so I hacked together CL 112155. Thinking about that CL raises another concern. Suppose we have a benchmark with high variance. We use our estimates to try to get near the benchtime. We are more likely to exceed the benchtime if we get a particularly slow run. A particularly fast run is more likely to trigger another benchmark run. The current approach thus introduces bias. One simple way to fix this would be to decide when our estimate is going to be "close enough", that is, when we are one iteration way from being done, and then stick with that final iteration even if it falls short of the benchtime. There are probably other fixes, too...including setting a fixed number of iterations up front. :) |
Oh, and note that making our iteration estimates ever more accurate (e.g. by eliminating rounding) actually exacerbates the bias-for-slow-runs problem, rather than ameliorating it. |
I agree, I think you'd end up having to rewrite perf/etc or call out to it and probably lose some expressiveness in the process.
If I'm reading this correctly you're suggesting adding a SIGUSR2 (or some other signal) when restarting the benchmark with some new N so that the outer perf can segment the perf.data files accordingly? That could work to some extent (at least to ensure there's some known amount of work represented in the perf record), but it seems like a complicated solution to the problem. I suppose you save on having to define iteration counts for each benchmark but you have to do more work after the run to ensure you're matching the right perf data to the right benchmark. It's also not portable for collecting data in situations where we're not using perf (as an example we use MAMBO to track visited basic blocks). |
The original goal of rounding to readable b.N was to make it easier to eyeball times. However, proper analysis requires tooling (such as benchstat) anyway. Instead, take b.N as it comes. This will reduce the impact of external noise such as GC on benchmarks. This requires doing our iteration estimation in floats instead of ints. When using ints, extremely fast (sub-nanosecond) benchmarks always get an estimated ns/op of 0, and with the reduced rounding up, they converged much too slowly. It also reduces the wall time required to run benchmarks. Here's the impact of this CL on the wall time to run all benchmarks once with benchtime=1s on some std packages: name old time/op new time/op delta bytes 306s ± 1% 238s ± 1% -22.24% (p=0.000 n=10+10) encoding/json 112s ± 8% 99s ± 7% -11.64% (p=0.000 n=10+10) net/http 54.7s ± 7% 44.9s ± 4% -17.94% (p=0.000 n=10+9) runtime 957s ± 1% 714s ± 0% -25.38% (p=0.000 n=10+9) strings 262s ± 1% 201s ± 1% -23.27% (p=0.000 n=10+10) [Geo mean] 216s 172s -20.23% Updates golang#24735 Change-Id: I7e38efb8e23c804046bf4fc065b3f5f3991d0a15
I have a very similar desire to what @wmorrow describes: I want to compare profiling info across code changes, or across packages which implement similar interfaces, or of the same code on different hardware, and so forth -- all situations where the iteration count detected by the bench tool is (quite sensibly) going to vary -- and have all the numbers be reasonably comparable with a bare minimum of further normalization. In order to get stable human-handy info like this, it seems I want to either be able to declare a fixed number for b.N; or, have some way to get that number out afterwards so I can divide e.g. prof stats into average-per-op numbers. A potential hack to work around this would be to also write Test1000Times_Foo functions to match every Benchmark_Foo function I have in the codebase; but this seems like a lot of boilerplate. It's a practical truism that the subjects of my benchmark functions are also what I want to evaluate any pprof'ing, etc, against, and so it would be nice if the test and benchmark run tooling would DWIM for this. |
We're discussing a lot of complexity but even if we add it, it's hard to see whether it solves the real problems with warm-ups and the like. The SIGUSR2 is kind of weird but interesting. I bet if we sent SIGUSR2 to test binaries in general some packages would get mad, like when we tried to have package testing install a SIGINT handler. |
To add my color from the other issue I opened (since GitHub search didn't lead me here):
I have been in a similar spot too, where I wanted to benchmark something but only wanted the high-level/one run result because of the overall complexity of the thing I'm benchmarking. It would have at least reduced the time I was waiting for benchmarks to finish. As I recall, I ended up having to implement a Edit: For posterity, here's the other reason I'd seen this asked for (from my comment in the other issue):
|
FWIW, my reaction to the "I really just want it to run once" style of benchmarking is that I'd argue that doing so might also make for good pedagogy: Start with |
If the author is concerned about the amount of traffic, then: More seriously, though: benchmarks are experiments, and a low
|
The point was to keep the example simple and digestable for readers who are newer to programming. Adding more complexity to the example makes it easier for people to become overwhelmed. Thank you for the feedback. |
Like @warpfork i'm interested in an option like this to help with profiling.
Things mostly work out today because, due to rounding, it is likely that My desired workflow goes something like this
I worked around this issue at first by fudging with I could do what @josharian suggested and write a |
@bradfitz suggested changing -benchtime to accept a little bit more than a time.Duration, so that you can say -benchtime=100x. Maybe that's good enough? |
SGTM. I'd still like to finish up CL 112155 and think more about the bias concerns raised in #24735 (comment). |
SGTM too. I'd be happy to take a stab at implementing this. Another thought: for the examples @theckman brought up about wanting to run a benchmark only once, it seems like a workaround would be to use |
Change https://golang.org/cl/130675 mentions this issue: |
Oh, I didn't notice @mrosier-qdt had already uploaded a CL. Oh well, here's my take on it. |
Looks like people generally agree with -benchtime=100x and there is a pending CL. Proposal accepted. |
Change https://golang.org/cl/139258 mentions this issue: |
@rsc I see you've uploaded a CL of your own. Should i abandon https://golang.org/cl/130675? |
The original goal of rounding to readable b.N was to make it easier to eyeball times. However, proper analysis requires tooling (such as benchstat) anyway. Instead, take b.N as it comes. This will reduce the impact of external noise such as GC on benchmarks. This requires reworking our iteration estimates. We used to calculate the estimated ns/op and then divide our target ns by that estimate. However, this order of operations was destructive when the ns/op was very small; rounding could hide almost an order of magnitude of variation. Instead, multiply first, then divide. Also, make n an int64 to avoid overflow. Prior to this change, we attempted to cap b.N at 1e9. Due to rounding up, it was possible to get b.N as high as 2e9. This change consistently enforces the 1e9 cap. This change also reduces the wall time required to run benchmarks. Here's the impact of this change on the wall time to run all benchmarks once with benchtime=1s on some std packages: name old time/op new time/op delta bytes 306s ± 1% 238s ± 1% -22.24% (p=0.000 n=10+10) encoding/json 112s ± 8% 99s ± 7% -11.64% (p=0.000 n=10+10) net/http 54.7s ± 7% 44.9s ± 4% -17.94% (p=0.000 n=10+9) runtime 957s ± 1% 714s ± 0% -25.38% (p=0.000 n=10+9) strings 262s ± 1% 201s ± 1% -23.27% (p=0.000 n=10+10) [Geo mean] 216s 172s -20.23% Updates #24735 Change-Id: I7e38efb8e23c804046bf4fc065b3f5f3991d0a15 Reviewed-on: https://go-review.googlesource.com/c/go/+/112155 Reviewed-by: Austin Clements <austin@google.com>
Currently benchmarks built using
testing
'sBenchmark
interface run a variable number of iterations, auto-adjusting to run for at least-benchiters
seconds. This complicates HW PMU counter collection and A/B comparisons because the amount of work is (potentially) variable and the time adjustment code can muddle the benchmark results. It can also easily overshoot the amount of time you expect it to take:The proposal
is to add a new flag to
go test
that circumvents the adjustment process and runs a benchmark for some exact user-defined number of iterations:References
A draft of this change is already on Gerrit (+92617)
The text was updated successfully, but these errors were encountered: