Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

proposal: testing: add testing.B.Loop for iteration #61515

Open
aclements opened this issue Jul 21, 2023 · 5 comments
Open

proposal: testing: add testing.B.Loop for iteration #61515

aclements opened this issue Jul 21, 2023 · 5 comments
Labels
Milestone

Comments

@aclements
Copy link
Member

aclements commented Jul 21, 2023

Update July 26, 2023: See this comment for the latest Loop API. The motivation and arguments for this proposal still otherwise apply in full, but the API has switched to the for b.Loop() { ... } form proposed in the Alternatives section.

Currently, Go benchmarks are required to repeat the body of the benchmark (*testing.B).N times. This approach minimizes measurement overhead, but it’s error-prone and has many limitations:

  • As we discovered in cmd/vet: flag benchmarks that don’t use b #38677, it’s surprisingly common for benchmarks to simply forget to use b.N.

  • While a vet check can pretty reliably detect forgotten uses of b.N, there’s some evidence that many benchmarks use b.N incorrectly, such as using it to size the input to an algorithm, rather than as an iteration count.

  • Because the benchmark framework doesn’t know when the b.N loop starts, if a benchmark has any non-trivial setup, it’s important for it to use (*testing.B).ResetTimer. It’s generally not clear what counts as non-trivial setup, and very hard to detect when ResetTimer is necessary.

Proposal

I propose that we add the following method to testing.B and encourage its use over b.N:

// Loop invokes op repeatedly and reports the time and (optionally) allocations per invocation
// as the results of benchmark b.
// Loop must be called only once per benchmark or sub-benchmark.
//
// A benchmark should either use Loop or contain an explicit loop from 0 to b.N, but not both.
// After b.Loop returns, b.N will contain the total number of calls to op, so the benchmark
// may use b.N to compute other average metrics.
func (b *B) Loop(op func())

This API has several advantages over b.N loops:

  • It cannot be misused for something other than an iteration count. It’s still possible for a benchmark to forget entirely to use b.Loop, but that can be detected reliably by vet.

  • The benchmarking framework can record time and other metrics around only the benchmarked operation, so benchmarks no longer need to use ResetTimer or be careful about their setup.

  • Iteration ramp-up can be done entirely within b.Loop, which means that benchmark setup before b.Loop will happen once and only once, rather than at each ramp-up step. For benchmarks with non-trivial setup, this saves a lot of time. Notably, benchmarks with expensive setup can run for far longer than the specified -benchtime because of the large number of ramp-up steps (setup time is not counted toward the -benchtime threshold). It’s also less error-prone than using a global sync.Once to reduce setup cost, which can have side effects on GC timing and other benchmarks if the computed results are large.

  • As suggested by @rsc, b.Loop could be a clear signal to the compiler not to perform certain optimizations in the loop body that often quietly invalidate benchmark results.

  • In the long term, we could collect distributions rather than just averages for benchmark metrics, which would enable deeper insights into benchmark results and far more powerful statistical methods, such as stationarity tests. The way this would work is that b.Loop would perform iteration ramp-up only to the point where it can amortize its measurement overhead (ramping up to, say, 1ms), and then repeat this short measurement loop many times until the total time reaches the specified -benchtime. For short benchmarks, this could easily gather 1,000 samples, rather than just a mean.

Alternatives

This proposal is complementary to testing.Keep (#61179). It’s an alternative to testing.B.Iterate (originally in #48768, with discussion now merged into #61179), which essentially combines Keep and Loop. I believe Iterate could have all of the same benefits as Loop, but it’s much clearer how to make Loop low-overhead. If Loop implicitly inhibits compiler optimizations in the body of its callback, then it has similar deoptimization benefits as Iterate. I would argue that Loop has a shallower learning curve than Iterate, though probably once users get used to either they would have similar usability.

If #61405 (range-over-func) is accepted, it may be that we want the signature of Loop to be Loop(op func() bool) bool, which would allow benchmarks to be written as:

func Benchmark(b *testing.B) {
    for range b.Loop {
        // … benchmark body …
    }
}

It’s not clear to me what this form should do if the body attempts to break or return.

Another option is to mimic testing.PB.Next. Here, the signature of Loop would be Loop() bool and it would be used as:

func Benchmark(b *testing.B) {
    for b.Loop() {
        // … benchmark body …
    }
}

This is slightly harder to implement, but perhaps more ergonomic to use. It's more possible to misuse than the version of Loop that takes a callback (e.g., code could do something wrong with the result, or break out of the loop early). But unlike b.N, which is easy to misuse, this seems much harder to use incorrectly than to use correctly.

cc @bcmills @rsc

@gopherbot gopherbot added this to the Proposal milestone Jul 21, 2023
@bcmills
Copy link
Member

bcmills commented Jul 21, 2023

b.Loop could be a clear signal to the compiler not to perform certain optimizations in the loop body that often quietly invalidate benchmark results.

It's not clear to me which optimizations those would be. Certainly an unused return value should not be eliminated, but should function arguments be inlined?

I have seen benchmarks that explicitly want to check inlining behavior, and also benchmarks that are accidentally-inlinable. I expect that as inlining improves (#61502), the problem of accidentally-inlined arguments will only get worse — but if the Loop function inhibits inlining, it won't be as obvious how to allow particular arguments to be inlined.

(I assume the user would have to create a closure, and then call that closure within the Loop body?)

@aclements
Copy link
Member Author

Since this is so closely related to testing.Keep and testing.B.Iterate, which are both being discussed on #61179, let's keep discussions of trade-offs over on #61179. (It's probably fine to discuss things specific to this Loop here.)

@aclements
Copy link
Member Author

It's not clear to me which optimizations those would be. Certainly an unused return value should not be eliminated, but should function arguments be inlined?

I think we would implicitly apply Keep to every function argument and result in the closure passed directly to Loop. I'm pretty sure that's what @rsc was thinking, too.

I expect that as inlining improves (#61502), the problem of accidentally-inlined arguments will only get worse — but if the Loop function inhibits inlining, it won't be as obvious how to allow particular arguments to be inlined.

I definitely agree that this problem is only going to get worse. I'm not sure we need to inhibit inlining, but we need to inhibit optimizations that propagate information across this inlining boundary. That's why I think it works to think of this as applying an implicit Keep to every function argument and result in the closure, and to think of Keep as having a used and non-constant result. (Obviously that's not a very precise specification, though!)

(I assume the user would have to create a closure, and then call that closure within the Loop body?)

Right. I think if the user specifically wants to benchmark the effect of constant propagation into an inlined function, they would add another layer of function call. We'd only apply the implicit Keep to the direct argument to Loop. That's fairly subtle, but I think such benchmarks are also rare.

My other concern with the deoptimization aspect of this is what to do if Loop is called with something other than a function literal. We could say the implicit Keep only applies if it's called directly with a function literal, but that feels.. even weirder. 😅 It may be that for b.Loop() { ... } alternative I gave at the end is less weird in this context because the lexical relationship is much clearer.

@rsc
Copy link
Contributor

rsc commented Jul 25, 2023

When b.Loop inlines, for b.Loop() { ... } has almost no overhead while enabling multiple trials and any number of other interesting measurements. I really like it a lot. It's unfortunate that PB's method is Next, since Loop seems like a much better name. PB is not used much; it may be fine to be different.

@rsc
Copy link
Contributor

rsc commented Jul 26, 2023

This proposal has been added to the active column of the proposals project
and will now be reviewed at the weekly proposal review meetings.
— rsc for the proposal review group

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Status: Active
Development

No branches or pull requests

4 participants