Currently benchmarks defined in the testing package only show ns/op by default, while optionally reporting average allocation count and rate (allocs/op, bytes/op). Users may also export a custom metric, via testing.(*B).ReportMetric.
A common issue with benchmarks defined by the testing package, especially microbenchmarks, is that they might execute only a handful of GC cycles. If, due to internal or external noise, like scheduling, an extra GC cycle gets executed, it could cause a significant shift in the results (typically ns/op). As a result, it's more difficult to trust the results from such benchmarks; they need either more iterations to be useful, or a change in how they're defined (say, with a bigger GOGC value).
I propose exposing the total number of GC cyclesaverage number of GC cycles that passed during a given benchmark run to help expose such issues. One issue with making the total a benchmark metric is that it depends on the benchmark time, or the number of benchmark iterations, so this number is not comparable if those are different between runs. I think this is OK, because this metric is just supposed to flag an issue, not to be used in actual performance comparisons. Tooling such as benchstat could simply understand this, or we could export some benchmark unit metadata that makes this explicit.
I don't have strong feelings about the details, such as the metric name (maybe total-GC-cyclesGCs/op?). While this metric potentially adds noise to benchmark output, I believe that cost is outweighed by the value of being able to identify this kind of noise early.
Alternatively, we could make it the job of the testing package to flag issues around such performance cliffs, but I suspect it's cleaner to just export the raw data and have it be interpreted by downstream tooling.
One might think that the testing package could just add iterations to adjust for these cliffs, but this is very difficult to do in the general case. It would involve a lot of heuristics that will generally be fragile and depend on the GC implementation. On the other hand, it's straightforward for tools to find issues after the fact, by identifying outliers and correlating those outliers with relatively large (by %) differences in GC cycle count.
Note that the average number of GC cycles per op would not work, because in many scenarios where this matters, it wouldn't properly flag an issue (the average would just be zero, or very, very close to zero).
Note that the average number of GC cycles per op would not work, because in many scenarios where this matters, it wouldn't properly flag an issue (the average would just be zero, or very, very close to zero).
Is this actually the case? It sounds mostly like an issue of ensuring we include enough significant figures. As long as there are enough significant figures, we should be able to tell that a doubling of 0.00001 GC/op to 0.00002 GC/op is problematic.
That's true. It looks a little weird but I think you're right that as long as there are enough significant figures it's fine. I think even with a float64 we're fine on that front.
mknyszek
changed the title
proposal: testing: add GC cycle count as a standard benchmark metric
proposal: testing: add GC cycles per op as a standard benchmark metric
Apr 21, 2022
How would this behave when a mark phase is active at the start, or at the end, of the benchmark loop? (Possible to wait for the cycle to finish before entering the benchmark, but maybe that's not possible at the end. Maybe that's something more to report so a post-processor could choose to discard those results?)
How does it interact with B.{Stop,Start,Reset}Timer, especially around an in-progress mark phase?
Thanks @rhysh. I think I'd lean toward not trying to do anything about in-progress mark phases and just measure completed GC cycles between the {Start,Reset}Timer and StopTimer. It's noisy, but ultimately what we're trying to flag is that noise. If we blocked until the end of a mark phase, it creates a much more stable condition for the number of GC cycles which might mask the problem. If the mark phases are short (true most of the time) then masking isn't too much of a problem, so I don't feel particularly strongly about this if there are other good reasons to block.
Currently benchmarks defined in the testing package only show
ns/op
by default, while optionally reporting average allocation count and rate (allocs/op
,bytes/op
). Users may also export a custom metric, viatesting.(*B).ReportMetric
.A common issue with benchmarks defined by the testing package, especially microbenchmarks, is that they might execute only a handful of GC cycles. If, due to internal or external noise, like scheduling, an extra GC cycle gets executed, it could cause a significant shift in the results (typically
ns/op
). As a result, it's more difficult to trust the results from such benchmarks; they need either more iterations to be useful, or a change in how they're defined (say, with a bigger GOGC value).I propose exposing the
total number of GC cyclesaverage number of GC cycles that passed during a given benchmark run to help expose such issues. One issue with making the total a benchmark metric is that it depends on the benchmark time, or the number of benchmark iterations, so this number is not comparable if those are different between runs. I think this is OK, because this metric is just supposed to flag an issue, not to be used in actual performance comparisons. Tooling such asbenchstat
could simply understand this, or we could export some benchmark unit metadata that makes this explicit.I don't have strong feelings about the details, such as the metric name (maybe
total-GC-cycles
GCs/op
?). While this metric potentially adds noise to benchmark output, I believe that cost is outweighed by the value of being able to identify this kind of noise early.Alternatively, we could make it the job of the testing package to flag issues around such performance cliffs, but I suspect it's cleaner to just export the raw data and have it be interpreted by downstream tooling.
One might think that the testing package could just add iterations to adjust for these cliffs, but this is very difficult to do in the general case. It would involve a lot of heuristics that will generally be fragile and depend on the GC implementation. On the other hand, it's straightforward for tools to find issues after the fact, by identifying outliers and correlating those outliers with relatively large (by %) differences in GC cycle count.
Note that the average number of GC cycles per op would not work, because in many scenarios where this matters, it wouldn't properly flag an issue (the average would just be zero, or very, very close to zero).EDIT: Switched to average number of GC cycles.
CC @aclements @prattmic @dr2chase
The text was updated successfully, but these errors were encountered: