testing: benchmark unit properties #43744
I propose extending the Go benchmark format with a way to specify properties of units. Especially now that the
I don't have strong opinions about the syntax. I would like the syntax for this to make sense both in the benchmark format itself, and on the command line of tools. To have something concrete, I propose the following for the benchmark format
Unit unit key=value key=value ...
and the following for command-line flags
-unit "unit key=value key=value ..."
Starting the line with
I propose it is an error to specify conflicting properties about a unit.
I also propose that a unit property applies to all following benchmark results, and that it is unspecified whether it applies to earlier benchmark results. The tools I have built that would benefit from unit properties aggregate all results before presenting them, so unit properties naturally apply to all results. But I don't want to force purely streaming tools to read all results before emitting anything just in case there's a later unit property line. Alternatively, we could say it's an error if a property is supplied for any unit that's already been seen, but this seems overly restrictive for aggregating tools.
For specific properties, I propose
Some concrete examples:
The text was updated successfully, but these errors were encountered:
I don't understand how this flag would be used. What tools do you envision supporting this flag?
Should this proposal address how benchmark code can control the properties for the units it uses?
I am wondering how a benchmark that doesn't use the default units would communicate its unit properties under this proposal. Take, for example,
I am reminded of this conversation: https://go-review.googlesource.com/c/go/+/232298/8..12/src/time/sleep_test.go#b585
benchstat, namely. I don't plan on adding support for this to the current version of benchstat, but I have a complete rewrite almost done that I would like to add support to. We're also working on automated performance monitoring (#48803), which wouldn't use the command-line flag specifically, but we would like to have unit properties for.
Support for "exact" measurements in particular has been a sore spot in benchstat for benchmarks that record things like "binary size", for example. Since that's not going to change with repeated measurement, we only report one sample, but that means that benchstat will refuse to show you deltas because there isn't enough data for it to statistically significant under general distributional assumptions.
They can actually just fmt.Print a "Unit" line in a
Heh, indeed. Benchstat v2 will handle this much better, without the need for unit properties. It parses units, and will normalize "ns" to "sec", then apply reasonable scaling to the result.
@ChrisHines , I agree it's not particularly ergonomic, though I think it's ultimately a follow-on issue to this proposal. We added format and tooling support for custom metrics long before the testing package had
I would add the unit properties specification to the Go benchmark format spec and could certainly mention there that
@aclements What is the target user base is for this feature?
If we want people writing benchmarks for their own code to know about and use this feature then I think the
If the target user base is narrower for some reason, perhaps just those contributing to the Go itself, then maybe it belongs in a different place.
From that I think I understand that this proposal is limited to the Go benchmark data format and benchstat CLI flags. The proposal title "testing: benchmark unit properties" gave me the impression that the testing package itself was in scope. My apologies if I got too excited and jumped the gun.
I am intending this for anyone who writes Go benchmarks, generally more sophisticated ones (certainly if you're not using custom units you don't have much need for this!). Curiously, we don't actually reference the Go benchmark format or benchstat anywhere in the testing package documentation. It's really quite non-prescriptive. I think you're right that documenting this in the testing package is the right thing to do, though I think we'll have to pull in a bit more of the bigger Go benchmarking picture at the same time (which isn't a bad thing).
That's fair. Maybe I should have titled this "x/perf:" or something. But I'm not sure you need to curb your enthusiasm. These things are all intentionally weakly coupled: as you point out, it won't initially be super ergonomic to access this from the testing package, but you can, and if this proposal is accepted, the broader tooling will make use of this.
This is a complete rewrite of benchstat. Basic usage remains the same, as does the core idea of showing statistical benchmark summaries and A/B comparisons in a table, but there are several major improvements. The statistics is now more robust. Previously, benchstat used IQR-based outlier rejection, showed the mean of the reduced sample, its range, and did a non-parametric difference-of-distribution test on reduced samples. Any form of outlier rejection must start with distributional assumptions, in this case assuming normality, which is generally not sound for benchmark data. Hence, now benchstat does not do any outlier rejection. As a result, it must use robust summary statistics as well, so benchstat now uses the median and confidence interval of the median as summary statistics. Benchstat continues to use the same Mann-Whitney U-test for the delta, but now does it on the full samples since the U-test is already non-parametric, hence increasing the power of this test. As part of these statistical improvements, benchstat now detects and warns about several common mistakes, such as having too few samples for meaningful statistical results, or having incomparable geomeans. The output format is more consistent. Previously, benchstat transformed units like "ns/op" into a metric like "time/op", which it used as a column header; and a numerator like "sec", which it used to label each measurement. This was easy enough for the standard units used by the testing framework, but was basically impossible to generalize to custom units. Now, benchstat does unit scaling, but otherwise leaves units alone. The full (scaled) unit is used as a column header and each measurement is simply a scaled value shown with an SI prefix. This also means that the text and CSV formats can be much more similar while still allowing the CSV format to be usefully machine-readable. Benchstat will also now do A/B comparisons even if there are more than two inputs. It shows a comparison to the base in the second and all subsequent columns. This approach is consistent for any number of inputs. Benchstat now supports the full Go benchmark format, including sophisticated control over exactly how it structures the results into rows, columns, and tables. This makes it easy to do meaningful comparisons across benchmark data that isn't simply structured into two input files, and gives significantly more control over how results are sorted. The default behavior is still to turn each input file into a column and each benchmark into a row. Fixes golang/go#19565 by showing all results, even if the benchmark sets don't match across columns, and warning when geomean sets are incompatible. Fixes golang/go#19634 by no longer doing outlier rejection and clearly reporting when there are not enough samples to do a meaningful difference test. Updates golang/go#23471 by providing more through command documentation. I'm not sure it quite fixes this issue, but it's much better than it was. Fixes golang/go#30368 because benchstat now supports filter expressions, which can also filter down units. Fixes golang/go#33169 because benchstat now always shows file configuration labels. Updates golang/go#43744 by integrating unit metadata to control statistical assumptions into the main tool that implements those assumptions. Fixes golang/go#48380 by introducing a way to override labels from the command line rather than always using file names. Change-Id: Ie2c5a12024e84b4918e483df2223eb1f10413a4f