testing: parallel benchmark results are poorly documented #31884
The way parallel benchmark results are reported is easy to misinterpret, and the correct way to interpret them is not documented. Typically this makes perfectly-scalable benchmarks appear to perform much worse at low parallelism, and makes not-at-all-scalable benchmarks appear to be doing fine. For example, this came up in #31820 (comment)
Specifically, the "ns/op" reported is not CPU-ns/op, it's wall-ns/op. For example, suppose each op takes exactly 100 ns, regardless of parallelism. If the single-threaded benchmark runs for 1 sec, it will execute 10,000,000 ops, so ns/op = 1s/10,000,000 ops = 100 ns/op. But if the same benchmark runs 4-way parallel for 1 sec, it will execute 40,000,000 ops, so ns/op = 1s/40,000,000 = 25 ns/op. (I really wish it didn't work this way...)
Interpreting the results of CPU-bound parallel benchmarks is further complicated by hyper-threading (though this isn't the fault of the testing package).
I don't think we can change the reported ns/op at this point. We could perhaps introduce a new metric for parallel benchmarks. At the very least, we should document this.