Join GitHub today
GitHub is home to over 50 million developers working together to host and review code, manage projects, and build software together.Sign up
x/build: noise in first perf dashboard result #20414
[continued from #20412]
Looking at https://perf.golang.org/search?q=cl:43491+try:Tfbd4a8a6e, which is the perf dashboard result for CL 43491, I see some unlikely things:
Observe the very high variance in the "after" column; see like one of the 5 runs went amok somehow due to external factors. I'm not sure how to try to reproduce this locally, but the x/benchmarks/http benchmarks all came back neutral on my machine before/after that CL, including for memory:
(Btw, I'm excited to see percentiles in there!)
I also see some crazy geomeans in there.
I point all these out to illustrate the key meta-problem of this dashboard: how to interpret this data meaningfully and sort relevant from not relevant, and yet still not miss important but unexpected results.
Look at the HTTP result more closely:
See the "n=4+5"? The benchstat code threw out one of the points from the old set as an outlier, but did not throw out any points on the right. If we grab the raw data, you will see that both old and new exhibit similar variability. I think this is probably just because the metric is very noisy - it depends on exactly when each GC happens to run. Also keep in mind that the benchmark is being run in parallel, further exacerbating the nondeterminism with the GC.
Your repro doesn't actually include
Testing on my own linux machine, I definitely get similar variance in
Your overall point is still well-taken, though. We need to think hard about how to meaningful analyze the data for significance. So far, I have entirely punted this decision to the existing code in benchstat. The geomean lines currently do no significance testing at all; they always report the geomean regardless of variance. Note also that the geomean line shows the percentage difference of the raw values, instead of the geomean of the percentages.
Another data point here, from a trybot run for CL 43637, which shouldn't impact any go1 benchmarks: GobDecode is -25%, GobEncode is -20%, TimeFormat is +23%. And the variances for all the go1 benchmarks are super high; typical variances on my machine are 1-3%.