Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

x/build: noise in first perf dashboard result #20414

Closed
josharian opened this issue May 18, 2017 · 3 comments

Comments

Projects
None yet
4 participants
@josharian
Copy link
Contributor

commented May 18, 2017

[continued from #20412]

Looking at https://perf.golang.org/search?q=cl:43491+try:Tfbd4a8a6e, which is the perf dashboard result for CL 43491, I see some unlikely things:

peak-VM-bytes	delta

HTTP-2	180M ± 3%	230M ±21%	+27.37%	(p=0.032 n=4+5)

Observe the very high variance in the "after" column; see like one of the 5 runs went amok somehow due to external factors. I'm not sure how to try to reproduce this locally, but the x/benchmarks/http benchmarks all came back neutral on my machine before/after that CL, including for memory:

name    old time/op                  new time/op                  delta
HTTP-8                  14.1µs ± 1%                  14.1µs ± 0%   ~     (p=0.690 n=5+5)

name    old GC-bytes-from-system     new GC-bytes-from-system     delta
HTTP-8                   1.47M ± 0%                   1.47M ± 0%   ~     (all equal)

name    old P50-time/op              new P50-time/op              delta
HTTP-8                   439µs ± 1%                   439µs ± 0%   ~     (p=0.690 n=5+5)

name    old P95-time/op              new P95-time/op              delta
HTTP-8                   553µs ± 4%                   557µs ± 6%   ~     (p=1.000 n=5+5)

name    old P99-time/op              new P99-time/op              delta
HTTP-8                   980µs ± 1%                   990µs ± 2%   ~     (p=0.421 n=5+5)

name    old STW-ns/GC                new STW-ns/GC                delta
HTTP-8                   79.7k ± 1%                   81.2k ± 6%   ~     (p=0.151 n=5+5)

name    old STW-time/op              new STW-time/op              delta
HTTP-8                  40.4ns ± 1%                  41.0ns ± 5%   ~     (p=0.238 n=5+5)

name    old allocated-bytes/op       new allocated-bytes/op       delta
HTTP-8                   4.80k ± 0%                   4.80k ± 0%   ~     (all equal)

name    old allocs/op                new allocs/op                delta
HTTP-8                    61.0 ± 0%                    61.0 ± 0%   ~     (all equal)

name    old bytes-from-system        new bytes-from-system        delta
HTTP-8                   44.9M ± 0%                   44.9M ± 0%   ~     (all equal)

name    old heap-bytes-from-system   new heap-bytes-from-system   delta
HTTP-8                   38.0M ± 0%                   38.0M ± 0%   ~     (p=0.103 n=5+5)

name    old other-bytes-from-system  new other-bytes-from-system  delta
HTTP-8                   3.87M ± 0%                   3.87M ± 0%   ~     (all equal)

name    old peak-RSS-bytes           new peak-RSS-bytes           delta
HTTP-8                   47.5M ± 0%                   47.5M ± 0%   ~     (p=0.619 n=5+4)

name    old stack-bytes-from-system  new stack-bytes-from-system  delta
HTTP-8                   1.55M ± 3%                   1.61M ± 4%   ~     (p=0.103 n=5+5)

name    old user+sys-time/op         new user+sys-time/op         delta
HTTP-8                  87.2µs ± 0%                  86.5µs ± 2%   ~     (p=0.690 n=5+5)

(Btw, I'm excited to see percentiles in there!)

I also see some crazy geomeans in there. stack-bytes-from-system -9.8%, STW-ns/op and STW-ns/GC +6%. I think it is very unlikely that this CL, which only changes code layout, should have impacted either of those. Note that the results that contributed all have very high variances. Unless/until we can get those variances down, we probably should just suppress geomeans there. Or run more iterations? Not sure.

I point all these out to illustrate the key meta-problem of this dashboard: how to interpret this data meaningfully and sort relevant from not relevant, and yet still not miss important but unexpected results.

@quentinmit @bradfitz

@gopherbot gopherbot added this to the Unreleased milestone May 18, 2017

@gopherbot gopherbot added the Builders label May 18, 2017

@quentinmit

This comment has been minimized.

Copy link
Contributor

commented May 18, 2017

Look at the HTTP result more closely:

peak-VM-bytes	delta

HTTP-2	180M ± 3%	230M ±21%	+27.37%	(p=0.032 n=4+5)

See the "n=4+5"? The benchstat code threw out one of the points from the old set as an outlier, but did not throw out any points on the right. If we grab the raw data, you will see that both old and new exhibit similar variability. I think this is probably just because the metric is very noisy - it depends on exactly when each GC happens to run. Also keep in mind that the benchmark is being run in parallel, further exacerbating the nondeterminism with the GC.

Your repro doesn't actually include peak-VM-bytes. It's included or not depending on what OS is being used - it looks like it is only currently implemented for GOOS=linux. Were you trying to reproduce on a different OS?

Testing on my own linux machine, I definitely get similar variance in peak-VM-bytes. It seems to be anticorrelated with GOMAXPROCS - larger GOMAXPROCS produces less variance. You ran with 8, the trybots ran with 2. Results here.

Your overall point is still well-taken, though. We need to think hard about how to meaningful analyze the data for significance. So far, I have entirely punted this decision to the existing code in benchstat. The geomean lines currently do no significance testing at all; they always report the geomean regardless of variance. Note also that the geomean line shows the percentage difference of the raw values, instead of the geomean of the percentages.

@josharian

This comment has been minimized.

Copy link
Contributor Author

commented May 18, 2017

Another data point here, from a trybot run for CL 43637, which shouldn't impact any go1 benchmarks: GobDecode is -25%, GobEncode is -20%, TimeFormat is +23%. And the variances for all the go1 benchmarks are super high; typical variances on my machine are 1-3%.

@bradfitz

This comment has been minimized.

Copy link
Member

commented May 21, 2018

Closing old performance dashboard bugs. #20552 tracks bringing it back one day.

@bradfitz bradfitz closed this May 21, 2018

@golang golang locked and limited conversation to collaborators May 21, 2019

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
You can’t perform that action at this time.