runtime: first benchmark run is much slower than subsequent runs #28595

josharian · 2018-11-05T04:15:15Z

$ cd $GOROOT/test/bench/go1
$ go test -bench=TimeFormat -count=5

Want:

The first benchmark run is similar to the others, maybe a bit slower because caches are cold. Sample:

goos: darwin
goarch: amd64
BenchmarkTimeFormat-8   	 5000000	       261 ns/op
BenchmarkTimeFormat-8   	 5000000	       254 ns/op
BenchmarkTimeFormat-8   	 5000000	       253 ns/op
BenchmarkTimeFormat-8   	 5000000	       252 ns/op
BenchmarkTimeFormat-8   	 5000000	       249 ns/op
PASS
ok  	_/Users/josh/go/tip/test/bench/go1	8.967s

Got:

The first benchmark run is drastically slower. Sample:

goos: darwin
goarch: amd64
BenchmarkTimeFormat-8   	 5000000	       504 ns/op
BenchmarkTimeFormat-8   	 5000000	       243 ns/op
BenchmarkTimeFormat-8   	 5000000	       244 ns/op
BenchmarkTimeFormat-8   	 5000000	       241 ns/op
BenchmarkTimeFormat-8   	 5000000	       239 ns/op
PASS
ok  	_/Users/josh/go/tip/test/bench/go1	10.259s

I noticed this while trying to understand some wild swings (50%+) in go1 benchmark performance that are not limited to the first benchmark run. I don't know whether the root cause is the same or not, but this issue at least reproduces reliably.

Bisect points to CL 139719 / c803ffc.

cc @mknyszek

The text was updated successfully, but these errors were encountered:

josharian · 2018-11-05T04:16:54Z

FYI @martisch @randall77 this is part of the puzzling go1 benchmark results I mentioned in CL 147360

mvdan · 2018-11-05T08:34:19Z

Thanks for spotting and reporting this, @josharian! I'm sure you've saved countless head-scratching hours :)

mknyszek · 2018-11-05T16:06:20Z

Thanks for bisecting and for reporting the issue!

I suspect that the regession you're seeing actually is the result of the runtime scavenging (i.e. releasing pages back to the OS) immediately after the first heap growth. For small heaps this shouldn't really be an issue since the scavenging routine should back out quite fast, but it seems that it's doing more work than expected in this case and perhaps being a little too aggressive.

I'll investigate and see what I can do.

mknyszek · 2018-11-06T00:07:50Z

I found the issue while I was writing this, so it the rest of the comment reads a little more like "how I figured it out". TL;DR is that we're calling madvise way too often due to some bad logic on my part. CL incoming.

It seems that on x64 Linux it's not so bad.

$ perflock ~/toolchain/bin/go test -bench=TimeFormat -count=5                                                                     
goos: linux
goarch: amd64
BenchmarkTimeFormat-8            5000000               318 ns/op
BenchmarkTimeFormat-8            5000000               297 ns/op
BenchmarkTimeFormat-8            5000000               299 ns/op
BenchmarkTimeFormat-8            5000000               298 ns/op
BenchmarkTimeFormat-8            5000000               296 ns/op
PASS
ok      _/usr/local/google/home/mknyszek/toolchain/test/bench/go1       10.911s

On Darwin I was able to reproduce it, and I confirmed the cause is my change. If I run the same command with -count=1 and -cpuprofile cpu.prof, then put cpu.prof through pprof, I get:

      flat  flat%   sum%        cum   cum%
     1.42s 47.65% 47.65%      1.42s 47.65%  runtime.memclrNoHeapPointers
     0.68s 22.82% 70.47%      0.68s 22.82%  runtime.madvise
     0.37s 12.42% 82.89%      0.37s 12.42%  runtime.memmove
     0.11s  3.69% 86.58%      0.25s  8.39%  time.Time.AppendFormat
     0.10s  3.36% 89.93%      0.10s  3.36%  runtime.(*mspan).init
     0.09s  3.02% 92.95%      0.09s  3.02%  runtime.(*mspan).refillAllocCache
     0.03s  1.01% 93.96%      0.03s  1.01%  time.(*Location).lookup
     0.03s  1.01% 94.97%      0.03s  1.01%  time.absDate
     0.02s  0.67% 95.64%      0.02s  0.67%  runtime.mmap
     0.02s  0.67% 96.31%      0.02s  0.67%  time.nextStdChunk

The runtime.madvise being so high up definitely points to my change(s). madvise might be more expensive on Darwin than on Linux. If I run with -count=5:

Showing top 10 nodes out of 47                                               
      flat  flat%   sum%        cum   cum%                         
    2080ms 17.48% 17.48%     2080ms 17.48%  runtime.memclrNoHeapPointers
    1670ms 14.03% 31.51%     6310ms 53.03%  time.Time.AppendFormat
    1620ms 13.61% 45.13%     1620ms 13.61%  runtime.memmove  
    1050ms  8.82% 53.95%     1090ms  9.16%  time.nextStdChunk
     800ms  6.72% 60.67%     1050ms  8.82%  time.appendInt
     690ms  5.80% 66.47%      800ms  6.72%  time.(*Location).lookup
     560ms  4.71% 71.18%      560ms  4.71%  runtime.madvise
     530ms  4.45% 75.63%      540ms  4.54%  time.absDate
     430ms  3.61% 79.24%      430ms  3.61%  runtime.(*mspan).refillAllocCache
     420ms  3.53% 82.77%      420ms  3.53%  runtime.(*mspan).init

Then it's not quite so high up, since those madvise calls are likely mostly made during the first run. Maybe during a heap growth.

To make sure, I tracked down where the madvise calls were coming from with some print debugging. Turns out I was wrong! The heap-growth scavenging routine only actually scavenges one enormous span. The real problem arises primarily from a performance bug in a previous change of mine.

When freeing a span, the runtime will may try to coalesce a scavenged span with a neighboring unscavenged span, so on merging, we re-scavenge the whole thing. Freeing a span happens more frequently than you might think, since when we're allocating a new span we search for a "best fit" free span, and if it's not of the same size we trim off the excess. In this particular case, we have one enormous free span which we scavenge. Then, lots of smaller allocations come in, and they end up hitting the space where that scavenged span lived, trimming it and returning the trimmings (which are considered scavenged) each time, only for those trimmings to get allocated out of again.

In this freeing process, due to a bug, we actually end up re-scavenging the span we're freeing if it's already scavenged, even if it didn't coalesce with anything. So, we basically have a lot of unnecessary additional madvise calls happening, which has a big impact if madvise is expensive of the platform.

I hacked together a fix for the freeing logic locally and it does in fact fix this issue. CL incoming.

gopherbot · 2018-11-06T00:16:02Z

Change https://golang.org/cl/147559 mentions this issue: runtime: stop unnecessary span scavenges on free

josharian added the release-blocker label Nov 5, 2018

josharian added this to the Go1.12 milestone Nov 5, 2018

josharian mentioned this issue Nov 8, 2018

runtime: stack copying performance regressions #28678

Closed

gopherbot closed this as completed in 06be7cb Nov 9, 2018

golang locked and limited conversation to collaborators Nov 9, 2019

gopherbot added the FrozenDueToAge label Nov 9, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

runtime: first benchmark run is much slower than subsequent runs #28595

runtime: first benchmark run is much slower than subsequent runs #28595

josharian commented Nov 5, 2018

josharian commented Nov 5, 2018

mvdan commented Nov 5, 2018

mknyszek commented Nov 5, 2018

mknyszek commented Nov 6, 2018

gopherbot commented Nov 6, 2018

runtime: first benchmark run is much slower than subsequent runs #28595

runtime: first benchmark run is much slower than subsequent runs #28595

Comments

josharian commented Nov 5, 2018

josharian commented Nov 5, 2018

mvdan commented Nov 5, 2018

mknyszek commented Nov 5, 2018

mknyszek commented Nov 6, 2018

gopherbot commented Nov 6, 2018