Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: first benchmark run is much slower than subsequent runs #28595

Closed
josharian opened this issue Nov 5, 2018 · 5 comments
Closed

runtime: first benchmark run is much slower than subsequent runs #28595

josharian opened this issue Nov 5, 2018 · 5 comments

Comments

@josharian
Copy link
Contributor

@josharian josharian commented Nov 5, 2018

$ cd $GOROOT/test/bench/go1
$ go test -bench=TimeFormat -count=5

Want:

The first benchmark run is similar to the others, maybe a bit slower because caches are cold. Sample:

goos: darwin
goarch: amd64
BenchmarkTimeFormat-8   	 5000000	       261 ns/op
BenchmarkTimeFormat-8   	 5000000	       254 ns/op
BenchmarkTimeFormat-8   	 5000000	       253 ns/op
BenchmarkTimeFormat-8   	 5000000	       252 ns/op
BenchmarkTimeFormat-8   	 5000000	       249 ns/op
PASS
ok  	_/Users/josh/go/tip/test/bench/go1	8.967s

Got:

The first benchmark run is drastically slower. Sample:

goos: darwin
goarch: amd64
BenchmarkTimeFormat-8   	 5000000	       504 ns/op
BenchmarkTimeFormat-8   	 5000000	       243 ns/op
BenchmarkTimeFormat-8   	 5000000	       244 ns/op
BenchmarkTimeFormat-8   	 5000000	       241 ns/op
BenchmarkTimeFormat-8   	 5000000	       239 ns/op
PASS
ok  	_/Users/josh/go/tip/test/bench/go1	10.259s

I noticed this while trying to understand some wild swings (50%+) in go1 benchmark performance that are not limited to the first benchmark run. I don't know whether the root cause is the same or not, but this issue at least reproduces reliably.

Bisect points to CL 139719 / c803ffc.

cc @mknyszek

@josharian josharian added this to the Go1.12 milestone Nov 5, 2018
@josharian

This comment has been minimized.

Copy link
Contributor Author

@josharian josharian commented Nov 5, 2018

FYI @martisch @randall77 this is part of the puzzling go1 benchmark results I mentioned in CL 147360

@mvdan

This comment has been minimized.

Copy link
Member

@mvdan mvdan commented Nov 5, 2018

Thanks for spotting and reporting this, @josharian! I'm sure you've saved countless head-scratching hours :)

@mknyszek

This comment has been minimized.

Copy link
Contributor

@mknyszek mknyszek commented Nov 5, 2018

Thanks for bisecting and for reporting the issue!

I suspect that the regession you're seeing actually is the result of the runtime scavenging (i.e. releasing pages back to the OS) immediately after the first heap growth. For small heaps this shouldn't really be an issue since the scavenging routine should back out quite fast, but it seems that it's doing more work than expected in this case and perhaps being a little too aggressive.

I'll investigate and see what I can do.

@mknyszek

This comment has been minimized.

Copy link
Contributor

@mknyszek mknyszek commented Nov 6, 2018

I found the issue while I was writing this, so it the rest of the comment reads a little more like "how I figured it out". TL;DR is that we're calling madvise way too often due to some bad logic on my part. CL incoming.

It seems that on x64 Linux it's not so bad.

$ perflock ~/toolchain/bin/go test -bench=TimeFormat -count=5                                                                     
goos: linux
goarch: amd64
BenchmarkTimeFormat-8            5000000               318 ns/op
BenchmarkTimeFormat-8            5000000               297 ns/op
BenchmarkTimeFormat-8            5000000               299 ns/op
BenchmarkTimeFormat-8            5000000               298 ns/op
BenchmarkTimeFormat-8            5000000               296 ns/op
PASS
ok      _/usr/local/google/home/mknyszek/toolchain/test/bench/go1       10.911s        

On Darwin I was able to reproduce it, and I confirmed the cause is my change. If I run the same command with -count=1 and -cpuprofile cpu.prof, then put cpu.prof through pprof, I get:

      flat  flat%   sum%        cum   cum%
     1.42s 47.65% 47.65%      1.42s 47.65%  runtime.memclrNoHeapPointers
     0.68s 22.82% 70.47%      0.68s 22.82%  runtime.madvise
     0.37s 12.42% 82.89%      0.37s 12.42%  runtime.memmove
     0.11s  3.69% 86.58%      0.25s  8.39%  time.Time.AppendFormat
     0.10s  3.36% 89.93%      0.10s  3.36%  runtime.(*mspan).init
     0.09s  3.02% 92.95%      0.09s  3.02%  runtime.(*mspan).refillAllocCache
     0.03s  1.01% 93.96%      0.03s  1.01%  time.(*Location).lookup
     0.03s  1.01% 94.97%      0.03s  1.01%  time.absDate
     0.02s  0.67% 95.64%      0.02s  0.67%  runtime.mmap
     0.02s  0.67% 96.31%      0.02s  0.67%  time.nextStdChunk

The runtime.madvise being so high up definitely points to my change(s). madvise might be more expensive on Darwin than on Linux. If I run with -count=5:

Showing top 10 nodes out of 47                                               
      flat  flat%   sum%        cum   cum%                         
    2080ms 17.48% 17.48%     2080ms 17.48%  runtime.memclrNoHeapPointers
    1670ms 14.03% 31.51%     6310ms 53.03%  time.Time.AppendFormat
    1620ms 13.61% 45.13%     1620ms 13.61%  runtime.memmove  
    1050ms  8.82% 53.95%     1090ms  9.16%  time.nextStdChunk
     800ms  6.72% 60.67%     1050ms  8.82%  time.appendInt
     690ms  5.80% 66.47%      800ms  6.72%  time.(*Location).lookup
     560ms  4.71% 71.18%      560ms  4.71%  runtime.madvise
     530ms  4.45% 75.63%      540ms  4.54%  time.absDate
     430ms  3.61% 79.24%      430ms  3.61%  runtime.(*mspan).refillAllocCache
     420ms  3.53% 82.77%      420ms  3.53%  runtime.(*mspan).init

Then it's not quite so high up, since those madvise calls are likely mostly made during the first run. Maybe during a heap growth.

To make sure, I tracked down where the madvise calls were coming from with some print debugging. Turns out I was wrong! The heap-growth scavenging routine only actually scavenges one enormous span. The real problem arises primarily from a performance bug in a previous change of mine.

When freeing a span, the runtime will may try to coalesce a scavenged span with a neighboring unscavenged span, so on merging, we re-scavenge the whole thing. Freeing a span happens more frequently than you might think, since when we're allocating a new span we search for a "best fit" free span, and if it's not of the same size we trim off the excess. In this particular case, we have one enormous free span which we scavenge. Then, lots of smaller allocations come in, and they end up hitting the space where that scavenged span lived, trimming it and returning the trimmings (which are considered scavenged) each time, only for those trimmings to get allocated out of again.

In this freeing process, due to a bug, we actually end up re-scavenging the span we're freeing if it's already scavenged, even if it didn't coalesce with anything. So, we basically have a lot of unnecessary additional madvise calls happening, which has a big impact if madvise is expensive of the platform.

I hacked together a fix for the freeing logic locally and it does in fact fix this issue. CL incoming.

@gopherbot

This comment has been minimized.

Copy link

@gopherbot gopherbot commented Nov 6, 2018

Change https://golang.org/cl/147559 mentions this issue: runtime: stop unnecessary span scavenges on free

@gopherbot gopherbot closed this in 06be7cb Nov 9, 2018
@golang golang locked and limited conversation to collaborators Nov 9, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
4 participants
You can’t perform that action at this time.