-
Notifications
You must be signed in to change notification settings - Fork 18.4k
Description
go version devel +ae9712c766 Thu Dec 1 22:20:31 2016 +0000 darwin/amd64
, and linux/amd64
I wrote a test case for the GC latency described by @WillSewell in https://blog.pusher.com/golangs-real-time-gc-in-theory-and-practice/ . I recorded the timestamps of when pauses (during allocation) were observed, and matched them against GODEBUG=gctrace=1 output.
The pauses come after the GC cycle, during the concurrent sweep phase. Also, they happen as the heap grows and do not happen once the heap size is stable. And the pauses are often longer than the clock time of any single gctrace line.
A goroutine that needs to allocate memory right after a GC cycle is required to help with the sweep work. This means looking through the heap until suitable memory is found. If the heap is growing rapidly with little or no garbage created, this mutator goroutine will need to search for a long time: it can end up sweeping half of the heap (shared with the background sweep worker).
The GC is pretty good at controlling the mark assist burden on any single goroutine—and starting the GC cycle before the allowable heap growth is exhausted, allowing tradeoffs between assist burden and heap growth during the cycle. These protections don't seem to be in place for sweep assists.
The execution tracer's view of this is confusing since it emits events for each call to runtime.(*mspan).sweep
, which is called in a loop from runtime.(*mcentral).cacheSpan
. This makes it look like the user goroutine is making a lot of progress on sweep work, since a single call to runtime.mallocgc can lead to thousands of tiny "SWEEP" blocks.
Frustrated assists like this are perhaps more likely to happen in microbenchmarks than in production applications, but the way sweep work is shown in the execution tracer makes it hard to say for sure. This could be a contributing factor to #14812 — one of the goroutines in the recent execution trace spent over 17ms in sweep assists.
The test is below, followed by some example runs.
package gc_test
import (
"flag"
"testing"
"time"
)
var (
start = time.Now()
allocSize = flag.Int("alloc-size", 1000, "Bytes per allocation")
retainCount = flag.Int("retain-count", 1000000, "Number of allocations to retain")
loopCount = flag.Int("loop-count", 10, "Number of times to execute loop")
preallocate = flag.Bool("preallocate", false, "Preallocate retention slice")
)
func TestHeapGrowthStalls(t *testing.T) {
var mem [][]byte
if *preallocate {
mem = make([][]byte, *retainCount)
}
var slowDelta time.Duration
for i := 0; i < (*loopCount)*(*retainCount); i++ {
// Allow the program's heap to start small, instead of pre-allocating
// the entire slice.
if len(mem) < *retainCount {
mem = append(mem, nil)
}
j := i % len(mem)
before := time.Now()
mem[j] = make([]byte, *allocSize)
delta := time.Since(before)
if delta > slowDelta || (delta > time.Millisecond && delta > slowDelta/2) {
slowDelta = delta
at := time.Since(start)
t.Logf("i=%-8d at=%.3fs delta=%.3fms",
i,
at.Seconds(),
delta.Seconds()*float64(time.Second/time.Millisecond))
}
}
}
On a darwin/amd64 laptop:
$ GODEBUG=gctrace=1 /tmp/gc.test -test.v
=== RUN TestHeapGrowthStalls
gc 1 @0.002s 1%: 0.035+0.14+0.019 ms clock, 0.10+0.047/0.12/0.12+0.059 ms cpu, 4->4->3 MB, 5 MB goal, 8 P
gc 2 @0.003s 2%: 0.003+0.63+0.028 ms clock, 0.024+0.054/0.17/0.14+0.22 ms cpu, 6->7->6 MB, 7 MB goal, 8 P
gc 3 @0.004s 3%: 0.003+0.29+0.034 ms clock, 0.026+0.16/0.16/0.095+0.27 ms cpu, 12->12->12 MB, 13 MB goal, 8 P
gc 4 @0.007s 4%: 0.004+1.2+0.048 ms clock, 0.033+0.94/0.059/0.29+0.39 ms cpu, 23->23->22 MB, 24 MB goal, 8 P
gc 5 @0.012s 3%: 0.005+0.59+0.024 ms clock, 0.044+0.054/0.59/0.58+0.19 ms cpu, 44->44->42 MB, 45 MB goal, 8 P
gc 6 @0.020s 2%: 0.005+5.9+0.033 ms clock, 0.040+0.093/1.1/5.7+0.26 ms cpu, 84->84->80 MB, 85 MB goal, 8 P
gc 7 @0.045s 2%: 0.006+1.7+0.036 ms clock, 0.051+0.25/2.2/2.3+0.29 ms cpu, 156->157->152 MB, 160 MB goal, 8 P
gc 8 @0.077s 1%: 0.006+2.1+0.027 ms clock, 0.055+1.0/2.9/7.3+0.21 ms cpu, 297->300->286 MB, 305 MB goal, 8 P
gc 9 @0.154s 1%: 0.008+4.4+0.044 ms clock, 0.070+2.3/4.6/18+0.35 ms cpu, 558->563->537 MB, 573 MB goal, 8 P
gc 10 @0.268s 1%: 0.008+7.5+0.044 ms clock, 0.067+3.7/12/32+0.35 ms cpu, 1047->1056->1004 MB, 1074 MB goal, 8 P
gc 11 @0.428s 1%: 0.008+7.8+0.043 ms clock, 0.070+1.9/13/36+0.34 ms cpu, 1959->1974->1018 MB, 2009 MB goal, 8 P
gc 12 @0.924s 0%: 0.008+8.6+0.17 ms clock, 0.071+1.8/15/38+1.3 ms cpu, 1985->2004->1022 MB, 2036 MB goal, 8 P
gc 13 @1.426s 0%: 0.006+9.0+0.034 ms clock, 0.053+1.8/17/41+0.27 ms cpu, 1993->2000->1010 MB, 2044 MB goal, 8 P
gc 14 @1.658s 0%: 0.008+9.4+0.026 ms clock, 0.070+1.8/15/45+0.21 ms cpu, 1969->1982->1015 MB, 2020 MB goal, 8 P
gc 15 @1.905s 0%: 0.015+9.5+0.075 ms clock, 0.12+1.9/18/41+0.60 ms cpu, 1980->1994->1017 MB, 2031 MB goal, 8 P
gc 16 @2.143s 0%: 0.016+11+0.045 ms clock, 0.12+1.8/21/54+0.36 ms cpu, 1983->1999->1019 MB, 2034 MB goal, 8 P
gc 17 @2.385s 0%: 0.015+10+0.066 ms clock, 0.12+2.4/20/49+0.53 ms cpu, 1987->2002->1018 MB, 2038 MB goal, 8 P
gc 18 @2.616s 0%: 0.015+11+0.077 ms clock, 0.12+2.4/15/59+0.61 ms cpu, 1985->2004->1021 MB, 2036 MB goal, 8 P
--- PASS: TestHeapGrowthStalls (2.85s)
gc_test.go:42: i=0 at=0.000s delta=0.000ms
gc_test.go:42: i=2 at=0.000s delta=0.001ms
gc_test.go:42: i=18 at=0.000s delta=0.002ms
gc_test.go:42: i=26 at=0.001s delta=0.002ms
gc_test.go:42: i=114 at=0.001s delta=0.003ms
gc_test.go:42: i=170 at=0.001s delta=0.004ms
gc_test.go:42: i=218 at=0.001s delta=0.014ms
gc_test.go:42: i=3226 at=0.002s delta=0.157ms
gc_test.go:42: i=12016 at=0.005s delta=0.385ms
gc_test.go:42: i=22534 at=0.009s delta=0.408ms
gc_test.go:42: i=41997 at=0.014s delta=0.797ms
gc_test.go:42: i=79541 at=0.028s delta=1.451ms
gc_test.go:42: i=152493 at=0.051s delta=3.489ms
gc_test.go:42: i=286213 at=0.084s delta=5.107ms
gc_test.go:42: i=286229 at=0.087s delta=2.922ms
gc_test.go:42: i=536053 at=0.169s delta=10.281ms
gc_test.go:42: i=1001709 at=0.295s delta=20.094ms
PASS
The at=0.169s delta=10.281ms
line matches up with gc 9 @0.154s 1%: 0.008+4.4+0.044 ms clock
. The GC cycle began at 0.154s and completed around 0.158s. The record-breaking allocation delay finished at 0.169s and began at 0.159s.
On a linux/amd64 server:
$ GOMAXPROCS=8 GODEBUG=gctrace=1 ./gc.test.linux_amd64 -test.v -retain-count=20000000 -preallocate=true -loop-count=3
=== RUN TestHeapGrowthStalls
gc 1 @0.045s 11%: 0.11+21+0.072 ms clock, 0.22+20/43/106+0.14 ms cpu, 458->459->458 MB, 459 MB goal, 8 P
gc 2 @0.068s 14%: 0.004+27+0.10 ms clock, 0.036+0.063/46/93+0.87 ms cpu, 459->513->513 MB, 917 MB goal, 8 P
gc 3 @0.172s 10%: 0.019+30+0.084 ms clock, 0.15+0.20/54/135+0.67 ms cpu, 739->785->784 MB, 1027 MB goal, 8 P
gc 4 @0.389s 6%: 0.014+33+0.039 ms clock, 0.11+0.71/64/161+0.31 ms cpu, 1315->1388->1388 MB, 1569 MB goal, 8 P
gc 5 @0.795s 4%: 0.026+33+0.044 ms clock, 0.21+1.6/62/164+0.35 ms cpu, 2486->2557->2557 MB, 2777 MB goal, 8 P
gc 6 @1.589s 3%: 0.025+84+0.046 ms clock, 0.20+2.3/159/406+0.37 ms cpu, 4780->4915->4915 MB, 5115 MB goal, 8 P
gc 7 @3.202s 2%: 0.015+63+0.041 ms clock, 0.12+4.3/126/313+0.32 ms cpu, 9377->9499->9499 MB, 9830 MB goal, 8 P
gc 8 @6.350s 1%: 0.018+204+0.091 ms clock, 0.14+55/395/998+0.72 ms cpu, 18438->18685->18685 MB, 18999 MB goal, 8 P
gc 9 @13.058s 1%: 0.019+300+0.066 ms clock, 0.15+112/565/1496+0.52 ms cpu, 36436->36706->20259 MB, 37370 MB goal, 8 P
gc 10 @34.001s 0%: 0.017+265+0.11 ms clock, 0.13+83/507/1313+0.91 ms cpu, 39506->39732->20215 MB, 40519 MB goal, 8 P
--- PASS: TestHeapGrowthStalls (53.37s)
gc_test.go:42: i=0 at=0.045s delta=0.581ms
gc_test.go:42: i=57055 at=0.097s delta=2.220ms
gc_test.go:42: i=331266 at=0.200s delta=2.323ms
gc_test.go:42: i=334871 at=0.221s delta=19.590ms
gc_test.go:42: i=952975 at=0.454s delta=32.232ms
gc_test.go:42: i=2150327 at=0.889s delta=61.852ms
gc_test.go:42: i=4564455 at=1.799s delta=126.545ms
gc_test.go:42: i=9258703 at=3.533s delta=268.414ms
gc_test.go:42: i=18664799 at=7.354s delta=799.231ms
PASS
/cc @aclements @RLH