New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: latency in sweep assists when there's no garbage #18155

Open
rhysh opened this Issue Dec 2, 2016 · 30 comments

Comments

Projects
None yet
9 participants
@rhysh
Contributor

rhysh commented Dec 2, 2016

go version devel +ae9712c766 Thu Dec 1 22:20:31 2016 +0000 darwin/amd64, and linux/amd64

I wrote a test case for the GC latency described by @WillSewell in https://blog.pusher.com/golangs-real-time-gc-in-theory-and-practice/ . I recorded the timestamps of when pauses (during allocation) were observed, and matched them against GODEBUG=gctrace=1 output.

The pauses come after the GC cycle, during the concurrent sweep phase. Also, they happen as the heap grows and do not happen once the heap size is stable. And the pauses are often longer than the clock time of any single gctrace line.

A goroutine that needs to allocate memory right after a GC cycle is required to help with the sweep work. This means looking through the heap until suitable memory is found. If the heap is growing rapidly with little or no garbage created, this mutator goroutine will need to search for a long time: it can end up sweeping half of the heap (shared with the background sweep worker).

The GC is pretty good at controlling the mark assist burden on any single goroutine—and starting the GC cycle before the allowable heap growth is exhausted, allowing tradeoffs between assist burden and heap growth during the cycle. These protections don't seem to be in place for sweep assists.

The execution tracer's view of this is confusing since it emits events for each call to runtime.(*mspan).sweep, which is called in a loop from runtime.(*mcentral).cacheSpan. This makes it look like the user goroutine is making a lot of progress on sweep work, since a single call to runtime.mallocgc can lead to thousands of tiny "SWEEP" blocks.

Frustrated assists like this are perhaps more likely to happen in microbenchmarks than in production applications, but the way sweep work is shown in the execution tracer makes it hard to say for sure. This could be a contributing factor to #14812 — one of the goroutines in the recent execution trace spent over 17ms in sweep assists.

The test is below, followed by some example runs.

package gc_test

import (
	"flag"
	"testing"
	"time"
)

var (
	start       = time.Now()
	allocSize   = flag.Int("alloc-size", 1000, "Bytes per allocation")
	retainCount = flag.Int("retain-count", 1000000, "Number of allocations to retain")
	loopCount   = flag.Int("loop-count", 10, "Number of times to execute loop")
	preallocate = flag.Bool("preallocate", false, "Preallocate retention slice")
)

func TestHeapGrowthStalls(t *testing.T) {
	var mem [][]byte
	if *preallocate {
		mem = make([][]byte, *retainCount)
	}

	var slowDelta time.Duration

	for i := 0; i < (*loopCount)*(*retainCount); i++ {
		// Allow the program's heap to start small, instead of pre-allocating
		// the entire slice.
		if len(mem) < *retainCount {
			mem = append(mem, nil)
		}

		j := i % len(mem)
		before := time.Now()
		mem[j] = make([]byte, *allocSize)
		delta := time.Since(before)
		if delta > slowDelta || (delta > time.Millisecond && delta > slowDelta/2) {
			slowDelta = delta
			at := time.Since(start)
			t.Logf("i=%-8d at=%.3fs   delta=%.3fms",
				i,
				at.Seconds(),
				delta.Seconds()*float64(time.Second/time.Millisecond))
		}
	}
}

On a darwin/amd64 laptop:

$ GODEBUG=gctrace=1 /tmp/gc.test -test.v
=== RUN   TestHeapGrowthStalls
gc 1 @0.002s 1%: 0.035+0.14+0.019 ms clock, 0.10+0.047/0.12/0.12+0.059 ms cpu, 4->4->3 MB, 5 MB goal, 8 P
gc 2 @0.003s 2%: 0.003+0.63+0.028 ms clock, 0.024+0.054/0.17/0.14+0.22 ms cpu, 6->7->6 MB, 7 MB goal, 8 P
gc 3 @0.004s 3%: 0.003+0.29+0.034 ms clock, 0.026+0.16/0.16/0.095+0.27 ms cpu, 12->12->12 MB, 13 MB goal, 8 P
gc 4 @0.007s 4%: 0.004+1.2+0.048 ms clock, 0.033+0.94/0.059/0.29+0.39 ms cpu, 23->23->22 MB, 24 MB goal, 8 P
gc 5 @0.012s 3%: 0.005+0.59+0.024 ms clock, 0.044+0.054/0.59/0.58+0.19 ms cpu, 44->44->42 MB, 45 MB goal, 8 P
gc 6 @0.020s 2%: 0.005+5.9+0.033 ms clock, 0.040+0.093/1.1/5.7+0.26 ms cpu, 84->84->80 MB, 85 MB goal, 8 P
gc 7 @0.045s 2%: 0.006+1.7+0.036 ms clock, 0.051+0.25/2.2/2.3+0.29 ms cpu, 156->157->152 MB, 160 MB goal, 8 P
gc 8 @0.077s 1%: 0.006+2.1+0.027 ms clock, 0.055+1.0/2.9/7.3+0.21 ms cpu, 297->300->286 MB, 305 MB goal, 8 P
gc 9 @0.154s 1%: 0.008+4.4+0.044 ms clock, 0.070+2.3/4.6/18+0.35 ms cpu, 558->563->537 MB, 573 MB goal, 8 P
gc 10 @0.268s 1%: 0.008+7.5+0.044 ms clock, 0.067+3.7/12/32+0.35 ms cpu, 1047->1056->1004 MB, 1074 MB goal, 8 P
gc 11 @0.428s 1%: 0.008+7.8+0.043 ms clock, 0.070+1.9/13/36+0.34 ms cpu, 1959->1974->1018 MB, 2009 MB goal, 8 P
gc 12 @0.924s 0%: 0.008+8.6+0.17 ms clock, 0.071+1.8/15/38+1.3 ms cpu, 1985->2004->1022 MB, 2036 MB goal, 8 P
gc 13 @1.426s 0%: 0.006+9.0+0.034 ms clock, 0.053+1.8/17/41+0.27 ms cpu, 1993->2000->1010 MB, 2044 MB goal, 8 P
gc 14 @1.658s 0%: 0.008+9.4+0.026 ms clock, 0.070+1.8/15/45+0.21 ms cpu, 1969->1982->1015 MB, 2020 MB goal, 8 P
gc 15 @1.905s 0%: 0.015+9.5+0.075 ms clock, 0.12+1.9/18/41+0.60 ms cpu, 1980->1994->1017 MB, 2031 MB goal, 8 P
gc 16 @2.143s 0%: 0.016+11+0.045 ms clock, 0.12+1.8/21/54+0.36 ms cpu, 1983->1999->1019 MB, 2034 MB goal, 8 P
gc 17 @2.385s 0%: 0.015+10+0.066 ms clock, 0.12+2.4/20/49+0.53 ms cpu, 1987->2002->1018 MB, 2038 MB goal, 8 P
gc 18 @2.616s 0%: 0.015+11+0.077 ms clock, 0.12+2.4/15/59+0.61 ms cpu, 1985->2004->1021 MB, 2036 MB goal, 8 P
--- PASS: TestHeapGrowthStalls (2.85s)
	gc_test.go:42: i=0        at=0.000s   delta=0.000ms
	gc_test.go:42: i=2        at=0.000s   delta=0.001ms
	gc_test.go:42: i=18       at=0.000s   delta=0.002ms
	gc_test.go:42: i=26       at=0.001s   delta=0.002ms
	gc_test.go:42: i=114      at=0.001s   delta=0.003ms
	gc_test.go:42: i=170      at=0.001s   delta=0.004ms
	gc_test.go:42: i=218      at=0.001s   delta=0.014ms
	gc_test.go:42: i=3226     at=0.002s   delta=0.157ms
	gc_test.go:42: i=12016    at=0.005s   delta=0.385ms
	gc_test.go:42: i=22534    at=0.009s   delta=0.408ms
	gc_test.go:42: i=41997    at=0.014s   delta=0.797ms
	gc_test.go:42: i=79541    at=0.028s   delta=1.451ms
	gc_test.go:42: i=152493   at=0.051s   delta=3.489ms
	gc_test.go:42: i=286213   at=0.084s   delta=5.107ms
	gc_test.go:42: i=286229   at=0.087s   delta=2.922ms
	gc_test.go:42: i=536053   at=0.169s   delta=10.281ms
	gc_test.go:42: i=1001709  at=0.295s   delta=20.094ms
PASS

The at=0.169s delta=10.281ms line matches up with gc 9 @0.154s 1%: 0.008+4.4+0.044 ms clock. The GC cycle began at 0.154s and completed around 0.158s. The record-breaking allocation delay finished at 0.169s and began at 0.159s.

On a linux/amd64 server:

$ GOMAXPROCS=8 GODEBUG=gctrace=1 ./gc.test.linux_amd64 -test.v -retain-count=20000000 -preallocate=true -loop-count=3
=== RUN   TestHeapGrowthStalls
gc 1 @0.045s 11%: 0.11+21+0.072 ms clock, 0.22+20/43/106+0.14 ms cpu, 458->459->458 MB, 459 MB goal, 8 P
gc 2 @0.068s 14%: 0.004+27+0.10 ms clock, 0.036+0.063/46/93+0.87 ms cpu, 459->513->513 MB, 917 MB goal, 8 P
gc 3 @0.172s 10%: 0.019+30+0.084 ms clock, 0.15+0.20/54/135+0.67 ms cpu, 739->785->784 MB, 1027 MB goal, 8 P
gc 4 @0.389s 6%: 0.014+33+0.039 ms clock, 0.11+0.71/64/161+0.31 ms cpu, 1315->1388->1388 MB, 1569 MB goal, 8 P
gc 5 @0.795s 4%: 0.026+33+0.044 ms clock, 0.21+1.6/62/164+0.35 ms cpu, 2486->2557->2557 MB, 2777 MB goal, 8 P
gc 6 @1.589s 3%: 0.025+84+0.046 ms clock, 0.20+2.3/159/406+0.37 ms cpu, 4780->4915->4915 MB, 5115 MB goal, 8 P
gc 7 @3.202s 2%: 0.015+63+0.041 ms clock, 0.12+4.3/126/313+0.32 ms cpu, 9377->9499->9499 MB, 9830 MB goal, 8 P
gc 8 @6.350s 1%: 0.018+204+0.091 ms clock, 0.14+55/395/998+0.72 ms cpu, 18438->18685->18685 MB, 18999 MB goal, 8 P
gc 9 @13.058s 1%: 0.019+300+0.066 ms clock, 0.15+112/565/1496+0.52 ms cpu, 36436->36706->20259 MB, 37370 MB goal, 8 P
gc 10 @34.001s 0%: 0.017+265+0.11 ms clock, 0.13+83/507/1313+0.91 ms cpu, 39506->39732->20215 MB, 40519 MB goal, 8 P
--- PASS: TestHeapGrowthStalls (53.37s)
        gc_test.go:42: i=0        at=0.045s   delta=0.581ms
        gc_test.go:42: i=57055    at=0.097s   delta=2.220ms
        gc_test.go:42: i=331266   at=0.200s   delta=2.323ms
        gc_test.go:42: i=334871   at=0.221s   delta=19.590ms
        gc_test.go:42: i=952975   at=0.454s   delta=32.232ms
        gc_test.go:42: i=2150327  at=0.889s   delta=61.852ms
        gc_test.go:42: i=4564455  at=1.799s   delta=126.545ms
        gc_test.go:42: i=9258703  at=3.533s   delta=268.414ms
        gc_test.go:42: i=18664799 at=7.354s   delta=799.231ms
PASS

/cc @aclements @RLH

@aclements

This comment has been minimized.

Member

aclements commented Dec 2, 2016

Very interesting. I knew this problem existed for large allocations, but hadn't really thought about it with small allocations.

Assuming I understand what's going on, this isn't really a "sweep assist" problem, since it would happen even if the sweep ratio was 0. Sweep assists don't care if they found more space or not; you get credit for sweeping a span regardless of the outcome. The problem is that allocation ultimately has to find a span with some free space on it and currently it will keep sweeping until it does.

I might have a simple fix: put a threshold on how many spans allocation is willing to sweep before it gives up and just allocates a fresh span from the heap. The heap overhead of this is bounded at 1/threshold.

@aclements

This comment has been minimized.

Member

aclements commented Dec 6, 2016

There are actually a few distinct causes of this.

  • One is when sweep has to process a lot of full spans from an mcache before finding one that has space to allocate from. My proposed fix targets this, but doesn't actually work because it just shifts the time into the next problem. (All + timestamps below are in ns.)
long cacheSpan @0.286 10ms, 83577 swept, 0 reclaimed, 1 grow
+27 (p 0) lock mcentral
+71 (p 0) locked; scanning lists
+10539832 (p 0) grow()
+10539993 (p 0) mheap.alloc()
+10540090 (p 0) mheap.alloc_m()
+10540149 (p 0) got lock
+10540220 (p 0) reclaim
+10542490 (p 0) reclaim done
+10542536 (p 0) allocSpanLocked
+10542977 (p 0) initialize span
+10543065 (p 0) unlock
+10543110 (p 0) mheap.alloc_m done
+10543166 (p 0) mheap.alloc done
+10543204 (p 0) initSpan
+10543524 (p 0) grow done
+10543588 (p 0) have span
  • When we go to allocate a fresh span, either because of the fallback in my fix or because we're genuinely out of spans of that size class, the heap also tries to sweep. This is not to find individual free objects, but to free up whole spans before allocating a new one and potentially growing the RSS. This is, in a sense, even more aggressive since it's trying to find completely empty spans. I've been aware of this one for a while and have various schemes for how to fix it, but none that are simple. This doesn't happen very often, but when it does it's really expensive.
long cacheSpan @2.890 22ms, 0 swept, 1 reclaimed, 1 grow
+25 (p 0) lock mcentral
+70 (p 0) locked; scanning lists
+121 (p 0) grow()
+263 (p 0) mheap.alloc()
+380 (p 0) mheap.alloc_m()
+480 (p 0) got lock
+525 (p 0) reclaim
+22056552 (p 0) reclaim done
+22056982 (p 0) allocSpanLocked
+22057164 (p 0) initialize span
+22057377 (p 0) unlock
+22057418 (p 0) mheap.alloc_m done
+22057638 (p 0) memclr
+22058327 (p 0) mheap.alloc done
+22058448 (p 0) initSpan
+22058975 (p 0) grow done
+22059177 (p 0) have span

With my proposed fix, we fall into this case any time we would otherwise have fallen into the previous case. For this particular application, the reclaim sweep has just as much trouble finding free space as the mcentral sweep. This may be less true in a more heterogeneous application.

long cacheSpan @0.161 5ms, 50+0 swept, 0 skipped, 1 grow
+27 (p 0) lock mcentral
+66 (p 0) locked; scanning lists
+5336 (p 0) retry failed
+5375 (p 0) grow()
+5425 (p 0) mheap.alloc()
+5485 (p 0) mheap.alloc_m()
+5525 (p 0) got lock
+5556 (p 0) reclaim
+5092799 (p 0) reclaim done
+5093019 (p 0) allocSpanLocked
+5093269 (p 0) initialize span
+5093403 (p 0) unlock
+5093441 (p 0) mheap.alloc_m done
+5093631 (p 0) memclr
+5098431 (p 0) mheap.alloc done
+5098462 (p 0) initSpan
+5100228 (p 0) grow done
+5100424 (p 0) have span
  • Every once in a while when I run the benchmark, I get dozens of ~10ms stalls. When this happens, it's because memclr in mheap.alloc_m takes 5--14ms. This almost certainly has something to do with page faults and kernel state, though I don't know what. I have a hunch it has to do with transparent huge page scavenging (not clearing; clearing a huge page takes 40 us), but that's purely a hunch.
long cacheSpan @1.643 9ms, 0 swept, 0 reclaimed, 1 grow
+28 (p 0) lock mcentral
+74 (p 0) locked; scanning lists
+194 (p 0) grow()
+247 (p 0) mheap.alloc()
+307 (p 0) mheap.alloc_m()
+350 (p 0) got lock
+391 (p 0) allocSpanLocked
+701 (p 0) initialize span
+761 (p 0) unlock
+804 (p 0) mheap.alloc_m done
+843 (p 0) memclr
+9365603 (p 0) mheap.alloc done
+9366045 (p 0) initSpan
+9366973 (p 0) grow done
+9367295 (p 0) have span

@aclements aclements added the NeedsFix label Dec 6, 2016

@aclements aclements added this to the Go1.8Maybe milestone Dec 6, 2016

@gopherbot

This comment has been minimized.

gopherbot commented Dec 12, 2016

CL https://golang.org/cl/34291 mentions this issue.

@aclements

This comment has been minimized.

Member

aclements commented Dec 12, 2016

@rhysh, care to give CL 34291 a shot? I don't think that's entirely the right approach, but it definitely makes things better.

@aclements aclements modified the milestones: Go1.9, Go1.8Maybe Dec 13, 2016

@rhysh

This comment has been minimized.

Contributor

rhysh commented Jan 20, 2017

Hi @aclements , yes CL 34291 PS 1 makes things better. It's particularly apparent when the heap is large, and the stalls of over 500ms are reduced to less than 100ms.

Here's the output I got from go1.8rc2 with CL 34291 PS 1 cherry-picked on linux/amd64:

$ GOMAXPROCS=8 GODEBUG=gctrace=1 ./gc.test.go18rc2.cl34291ps1.linux_amd64 -test.v -retain-count=20000000 -preallocate=true -loop-count=3
=== RUN   TestHeapGrowthStalls
gc 1 @0.043s 13%: 0.17+25+0.070 ms clock, 0.35+23/49/122+0.14 ms cpu, 458->459->458 MB, 459 MB goal, 8 P
gc 2 @0.069s 15%: 0.007+34+0.13 ms clock, 0.057+0.12/57/119+1.1 ms cpu, 459->524->524 MB, 917 MB goal, 8 P
gc 3 @0.182s 11%: 0.008+29+0.054 ms clock, 0.070+0.29/57/144+0.43 ms cpu, 748->800->800 MB, 1048 MB goal, 8 P
gc 4 @0.397s 7%: 0.016+29+0.043 ms clock, 0.13+0.75/56/142+0.34 ms cpu, 1332->1375->1375 MB, 1601 MB goal, 8 P
gc 5 @0.801s 4%: 0.013+33+0.066 ms clock, 0.10+1.5/64/162+0.53 ms cpu, 2482->2555->2555 MB, 2750 MB goal, 8 P
gc 6 @1.604s 3%: 0.013+41+0.033 ms clock, 0.11+2.3/79/202+0.26 ms cpu, 4792->4876->4876 MB, 5111 MB goal, 8 P
gc 7 @3.199s 2%: 0.020+85+0.063 ms clock, 0.16+4.3/158/368+0.50 ms cpu, 9367->9524->9524 MB, 9752 MB goal, 8 P
gc 8 @6.374s 1%: 0.019+149+0.10 ms clock, 0.15+19/280/682+0.80 ms cpu, 18515->18767->18767 MB, 19049 MB goal, 8 P
gc 9 @12.683s 1%: 0.018+169+0.054 ms clock, 0.14+31/334/834+0.43 ms cpu, 36597->36799->20191 MB, 37535 MB goal, 8 P
gc 10 @37.305s 0%: 0.026+190+0.040 ms clock, 0.21+39/363/912+0.32 ms cpu, 39373->39596->20212 MB, 40382 MB goal, 8 P
--- PASS: TestHeapGrowthStalls (55.54s)
	gc_test.go:42: i=0        at=0.043s   delta=0.710ms
	gc_test.go:42: i=68087    at=0.107s   delta=0.870ms
	gc_test.go:42: i=331280   at=0.202s   delta=2.343ms
	gc_test.go:42: i=351240   at=0.230s   delta=1.670ms
	gc_test.go:42: i=940256   at=0.456s   delta=4.094ms
	gc_test.go:42: i=2150448  at=0.898s   delta=7.048ms
	gc_test.go:42: i=4529512  at=1.789s   delta=22.685ms
	gc_test.go:42: i=9294784  at=3.549s   delta=39.208ms
	gc_test.go:42: i=18770456 at=7.122s   delta=88.855ms
PASS

The longest stall, 89ms, started around (7.122s-0.089s = 7.033s), which is not immediately after the previous GC, which ended around (6.374s+0.149s = 6.523s). It's likely still during the concurrent sweep phase (which takes around 560ms, based on the go1.8rc2 run below), but it doesn't start at the beginning of the sweep phase.

Compared with the output from vanilla go1.8rc2:

$ GOMAXPROCS=8 GODEBUG=gctrace=1 ./gc.test.go18rc2.linux_amd64 -test.v -retain-count=20000000 -preallocate=true -loop-count=3
=== RUN   TestHeapGrowthStalls
gc 1 @0.038s 13%: 0.17+21+0.080 ms clock, 0.35+20/42/105+0.16 ms cpu, 458->459->458 MB, 459 MB goal, 8 P
gc 2 @0.061s 15%: 0.006+29+0.076 ms clock, 0.055+0.21/49/99+0.61 ms cpu, 459->502->502 MB, 917 MB goal, 8 P
gc 3 @0.171s 10%: 0.009+26+0.051 ms clock, 0.073+0.25/48/121+0.41 ms cpu, 729->785->785 MB, 1005 MB goal, 8 P
gc 4 @0.369s 6%: 0.013+29+0.10 ms clock, 0.10+0.69/57/143+0.83 ms cpu, 1311->1347->1347 MB, 1570 MB goal, 8 P
gc 5 @0.785s 4%: 0.015+48+0.062 ms clock, 0.12+1.6/82/175+0.49 ms cpu, 2441->2506->2505 MB, 2694 MB goal, 8 P
gc 6 @1.594s 2%: 0.011+43+0.060 ms clock, 0.090+2.5/82/211+0.48 ms cpu, 4716->4805->4805 MB, 5011 MB goal, 8 P
gc 7 @3.143s 2%: 0.016+73+0.030 ms clock, 0.13+4.8/145/364+0.24 ms cpu, 9238->9366->9365 MB, 9610 MB goal, 8 P
gc 8 @6.224s 1%: 0.019+139+0.037 ms clock, 0.15+20/267/697+0.30 ms cpu, 18241->18472->18472 MB, 18731 MB goal, 8 P
gc 9 @12.331s 1%: 0.020+152+0.038 ms clock, 0.16+30/302/755+0.30 ms cpu, 36020->36228->20196 MB, 36944 MB goal, 8 P
gc 10 @32.834s 0%: 0.013+209+0.058 ms clock, 0.10+36/397/1053+0.46 ms cpu, 39383->39624->20229 MB, 40393 MB goal, 8 P
--- PASS: TestHeapGrowthStalls (55.93s)
	gc_test.go:42: i=0        at=0.039s   delta=1.323ms
	gc_test.go:42: i=45710    at=0.091s   delta=1.879ms
	gc_test.go:42: i=331286   at=0.195s   delta=2.284ms
	gc_test.go:42: i=335222   at=0.206s   delta=9.855ms
	gc_test.go:42: i=910622   at=0.446s   delta=48.938ms
	gc_test.go:42: i=2097248  at=0.904s   delta=70.692ms
	gc_test.go:42: i=4451669  at=1.777s   delta=140.077ms
	gc_test.go:42: i=9121877  at=3.487s   delta=271.325ms
	gc_test.go:42: i=18446653 at=6.923s   delta=560.603ms
PASS

I'm not aware of any production applications that this issue affects: observing the issue (if I understand it correctly) seems to require the unusual combination of 1) a growing heap, 2) very little garbage to reclaim, and 3) caring about latency in individual goroutines.

@WillSewell

This comment has been minimized.

WillSewell commented Feb 14, 2017

I'm trying to understand this problem, and there's a few things that are confusing me. In my original benchmark, when I visualise a run of the program with go tool trace, the pauses look like this:

image

i.e. all processors are running a combination of runtime.gcBgMarkWorker and MARK and GC.

When I visualise the benchmark listed at the top of this issue, the pauses look like this:

screen shot 2017-02-14 at 11 56 18

This also has all processors running runtime.gcBgMarkWorker at the start, and processors running MARK at the end, but for a very short time (<0.05ms). But one processor runs runtime.gcBgMarkWorker and one runs GC for relatively much longer.

A couple of things about this confuses me:

  1. Why is the work done during the pauses not quite the same? I appreciate this could just be due to a larger heap size.
  2. From what I can tell (just from the names of the events) that all the work is to do with the mark phase. What about this indicates it's pausing doing sweep assist work? Is it to do with the mysteriously named GC event?
  3. I came to the conclusion that the problem we are seeing in the visualisation is this: #16528. Is this not correct? Or am I seeing #16528 in the above screenshots, and I'm missing the issue discussed here in the trace visualisations?

Happy to take this offline if someone would be willing to explain this to me.

@gopherbot

This comment has been minimized.

gopherbot commented Aug 31, 2017

Change https://golang.org/cl/60790 mentions this issue: cmd/trace: add minimum mutator utilization (MMU) plot

@rsc rsc modified the milestones: Go1.10, Go1.11 Nov 22, 2017

@cespare

This comment has been minimized.

Contributor

cespare commented Feb 24, 2018

I have what I think is an acute case of this problem. (If that's not the case -- apologies; please ignore.)

Over on #23044 I described my current workaround for the "small heap problem" -- where a program generates a lot of garbage but keeps a very small amount of live data, leading to extremely frequent GCs.

As I mentioned in #23044 I'm currently using heap ballast as a horrible workaround to force the garbage collector to set a much higher size target.

The latest system where I added ballast is described in #23044 (comment):

I ran into this again today with another service that was doing 15 collections/sec. This one is a fairly simple webserver that receives 10-15k requests per second and turns them into protobuf messages that are sent along the network to other systems for processing.

So my program does a bunch of small allocations that become garbage right away (its work) and a few giant allocations every once in a while which tend to stay live for a long time (ballast). It seems like sometimes, after more ballast is added, my program runs into the situation that @rhysh described at the top, and all the mutators can spend all their time in sweep assists.

See this unpleasant trace:

screen_20180223180636

In this case no goroutine did productive work for about 600ms.

This particular case isn't a big deal for us -- I can mitigate this in a few different ways -- but I thought it might be interesting as an extreme example.

@aclements

This comment has been minimized.

Member

aclements commented Feb 24, 2018

@cespare

This comment has been minimized.

Contributor

cespare commented Feb 28, 2018

Quick question: could you sample a
few of those sweeps and report the metadata associated with them? I'm
particularly interested in their duration, how many pages were swept, and
how many bytes were recovered.

Here are 6 sequential sweeps for one of the procs:

Start 4,116,135,261 ns
Wall Duration 13,317,631 ns
Swept bytes 5201920
Reclaimed bytes	5104656

Start 4,129,453,818 ns
Wall Duration 5,827 ns
Swept bytes 8192
Reclaimed bytes 8064

Start 4,129,479,153 ns
Wall Duration 13,572,150 ns
Swept bytes 6684672
Reclaimed bytes 6553744

Start 4,143,126,160 ns
Wall Duration 13,116,497 ns
Swept bytes 6479872
Reclaimed bytes 6353504

Start 4,156,243,473 ns
Wall Duration 6,180 ns
Swept bytes 8192
Reclaimed bytes 8096

Start 4,156,321,795 ns
Wall Duration 14,072,471 ns
Swept bytes 7069696
Reclaimed bytes 6929616

Also, is there anything between the obvious, long sweeps that are visible
in the screenshot? (E.g., if you select one and hit the right arrow key, is
the next event just the next big sweep or something else?)

Yeah -- as shown above, between the big sweeps there are often interspersed some small sweeps. (But not always -- sometimes there are two big sweeps next to each other.)

I also mailed you the trace (this is private code).

@gopherbot gopherbot modified the milestones: Go1.11, Unplanned May 23, 2018

@danieltahara

This comment has been minimized.

danieltahara commented Aug 27, 2018

We're writing a MySQL wire-protocol compatible proxy at Dropbox, and it looks like we're hitting this same issue in production on go 1.10.3.

Some "oddities" about the service -- being effectively a MySQL server, there is no connection multiplexing and there is a ton of per-connection buffers and state (see vitessio/vitess#4138). We're working on reducing them, but something like 90% of the in use heap memory at any given time is tied up is unreclaimable, effectively constant memory, and 50+% of the # of in use objects are from the same.

Our traces look like this:

image

Zooming in, the concurrent SWEEP happens on the first call to MakeString(...) (copying cgo memory into a go string). The gap between GC finishing and the SWEEPs is due to the fact that most of the goroutines (one per connection) don't have any data waiting on their sockets, so they don't require any allocations:

image

GC happens every few seconds (GOGC=75, since we don't have much memory to spare), and concurrent mark and scan is itself expensive (and born mostly by mutators), but it still pales in comparison to the concurrent sweep stalls, which seem to tie up every core for about 100-200ms.

gc 62645 @227962.290s 4%: 0.28+223+0.23 ms clock, 5.6+2319/1090/7.2+4.6 ms cpu, 6995->7052->4089 MB, 7146 MB goal, 20 P
gc 62646 @227965.313s 4%: 0.30+243+0.26 ms clock, 6.1+2406/1188/0.97+5.2 ms cpu, 7004->7061->4083 MB, 7156 MB goal, 20 P
gc 62647 @227968.497s 4%: 0.14+239+0.25 ms clock, 2.8+2290/1170/0.024+5.1 ms cpu, 6993->7051->4083 MB, 7145 MB goal, 20 P
gc 62648 @227971.571s 4%: 0.26+251+0.26 ms clock, 5.3+2557/1233/0.29+5.2 ms cpu, 6994->7050->4085 MB, 7146 MB goal, 20 P
gc 62649 @227974.337s 4%: 0.16+242+0.24 ms clock, 3.2+2376/1175/1.8+4.8 ms cpu, 6997->7056->4086 MB, 7148 MB goal, 20 P
gc 62650 @227977.309s 4%: 0.35+291+0.31 ms clock, 7.1+2466/1422/1.3+6.3 ms cpu, 6999->7055->4084 MB, 7151 MB goal, 20 P
@RLH

This comment has been minimized.

Contributor

RLH commented Aug 29, 2018

@danieltahara

This comment has been minimized.

danieltahara commented Aug 29, 2018

The objects are in the range of a few hundred bytes to single digit KB. The best specific data I can get is for row size (as opposed to column size). That distribution (in bytes) is here:

p99 p90 p75 p50
84511.8414031 6485.345 1126.004 379.099280115
@RLH

This comment has been minimized.

Contributor

RLH commented Aug 29, 2018

@danieltahara

This comment has been minimized.

danieltahara commented Aug 29, 2018

Queries in last hour, in bytes:

p99 p95 p90 p75 p50 p25 p10
72700.6023324 10662.0823733 6609.87482699 1130.83317386 379.689311459 101.527534113 46.6143729713

To reiterate -- these are stats for row size, not column/value size.

@RLH

This comment has been minimized.

Contributor

RLH commented Aug 29, 2018

@danieltahara

This comment has been minimized.

danieltahara commented Aug 29, 2018

@LK4D4

This comment has been minimized.

Contributor

LK4D4 commented Aug 29, 2018

@RLH Actually, we took another look and looks like MakeString doesn't allocate that much:
https://github.com/dropbox/godropbox/blob/master/database/sqltypes/sqltypes.go#L69
Just structure with interface filed where interface implementation is

type String struct {
	data   []byte
	isUtf8 bool
}
@RLH

This comment has been minimized.

Contributor

RLH commented Aug 30, 2018

@danieltahara

This comment has been minimized.

danieltahara commented Sep 4, 2018

Okay sorry for the back and forth here. The SWEEP stack is different than the goroutine stack:

image

I clicked through the large sweeps and they're all 8kb allocations with 10s-100s of MB swept.

Zooming in, there are other tiny sweeps, but they look more like this:

Swept bytes | 303104
-- | --
Reclaimed bytes | 303104
Swept bytes | 40960
-- | --
Reclaimed bytes | 40960
Swept bytes | 8192
-- | --
Reclaimed bytes | 8160
Swept bytes | 8192
-- | --
Reclaimed bytes | 1456
@danieltahara

This comment has been minimized.

danieltahara commented Sep 4, 2018

Here's how this all ties together.

I mentioned in my first message (#18155 (comment)) that the process allocates a few buffers per inbound connection, and that memory is never reclaimable until a connection closes (duration is measured in hours). These buffers happen to be 8kb in size, and each connection has 3 of them.

This means we have about 500,000 such buffers allocated, and unreclaimable in steady state. When we finish a GC and go to allocate an 8kb buffer, then, we attempt to sweep all of these 8kb spans and find nothing available for allocation. This is, as far as I understand, the issue in @rhysh's benchmark and the trace from @cespare, just manifesting itself in a slightly quirky way due to the memory profile of our application. I suspect it's exacerbated by the fact that 8kb objects are allocated in spans of exactly 8kb in size, so we don't get any benefit from the gc bitmap.

@RLH

This comment has been minimized.

Contributor

RLH commented Sep 6, 2018

STW behaviour during sweeping

This is a brief summary of this issue, the suspected causes, and several proposed fixes. If contributors believe their application can or can’t be attributed to this summary please speak up.

The issue: Sweeping can exhibit stop the world latencies that exceed Go’s sub millisecond SLO.

Diagnosis: When there are many full spans of the same size class then the mutator spends an unbounded amount of time sweeping spans looking for a free space. The mutator repeatedly asks the central store for a span to scan and repeatedly finds it full, the result manifests as a stop the goroutine and if many goroutines are stopped a stop the world. The number of spans scanned is bounded only by the size of the heap. Since the central store is a lock based data structure it is possible that lock contention may occur when many goroutines try to allocate and the time spent scanning a span is small. This problem is exasperated by a situation where the number of spans of a particular size class with free objects is small so that on average one needs to sweep many spans in order to find one that is not full.

The poster child for this would a system with many gigabytes of same sized objects are allocated at initialization and are immortal. Add to this a modest amount of new objects of this size allocated between GC cycles so that % of full spans with this size is very high.

Several solutions have been suggested, some are partial and some overlap. The way forward is likely a combination of these approaches.

“Back-off sweeping.” Don’t sweep a span if past history indicates it is likely to be full. This can be done using an exponential backoff. If sweep results in a full span, do not sweep that span again for a time proportional to the time since it was last not full. The downside is that we may have floating garbage proportional to the lifetime of the object. To be effective the spans that need sweeping should be segregated from the spans that do not so that obtaining a span from the mcentral only touches and returns spans that need sweeping.

“Give up and go to the heap”. After unsuccessfully sweeping n spans, ask for a free span from the heap. Currently, this may cause the heap to continue sweeping looking for a free span to allocate, which is even harder to find than a partially-free span. Hence, this would also need a way to find free spans quickly. This could be done by keeping a mark per span in a bitmap. This bitmap could be scanned very quickly to find free spans, but maintaining this bitmap may very slightly slow down marking.

“Give up and go to the OS.” After unsuccessfully sweeping n spans, immediately ask the OS for more allocation space. This will cause heap growth that would otherwise not be needed as it trades memory for latency. This was prototyped by CL 34291, in which we discovered that getting more memory from the OS can sometimes cost more than just doing more sweep work.

“Ordered sweep.” Sweep the spans in a random order. If the sweep order favors doing full spans first this will amerolate that concern. This solution produce non-deterministic behaviour. If we come up with an order that produces a free span immediately we would end up having to sweep all the full spans near the end of the cycle. This also doesn’t help if an entire size class is allocated very densely. Ultimately this doesn’t seem like a viable approach.

“Reduce contention.” The per-size-class mcentral may have lock contention. When sweeping, fetch a group of spans equal to the number spans that can be reasonably swept in a small bounded amount of time. Even if we go to a lock free algorithm, this grouping seems like a good idea and it should combine well with whatever other approach we use.

“Reserve scheduling.” Schedule 25% of the CPU for background sweeping similar to what we do with marking. Currently background sweeping has no CPU reservation and simply runs at the same priority as normal goroutines. This will potentially take CPU away from mutator goroutines and will not help in situations where all the goroutines are sweeping anyway. This seems less useful than the Async Sweep approach.

“Async sweep” or “Sweep and Swap.” Allocate from the previous cycle’s freed slots while we sweep to find new free slots and extend the GC cycle to include sweeping. Sweeping could make free slots available immediately after sweeping each span, or there could be an atomic swap once all sweeping is done. We haven’t worked out the details of this, but allocation during sweep would have to allocate black (like during the mark phase). The write barrier can be turned off. Since free slots wouldn’t become available immediately after marking, this would have to be integrated into the GC trigger and would cause marking to trigger slightly earlier.

All of these approaches are constrained by not being able to severely degrade current performance.

@danieltahara

This comment has been minimized.

danieltahara commented Sep 6, 2018

This is a brief summary of this issue, the suspected causes, and several proposed fixes. If contributors believe their application can or can’t be attributed to this summary please speak up.

Diagnosis: When there are many full spans of the same size class then the mutator spends an unbounded amount of time sweeping spans looking for a free space. The mutator repeatedly asks the central store for a span to scan and repeatedly finds it full, the result manifests as a stop the goroutine and if many goroutines are stopped a stop the world. The number of spans scanned is bounded only by the size of the heap. Since the central store is a lock based data structure it is possible that lock contention may occur when many goroutines try to allocate and the time spent scanning a span is small. This problem is exasperated by a situation where the number of spans of a particular size class with free objects is small so that on average one needs to sweep many spans in order to find one that is not full.

This matches our application memory pattern and GC behavior.

@aclements

This comment has been minimized.

Member

aclements commented Sep 19, 2018

“Async sweep” or “Sweep and Swap.” ... Since free slots wouldn’t become available immediately after marking, this would have to be integrated into the GC trigger and would cause marking to trigger slightly earlier.

I've been working on an implementation of this and I believe doing async sweeping is actually quite clean (that or I'm just taking the opportunity to clean up the sweep code; it's hard to say :). My main concern is about how this fits into the pacer. With this approach, the heap can continue to grow after mark termination, but it's very hard to predict exactly how. For example, if async sweeping immediately finds free slots, it may grow very little, but if async sweeping takes a long time to find free slots, it may grow quite a bit.

This leaves me wondering if it makes sense to integrate this into the GC pacer, or if we should try to use some other technique to mitigate this. For example, we could easily combine this with page marking so that wholly unused spans can be returned to the heap very quickly after mark termination. Or we could combine it with ordering heuristics to sweep spans that are more likely to produce free slots (like the age heuristics in "back-off sweeping" but just for determining an order).

@gopherbot

This comment has been minimized.

gopherbot commented Sep 27, 2018

Change https://golang.org/cl/138076 mentions this issue: runtime: track all heap arenas in a slice

@gopherbot

This comment has been minimized.

gopherbot commented Oct 2, 2018

Change https://golang.org/cl/138958 mentions this issue: runtime: mark span when marking any object on the span

@gopherbot

This comment has been minimized.

gopherbot commented Oct 2, 2018

Change https://golang.org/cl/138961 mentions this issue: runtime: helper to compute span's "fullness"

@gopherbot

This comment has been minimized.

gopherbot commented Oct 2, 2018

Change https://golang.org/cl/138962 mentions this issue: runtime: use asynchronous sweeping for small allocations

@gopherbot

This comment has been minimized.

gopherbot commented Oct 2, 2018

Change https://golang.org/cl/138959 mentions this issue: runtime: add an efficient whole-page reclaimer

@gopherbot

This comment has been minimized.

gopherbot commented Oct 2, 2018

Change https://golang.org/cl/138957 mentions this issue: runtime: record in-use spans in a page-indexed bitmap

gopherbot pushed a commit that referenced this issue Nov 5, 2018

cmd/trace: add minimum mutator utilization (MMU) plot
This adds an endpoint to the trace tool that plots the minimum mutator
utilization curve using information on mark assists and GC pauses from
the trace.

This commit implements a fairly straightforward O(nm) algorithm for
computing the MMU (and tests against an even more direct but slower
algorithm). Future commits will extend and optimize this algorithm.

This should be useful for debugging and understanding mutator
utilization issues like #14951, #14812, #18155. #18534, #21107,
particularly once follow-up CLs add trace cross-referencing.

Change-Id: Ic2866869e7da1e6c56ba3e809abbcb2eb9c4923a
Reviewed-on: https://go-review.googlesource.com/c/60790
Run-TryBot: Austin Clements <austin@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Hyang-Ah Hana Kim <hyangah@gmail.com>

gopherbot pushed a commit that referenced this issue Nov 15, 2018

runtime: track all heap arenas in a slice
Currently, there's no efficient way to iterate over the Go heap. We're
going to need this for fast free page sweeping, so this CL adds a
slice of all allocated heap arenas. This will also be useful for
generational GC.

For #18155.

Change-Id: I58d126cfb9c3f61b3125d80b74ccb1b2169efbcc
Reviewed-on: https://go-review.googlesource.com/c/138076
Run-TryBot: Austin Clements <austin@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Rick Hudson <rlh@golang.org>

gopherbot pushed a commit that referenced this issue Nov 15, 2018

runtime: record in-use spans in a page-indexed bitmap
This adds a bitmap indexed by page number that marks the starts of
in-use spans. This will be used to quickly find in-use spans with no
marked objects for sweeping.

For #18155.

Change-Id: Icee56f029cde502447193e136fa54a74c74326dd
Reviewed-on: https://go-review.googlesource.com/c/138957
Run-TryBot: Austin Clements <austin@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Rick Hudson <rlh@golang.org>
Reviewed-by: Michael Knyszek <mknyszek@google.com>

gopherbot pushed a commit that referenced this issue Nov 15, 2018

runtime: mark span when marking any object on the span
This adds a mark bit for each span that is set if any objects on the
span are marked. This will be used for sweeping.

For #18155.

The impact of this is negligible for most benchmarks, and < 1% for
GC-heavy benchmarks.

name                       old time/op  new time/op  delta
Garbage/benchmem-MB=64-12  2.18ms ± 0%  2.20ms ± 1%  +0.88%  (p=0.000 n=16+18)

(https://perf.golang.org/search?q=upload:20180928.1)

name                      old time/op    new time/op    delta
BinaryTree17-12              2.68s ± 1%     2.68s ± 1%    ~     (p=0.707 n=17+19)
Fannkuch11-12                2.28s ± 0%     2.39s ± 0%  +4.95%  (p=0.000 n=19+18)
FmtFprintfEmpty-12          40.3ns ± 4%    39.4ns ± 2%  -2.27%  (p=0.000 n=17+18)
FmtFprintfString-12         67.9ns ± 1%    68.3ns ± 1%  +0.55%  (p=0.000 n=18+19)
FmtFprintfInt-12            75.7ns ± 1%    76.1ns ± 1%  +0.44%  (p=0.005 n=18+19)
FmtFprintfIntInt-12          123ns ± 1%     121ns ± 1%  -1.00%  (p=0.000 n=18+18)
FmtFprintfPrefixedInt-12     150ns ± 0%     148ns ± 0%  -1.33%  (p=0.000 n=16+13)
FmtFprintfFloat-12           208ns ± 0%     204ns ± 0%  -1.92%  (p=0.000 n=13+17)
FmtManyArgs-12               501ns ± 1%     498ns ± 0%  -0.55%  (p=0.000 n=19+17)
GobDecode-12                6.24ms ± 0%    6.25ms ± 1%    ~     (p=0.113 n=20+19)
GobEncode-12                5.33ms ± 0%    5.29ms ± 1%  -0.72%  (p=0.000 n=20+18)
Gzip-12                      220ms ± 1%     218ms ± 1%  -1.02%  (p=0.000 n=19+19)
Gunzip-12                   35.5ms ± 0%    35.7ms ± 0%  +0.45%  (p=0.000 n=16+18)
HTTPClientServer-12         77.9µs ± 1%    77.7µs ± 1%  -0.30%  (p=0.047 n=20+19)
JSONEncode-12               8.82ms ± 0%    8.93ms ± 0%  +1.20%  (p=0.000 n=18+17)
JSONDecode-12               47.3ms ± 0%    47.0ms ± 0%  -0.49%  (p=0.000 n=17+18)
Mandelbrot200-12            3.69ms ± 0%    3.68ms ± 0%  -0.25%  (p=0.000 n=19+18)
GoParse-12                  3.13ms ± 1%    3.13ms ± 1%    ~     (p=0.640 n=20+20)
RegexpMatchEasy0_32-12      76.2ns ± 1%    76.2ns ± 1%    ~     (p=0.818 n=20+19)
RegexpMatchEasy0_1K-12       226ns ± 0%     226ns ± 0%  -0.22%  (p=0.001 n=17+18)
RegexpMatchEasy1_32-12      71.9ns ± 1%    72.0ns ± 1%    ~     (p=0.653 n=18+18)
RegexpMatchEasy1_1K-12       355ns ± 1%     356ns ± 1%    ~     (p=0.160 n=18+19)
RegexpMatchMedium_32-12      106ns ± 1%     106ns ± 1%    ~     (p=0.325 n=17+20)
RegexpMatchMedium_1K-12     31.1µs ± 2%    31.2µs ± 0%  +0.59%  (p=0.007 n=19+15)
RegexpMatchHard_32-12       1.54µs ± 2%    1.53µs ± 2%  -0.78%  (p=0.021 n=17+18)
RegexpMatchHard_1K-12       46.0µs ± 1%    45.9µs ± 1%  -0.31%  (p=0.025 n=17+19)
Revcomp-12                   391ms ± 1%     394ms ± 2%  +0.80%  (p=0.000 n=17+19)
Template-12                 59.9ms ± 1%    59.9ms ± 1%    ~     (p=0.428 n=20+19)
TimeParse-12                 304ns ± 1%     312ns ± 0%  +2.88%  (p=0.000 n=20+17)
TimeFormat-12                318ns ± 0%     326ns ± 0%  +2.64%  (p=0.000 n=20+17)

(https://perf.golang.org/search?q=upload:20180928.2)

Change-Id: I336b9bf054113580a24103192904c8c76593e90e
Reviewed-on: https://go-review.googlesource.com/c/138958
Run-TryBot: Austin Clements <austin@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Rick Hudson <rlh@golang.org>
Reviewed-by: Michael Knyszek <mknyszek@google.com>

gopherbot pushed a commit that referenced this issue Nov 15, 2018

runtime: implement efficient page reclaimer
When we attempt to allocate an N page span (either for a large
allocation or when an mcentral runs dry), we first try to sweep spans
to release N pages. Currently, this can be extremely expensive:
sweeping a span to emptiness is the hardest thing to ask for and the
sweeper generally doesn't know where to even look for potentially
fruitful results. Since this is on the critical path of many
allocations, this is unfortunate.

This CL changes how we reclaim empty spans. Instead of trying lots of
spans and hoping for the best, it uses the newly introduced span marks
to efficiently find empty spans. The span marks (and in-use bits) are
in a dense bitmap, so these spans can be found with an efficient
sequential memory scan. This approach can scan for unmarked spans at
about 300 GB/ms and can free unmarked spans at about 32 MB/ms. We
could probably significantly improve the rate at which is can free
unmarked spans, but that's a separate issue.

Like the current reclaimer, this is still linear in the number of
spans that are swept, but the constant factor is now so vanishingly
small that it doesn't matter.

The benchmark in #18155 demonstrates both significant page reclaiming
delays, and object reclaiming delays. With "-retain-count=20000000
-preallocate=true -loop-count=3", the benchmark demonstrates several
page reclaiming delays on the order of 40ms. After this change, the
page reclaims are insignificant. The longest sweeps are still ~150ms,
but are object reclaiming delays. We'll address those in the next
several CLs.

Updates #18155.

Fixes #21378 by completely replacing the logic that had that bug.

Change-Id: Iad80eec11d7fc262d02c8f0761ac6998425c4064
Reviewed-on: https://go-review.googlesource.com/c/138959
Run-TryBot: Austin Clements <austin@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Rick Hudson <rlh@golang.org>

bradfitz pushed a commit that referenced this issue Nov 21, 2018

runtime: track all heap arenas in a slice
Currently, there's no efficient way to iterate over the Go heap. We're
going to need this for fast free page sweeping, so this CL adds a
slice of all allocated heap arenas. This will also be useful for
generational GC.

For #18155.

Change-Id: I58d126cfb9c3f61b3125d80b74ccb1b2169efbcc
Reviewed-on: https://go-review.googlesource.com/c/138076
Run-TryBot: Austin Clements <austin@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Rick Hudson <rlh@golang.org>

bradfitz pushed a commit that referenced this issue Nov 21, 2018

runtime: record in-use spans in a page-indexed bitmap
This adds a bitmap indexed by page number that marks the starts of
in-use spans. This will be used to quickly find in-use spans with no
marked objects for sweeping.

For #18155.

Change-Id: Icee56f029cde502447193e136fa54a74c74326dd
Reviewed-on: https://go-review.googlesource.com/c/138957
Run-TryBot: Austin Clements <austin@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Rick Hudson <rlh@golang.org>
Reviewed-by: Michael Knyszek <mknyszek@google.com>

bradfitz pushed a commit that referenced this issue Nov 21, 2018

runtime: mark span when marking any object on the span
This adds a mark bit for each span that is set if any objects on the
span are marked. This will be used for sweeping.

For #18155.

The impact of this is negligible for most benchmarks, and < 1% for
GC-heavy benchmarks.

name                       old time/op  new time/op  delta
Garbage/benchmem-MB=64-12  2.18ms ± 0%  2.20ms ± 1%  +0.88%  (p=0.000 n=16+18)

(https://perf.golang.org/search?q=upload:20180928.1)

name                      old time/op    new time/op    delta
BinaryTree17-12              2.68s ± 1%     2.68s ± 1%    ~     (p=0.707 n=17+19)
Fannkuch11-12                2.28s ± 0%     2.39s ± 0%  +4.95%  (p=0.000 n=19+18)
FmtFprintfEmpty-12          40.3ns ± 4%    39.4ns ± 2%  -2.27%  (p=0.000 n=17+18)
FmtFprintfString-12         67.9ns ± 1%    68.3ns ± 1%  +0.55%  (p=0.000 n=18+19)
FmtFprintfInt-12            75.7ns ± 1%    76.1ns ± 1%  +0.44%  (p=0.005 n=18+19)
FmtFprintfIntInt-12          123ns ± 1%     121ns ± 1%  -1.00%  (p=0.000 n=18+18)
FmtFprintfPrefixedInt-12     150ns ± 0%     148ns ± 0%  -1.33%  (p=0.000 n=16+13)
FmtFprintfFloat-12           208ns ± 0%     204ns ± 0%  -1.92%  (p=0.000 n=13+17)
FmtManyArgs-12               501ns ± 1%     498ns ± 0%  -0.55%  (p=0.000 n=19+17)
GobDecode-12                6.24ms ± 0%    6.25ms ± 1%    ~     (p=0.113 n=20+19)
GobEncode-12                5.33ms ± 0%    5.29ms ± 1%  -0.72%  (p=0.000 n=20+18)
Gzip-12                      220ms ± 1%     218ms ± 1%  -1.02%  (p=0.000 n=19+19)
Gunzip-12                   35.5ms ± 0%    35.7ms ± 0%  +0.45%  (p=0.000 n=16+18)
HTTPClientServer-12         77.9µs ± 1%    77.7µs ± 1%  -0.30%  (p=0.047 n=20+19)
JSONEncode-12               8.82ms ± 0%    8.93ms ± 0%  +1.20%  (p=0.000 n=18+17)
JSONDecode-12               47.3ms ± 0%    47.0ms ± 0%  -0.49%  (p=0.000 n=17+18)
Mandelbrot200-12            3.69ms ± 0%    3.68ms ± 0%  -0.25%  (p=0.000 n=19+18)
GoParse-12                  3.13ms ± 1%    3.13ms ± 1%    ~     (p=0.640 n=20+20)
RegexpMatchEasy0_32-12      76.2ns ± 1%    76.2ns ± 1%    ~     (p=0.818 n=20+19)
RegexpMatchEasy0_1K-12       226ns ± 0%     226ns ± 0%  -0.22%  (p=0.001 n=17+18)
RegexpMatchEasy1_32-12      71.9ns ± 1%    72.0ns ± 1%    ~     (p=0.653 n=18+18)
RegexpMatchEasy1_1K-12       355ns ± 1%     356ns ± 1%    ~     (p=0.160 n=18+19)
RegexpMatchMedium_32-12      106ns ± 1%     106ns ± 1%    ~     (p=0.325 n=17+20)
RegexpMatchMedium_1K-12     31.1µs ± 2%    31.2µs ± 0%  +0.59%  (p=0.007 n=19+15)
RegexpMatchHard_32-12       1.54µs ± 2%    1.53µs ± 2%  -0.78%  (p=0.021 n=17+18)
RegexpMatchHard_1K-12       46.0µs ± 1%    45.9µs ± 1%  -0.31%  (p=0.025 n=17+19)
Revcomp-12                   391ms ± 1%     394ms ± 2%  +0.80%  (p=0.000 n=17+19)
Template-12                 59.9ms ± 1%    59.9ms ± 1%    ~     (p=0.428 n=20+19)
TimeParse-12                 304ns ± 1%     312ns ± 0%  +2.88%  (p=0.000 n=20+17)
TimeFormat-12                318ns ± 0%     326ns ± 0%  +2.64%  (p=0.000 n=20+17)

(https://perf.golang.org/search?q=upload:20180928.2)

Change-Id: I336b9bf054113580a24103192904c8c76593e90e
Reviewed-on: https://go-review.googlesource.com/c/138958
Run-TryBot: Austin Clements <austin@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Rick Hudson <rlh@golang.org>
Reviewed-by: Michael Knyszek <mknyszek@google.com>

bradfitz pushed a commit that referenced this issue Nov 21, 2018

runtime: implement efficient page reclaimer
When we attempt to allocate an N page span (either for a large
allocation or when an mcentral runs dry), we first try to sweep spans
to release N pages. Currently, this can be extremely expensive:
sweeping a span to emptiness is the hardest thing to ask for and the
sweeper generally doesn't know where to even look for potentially
fruitful results. Since this is on the critical path of many
allocations, this is unfortunate.

This CL changes how we reclaim empty spans. Instead of trying lots of
spans and hoping for the best, it uses the newly introduced span marks
to efficiently find empty spans. The span marks (and in-use bits) are
in a dense bitmap, so these spans can be found with an efficient
sequential memory scan. This approach can scan for unmarked spans at
about 300 GB/ms and can free unmarked spans at about 32 MB/ms. We
could probably significantly improve the rate at which is can free
unmarked spans, but that's a separate issue.

Like the current reclaimer, this is still linear in the number of
spans that are swept, but the constant factor is now so vanishingly
small that it doesn't matter.

The benchmark in #18155 demonstrates both significant page reclaiming
delays, and object reclaiming delays. With "-retain-count=20000000
-preallocate=true -loop-count=3", the benchmark demonstrates several
page reclaiming delays on the order of 40ms. After this change, the
page reclaims are insignificant. The longest sweeps are still ~150ms,
but are object reclaiming delays. We'll address those in the next
several CLs.

Updates #18155.

Fixes #21378 by completely replacing the logic that had that bug.

Change-Id: Iad80eec11d7fc262d02c8f0761ac6998425c4064
Reviewed-on: https://go-review.googlesource.com/c/138959
Run-TryBot: Austin Clements <austin@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Rick Hudson <rlh@golang.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment