runtime: unbounded heap growth with concurrent GC #11911

dvyukov · 2015-07-28T17:43:00Z

The following program has live set of at most 100MB . With GODEBUG=gcstoptheworld=1 RSS stabilizes at ~204MB according to top, which makes sense. However with GODEBUG=gcstoptheworld=0 RSS steadily chaotically grows over time, I waited only for few minutes and RSS grew to 1106MB, one of the jumps was from 212MB to 678MB in one second. GOMAXPROCS value may have effect on this, I've get the worst numbers with GOMAXPROCS=2/4, however I did not make an exhaustive investigation, maybe GOMAXPROCS is red herring.

package main

import (
    "math/rand"
    "sync"
    "sync/atomic"
    "unsafe"
)

var (
    gs, retained, seed int32
    mu                 sync.Mutex
    blocks             [][]*byte
    sink               = make([]byte, 4096)
)

const (
    maxGs       = 100
    maxRetained = 100 << 20
    maxBlock    = 64 << 10
)

func main() {
    go worker()
    select {}
}

func worker() {
    rnd := rand.New(rand.NewSource(int64(atomic.AddInt32(&seed, 1))))
    ss := unsafe.Sizeof(&gs)
    var local [][]*byte
loop:
    for {
        switch r := rnd.Intn(1000); {
        case r == 0 || r == 1:
            if !tryAdd(&gs, 1, maxGs, 1) {
                continue loop
            }
            go worker()
        case r == 2:
            if !tryAdd(&gs, 1, maxGs, -1) {
                continue loop
            }
            s := 0
            for _, b := range local {
                s += len(b)
            }
            local = nil
            if !tryAdd(&retained, 0, maxRetained, -int32(s)*int32(ss)) {
                panic("bad")
            }
            return
        case r < 300:
            s := maxBlock / int(ss)
            if !tryAdd(&retained, 0, maxRetained, int32(s)*int32(ss)) {
                continue loop
            }
            b := make([]*byte, s)
            p := &sink[len(sink)-1]
            for i := range b {
                b[i] = p
            }
            mu.Lock()
            blocks = append(blocks, b)
            mu.Unlock()
        case r < 500:
            s := 0
            mu.Lock()
            if len(blocks) > 0 {
                idx := rnd.Intn(len(blocks))
                s = len(blocks[idx])
                blocks[idx] = blocks[len(blocks)-1]
                blocks[len(blocks)-1] = nil
                blocks = blocks[:len(blocks)-1]
            }
            mu.Unlock()
            if !tryAdd(&retained, 0, maxRetained, -int32(s)*int32(ss)) {
                panic("bad")
            }
        case r < 800:
            s := maxBlock / int(ss)
            if !tryAdd(&retained, 0, maxRetained, int32(s)*int32(ss)) {
                continue loop
            }
            b := make([]*byte, s)
            p := &sink[len(sink)-1]
            for i := range b {
                b[i] = p
            }
            local = append(local, b)
        default:
            s := 0
            if len(local) > 0 {
                idx := rnd.Intn(len(local))
                s = len(local[idx])
                local[idx] = local[len(local)-1]
                local[len(local)-1] = nil
                local = local[:len(local)-1]
            }
            if !tryAdd(&retained, 0, maxRetained, -int32(s)*int32(ss)) {
                panic("bad")
            }
        }
    }
}

func tryAdd(p *int32, min, max, v int32) bool {
    for {
        old := atomic.LoadInt32(p)
        xchg := old + v
        if xchg < min || xchg > max {
            return false
        }
        if atomic.CompareAndSwapInt32(p, old, xchg) {
            return true
        }
    }
}

go version devel +d0729a6 Tue Jul 28 06:23:38 2015 +0000 linux/amd64

RLH · 2015-07-29T13:59:32Z

Interesting reproduction. If I do a ^s to stop output at gc 6 (.391 seconds into the run) then do a ^q to restart then the heap size immediately jumps to many gigabytes but quickly returns to normal. Where the ^s happens seems not to make a difference, .391 is just right after I start the run. rlh@rlh0:~/work/code/issue11911/src$ GODEBUG=gctrace=1 GOMAXPROCS=2 ./issue11911 gc 1 @0.026s 48%: 0.036+0.22+0.006+26+0.28 ms clock, 0.072+0.22+0+33/18/0.001+0.57 ms cpu, 52->52->17 MB, 4 MB goal, 2 P gc 2 @0.054s 46%: 0.025+0.11+0.049+102+0.31 ms clock, 0.050+0.11+0+41/50/11+0.62 ms cpu, 17->35->35 MB, 33 MB goal, 2 P gc 3 @0.157s 56%: 0.009+7.7+0.002+35+0.24 ms clock, 0.019+7.7+0+61/10/0.001+0.48 ms cpu, 35->35->23 MB, 34 MB goal, 2 P gc 4 @0.202s 57%: 0.027+0.76+0.004+81+0.33 ms clock, 0.054+0.76+0+61/40/0+0.67 ms cpu, 32->43->34 MB, 45 MB goal, 2 P gc 5 @0.285s 58%: 0.010+0.80+0.79+103+0.47 ms clock, 0.021+0.80+0+70/59/0+0.95 ms cpu, 34->48->44 MB, 47 MB goal, 2 P gc 6 @0.391s 59%: 0.010+1.4+0.012+116+0.84 ms clock, 0.021+1.4+0+86/59/0+1.6 ms cpu, 44->60->49 MB, 60 MB goal, 2 P gc 7 @3.748s 12%: 0.024+120+196+0.16+2.2 ms clock, 0.048+120+0+316/0.001/0+4.4 ms cpu, 7403->7404->101 MB, 68 MB goal, 2 P gc 8 @5.021s 14%: 1.0+234+0.006+11+2.6 ms clock, 2.1+234+0+245/10/0+5.3 ms cpu, 4727->4730->103 MB, 201 MB goal, 2 P gc 9 @5.466s 17%: 1.0+183+49+12+2.0 ms clock, 2.1+183+0+245/10/0.004+4.1 ms cpu, 1013->1014->97 MB, 199 MB goal, 2 P gc 10 @5.727s 20%: 1.1+217+0.005+36+2.4 ms clock, 2.2+217+0+253/0/0+4.8 ms cpu, 97->97->97 MB, 193 MB goal, 2 P gc 11 @5.999s 22%: 1.0+2.2+0.67+129+1.8 ms clock, 2.1+2.2+0+227/33/0+3.7 ms cpu, 144->144->100 MB, 194 MB goal, 2 P gc 12 @6.153s 25%: 1.0+216+0.11+36+2.1 ms clock, 2.1+216+0+260/13/0.002+4.3 ms cpu, 174->176->102 MB, 201 MB goal, 2 P gc 13 @6.428s 26%: 1.2+2.1+0.003+142+2.7 ms clock, 2.4+2.1+0+229/39/0+5.5 ms cpu, 176->179->103 MB, 200 MB goal, 2 P gc 14 @6.595s 28%: 1.0+3.7+7.8+134+3.0 ms clock, 2.0+3.7+0+218/51/0+6.0 ms cpu, 189->191->102 MB, 200 MB goal, 2 P ^Crlh@rlh0:~/work/code/issue11911/src$

aclements · 2015-07-29T14:08:14Z

The ^s overallocation thing could happen for any program if you have gctrace on. That requests that the kernel block writes to the tty (or pty, in this case), which means our gctrace writes get blocked, which prevents a new GC from starting. It probably wreaks all sorts of other havoc, too, since the runtime's print isn't expecting this.

aclements · 2015-07-29T15:04:31Z

It looks like the start of GC is being delayed sometimes. The jumps in RSS correspond to large heap sizes at the start of a GC cycle (well beyond the heap trigger) and these cycles tend to start much longer after the previous cycle than normal (~500ms instead of ~150ms). We successfully collect back to ~100 MB (plus a little for floating garbage), but the memory doesn't get returned to the system because the scavenger is conservative (and may never get returned if these bumps keep happening).

@RLH is adding instrumentation to pin down exactly where the delay happens.

aclements · 2015-07-29T20:08:27Z

It looks like this is usually caused by a long semacquire(&worldsema). I've seen this take as long as 352ms with your example program. I don't know what's causing this, since this program doesn't obviously use anything else that would be acquiring worldsema.

I've seen other causes, too, though. In one run, it took 86ms to get from the Gosched in startGC to the actually running the GC goroutine. I suspect the GC goroutine got kicked out of or stolen from runnext or delayed by something considered higher priority by schedule.

RLH · 2015-07-29T21:53:16Z

I am seeing the sweep prior to the worldsema being acquired by the GC
taking a lot of time and there is no throttling of the mutator since
bggc.work is set to 1.

On Wed, Jul 29, 2015 at 4:08 PM, Austin Clements notifications@github.com
wrote:

It looks like this is usually caused by a long semacquire(&worldsema).
I've seen this take as long as 352ms with your example program. I don't
know what's causing this, since this program doesn't obviously use anything
else that would be acquiring worldsema.

I've seen other causes, too, though. In one run, it took 86ms to get from
the Gosched in startGC to the actually running the GC goroutine. I suspect
the GC goroutine got kicked out of or stolen from runnext or delayed by
something considered higher priority by schedule.

—
Reply to this email directly or view it on GitHub
#11911 (comment).

aclements · 2015-07-29T22:14:25Z

You mean the sweep after the worldsema acquire, but, yes, I can confirm that I'm also seeing that problem. The 352ms I saw before could actually have been this (my tracing wasn't high enough fidelity to separate them). I am also seeing that it sometimes takes a long time to acquire worldsema (33ms, 40ms).

However, the problem is not that there are unswept spans at this point. Even when we take a long time to get through the loop, we don't actually execute any iterations of the loop. The time is spent in the single call to gosweepone(), which ultimately returns -1.

aclements · 2015-07-30T02:32:06Z

I added a probe when we preempt the backgroundgc goroutine. 5 out of 5 times I've run it there's been an involuntary preemption of the GC startup operation that takes a long time and lets the heap overflow. It's hard to say if this is cause or effect—the preemption could be happening because we're taking too long to do an operation—but when it happens it certainly exacerbates the situation since the large number of runnable goroutines in this test means the GC goroutine won't get rescheduled for some time.

RLH · 2015-07-30T18:23:37Z

It looks like the preemption can happen immediately after the backgroundgc
is unparked. This is before it has any time to do much of anything beyond
setting a few globals.

The scheduler can suppress the preemption of the background GC during times
the mutator does not have any back pressure or ask/force the mutator to do
assists where we now have no back pressure. Currently this is when we are
entering or exiting the GC. The assist may well take the form of self
preemption if it notices the GC should be running on a P but isn't.

Both approaches have risk, whether the risk of not doing anything is higher
is unknown.

On Wed, Jul 29, 2015 at 10:32 PM, Austin Clements notifications@github.com
wrote:

I added a probe when we preempt the backgroundgc goroutine. 5 out of 5
times I've run it there's been an involuntary preemption of the GC startup
operation that takes a long time and lets the heap overflow. It's hard to
say if this is cause or effect—the preemption could be happening because
we're taking too long to do an operation—but when it happens it certainly
exacerbates the situation since the large number of runnable goroutines in
this test means the GC goroutine won't get rescheduled for some time.

—
Reply to this email directly or view it on GitHub
#11911 (comment).

aclements · 2015-07-30T20:50:38Z

@RLH and I discussed and the plan is to set a flag to indicate when we want to be GCing, but can't have mutator assists enabled (because we're either just starting or just finishing GC), and make the mutator goroutine yield if it would otherwise assist. This means 1) from when we surpass the heap trigger and ready the GC goroutine to just after the first stop-the-world, and 2) from just before the last start-the-world to right before the GC goroutine parks.

A more aggressive form of this is to have the mutator repeatedly yield until the GC says it's ready, but we're worried that might be too risky at this point.

I think there are other causes mixed in with this, but we'll see if this mitigates at least part of the problem.

I'm writing a Go1.6Early issue about ways to fix this more properly.

aclements · 2015-07-30T21:27:58Z

With this fix I was able to get through 3,171 GC cycles with the test program before the heap size blew up and when it did blow up, assists were enabled, which suggests a second problem. But, still, this is much better than it was.

aclements · 2015-07-30T21:54:40Z

I've reproduced the remaining assist failure twice now and both times in the failing cycle we also swept several thousand spans at the beginning of that cycle. We're not supposed to be sweeping any spans (and usually we don't). This suggests that some accounting is off and it's throwing off both the sweep ratio and the assist ratio. This sweeping is also bad for the mutator yielding trick because mutators that are allocating wind up yielding a lot during this sweep process.

RLH · 2015-07-30T21:59:55Z

At times I've suspected that there might be a race between one GC ending
and another one starting immediately even though there were spans holding
lots of space. Not sure how to eliminate this possibility.

On Thursday, July 30, 2015, Austin Clements notifications@github.com
wrote:

I've reproduced the remaining assist failure twice now and both times in
the failing cycle we also swept several thousand spans at the beginning of
that cycle. We're not supposed to be sweeping any spans (and usually we
don't). This suggests that some accounting is off and it's throwing off
both the sweep ratio and the assist ratio. This sweeping is also bad for
the mutator yielding trick because mutators that are allocating wind up
yielding a lot during this sweep process.

—
Reply to this email directly or view it on GitHub
#11911 (comment).

rsc · 2015-07-31T02:44:11Z

Can the mutator sweepone+yield instead of just yielding?

gopherbot · 2015-07-31T19:00:30Z

CL https://golang.org/cl/13026 mentions this issue.

gopherbot · 2015-07-31T19:00:38Z

CL https://golang.org/cl/13025 mentions this issue.

aclements · 2015-07-31T19:05:09Z

Can the mutator sweepone+yield instead of just yielding?

Usually the proportional sweep is doing its job and there is no sweep work to do (usually this issue is caused by bad scheduling, not by long sweeps). If I can't figure out why proportional sweep sometimes falls short, adding a sweepone to CL 13026 may at least improves the situation.

aclements · 2015-07-31T20:03:31Z

I believe I understand the cause of the unfinished concurrent sweep. It happens when next_gc (the GC trigger heap size) is within a few megs of the size of the retained heap at the end of GC. The proportional sweep works by using this "heap distance" and the number of spans that need to be swept to compute a sweep ratio. We perform extra sweeping to satisfy this ratio in mCentral_CacheSpan, which is what gets a span for a P to allocate from. However, it computes the extra pages to sweep based on how many heap bytes have been allocated since the end of the last GC. This is where the problem lies. During mark termination GC flushes all mcaches, so as soon as it starts the world again all of the Ps will repopulate their mcaches and go through this sweep debt calculation. But at this point there have hardly been any bytes allocated, so very little is swept. Since the distance to the next GC trigger is very small, the Ps are able to allocate right up to the GC trigger just from these cached spans. Since they don't get new cached spans, proportional sweep simply doesn't kick in and we wind up at the GC trigger with a bunch of unswept pages.

aclements · 2015-07-31T20:16:37Z

(And, yes, this is another instance of operating on debt rather than on credit, though I didn't realize this until this afternoon. Luckily, we've tripped over the only two credit systems in the runtime at this point, so I don't think there are other instances of this problem. :)

aclements · 2015-07-31T22:19:14Z

I believe I understand the cause of the unfinished concurrent sweep.

Okay, I have a fix for this that bases the sweep on how many bytes of spans have been allocated from the mheap, rather than how many bytes of objects have been allocated from spans. I've run it for a few thousand GC cycles and the concurrent sweep has always finished before GC starts.

gopherbot · 2015-08-03T16:00:28Z

CL https://golang.org/cl/13043 mentions this issue.

gopherbot · 2015-08-03T16:00:36Z

CL https://golang.org/cl/13044 mentions this issue.

gopherbot · 2015-08-03T23:00:30Z

CL https://golang.org/cl/13047 mentions this issue.

gopherbot · 2015-08-03T23:00:34Z

CL https://golang.org/cl/13046 mentions this issue.

Currently allocation checks the GC trigger speculatively during allocation and then triggers the GC without rechecking. As a result, it's possible for G 1 and G 2 to detect the trigger simultaneously, both enter startGC, G 1 actually starts GC while G 2 gets preempted until after the whole GC cycle, then G 2 immediately starts another GC cycle even though the heap is now well under the trigger. Fix this by re-checking the GC trigger non-speculatively just before actually kicking off a new GC cycle. This contributes to #11911 because when this happens, we definitely don't finish the background sweep before starting the next GC cycle, which can significantly delay the start of concurrent scan. Change-Id: I560ab79ba5684ba435084410a9765d28f5745976 Reviewed-on: https://go-review.googlesource.com/13025 Reviewed-by: Russ Cox <rsc@golang.org> Reviewed-by: Rick Hudson <rlh@golang.org>

Currently there are two sensitive periods during which a mutator can allocate past the heap goal but mutator assists can't be enabled: 1) at the beginning of GC between when the heap first passes the heap trigger and sweep termination and 2) at the end of GC between mark termination and when the background GC goroutine parks. During these periods there's no back-pressure or safety net, so a rapidly allocating mutator can allocate past the heap goal. This is exacerbated if there are many goroutines because the GC coordinator is scheduled as any other goroutine, so if it gets preempted during one of these periods, it may stay preempted for a long period (10s or 100s of milliseconds). Normally the mutator does scan work to create back-pressure against allocation, but there is no scan work during these periods. Hence, as a fall back, if a mutator would assist but can't yet, simply yield the CPU. This delays the mutator somewhat, but more importantly gives more CPU time to the GC coordinator for it to complete the transition. This is obviously a workaround. Issue #11970 suggests a far better but far more invasive way to fix this. Updates #11911. (This very nearly fixes the issue, but about once every 15 minutes I get a GC cycle where the assists are enabled but don't do enough work.) Change-Id: I9768b79e3778abd3e06d306596c3bd77f65bf3f1 Reviewed-on: https://go-review.googlesource.com/13026 Reviewed-by: Russ Cox <rsc@golang.org> Reviewed-by: Rick Hudson <rlh@golang.org>

Currently it's possible for the next_gc heap size trigger computed for the next GC cycle to be less than the current allocated heap size. This means the next cycle will start immediately, which means there's no time to perform the concurrent sweep between GC cycles. This places responsibility for finishing the sweep on GC itself, which delays GC start-up and hence delays mutator assist. Fix this by ensuring that next_gc is always at least a little higher than the allocated heap size, so we won't trigger the next cycle instantly. Updates #11911. Change-Id: I74f0b887bf187518d5fedffc7989817cbcf30592 Reviewed-on: https://go-review.googlesource.com/13043 Reviewed-by: Rick Hudson <rlh@golang.org> Reviewed-by: Russ Cox <rsc@golang.org>

Proportional concurrent sweep is currently based on a ratio of spans to be swept per bytes of object allocation. However, proportional sweeping is performed during span allocation, not object allocation, in order to minimize contention and overhead. Since objects are allocated from spans after those spans are allocated, the system tends to operate in debt, which means when the next GC cycle starts, there is often sweep debt remaining, so GC has to finish the sweep, which delays the start of the cycle and delays enabling mutator assists. For example, it's quite likely that many Ps will simultaneously refill their span caches immediately after a GC cycle (because GC flushes the span caches), but at this point, there has been very little object allocation since the end of GC, so very little sweeping is done. The Ps then allocate objects from these cached spans, which drives up the bytes of object allocation, but since these allocations are coming from cached spans, nothing considers whether more sweeping has to happen. If the sweep ratio is high enough (which can happen if the next GC trigger is very close to the retained heap size), this can easily represent a sweep debt of thousands of pages. Fix this by making proportional sweep proportional to the number of bytes of spans allocated, rather than the number of bytes of objects allocated. Prior to allocating a span, both the small object path and the large object path ensure credit for allocating that span, so the system operates in the black, rather than in the red. Combined with the previous commit, this should eliminate all sweeping from GC start up. On the stress test in issue #11911, this reduces the time spent sweeping during GC (and delaying start up) by several orders of magnitude: mean 99%ile max pre fix 1 ms 11 ms 144 ms post fix 270 ns 735 ns 916 ns Updates #11911. Change-Id: I89223712883954c9d6ec2a7a51ecb97172097df3 Reviewed-on: https://go-review.googlesource.com/13044 Reviewed-by: Rick Hudson <rlh@golang.org> Reviewed-by: Russ Cox <rsc@golang.org>

This was useful in debugging the mutator assist behavior for #11911, and it fits with the other gcpacertrace output. Change-Id: I1e25590bb4098223a160de796578bd11086309c7 Reviewed-on: https://go-review.googlesource.com/13046 Reviewed-by: Rick Hudson <rlh@golang.org> Reviewed-by: Russ Cox <rsc@golang.org>

dvyukov assigned aclements Jul 28, 2015

dvyukov added this to the Go1.5 milestone Jul 28, 2015

aclements mentioned this issue Jul 31, 2015

runtime: replace GC coordinator with state machine #11970

Closed

rsc mentioned this issue Aug 3, 2015

runtime: fatal error: runtime: address space conflict #11993

Closed

aclements closed this as completed in 1fb01a8 Aug 4, 2015

golang locked and limited conversation to collaborators Aug 5, 2016

gopherbot added the FrozenDueToAge label Aug 5, 2016

rsc unassigned aclements Jun 23, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

runtime: unbounded heap growth with concurrent GC #11911

runtime: unbounded heap growth with concurrent GC #11911

dvyukov commented Jul 28, 2015

RLH commented Jul 29, 2015 via email

aclements commented Jul 29, 2015 via email

aclements commented Jul 29, 2015

aclements commented Jul 29, 2015

RLH commented Jul 29, 2015

aclements commented Jul 29, 2015

aclements commented Jul 30, 2015

RLH commented Jul 30, 2015

aclements commented Jul 30, 2015

aclements commented Jul 30, 2015

aclements commented Jul 30, 2015

RLH commented Jul 30, 2015

rsc commented Jul 31, 2015

gopherbot commented Jul 31, 2015

gopherbot commented Jul 31, 2015

aclements commented Jul 31, 2015

aclements commented Jul 31, 2015

aclements commented Jul 31, 2015

aclements commented Jul 31, 2015

gopherbot commented Aug 3, 2015

gopherbot commented Aug 3, 2015

gopherbot commented Aug 3, 2015

gopherbot commented Aug 3, 2015

runtime: unbounded heap growth with concurrent GC #11911

runtime: unbounded heap growth with concurrent GC #11911

Comments

dvyukov commented Jul 28, 2015

RLH commented Jul 29, 2015 via email

aclements commented Jul 29, 2015 via email

aclements commented Jul 29, 2015

aclements commented Jul 29, 2015

RLH commented Jul 29, 2015

aclements commented Jul 29, 2015

aclements commented Jul 30, 2015

RLH commented Jul 30, 2015

aclements commented Jul 30, 2015

aclements commented Jul 30, 2015

aclements commented Jul 30, 2015

RLH commented Jul 30, 2015

rsc commented Jul 31, 2015

gopherbot commented Jul 31, 2015

gopherbot commented Jul 31, 2015

aclements commented Jul 31, 2015

aclements commented Jul 31, 2015

aclements commented Jul 31, 2015

aclements commented Jul 31, 2015

gopherbot commented Aug 3, 2015

gopherbot commented Aug 3, 2015

gopherbot commented Aug 3, 2015

gopherbot commented Aug 3, 2015