Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: unbounded heap growth with concurrent GC #11911

Closed
dvyukov opened this issue Jul 28, 2015 · 23 comments
Closed

runtime: unbounded heap growth with concurrent GC #11911

dvyukov opened this issue Jul 28, 2015 · 23 comments
Milestone

Comments

@dvyukov
Copy link
Member

dvyukov commented Jul 28, 2015

The following program has live set of at most 100MB . With GODEBUG=gcstoptheworld=1 RSS stabilizes at ~204MB according to top, which makes sense. However with GODEBUG=gcstoptheworld=0 RSS steadily chaotically grows over time, I waited only for few minutes and RSS grew to 1106MB, one of the jumps was from 212MB to 678MB in one second. GOMAXPROCS value may have effect on this, I've get the worst numbers with GOMAXPROCS=2/4, however I did not make an exhaustive investigation, maybe GOMAXPROCS is red herring.

package main

import (
    "math/rand"
    "sync"
    "sync/atomic"
    "unsafe"
)

var (
    gs, retained, seed int32
    mu                 sync.Mutex
    blocks             [][]*byte
    sink               = make([]byte, 4096)
)

const (
    maxGs       = 100
    maxRetained = 100 << 20
    maxBlock    = 64 << 10
)

func main() {
    go worker()
    select {}
}

func worker() {
    rnd := rand.New(rand.NewSource(int64(atomic.AddInt32(&seed, 1))))
    ss := unsafe.Sizeof(&gs)
    var local [][]*byte
loop:
    for {
        switch r := rnd.Intn(1000); {
        case r == 0 || r == 1:
            if !tryAdd(&gs, 1, maxGs, 1) {
                continue loop
            }
            go worker()
        case r == 2:
            if !tryAdd(&gs, 1, maxGs, -1) {
                continue loop
            }
            s := 0
            for _, b := range local {
                s += len(b)
            }
            local = nil
            if !tryAdd(&retained, 0, maxRetained, -int32(s)*int32(ss)) {
                panic("bad")
            }
            return
        case r < 300:
            s := maxBlock / int(ss)
            if !tryAdd(&retained, 0, maxRetained, int32(s)*int32(ss)) {
                continue loop
            }
            b := make([]*byte, s)
            p := &sink[len(sink)-1]
            for i := range b {
                b[i] = p
            }
            mu.Lock()
            blocks = append(blocks, b)
            mu.Unlock()
        case r < 500:
            s := 0
            mu.Lock()
            if len(blocks) > 0 {
                idx := rnd.Intn(len(blocks))
                s = len(blocks[idx])
                blocks[idx] = blocks[len(blocks)-1]
                blocks[len(blocks)-1] = nil
                blocks = blocks[:len(blocks)-1]
            }
            mu.Unlock()
            if !tryAdd(&retained, 0, maxRetained, -int32(s)*int32(ss)) {
                panic("bad")
            }
        case r < 800:
            s := maxBlock / int(ss)
            if !tryAdd(&retained, 0, maxRetained, int32(s)*int32(ss)) {
                continue loop
            }
            b := make([]*byte, s)
            p := &sink[len(sink)-1]
            for i := range b {
                b[i] = p
            }
            local = append(local, b)
        default:
            s := 0
            if len(local) > 0 {
                idx := rnd.Intn(len(local))
                s = len(local[idx])
                local[idx] = local[len(local)-1]
                local[len(local)-1] = nil
                local = local[:len(local)-1]
            }
            if !tryAdd(&retained, 0, maxRetained, -int32(s)*int32(ss)) {
                panic("bad")
            }
        }
    }
}

func tryAdd(p *int32, min, max, v int32) bool {
    for {
        old := atomic.LoadInt32(p)
        xchg := old + v
        if xchg < min || xchg > max {
            return false
        }
        if atomic.CompareAndSwapInt32(p, old, xchg) {
            return true
        }
    }
}

go version devel +d0729a6 Tue Jul 28 06:23:38 2015 +0000 linux/amd64

@dvyukov dvyukov added this to the Go1.5 milestone Jul 28, 2015
@RLH
Copy link
Contributor

RLH commented Jul 29, 2015 via email

@aclements
Copy link
Member

aclements commented Jul 29, 2015 via email

@aclements
Copy link
Member

It looks like the start of GC is being delayed sometimes. The jumps in RSS correspond to large heap sizes at the start of a GC cycle (well beyond the heap trigger) and these cycles tend to start much longer after the previous cycle than normal (~500ms instead of ~150ms). We successfully collect back to ~100 MB (plus a little for floating garbage), but the memory doesn't get returned to the system because the scavenger is conservative (and may never get returned if these bumps keep happening).

@RLH is adding instrumentation to pin down exactly where the delay happens.

@aclements
Copy link
Member

It looks like this is usually caused by a long semacquire(&worldsema). I've seen this take as long as 352ms with your example program. I don't know what's causing this, since this program doesn't obviously use anything else that would be acquiring worldsema.

I've seen other causes, too, though. In one run, it took 86ms to get from the Gosched in startGC to the actually running the GC goroutine. I suspect the GC goroutine got kicked out of or stolen from runnext or delayed by something considered higher priority by schedule.

@RLH
Copy link
Contributor

RLH commented Jul 29, 2015

I am seeing the sweep prior to the worldsema being acquired by the GC
taking a lot of time and there is no throttling of the mutator since
bggc.work is set to 1.

On Wed, Jul 29, 2015 at 4:08 PM, Austin Clements notifications@github.com
wrote:

It looks like this is usually caused by a long semacquire(&worldsema).
I've seen this take as long as 352ms with your example program. I don't
know what's causing this, since this program doesn't obviously use anything
else that would be acquiring worldsema.

I've seen other causes, too, though. In one run, it took 86ms to get from
the Gosched in startGC to the actually running the GC goroutine. I suspect
the GC goroutine got kicked out of or stolen from runnext or delayed by
something considered higher priority by schedule.


Reply to this email directly or view it on GitHub
#11911 (comment).

@aclements
Copy link
Member

You mean the sweep after the worldsema acquire, but, yes, I can confirm that I'm also seeing that problem. The 352ms I saw before could actually have been this (my tracing wasn't high enough fidelity to separate them). I am also seeing that it sometimes takes a long time to acquire worldsema (33ms, 40ms).

However, the problem is not that there are unswept spans at this point. Even when we take a long time to get through the loop, we don't actually execute any iterations of the loop. The time is spent in the single call to gosweepone(), which ultimately returns -1.

@aclements
Copy link
Member

I added a probe when we preempt the backgroundgc goroutine. 5 out of 5 times I've run it there's been an involuntary preemption of the GC startup operation that takes a long time and lets the heap overflow. It's hard to say if this is cause or effect—the preemption could be happening because we're taking too long to do an operation—but when it happens it certainly exacerbates the situation since the large number of runnable goroutines in this test means the GC goroutine won't get rescheduled for some time.

@RLH
Copy link
Contributor

RLH commented Jul 30, 2015

It looks like the preemption can happen immediately after the backgroundgc
is unparked. This is before it has any time to do much of anything beyond
setting a few globals.

The scheduler can suppress the preemption of the background GC during times
the mutator does not have any back pressure or ask/force the mutator to do
assists where we now have no back pressure. Currently this is when we are
entering or exiting the GC. The assist may well take the form of self
preemption if it notices the GC should be running on a P but isn't.

Both approaches have risk, whether the risk of not doing anything is higher
is unknown.

On Wed, Jul 29, 2015 at 10:32 PM, Austin Clements notifications@github.com
wrote:

I added a probe when we preempt the backgroundgc goroutine. 5 out of 5
times I've run it there's been an involuntary preemption of the GC startup
operation that takes a long time and lets the heap overflow. It's hard to
say if this is cause or effect—the preemption could be happening because
we're taking too long to do an operation—but when it happens it certainly
exacerbates the situation since the large number of runnable goroutines in
this test means the GC goroutine won't get rescheduled for some time.


Reply to this email directly or view it on GitHub
#11911 (comment).

@aclements
Copy link
Member

@RLH and I discussed and the plan is to set a flag to indicate when we want to be GCing, but can't have mutator assists enabled (because we're either just starting or just finishing GC), and make the mutator goroutine yield if it would otherwise assist. This means 1) from when we surpass the heap trigger and ready the GC goroutine to just after the first stop-the-world, and 2) from just before the last start-the-world to right before the GC goroutine parks.

A more aggressive form of this is to have the mutator repeatedly yield until the GC says it's ready, but we're worried that might be too risky at this point.

I think there are other causes mixed in with this, but we'll see if this mitigates at least part of the problem.

I'm writing a Go1.6Early issue about ways to fix this more properly.

@aclements
Copy link
Member

With this fix I was able to get through 3,171 GC cycles with the test program before the heap size blew up and when it did blow up, assists were enabled, which suggests a second problem. But, still, this is much better than it was.

@aclements
Copy link
Member

I've reproduced the remaining assist failure twice now and both times in the failing cycle we also swept several thousand spans at the beginning of that cycle. We're not supposed to be sweeping any spans (and usually we don't). This suggests that some accounting is off and it's throwing off both the sweep ratio and the assist ratio. This sweeping is also bad for the mutator yielding trick because mutators that are allocating wind up yielding a lot during this sweep process.

@RLH
Copy link
Contributor

RLH commented Jul 30, 2015

At times I've suspected that there might be a race between one GC ending
and another one starting immediately even though there were spans holding
lots of space. Not sure how to eliminate this possibility.

On Thursday, July 30, 2015, Austin Clements notifications@github.com
wrote:

I've reproduced the remaining assist failure twice now and both times in
the failing cycle we also swept several thousand spans at the beginning of
that cycle. We're not supposed to be sweeping any spans (and usually we
don't). This suggests that some accounting is off and it's throwing off
both the sweep ratio and the assist ratio. This sweeping is also bad for
the mutator yielding trick because mutators that are allocating wind up
yielding a lot during this sweep process.


Reply to this email directly or view it on GitHub
#11911 (comment).

@rsc
Copy link
Contributor

rsc commented Jul 31, 2015

Can the mutator sweepone+yield instead of just yielding?

@gopherbot
Copy link
Contributor

CL https://golang.org/cl/13026 mentions this issue.

@gopherbot
Copy link
Contributor

CL https://golang.org/cl/13025 mentions this issue.

@aclements
Copy link
Member

Can the mutator sweepone+yield instead of just yielding?

Usually the proportional sweep is doing its job and there is no sweep work to do (usually this issue is caused by bad scheduling, not by long sweeps). If I can't figure out why proportional sweep sometimes falls short, adding a sweepone to CL 13026 may at least improves the situation.

@aclements
Copy link
Member

I believe I understand the cause of the unfinished concurrent sweep. It happens when next_gc (the GC trigger heap size) is within a few megs of the size of the retained heap at the end of GC. The proportional sweep works by using this "heap distance" and the number of spans that need to be swept to compute a sweep ratio. We perform extra sweeping to satisfy this ratio in mCentral_CacheSpan, which is what gets a span for a P to allocate from. However, it computes the extra pages to sweep based on how many heap bytes have been allocated since the end of the last GC. This is where the problem lies. During mark termination GC flushes all mcaches, so as soon as it starts the world again all of the Ps will repopulate their mcaches and go through this sweep debt calculation. But at this point there have hardly been any bytes allocated, so very little is swept. Since the distance to the next GC trigger is very small, the Ps are able to allocate right up to the GC trigger just from these cached spans. Since they don't get new cached spans, proportional sweep simply doesn't kick in and we wind up at the GC trigger with a bunch of unswept pages.

@aclements
Copy link
Member

(And, yes, this is another instance of operating on debt rather than on credit, though I didn't realize this until this afternoon. Luckily, we've tripped over the only two credit systems in the runtime at this point, so I don't think there are other instances of this problem. :)

@aclements
Copy link
Member

I believe I understand the cause of the unfinished concurrent sweep.

Okay, I have a fix for this that bases the sweep on how many bytes of spans have been allocated from the mheap, rather than how many bytes of objects have been allocated from spans. I've run it for a few thousand GC cycles and the concurrent sweep has always finished before GC starts.

@gopherbot
Copy link
Contributor

CL https://golang.org/cl/13043 mentions this issue.

@gopherbot
Copy link
Contributor

CL https://golang.org/cl/13044 mentions this issue.

@gopherbot
Copy link
Contributor

CL https://golang.org/cl/13047 mentions this issue.

@gopherbot
Copy link
Contributor

CL https://golang.org/cl/13046 mentions this issue.

aclements added a commit that referenced this issue Aug 4, 2015
Currently allocation checks the GC trigger speculatively during
allocation and then triggers the GC without rechecking. As a result,
it's possible for G 1 and G 2 to detect the trigger simultaneously,
both enter startGC, G 1 actually starts GC while G 2 gets preempted
until after the whole GC cycle, then G 2 immediately starts another GC
cycle even though the heap is now well under the trigger.

Fix this by re-checking the GC trigger non-speculatively just before
actually kicking off a new GC cycle.

This contributes to #11911 because when this happens, we definitely
don't finish the background sweep before starting the next GC cycle,
which can significantly delay the start of concurrent scan.

Change-Id: I560ab79ba5684ba435084410a9765d28f5745976
Reviewed-on: https://go-review.googlesource.com/13025
Reviewed-by: Russ Cox <rsc@golang.org>
Reviewed-by: Rick Hudson <rlh@golang.org>
aclements added a commit that referenced this issue Aug 4, 2015
Currently there are two sensitive periods during which a mutator can
allocate past the heap goal but mutator assists can't be enabled: 1)
at the beginning of GC between when the heap first passes the heap
trigger and sweep termination and 2) at the end of GC between mark
termination and when the background GC goroutine parks. During these
periods there's no back-pressure or safety net, so a rapidly
allocating mutator can allocate past the heap goal. This is
exacerbated if there are many goroutines because the GC coordinator is
scheduled as any other goroutine, so if it gets preempted during one
of these periods, it may stay preempted for a long period (10s or 100s
of milliseconds).

Normally the mutator does scan work to create back-pressure against
allocation, but there is no scan work during these periods. Hence, as
a fall back, if a mutator would assist but can't yet, simply yield the
CPU. This delays the mutator somewhat, but more importantly gives more
CPU time to the GC coordinator for it to complete the transition.

This is obviously a workaround. Issue #11970 suggests a far better but
far more invasive way to fix this.

Updates #11911. (This very nearly fixes the issue, but about once
every 15 minutes I get a GC cycle where the assists are enabled but
don't do enough work.)

Change-Id: I9768b79e3778abd3e06d306596c3bd77f65bf3f1
Reviewed-on: https://go-review.googlesource.com/13026
Reviewed-by: Russ Cox <rsc@golang.org>
Reviewed-by: Rick Hudson <rlh@golang.org>
aclements added a commit that referenced this issue Aug 4, 2015
Currently it's possible for the next_gc heap size trigger computed for
the next GC cycle to be less than the current allocated heap size.
This means the next cycle will start immediately, which means there's
no time to perform the concurrent sweep between GC cycles. This places
responsibility for finishing the sweep on GC itself, which delays GC
start-up and hence delays mutator assist.

Fix this by ensuring that next_gc is always at least a little higher
than the allocated heap size, so we won't trigger the next cycle
instantly.

Updates #11911.

Change-Id: I74f0b887bf187518d5fedffc7989817cbcf30592
Reviewed-on: https://go-review.googlesource.com/13043
Reviewed-by: Rick Hudson <rlh@golang.org>
Reviewed-by: Russ Cox <rsc@golang.org>
aclements added a commit that referenced this issue Aug 4, 2015
Proportional concurrent sweep is currently based on a ratio of spans
to be swept per bytes of object allocation. However, proportional
sweeping is performed during span allocation, not object allocation,
in order to minimize contention and overhead. Since objects are
allocated from spans after those spans are allocated, the system tends
to operate in debt, which means when the next GC cycle starts, there
is often sweep debt remaining, so GC has to finish the sweep, which
delays the start of the cycle and delays enabling mutator assists.

For example, it's quite likely that many Ps will simultaneously refill
their span caches immediately after a GC cycle (because GC flushes the
span caches), but at this point, there has been very little object
allocation since the end of GC, so very little sweeping is done. The
Ps then allocate objects from these cached spans, which drives up the
bytes of object allocation, but since these allocations are coming
from cached spans, nothing considers whether more sweeping has to
happen. If the sweep ratio is high enough (which can happen if the
next GC trigger is very close to the retained heap size), this can
easily represent a sweep debt of thousands of pages.

Fix this by making proportional sweep proportional to the number of
bytes of spans allocated, rather than the number of bytes of objects
allocated. Prior to allocating a span, both the small object path and
the large object path ensure credit for allocating that span, so the
system operates in the black, rather than in the red.

Combined with the previous commit, this should eliminate all sweeping
from GC start up. On the stress test in issue #11911, this reduces the
time spent sweeping during GC (and delaying start up) by several
orders of magnitude:

                mean    99%ile     max
    pre fix      1 ms    11 ms   144 ms
    post fix   270 ns   735 ns   916 ns

Updates #11911.

Change-Id: I89223712883954c9d6ec2a7a51ecb97172097df3
Reviewed-on: https://go-review.googlesource.com/13044
Reviewed-by: Rick Hudson <rlh@golang.org>
Reviewed-by: Russ Cox <rsc@golang.org>
aclements added a commit that referenced this issue Aug 4, 2015
This was useful in debugging the mutator assist behavior for #11911,
and it fits with the other gcpacertrace output.

Change-Id: I1e25590bb4098223a160de796578bd11086309c7
Reviewed-on: https://go-review.googlesource.com/13046
Reviewed-by: Rick Hudson <rlh@golang.org>
Reviewed-by: Russ Cox <rsc@golang.org>
@golang golang locked and limited conversation to collaborators Aug 5, 2016
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

5 participants