runtime: concurrent GC leads to 5x memory consumption #11677

mikioh · 2015-07-11T23:47:02Z

go version devel +2ae7737 Sat Jul 11 14:36:33 2015 +0000 freebsd/386

# go test -v -run=Stress$
=== RUN   TestTraceStress
runtime: memory allocated by OS (2188484608) not in usable range [0x93100000,0x13100000)
runtime: out of memory: cannot allocate 65536-byte block (640802816 in use)
fatal error: out of memory
(snip)

The text was updated successfully, but these errors were encountered:

mikioh · 2015-07-11T23:48:42Z

/CC @dvyukov

dvyukov · 2015-07-19T21:15:22Z

That's concurrent GC issue.

I've run:
$ GODEBUG=gctrace=1 ./pprof.test -test.run=TestTraceStress$ -test.cpu=1,1,1,1,.......,1

If I sprinkle runtime.GC calls in the test, in particular inside of the following loop:

    for i := 0; i < 1e3; i++ {
        _ = make([]byte, 1<<20)
    }

Memory usage stabilizes at ~68MB.

But if I remove the calls, and let the concurrent GC do its work, memory steadily grows to 240MB, then to 300MB, then to 340MB.

If I enable gctracing, then I see lines like this:

gc #1095 @60.193s 9%: 0.35+0.59+0.66+7.4+0.61 ms clock, 1.4+0.59+0+0/1.3/0.16+2.4 ms cpu, 4->125->2 MB, 4 MB goal, 10 P

GC is basically 30x off of the goal.

GC must not ever let heap grow above the GOGC goal. If it needs some reserve room, it should aim at 0.9 GOGC or so, but then cap the heap at GOGC if it fails to reclaim memory in time.

RLH · 2015-07-20T12:34:09Z

This is certainly worth looking into but stating that "GC must not ever let
heap grow above the GOGC goal" is promising something that Go can only
promise in an operational sense based on the current implementation. That
definition seems like a rat hole. Keeping GOGC a goal met by quality of
service setting makes more sense.

On Sun, Jul 19, 2015 at 5:15 PM, Dmitry Vyukov notifications@github.com
wrote:

That's concurrent GC issue.

I've run:
$ GODEBUG=gctrace=1 ./pprof.test -test.run=TestTraceStress$
-test.cpu=1,1,1,1,.......,1

If I sprinkle runtime.GC calls in the test, in particular inside of the
following loop:
for i := 0; i < 1e3; i++ {
    _ = make([]byte, 1<<20)
}
Memory usage stabilizes at ~68MB.

But if I remove the calls, and let the concurrent GC do its work, memory
steadily grows to 240MB, then to 300MB, then to 340MB.

If I enable gctracing, then I see lines like this:

gc #1095 #1095 @60.193s 9%:
0.35+0.59+0.66+7.4+0.61 ms clock, 1.4+0.59+0+0/1.3/0.16+2.4 ms cpu,
4->125->2 MB, 4 MB goal, 10 P

GC is basically 30x off of the goal.

GC must not ever let heap grow above the GOGC goal. If it needs some
reserve room, it should aim at 0.9 GOGC or so, but then cap the heap at
GOGC if it fails to reclaim memory in time.

—
Reply to this email directly or view it on GitHub
#11677 (comment).

dvyukov · 2015-07-20T18:05:59Z

If it is not a hard limit, than it is not possible to provision a machine for a Go program.

RLH · 2015-07-20T19:33:23Z

Yet machines run Go programs. What is the use case? The example is worth
investigating but I don't believe it is likely to happen in the wild.

On Mon, Jul 20, 2015 at 2:06 PM, Dmitry Vyukov notifications@github.com
wrote:

If it is not a hard limit, than it is not possible to provision a machine
for a Go program.

—
Reply to this email directly or view it on GitHub
#11677 (comment).

bradfitz · 2015-07-20T19:36:32Z

@RLH, tests are failing regularly. The problem is not hypothetical. Very senior Go developers can't reason about memory usage and write reliable tests. Surely there is some problem here. It may not be the GC, but let's not be so quick to dismiss it as a possibility. We need to get the world not flaky.

aclements · 2015-07-20T19:47:11Z

We can argue all we want about GOGC and provisioning, but there's definitely a bug here:

gc #1095 @60.193s 9%: 0.35+0.59+0.66+7.4+0.61 ms clock, 1.4+0.59+0+0/1.3/0.16+2.4 ms cpu, 4->125->2 MB, 4 MB goal, 10 P

It was supposed to finish at 4 MB but finished at 125 MB. It shouldn't be possible to miss by that much. We're probably mis-accounting something. We spent less than a microsecond in assists (probably we spent no time in assists), which should have acted as the back-pressure against the heap going this far past the goal.

Another odd thing here is that the duration of the mark phase was 7.4 ms, but we only spent 1.46 ms of CPU time in GC during that phase. Those times are small enough that it's possible it's just scheduling noise delaying the start of the mark workers (from the OS, the runtime, or both), but that shouldn't impact assists.

RLH · 2015-07-20T19:50:04Z

The role of the GC in this issue is real. Whether the GC can ever exceed
some hard memory limit is not.

On Mon, Jul 20, 2015 at 3:36 PM, Brad Fitzpatrick notifications@github.com
wrote:

@RLH https://github.com/RLH, tests are failing regularly. The problem
is not hypothetical. Very senior Go developers can't reason about memory
usage and write reliable tests. Surely there is some problem here. It may
not be the GC, but let's not be so quick to dismiss it as a possibility. We
need to get the world not flaky.

—
Reply to this email directly or view it on GitHub
#11677 (comment).

rsc · 2015-07-21T02:56:32Z

Let's make this about the bug that Austin found and not about GC philosophy.

aclements · 2015-07-21T22:30:11Z

Here's what's going on. The loop that allocates 1,000 1MB byte slices does enter the assist and the assist knows that it needs to do a lot of work to pay for the allocation, but it finds no work to do, so it returns immediately. It keeps building up assist debt, but also keeps finding no work to do, so there's no back pressure.

The reason it's not finding any work to do appears to be related to #11694. There's no work to do because we're actually out of work. But we haven't entered mark termination yet, so the mutator is still running. This can happen simply because it takes us a while to go in to mark termination when we run out of work because we do things like rescanning globals to pick up more work to do in concurrent mark.

It's also theoretically possible for this to happen when there is work to do, but it's not available to the assist. For example, the code in #11694 creates a heap that's almost entirely a long singly-linked list. A dedicated mark worker could be working on that list, but there's no way to hand off that work, so the assist won't be able to do anything, regardless of the assist debt.

So it seems like we need to do something if we're in the assist and can't find enough work to pay down the assist debt.

RLH · 2015-07-22T13:10:58Z

#11694 #11694 is related. #11694
#11694 has trace work to do but the GC
just can't find it. The mutator can occasionally hide a "gold" pointers
long enough for the GC to progress to mark termination where it finds the
gold pointer along with the 10 million untraced objects it is connected to.

The solution to both probably includes back pressure that forces the
mutator to be more cooperative, for #11677 this probably means putting the
mutator to sleep for a some period proportional to its debt, in effect
slowing allocation by this goroutine down. For #11694
#11694 it means the same with the
added twist that before it sleeps it needs to do a gcw.dispose or in some
other way reveal the gold pointer to the GC.

The trick of course is to minimize the cost to more mainstream applications
so we can deal with these corner case devils gracefully.

On Tue, Jul 21, 2015 at 6:30 PM, Austin Clements notifications@github.com
wrote:

Here's what's going on. The loop that allocates 1,000 1MB byte slices does
enter the assist and the assist knows that it needs to do a lot of work to
pay for the allocation, but it finds no work to do, so it returns
immediately. It keeps building up assist debt, but also keeps finding no
work to do, so there's no back pressure.

The reason it's not finding any work to do appears to be related to #11694
#11694. There's no work to do
because we're actually out of work. But we haven't entered mark termination
yet, so the mutator is still running. This can happen simply because it
takes us a while to go in to mark termination when we run out of work
because we do things like rescanning globals to pick up more work to do in
concurrent mark.

It's also theoretically possible for this to happen when there is work to
do, but it's not available to the assist. For example, the code in #11694
#11694 creates a heap that's almost
entirely a long singly-linked list. A dedicated mark worker could be
working on that list, but there's no way to hand off that work, so the
assist won't be able to do anything, regardless of the assist debt.

So it seems like we need to do something if we're in the assist and can't
find enough work to pay down the assist debt.

—
Reply to this email directly or view it on GitHub
#11677 (comment).

dvyukov · 2015-07-22T14:02:46Z

This can happen simply because it takes us a while to go in to mark termination when we run out of work because we do things like rescanning globals to pick up more work to do in concurrent mark.

I think for 1.6 we need to make all work equally available to all goroutines. That is, allow mutators to rescan stacks and globals if that's the next piece of work that GC needs to perform.

The solution to both probably includes back pressure that forces the
mutator to be more cooperative, for #11677 this probably means putting the
mutator to sleep for a some period proportional to its debt, in effect
slowing allocation by this goroutine down.

We need to put them to sleep before the allocation. That is, you already have credit to pay for the allocation, or you are not allowed to do it.

I am also worried about transition period when heap profile/size changes. If controller parameters are incorrect, can't mutators under-help and lead to overallocation?

aclements · 2015-07-22T19:34:13Z

There appears to be one other cause of this. We don't enable assists until the concurrent mark phase, which means there's some time between when we hit the GC trigger heap size and when we finish the scan phase and go in to mark during which the mutator can continue to allocate without back-pressure. With my prototype fix for the "no work" problem, I see that when we still go over the heap goal, we're over by exactly the amount that was allocated between the GC trigger and the beginning of concurrent mark.

aclements · 2015-07-22T19:41:41Z

The solution to both probably includes back pressure that forces the
mutator to be more cooperative, for #11677 this probably means putting the
mutator to sleep for a some period proportional to its debt, in effect
slowing allocation by this goroutine down.

Currently, there are three ways to satisfy an assist: 1) the mutator steals credit from background GC, 2) the mutator actually does GC work, and 3) there is no more work available. 3 is one of the causes of overshooting the heap goal (and was never intended as a way to satisfy an assist). I have a prototype solution that disallows 3 by making the assist block until it can satisfy 1 or 2 or GC terminates. This actually looks a lot like the getfull barrier, except that it also repeatedly tries to steal credit. Unfortunately it's another spin-with-sleep loop like getfull, for the same reasons getfull is. I'm writing up a Go1.6Early issue about improving the GC barrier overall, since it's really showing its inappropriateness for concurrent GC.

aclements · 2015-07-22T20:27:09Z

I think for 1.6 we need to make all work equally available to all goroutines. That is, allow mutators to rescan stacks and globals if that's the next piece of work that GC needs to perform.

I think this is a good idea.

We need to put them to sleep before the allocation. That is, you already have credit to pay for the allocation, or you are not allowed to do it.

Perhaps. Currently, we only do assists if we just did a large allocation or got a new span to keep it off the common path, which means it would be tricky to do the assist before we do the allocation. But we could change this so it updates the assist debt first and if it's over some small-ish threshold, do the assist (or any number of variations on this theme).

I am also worried about transition period when heap profile/size changes. If controller parameters are incorrect, can't mutators under-help and lead to overallocation?

No, modulo bugs like this one. We conservatively base the total assist work on the entire allocated heap size. (Note that this is different from what went out in the original GC pacing proposal, which was much less conservative and occasionally had very bad results.)

aclements · 2015-07-22T22:23:39Z

We don't enable assists until the concurrent mark phase, which means there's some time between when we hit the GC trigger heap size and when we finish the scan phase and go in to mark during which the mutator can continue to allocate without back-pressure.

My plan to fix this was to enable assists earlier. The write barrier must be on before we enable assists, but since we turn it on in the scan phase, we can in principle enable assists quite early. Unfortunately, this introduces other problems. For example, it's possible for assists during the scan phase to drain everything we have so far and signal completion even though the scan is still collecting pointers. I can work around this, but this change still has deleterious effects I don't yet understand like doubling the duration and heap size of TestTraceStress.

gopherbot · 2015-07-24T23:00:57Z

CL https://golang.org/cl/12672 mentions this issue.

gopherbot · 2015-07-24T23:01:01Z

CL https://golang.org/cl/12674 mentions this issue.

gopherbot · 2015-07-24T23:01:09Z

CL https://golang.org/cl/12671 mentions this issue.

gopherbot · 2015-07-24T23:01:14Z

CL https://golang.org/cl/12670 mentions this issue.

Currently it's possible for the GC assist to signal completion of the mark phase, which puts the GC coordinator goroutine on the current P's run queue, and then return to mutator code that delays until the next forced preemption before actually yielding control to the GC coordinator, dragging out completion of the mark phase. This delay can be further exacerbated if the mutator makes other goroutines runnable before yielding control, since this will push the GC coordinator on the back of the P's run queue. To fix this, this adds a Gosched to the assist if it completed the mark phase. This immediately and directly yields control to the GC coordinator. This already happens implicitly in the background mark workers because they park immediately after completing the mark. This is one of the reasons completion of the mark phase is being dragged out and allowing the mutator to allocate without assisting, leading to the large heap goal overshoot in issue #11677. This is also a prerequisite to making the assist block when it can't pay off its debt. Change-Id: I586adfbecb3ca042a37966752c1dc757f5c7fc78 Reviewed-on: https://go-review.googlesource.com/12670 Reviewed-by: Russ Cox <rsc@golang.org>

Currently, there are three ways to satisfy a GC assist: 1) the mutator steals credit from background GC, 2) the mutator actually does GC work, and 3) there is no more work available. 3 was never really intended as a way to satisfy an assist, and it causes problems: there are periods when it's expected that the GC won't have any work, such as when transitioning from mark 1 to mark 2 and from mark 2 to mark termination. During these periods, there's no back-pressure on rapidly allocating mutators, which lets them race ahead of the heap goal. For example, test/init1.go and the runtime/trace test both have small reachable heaps and contain loops that rapidly allocate large garbage byte slices. This bug lets these tests exceed the heap goal by several orders of magnitude. Fix this by forcing the assist (and hence the allocation) to block until it can satisfy its debt via either 1 or 2, or the GC cycle terminates. This fixes one the causes of #11677. It's still possible to overshoot the GC heap goal, but with this change the overshoot is almost exactly by the amount of allocation that happens during the concurrent scan phase, between when the heap passes the GC trigger and when the GC enables assists. Change-Id: I5ef4edcb0d2e13a1e432e66e8245f2bd9f8995be Reviewed-on: https://go-review.googlesource.com/12671 Reviewed-by: Russ Cox <rsc@golang.org>

Currently the GC coordinator enables GC assists at the same time it enables background mark workers, after the concurrent scan phase is done. However, this means a rapidly allocating mutator has the entire scan phase during which to allocate beyond the heap trigger and potentially beyond the heap goal with no back-pressure from assists. This prevents the feedback system that's supposed to keep the heap size under the heap goal from doing its job. Fix this by enabling mutator assists during the scan phase. This is safe because the write barrier is already enabled and globally acknowledged at this point. There's still a very small window between when the heap size reaches the heap trigger and when the GC coordinator is able to stop the world during which the mutator can allocate unabated. This allows *very* rapidly allocator mutators like TestTraceStress to still occasionally exceed the heap goal by a small amount (~20 MB at most for TestTraceStress). However, this seems like a corner case. Fixes #11677. Change-Id: I0f80d949ec82341cd31ca1604a626efb7295a819 Reviewed-on: https://go-review.googlesource.com/12674 Reviewed-by: Russ Cox <rsc@golang.org>

aclements · 2015-10-08T05:47:35Z

For future reference, this will print all GC cycles that go over the goal:

awk '{split($11, H, "->"); if (H[2] > $13) print}' < gctrace

gopherbot · 2015-10-26T22:02:11Z

CL https://golang.org/cl/16297 mentions this issue.

Currently dedicated mark workers participate in the getfull barrier during concurrent mark. However, the getfull barrier wasn't designed for concurrent work and this causes no end of headaches. In the concurrent setting, participants come and go. This makes mark completion susceptible to live-lock: since dedicated workers are only periodically polling for completion, it's possible for the program to be in some transient worker each time one of the dedicated workers wakes up to check if it can exit the getfull barrier. It also complicates reasoning about the system because dedicated workers participate directly in the getfull barrier, but transient workers must instead use trygetfull because they have exit conditions that aren't captured by getfull (e.g., fractional workers exit when preempted). The complexity of implementing these exit conditions contributed to #11677. Furthermore, the getfull barrier is inefficient because we could be running user code instead of spinning on a P. In effect, we're dedicating 25% of the CPU to marking even if that means we have to spin to make that 25%. It also causes issues on Windows because we can't actually sleep for 100µs (#8687). Fix this by making dedicated workers no longer participate in the getfull barrier. Instead, dedicated workers simply return to the scheduler when they fail to get more work, regardless of what others workers are doing, and the scheduler only starts new dedicated workers if there's work available. Everything that needs to be handled by this barrier is already handled by detection of mark completion. This makes the system much more symmetric because all workers and assists now use trygetfull during concurrent mark. It also loosens the 25% CPU target so that we can give some of that 25% back to user code if there isn't enough work to keep the mark worker busy. And it eliminates the problematic 100µs sleep on Windows during concurrent mark (though not during mark termination). The downside of this is that if we hit a bottleneck in the heap graph that then expands back out, the system may shut down dedicated workers and take a while to start them back up. We'll address this in the next commit. Updates #12041 and #8687. No effect on the go1 benchmarks. This slows down the garbage benchmark by 9%, but we'll more than make it up in the next commit. name old time/op new time/op delta XBenchGarbage-12 5.80ms ± 2% 6.32ms ± 4% +9.03% (p=0.000 n=20+20) Change-Id: I65100a9ba005a8b5cf97940798918672ea9dd09b Reviewed-on: https://go-review.googlesource.com/16297 Reviewed-by: Rick Hudson <rlh@golang.org>

mikioh modified the milestones: Go1.5, Go1.5Maybe Jul 11, 2015

ianlancetaylor mentioned this issue Jul 18, 2015

runtime/trace: TestTraceStress fails on windows-gce-386 builder #11776

Closed

bradfitz modified the milestones: Go1.5, Go1.5Maybe Jul 18, 2015

dvyukov self-assigned this Jul 19, 2015

dvyukov changed the title ~~runtime/pprof: TestTraceStress crashes with "fatal error: out of memory"~~ runtime: concurrent GC leads to 3x memory consumption Jul 19, 2015

dvyukov assigned aclements and unassigned dvyukov Jul 19, 2015

dvyukov changed the title ~~runtime: concurrent GC leads to 3x memory consumption~~ runtime: concurrent GC leads to 5x memory consumption Jul 19, 2015

rsc mentioned this issue Jul 21, 2015

runtime: test/init1.go fails intermittently #11321

Closed

dvyukov mentioned this issue Jul 22, 2015

runtime/trace: TestTraceStress failing regularly on netbsd-386-minux #11808

Closed

aclements mentioned this issue Jul 23, 2015

sync: Pool appears to interact poorly with concurrent GC #11834

Closed

aclements closed this as completed in 64a32ff Jul 27, 2015

This was referenced Jul 31, 2015

runtime: perform GC assist before allocating [easy] #11967

Closed

runtime: replace GC coordinator with state machine #11970

Closed

runtime: redesign GC synchronization #12041

Closed

golang locked and limited conversation to collaborators Oct 26, 2016

gopherbot added the FrozenDueToAge label Oct 26, 2016

rsc unassigned aclements Jun 23, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

runtime: concurrent GC leads to 5x memory consumption #11677

runtime: concurrent GC leads to 5x memory consumption #11677

mikioh commented Jul 11, 2015

mikioh commented Jul 11, 2015

dvyukov commented Jul 19, 2015

RLH commented Jul 20, 2015

dvyukov commented Jul 20, 2015

RLH commented Jul 20, 2015

bradfitz commented Jul 20, 2015

aclements commented Jul 20, 2015

RLH commented Jul 20, 2015

rsc commented Jul 21, 2015

aclements commented Jul 21, 2015

RLH commented Jul 22, 2015

dvyukov commented Jul 22, 2015

aclements commented Jul 22, 2015

aclements commented Jul 22, 2015

aclements commented Jul 22, 2015

aclements commented Jul 22, 2015

gopherbot commented Jul 24, 2015

gopherbot commented Jul 24, 2015

gopherbot commented Jul 24, 2015

gopherbot commented Jul 24, 2015

aclements commented Oct 8, 2015

gopherbot commented Oct 26, 2015

runtime: concurrent GC leads to 5x memory consumption #11677

runtime: concurrent GC leads to 5x memory consumption #11677

Comments

mikioh commented Jul 11, 2015

mikioh commented Jul 11, 2015

dvyukov commented Jul 19, 2015

RLH commented Jul 20, 2015

dvyukov commented Jul 20, 2015

RLH commented Jul 20, 2015

bradfitz commented Jul 20, 2015

aclements commented Jul 20, 2015

RLH commented Jul 20, 2015

rsc commented Jul 21, 2015

aclements commented Jul 21, 2015

RLH commented Jul 22, 2015

dvyukov commented Jul 22, 2015

aclements commented Jul 22, 2015

aclements commented Jul 22, 2015

aclements commented Jul 22, 2015

aclements commented Jul 22, 2015

gopherbot commented Jul 24, 2015

gopherbot commented Jul 24, 2015

gopherbot commented Jul 24, 2015

gopherbot commented Jul 24, 2015

aclements commented Oct 8, 2015

gopherbot commented Oct 26, 2015