Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: concurrent GC leads to 5x memory consumption #11677

Closed
mikioh opened this issue Jul 11, 2015 · 22 comments
Closed

runtime: concurrent GC leads to 5x memory consumption #11677

mikioh opened this issue Jul 11, 2015 · 22 comments
Milestone

Comments

@mikioh
Copy link
Contributor

mikioh commented Jul 11, 2015

go version devel +2ae7737 Sat Jul 11 14:36:33 2015 +0000 freebsd/386

# go test -v -run=Stress$
=== RUN   TestTraceStress
runtime: memory allocated by OS (2188484608) not in usable range [0x93100000,0x13100000)
runtime: out of memory: cannot allocate 65536-byte block (640802816 in use)
fatal error: out of memory
(snip)
@mikioh mikioh modified the milestones: Go1.5, Go1.5Maybe Jul 11, 2015
@mikioh
Copy link
Contributor Author

mikioh commented Jul 11, 2015

/CC @dvyukov

@bradfitz bradfitz modified the milestones: Go1.5, Go1.5Maybe Jul 18, 2015
@dvyukov dvyukov self-assigned this Jul 19, 2015
@dvyukov dvyukov changed the title runtime/pprof: TestTraceStress crashes with "fatal error: out of memory" runtime: concurrent GC leads to 3x memory consumption Jul 19, 2015
@dvyukov dvyukov assigned aclements and unassigned dvyukov Jul 19, 2015
@dvyukov dvyukov changed the title runtime: concurrent GC leads to 3x memory consumption runtime: concurrent GC leads to 5x memory consumption Jul 19, 2015
@dvyukov
Copy link
Member

dvyukov commented Jul 19, 2015

That's concurrent GC issue.

I've run:
$ GODEBUG=gctrace=1 ./pprof.test -test.run=TestTraceStress$ -test.cpu=1,1,1,1,.......,1

If I sprinkle runtime.GC calls in the test, in particular inside of the following loop:

    for i := 0; i < 1e3; i++ {
        _ = make([]byte, 1<<20)
    }

Memory usage stabilizes at ~68MB.

But if I remove the calls, and let the concurrent GC do its work, memory steadily grows to 240MB, then to 300MB, then to 340MB.

If I enable gctracing, then I see lines like this:

gc #1095 @60.193s 9%: 0.35+0.59+0.66+7.4+0.61 ms clock, 1.4+0.59+0+0/1.3/0.16+2.4 ms cpu, 4->125->2 MB, 4 MB goal, 10 P

GC is basically 30x off of the goal.

GC must not ever let heap grow above the GOGC goal. If it needs some reserve room, it should aim at 0.9 GOGC or so, but then cap the heap at GOGC if it fails to reclaim memory in time.

@RLH
Copy link
Contributor

RLH commented Jul 20, 2015

This is certainly worth looking into but stating that "GC must not ever let
heap grow above the GOGC goal" is promising something that Go can only
promise in an operational sense based on the current implementation. That
definition seems like a rat hole. Keeping GOGC a goal met by quality of
service setting makes more sense.

On Sun, Jul 19, 2015 at 5:15 PM, Dmitry Vyukov notifications@github.com
wrote:

That's concurrent GC issue.

I've run:
$ GODEBUG=gctrace=1 ./pprof.test -test.run=TestTraceStress$
-test.cpu=1,1,1,1,.......,1

If I sprinkle runtime.GC calls in the test, in particular inside of the
following loop:

for i := 0; i < 1e3; i++ {
    _ = make([]byte, 1<<20)
}

Memory usage stabilizes at ~68MB.

But if I remove the calls, and let the concurrent GC do its work, memory
steadily grows to 240MB, then to 300MB, then to 340MB.

If I enable gctracing, then I see lines like this:

gc #1095 #1095 @60.193s 9%:
0.35+0.59+0.66+7.4+0.61 ms clock, 1.4+0.59+0+0/1.3/0.16+2.4 ms cpu,
4->125->2 MB, 4 MB goal, 10 P

GC is basically 30x off of the goal.

GC must not ever let heap grow above the GOGC goal. If it needs some
reserve room, it should aim at 0.9 GOGC or so, but then cap the heap at
GOGC if it fails to reclaim memory in time.


Reply to this email directly or view it on GitHub
#11677 (comment).

@dvyukov
Copy link
Member

dvyukov commented Jul 20, 2015

If it is not a hard limit, than it is not possible to provision a machine for a Go program.

@RLH
Copy link
Contributor

RLH commented Jul 20, 2015

Yet machines run Go programs. What is the use case? The example is worth
investigating but I don't believe it is likely to happen in the wild.

On Mon, Jul 20, 2015 at 2:06 PM, Dmitry Vyukov notifications@github.com
wrote:

If it is not a hard limit, than it is not possible to provision a machine
for a Go program.


Reply to this email directly or view it on GitHub
#11677 (comment).

@bradfitz
Copy link
Contributor

@RLH, tests are failing regularly. The problem is not hypothetical. Very senior Go developers can't reason about memory usage and write reliable tests. Surely there is some problem here. It may not be the GC, but let's not be so quick to dismiss it as a possibility. We need to get the world not flaky.

@aclements
Copy link
Member

We can argue all we want about GOGC and provisioning, but there's definitely a bug here:

gc #1095 @60.193s 9%: 0.35+0.59+0.66+7.4+0.61 ms clock, 1.4+0.59+0+0/1.3/0.16+2.4 ms cpu, 4->125->2 MB, 4 MB goal, 10 P

It was supposed to finish at 4 MB but finished at 125 MB. It shouldn't be possible to miss by that much. We're probably mis-accounting something. We spent less than a microsecond in assists (probably we spent no time in assists), which should have acted as the back-pressure against the heap going this far past the goal.

Another odd thing here is that the duration of the mark phase was 7.4 ms, but we only spent 1.46 ms of CPU time in GC during that phase. Those times are small enough that it's possible it's just scheduling noise delaying the start of the mark workers (from the OS, the runtime, or both), but that shouldn't impact assists.

@RLH
Copy link
Contributor

RLH commented Jul 20, 2015

The role of the GC in this issue is real. Whether the GC can ever exceed
some hard memory limit is not.

On Mon, Jul 20, 2015 at 3:36 PM, Brad Fitzpatrick notifications@github.com
wrote:

@RLH https://github.com/RLH, tests are failing regularly. The problem
is not hypothetical. Very senior Go developers can't reason about memory
usage and write reliable tests. Surely there is some problem here. It may
not be the GC, but let's not be so quick to dismiss it as a possibility. We
need to get the world not flaky.


Reply to this email directly or view it on GitHub
#11677 (comment).

@rsc
Copy link
Contributor

rsc commented Jul 21, 2015

Let's make this about the bug that Austin found and not about GC philosophy.

@aclements
Copy link
Member

Here's what's going on. The loop that allocates 1,000 1MB byte slices does enter the assist and the assist knows that it needs to do a lot of work to pay for the allocation, but it finds no work to do, so it returns immediately. It keeps building up assist debt, but also keeps finding no work to do, so there's no back pressure.

The reason it's not finding any work to do appears to be related to #11694. There's no work to do because we're actually out of work. But we haven't entered mark termination yet, so the mutator is still running. This can happen simply because it takes us a while to go in to mark termination when we run out of work because we do things like rescanning globals to pick up more work to do in concurrent mark.

It's also theoretically possible for this to happen when there is work to do, but it's not available to the assist. For example, the code in #11694 creates a heap that's almost entirely a long singly-linked list. A dedicated mark worker could be working on that list, but there's no way to hand off that work, so the assist won't be able to do anything, regardless of the assist debt.

So it seems like we need to do something if we're in the assist and can't find enough work to pay down the assist debt.

@RLH
Copy link
Contributor

RLH commented Jul 22, 2015

#11694 #11694 is related. #11694
#11694 has trace work to do but the GC
just can't find it. The mutator can occasionally hide a "gold" pointers
long enough for the GC to progress to mark termination where it finds the
gold pointer along with the 10 million untraced objects it is connected to.

The solution to both probably includes back pressure that forces the
mutator to be more cooperative, for #11677 this probably means putting the
mutator to sleep for a some period proportional to its debt, in effect
slowing allocation by this goroutine down. For #11694
#11694 it means the same with the
added twist that before it sleeps it needs to do a gcw.dispose or in some
other way reveal the gold pointer to the GC.

The trick of course is to minimize the cost to more mainstream applications
so we can deal with these corner case devils gracefully.

On Tue, Jul 21, 2015 at 6:30 PM, Austin Clements notifications@github.com
wrote:

Here's what's going on. The loop that allocates 1,000 1MB byte slices does
enter the assist and the assist knows that it needs to do a lot of work to
pay for the allocation, but it finds no work to do, so it returns
immediately. It keeps building up assist debt, but also keeps finding no
work to do, so there's no back pressure.

The reason it's not finding any work to do appears to be related to #11694
#11694. There's no work to do
because we're actually out of work. But we haven't entered mark termination
yet, so the mutator is still running. This can happen simply because it
takes us a while to go in to mark termination when we run out of work
because we do things like rescanning globals to pick up more work to do in
concurrent mark.

It's also theoretically possible for this to happen when there is work to
do, but it's not available to the assist. For example, the code in #11694
#11694 creates a heap that's almost
entirely a long singly-linked list. A dedicated mark worker could be
working on that list, but there's no way to hand off that work, so the
assist won't be able to do anything, regardless of the assist debt.

So it seems like we need to do something if we're in the assist and can't
find enough work to pay down the assist debt.


Reply to this email directly or view it on GitHub
#11677 (comment).

@dvyukov
Copy link
Member

dvyukov commented Jul 22, 2015

This can happen simply because it takes us a while to go in to mark termination when we run out of work because we do things like rescanning globals to pick up more work to do in concurrent mark.

I think for 1.6 we need to make all work equally available to all goroutines. That is, allow mutators to rescan stacks and globals if that's the next piece of work that GC needs to perform.

The solution to both probably includes back pressure that forces the
mutator to be more cooperative, for #11677 this probably means putting the
mutator to sleep for a some period proportional to its debt, in effect
slowing allocation by this goroutine down.

We need to put them to sleep before the allocation. That is, you already have credit to pay for the allocation, or you are not allowed to do it.

I am also worried about transition period when heap profile/size changes. If controller parameters are incorrect, can't mutators under-help and lead to overallocation?

@aclements
Copy link
Member

There appears to be one other cause of this. We don't enable assists until the concurrent mark phase, which means there's some time between when we hit the GC trigger heap size and when we finish the scan phase and go in to mark during which the mutator can continue to allocate without back-pressure. With my prototype fix for the "no work" problem, I see that when we still go over the heap goal, we're over by exactly the amount that was allocated between the GC trigger and the beginning of concurrent mark.

@aclements
Copy link
Member

The solution to both probably includes back pressure that forces the
mutator to be more cooperative, for #11677 this probably means putting the
mutator to sleep for a some period proportional to its debt, in effect
slowing allocation by this goroutine down.

Currently, there are three ways to satisfy an assist: 1) the mutator steals credit from background GC, 2) the mutator actually does GC work, and 3) there is no more work available. 3 is one of the causes of overshooting the heap goal (and was never intended as a way to satisfy an assist). I have a prototype solution that disallows 3 by making the assist block until it can satisfy 1 or 2 or GC terminates. This actually looks a lot like the getfull barrier, except that it also repeatedly tries to steal credit. Unfortunately it's another spin-with-sleep loop like getfull, for the same reasons getfull is. I'm writing up a Go1.6Early issue about improving the GC barrier overall, since it's really showing its inappropriateness for concurrent GC.

@aclements
Copy link
Member

I think for 1.6 we need to make all work equally available to all goroutines. That is, allow mutators to rescan stacks and globals if that's the next piece of work that GC needs to perform.

I think this is a good idea.

We need to put them to sleep before the allocation. That is, you already have credit to pay for the allocation, or you are not allowed to do it.

Perhaps. Currently, we only do assists if we just did a large allocation or got a new span to keep it off the common path, which means it would be tricky to do the assist before we do the allocation. But we could change this so it updates the assist debt first and if it's over some small-ish threshold, do the assist (or any number of variations on this theme).

I am also worried about transition period when heap profile/size changes. If controller parameters are incorrect, can't mutators under-help and lead to overallocation?

No, modulo bugs like this one. We conservatively base the total assist work on the entire allocated heap size. (Note that this is different from what went out in the original GC pacing proposal, which was much less conservative and occasionally had very bad results.)

@aclements
Copy link
Member

We don't enable assists until the concurrent mark phase, which means there's some time between when we hit the GC trigger heap size and when we finish the scan phase and go in to mark during which the mutator can continue to allocate without back-pressure.

My plan to fix this was to enable assists earlier. The write barrier must be on before we enable assists, but since we turn it on in the scan phase, we can in principle enable assists quite early. Unfortunately, this introduces other problems. For example, it's possible for assists during the scan phase to drain everything we have so far and signal completion even though the scan is still collecting pointers. I can work around this, but this change still has deleterious effects I don't yet understand like doubling the duration and heap size of TestTraceStress.

@gopherbot
Copy link
Contributor

CL https://golang.org/cl/12672 mentions this issue.

@gopherbot
Copy link
Contributor

CL https://golang.org/cl/12674 mentions this issue.

@gopherbot
Copy link
Contributor

CL https://golang.org/cl/12671 mentions this issue.

@gopherbot
Copy link
Contributor

CL https://golang.org/cl/12670 mentions this issue.

aclements added a commit that referenced this issue Jul 27, 2015
Currently it's possible for the GC assist to signal completion of the
mark phase, which puts the GC coordinator goroutine on the current P's
run queue, and then return to mutator code that delays until the next
forced preemption before actually yielding control to the GC
coordinator, dragging out completion of the mark phase. This delay can
be further exacerbated if the mutator makes other goroutines runnable
before yielding control, since this will push the GC coordinator on
the back of the P's run queue.

To fix this, this adds a Gosched to the assist if it completed the
mark phase. This immediately and directly yields control to the GC
coordinator. This already happens implicitly in the background mark
workers because they park immediately after completing the mark.

This is one of the reasons completion of the mark phase is being
dragged out and allowing the mutator to allocate without assisting,
leading to the large heap goal overshoot in issue #11677. This is also
a prerequisite to making the assist block when it can't pay off its
debt.

Change-Id: I586adfbecb3ca042a37966752c1dc757f5c7fc78
Reviewed-on: https://go-review.googlesource.com/12670
Reviewed-by: Russ Cox <rsc@golang.org>
aclements added a commit that referenced this issue Jul 27, 2015
Currently, there are three ways to satisfy a GC assist: 1) the mutator
steals credit from background GC, 2) the mutator actually does GC
work, and 3) there is no more work available. 3 was never really
intended as a way to satisfy an assist, and it causes problems: there
are periods when it's expected that the GC won't have any work, such
as when transitioning from mark 1 to mark 2 and from mark 2 to mark
termination. During these periods, there's no back-pressure on rapidly
allocating mutators, which lets them race ahead of the heap goal.

For example, test/init1.go and the runtime/trace test both have small
reachable heaps and contain loops that rapidly allocate large garbage
byte slices. This bug lets these tests exceed the heap goal by several
orders of magnitude.

Fix this by forcing the assist (and hence the allocation) to block
until it can satisfy its debt via either 1 or 2, or the GC cycle
terminates.

This fixes one the causes of #11677. It's still possible to overshoot
the GC heap goal, but with this change the overshoot is almost exactly
by the amount of allocation that happens during the concurrent scan
phase, between when the heap passes the GC trigger and when the GC
enables assists.

Change-Id: I5ef4edcb0d2e13a1e432e66e8245f2bd9f8995be
Reviewed-on: https://go-review.googlesource.com/12671
Reviewed-by: Russ Cox <rsc@golang.org>
aclements added a commit that referenced this issue Jul 27, 2015
Currently the GC coordinator enables GC assists at the same time it
enables background mark workers, after the concurrent scan phase is
done. However, this means a rapidly allocating mutator has the entire
scan phase during which to allocate beyond the heap trigger and
potentially beyond the heap goal with no back-pressure from assists.
This prevents the feedback system that's supposed to keep the heap
size under the heap goal from doing its job.

Fix this by enabling mutator assists during the scan phase. This is
safe because the write barrier is already enabled and globally
acknowledged at this point.

There's still a very small window between when the heap size reaches
the heap trigger and when the GC coordinator is able to stop the world
during which the mutator can allocate unabated. This allows *very*
rapidly allocator mutators like TestTraceStress to still occasionally
exceed the heap goal by a small amount (~20 MB at most for
TestTraceStress). However, this seems like a corner case.

Fixes #11677.

Change-Id: I0f80d949ec82341cd31ca1604a626efb7295a819
Reviewed-on: https://go-review.googlesource.com/12674
Reviewed-by: Russ Cox <rsc@golang.org>
@aclements
Copy link
Member

For future reference, this will print all GC cycles that go over the goal:

awk '{split($11, H, "->"); if (H[2] > $13) print}' < gctrace

@gopherbot
Copy link
Contributor

CL https://golang.org/cl/16297 mentions this issue.

aclements added a commit that referenced this issue Nov 4, 2015
Currently dedicated mark workers participate in the getfull barrier
during concurrent mark. However, the getfull barrier wasn't designed
for concurrent work and this causes no end of headaches.

In the concurrent setting, participants come and go. This makes mark
completion susceptible to live-lock: since dedicated workers are only
periodically polling for completion, it's possible for the program to
be in some transient worker each time one of the dedicated workers
wakes up to check if it can exit the getfull barrier. It also
complicates reasoning about the system because dedicated workers
participate directly in the getfull barrier, but transient workers
must instead use trygetfull because they have exit conditions that
aren't captured by getfull (e.g., fractional workers exit when
preempted). The complexity of implementing these exit conditions
contributed to #11677. Furthermore, the getfull barrier is inefficient
because we could be running user code instead of spinning on a P. In
effect, we're dedicating 25% of the CPU to marking even if that means
we have to spin to make that 25%. It also causes issues on Windows
because we can't actually sleep for 100µs (#8687).

Fix this by making dedicated workers no longer participate in the
getfull barrier. Instead, dedicated workers simply return to the
scheduler when they fail to get more work, regardless of what others
workers are doing, and the scheduler only starts new dedicated workers
if there's work available. Everything that needs to be handled by this
barrier is already handled by detection of mark completion.

This makes the system much more symmetric because all workers and
assists now use trygetfull during concurrent mark. It also loosens the
25% CPU target so that we can give some of that 25% back to user code
if there isn't enough work to keep the mark worker busy. And it
eliminates the problematic 100µs sleep on Windows during concurrent
mark (though not during mark termination).

The downside of this is that if we hit a bottleneck in the heap graph
that then expands back out, the system may shut down dedicated workers
and take a while to start them back up. We'll address this in the next
commit.

Updates #12041 and #8687.

No effect on the go1 benchmarks. This slows down the garbage benchmark
by 9%, but we'll more than make it up in the next commit.

name              old time/op  new time/op  delta
XBenchGarbage-12  5.80ms ± 2%  6.32ms ± 4%  +9.03%  (p=0.000 n=20+20)

Change-Id: I65100a9ba005a8b5cf97940798918672ea9dd09b
Reviewed-on: https://go-review.googlesource.com/16297
Reviewed-by: Rick Hudson <rlh@golang.org>
@golang golang locked and limited conversation to collaborators Oct 26, 2016
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

7 participants