Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: replace GC coordinator with state machine #11970

Closed
aclements opened this issue Jul 31, 2015 · 8 comments
Closed

runtime: replace GC coordinator with state machine #11970

aclements opened this issue Jul 31, 2015 · 8 comments
Milestone

Comments

@aclements
Copy link
Member

@aclements aclements commented Jul 31, 2015

As of 1.5, all GC phase changes are managed through straight-line code in func gc, which runs on a dedicated GC goroutine. However, much of the real work is done in other goroutines, and the coordination between these goroutines and the main GC goroutine delays phase changes and opens windows where, for example, the mutator can allocate uncontrolled, or nothing can be accomplished because everything is waiting on the coordinator to wake up. This has led to bugs like #11677 and #11911. We've tried to mitigate this by handing control directly to the coordinator goroutine when we wake it up, but the scheduler isn't designed for this sort of explicit co-routine scheduling, so this doesn't always work and it's more likely to fall apart under stress.

For 1.6, we should consider decentralizing the role of the coordinator so that the goroutine that notices it's time to make a transition takes care of performing the transition.

This could take the form of a state machine where whichever goroutine detects the termination condition of the current state is responsible for moving the system to the next state. Because these transitions could happen on user goroutines, we'll want to make sure that these state transitions are all short (or STW), which is currently not true of everything the coordinator does. The states would roughly correspond to the GC phases, but would further subdivide some of them (especially the concurrent ones).

I think we should further design this state machine such that if more than one goroutine detects the termination condition for the current state (including while another goroutine is working on transitioning out of it) it should block until the goroutine performing the transition finishes the transition (or, if possible, help with the transition), rather than simply continuing execution. We have various problems right now because we don't do this: for example, the first G to detect that the system is over the heap trigger will start the transition from "GC off" to sweep termination, but before we're actually in sweep termination, other Gs will also detect this condition as well but simply continue allocating, allowing them to over-allocate [1]. If they instead blocked or helped until this transition was complete, this wouldn't happen.

[1] See issue #11911 and all of its linked CLs.

@RLH @rsc

@gopherbot

This comment has been minimized.

Copy link

@gopherbot gopherbot commented Jul 31, 2015

CL https://golang.org/cl/13026 mentions this issue.

aclements added a commit that referenced this issue Aug 4, 2015
Currently there are two sensitive periods during which a mutator can
allocate past the heap goal but mutator assists can't be enabled: 1)
at the beginning of GC between when the heap first passes the heap
trigger and sweep termination and 2) at the end of GC between mark
termination and when the background GC goroutine parks. During these
periods there's no back-pressure or safety net, so a rapidly
allocating mutator can allocate past the heap goal. This is
exacerbated if there are many goroutines because the GC coordinator is
scheduled as any other goroutine, so if it gets preempted during one
of these periods, it may stay preempted for a long period (10s or 100s
of milliseconds).

Normally the mutator does scan work to create back-pressure against
allocation, but there is no scan work during these periods. Hence, as
a fall back, if a mutator would assist but can't yet, simply yield the
CPU. This delays the mutator somewhat, but more importantly gives more
CPU time to the GC coordinator for it to complete the transition.

This is obviously a workaround. Issue #11970 suggests a far better but
far more invasive way to fix this.

Updates #11911. (This very nearly fixes the issue, but about once
every 15 minutes I get a GC cycle where the assists are enabled but
don't do enough work.)

Change-Id: I9768b79e3778abd3e06d306596c3bd77f65bf3f1
Reviewed-on: https://go-review.googlesource.com/13026
Reviewed-by: Russ Cox <rsc@golang.org>
Reviewed-by: Rick Hudson <rlh@golang.org>
@gopherbot

This comment has been minimized.

Copy link

@gopherbot gopherbot commented Oct 26, 2015

CL https://golang.org/cl/16293 mentions this issue.

@gopherbot

This comment has been minimized.

Copy link

@gopherbot gopherbot commented Oct 27, 2015

CL https://golang.org/cl/16355 mentions this issue.

@gopherbot

This comment has been minimized.

Copy link

@gopherbot gopherbot commented Oct 27, 2015

CL https://golang.org/cl/16356 mentions this issue.

@gopherbot

This comment has been minimized.

Copy link

@gopherbot gopherbot commented Oct 27, 2015

CL https://golang.org/cl/16391 mentions this issue.

@gopherbot

This comment has been minimized.

Copy link

@gopherbot gopherbot commented Oct 27, 2015

CL https://golang.org/cl/16357 mentions this issue.

@gopherbot

This comment has been minimized.

Copy link

@gopherbot gopherbot commented Oct 27, 2015

CL https://golang.org/cl/16392 mentions this issue.

@gopherbot

This comment has been minimized.

Copy link

@gopherbot gopherbot commented Oct 27, 2015

CL https://golang.org/cl/16393 mentions this issue.

gopherbot pushed a commit to golang/proposal that referenced this issue Oct 29, 2015
Updates golang/go#11970.

Change-Id: Ief40300c9c0da583d940823a2290fbe5d5ad02bf
Reviewed-on: https://go-review.googlesource.com/16293
Reviewed-by: Russ Cox <rsc@golang.org>
aclements added a commit that referenced this issue Nov 5, 2015
This begins the conversion of the centralized GC coordinator to a
decentralized state machine by introducing the internal API that
triggers the first state transition from _GCoff to _GCmark (or
_GCmarktermination).

This change introduces the transition lock, the off->mark transition
condition (which is very similar to shouldtriggergc()), and the
general structure of a state transition. Since we're doing this
conversion in stages, it then falls back to the GC coordinator to
actually execute the cycle. We'll start moving logic out of the GC
coordinator and in to transition functions next.

This fixes a minor bug in gcstoptheworld debug mode where passing the
heap trigger once could trigger multiple STW GCs.

Updates #11970.

Change-Id: I964087dd190a639eb5766398f8e1bbf8b352902f
Reviewed-on: https://go-review.googlesource.com/16355
Reviewed-by: Rick Hudson <rlh@golang.org>
Run-TryBot: Austin Clements <austin@google.com>
aclements added a commit that referenced this issue Nov 5, 2015
This moves concurrent sweep termination from the coordinator to the
off->mark transition. This allows it to be performed by all Gs
attempting to start the GC.

Updates #11970.

Change-Id: I24428e8599a759398c2ef7ec996ba755a448f947
Reviewed-on: https://go-review.googlesource.com/16356
Reviewed-by: Rick Hudson <rlh@golang.org>
Run-TryBot: Austin Clements <austin@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
aclements added a commit that referenced this issue Nov 5, 2015
This moves all of GC initialization, sweep termination, and the
transition to concurrent marking in to the off->mark transition
function. This means it's now handled on the goroutine that detected
the state exit condition.

As a result, malloc no longer needs to Gosched() at the beginning of
the GC cycle to prevent over-allocation while the GC is starting up
because it will now *help* the GC to start up. The Gosched hack is
still necessary during GC shutdown (this is easy to test by enabling
gctrace and hitting Ctrl-S to block the gctrace output).

At this point, the GC coordinator still handles later phases. This
requires a small tweak to how we start the GC coordinator. Currently,
starting the GC coordinator is best-effort and may fail if the
coordinator is about to park from the previous cycle but hasn't yet.
We fix this by replacing the park/ready to wake up the coordinator
with a semaphore. This is temporary since the coordinator will be
going away in a few commits.

Updates #11970.

Change-Id: I2c6a11c91e72dfbc59c2d8e7c66146dee9a444fe
Reviewed-on: https://go-review.googlesource.com/16357
Reviewed-by: Rick Hudson <rlh@golang.org>
Run-TryBot: Austin Clements <austin@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
@aclements aclements closed this in c99d7f7 Nov 5, 2015
aclements added a commit that referenced this issue Nov 5, 2015
Currently mallocgc detects if the GC is in a state where it can't
assist, but also can't allocate uncontrolled and yields to help out
the GC. This was a workaround for periods when we were trying to
schedule the GC coordinator. It is no longer necessary because there
is no GC coordinator and malloc can always assist with any GC
transitions that are necessary.

Updates #11970.

Change-Id: I4f7beb7013e85e50ae99a3a8b0bb708ba49cbcd4
Reviewed-on: https://go-review.googlesource.com/16392
Reviewed-by: Rick Hudson <rlh@golang.org>
Run-TryBot: Austin Clements <austin@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
aclements added a commit that referenced this issue Nov 5, 2015
These are now unused.

Updates #11970.

Change-Id: I43e5c4e5bcda9581bacc63364f96bb4855ab779f
Reviewed-on: https://go-review.googlesource.com/16393
Reviewed-by: Rick Hudson <rlh@golang.org>
Run-TryBot: Austin Clements <austin@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
@golang golang locked and limited conversation to collaborators Nov 4, 2016
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
2 participants
You can’t perform that action at this time.