Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
design: proposal for concurrent stack re-scanning
Updates golang/go#17505. Change-Id: I353edb79d23b1ef1a6bab57485cfd74089b67fa0 Reviewed-on: https://go-review.googlesource.com/31360 Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>
- Loading branch information
Showing
1 changed file
with
323 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,323 @@ | ||
# Proposal: Concurrent stack re-scanning | ||
|
||
Author(s): Austin Clements, Rick Hudson | ||
|
||
Last updated: 2016-10-18 | ||
|
||
Discussion at https://golang.org/issue/17505. | ||
|
||
**Note:** We are not actually proposing this. | ||
This design was developed before proposal #17503, which is a | ||
dramatically simpler solution to the problem of stack re-scanning. | ||
We're posting this design doc for its historical value. | ||
|
||
|
||
## Abstract | ||
|
||
Since the release of the concurrent garbage collector in Go 1.5, each | ||
subsequent release has further reduced stop-the-world (STW) time by | ||
moving more tasks to the concurrent phase. | ||
As of Go 1.7, the only non-trivial STW task is stack re-scanning. | ||
We propose to make stack re-scanning concurrent for Go 1.8, likely | ||
resulting in sub-millisecond worst-case STW times. | ||
|
||
|
||
## Background | ||
|
||
Go's concurrent garbage collector consists of four phases: mark, mark | ||
termination, sweep, and sweep termination. | ||
The mark and sweep phases are *concurrent*, meaning that the | ||
application (the *mutator*) continues to run during these phases, | ||
while the mark termination and sweep termination phases are | ||
*stop-the-world* (STW), meaning that the garbage collector pauses the | ||
mutator for the duration of the phase. | ||
|
||
Since Go 1.5, we've been steadily moving tasks from the STW phases to | ||
the concurrent phases, with a particular focus on tasks that take time | ||
proportional to something under application control, such as heap size | ||
or number of goroutines. | ||
As a result, in Go 1.7, most applications have sub-millisecond STW | ||
times. | ||
|
||
As of Go 1.7, the only remaining application-controllable STW task is | ||
*stack re-scanning*. | ||
Because of this one task, applications with large numbers of active | ||
goroutines can still experience STW times in excess of 10ms. | ||
|
||
Stack re-scanning is necessary because stacks are *permagray* in the | ||
Go garbage collector. | ||
Specifically, for performance reasons, there are no write barriers for | ||
writes to pointers in the current stack frame. | ||
As a result, even though the garbage collector scans all stacks at the | ||
beginning of the mark phase, it must re-scan all modified stacks with | ||
the world is stopped to catch any pointers the mutator "hid" on the | ||
stack. | ||
|
||
Unfortunately, this makes STW time proportional to the total amount of | ||
stack that needs to be rescanned. | ||
Worse, stack scanning is relatively expensive (~5ms/MB). | ||
Hence, applications with a large number of active goroutines can | ||
quickly drive up STW time. | ||
|
||
|
||
## Proposal | ||
|
||
We propose to make stack re-scanning concurrent using a *transitive | ||
mark* write barrier. | ||
|
||
In this design, we add a new concurrent phase between mark and mark | ||
termination called *stack re-scan*. | ||
This phase starts as soon as the mark phase has marked all objects | ||
reachable from roots *other than stacks*. | ||
The phase re-scans stacks that have been modified since their initial | ||
scan, and enables a special *transitive mark* write barrier. | ||
|
||
Re-scanning and the write barrier ensure the following invariant | ||
during this phase: | ||
|
||
> *After a goroutine stack G has been re-scanned, all objects locally | ||
> reachable to G are black.* | ||
This depends on a goroutine-local notion of reachability, which is the | ||
set of objects reachable from globals or a given goroutine's stack or | ||
registers. | ||
Unlike regular global reachability, this is not stable: as goroutines | ||
modify heap pointers or communicate, an object that was locally | ||
unreachable to a given goroutine may become locally reachable. | ||
However, the concepts are closely related: a globally reachable object | ||
must be locally reachable by at least one goroutine, and, conversely, | ||
an object that is not locally reachable by any goroutine is not | ||
globally reachable. | ||
|
||
This invariant ensures that re-scanning a stack *blackens* that stack, | ||
and that the stack remains black since the goroutine has no way to | ||
find a white object once its stack has been re-scanned. | ||
|
||
Furthermore, once every goroutine stack has been re-scanned, marking | ||
is complete. | ||
Every globally reachable object must be locally reachable by some | ||
goroutine and, once every stack has been re-scanned, every object | ||
locally reachable by some goroutine is black, so it follows that every | ||
globally reachable object is black once every stack has been | ||
re-scanned. | ||
|
||
### Transitive mark write barrier | ||
|
||
The transitive mark write barrier for an assignment `*dst = src` | ||
(where `src` is a pointer) ensures that all objects reachable from | ||
`src` are black *before* writing `src` to `*dst`. | ||
Writing `src` to `*dst` may make any object reachable from `src` | ||
(including `src` itself) locally reachable to some goroutine that has | ||
been re-scanned. | ||
Hence, to maintain the invariant, we must ensure these objects are all | ||
black. | ||
|
||
To do this, the write barrier greys `src` and then drains the mark | ||
work queue until there are no grey objects (using the same work queue | ||
logic that drives the mark phase). | ||
At this point, it writes `src` to `*dst` and allows the goroutine to | ||
proceed. | ||
|
||
The write barrier must not perform the write until all simultaneous | ||
write barriers are also ready to perform the write. | ||
We refer to this *mark quiescence*. | ||
To see why this is necessary, consider two simultaneous write barriers | ||
for `*D1 = S1` and `*D2 = S2` on an object graph that looks like this: | ||
|
||
G1 [b] → D1 [b] S1 [w] | ||
↘ | ||
O1 [w] → O2 [w] → O3 [w] | ||
↗ | ||
D2 [b] S2 [w] | ||
|
||
Goroutine *G1* has been re-scanned (so *D1* must be black), while *Sn* | ||
and *On* are all white. | ||
|
||
Suppose the *S2* write barrier blackens *S2* and *O1* and greys *O2*, | ||
then the *S1* write barrier blackens *S1* and observes that *O1* is | ||
already black: | ||
|
||
G1 [b] → D1 [b] S1 [b] | ||
↘ | ||
O1 [b] → O2 [g] → O3 [w] | ||
↗ | ||
D2 [b] S2 [b] | ||
|
||
At this point, the *S1* barrier has run out of local work, but the | ||
*S2* barrier is still going. | ||
If *S1* were to complete and write `*D1 = S1` at this point, it would | ||
make white object *O3* reachable to goroutine *G1*, violating the | ||
invariant. | ||
Hence, the *S1* barrier cannot complete until the *S2* barrier is also | ||
ready to complete. | ||
|
||
This requirement sounds onerous, but it can be achieved in a simple | ||
and reasonably efficient manner by sharing a global mark work queue | ||
between the write barriers. | ||
This reuses the existing mark work queue and quiescence logic and | ||
allows write barriers to help each other to completion. | ||
|
||
### Stack re-scanning | ||
|
||
The stack re-scan phase re-scans the stacks of all goroutines that | ||
have run since the initial stack scan to find pointers to white | ||
objects. | ||
The process of re-scanning a stack is identical to that of the initial | ||
scan, except that it must participate in mark quiescence. | ||
Specifically, the re-scanned goroutine must not resume execution until | ||
the system has reached mark quiescence (even if no white pointers are | ||
found on the stack). | ||
Otherwise, the same sorts of races that were described above are | ||
possible. | ||
|
||
There are multiple ways to realize this. | ||
The whole stack scan could participate in mark quiescence, but this | ||
would block any contemporaneous stack scans or write barriers from | ||
completing during a stack scan if any white pointers were found. | ||
Alternatively, each white pointer found on the stack could participate | ||
individually in mark quiescence, blocking the stack scan at that | ||
pointer until mark quiescence, and the stack scan could again | ||
participate in mark quiescence once all frames had been scanned. | ||
|
||
We propose an intermediate: gather small batches of white pointers | ||
from a stack at a time and reach mark quiescence on each batch | ||
individually, as well as at the end of the stack scan (even if the | ||
final batch is empty). | ||
|
||
### Other considerations | ||
|
||
Goroutines that start during stack re-scanning cannot reach any white | ||
objects, so their stacks are immediately considered black. | ||
|
||
Goroutines can also share pointers through channels, which are often | ||
implemented as direct stack-to-stack copies. | ||
Hence, channel receives also require write barriers in order to | ||
maintain the invariant. | ||
Channel receives already have write barriers to maintain stack | ||
barriers, so there is no additional work here. | ||
|
||
|
||
## Rationale | ||
|
||
The primary drawback of this approach to concurrent stack re-scanning | ||
is that a write barrier during re-scanning could introduce significant | ||
mutator latency if the transitive mark finds a large unmarked region | ||
of the heap, or if overlapping write barriers significantly delay mark | ||
quiescence. | ||
However, we consider this situation unlikely in non-adversarial | ||
applications. | ||
Furthermore, the resulting delay should be no worse than the mark | ||
termination STW time applications currently experience, since mark | ||
termination has to do exactly the same amount of marking work, in | ||
addition to the cost of stack scanning. | ||
|
||
### Alternative approaches | ||
|
||
An alternative solution to concurrent stack re-scanning would be to | ||
adopt DMOS-style quiescence [Hudson '97]. | ||
In this approach, greying any object during stack re-scanning (either | ||
by finding a pointer to a white object on a stack or by installing a | ||
pointer to a white object in the heap) forces the GC to drain this | ||
marking work and *restart* the stack re-scanning phase. | ||
|
||
This approach has a much simpler write barrier implementation that is | ||
constant time, so the write barrier would not induce significant | ||
mutator latency. | ||
However, unlike the proposed approach, the amount of work performed by | ||
DMOS-style stack re-scanning is potentially unbounded. | ||
This interacts poorly with Go's GC pacer. | ||
The pacer enforces the goal heap size making allocating and GC work | ||
proportional, but this requires an upper bound on possible GC work. | ||
As a result, if the pacer underestimates the amount of re-scanning | ||
work, it may need to block allocation entirely to avoid exceeding the | ||
goal heap size. | ||
This would be an effective STW. | ||
|
||
There is also a hybrid solution: we could use the proposed transitive | ||
marking write barrier, but bound the amount of work it can do (and | ||
hence the latency it can induce). | ||
If the write barrier exceeds this bound, it performs a DMOS-style | ||
restart. | ||
This is likely to get the best of both worlds, but also inherits the | ||
sum of their complexity. | ||
|
||
A final alternative would be to eliminate concurrent stack re-scanning | ||
entirely by adopting a *deletion-style* write barrier [Yuasa '90]. | ||
This style of write barrier allows the initial stack scan to *blacken* | ||
the stack, rather than merely greying it (still without the need for | ||
stack write barriers). | ||
For full details, see proposal #17503. | ||
|
||
|
||
## Compatibility | ||
|
||
This proposal does not affect the language or any APIs and hence | ||
satisfies the Go 1 compatibility guidelines. | ||
|
||
|
||
## Implementation | ||
|
||
We do not plan to implement this proposal. | ||
Instead, we plan to implement proposal #17503. | ||
|
||
The implementation steps are as follows: | ||
|
||
1. While not strictly necessary, first make GC assists participate in | ||
stack scanning. | ||
Currently this is not possible, which increases mutator latency at | ||
the beginning of the GC cycle. | ||
This proposal would compound this effect by also blocking GC | ||
assists at the end of the GC cycle, causing an effective STW. | ||
|
||
2. Modify the write barrier to be pre-publication instead of | ||
post-publication. | ||
Currently the write barrier occurs after the write of a pointer, | ||
but this proposal requires that the write barrier complete | ||
transitive marking *before* writing the pointer to its destination. | ||
A pre-publication barrier is also necessary for | ||
[ROC](https://golang.org/s/gctoc). | ||
|
||
3. Make the mark completion condition precise. | ||
Currently it's possible (albeit unlikely) to enter mark termination | ||
before all heap pointers have been marked. | ||
This proposal requires that we not start stack re-scanning until | ||
all objects reachable from globals are marked, which requires a | ||
precise completion condition. | ||
|
||
4. Implement the transitive mark write barrier. | ||
This can reuse the existing work buffer pool lists and logic, | ||
including the global quiescence barrier in getfull. | ||
It may be necessary to improve the performance characteristics of | ||
the getfull barrier, since this proposal will lean far more heavily | ||
on this barrier than we currently do. | ||
|
||
5. Check stack re-scanning code and make sure it is safe during | ||
non-STW. | ||
Since this only runs during STW right now, it may omit | ||
synchronization that will be necessary when running during non-STW. | ||
This is likely to be minimal, since most of the code is shared with | ||
the initial stack scan, which does run concurrently. | ||
|
||
6. Make stack re-scanning participate in write barrier quiescence. | ||
|
||
7. Create a new stack re-scanning phase. | ||
Make mark 2 completion transition to stack re-scanning instead of | ||
mark termination and enqueue stack re-scanning root jobs. | ||
Once all stack re-scanning jobs are complete, transition to mark | ||
termination. | ||
|
||
|
||
## Acknowledgments | ||
|
||
We would like to thank Rhys Hiltner (@rhysh) for suggesting the idea | ||
of a transitive mark write barrier. | ||
|
||
|
||
## References | ||
|
||
[Hudson '97] R. L. Hudson, R. Morrison, J. E. B. Moss, and D. S. | ||
Munro. Garbage collecting the world: One car at a time. In *ACM | ||
SIGPLAN Notices* 32(10):162–175, October 1997. | ||
|
||
[Yuasa '90] T. Yuasa. Real-time garbage collection on general-purpose | ||
machines. *Journal of Systems and Software*, 11(3):181–198, 1990. |