-
Notifications
You must be signed in to change notification settings - Fork 17.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
runtime: shrinkstack during mark termination significantly increases GC STW time #12967
Comments
CC @RLH @aclements |
We could move stack shrinking to the phase where we stop one goroutine at a time. |
I agree. I don't think there would be any problem with shrinking each stack right before we do the concurrent scan of that stack. We have to be careful not to use any cross-stack pointers between when we shrink our own stack and when we return from the system stack, but that may actually be easier to ensure if we do the shrink as part of stack scanning. |
I've written a reproducer, included below. Its perf_events profile, collected on linux/amd64, looks very similar to what I see for my production program in terms of CPU time spent in runtime.freeStackSpans (10%), in runtime.shrinkstack (83%), in runtime.stackalloc (37%), and in runtime.stackfree (37%). The reproducer starts 10,000 goroutines which each repeatedly grow and shrink their stack by 10kB. The production program has around 200,000 goroutines and has a garbage collection every 60-120 seconds, so any goroutines that grow and shrink their stacks during that time will contribute to the mark termination pause. (Ironic that generating more garbage, which would lead to more frequent collections, could reduce the duration of each pause.) Here's output from some runs of the program on darwin/amd64.
Disabling stack shrinking makes the mark termination STW pauses significantly shorter.
|
I thought of one problem with shrinking stacks during concurrent stack scan. If the goroutine is blocked in a channel operation or select, there are pointers in to that G's stack from runtime structures and updating these pointers during shrinking could race with another G completing that channel operation or select. This isn't an issue for stack growing because that's always done synchronously. I can think of a few solutions:
|
Does your first solution mean that the stacks would be shrunk during a STW phase, or that they would not be shrunk at all? Most of the goroutines in the application I described are blocked on channel operations nearly all the time. The goroutines occasionally get a message on their channel that makes them call a function with a large stack requirement (and which returns quickly). Locking all goroutines that are attempting to communicate on a channel, as described in the second solution, would make "quit" channels more expensive if shared across many goroutines. Stack shrinking on any one of the participating goroutines would delay all of the others. The third solution seems easy to reason about, and sounds like it will have predictable performance. |
Per Austin, this is too invasive for Go 1.6. He may look into having it ready for the start of the 1.7 cycle. |
So as I see it there are a couple of options to enable shrinking outside a STW phase.
1 means allocation per channel receive. As others have also pointed out, I don't think preventing shrinking while in receive is a viable option. |
You probably considered this, but everything flows through the sudog, so we could add a lock there. Once the sudog is selected for a channel op, we can lock it. Stack shrinking already knows about the sudogs associated with the channel, it could lock all of them first. If stack shrinking sees that a sudog is already locked, it can punt and wait until next time. If the channel op sees that the sudog is locked, we know there can be at must one channel op pending for a goroutine, so the channel op can sleep on a note until the stack shrinking is complete. So I think the sudog lock can be a single atomic integer managed through cas, plus a note. I think the overhead in the normal case is one additional atomic cas and one atomic store per channel send. |
@RLH and I were talking about another solution that sort of combines @randall77's first suggestion and @ianlancetaylor's suggestion: reserve a few words in the sudog for receiving small values (which are presumably the majority of channel ops) and otherwise allocate space on the heap for a slot. The receiver would then copy the value in to its own stack when it wakes up. This would eliminate cross-stack writes, at the cost of slightly larger sudogs and a heap allocation for passing large values. |
I suspect that we can also make the sudog slot solution lock free, just cas On Mon, Nov 9, 2015 at 11:16 AM, Austin Clements notifications@github.com
|
@RLH, I think the sudog solution doesn't introduce any additional locks or even atomic ops over what we have now. Existing locking already ensures the sudog isn't going anywhere and there's no competition for writing to a slot in it since the sudog represents a single (goroutine, channel) pair. |
The trouble with putting some space in the sudog is that it needs to be typed correctly for GC. So if we wanted 4 words of space, we'd need 16 pseudog types to encode ptr/nonptr for each of those 4 slots. This isn't a showstopper, but it complicates things like acquireSudog. |
Of course, if we went the route of different sudog types we could make them exactly the required size. |
@davecheney's email (https://groups.google.com/d/topic/golang-dev/SJ3LAW5ZXDs/discussion) got me thinking about this again; specifically @randall77's 5th option of shrinking stacks in place. The benefit is of course that this would turn a fairly expensive moving operation into a cheap metadata operation. The concern with shrinking stacks in place is that it would increase fragmentation. However, I wonder how bad of a problem this really is and to what extent we could mitigate it. I think the worst-case here is a large number of goroutines start and grow their stacks to 32K, then all shrink their stacks to 2K. If we do this all in place, each 2K stack will pin a 32K span for stacks, a 16X fragmentation ratio. However, this is nothing new; you can do the same thing by starting lots of goroutines and selectively exiting them, or even carefully allocating and releasing objects. But all of these situations naturally remedy themselves as a program continues doing real work. Fragmentation will decrease as other goroutines start (and use the free slots in the existing stack spans) or as the running goroutines grow their stacks (forcing them to move) or exit. We could also actively mitigate fragmentation. We could move stacks if we detect large fragmentation, but I'd just as soon not. Alternatively, we could combat fragmentation generally by switching from the current size-segregated stack free lists to a buddy allocator. The constraints of stack allocation fit quite nicely with a buddy allocator. This would let one span be used for multiple stack size classes, reducing size calcification. It would efficiently merge neighboring free slots and break up large slots when necessary, while keeping allocation and freeing very efficient. It would only require 16 bits of metadata per stack span, which we could easily steal from several places. We could even do opportunistic in-place stack growth for small stacks. This wouldn't prevent the above worst-case fragmentation, but it would remedy it faster by allowing the free space to be used for other size classes. |
Well that turned out to be hairier than I expected. @rhysh, care to give https://go-review.googlesource.com/20044 a try? For me it reduces the STW time on your test program to under 2ms. |
CL https://golang.org/cl/20033 mentions this issue. |
CL https://golang.org/cl/20041 mentions this issue. |
CL https://golang.org/cl/20034 mentions this issue. |
CL https://golang.org/cl/20039 mentions this issue. |
CL https://golang.org/cl/20042 mentions this issue. |
CL https://golang.org/cl/20038 mentions this issue. |
CL https://golang.org/cl/20040 mentions this issue. |
CL https://golang.org/cl/20037 mentions this issue. |
CL https://golang.org/cl/20035 mentions this issue. |
CL https://golang.org/cl/20044 mentions this issue. |
Thanks @aclements, your changes result in a big improvement in mark termination time for my application, bringing it down to around 20ms with ~120k goroutines and ~500MB live heap (compared with 50ms at 90k goroutines and 330MB heap on go1.6). Some of the extra pause time may be attributable to #14419 or to #14406, which I did not mitigate while testing this change. I've been testing with CL 20044 PS 2. go1.6 with framepointer, ~90k goroutines:
CL 20044 PS 2, devel +f9e73e9 Mon Feb 29 11:13:35 2016 -0500 with framepointer, ~120k goroutines:
|
Currently copystack adjusts pointers in the old stack and then copies the adjusted stack to the new stack. In addition to being generally confusing, this is going to make concurrent stack shrinking harder. Switch this around so that we first copy the stack and then adjust pointers on the new stack (never writing to the old stack). This reprises CL 15996, but takes a different and simpler approach. CL 15996 still walked the old stack while adjusting pointers on the new stack. In this CL, we adjust auxiliary structures before walking the stack, so we can just walk the new stack. For golang#12967. Change-Id: I94fa86f823ba9ee478e73b2ba509eed3361c43df Reviewed-on: https://go-review.googlesource.com/20033 Reviewed-by: Rick Hudson <rlh@golang.org>
In particular, write down the rules for stack ownership because the details of this are about to get very important with concurrent stack shrinking. (Interestingly, the details don't actually change, but anything that's currently skating on thin ice is likely to fall through.) Fox #12967. Change-Id: I561e2610e864295e9faba07717a934aabefcaab9 Reviewed-on: https://go-review.googlesource.com/20034 Reviewed-by: Rick Hudson <rlh@golang.org>
Given a G, there's currently no way to find the channel it's blocking on. We'll need this information to fix a (probably theoretical) bug in select and to implement concurrent stack shrinking, so record the channel in the sudog. For #12967. Change-Id: If8fb63a140f1d07175818824d08c0ebeec2bdf66 Reviewed-on: https://go-review.googlesource.com/20035 Reviewed-by: Rick Hudson <rlh@golang.org> Run-TryBot: Austin Clements <austin@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org>
Currently the g.waiting list created by a select is in poll order. However, nothing depends on this, and we're going to need access to the channel lock order in other places shortly, so modify select to put the waiting list in channel lock order. For #12967. Change-Id: If0d38816216ecbb37a36624d9b25dd96e0a775ec Reviewed-on: https://go-review.googlesource.com/20037 Reviewed-by: Rick Hudson <rlh@golang.org> Reviewed-by: Keith Randall <khr@golang.org> Run-TryBot: Austin Clements <austin@google.com>
gopark calls the unlock function after setting the G to _Gwaiting. This means it's generally unsafe to access the G's stack from the unlock function because the G may start running on another P. Once we start shrinking stacks concurrently, a stack shrink could also move the stack the moment after it enters _Gwaiting and before the unlock function is called. Document this restriction and fix the two places where we currently violate it. This is unlikely to be a problem in practice for these two places right now, but they're already skating on thin ice. For example, the following sequence could in principle cause corruption, deadlock, or a panic in the select code: On M1/P1: 1. G1 selects on channels A and B. 2. selectgoImpl calls gopark. 3. gopark puts G1 in _Gwaiting. 4. gopark calls selparkcommit. 5. selparkcommit releases the lock on channel A. On M2/P2: 6. G2 sends to channel A. 7. The send puts G1 in _Grunnable and puts it on P2's run queue. 8. The scheduler runs, selects G1, puts it in _Grunning, and resumes G1. 9. On G1, the sellock immediately following the gopark gets called. 10. sellock grows and moves the stack. On M1/P1: 11. selparkcommit continues to scan the lock order for the next channel to unlock, but it's now reading from a freed (and possibly reused) stack. This shouldn't happen in practice because step 10 isn't the first call to sellock, so the stack should already be big enough. However, once we start shrinking stacks concurrently, this reasoning won't work any more. For #12967. Change-Id: I3660c5be37e5be9f87433cb8141bdfdf37fadc4c Reviewed-on: https://go-review.googlesource.com/20038 Reviewed-by: Rick Hudson <rlh@golang.org> Run-TryBot: Austin Clements <austin@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org>
With concurrent stack shrinking, the stack can move the instant after a G enters _Gwaiting. There are only two places that put a G into _Gwaiting: gopark and newstack. We fixed uses of gopark. This commit fixes newstack by simplifying its G transitions and, in particular, eliminating or narrowing the transient _Gwaiting states it passes through so it's clear nothing in the G is accessed while in _Gwaiting. For #12967. Change-Id: I2440ead411d2bc61beb1e2ab020ebe3cb3481af9 Reviewed-on: https://go-review.googlesource.com/20039 Reviewed-by: Rick Hudson <rlh@golang.org> Run-TryBot: Austin Clements <austin@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org>
Currently sudog.elem is never accessed concurrently, so in several cases we drop the channel lock just before reading/writing the sent/received value from/to sudog.elem. However, concurrent stack shrinking is going to have to adjust sudog.elem to point to the new stack, which means it needs a way to synchronize with accesses to sudog.elem. Hence, add sudog.elem to the fields protected by hchan.lock and scoot the unlocks down past the uses of sudog.elem. While we're here, better document the channel synchronization rules. For #12967. Change-Id: I3ad0ca71f0a74b0716c261aef21b2f7f13f74917 Reviewed-on: https://go-review.googlesource.com/20040 Reviewed-by: Keith Randall <khr@golang.org> Run-TryBot: Austin Clements <austin@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org>
Currently, locking a G's stack by setting its status to _Gcopystack or _Gscan is unordered with respect to channel locks. However, when we make stack shrinking concurrent, stack shrinking will need to lock the G and then acquire channel locks, which imposes an order on these. Document this lock ordering and fix closechan to respect it. Everything else already happens to respect it. For #12967. Change-Id: I4dd02675efffb3e7daa5285cf75bf24f987d90d4 Reviewed-on: https://go-review.googlesource.com/20041 Reviewed-by: Rick Hudson <rlh@golang.org> Run-TryBot: Austin Clements <austin@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org>
Currently shinkstack is only safe during STW because it adjusts channel-related stack pointers and moves send/receive stack slots without synchronizing with the channel code. Make it safe to use when the world isn't stopped by: 1) Locking all channels the G is blocked on while adjusting the sudogs and copying the area of the stack that may contain send/receive slots. 2) For any stack frames that may contain send/receive slot, using an atomic CAS to adjust pointers to prevent races between adjusting a pointer in a receive slot and a concurrent send writing to that receive slot. In principle, the synchronization could be finer-grained. For example, we considered synchronizing around the sudogs, which would allow channel operations involving other Gs to continue if the G being shrunk was far enough down the send/receive queue. However, using the channel lock means no additional locks are necessary in the channel code. Furthermore, the stack shrinking code holds the channel lock for a very short time (much less than the time required to shrink the stack). This does not yet make stack shrinking concurrent; it merely makes doing so safe. This has negligible effect on the go1 and garbage benchmarks. For #12967. Change-Id: Ia49df3a8a7be4b36e365aac4155a2416b94b988c Reviewed-on: https://go-review.googlesource.com/20042 Reviewed-by: Keith Randall <khr@golang.org> Run-TryBot: Austin Clements <austin@google.com>
I have a process that experiences significant STW times during the mark termination phase of garbage collection. When it has around 100,000 goroutines and a live heap of around 500MB, its STW pauses take tens of milliseconds. The process runs on linux/amd64 with GOMAXPROCS=18, with a recent version of Go tip (30b9663).
I profiled its CPU usage with perf_events, looking specifically at samples with runtime.gcMark on the stack. 7.5% of samples include runtime.gcMark -> runtime.freeStackSpans (non-parallelized work, discussed in #11465). 90% of samples include runtime.gcMark -> runtime.parfordo -> runtime.markroot, and 75% of samples include runtime.gcMark -> runtime.parfordo -> runtime.markroot -> runtime.shrinkstack.
The work for stack shrinking is split 18 ways in my case, but at 75% of CPU cycles it's still a huge contributor to mark termination pauses, and the pauses my program sees are well beyond the 10ms goal with only a modest heap size.
Does stack shrinking need to take place while the world is stopped?
@aclements
The text was updated successfully, but these errors were encountered: