runtime: eliminate stack rescanning #17503
One of the largest remaining contributors to GC STW time is stack rescanning. I have an approach for eliminating this entirely. This is a tracking bug for implementing this approach.
I will upload a design document and proof soon, and I have a working implementation that I plan to have cleaned up and mailed out in a day or two.
I'm marking this Go 1.9. My current plan is to get the change in for Go 1.8, but have a GODEBUG flag to fall back to the current algorithm for debugging purposes (and in case something goes wrong). Assuming things go smoothly, we'll actually rip out the stack rescanning code when Go 1.9 opens.
Edit: Design doc
Edit: Things to follow up on in Go 1.9+:
The text was updated successfully, but these errors were encountered:
We used the "double barrier" to address stack scanning issues in the IBM J9 implementation of Metronome. It's in section 4.3 of "Design and implementation of a comprehensive real-time java virtual machine" by Auerbach et al, section 4.3. In that case we were incrementalizing over many thread's stacks, as opposed to the individual stack, but it solves the same problem.
It worked well, although the extra barrier overhead was annoying. Let me know if you have any questions.
@rasky, it's about a 1.7% performance hit on the x/benchmarks garbage benchmark (which, as the name suggests, is designed to hammer the garbage collector). I haven't checked, but I suspect we've gained more than that from other optimizations since Go 1.7.
They are harder to eliminate at compile time. I completely disabled the optimizations that don't carry over directly to the hybrid barrier and binaries got about 1% larger. We could eliminate some of these, but it requires flow analysis and the current insertion code doesn't do any flow analysis. OTOH, the places where we can eliminate write barriers with the current write barrier aren't all that common anyway (how often do you write the address of a global to something?), so we're not losing much.
morestack writes the context pointer to gobuf.ctxt, but since morestack is written in assembly (and has to be very careful with state), it does *not* invoke the requisite write barrier for this write. Instead, we patch this up later, in newstack, where we invoke an explicit write barrier for ctxt. This already requires some subtle reasoning, and it's going to get a lot hairier with the hybrid barrier. Fix this by simplifying the whole mechanism. Instead of writing gobuf.ctxt in morestack, just pass the value of the context register to newstack and let it write it to gobuf.ctxt. This is a normal Go pointer write, so it gets the normal Go write barrier. No subtle reasoning required. Updates #17503. Change-Id: Ia6bf8459bfefc6828f53682ade32c02412e4db63 Reviewed-on: https://go-review.googlesource.com/31550 Run-TryBot: Austin Clements <firstname.lastname@example.org> TryBot-Result: Gobot Gobot <email@example.com> Reviewed-by: Cherry Zhang <firstname.lastname@example.org>
Since we're no longer stealing space for the stack barrier array from the stack allocation, the stack allocation is simply g.stack.hi-g.stack.lo. Updates #17503. Change-Id: Id9b450ae12c3df9ec59cfc4365481a0a16b7c601 Reviewed-on: https://go-review.googlesource.com/36621 Run-TryBot: Austin Clements <email@example.com> TryBot-Result: Gobot Gobot <firstname.lastname@example.org> Reviewed-by: Rick Hudson <email@example.com>
The mark 2 phase was originally introduced as a way to reduce the chance of entering STW mark termination while there was still marking work to do. It works by flushing and disabling all local work caches so that all enqueued work becomes immediately globally visible. However, mark 2 is not only slow–disabling caches makes marking and the write barrier both much more expensive–but also imperfect. There is still a rare but possible race (~once per all.bash) that can cause GC to enter mark termination while there is still marking work. This race is detailed at https://github.com/golang/proposal/blob/master/design/17503-eliminate-rescan.md#appendix-mark-completion-race The effect of this is that mark termination must still cope with the possibility that there may be work remaining after a concurrent mark phase. Dealing with this increases STW pause time and increases the complexity of mark termination. Furthermore, a similar but far more likely race can cause early transition from mark 1 to mark 2. This is unfortunate because it causes performance instability because of the cost of mark 2. This CL fixes this by replacing mark 2 with a distributed termination detection algorithm. This algorithm is correct, so it eliminates the mark termination race, and doesn't require disabling local caches. It ensures that there are no grey objects upon entering mark termination. With this change, we're one step closer to eliminating marking from mark termination entirely (it's still used by STW GC and checkmarks mode). This CL does not eliminate the gcBlackenPromptly global flag, though it is always set to false now. It will be removed in a cleanup CL. This led to only minor variations in the go1 benchmarks (https://perf.golang.org/search?q=upload:20180909.1) and compilebench benchmarks (https://perf.golang.org/search?q=upload:20180910.2). This significantly improves performance of the garbage benchmark, with no impact on STW times: name old time/op new time/op delta Garbage/benchmem-MB=64-12 2.21ms ± 1% 2.05ms ± 1% -7.38% (p=0.000 n=18+19) Garbage/benchmem-MB=1024-12 2.30ms ±16% 2.20ms ± 7% -4.51% (p=0.001 n=20+20) name old STW-ns/GC new STW-ns/GC delta Garbage/benchmem-MB=64-12 138k ±44% 141k ±23% ~ (p=0.309 n=19+20) Garbage/benchmem-MB=1024-12 159k ±25% 178k ±98% ~ (p=0.798 n=16+18) name old STW-ns/op new STW-ns/op delta Garbage/benchmem-MB=64-12 4.42k ±44% 4.24k ±23% ~ (p=0.531 n=19+20) Garbage/benchmem-MB=1024-12 591 ±24% 636 ±111% ~ (p=0.309 n=16+18) (https://perf.golang.org/search?q=upload:20180910.1) Updates #26903. Updates #17503. Change-Id: Icbd1e12b7a12a76f423c9bf033b13cb363e4cd19 Reviewed-on: https://go-review.googlesource.com/c/134318 Run-TryBot: Austin Clements <firstname.lastname@example.org> TryBot-Result: Gobot Gobot <email@example.com> Reviewed-by: Rick Hudson <firstname.lastname@example.org>
Currently, setting GODEBUG=gcrescanstacks=1 enables a debugging mode where the garbage collector re-scans goroutine stacks during mark termination. This was introduced in Go 1.8 to debug the hybrid write barrier, but I don't think we ever used it. Now it's one of the last sources of mark work during mark termination. This CL removes it. Updates #26903. This is preparation for unifying STW GC and concurrent GC. Updates #17503. Change-Id: I6ae04d3738aa9c448e6e206e21857a33ecd12acf Reviewed-on: https://go-review.googlesource.com/c/134777 Run-TryBot: Austin Clements <email@example.com> TryBot-Result: Gobot Gobot <firstname.lastname@example.org> Reviewed-by: Rick Hudson <email@example.com>
Now that we do no mark work during mark termination, we no longer need the gchelper mechanism. Updates #26903. Updates #17503. Change-Id: Ie94e5c0f918cfa047e88cae1028fece106955c1b Reviewed-on: https://go-review.googlesource.com/c/134785 Run-TryBot: Austin Clements <firstname.lastname@example.org> TryBot-Result: Gobot Gobot <email@example.com> Reviewed-by: Rick Hudson <firstname.lastname@example.org>