runtime: optimize write barrier #22460
This issue is to track work on optimizing the write barrier.
Our current write barrier is fairly inefficient simply because not much effort has been put into optimizing it. For most applications, the write barrier has little overhead simply because it's rarely enabled. However, for some applications (e.g., the compiler) it consumes a few percent of the CPU. Furthermore, to fix #14951, we have to reduce the CPU consumed by GC, which necessarily means that GC and the write barrier will be enabled more of the time, increasing the impact of write barrier overhead.
I propose switching to a "buffered" write barrier, in which a fast path simply enqueues the necessary pointers to a per-P buffer. This fast path can be done without a normal Go call, avoiding the cost of spilling registers around the write barrier. When the buffer fills up, the write barrier will enter the slow path, which will spill all registers and enter the runtime to flush the buffer. We can disallow stack splits and safe-points during flushing so we don't need type information for the spilled registers.
I've already implemented this for the single pointer barrier on amd64. It's ~4X faster than the current barrier, speeds up GC-heavy applications by ~2%, and reduces binary size by ~1.5%. I haven't yet implemented it for other architectures, or used these techniques to improve the bulk write barriers.
The text was updated successfully, but these errors were encountered:
Sorry, I thought you were going to enqueue in generated code. Something like:
That's certainly possible. I went with calling into the runtime for a few reasons: the fast path is still enough code that I was worried about binary bloat and icache pressure, without the spills or stack writes I figured the CALL was probably virtually free, and it was easier for me. :)
It could be worth revisiting, though at this point I think the write barrier might be fast enough that it doesn't matter.
Another possible optimization worth revisiting in the future: always enqueue to the buffer, even if the write barrier is disabled, and only check whether the write barrier is enabled in the slow path.
One possible complication with this would be during early runtime init, before there's a P. That would have to be nowrtiebarrierrec, and I found out this week (the hard way) that it definitely has some write barriers in it. We might be able to improve that complication by setting up a P just for its buffer really early during init. Or, I was considering caching a pointer to the buffer in the G anyway, and then it would be easy to point g0 to a temporary buffer during init.
Currently, newstack and gogo have write barriers for maintaining the context register saved in g.sched.ctxt. This is troublesome, because newstack can be called from go:nowritebarrierrec places that can't allow write barriers. It happens to be benign because g.sched.ctxt will always be nil on entry to newstack *and* it so happens the incoming ctxt will also always be nil in these contexts (I think/hope), but this is playing with fire. It's also desirable to mark newstack go:nowritebarrierrec to prevent any other, non-benign write barriers from creeping in, but we can't do that right now because of this one write barrier. Fix all of this by observing that g.sched.ctxt is really just a saved live pointer register. Hence, we can shade it when we scan g's stack and otherwise move it back and forth between the actual context register and g.sched.ctxt without write barriers. This means we can save it in morestack along with all of the other g.sched, eliminate the save from newstack along with its troublesome write barrier, and eliminate the shenanigans in gogo to invoke the write barrier when restoring it. Once we've done all of this, we can mark newstack go:nowritebarrierrec. Fixes #22385. For #22460. Change-Id: I43c24958e3f6785b53c1350e1e83c2844e0d1522 Reviewed-on: https://go-review.googlesource.com/72553 Run-TryBot: Austin Clements <email@example.com> TryBot-Result: Gobot Gobot <firstname.lastname@example.org> Reviewed-by: Rick Hudson <email@example.com> Reviewed-by: Cherry Zhang <firstname.lastname@example.org>
Currently most of these are marked go:nowritebarrier as a hint, but it's actually important that these not invoke write barriers recursively. The danger is that some gcWork method would invoke the write barrier while the gcWork is in an inconsistent state and that the write barrier would in turn invoke some other gcWork method, which would crash or permanently corrupt the gcWork. Simply marking the write barrier itself as go:nowritebarrierrec isn't sufficient to prevent this if the write barrier doesn't use the outer method. Thankfully, this doesn't cause any build failures, so we were getting this right. :) For #22460. Change-Id: I35a7292a584200eb35a49507cd3fe359ba2206f6 Reviewed-on: https://go-review.googlesource.com/72554 Run-TryBot: Austin Clements <email@example.com> TryBot-Result: Gobot Gobot <firstname.lastname@example.org> Reviewed-by: Rick Hudson <email@example.com>
We're about to start tracking nowritebarrierrec through systemstack calls, which will reveal write barriers in startpanic_m prohibited by various callers. We actually can allow write barriers here because the write barrier is a no-op when we're panicking. Let the compiler know. Updates #22384. For #22460. Change-Id: Ifb3a38d3dd9a4125c278c3680f8648f987a5b0b8 Reviewed-on: https://go-review.googlesource.com/72770 Run-TryBot: Austin Clements <firstname.lastname@example.org> TryBot-Result: Gobot Gobot <email@example.com> Reviewed-by: Rick Hudson <firstname.lastname@example.org>
We're about to start tracking nowritebarrierrec through systemstack calls, which will reveal write barriers in persistentalloc prohibited by various callers. The pointers manipulated by persistentalloc are always to off-heap memory, so this removes these write barriers statically by introducing a new go:notinheap type to represent generic off-heap memory. Updates #22384. For #22460. Change-Id: Id449d9ebf145b14d55476a833e7f076b0d261d57 Reviewed-on: https://go-review.googlesource.com/72771 Run-TryBot: Austin Clements <email@example.com> TryBot-Result: Gobot Gobot <firstname.lastname@example.org> Reviewed-by: Rick Hudson <email@example.com>
We're about to start tracking nowritebarrierrec through systemstack calls, which detects that we're calling markroot (which has write barriers) from gchelper, which is called from the scheduler during STW apparently without a P. But it turns out that func helpgc, which wakes up blocked Ms to run gchelper, installs a P for gchelper to use. This means there *is* a P when gchelper runs, so it is allowed to have write barriers. Tell the compiler this by marking gchelper go:yeswritebarrierrec. Also, document the call to gchelper so I don't have to spend another half a day puzzling over how on earth this could possibly work before discovering the spooky action-at-a-distance in helpgc. Updates #22384. For #22460. Change-Id: I7394c9b4871745575f87a2d4fbbc5b8e54d669f7 Reviewed-on: https://go-review.googlesource.com/72772 Run-TryBot: Austin Clements <firstname.lastname@example.org> TryBot-Result: Gobot Gobot <email@example.com> Reviewed-by: Rick Hudson <firstname.lastname@example.org>
The current go:nowritebarrierrec checker has two problems that limit its coverage: 1. It doesn't understand that systemstack calls its argument, which means there are several cases where we fail to detect prohibited write barriers. 2. It only observes calls in the AST, so calls constructed during lowering by SSA aren't followed. This CL completely rewrites this checker to address these issues. The current checker runs entirely after walk and uses visitBottomUp, which introduces several problems for checking across systemstack. First, visitBottomUp itself doesn't understand systemstack calls, so the callee may be ordered after the caller, causing the checker to fail to propagate constraints. Second, many systemstack calls are passed a closure, which is quite difficult to resolve back to the function definition after transformclosure and walk have run. Third, visitBottomUp works exclusively on the AST, so it can't observe calls created by SSA. To address these problems, this commit splits the check into two phases and rewrites it to use a call graph generated during SSA lowering. The first phase runs before transformclosure/walk and simply records systemstack arguments when they're easy to get. Then, it modifies genssa to record static call edges at the point where we're lowering to Progs (which is the latest point at which position information is conveniently available). Finally, the second phase runs after all functions have been lowered and uses a direct BFS walk of the call graph (combining systemstack calls with static calls) to find prohibited write barriers and construct nice error messages. Fixes #22384. For #22460. Change-Id: I39668f7f2366ab3c1ab1a71eaf25484d25349540 Reviewed-on: https://go-review.googlesource.com/72773 Run-TryBot: Austin Clements <email@example.com> TryBot-Result: Gobot Gobot <firstname.lastname@example.org> Reviewed-by: Matthew Dempsky <email@example.com>
recordspan has two remaining write barriers from writing to the pointer to the backing store of h.allspans. However, h.allspans is always backed by off-heap memory, so let the compiler know this. Unfortunately, this isn't quite as clean as most go:notinheap uses because we can't directly name the backing store of a slice, but we can get it done with some judicious casting. For #22460. Change-Id: I296f92fa41cf2cb6ae572b35749af23967533877 Reviewed-on: https://go-review.googlesource.com/73414 Reviewed-by: Rick Hudson <firstname.lastname@example.org>
This implements runtime support for buffered write barriers on amd64. The buffered write barrier has a fast path that simply enqueues pointers in a per-P buffer. Unlike the current write barrier, this fast path is *not* a normal Go call and does not require the compiler to spill general-purpose registers or put arguments on the stack. When the buffer fills up, the write barrier takes the slow path, which spills all general purpose registers and flushes the buffer. We don't allow safe-points or stack splits while this frame is active, so it doesn't matter that we have no type information for the spilled registers in this frame. One minor complication is cgocheck=2 mode, which uses the write barrier to detect Go pointers being written to non-Go memory. We obviously can't buffer this, so instead we set the buffer to its minimum size, forcing the write barrier into the slow path on every call. For this specific case, we pass additional information as arguments to the flush function. This also requires enabling the cgo write barrier slightly later during runtime initialization, after Ps (and the per-P write barrier buffers) have been initialized. The code in this CL is not yet active. The next CL will modify the compiler to generate calls to the new write barrier. This reduces the average cost of the write barrier by roughly a factor of 4, which will pay for the cost of having it enabled more of the time after we make the GC pacer less aggressive. (Benchmarks will be in the next CL.) Updates #14951. Updates #22460. Change-Id: I396b5b0e2c5e5c4acfd761a3235fd15abadc6cb1 Reviewed-on: https://go-review.googlesource.com/73711 Run-TryBot: Austin Clements <email@example.com> TryBot-Result: Gobot Gobot <firstname.lastname@example.org> Reviewed-by: Rick Hudson <email@example.com>
This CL implements the compiler support for calling the buffered write barrier added by the previous CL. Since the buffered write barrier is only implemented on amd64 right now, this still supports the old, eager write barrier as well. There's little overhead to supporting both and this way a few tests in test/fixedbugs that expect to have liveness maps at write barrier calls can easily opt-in to the old, eager barrier. This significantly improves the performance of the write barrier: name old time/op new time/op delta WriteBarrier-12 73.5ns ±20% 19.2ns ±27% -73.90% (p=0.000 n=19+18) It also reduces the size of binaries because the write barrier call is more compact: name old object-bytes new object-bytes delta Template 398k ± 0% 393k ± 0% -1.14% (p=0.008 n=5+5) Unicode 208k ± 0% 206k ± 0% -1.00% (p=0.008 n=5+5) GoTypes 1.18M ± 0% 1.15M ± 0% -2.00% (p=0.008 n=5+5) Compiler 4.05M ± 0% 3.88M ± 0% -4.26% (p=0.008 n=5+5) SSA 8.25M ± 0% 8.11M ± 0% -1.59% (p=0.008 n=5+5) Flate 228k ± 0% 224k ± 0% -1.83% (p=0.008 n=5+5) GoParser 295k ± 0% 284k ± 0% -3.62% (p=0.008 n=5+5) Reflect 1.00M ± 0% 0.99M ± 0% -0.70% (p=0.008 n=5+5) Tar 339k ± 0% 333k ± 0% -1.67% (p=0.008 n=5+5) XML 404k ± 0% 395k ± 0% -2.10% (p=0.008 n=5+5) [Geo mean] 704k 690k -2.00% name old exe-bytes new exe-bytes delta HelloSize 1.05M ± 0% 1.04M ± 0% -1.55% (p=0.008 n=5+5) https://perf.golang.org/search?q=upload:20171027.1 (Amusingly, this also reduces compiler allocations by 0.75%, which, combined with the better write barrier, speeds up the compiler overall by 2.10%. See the perf link.) It slightly improves the performance of most of the go1 benchmarks and improves the performance of the x/benchmarks: name old time/op new time/op delta BinaryTree17-12 2.40s ± 1% 2.47s ± 1% +2.69% (p=0.000 n=19+19) Fannkuch11-12 2.95s ± 0% 2.95s ± 0% +0.21% (p=0.000 n=20+19) FmtFprintfEmpty-12 41.8ns ± 4% 41.4ns ± 2% -1.03% (p=0.014 n=20+20) FmtFprintfString-12 68.7ns ± 2% 67.5ns ± 1% -1.75% (p=0.000 n=20+17) FmtFprintfInt-12 79.0ns ± 3% 77.1ns ± 1% -2.40% (p=0.000 n=19+17) FmtFprintfIntInt-12 127ns ± 1% 123ns ± 3% -3.42% (p=0.000 n=20+20) FmtFprintfPrefixedInt-12 152ns ± 1% 150ns ± 1% -1.02% (p=0.000 n=18+17) FmtFprintfFloat-12 211ns ± 1% 209ns ± 0% -0.99% (p=0.000 n=20+16) FmtManyArgs-12 500ns ± 0% 496ns ± 0% -0.73% (p=0.000 n=17+20) GobDecode-12 6.44ms ± 1% 6.53ms ± 0% +1.28% (p=0.000 n=20+19) GobEncode-12 5.46ms ± 0% 5.46ms ± 1% ~ (p=0.550 n=19+20) Gzip-12 220ms ± 1% 216ms ± 0% -1.75% (p=0.000 n=19+19) Gunzip-12 38.8ms ± 0% 38.6ms ± 0% -0.30% (p=0.000 n=18+19) HTTPClientServer-12 79.0µs ± 1% 78.2µs ± 1% -1.01% (p=0.000 n=20+20) JSONEncode-12 11.9ms ± 0% 11.9ms ± 0% -0.29% (p=0.000 n=20+19) JSONDecode-12 52.6ms ± 0% 52.2ms ± 0% -0.68% (p=0.000 n=19+20) Mandelbrot200-12 3.69ms ± 0% 3.68ms ± 0% -0.36% (p=0.000 n=20+20) GoParse-12 3.13ms ± 1% 3.18ms ± 1% +1.67% (p=0.000 n=19+20) RegexpMatchEasy0_32-12 73.2ns ± 1% 72.3ns ± 1% -1.19% (p=0.000 n=19+18) RegexpMatchEasy0_1K-12 241ns ± 0% 239ns ± 0% -0.83% (p=0.000 n=17+16) RegexpMatchEasy1_32-12 68.6ns ± 1% 69.0ns ± 1% +0.47% (p=0.015 n=18+16) RegexpMatchEasy1_1K-12 364ns ± 0% 361ns ± 0% -0.67% (p=0.000 n=16+17) RegexpMatchMedium_32-12 104ns ± 1% 103ns ± 1% -0.79% (p=0.001 n=20+15) RegexpMatchMedium_1K-12 33.8µs ± 3% 34.0µs ± 2% ~ (p=0.267 n=20+19) RegexpMatchHard_32-12 1.64µs ± 1% 1.62µs ± 2% -1.25% (p=0.000 n=19+18) RegexpMatchHard_1K-12 49.2µs ± 0% 48.7µs ± 1% -0.93% (p=0.000 n=19+18) Revcomp-12 391ms ± 5% 396ms ± 7% ~ (p=0.154 n=19+19) Template-12 63.1ms ± 0% 59.5ms ± 0% -5.76% (p=0.000 n=18+19) TimeParse-12 307ns ± 0% 306ns ± 0% -0.39% (p=0.000 n=19+17) TimeFormat-12 325ns ± 0% 323ns ± 0% -0.50% (p=0.000 n=19+19) [Geo mean] 47.3µs 46.9µs -0.67% https://perf.golang.org/search?q=upload:20171026.1 name old time/op new time/op delta Garbage/benchmem-MB=64-12 2.25ms ± 1% 2.20ms ± 1% -2.31% (p=0.000 n=18+18) HTTP-12 12.6µs ± 0% 12.6µs ± 0% -0.72% (p=0.000 n=18+17) JSON-12 11.0ms ± 0% 11.0ms ± 1% -0.68% (p=0.000 n=17+19) https://perf.golang.org/search?q=upload:20171026.2 Updates #14951. Updates #22460. Change-Id: Id4c0932890a1d41020071bec73b8522b1367d3e7 Reviewed-on: https://go-review.googlesource.com/73712 Run-TryBot: Austin Clements <firstname.lastname@example.org> TryBot-Result: Gobot Gobot <email@example.com> Reviewed-by: Cherry Zhang <firstname.lastname@example.org>
This adds a benchmark of typedslicecopy and its bulk write barriers. For #22460. Change-Id: I439ca3b130bb22944468095f8f18b464e5bb43ca Reviewed-on: https://go-review.googlesource.com/74051 Run-TryBot: Austin Clements <email@example.com> TryBot-Result: Gobot Gobot <firstname.lastname@example.org> Reviewed-by: Rick Hudson <email@example.com>
Currently, typedslicecopy meticulously performs a typedmemmove on every element of the slice. This probably used to be necessary because we only had an individual element's type, but now we use the heap bitmap, so we only need to know whether the type has any pointers and how big it is. Hence, this CL rewrites typedslicecopy to simply perform one bulk barrier and one memmove. This also has a side-effect of eliminating two unnecessary write barriers per slice element that were coming from updates to dstp and srcp, which were stored in the parent stack frame. However, most of the win comes from eliminating the loops. name old time/op new time/op delta BulkWriteBarrier-12 7.83ns ±10% 7.33ns ± 6% -6.45% (p=0.000 n=20+20) Updates #22460. Change-Id: Id3450e9f36cc8e0892f268319b136f0d8f5464b8 Reviewed-on: https://go-review.googlesource.com/73831 Run-TryBot: Austin Clements <firstname.lastname@example.org> TryBot-Result: Gobot Gobot <email@example.com> Reviewed-by: Rick Hudson <firstname.lastname@example.org>
This modifies bulkBarrierPreWrite to use the buffered write barrier instead of the eager write barrier. This reduces the number of system stack switches and sanity checks by a factor of the buffer size (currently 256). This affects both typedmemmove and typedmemclr. Since this is purely a runtime change, it applies to all arches (unlike the pointer write barrier). name old time/op new time/op delta BulkWriteBarrier-12 7.33ns ± 6% 4.46ns ± 9% -39.10% (p=0.000 n=20+19) Updates #22460. Change-Id: I6a686a63bbf08be02b9b97250e37163c5a90cdd8 Reviewed-on: https://go-review.googlesource.com/73832 Run-TryBot: Austin Clements <email@example.com> TryBot-Result: Gobot Gobot <firstname.lastname@example.org> Reviewed-by: Rick Hudson <email@example.com>
Calls to writebarrierptr can simply be actual pointer writes. Calls to writebarrierptr_prewrite need to go through the write barrier buffer. Updates #22460. Change-Id: I92cee4da98c5baa499f1977563757c76f95bf0ca Reviewed-on: https://go-review.googlesource.com/92704 Run-TryBot: Austin Clements <firstname.lastname@example.org> Reviewed-by: Rick Hudson <email@example.com>
Allow the compiler to generate code like CMPQ 16(AX), $7 It's tricky because it's difficult to spill such a comparison during flagalloc, because the same memory state might not be available at the restore locations. Solve this problem by decomposing the compare+load back into its parts if it needs to be spilled. The big win is that the write barrier test goes from: MOVL runtime.writeBarrier(SB), CX TESTL CX, CX JNE 60 to CMPL runtime.writeBarrier(SB), $0 JNE 59 It's one instruction and one byte smaller. Fixes #19485 Fixes #15245 Update #22460 Binaries are about 0.15% smaller. Change-Id: I4fd8d1111b6b9924d52f9a0901ca1b2e5cce0836 Reviewed-on: https://go-review.googlesource.com/86035 Reviewed-by: Cherry Zhang <firstname.lastname@example.org> Reviewed-by: Ilya Tocar <email@example.com>