-
Notifications
You must be signed in to change notification settings - Fork 17.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
runtime: mutator assists are over-aggressive, especially at high GOGC #14951
Comments
Some approaches we've taken or thought about:
|
I'm not sure I understand the desire to hit the heap goal if the live set is growing. If the live set is growing, we're going to set a higher heap goal for the next cycle. Why not go over the heap goal this cycle in that case? At the last GC, assume we had a live set of size L so we set the heap goal to be 2L (for GOGC=100). Just set the trigger & assist ratio so that we scan L bytes by the time total allocation is up to 2L. Then if we reach a total allocation of 2L and scanning still isn't done, then we know the live set is larger than L and we can grow the heap (i.e. miss the heap goal) with a clean conscience. |
@randall77, this is basically the idea behind proposal 3. However, it's important to have the more aggressive fallback if the heap is actually growing. In fact, in the original pacer design, I set the assist ratio to target scanning the expected reachable heap (the marked heap from last cycle) by the heap goal and this let aggressive mutators spiral the heap size out of control. I don't remember the exact details of this failure mode, but it may have happened if the system got into a state where the assist ratio was less than 1 (or only slightly larger than 1). |
CL https://golang.org/cl/21324 mentions this issue. |
These number show various approaches to dealing with increasing CPU availability The benchMasterMar28 runs show the performance of tip on March 28 which is If we change benchMasterMarch28 to allocate black during the GC cycle the result is Here is the timing grid that considers throughput. 1.6 baseline is 2.70. Master is 2.45. The dev:garbage branch is 2.352, 3% better The fixed ratio schemes are scan one word for each word allocated or scan two
Here is the grid from a gctrace sampling somewhere near 3.5 seconds. The traces nearby Master White Master Black Scan1for1 Black Scan2for1 Black Scan1for1 White Scan2for1 White Time spent doing assists dropped when we went to either the 2 for 1 or the 1 for 1 fixed Allocating black shortened the GC since the amount of scan work decreases. As we make more of the CPUs available to the mutators during a GC the more advantage there We missed the target heap size in all of the fixed ratio tests, sometimes by quite a bit. CL 21287 contains the location of where Here is the raw data from the various runs. cd benchmarks/bench/bin ./benchMasterMar28 -bench=garbage ./benchAllocatBlackOrig -bench=garbage gc 56 @3.044s 24%: 0.027+18+0.25 ms clock, 0.081+128/53/0.46+0.75 ms cpu, 121->123->65 MB, 125 MB goal, 12 P ./benchDev -bench=garbage # March 28 best numbers using the dev branch with Ctz optimizations. gc 53 @3.056s 23%: 0.031+17+0.38 ms clock, 0.15+132/52/0+1.9 ms cpu, 125->126->64 MB, 128 MB goal, 12 P ./benchAllocatBlackScan1for1 -bench=garbage ./benchAllocatBlackScan2for1 -bench=garbage ./benchAllocatWhiteScan1for1 -bench=garbage ./benchAllocatWhiteScan2for1 -bench=garbage |
@RLH did you mean to close this? |
No, How do I fix this. On Wed, Mar 30, 2016 at 6:42 PM, Caleb Spare notifications@github.com
|
You don't see the "Reopen Issue" button? I pressed it, in any case. Maybe you're not in the right groups. |
Currently when we compute the trigger for the next GC, we do it based on an estimate of the reachable heap size at the start of the GC cycle, which is itself based on an estimate of the floating garbage. This was introduced by 4655aad to fix a bad feedback loop that allowed the heap to grow to many times the true reachable size. However, this estimate gets easily confused by rapidly allocating applications, and, worse it's different than the heap size the trigger controller uses to compute the trigger itself. This results in the trigger controller often thinking that GC finished before it started. Since this would be a pretty great outcome from it's perspective, it sets the trigger for the next cycle as close to the next goal as possible (which is limited to 95% of the goal). Furthermore, the bad feedback loop this estimate originally fixed seems not to happen any more, suggesting it was fixed more correctly by some other change in the mean time. Finally, with the change to allocate black, it shouldn't even be theoretically possible for this bad feedback loop to occur. Hence, eliminate the floating garbage estimate and simply consider the reachable heap to be the marked heap. This harms overall throughput slightly for allocation-heavy benchmarks, but significantly improves mutator availability. Fixes #12204. This brings the average trigger in this benchmark from 0.95 (the cap) to 0.7 and the active GC utilization from ~90% to ~45%. Updates #14951. This makes the trigger controller much better behaved, so it pulls the trigger lower if assists are consuming a lot of CPU like it's supposed to, increasing mutator availability. name old time/op new time/op delta XBenchGarbage-12 2.21ms ± 1% 2.28ms ± 3% +3.29% (p=0.000 n=17+17) Some of this slow down we paid for in earlier commits. Relative to the start of the series to switch to allocate-black (the parent of "count black allocations toward scan work"), the garbage benchmark is 2.62% slower. name old time/op new time/op delta BinaryTree17-12 2.53s ± 3% 2.53s ± 3% ~ (p=0.708 n=20+19) Fannkuch11-12 2.08s ± 0% 2.08s ± 0% -0.22% (p=0.002 n=19+18) FmtFprintfEmpty-12 45.3ns ± 2% 45.2ns ± 3% ~ (p=0.505 n=20+20) FmtFprintfString-12 129ns ± 0% 131ns ± 2% +1.80% (p=0.000 n=16+19) FmtFprintfInt-12 121ns ± 2% 121ns ± 2% ~ (p=0.768 n=19+19) FmtFprintfIntInt-12 186ns ± 1% 188ns ± 3% +0.99% (p=0.000 n=19+19) FmtFprintfPrefixedInt-12 188ns ± 1% 188ns ± 1% ~ (p=0.947 n=18+16) FmtFprintfFloat-12 254ns ± 1% 255ns ± 1% +0.30% (p=0.002 n=19+17) FmtManyArgs-12 763ns ± 0% 770ns ± 0% +0.92% (p=0.000 n=18+18) GobDecode-12 7.00ms ± 1% 7.04ms ± 1% +0.61% (p=0.049 n=20+20) GobEncode-12 5.88ms ± 1% 5.88ms ± 0% ~ (p=0.641 n=18+19) Gzip-12 214ms ± 1% 215ms ± 1% +0.43% (p=0.002 n=18+19) Gunzip-12 37.6ms ± 0% 37.6ms ± 0% +0.11% (p=0.015 n=17+18) HTTPClientServer-12 76.9µs ± 2% 78.1µs ± 2% +1.44% (p=0.000 n=20+18) JSONEncode-12 15.2ms ± 2% 15.1ms ± 1% ~ (p=0.271 n=19+18) JSONDecode-12 53.1ms ± 1% 53.3ms ± 0% +0.49% (p=0.000 n=18+19) Mandelbrot200-12 4.04ms ± 1% 4.03ms ± 0% -0.33% (p=0.005 n=18+18) GoParse-12 3.29ms ± 1% 3.28ms ± 1% ~ (p=0.146 n=16+17) RegexpMatchEasy0_32-12 69.9ns ± 3% 69.5ns ± 1% ~ (p=0.785 n=20+19) RegexpMatchEasy0_1K-12 237ns ± 0% 237ns ± 0% ~ (p=1.000 n=18+18) RegexpMatchEasy1_32-12 69.5ns ± 1% 69.2ns ± 1% -0.44% (p=0.020 n=16+19) RegexpMatchEasy1_1K-12 372ns ± 1% 371ns ± 2% ~ (p=0.086 n=20+19) RegexpMatchMedium_32-12 108ns ± 3% 107ns ± 1% -1.00% (p=0.004 n=19+14) RegexpMatchMedium_1K-12 34.2µs ± 4% 34.0µs ± 2% ~ (p=0.380 n=19+20) RegexpMatchHard_32-12 1.77µs ± 4% 1.76µs ± 3% ~ (p=0.558 n=18+20) RegexpMatchHard_1K-12 53.4µs ± 4% 52.8µs ± 2% -1.10% (p=0.020 n=18+20) Revcomp-12 359ms ± 4% 377ms ± 0% +5.19% (p=0.000 n=20+18) Template-12 63.7ms ± 2% 62.9ms ± 2% -1.27% (p=0.005 n=18+20) TimeParse-12 316ns ± 2% 313ns ± 1% ~ (p=0.059 n=20+16) TimeFormat-12 329ns ± 0% 331ns ± 0% +0.39% (p=0.000 n=16+18) [Geo mean] 51.6µs 51.7µs +0.18% Change-Id: I1dce4640c8205d41717943b021039fffea863c57 Reviewed-on: https://go-review.googlesource.com/21324 Reviewed-by: Rick Hudson <rlh@golang.org> Run-TryBot: Austin Clements <austin@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org>
Change https://golang.org/cl/59970 mentions this issue: |
Change https://golang.org/cl/59971 mentions this issue: |
Change https://golang.org/cl/60790 mentions this issue: |
Change https://golang.org/cl/73711 mentions this issue: |
Change https://golang.org/cl/73712 mentions this issue: |
This implements runtime support for buffered write barriers on amd64. The buffered write barrier has a fast path that simply enqueues pointers in a per-P buffer. Unlike the current write barrier, this fast path is *not* a normal Go call and does not require the compiler to spill general-purpose registers or put arguments on the stack. When the buffer fills up, the write barrier takes the slow path, which spills all general purpose registers and flushes the buffer. We don't allow safe-points or stack splits while this frame is active, so it doesn't matter that we have no type information for the spilled registers in this frame. One minor complication is cgocheck=2 mode, which uses the write barrier to detect Go pointers being written to non-Go memory. We obviously can't buffer this, so instead we set the buffer to its minimum size, forcing the write barrier into the slow path on every call. For this specific case, we pass additional information as arguments to the flush function. This also requires enabling the cgo write barrier slightly later during runtime initialization, after Ps (and the per-P write barrier buffers) have been initialized. The code in this CL is not yet active. The next CL will modify the compiler to generate calls to the new write barrier. This reduces the average cost of the write barrier by roughly a factor of 4, which will pay for the cost of having it enabled more of the time after we make the GC pacer less aggressive. (Benchmarks will be in the next CL.) Updates #14951. Updates #22460. Change-Id: I396b5b0e2c5e5c4acfd761a3235fd15abadc6cb1 Reviewed-on: https://go-review.googlesource.com/73711 Run-TryBot: Austin Clements <austin@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Rick Hudson <rlh@golang.org>
This CL implements the compiler support for calling the buffered write barrier added by the previous CL. Since the buffered write barrier is only implemented on amd64 right now, this still supports the old, eager write barrier as well. There's little overhead to supporting both and this way a few tests in test/fixedbugs that expect to have liveness maps at write barrier calls can easily opt-in to the old, eager barrier. This significantly improves the performance of the write barrier: name old time/op new time/op delta WriteBarrier-12 73.5ns ±20% 19.2ns ±27% -73.90% (p=0.000 n=19+18) It also reduces the size of binaries because the write barrier call is more compact: name old object-bytes new object-bytes delta Template 398k ± 0% 393k ± 0% -1.14% (p=0.008 n=5+5) Unicode 208k ± 0% 206k ± 0% -1.00% (p=0.008 n=5+5) GoTypes 1.18M ± 0% 1.15M ± 0% -2.00% (p=0.008 n=5+5) Compiler 4.05M ± 0% 3.88M ± 0% -4.26% (p=0.008 n=5+5) SSA 8.25M ± 0% 8.11M ± 0% -1.59% (p=0.008 n=5+5) Flate 228k ± 0% 224k ± 0% -1.83% (p=0.008 n=5+5) GoParser 295k ± 0% 284k ± 0% -3.62% (p=0.008 n=5+5) Reflect 1.00M ± 0% 0.99M ± 0% -0.70% (p=0.008 n=5+5) Tar 339k ± 0% 333k ± 0% -1.67% (p=0.008 n=5+5) XML 404k ± 0% 395k ± 0% -2.10% (p=0.008 n=5+5) [Geo mean] 704k 690k -2.00% name old exe-bytes new exe-bytes delta HelloSize 1.05M ± 0% 1.04M ± 0% -1.55% (p=0.008 n=5+5) https://perf.golang.org/search?q=upload:20171027.1 (Amusingly, this also reduces compiler allocations by 0.75%, which, combined with the better write barrier, speeds up the compiler overall by 2.10%. See the perf link.) It slightly improves the performance of most of the go1 benchmarks and improves the performance of the x/benchmarks: name old time/op new time/op delta BinaryTree17-12 2.40s ± 1% 2.47s ± 1% +2.69% (p=0.000 n=19+19) Fannkuch11-12 2.95s ± 0% 2.95s ± 0% +0.21% (p=0.000 n=20+19) FmtFprintfEmpty-12 41.8ns ± 4% 41.4ns ± 2% -1.03% (p=0.014 n=20+20) FmtFprintfString-12 68.7ns ± 2% 67.5ns ± 1% -1.75% (p=0.000 n=20+17) FmtFprintfInt-12 79.0ns ± 3% 77.1ns ± 1% -2.40% (p=0.000 n=19+17) FmtFprintfIntInt-12 127ns ± 1% 123ns ± 3% -3.42% (p=0.000 n=20+20) FmtFprintfPrefixedInt-12 152ns ± 1% 150ns ± 1% -1.02% (p=0.000 n=18+17) FmtFprintfFloat-12 211ns ± 1% 209ns ± 0% -0.99% (p=0.000 n=20+16) FmtManyArgs-12 500ns ± 0% 496ns ± 0% -0.73% (p=0.000 n=17+20) GobDecode-12 6.44ms ± 1% 6.53ms ± 0% +1.28% (p=0.000 n=20+19) GobEncode-12 5.46ms ± 0% 5.46ms ± 1% ~ (p=0.550 n=19+20) Gzip-12 220ms ± 1% 216ms ± 0% -1.75% (p=0.000 n=19+19) Gunzip-12 38.8ms ± 0% 38.6ms ± 0% -0.30% (p=0.000 n=18+19) HTTPClientServer-12 79.0µs ± 1% 78.2µs ± 1% -1.01% (p=0.000 n=20+20) JSONEncode-12 11.9ms ± 0% 11.9ms ± 0% -0.29% (p=0.000 n=20+19) JSONDecode-12 52.6ms ± 0% 52.2ms ± 0% -0.68% (p=0.000 n=19+20) Mandelbrot200-12 3.69ms ± 0% 3.68ms ± 0% -0.36% (p=0.000 n=20+20) GoParse-12 3.13ms ± 1% 3.18ms ± 1% +1.67% (p=0.000 n=19+20) RegexpMatchEasy0_32-12 73.2ns ± 1% 72.3ns ± 1% -1.19% (p=0.000 n=19+18) RegexpMatchEasy0_1K-12 241ns ± 0% 239ns ± 0% -0.83% (p=0.000 n=17+16) RegexpMatchEasy1_32-12 68.6ns ± 1% 69.0ns ± 1% +0.47% (p=0.015 n=18+16) RegexpMatchEasy1_1K-12 364ns ± 0% 361ns ± 0% -0.67% (p=0.000 n=16+17) RegexpMatchMedium_32-12 104ns ± 1% 103ns ± 1% -0.79% (p=0.001 n=20+15) RegexpMatchMedium_1K-12 33.8µs ± 3% 34.0µs ± 2% ~ (p=0.267 n=20+19) RegexpMatchHard_32-12 1.64µs ± 1% 1.62µs ± 2% -1.25% (p=0.000 n=19+18) RegexpMatchHard_1K-12 49.2µs ± 0% 48.7µs ± 1% -0.93% (p=0.000 n=19+18) Revcomp-12 391ms ± 5% 396ms ± 7% ~ (p=0.154 n=19+19) Template-12 63.1ms ± 0% 59.5ms ± 0% -5.76% (p=0.000 n=18+19) TimeParse-12 307ns ± 0% 306ns ± 0% -0.39% (p=0.000 n=19+17) TimeFormat-12 325ns ± 0% 323ns ± 0% -0.50% (p=0.000 n=19+19) [Geo mean] 47.3µs 46.9µs -0.67% https://perf.golang.org/search?q=upload:20171026.1 name old time/op new time/op delta Garbage/benchmem-MB=64-12 2.25ms ± 1% 2.20ms ± 1% -2.31% (p=0.000 n=18+18) HTTP-12 12.6µs ± 0% 12.6µs ± 0% -0.72% (p=0.000 n=18+17) JSON-12 11.0ms ± 0% 11.0ms ± 1% -0.68% (p=0.000 n=17+19) https://perf.golang.org/search?q=upload:20171026.2 Updates #14951. Updates #22460. Change-Id: Id4c0932890a1d41020071bec73b8522b1367d3e7 Reviewed-on: https://go-review.googlesource.com/73712 Run-TryBot: Austin Clements <austin@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Cherry Zhang <cherryyz@google.com>
Currently, GC pacing is based on a single hard heap limit computed based on GOGC. In order to achieve this hard limit, assist pacing makes the conservative assumption that the entire heap is live. However, in the steady state (with GOGC=100), only half of the heap is live. As a result, the garbage collector works twice as hard as necessary and finishes half way between the trigger and the goal. Since this is a stable state for the trigger controller, this repeats from cycle to cycle. Matters are even worse if GOGC is higher. For example, if GOGC=200, only a third of the heap is live in steady state, so the GC will work three times harder than necessary and finish only a third of the way between the trigger and the goal. Since this causes the garbage collector to consume ~50% of the available CPU during marking instead of the intended 25%, about 25% of the CPU goes to mutator assists. This high mutator assist cost causes high mutator latency variability. This commit improves the situation by separating the heap goal into two goals: a soft goal and a hard goal. The soft goal is set based on GOGC, just like the current goal is, and the hard goal is set at a 10% larger heap than the soft goal. Prior to the soft goal, assist pacing assumes the heap is in steady state (e.g., only half of it is live). Between the soft goal and the hard goal, assist pacing switches to the current conservative assumption that the entire heap is live. In benchmarks, this nearly eliminates mutator assists. However, since background marking is fixed at 25% CPU, this causes the trigger controller to saturate, which leads to somewhat higher variability in heap size. The next commit will address this. The lower CPU usage of course leads to longer mark cycles, though really it means the mark cycles are as long as they should have been in the first place. This does, however, lead to two potential down-sides compared to the current pacing policy: 1. the total overhead of the write barrier is higher because it's enabled more of the time and 2. the heap size may be larger because there's more floating garbage. We addressed 1 by significantly improving the performance of the write barrier in the preceding commits. 2 can be demonstrated in intense GC benchmarks, but doesn't seem to be a problem in any real applications. Updates #14951. Updates #14812 (fixes?). Fixes #18534. This has no significant effect on the throughput of the github.com/dr2chase/bent benchmarks-50. This has little overall throughput effect on the go1 benchmarks: name old time/op new time/op delta BinaryTree17-12 2.41s ± 0% 2.40s ± 0% -0.22% (p=0.007 n=20+18) Fannkuch11-12 2.95s ± 0% 2.95s ± 0% +0.07% (p=0.003 n=17+18) FmtFprintfEmpty-12 41.7ns ± 3% 42.2ns ± 0% +1.17% (p=0.002 n=20+15) FmtFprintfString-12 66.5ns ± 0% 67.9ns ± 2% +2.16% (p=0.000 n=16+20) FmtFprintfInt-12 77.6ns ± 2% 75.6ns ± 3% -2.55% (p=0.000 n=19+19) FmtFprintfIntInt-12 124ns ± 1% 123ns ± 1% -0.98% (p=0.000 n=18+17) FmtFprintfPrefixedInt-12 151ns ± 1% 148ns ± 1% -1.75% (p=0.000 n=19+20) FmtFprintfFloat-12 210ns ± 1% 212ns ± 0% +0.75% (p=0.000 n=19+16) FmtManyArgs-12 501ns ± 1% 499ns ± 1% -0.30% (p=0.041 n=17+19) GobDecode-12 6.50ms ± 1% 6.49ms ± 1% ~ (p=0.234 n=19+19) GobEncode-12 5.43ms ± 0% 5.47ms ± 0% +0.75% (p=0.000 n=20+19) Gzip-12 216ms ± 1% 220ms ± 1% +1.71% (p=0.000 n=19+20) Gunzip-12 38.6ms ± 0% 38.8ms ± 0% +0.66% (p=0.000 n=18+19) HTTPClientServer-12 78.1µs ± 1% 78.5µs ± 1% +0.49% (p=0.035 n=20+20) JSONEncode-12 12.1ms ± 0% 12.2ms ± 0% +1.05% (p=0.000 n=18+17) JSONDecode-12 53.0ms ± 0% 52.3ms ± 0% -1.27% (p=0.000 n=19+19) Mandelbrot200-12 3.74ms ± 0% 3.69ms ± 0% -1.17% (p=0.000 n=18+19) GoParse-12 3.17ms ± 1% 3.17ms ± 1% ~ (p=0.569 n=19+20) RegexpMatchEasy0_32-12 73.2ns ± 1% 73.7ns ± 0% +0.76% (p=0.000 n=18+17) RegexpMatchEasy0_1K-12 239ns ± 0% 238ns ± 0% -0.27% (p=0.000 n=13+17) RegexpMatchEasy1_32-12 69.0ns ± 2% 69.1ns ± 1% ~ (p=0.404 n=19+19) RegexpMatchEasy1_1K-12 367ns ± 1% 365ns ± 1% -0.60% (p=0.000 n=19+19) RegexpMatchMedium_32-12 105ns ± 1% 104ns ± 1% -1.24% (p=0.000 n=19+16) RegexpMatchMedium_1K-12 34.1µs ± 2% 33.6µs ± 3% -1.60% (p=0.000 n=20+20) RegexpMatchHard_32-12 1.62µs ± 1% 1.67µs ± 1% +2.75% (p=0.000 n=18+18) RegexpMatchHard_1K-12 48.8µs ± 1% 50.3µs ± 2% +3.07% (p=0.000 n=20+19) Revcomp-12 386ms ± 0% 384ms ± 0% -0.57% (p=0.000 n=20+19) Template-12 59.9ms ± 1% 61.1ms ± 1% +2.01% (p=0.000 n=20+19) TimeParse-12 301ns ± 2% 307ns ± 0% +2.11% (p=0.000 n=20+19) TimeFormat-12 323ns ± 0% 323ns ± 0% ~ (all samples are equal) [Geo mean] 47.0µs 47.1µs +0.23% https://perf.golang.org/search?q=upload:20171030.1 Likewise, the throughput effect on the x/benchmarks is minimal (and reasonably positive on the garbage benchmark with a large heap): name old time/op new time/op delta Garbage/benchmem-MB=1024-12 2.40ms ± 4% 2.29ms ± 3% -4.57% (p=0.000 n=19+18) Garbage/benchmem-MB=64-12 2.23ms ± 1% 2.24ms ± 2% +0.59% (p=0.016 n=19+18) HTTP-12 12.5µs ± 1% 12.6µs ± 1% ~ (p=0.326 n=20+19) JSON-12 11.1ms ± 1% 11.3ms ± 2% +2.15% (p=0.000 n=16+17) It does increase the heap size of the garbage benchmarks, but seems to have relatively little impact on more realistic programs. Also, we'll gain some of this back with the next commit. name old peak-RSS-bytes new peak-RSS-bytes delta Garbage/benchmem-MB=1024-12 1.21G ± 1% 1.88G ± 2% +55.59% (p=0.000 n=19+20) Garbage/benchmem-MB=64-12 168M ± 3% 248M ± 8% +48.08% (p=0.000 n=18+20) HTTP-12 45.6M ± 9% 47.0M ±27% ~ (p=0.925 n=20+20) JSON-12 193M ±11% 206M ±11% +7.06% (p=0.001 n=20+20) https://perf.golang.org/search?q=upload:20171030.2 Change-Id: Ic78904135f832b4d64056cbe734ab979f5ad9736 Reviewed-on: https://go-review.googlesource.com/59970 Run-TryBot: Austin Clements <austin@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Rick Hudson <rlh@golang.org>
Currently, both the background mark worker and the goal GC CPU are both fixed at 25%. The trigger controller's goal is to achieve the goal CPU usage, and with the previous commit it can actually achieve this. But this means there are *no* assists, which sounds ideal but actually causes problems for the trigger controller. Since the controller can't lower CPU usage below the background mark worker CPU, it saturates at the CPU goal and no longer gets feedback, which translates into higher variability in heap growth. This commit fixes this by allowing assists 5% CPU beyond the 25% fixed background mark. This avoids saturating the trigger controller, since it can now get feedback from both sides of the CPU goal. This leads to low variability in both CPU usage and heap growth, at the cost of reintroducing a low rate of mark assists. We also experimented with 20% background plus 5% assist, but 25%+5% clearly performed better in benchmarks. Updates #14951. Updates #14812. Updates #18534. Combined with the previous CL, this significantly improves tail mutator utilization in the x/bechmarks garbage benchmark. On a sample trace, it increased the 99.9%ile mutator utilization at 10ms from 26% to 59%, and at 5ms from 17% to 52%. It reduced the 99.9%ile zero utilization window from 2ms to 700µs. It also helps the mean mutator utilization: it increased the 10s mutator utilization from 83% to 94%. The minimum mutator utilization is also somewhat improved, though there is still some unknown artifact that causes a miniscule fraction of mutator assists to take 5--10ms (in fact, there was exactly one 10ms mutator assist in my sample trace). This has no significant effect on the throughput of the github.com/dr2chase/bent benchmarks-50. This has little effect on the go1 benchmarks (and the slight overall improvement makes up for the slight overall slowdown from the previous commit): name old time/op new time/op delta BinaryTree17-12 2.40s ± 0% 2.41s ± 1% +0.26% (p=0.010 n=18+19) Fannkuch11-12 2.95s ± 0% 2.93s ± 0% -0.62% (p=0.000 n=18+15) FmtFprintfEmpty-12 42.2ns ± 0% 42.3ns ± 1% +0.37% (p=0.001 n=15+14) FmtFprintfString-12 67.9ns ± 2% 67.2ns ± 3% -1.03% (p=0.002 n=20+18) FmtFprintfInt-12 75.6ns ± 3% 76.8ns ± 2% +1.59% (p=0.000 n=19+17) FmtFprintfIntInt-12 123ns ± 1% 124ns ± 1% +0.77% (p=0.000 n=17+14) FmtFprintfPrefixedInt-12 148ns ± 1% 150ns ± 1% +1.28% (p=0.000 n=20+20) FmtFprintfFloat-12 212ns ± 0% 211ns ± 1% -0.67% (p=0.000 n=16+17) FmtManyArgs-12 499ns ± 1% 500ns ± 0% +0.23% (p=0.004 n=19+16) GobDecode-12 6.49ms ± 1% 6.51ms ± 1% +0.32% (p=0.008 n=19+19) GobEncode-12 5.47ms ± 0% 5.43ms ± 1% -0.68% (p=0.000 n=19+20) Gzip-12 220ms ± 1% 216ms ± 1% -1.66% (p=0.000 n=20+19) Gunzip-12 38.8ms ± 0% 38.5ms ± 0% -0.80% (p=0.000 n=19+20) HTTPClientServer-12 78.5µs ± 1% 78.1µs ± 1% -0.53% (p=0.008 n=20+19) JSONEncode-12 12.2ms ± 0% 11.9ms ± 0% -2.38% (p=0.000 n=17+19) JSONDecode-12 52.3ms ± 0% 53.3ms ± 0% +1.84% (p=0.000 n=19+20) Mandelbrot200-12 3.69ms ± 0% 3.69ms ± 0% -0.19% (p=0.000 n=19+19) GoParse-12 3.17ms ± 1% 3.19ms ± 1% +0.61% (p=0.000 n=20+20) RegexpMatchEasy0_32-12 73.7ns ± 0% 73.2ns ± 1% -0.66% (p=0.000 n=17+20) RegexpMatchEasy0_1K-12 238ns ± 0% 239ns ± 0% +0.32% (p=0.000 n=17+16) RegexpMatchEasy1_32-12 69.1ns ± 1% 69.2ns ± 1% ~ (p=0.669 n=19+13) RegexpMatchEasy1_1K-12 365ns ± 1% 367ns ± 1% +0.49% (p=0.000 n=19+19) RegexpMatchMedium_32-12 104ns ± 1% 105ns ± 1% +1.33% (p=0.000 n=16+20) RegexpMatchMedium_1K-12 33.6µs ± 3% 34.1µs ± 4% +1.67% (p=0.001 n=20+20) RegexpMatchHard_32-12 1.67µs ± 1% 1.62µs ± 1% -2.78% (p=0.000 n=18+17) RegexpMatchHard_1K-12 50.3µs ± 2% 48.7µs ± 1% -3.09% (p=0.000 n=19+18) Revcomp-12 384ms ± 0% 386ms ± 0% +0.59% (p=0.000 n=19+19) Template-12 61.1ms ± 1% 60.5ms ± 1% -1.02% (p=0.000 n=19+20) TimeParse-12 307ns ± 0% 303ns ± 1% -1.23% (p=0.000 n=19+15) TimeFormat-12 323ns ± 0% 323ns ± 0% -0.12% (p=0.011 n=15+20) [Geo mean] 47.1µs 47.0µs -0.20% https://perf.golang.org/search?q=upload:20171030.4 It slightly improve the performance the x/benchmarks: name old time/op new time/op delta Garbage/benchmem-MB=1024-12 2.29ms ± 3% 2.22ms ± 2% -2.97% (p=0.000 n=18+18) Garbage/benchmem-MB=64-12 2.24ms ± 2% 2.21ms ± 2% -1.64% (p=0.000 n=18+18) HTTP-12 12.6µs ± 1% 12.6µs ± 1% ~ (p=0.690 n=19+17) JSON-12 11.3ms ± 2% 11.3ms ± 1% ~ (p=0.163 n=17+18) and fixes some of the heap size bloat caused by the previous commit: name old peak-RSS-bytes new peak-RSS-bytes delta Garbage/benchmem-MB=1024-12 1.88G ± 2% 1.77G ± 2% -5.52% (p=0.000 n=20+18) Garbage/benchmem-MB=64-12 248M ± 8% 226M ± 5% -8.93% (p=0.000 n=20+20) HTTP-12 47.0M ±27% 47.2M ±12% ~ (p=0.512 n=20+20) JSON-12 206M ±11% 206M ±10% ~ (p=0.841 n=20+20) https://perf.golang.org/search?q=upload:20171030.5 Combined with the change to add a soft goal in the previous commit, the achieves a decent performance improvement on the garbage benchmark: name old time/op new time/op delta Garbage/benchmem-MB=1024-12 2.40ms ± 4% 2.22ms ± 2% -7.40% (p=0.000 n=19+18) Garbage/benchmem-MB=64-12 2.23ms ± 1% 2.21ms ± 2% -1.06% (p=0.000 n=19+18) HTTP-12 12.5µs ± 1% 12.6µs ± 1% ~ (p=0.330 n=20+17) JSON-12 11.1ms ± 1% 11.3ms ± 1% +1.87% (p=0.000 n=16+18) https://perf.golang.org/search?q=upload:20171030.6 Change-Id: If04ddb57e1e58ef2fb9eec54c290eb4ae4bea121 Reviewed-on: https://go-review.googlesource.com/59971 Run-TryBot: Austin Clements <austin@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Rick Hudson <rlh@golang.org>
Commit 03eb948 fixed this issue (not sure why I only marked it "updates"). I have a design doc that I hope to clean up and publish under this issue in the near future. |
Change https://golang.org/cl/122796 mentions this issue: |
This is a design document I wrote a long time ago but hadn't published yet (this change was released in Go 1.10). Updates golang/go#14951. Change-Id: Ib5d648eede153d68af503eaada455069c311e27e Reviewed-on: https://go-review.googlesource.com/122796 Reviewed-by: Ian Lance Taylor <iant@golang.org>
This adds an endpoint to the trace tool that plots the minimum mutator utilization curve using information on mark assists and GC pauses from the trace. This commit implements a fairly straightforward O(nm) algorithm for computing the MMU (and tests against an even more direct but slower algorithm). Future commits will extend and optimize this algorithm. This should be useful for debugging and understanding mutator utilization issues like #14951, #14812, #18155. #18534, #21107, particularly once follow-up CLs add trace cross-referencing. Change-Id: Ic2866869e7da1e6c56ba3e809abbcb2eb9c4923a Reviewed-on: https://go-review.googlesource.com/c/60790 Run-TryBot: Austin Clements <austin@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Hyang-Ah Hana Kim <hyangah@gmail.com>
@RLH and I have been discussing this problem for the past week. This issue is to track our thoughts.
The current GC pacer is conservative about scheduling mutator assists and this leads to a cycle where we over-assist (use too much CPU for GC) and under-shoot the heap goal.
Specifically, GC currently assumes it may have to scan the entire live heap before the live heap size reaches the heap goal. It sets the assist ratio accordingly. If the reachable heap is growing, this conservative assumption is necessary to prevent the mutator from outpacing the garbage collector and spiraling the heap size up. However, in steady state (where the reachable heap size is roughly the same as it was the last GC cycle), this means we'll finish garbage collection when we're less than 100/(GOGC+100) of the way from the trigger to the goal (1/2 way for GOGC=100; 1/4 way for GOGC=300; etc).
You might think the trigger controller would adapt by setting the GC trigger lower in response to the over-utilization, but in fact, this drives the trigger controller into a state of persistent CPU over-utilization and heap under-shoot. Currently, the trigger controller will leave the trigger where it is if we're anywhere on the white edge between CPU utilization and heap over-shoot shown below:
The under-shoot caused by the over-assist forces the trigger controller to the left of the ideal point on this curve. Since this persists, we eventually settle well to the left of the ideal, repeatedly setting the trigger too close to the goal, which leads to the next cycle setting a high assist ratio and again finishing early.
This is particularly visible in applications with a high allocation rate and/or high GOGC.
The text was updated successfully, but these errors were encountered: