Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: mutator assists are over-aggressive, especially at high GOGC #14951

Closed
aclements opened this issue Mar 24, 2016 · 15 comments
Closed

runtime: mutator assists are over-aggressive, especially at high GOGC #14951

aclements opened this issue Mar 24, 2016 · 15 comments

Comments

@aclements
Copy link
Member

@RLH and I have been discussing this problem for the past week. This issue is to track our thoughts.

The current GC pacer is conservative about scheduling mutator assists and this leads to a cycle where we over-assist (use too much CPU for GC) and under-shoot the heap goal.

Specifically, GC currently assumes it may have to scan the entire live heap before the live heap size reaches the heap goal. It sets the assist ratio accordingly. If the reachable heap is growing, this conservative assumption is necessary to prevent the mutator from outpacing the garbage collector and spiraling the heap size up. However, in steady state (where the reachable heap size is roughly the same as it was the last GC cycle), this means we'll finish garbage collection when we're less than 100/(GOGC+100) of the way from the trigger to the goal (1/2 way for GOGC=100; 1/4 way for GOGC=300; etc).

overassist

You might think the trigger controller would adapt by setting the GC trigger lower in response to the over-utilization, but in fact, this drives the trigger controller into a state of persistent CPU over-utilization and heap under-shoot. Currently, the trigger controller will leave the trigger where it is if we're anywhere on the white edge between CPU utilization and heap over-shoot shown below:

trigger

The under-shoot caused by the over-assist forces the trigger controller to the left of the ideal point on this curve. Since this persists, we eventually settle well to the left of the ideal, repeatedly setting the trigger too close to the goal, which leads to the next cycle setting a high assist ratio and again finishing early.

This is particularly visible in applications with a high allocation rate and/or high GOGC.

@aclements
Copy link
Member Author

Some approaches we've taken or thought about:

  1. Multiply the assist ratio by an "aggressiveness" that starts out 0 at the trigger and grows linearly to 1 at the goal. The idea is that this would damp the assists early in the cycle, let the background mark workers get ahead, and hopefully finish closer to the goal. This didn't work well because allocation proceeded so quickly early in the GC cycle without the assists to slow it down that the assist ratio quickly ramped up anyway.
  2. A variation of this was to set the assist ratio to 0 until live heap reached half-way between the trigger and the goal. This had the same problem because allocation was completely unimpeded.
  3. Let GC go in to "overtime" if the heap is growing. When live heap is between the trigger and the goal, compute the assist ratio assuming we only have to scan the same amount of heap as the last GC cycle marked. If we pass the goal, set an "overtime" goal to something like 2*goal and set the assist ratio assuming we have to scan the whole heap by the time we reach the overtime goal.
  4. Set (or cap) the assist ratio to 2 (scan 2 bytes for every 1 byte allocated). The idea is to eliminate the within-cycle feedback loop, allow GC to pass the goal, and use this feedback to adjust the trigger between cycles. In principle, the cross-cycle feedback loop should adjust in steady state so that the GC stops passing the goal after a few cycles. By setting the assist ratio to > 1, we still bound the total size the heap can grow to during a cycle. Setting it to 2 specifically bounds it to 2*trigger.

@randall77
Copy link
Contributor

I'm not sure I understand the desire to hit the heap goal if the live set is growing. If the live set is growing, we're going to set a higher heap goal for the next cycle. Why not go over the heap goal this cycle in that case?

At the last GC, assume we had a live set of size L so we set the heap goal to be 2L (for GOGC=100). Just set the trigger & assist ratio so that we scan L bytes by the time total allocation is up to 2L. Then if we reach a total allocation of 2L and scanning still isn't done, then we know the live set is larger than L and we can grow the heap (i.e. miss the heap goal) with a clean conscience.

@aclements
Copy link
Member Author

@randall77, this is basically the idea behind proposal 3. However, it's important to have the more aggressive fallback if the heap is actually growing. In fact, in the original pacer design, I set the assist ratio to target scanning the expected reachable heap (the marked heap from last cycle) by the heap goal and this let aggressive mutators spiral the heap size out of control. I don't remember the exact details of this failure mode, but it may have happened if the system got into a state where the assist ratio was less than 1 (or only slightly larger than 1).

@aclements aclements added this to the Unplanned milestone Mar 27, 2016
@gopherbot
Copy link
Contributor

CL https://golang.org/cl/21324 mentions this issue.

@RLH
Copy link
Contributor

RLH commented Mar 30, 2016

These number show various approaches to dealing with increasing CPU availability
for the mutators. While the numbers have not led to an obvious solution they are
valuable in understanding the issue. As a baseline I ran the garbage benchmark on
1.6. Whatever solution we come up with should have comparable throughput to 1.6.

The benchMasterMar28 runs show the performance of tip on March 28 which is
approximately 9% faster than 1.6. The dev:garbage branch is an additional 3% on top
of the 9%. Bottom line is that we have 3% wiggle room in trading GC throughput for
increased concurrency.

If we change benchMasterMarch28 to allocate black during the GC cycle the result is
not significantly different than allocating white due to the fact that the assists are so
high that the amount of actual availability and allocation is so low.

Here is the timing grid that considers throughput.

1.6 baseline is 2.70. Master is 2.45. The dev:garbage branch is 2.352, 3% better
than Master. The large improvement seen between allocating white and allocating
black for the scan assist schemes is significant and we are investigating it.

The fixed ratio schemes are scan one word for each word allocated or scan two
words for each word allocated.

              White             Black
Master         2.45            2.41
Scan1for1      3.78            2.92
Scan2For1      3.07            2.61

Here is the grid from a gctrace sampling somewhere near 3.5 seconds. The traces nearby
all exhibited similar data.

Master White
gc 63 @3.520s 25%: 0.062+20+0.42 ms clock, 0.12+143/62/0+0.84 ms cpu, 122->125->70 MB, 127 MB goal, 12 P

Master Black
gc 66 @3.600s 24%: 0.20+20+0.63 ms clock, 0.41+132/60/0+1.2 ms cpu, 132->135->68 MB, 139 MB goal, 12 P

Scan1for1 Black
gc 53 @3.349s 21%: 0.032+40+0.29 ms clock, 0.096+64/120/0+0.87 ms cpu, 129->150->84 MB, 132 MB goal, 12 P

Scan2for1 Black
gc 62 @3.584s 22%: 0.015+30+0.44 ms clock, 0.18+91/89/0+5.3 ms cpu, 125->138->75 MB, 128 MB goal, 12 P

Scan1for1 White
gc 45 @3.573s 22%: 0.10+75+0.44 ms clock, 0.31+23/222/0+1.3 ms cpu, 117->154->95 MB, 118 MB goal, 12 P

Scan2for1 White
gc 57 @3.487s 24%: 0.17+33+0.66 ms clock, 0.34+99/100/14+1.3 ms cpu, 128->142->79 MB, 132 MB goal, 12 P

Time spent doing assists dropped when we went to either the 2 for 1 or the 1 for 1 fixed
ratio approach. At the same time the length of the GC significantly increased over baseline
and not surprisingly the 2 for 1 was shorter than the 1 for 1 due to increased assists.

Allocating black shortened the GC since the amount of scan work decreases.

As we make more of the CPUs available to the mutators during a GC the more advantage there
is in allocating black probably because there is less to scan and less time with the write
barrier on. The literature argues for allocating white to decrease floating garbage. These
numbers do not support that conclusion.

We missed the target heap size in all of the fixed ratio tests, sometimes by quite a bit.
The 2 for 1 approaches used 25% of the CPU for the GC proper and 25% for the assists so
even with 50% of the CPUs dedicated to the GC we were unable to meet our heap size goals.

CL 21287 contains the location of where
to make changes if one wishes to duplicate these numbers.

Here is the raw data from the various runs.

cd benchmarks/bench/bin
bench1.6 bench=garbage
GOPERF-METRIC:allocated=2801684
GOPERF-METRIC:allocs=68905
GOPERF-METRIC:cputime=31848608
GOPERF-METRIC:gc-pause-one=704841
GOPERF-METRIC:gc-pause-total=31153
GOPERF-METRIC:rss=193101824
GOPERF-METRIC:sys-gc=5554176
GOPERF-METRIC:sys-heap=170098688
GOPERF-METRIC:sys-other=15517896
GOPERF-METRIC:sys-stack=884736
GOPERF-METRIC:sys-total=192055496
GOPERF-METRIC:time=2699980
GOPERF-METRIC:virtual-mem=763604992
GOPERF-FILE:cpuprof=/tmp/46.prof.txt
GOPERF-FILE:memprof=/tmp/47.prof.txt
gc 47 @3.017s 25%: 0.079+24+0.62 ms clock, 0.15+166/72/0.068+1.2 ms cpu, 128->130->67 MB, 132 MB goal, 12 P
gc 48 @3.087s 25%: 0.10+26+0.55 ms clock, 0.20+179/77/0.092+1.1 ms cpu, 127->130->69 MB, 133 MB goal, 12 P
gc 49 @3.161s 25%: 0.086+27+0.70 ms clock, 0.17+184/76/0+1.4 ms cpu, 130->133->65 MB, 136 MB goal, 12 P
gc 50 @3.237s 25%: 0.12+25+0.72 ms clock, 0.25+169/71/0+1.4 ms cpu, 126->128->69 MB, 129 MB goal, 12 P
gc 51 @3.312s 25%: 0.21+24+0.52 ms clock, 0.42+166/69/2.8+1.0 ms cpu, 134->135->67 MB, 137 MB goal, 12 P
gc 52 @3.384s 25%: 0.10+22+0.59 ms clock, 0.20+164/65/0.021+1.1 ms cpu, 131->133->63 MB, 135 MB goal, 12 P
gc 53 @3.454s 25%: 0.10+23+0.55 ms clock, 0.20+169/67/0.067+1.1 ms cpu, 123->125->66 MB, 127 MB goal, 12 P
gc 54 @3.522s 25%: 0.085+23+0.49 ms clock, 0.17+148/69/0.047+0.99 ms cpu, 127->129->65 MB, 132 MB goal, 12 P
gc 55 @3.591s 25%: 0.085+25+0.63 ms clock, 0.17+153/74/0.022+1.2 ms cpu, 123->126->63 MB, 128 MB goal, 12 P
gc 56 @3.663s 25%: 0.10+21+0.55 ms clock, 0.20+162/60/0.047+1.1 ms cpu, 121->123->62 MB, 124 MB goal, 12 P
gc 57 @3.730s 25%: 0.12+23+0.58 ms clock, 0.25+182/67/0.15+1.1 ms cpu, 122->123->64 MB, 125 MB goal, 12 P
gc 58 @3.799s 25%: 0.14+24+0.66 ms clock, 0.28+177/70/0.004+1.3 ms cpu, 125->127->66 MB, 129 MB goal, 12 P
gc 59 @3.866s 25%: 0.083+25+0.78 ms clock, 0.16+175/74/0+1.5 ms cpu, 125->128->66 MB, 132 MB goal, 12 P
gc 60 @3.937s 25%: 0.65+31+0.50 ms clock, 1.3+171/93/0+1.0 ms cpu, 121->124->68 MB, 128 MB goal, 12 P

./benchMasterMar28 -bench=garbage
GOPERF-METRIC:allocated=2780204
GOPERF-METRIC:allocs=68624
GOPERF-METRIC:cputime=28810325
GOPERF-METRIC:gc-pause-one=575984
GOPERF-METRIC:gc-pause-total=25343
GOPERF-METRIC:rss=196517888
GOPERF-METRIC:sys-gc=5629952
GOPERF-METRIC:sys-heap=173768704
GOPERF-METRIC:sys-other=11366928
GOPERF-METRIC:sys-stack=884736
GOPERF-METRIC:sys-total=191650320
GOPERF-METRIC:time=2452319
GOPERF-METRIC:virtual-mem=775798784
GOPERF-FILE:cpuprof=/tmp/46.prof.txt
GOPERF-FILE:memprof=/tmp/47.prof.txt
gc 54 @3.010s 25%: 0.025+18+0.46 ms clock, 0.10+135/53/0.072+1.8 ms cpu, 122->124->65 MB, 125 MB goal, 12 P
gc 55 @3.068s 25%: 0.031+19+0.39 ms clock, 0.063+142/56/0.066+0.79 ms cpu, 127->129->67 MB, 130 MB goal, 12 P
gc 56 @3.127s 25%: 0.036+18+0.44 ms clock, 0.073+134/56/0+0.88 ms cpu, 130->132->65 MB, 135 MB goal, 12 P
gc 57 @3.182s 25%: 0.026+18+0.49 ms clock, 0.15+125/53/0+2.9 ms cpu, 124->127->62 MB, 129 MB goal, 12 P
gc 58 @3.236s 25%: 0.037+18+0.44 ms clock, 0.34+107/54/19+4.0 ms cpu, 118->121->63 MB, 123 MB goal, 12 P
gc 59 @3.291s 25%: 0.017+18+0.48 ms clock, 0.21+129/54/4.1+5.8 ms cpu, 122->123->65 MB, 125 MB goal, 12 P
gc 60 @3.351s 25%: 0.042+18+0.58 ms clock, 0.085+141/54/0.006+1.1 ms cpu, 128->129->66 MB, 131 MB goal, 12 P
gc 61 @3.407s 25%: 0.045+18+0.48 ms clock, 0.090+143/54/0+0.97 ms cpu, 128->129->66 MB, 133 MB goal, 12 P
gc 62 @3.465s 25%: 0.35+17+0.39 ms clock, 0.70+130/53/0+0.79 ms cpu, 128->130->64 MB, 133 MB goal, 12 P
gc 63 @3.520s 25%: 0.062+20+0.42 ms clock, 0.12+143/62/0+0.84 ms cpu, 122->125->70 MB, 127 MB goal, 12 P
gc 64 @3.578s 25%: 0.038+19+0.40 ms clock, 0.077+129/57/0+0.80 ms cpu, 130->133->64 MB, 137 MB goal, 12 P
gc 65 @3.634s 25%: 0.032+18+0.23 ms clock, 0.19+122/54/11+1.4 ms cpu, 123->125->64 MB, 126 MB goal, 12 P
gc 66 @3.691s 26%: 0.020+18+0.29 ms clock, 0.12+145/55/2.2+1.7 ms cpu, 126->128->69 MB, 129 MB goal, 12 P
gc 67 @3.750s 26%: 0.035+18+0.43 ms clock, 0.071+130/54/0+0.87 ms cpu, 133->135->64 MB, 138 MB goal, 12 P
gc 68 @3.805s 26%: 0.027+18+0.89 ms clock, 0.13+134/54/0+4.4 ms cpu, 121->123->65 MB, 127 MB goal, 12 P
gc 69 @3.861s 26%: 0.19+19+0.63 ms clock, 0.96+135/55/0.063+3.1 ms cpu, 124->127->67 MB, 129 MB goal, 12 P
gc 70 @3.920s 26%: 0.027+19+0.60 ms clock, 0.10+139/57/0+2.4 ms cpu, 127->130->67 MB, 131 MB goal, 12 P
gc 71 @3.979s 26%: 0.077+19+0.45 ms clock, 0.69+138/56/0.073+4.0 ms cpu, 129->131->68 MB, 132 MB goal, 12 P

./benchAllocatBlackOrig -bench=garbage
GOPERF-METRIC:allocated=2780656
GOPERF-METRIC:allocs=68649
GOPERF-METRIC:cputime=28378118
GOPERF-METRIC:gc-pause-one=606225
GOPERF-METRIC:gc-pause-total=26552
GOPERF-METRIC:rss=190029824
GOPERF-METRIC:sys-gc=5603328
GOPERF-METRIC:sys-heap=173146112
GOPERF-METRIC:sys-other=11368768
GOPERF-METRIC:sys-stack=786432
GOPERF-METRIC:sys-total=190904640
GOPERF-METRIC:time=2414792
GOPERF-METRIC:virtual-mem=814186496
GOPERF-FILE:cpuprof=/tmp/46.prof.txt
GOPERF-FILE:memprof=/tmp/47.prof.txt

gc 56 @3.044s 24%: 0.027+18+0.25 ms clock, 0.081+128/53/0.46+0.75 ms cpu, 121->123->65 MB, 125 MB goal, 12 P
gc 57 @3.099s 24%: 0.026+18+0.41 ms clock, 0.20+129/53/0.063+3.3 ms cpu, 125->127->64 MB, 129 MB goal, 12 P
gc 58 @3.155s 24%: 0.024+16+0.27 ms clock, 0.19+123/49/13+2.2 ms cpu, 125->126->65 MB, 128 MB goal, 12 P
gc 59 @3.210s 24%: 0.021+18+0.25 ms clock, 0.14+133/54/0+1.8 ms cpu, 127->129->66 MB, 130 MB goal, 12 P
gc 60 @3.266s 24%: 0.025+18+0.45 ms clock, 0.050+135/54/0+0.90 ms cpu, 127->129->66 MB, 132 MB goal, 12 P
gc 61 @3.323s 24%: 0.022+17+0.41 ms clock, 0.22+128/52/0+4.1 ms cpu, 127->129->64 MB, 132 MB goal, 12 P
gc 62 @3.377s 24%: 0.027+16+0.81 ms clock, 0.32+123/49/0+9.7 ms cpu, 124->126->62 MB, 128 MB goal, 12 P
gc 63 @3.431s 24%: 0.023+17+0.57 ms clock, 0.16+127/52/0.023+3.9 ms cpu, 120->121->63 MB, 123 MB goal, 12 P
gc 64 @3.486s 24%: 0.028+18+0.59 ms clock, 0.11+134/55/0.051+2.3 ms cpu, 122->124->66 MB, 127 MB goal, 12 P
gc 65 @3.541s 24%: 0.033+21+0.36 ms clock, 0.23+139/63/0+2.5 ms cpu, 124->128->71 MB, 131 MB goal, 12 P
gc 66 @3.600s 24%: 0.20+20+0.63 ms clock, 0.41+132/60/0+1.2 ms cpu, 132->135->68 MB, 139 MB goal, 12 P
gc 67 @3.659s 24%: 0.036+17+0.44 ms clock, 0.073+128/52/0+0.88 ms cpu, 129->131->63 MB, 133 MB goal, 12 P
gc 68 @3.714s 24%: 0.030+19+0.48 ms clock, 0.061+135/55/1.0+0.96 ms cpu, 124->125->67 MB, 127 MB goal, 12 P
gc 69 @3.774s 24%: 0.029+18+0.53 ms clock, 0.17+137/54/0+3.1 ms cpu, 130->132->66 MB, 133 MB goal, 12 P
gc 70 @3.831s 24%: 0.025+17+0.42 ms clock, 0.10+135/52/0.024+1.7 ms cpu, 129->131->66 MB, 133 MB goal, 12 P
gc 71 @3.887s 24%: 0.026+19+0.52 ms clock, 0.10+122/56/0+2.0 ms cpu, 126->129->64 MB, 131 MB goal, 12 P
gc 72 @3.943s 24%: 0.017+19+0.28 ms clock, 0.17+115/56/16+2.8 ms cpu, 122->125->67 MB, 127 MB goal, 12 P

./benchDev -bench=garbage # March 28 best numbers using the dev branch with Ctz optimizations.
GOPERF-METRIC:allocated=3074068
GOPERF-METRIC:allocs=75288
GOPERF-METRIC:cputime=27570650
GOPERF-METRIC:gc-pause-one=626263
GOPERF-METRIC:gc-pause-total=27555
GOPERF-METRIC:rss=191041536
GOPERF-METRIC:sys-gc=5519360
GOPERF-METRIC:sys-heap=171081728
GOPERF-METRIC:sys-other=12169368
GOPERF-METRIC:sys-stack=819200
GOPERF-METRIC:sys-total=189589656
GOPERF-METRIC:time=2351587
GOPERF-METRIC:virtual-mem=805724160
GOPERF-FILE:cpuprof=/tmp/46.prof.txt
GOPERF-FILE:memprof=/tmp/47.prof.txt

gc 53 @3.056s 23%: 0.031+17+0.38 ms clock, 0.15+132/52/0+1.9 ms cpu, 125->126->64 MB, 128 MB goal, 12 P
gc 54 @3.110s 23%: 0.20+17+0.39 ms clock, 0.41+120/50/4.2+0.78 ms cpu, 125->126->62 MB, 128 MB goal, 12 P
gc 55 @3.163s 23%: 0.024+20+0.45 ms clock, 0.19+142/59/0.050+3.6 ms cpu, 119->122->70 MB, 123 MB goal, 12 P
gc 56 @3.221s 23%: 0.23+21+0.22 ms clock, 0.46+140/53/0+0.45 ms cpu, 131->134->69 MB, 138 MB goal, 12 P
gc 57 @3.279s 23%: 0.71+19+0.54 ms clock, 2.8+141/57/0+2.1 ms cpu, 130->132->67 MB, 135 MB goal, 12 P
gc 58 @3.337s 23%: 0.17+21+0.55 ms clock, 1.4+143/62/0+4.4 ms cpu, 130->132->68 MB, 133 MB goal, 12 P
gc 59 @3.397s 23%: 0.045+17+0.35 ms clock, 0.090+133/52/0+0.70 ms cpu, 133->135->63 MB, 137 MB goal, 12 P
gc 60 @3.452s 23%: 0.033+17+0.58 ms clock, 0.16+129/50/2.6+2.9 ms cpu, 124->125->63 MB, 127 MB goal, 12 P
gc 61 @3.505s 23%: 0.46+18+0.50 ms clock, 1.4+130/55/0.075+1.5 ms cpu, 122->124->63 MB, 126 MB goal, 12 P
gc 62 @3.560s 23%: 0.15+20+0.61 ms clock, 0.92+120/59/0+3.6 ms cpu, 121->124->66 MB, 126 MB goal, 12 P
gc 63 @3.617s 23%: 0.063+20+0.48 ms clock, 0.18+134/58/0.050+1.4 ms cpu, 124->127->65 MB, 129 MB goal, 12 P
gc 64 @3.673s 24%: 0.019+20+0.44 ms clock, 0.21+142/58/0+4.9 ms cpu, 123->125->69 MB, 127 MB goal, 12 P
gc 65 @3.732s 24%: 0.16+20+0.49 ms clock, 0.33+141/58/0.001+0.99 ms cpu, 133->135->67 MB, 137 MB goal, 12 P
gc 66 @3.798s 24%: 0.30+17+0.82 ms clock, 0.60+98/50/8.2+1.6 ms cpu, 132->134->58 MB, 135 MB goal, 12 P
gc 67 @3.851s 24%: 0.029+18+0.54 ms clock, 0.20+115/55/8.2+3.7 ms cpu, 114->116->64 MB, 117 MB goal, 12 P
gc 68 @3.905s 24%: 0.19+19+0.60 ms clock, 1.1+119/55/0.95+3.6 ms cpu, 122->125->65 MB, 128 MB goal, 12 P
gc 69 @3.960s 24%: 0.11+20+0.64 ms clock, 0.83+126/61/0+4.4 ms cpu, 122->125->65 MB, 128 MB goal, 12 P

./benchAllocatBlackScan1for1 -bench=garbage
GOPERF-METRIC:allocated=2780205
GOPERF-METRIC:allocs=68623
GOPERF-METRIC:cputime=34221560
GOPERF-METRIC:gc-pause-one=735410
GOPERF-METRIC:gc-pause-total=32505
GOPERF-METRIC:rss=220323840
GOPERF-METRIC:sys-gc=10213376
GOPERF-METRIC:sys-heap=193134592
GOPERF-METRIC:sys-other=11352352
GOPERF-METRIC:sys-stack=786432
GOPERF-METRIC:sys-total=215486752
GOPERF-METRIC:time=2920883
GOPERF-METRIC:virtual-mem=943915008
GOPERF-FILE:cpuprof=/tmp/46.prof.txt
GOPERF-FILE:memprof=/tmp/47.prof.txt
gc 48 @3.016s 21%: 0.56+34+0.29 ms clock, 2.2+57/101/0+1.1 ms cpu, 128->144->73 MB, 131 MB goal, 12 P
gc 49 @3.077s 21%: 0.67+42+0.23 ms clock, 5.3+51/126/0+1.8 ms cpu, 112->135->85 MB, 115 MB goal, 12 P
gc 50 @3.145s 21%: 0.023+41+0.28 ms clock, 0.23+35/122/13+2.8 ms cpu, 125->148->83 MB, 128 MB goal, 12 P
gc 51 @3.212s 21%: 0.12+43+0.38 ms clock, 1.0+21/128/38+3.0 ms cpu, 121->146->87 MB, 124 MB goal, 12 P
gc 52 @3.281s 21%: 0.017+40+0.25 ms clock, 0.21+62/106/25+3.1 ms cpu, 125->145->84 MB, 128 MB goal, 12 P
gc 53 @3.349s 21%: 0.032+40+0.29 ms clock, 0.096+64/120/0+0.87 ms cpu, 129->150->84 MB, 132 MB goal, 12 P
gc 54 @3.417s 21%: 0.43+41+0.29 ms clock, 5.1+47/124/0+3.5 ms cpu, 124->147->83 MB, 128 MB goal, 12 P
gc 55 @3.483s 21%: 0.16+34+0.28 ms clock, 1.6+25/95/61+2.8 ms cpu, 120->135->77 MB, 123 MB goal, 12 P
gc 56 @3.548s 21%: 0.021+43+0.27 ms clock, 0.085+58/124/0+1.0 ms cpu, 124->148->87 MB, 127 MB goal, 12 P
gc 57 @3.616s 21%: 1.0+40+0.30 ms clock, 12+64/121/0+3.6 ms cpu, 127->147->83 MB, 130 MB goal, 12 P
gc 58 @3.684s 21%: 0.65+37+0.33 ms clock, 5.2+61/111/0+2.6 ms cpu, 124->143->79 MB, 127 MB goal, 12 P
gc 59 @3.750s 21%: 0.028+38+0.27 ms clock, 0.22+60/103/0+2.1 ms cpu, 120->140->79 MB, 123 MB goal, 12 P
gc 60 @3.813s 21%: 7.0+35+0.30 ms clock, 21+74/103/0+0.91 ms cpu, 116->134->79 MB, 119 MB goal, 12 P
gc 61 @3.883s 21%: 0.031+38+0.29 ms clock, 0.22+64/114/0+2.0 ms cpu, 122->142->82 MB, 125 MB goal, 12 P
gc 62 @3.949s 21%: 0.39+38+0.24 ms clock, 1.1+70/115/0+0.73 ms cpu, 123->143->81 MB, 126 MB goal, 12 P

./benchAllocatBlackScan2for1 -bench=garbage
GOPERF-METRIC:allocated=2780980
GOPERF-METRIC:allocs=68669
GOPERF-METRIC:cputime=30555106
GOPERF-METRIC:gc-pause-one=710828
GOPERF-METRIC:gc-pause-total=30992
GOPERF-METRIC:rss=205619200
GOPERF-METRIC:sys-gc=6676480
GOPERF-METRIC:sys-heap=183631872
GOPERF-METRIC:sys-other=11679120
GOPERF-METRIC:sys-stack=851968
GOPERF-METRIC:sys-total=202839440
GOPERF-METRIC:time=2608209
GOPERF-METRIC:virtual-mem=1094815744
GOPERF-FILE:cpuprof=/tmp/46.prof.txt
GOPERF-FILE:memprof=/tmp/47.prof.txt
gc 53 @3.041s 21%: 3.2+28+0.31 ms clock, 13+103/84/0+1.2 ms cpu, 124->135->74 MB, 127 MB goal, 12 P
gc 54 @3.106s 21%: 0.18+28+0.41 ms clock, 0.37+103/85/0+0.83 ms cpu, 127->139->76 MB, 131 MB goal, 12 P
gc 55 @3.169s 21%: 0.026+30+0.25 ms clock, 0.18+83/88/20+1.8 ms cpu, 129->141->78 MB, 132 MB goal, 12 P
gc 56 @3.233s 22%: 0.024+28+0.25 ms clock, 0.14+121/81/0+1.5 ms cpu, 132->142->78 MB, 135 MB goal, 12 P
gc 57 @3.297s 22%: 0.030+20+0.60 ms clock, 0.060+97/61/0+1.2 ms cpu, 135->142->61 MB, 139 MB goal, 12 P
gc 58 @3.350s 22%: 0.036+28+0.24 ms clock, 0.072+97/81/0+0.49 ms cpu, 109->121->73 MB, 112 MB goal, 12 P
gc 59 @3.411s 22%: 0.15+24+0.46 ms clock, 0.95+101/71/0+2.8 ms cpu, 124->134->70 MB, 128 MB goal, 12 P
gc 60 @3.468s 22%: 0.025+25+0.44 ms clock, 0.17+84/76/0+3.0 ms cpu, 121->131->67 MB, 124 MB goal, 12 P
gc 61 @3.524s 22%: 0.030+27+0.58 ms clock, 0.060+80/82/17+1.1 ms cpu, 114->125->73 MB, 117 MB goal, 12 P
gc 62 @3.584s 22%: 0.015+30+0.44 ms clock, 0.18+91/89/0+5.3 ms cpu, 125->138->75 MB, 128 MB goal, 12 P
gc 63 @3.646s 22%: 0.024+30+0.43 ms clock, 0.14+81/89/1.0+2.6 ms cpu, 125->138->73 MB, 128 MB goal, 12 P
gc 64 @3.706s 22%: 0.16+26+0.42 ms clock, 0.32+93/78/0+0.84 ms cpu, 122->132->70 MB, 124 MB goal, 12 P
gc 65 @3.764s 22%: 0.013+26+0.49 ms clock, 0.16+69/78/22+5.9 ms cpu, 120->129->69 MB, 123 MB goal, 12 P
gc 66 @3.822s 22%: 0.017+26+0.49 ms clock, 0.17+92/79/0.001+4.9 ms cpu, 119->130->70 MB, 122 MB goal, 12 P
gc 67 @3.879s 22%: 5.7+26+0.52 ms clock, 11+121/77/0+1.0 ms cpu, 119->127->71 MB, 122 MB goal, 12 P
gc 68 @3.946s 22%: 0.028+26+0.31 ms clock, 0.056+104/72/0+0.63 ms cpu, 126->135->70 MB, 129 MB goal, 12 P

./benchAllocatWhiteScan1for1 -bench=garbage
GOPERF-METRIC:allocated=2780554
GOPERF-METRIC:allocs=68638
GOPERF-METRIC:cputime=44090842
GOPERF-METRIC:gc-pause-one=1437192
GOPERF-METRIC:gc-pause-total=66829
GOPERF-METRIC:rss=232046592
GOPERF-METRIC:sys-gc=7659520
GOPERF-METRIC:sys-heap=209879040
GOPERF-METRIC:sys-other=10847632
GOPERF-METRIC:sys-stack=753664
GOPERF-METRIC:sys-total=229139856
GOPERF-METRIC:time=3779232
GOPERF-METRIC:virtual-mem=766160896
GOPERF-FILE:cpuprof=/tmp/46.prof.txt
GOPERF-FILE:memprof=/tmp/47.prof.txt
gc 40 @3.072s 22%: 0.027+97+0.47 ms clock, 0.27+39/254/0+4.7 ms cpu, 110->161->109 MB, 112 MB goal, 12 P
gc 41 @3.182s 22%: 0.11+88+0.55 ms clock, 0.89+67/237/0.078+4.4 ms cpu, 116->163->103 MB, 119 MB goal, 12 P
gc 42 @3.284s 22%: 0.031+70+0.63 ms clock, 0.063+33/193/0+1.2 ms cpu, 114->148->86 MB, 116 MB goal, 12 P
gc 43 @3.375s 22%: 0.041+77+0.95 ms clock, 0.083+18/214/0+1.9 ms cpu, 105->140->90 MB, 107 MB goal, 12 P
gc 44 @3.472s 22%: 0.22+87+0.44 ms clock, 0.44+63/224/0+0.89 ms cpu, 109->152->100 MB, 112 MB goal, 12 P
gc 45 @3.573s 22%: 0.10+75+0.44 ms clock, 0.31+23/222/0+1.3 ms cpu, 117->154->95 MB, 118 MB goal, 12 P
gc 46 @3.667s 22%: 0.047+79+0.60 ms clock, 0.28+41/235/0+3.6 ms cpu, 115->160->106 MB, 117 MB goal, 12 P
gc 47 @3.762s 22%: 0.037+72+0.61 ms clock, 0.075+42/215/0+1.2 ms cpu, 121->157->92 MB, 124 MB goal, 12 P
gc 48 @3.855s 22%: 0.024+70+0.59 ms clock, 0.12+47/210/0+2.9 ms cpu, 113->150->89 MB, 116 MB goal, 12 P
gc 49 @3.941s 22%: 0.067+69+0.51 ms clock, 0.33+44/206/0+2.5 ms cpu, 106->144->95 MB, 109 MB goal, 12 P
gc 50 @4.026s 22%: 0.11+58+0.44 ms clock, 0.77+40/176/0+3.1 ms cpu, 114->145->83 MB, 117 MB goal, 12 P

./benchAllocatWhiteScan2for1 -bench=garbage
GOPERF-METRIC:allocated=2780262
GOPERF-METRIC:allocs=68629
GOPERF-METRIC:cputime=32769944
GOPERF-METRIC:gc-pause-one=861449
GOPERF-METRIC:gc-pause-total=37559
GOPERF-METRIC:rss=204873728
GOPERF-METRIC:sys-gc=6516736
GOPERF-METRIC:sys-heap=185761792
GOPERF-METRIC:sys-other=11133096
GOPERF-METRIC:sys-stack=819200
GOPERF-METRIC:sys-total=204230824
GOPERF-METRIC:time=3069541
GOPERF-METRIC:virtual-mem=719392768
GOPERF-FILE:cpuprof=/tmp/46.prof.txt
GOPERF-FILE:memprof=/tmp/47.prof.txt
gc 50 @3.020s 23%: 0.026+34+0.61 ms clock, 0.13+121/102/0+3.0 ms cpu, 122->136->83 MB, 125 MB goal, 12 P
gc 51 @3.088s 23%: 0.028+36+0.49 ms clock, 0.17+105/108/0+2.9 ms cpu, 138->154->78 MB, 141 MB goal, 12 P
gc 52 @3.155s 23%: 0.023+39+0.54 ms clock, 0.19+96/117/10+4.3 ms cpu, 123->142->83 MB, 127 MB goal, 12 P
gc 53 @3.224s 23%: 0.043+32+0.61 ms clock, 0.39+101/95/0+5.5 ms cpu, 130->143->75 MB, 133 MB goal, 12 P
gc 54 @3.287s 23%: 0.017+36+0.45 ms clock, 0.17+72/107/28+4.5 ms cpu, 123->138->76 MB, 126 MB goal, 12 P
gc 55 @3.352s 23%: 0.041+35+0.46 ms clock, 0.41+104/104/0+4.6 ms cpu, 121->136->78 MB, 124 MB goal, 12 P
gc 56 @3.418s 24%: 2.0+35+0.54 ms clock, 4.0+114/107/0+1.0 ms cpu, 125->140->80 MB, 128 MB goal, 12 P
gc 57 @3.487s 24%: 0.17+33+0.66 ms clock, 0.34+99/100/14+1.3 ms cpu, 128->142->79 MB, 132 MB goal, 12 P
gc 58 @3.555s 24%: 0.10+29+0.45 ms clock, 0.43+99/87/0+1.8 ms cpu, 131->143->70 MB, 135 MB goal, 12 P
gc 59 @3.617s 24%: 0.043+31+0.50 ms clock, 0.39+93/94/0+4.5 ms cpu, 116->130->71 MB, 119 MB goal, 12 P
gc 60 @3.678s 24%: 0.32+36+0.49 ms clock, 2.6+102/109/0+3.9 ms cpu, 114->130->76 MB, 117 MB goal, 12 P
gc 61 @3.743s 24%: 3.0+29+0.53 ms clock, 6.0+98/88/7.0+1.0 ms cpu, 122->133->71 MB, 124 MB goal, 12 P
gc 62 @3.807s 24%: 0.17+34+0.43 ms clock, 1.0+108/101/0+2.6 ms cpu, 119->134->78 MB, 122 MB goal, 12 P
gc 63 @3.872s 24%: 2.0+38+0.71 ms clock, 20+102/114/0+7.1 ms cpu, 127->145->80 MB, 130 MB goal, 12 P
gc 64 @3.941s 24%: 0.027+32+0.53 ms clock, 0.24+103/97/0+4.8 ms cpu, 124->138->73 MB, 127 MB goal, 12 P

@RLH RLH closed this as completed Mar 30, 2016
@cespare
Copy link
Contributor

cespare commented Mar 30, 2016

@RLH did you mean to close this?

@RLH
Copy link
Contributor

RLH commented Mar 30, 2016

No, How do I fix this.

On Wed, Mar 30, 2016 at 6:42 PM, Caleb Spare notifications@github.com
wrote:

@RLH https://github.com/RLH did you mean to close this?


You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#14951 (comment)

@bradfitz bradfitz reopened this Mar 30, 2016
@bradfitz
Copy link
Contributor

You don't see the "Reopen Issue" button? I pressed it, in any case. Maybe you're not in the right groups.

gopherbot pushed a commit that referenced this issue Apr 21, 2016
Currently when we compute the trigger for the next GC, we do it based
on an estimate of the reachable heap size at the start of the GC
cycle, which is itself based on an estimate of the floating garbage.
This was introduced by 4655aad to fix a bad feedback loop that allowed
the heap to grow to many times the true reachable size.

However, this estimate gets easily confused by rapidly allocating
applications, and, worse it's different than the heap size the trigger
controller uses to compute the trigger itself. This results in the
trigger controller often thinking that GC finished before it started.
Since this would be a pretty great outcome from it's perspective, it
sets the trigger for the next cycle as close to the next goal as
possible (which is limited to 95% of the goal).

Furthermore, the bad feedback loop this estimate originally fixed
seems not to happen any more, suggesting it was fixed more correctly
by some other change in the mean time. Finally, with the change to
allocate black, it shouldn't even be theoretically possible for this
bad feedback loop to occur.

Hence, eliminate the floating garbage estimate and simply consider the
reachable heap to be the marked heap. This harms overall throughput
slightly for allocation-heavy benchmarks, but significantly improves
mutator availability.

Fixes #12204. This brings the average trigger in this benchmark from
0.95 (the cap) to 0.7 and the active GC utilization from ~90% to ~45%.

Updates #14951. This makes the trigger controller much better behaved,
so it pulls the trigger lower if assists are consuming a lot of CPU
like it's supposed to, increasing mutator availability.

name              old time/op  new time/op  delta
XBenchGarbage-12  2.21ms ± 1%  2.28ms ± 3%  +3.29%  (p=0.000 n=17+17)

Some of this slow down we paid for in earlier commits. Relative to the
start of the series to switch to allocate-black (the parent of "count
black allocations toward scan work"), the garbage benchmark is 2.62%
slower.

name                      old time/op    new time/op    delta
BinaryTree17-12              2.53s ± 3%     2.53s ± 3%    ~     (p=0.708 n=20+19)
Fannkuch11-12                2.08s ± 0%     2.08s ± 0%  -0.22%  (p=0.002 n=19+18)
FmtFprintfEmpty-12          45.3ns ± 2%    45.2ns ± 3%    ~     (p=0.505 n=20+20)
FmtFprintfString-12          129ns ± 0%     131ns ± 2%  +1.80%  (p=0.000 n=16+19)
FmtFprintfInt-12             121ns ± 2%     121ns ± 2%    ~     (p=0.768 n=19+19)
FmtFprintfIntInt-12          186ns ± 1%     188ns ± 3%  +0.99%  (p=0.000 n=19+19)
FmtFprintfPrefixedInt-12     188ns ± 1%     188ns ± 1%    ~     (p=0.947 n=18+16)
FmtFprintfFloat-12           254ns ± 1%     255ns ± 1%  +0.30%  (p=0.002 n=19+17)
FmtManyArgs-12               763ns ± 0%     770ns ± 0%  +0.92%  (p=0.000 n=18+18)
GobDecode-12                7.00ms ± 1%    7.04ms ± 1%  +0.61%  (p=0.049 n=20+20)
GobEncode-12                5.88ms ± 1%    5.88ms ± 0%    ~     (p=0.641 n=18+19)
Gzip-12                      214ms ± 1%     215ms ± 1%  +0.43%  (p=0.002 n=18+19)
Gunzip-12                   37.6ms ± 0%    37.6ms ± 0%  +0.11%  (p=0.015 n=17+18)
HTTPClientServer-12         76.9µs ± 2%    78.1µs ± 2%  +1.44%  (p=0.000 n=20+18)
JSONEncode-12               15.2ms ± 2%    15.1ms ± 1%    ~     (p=0.271 n=19+18)
JSONDecode-12               53.1ms ± 1%    53.3ms ± 0%  +0.49%  (p=0.000 n=18+19)
Mandelbrot200-12            4.04ms ± 1%    4.03ms ± 0%  -0.33%  (p=0.005 n=18+18)
GoParse-12                  3.29ms ± 1%    3.28ms ± 1%    ~     (p=0.146 n=16+17)
RegexpMatchEasy0_32-12      69.9ns ± 3%    69.5ns ± 1%    ~     (p=0.785 n=20+19)
RegexpMatchEasy0_1K-12       237ns ± 0%     237ns ± 0%    ~     (p=1.000 n=18+18)
RegexpMatchEasy1_32-12      69.5ns ± 1%    69.2ns ± 1%  -0.44%  (p=0.020 n=16+19)
RegexpMatchEasy1_1K-12       372ns ± 1%     371ns ± 2%    ~     (p=0.086 n=20+19)
RegexpMatchMedium_32-12      108ns ± 3%     107ns ± 1%  -1.00%  (p=0.004 n=19+14)
RegexpMatchMedium_1K-12     34.2µs ± 4%    34.0µs ± 2%    ~     (p=0.380 n=19+20)
RegexpMatchHard_32-12       1.77µs ± 4%    1.76µs ± 3%    ~     (p=0.558 n=18+20)
RegexpMatchHard_1K-12       53.4µs ± 4%    52.8µs ± 2%  -1.10%  (p=0.020 n=18+20)
Revcomp-12                   359ms ± 4%     377ms ± 0%  +5.19%  (p=0.000 n=20+18)
Template-12                 63.7ms ± 2%    62.9ms ± 2%  -1.27%  (p=0.005 n=18+20)
TimeParse-12                 316ns ± 2%     313ns ± 1%    ~     (p=0.059 n=20+16)
TimeFormat-12                329ns ± 0%     331ns ± 0%  +0.39%  (p=0.000 n=16+18)
[Geo mean]                  51.6µs         51.7µs       +0.18%

Change-Id: I1dce4640c8205d41717943b021039fffea863c57
Reviewed-on: https://go-review.googlesource.com/21324
Reviewed-by: Rick Hudson <rlh@golang.org>
Run-TryBot: Austin Clements <austin@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
@gopherbot
Copy link
Contributor

Change https://golang.org/cl/59970 mentions this issue: runtime: separate soft and hard heap limits

@gopherbot
Copy link
Contributor

Change https://golang.org/cl/59971 mentions this issue: runtime: reduce background mark work to from 25% to 20%

@gopherbot
Copy link
Contributor

Change https://golang.org/cl/60790 mentions this issue: cmd/trace: add minimum mutator utilization (MMU) plot

@gopherbot
Copy link
Contributor

Change https://golang.org/cl/73711 mentions this issue: runtime: buffered write barrier implementation

@gopherbot
Copy link
Contributor

Change https://golang.org/cl/73712 mentions this issue: cmd/compile: compiler support for buffered write barrier

gopherbot pushed a commit that referenced this issue Oct 30, 2017
This implements runtime support for buffered write barriers on amd64.
The buffered write barrier has a fast path that simply enqueues
pointers in a per-P buffer. Unlike the current write barrier, this
fast path is *not* a normal Go call and does not require the compiler
to spill general-purpose registers or put arguments on the stack. When
the buffer fills up, the write barrier takes the slow path, which
spills all general purpose registers and flushes the buffer. We don't
allow safe-points or stack splits while this frame is active, so it
doesn't matter that we have no type information for the spilled
registers in this frame.

One minor complication is cgocheck=2 mode, which uses the write
barrier to detect Go pointers being written to non-Go memory. We
obviously can't buffer this, so instead we set the buffer to its
minimum size, forcing the write barrier into the slow path on every
call. For this specific case, we pass additional information as
arguments to the flush function. This also requires enabling the cgo
write barrier slightly later during runtime initialization, after Ps
(and the per-P write barrier buffers) have been initialized.

The code in this CL is not yet active. The next CL will modify the
compiler to generate calls to the new write barrier.

This reduces the average cost of the write barrier by roughly a factor
of 4, which will pay for the cost of having it enabled more of the
time after we make the GC pacer less aggressive. (Benchmarks will be
in the next CL.)

Updates #14951.
Updates #22460.

Change-Id: I396b5b0e2c5e5c4acfd761a3235fd15abadc6cb1
Reviewed-on: https://go-review.googlesource.com/73711
Run-TryBot: Austin Clements <austin@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Rick Hudson <rlh@golang.org>
gopherbot pushed a commit that referenced this issue Oct 30, 2017
This CL implements the compiler support for calling the buffered write
barrier added by the previous CL.

Since the buffered write barrier is only implemented on amd64 right
now, this still supports the old, eager write barrier as well. There's
little overhead to supporting both and this way a few tests in
test/fixedbugs that expect to have liveness maps at write barrier
calls can easily opt-in to the old, eager barrier.

This significantly improves the performance of the write barrier:

name             old time/op  new time/op  delta
WriteBarrier-12  73.5ns ±20%  19.2ns ±27%  -73.90%  (p=0.000 n=19+18)

It also reduces the size of binaries because the write barrier call is
more compact:

name        old object-bytes  new object-bytes  delta
Template           398k ± 0%         393k ± 0%  -1.14%  (p=0.008 n=5+5)
Unicode            208k ± 0%         206k ± 0%  -1.00%  (p=0.008 n=5+5)
GoTypes           1.18M ± 0%        1.15M ± 0%  -2.00%  (p=0.008 n=5+5)
Compiler          4.05M ± 0%        3.88M ± 0%  -4.26%  (p=0.008 n=5+5)
SSA               8.25M ± 0%        8.11M ± 0%  -1.59%  (p=0.008 n=5+5)
Flate              228k ± 0%         224k ± 0%  -1.83%  (p=0.008 n=5+5)
GoParser           295k ± 0%         284k ± 0%  -3.62%  (p=0.008 n=5+5)
Reflect           1.00M ± 0%        0.99M ± 0%  -0.70%  (p=0.008 n=5+5)
Tar                339k ± 0%         333k ± 0%  -1.67%  (p=0.008 n=5+5)
XML                404k ± 0%         395k ± 0%  -2.10%  (p=0.008 n=5+5)
[Geo mean]         704k              690k       -2.00%

name        old exe-bytes     new exe-bytes     delta
HelloSize         1.05M ± 0%        1.04M ± 0%  -1.55%  (p=0.008 n=5+5)

https://perf.golang.org/search?q=upload:20171027.1

(Amusingly, this also reduces compiler allocations by 0.75%, which,
combined with the better write barrier, speeds up the compiler overall
by 2.10%. See the perf link.)

It slightly improves the performance of most of the go1 benchmarks and
improves the performance of the x/benchmarks:

name                      old time/op    new time/op    delta
BinaryTree17-12              2.40s ± 1%     2.47s ± 1%  +2.69%  (p=0.000 n=19+19)
Fannkuch11-12                2.95s ± 0%     2.95s ± 0%  +0.21%  (p=0.000 n=20+19)
FmtFprintfEmpty-12          41.8ns ± 4%    41.4ns ± 2%  -1.03%  (p=0.014 n=20+20)
FmtFprintfString-12         68.7ns ± 2%    67.5ns ± 1%  -1.75%  (p=0.000 n=20+17)
FmtFprintfInt-12            79.0ns ± 3%    77.1ns ± 1%  -2.40%  (p=0.000 n=19+17)
FmtFprintfIntInt-12          127ns ± 1%     123ns ± 3%  -3.42%  (p=0.000 n=20+20)
FmtFprintfPrefixedInt-12     152ns ± 1%     150ns ± 1%  -1.02%  (p=0.000 n=18+17)
FmtFprintfFloat-12           211ns ± 1%     209ns ± 0%  -0.99%  (p=0.000 n=20+16)
FmtManyArgs-12               500ns ± 0%     496ns ± 0%  -0.73%  (p=0.000 n=17+20)
GobDecode-12                6.44ms ± 1%    6.53ms ± 0%  +1.28%  (p=0.000 n=20+19)
GobEncode-12                5.46ms ± 0%    5.46ms ± 1%    ~     (p=0.550 n=19+20)
Gzip-12                      220ms ± 1%     216ms ± 0%  -1.75%  (p=0.000 n=19+19)
Gunzip-12                   38.8ms ± 0%    38.6ms ± 0%  -0.30%  (p=0.000 n=18+19)
HTTPClientServer-12         79.0µs ± 1%    78.2µs ± 1%  -1.01%  (p=0.000 n=20+20)
JSONEncode-12               11.9ms ± 0%    11.9ms ± 0%  -0.29%  (p=0.000 n=20+19)
JSONDecode-12               52.6ms ± 0%    52.2ms ± 0%  -0.68%  (p=0.000 n=19+20)
Mandelbrot200-12            3.69ms ± 0%    3.68ms ± 0%  -0.36%  (p=0.000 n=20+20)
GoParse-12                  3.13ms ± 1%    3.18ms ± 1%  +1.67%  (p=0.000 n=19+20)
RegexpMatchEasy0_32-12      73.2ns ± 1%    72.3ns ± 1%  -1.19%  (p=0.000 n=19+18)
RegexpMatchEasy0_1K-12       241ns ± 0%     239ns ± 0%  -0.83%  (p=0.000 n=17+16)
RegexpMatchEasy1_32-12      68.6ns ± 1%    69.0ns ± 1%  +0.47%  (p=0.015 n=18+16)
RegexpMatchEasy1_1K-12       364ns ± 0%     361ns ± 0%  -0.67%  (p=0.000 n=16+17)
RegexpMatchMedium_32-12      104ns ± 1%     103ns ± 1%  -0.79%  (p=0.001 n=20+15)
RegexpMatchMedium_1K-12     33.8µs ± 3%    34.0µs ± 2%    ~     (p=0.267 n=20+19)
RegexpMatchHard_32-12       1.64µs ± 1%    1.62µs ± 2%  -1.25%  (p=0.000 n=19+18)
RegexpMatchHard_1K-12       49.2µs ± 0%    48.7µs ± 1%  -0.93%  (p=0.000 n=19+18)
Revcomp-12                   391ms ± 5%     396ms ± 7%    ~     (p=0.154 n=19+19)
Template-12                 63.1ms ± 0%    59.5ms ± 0%  -5.76%  (p=0.000 n=18+19)
TimeParse-12                 307ns ± 0%     306ns ± 0%  -0.39%  (p=0.000 n=19+17)
TimeFormat-12                325ns ± 0%     323ns ± 0%  -0.50%  (p=0.000 n=19+19)
[Geo mean]                  47.3µs         46.9µs       -0.67%

https://perf.golang.org/search?q=upload:20171026.1

name                       old time/op  new time/op  delta
Garbage/benchmem-MB=64-12  2.25ms ± 1%  2.20ms ± 1%  -2.31%  (p=0.000 n=18+18)
HTTP-12                    12.6µs ± 0%  12.6µs ± 0%  -0.72%  (p=0.000 n=18+17)
JSON-12                    11.0ms ± 0%  11.0ms ± 1%  -0.68%  (p=0.000 n=17+19)

https://perf.golang.org/search?q=upload:20171026.2

Updates #14951.
Updates #22460.

Change-Id: Id4c0932890a1d41020071bec73b8522b1367d3e7
Reviewed-on: https://go-review.googlesource.com/73712
Run-TryBot: Austin Clements <austin@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Cherry Zhang <cherryyz@google.com>
gopherbot pushed a commit that referenced this issue Oct 31, 2017
Currently, GC pacing is based on a single hard heap limit computed
based on GOGC. In order to achieve this hard limit, assist pacing
makes the conservative assumption that the entire heap is live.
However, in the steady state (with GOGC=100), only half of the heap is
live. As a result, the garbage collector works twice as hard as
necessary and finishes half way between the trigger and the goal.
Since this is a stable state for the trigger controller, this repeats
from cycle to cycle. Matters are even worse if GOGC is higher. For
example, if GOGC=200, only a third of the heap is live in steady
state, so the GC will work three times harder than necessary and
finish only a third of the way between the trigger and the goal.

Since this causes the garbage collector to consume ~50% of the
available CPU during marking instead of the intended 25%, about 25% of
the CPU goes to mutator assists. This high mutator assist cost causes
high mutator latency variability.

This commit improves the situation by separating the heap goal into
two goals: a soft goal and a hard goal. The soft goal is set based on
GOGC, just like the current goal is, and the hard goal is set at a 10%
larger heap than the soft goal. Prior to the soft goal, assist pacing
assumes the heap is in steady state (e.g., only half of it is live).
Between the soft goal and the hard goal, assist pacing switches to the
current conservative assumption that the entire heap is live.

In benchmarks, this nearly eliminates mutator assists. However, since
background marking is fixed at 25% CPU, this causes the trigger
controller to saturate, which leads to somewhat higher variability in
heap size. The next commit will address this.

The lower CPU usage of course leads to longer mark cycles, though
really it means the mark cycles are as long as they should have been
in the first place. This does, however, lead to two potential
down-sides compared to the current pacing policy: 1. the total
overhead of the write barrier is higher because it's enabled more of
the time and 2. the heap size may be larger because there's more
floating garbage. We addressed 1 by significantly improving the
performance of the write barrier in the preceding commits. 2 can be
demonstrated in intense GC benchmarks, but doesn't seem to be a
problem in any real applications.

Updates #14951.
Updates #14812 (fixes?).
Fixes #18534.

This has no significant effect on the throughput of the
github.com/dr2chase/bent benchmarks-50.

This has little overall throughput effect on the go1 benchmarks:

name                      old time/op    new time/op    delta
BinaryTree17-12              2.41s ± 0%     2.40s ± 0%  -0.22%  (p=0.007 n=20+18)
Fannkuch11-12                2.95s ± 0%     2.95s ± 0%  +0.07%  (p=0.003 n=17+18)
FmtFprintfEmpty-12          41.7ns ± 3%    42.2ns ± 0%  +1.17%  (p=0.002 n=20+15)
FmtFprintfString-12         66.5ns ± 0%    67.9ns ± 2%  +2.16%  (p=0.000 n=16+20)
FmtFprintfInt-12            77.6ns ± 2%    75.6ns ± 3%  -2.55%  (p=0.000 n=19+19)
FmtFprintfIntInt-12          124ns ± 1%     123ns ± 1%  -0.98%  (p=0.000 n=18+17)
FmtFprintfPrefixedInt-12     151ns ± 1%     148ns ± 1%  -1.75%  (p=0.000 n=19+20)
FmtFprintfFloat-12           210ns ± 1%     212ns ± 0%  +0.75%  (p=0.000 n=19+16)
FmtManyArgs-12               501ns ± 1%     499ns ± 1%  -0.30%  (p=0.041 n=17+19)
GobDecode-12                6.50ms ± 1%    6.49ms ± 1%    ~     (p=0.234 n=19+19)
GobEncode-12                5.43ms ± 0%    5.47ms ± 0%  +0.75%  (p=0.000 n=20+19)
Gzip-12                      216ms ± 1%     220ms ± 1%  +1.71%  (p=0.000 n=19+20)
Gunzip-12                   38.6ms ± 0%    38.8ms ± 0%  +0.66%  (p=0.000 n=18+19)
HTTPClientServer-12         78.1µs ± 1%    78.5µs ± 1%  +0.49%  (p=0.035 n=20+20)
JSONEncode-12               12.1ms ± 0%    12.2ms ± 0%  +1.05%  (p=0.000 n=18+17)
JSONDecode-12               53.0ms ± 0%    52.3ms ± 0%  -1.27%  (p=0.000 n=19+19)
Mandelbrot200-12            3.74ms ± 0%    3.69ms ± 0%  -1.17%  (p=0.000 n=18+19)
GoParse-12                  3.17ms ± 1%    3.17ms ± 1%    ~     (p=0.569 n=19+20)
RegexpMatchEasy0_32-12      73.2ns ± 1%    73.7ns ± 0%  +0.76%  (p=0.000 n=18+17)
RegexpMatchEasy0_1K-12       239ns ± 0%     238ns ± 0%  -0.27%  (p=0.000 n=13+17)
RegexpMatchEasy1_32-12      69.0ns ± 2%    69.1ns ± 1%    ~     (p=0.404 n=19+19)
RegexpMatchEasy1_1K-12       367ns ± 1%     365ns ± 1%  -0.60%  (p=0.000 n=19+19)
RegexpMatchMedium_32-12      105ns ± 1%     104ns ± 1%  -1.24%  (p=0.000 n=19+16)
RegexpMatchMedium_1K-12     34.1µs ± 2%    33.6µs ± 3%  -1.60%  (p=0.000 n=20+20)
RegexpMatchHard_32-12       1.62µs ± 1%    1.67µs ± 1%  +2.75%  (p=0.000 n=18+18)
RegexpMatchHard_1K-12       48.8µs ± 1%    50.3µs ± 2%  +3.07%  (p=0.000 n=20+19)
Revcomp-12                   386ms ± 0%     384ms ± 0%  -0.57%  (p=0.000 n=20+19)
Template-12                 59.9ms ± 1%    61.1ms ± 1%  +2.01%  (p=0.000 n=20+19)
TimeParse-12                 301ns ± 2%     307ns ± 0%  +2.11%  (p=0.000 n=20+19)
TimeFormat-12                323ns ± 0%     323ns ± 0%    ~     (all samples are equal)
[Geo mean]                  47.0µs         47.1µs       +0.23%

https://perf.golang.org/search?q=upload:20171030.1

Likewise, the throughput effect on the x/benchmarks is minimal (and
reasonably positive on the garbage benchmark with a large heap):

name                         old time/op  new time/op  delta
Garbage/benchmem-MB=1024-12  2.40ms ± 4%  2.29ms ± 3%  -4.57%  (p=0.000 n=19+18)
Garbage/benchmem-MB=64-12    2.23ms ± 1%  2.24ms ± 2%  +0.59%  (p=0.016 n=19+18)
HTTP-12                      12.5µs ± 1%  12.6µs ± 1%    ~     (p=0.326 n=20+19)
JSON-12                      11.1ms ± 1%  11.3ms ± 2%  +2.15%  (p=0.000 n=16+17)

It does increase the heap size of the garbage benchmarks, but seems to
have relatively little impact on more realistic programs. Also, we'll
gain some of this back with the next commit.

name                         old peak-RSS-bytes  new peak-RSS-bytes  delta
Garbage/benchmem-MB=1024-12          1.21G ± 1%          1.88G ± 2%  +55.59%  (p=0.000 n=19+20)
Garbage/benchmem-MB=64-12             168M ± 3%           248M ± 8%  +48.08%  (p=0.000 n=18+20)
HTTP-12                              45.6M ± 9%          47.0M ±27%     ~     (p=0.925 n=20+20)
JSON-12                               193M ±11%           206M ±11%   +7.06%  (p=0.001 n=20+20)

https://perf.golang.org/search?q=upload:20171030.2

Change-Id: Ic78904135f832b4d64056cbe734ab979f5ad9736
Reviewed-on: https://go-review.googlesource.com/59970
Run-TryBot: Austin Clements <austin@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Rick Hudson <rlh@golang.org>
gopherbot pushed a commit that referenced this issue Oct 31, 2017
Currently, both the background mark worker and the goal GC CPU are
both fixed at 25%. The trigger controller's goal is to achieve the
goal CPU usage, and with the previous commit it can actually achieve
this. But this means there are *no* assists, which sounds ideal but
actually causes problems for the trigger controller. Since the
controller can't lower CPU usage below the background mark worker CPU,
it saturates at the CPU goal and no longer gets feedback, which
translates into higher variability in heap growth.

This commit fixes this by allowing assists 5% CPU beyond the 25% fixed
background mark. This avoids saturating the trigger controller, since
it can now get feedback from both sides of the CPU goal. This leads to
low variability in both CPU usage and heap growth, at the cost of
reintroducing a low rate of mark assists.

We also experimented with 20% background plus 5% assist, but 25%+5%
clearly performed better in benchmarks.

Updates #14951.
Updates #14812.
Updates #18534.

Combined with the previous CL, this significantly improves tail
mutator utilization in the x/bechmarks garbage benchmark. On a sample
trace, it increased the 99.9%ile mutator utilization at 10ms from 26%
to 59%, and at 5ms from 17% to 52%. It reduced the 99.9%ile zero
utilization window from 2ms to 700µs. It also helps the mean mutator
utilization: it increased the 10s mutator utilization from 83% to 94%.
The minimum mutator utilization is also somewhat improved, though
there is still some unknown artifact that causes a miniscule fraction
of mutator assists to take 5--10ms (in fact, there was exactly one
10ms mutator assist in my sample trace).

This has no significant effect on the throughput of the
github.com/dr2chase/bent benchmarks-50.

This has little effect on the go1 benchmarks (and the slight overall
improvement makes up for the slight overall slowdown from the previous
commit):

name                      old time/op    new time/op    delta
BinaryTree17-12              2.40s ± 0%     2.41s ± 1%  +0.26%  (p=0.010 n=18+19)
Fannkuch11-12                2.95s ± 0%     2.93s ± 0%  -0.62%  (p=0.000 n=18+15)
FmtFprintfEmpty-12          42.2ns ± 0%    42.3ns ± 1%  +0.37%  (p=0.001 n=15+14)
FmtFprintfString-12         67.9ns ± 2%    67.2ns ± 3%  -1.03%  (p=0.002 n=20+18)
FmtFprintfInt-12            75.6ns ± 3%    76.8ns ± 2%  +1.59%  (p=0.000 n=19+17)
FmtFprintfIntInt-12          123ns ± 1%     124ns ± 1%  +0.77%  (p=0.000 n=17+14)
FmtFprintfPrefixedInt-12     148ns ± 1%     150ns ± 1%  +1.28%  (p=0.000 n=20+20)
FmtFprintfFloat-12           212ns ± 0%     211ns ± 1%  -0.67%  (p=0.000 n=16+17)
FmtManyArgs-12               499ns ± 1%     500ns ± 0%  +0.23%  (p=0.004 n=19+16)
GobDecode-12                6.49ms ± 1%    6.51ms ± 1%  +0.32%  (p=0.008 n=19+19)
GobEncode-12                5.47ms ± 0%    5.43ms ± 1%  -0.68%  (p=0.000 n=19+20)
Gzip-12                      220ms ± 1%     216ms ± 1%  -1.66%  (p=0.000 n=20+19)
Gunzip-12                   38.8ms ± 0%    38.5ms ± 0%  -0.80%  (p=0.000 n=19+20)
HTTPClientServer-12         78.5µs ± 1%    78.1µs ± 1%  -0.53%  (p=0.008 n=20+19)
JSONEncode-12               12.2ms ± 0%    11.9ms ± 0%  -2.38%  (p=0.000 n=17+19)
JSONDecode-12               52.3ms ± 0%    53.3ms ± 0%  +1.84%  (p=0.000 n=19+20)
Mandelbrot200-12            3.69ms ± 0%    3.69ms ± 0%  -0.19%  (p=0.000 n=19+19)
GoParse-12                  3.17ms ± 1%    3.19ms ± 1%  +0.61%  (p=0.000 n=20+20)
RegexpMatchEasy0_32-12      73.7ns ± 0%    73.2ns ± 1%  -0.66%  (p=0.000 n=17+20)
RegexpMatchEasy0_1K-12       238ns ± 0%     239ns ± 0%  +0.32%  (p=0.000 n=17+16)
RegexpMatchEasy1_32-12      69.1ns ± 1%    69.2ns ± 1%    ~     (p=0.669 n=19+13)
RegexpMatchEasy1_1K-12       365ns ± 1%     367ns ± 1%  +0.49%  (p=0.000 n=19+19)
RegexpMatchMedium_32-12      104ns ± 1%     105ns ± 1%  +1.33%  (p=0.000 n=16+20)
RegexpMatchMedium_1K-12     33.6µs ± 3%    34.1µs ± 4%  +1.67%  (p=0.001 n=20+20)
RegexpMatchHard_32-12       1.67µs ± 1%    1.62µs ± 1%  -2.78%  (p=0.000 n=18+17)
RegexpMatchHard_1K-12       50.3µs ± 2%    48.7µs ± 1%  -3.09%  (p=0.000 n=19+18)
Revcomp-12                   384ms ± 0%     386ms ± 0%  +0.59%  (p=0.000 n=19+19)
Template-12                 61.1ms ± 1%    60.5ms ± 1%  -1.02%  (p=0.000 n=19+20)
TimeParse-12                 307ns ± 0%     303ns ± 1%  -1.23%  (p=0.000 n=19+15)
TimeFormat-12                323ns ± 0%     323ns ± 0%  -0.12%  (p=0.011 n=15+20)
[Geo mean]                  47.1µs         47.0µs       -0.20%

https://perf.golang.org/search?q=upload:20171030.4

It slightly improve the performance the x/benchmarks:

name                         old time/op  new time/op  delta
Garbage/benchmem-MB=1024-12  2.29ms ± 3%  2.22ms ± 2%  -2.97%  (p=0.000 n=18+18)
Garbage/benchmem-MB=64-12    2.24ms ± 2%  2.21ms ± 2%  -1.64%  (p=0.000 n=18+18)
HTTP-12                      12.6µs ± 1%  12.6µs ± 1%    ~     (p=0.690 n=19+17)
JSON-12                      11.3ms ± 2%  11.3ms ± 1%    ~     (p=0.163 n=17+18)

and fixes some of the heap size bloat caused by the previous commit:

name                         old peak-RSS-bytes  new peak-RSS-bytes  delta
Garbage/benchmem-MB=1024-12          1.88G ± 2%          1.77G ± 2%  -5.52%  (p=0.000 n=20+18)
Garbage/benchmem-MB=64-12             248M ± 8%           226M ± 5%  -8.93%  (p=0.000 n=20+20)
HTTP-12                              47.0M ±27%          47.2M ±12%    ~     (p=0.512 n=20+20)
JSON-12                               206M ±11%           206M ±10%    ~     (p=0.841 n=20+20)

https://perf.golang.org/search?q=upload:20171030.5

Combined with the change to add a soft goal in the previous commit,
the achieves a decent performance improvement on the garbage
benchmark:

name                         old time/op  new time/op  delta
Garbage/benchmem-MB=1024-12  2.40ms ± 4%  2.22ms ± 2%  -7.40%  (p=0.000 n=19+18)
Garbage/benchmem-MB=64-12    2.23ms ± 1%  2.21ms ± 2%  -1.06%  (p=0.000 n=19+18)
HTTP-12                      12.5µs ± 1%  12.6µs ± 1%    ~     (p=0.330 n=20+17)
JSON-12                      11.1ms ± 1%  11.3ms ± 1%  +1.87%  (p=0.000 n=16+18)

https://perf.golang.org/search?q=upload:20171030.6

Change-Id: If04ddb57e1e58ef2fb9eec54c290eb4ae4bea121
Reviewed-on: https://go-review.googlesource.com/59971
Run-TryBot: Austin Clements <austin@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Rick Hudson <rlh@golang.org>
@aclements
Copy link
Member Author

Commit 03eb948 fixed this issue (not sure why I only marked it "updates"). I have a design doc that I hope to clean up and publish under this issue in the near future.

@gopherbot
Copy link
Contributor

Change https://golang.org/cl/122796 mentions this issue: design: soft/hard heap limits design

gopherbot pushed a commit to golang/proposal that referenced this issue Jul 10, 2018
This is a design document I wrote a long time ago but hadn't published
yet (this change was released in Go 1.10).

Updates golang/go#14951.

Change-Id: Ib5d648eede153d68af503eaada455069c311e27e
Reviewed-on: https://go-review.googlesource.com/122796
Reviewed-by: Ian Lance Taylor <iant@golang.org>
gopherbot pushed a commit that referenced this issue Nov 5, 2018
This adds an endpoint to the trace tool that plots the minimum mutator
utilization curve using information on mark assists and GC pauses from
the trace.

This commit implements a fairly straightforward O(nm) algorithm for
computing the MMU (and tests against an even more direct but slower
algorithm). Future commits will extend and optimize this algorithm.

This should be useful for debugging and understanding mutator
utilization issues like #14951, #14812, #18155. #18534, #21107,
particularly once follow-up CLs add trace cross-referencing.

Change-Id: Ic2866869e7da1e6c56ba3e809abbcb2eb9c4923a
Reviewed-on: https://go-review.googlesource.com/c/60790
Run-TryBot: Austin Clements <austin@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Hyang-Ah Hana Kim <hyangah@gmail.com>
@golang golang locked and limited conversation to collaborators Jul 9, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

6 participants