Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: concurrent GC starves mutator and netpoller #16432

Closed
rhysh opened this issue Jul 19, 2016 · 26 comments

Comments

Projects
None yet
8 participants
@rhysh
Copy link
Contributor

commented Jul 19, 2016

Please answer these questions before submitting your issue. Thanks!

  1. What version of Go are you using (go version)?

    go1.7rc2

  2. What operating system and processor architecture are you using (go env)?

    linux/amd64

  3. What did you do?

    I've got a program that accepts data over 200–300 active network connections, unmarshals the data, reformats and remarshals the data, and sends it over a small number of http connections. It usually has around 1300 goroutines and 200MB of live heap, and triggers a garbage collection about twice a second (the apis for encoding/json and others force a lot of allocations). It runs on a 36-core machine with GOMAXPROCS=36.

    I took an execution trace from /debug/pprof/trace to try to track down a different GC issue.

  4. What did you expect to see?

    I expected the program's mutator goroutines continue to be scheduled as usual during the concurrent mark phase. I'd expect most goroutines to be scheduled for close to the 50–2000µs they need to do their normal work. I expected the "Goroutines" row to indicate that no goroutines were getting stuck in the run queue for long periods of time.

    I'd expect no more than 25% of threads to be executing runtime.gcBgMarkWorker goroutines.

    I expected the "Network" row to show incoming network events continue to occur during GC, and for the goroutines that are unblocked by network events to be scheduled immediately afterwards.

  5. What did you see instead?

    Goroutines that are scheduled during garbage collection run for around 10ms. When runtime.gcBgMarkWorker goroutines run, they go for longer than 10ms. The "Goroutines" row grows and then stays steady, only draining when the GC is nearly complete.

    Occasionally, as shown in the execution trace screenshot below, nearly all threads are running a runtime.gcBgMarkWorker goroutine. This one has 35 out of 36 threads running their copy of that goroutine—20 of the 21 pictured there, plus the 15 others offscreen.

    The "Network" row shows less inbound network traffic during the GC, which seems to indicate that the netpoller is also getting starved—I don't know a lot yet about the netpoller's operation, but I don't think that there'd be any less inbound network traffic to this process during garbage collections. Additionally, when a network even is detected it takes a long time to run the goroutine that was unblocked by that network event. The red arrow in the execution tracer screenshot shows a delay of around 10ms between data coming in over the network and the correct goroutine running.


I've included some output from running the program with GODEBUG=gctrace=1,gcpacertrace=1. I'm not yet very familiar with the pacer, but it looks like "assist ratio=+1.306319e+001" means that when a goroutine goes into allocation debt that it will end up scanning 13,000,000 scan-units before it's allowed to run again per gcAssistAlloc and gcOverAssistBytes.

screen shot 2016-07-19 at 1 42 23 pm redacted

pacer: sweep done at heap size 270MB; allocated 28MB of spans; swept 61108 pages at +2.548577e-004 pages/byte
pacer: assist ratio=+1.306319e+001 (scan 156 MB in 471->483 MB) workers=9+0
pacer: H_m_prev=253496440 h_t=+9.500000e-001 H_T=494318058 h_a=+9.582826e-001 H_a=496417664 h_g=+1.000000e+000 H_g=506992880 u_a=+6.990956e-001 u_g=+2.500000e-001 W_a=85708136 goalΔ=+5.000000e-002 actualΔ=+8.282586e-003 u_a/u_g=+2.796383e+000
gc 2106 @1021.618s 4%: 0.27+30+1.4 ms clock, 8.5+494/256/0+46 ms cpu, 471->473->237 MB, 483 MB goal, 36 P
pacer: sweep done at heap size 268MB; allocated 31MB of spans; swept 61104 pages at +2.595642e-004 pages/byte
pacer: assist ratio=+1.269421e+001 (scan 150 MB in 462->474 MB) workers=9+0
pacer: H_m_prev=248903752 h_t=+9.500000e-001 H_T=485362316 h_a=+9.559496e-001 H_a=486843184 h_g=+1.000000e+000 H_g=497807504 u_a=+5.780640e-001 u_g=+2.500000e-001 W_a=85396448 goalΔ=+5.000000e-002 actualΔ=+5.949559e-003 u_a/u_g=+2.312256e+000
gc 2107 @1022.157s 4%: 0.084+30+1.3 ms clock, 2.7+368/272/246+43 ms cpu, 462->464->238 MB, 474 MB goal, 36 P
pacer: sweep done at heap size 263MB; allocated 24MB of spans; swept 60513 pages at +2.558500e-004 pages/byte
pacer: assist ratio=+1.294395e+001 (scan 153 MB in 465->476 MB) workers=9+0
pacer: H_m_prev=250069528 h_t=+9.500000e-001 H_T=487635579 h_a=+9.529223e-001 H_a=488366352 h_g=+1.000000e+000 H_g=500139056 u_a=+4.340543e-001 u_g=+2.500000e-001 W_a=83280464 goalΔ=+5.000000e-002 actualΔ=+2.922277e-003 u_a/u_g=+1.736217e+000
gc 2108 @1022.667s 4%: 0.17+27+1.2 ms clock, 5.4+182/239/482+37 ms cpu, 465->465->235 MB, 476 MB goal, 36 P
pacer: sweep done at heap size 257MB; allocated 21MB of spans; swept 60631 pages at +2.596270e-004 pages/byte
pacer: assist ratio=+1.270776e+001 (scan 149 MB in 459->470 MB) workers=9+0
pacer: H_m_prev=246926040 h_t=+9.500000e-001 H_T=481505778 h_a=+9.583699e-001 H_a=483572528 h_g=+1.000000e+000 H_g=493852080 u_a=+3.990433e-001 u_g=+2.500000e-001 W_a=83278560 goalΔ=+5.000000e-002 actualΔ=+8.369915e-003 u_a/u_g=+1.596173e+000
gc 2109 @1023.185s 4%: 0.18+29+1.4 ms clock, 5.9+159/249/500+44 ms cpu, 459->461->229 MB, 470 MB goal, 36 P
pacer: sweep done at heap size 258MB; allocated 28MB of spans; swept 60538 pages at +2.662955e-004 pages/byte
pacer: assist ratio=+1.290206e+001 (scan 147 MB in 447->458 MB) workers=9+0
pacer: H_m_prev=240402624 h_t=+9.500000e-001 H_T=468785116 h_a=+9.540130e-001 H_a=469749856 h_g=+1.000000e+000 H_g=480805248 u_a=+4.026614e-001 u_g=+2.500000e-001 W_a=82234064 goalΔ=+5.000000e-002 actualΔ=+4.013014e-003 u_a/u_g=+1.610646e+000
gc 2110 @1023.709s 4%: 0.18+26+1.1 ms clock, 5.9+146/228/474+35 ms cpu, 447->447->234 MB, 458 MB goal, 36 P

The combination of high assist ratio and all goroutines on all Ps running slowly makes this look similar to #14812, with the added twist that sometimes nearly all Ps are running their gcBgMarkWorker goroutine.

/cc @aclements

@ianlancetaylor ianlancetaylor added this to the Go1.8 milestone Jul 19, 2016

@ianlancetaylor

This comment has been minimized.

Copy link
Contributor

commented Jul 19, 2016

CC @RLH

@aclements

This comment has been minimized.

Copy link
Member

commented Jul 21, 2016

Given how much time is being spent in gcBgMarkWorker, I think this is actually unrelated to assists (or at least the gcBgMarkWorker is masking the assists). Given the duration of many of the gcBgMarkWorker runs and the amount of idle time shown outside of GC in the trace, I would guess that these are idle workers, which run for 10ms when invoked (which was probably a bad idea). @rhysh, could you try https://go-review.googlesource.com/c/24706/? That isn't really the "right" fix for this problem, but it should make it clear in the trace if this is the problem by chopping most of the gcBgMarkWorkers into much shorter windows, and it may help somewhat with latency.

@rhysh

This comment has been minimized.

Copy link
Contributor Author

commented Jul 21, 2016

Hi @aclements, I tried CL 24706 PS 1. When I checked it out and built/ran my program, the program consumed a lot more CPU. An execution trace showed GC runs lasting for about 25ms with about 11ms of application execution between them. During the 1/3rd of the time for which the app was running, the user goroutines were busy doing sweep work.

I cherry-picked CL 24706 PS 1 onto go1.7rc2, and it didn't seem to have the effect you describe in the commit message of preempting the gcBgMarkWorker every 1ms. It may well not apply to go1.7rc2 in a meaningful way, I haven't looked into the patches it's based on.

Below is an execution trace of the program with go1.7rc2 with CL 24706 PS 1 cherry-picked on top. It doesn't exhibit the behavior of many Ps running gcBgMarkWorker (which sounds like it's not a bug in the first place?), and it also doesn't show the gcBgMarkWorker goroutines allowing themselves to be descheduled after 1ms of work. It still shows a decrease in the frequency of "Network" events, which (not pictured) still have a multi-ms delay (I've found up to 7ms) before the corresponding goroutines are scheduled, vs the <50µs delay typical when the GC is not running.

screen shot 2016-07-21 at 12 50 12 pm redacted

@aclements

This comment has been minimized.

Copy link
Member

commented Jul 22, 2016

Actually, that trace looks like CL 24706 is working just fine; it's just shifted the problem from long idle background marks to long assists (the work has to happen somewhere). I would expect 9 of the Ps to run a dedicated gcBgMarkWorker for the whole cycle, which probably agrees with the trace (clearly you need a bigger monitor :).

What's weird (and what agrees with #14812), is that these assists take so long and that the pacer isn't pulling back the start time to reduce the CPU usage. Do you have the gctrace and gcpacertrace from the run with CL 24706? Even with an assist ratio of ~13 (which, you're right, means it will do about 13 million units of scan work per assist), I would only expect it to take ~10ms per assist. That's clearly still too long, but the trace is showing assists much longer than this.

You mentioned that this is basically an RPC translator. Any idea how difficult it would be to distill this into a public benchmark that might exhibit a similar pattern of allocation and goroutine activation?

@rhysh

This comment has been minimized.

Copy link
Contributor Author

commented Jul 22, 2016

I've run the program with go1.7rc3 with CL 24706 PS 1 cherry-picked on top. gctrace and gcpacertrace output are below, along with an execution trace zoomed out to show what all 36 Ps are doing. There are indeed 9 that run a gcBgMarkWorker goroutine.

I think I'll be able to distill this into a public benchmark, though I don't yet know how the GC pacer works. Can you give me some direction on what sorts of stimuli cause the assist ratio to be high or low?

The GC's behavior from what I'd expect in a few ways. Can you clarify what behavior I should expect?

  1. Allocation during a GC cycle is several orders of magnitude more expensive than allocation between GC cycles. I'd expect the cost difference to be less than 10x.
  2. The delay between a goroutine being runnable and it getting CPU time ("scheduling latency"?) is several orders of magnitude higher when a GC is running—say 7ms to receive inbound network traffic vs <50µs typically. I'd expect the latency difference to be less than 10x for applications that are not CPU bound when the GC isn't running.
  3. In the execution trace below, all Ps are busy helping the collector between 50% and 80% of the way through the trace (550–562ms)—either directly via gcBgMarkWorker, or via assist taxes. I'd expect the GC's CPU usage to be relatively consistent throughout the cycle, for it to immediately enlist the help it needs, or to not enlist the help at all (at least not to the extent it does).
  4. At the very end of the concurrent GC phase, between 564ms and 566ms in the trace, the goroutine queue drains. The goroutines are each scheduled for more time than they are when the GC isn't running (compare with 568–570ms), but they each run quickly enough that the run queue drains. I'd expect the whole concurrent GC cycle to look like this, at least because it looks like confusion (1) isn't in play there.

I think that (1) is the root of my surprise.

Is (4) what you mean by "the pacer pulling back"? Does the pacer divide the GC into two (or more) phases, during which it imposes different allocation taxes?

Looking again at runtime.gcAssistAlloc:

    // Compute the amount of scan work we need to do to make the
    // balance positive. We over-assist to build up credit for
    // future allocations and amortize the cost of assisting.
    debtBytes := -gp.gcAssistBytes + gcOverAssistBytes
    scanWork := int64(gcController.assistWorkPerByte * float64(debtBytes))

I wonder if scanWork should be calculated differently. It looks like the goals the current code meets are:

  • get out of debt locally
  • stay on pace to get out of debt globally
  • don't get back into debt for a little while longer to amortize the cost of assisting

I think the problems I'm seeing in my program are because:

  • the credit build-up for amortizing the cost of assisting is too high
  • the credit build-up is affected by the assist ratio, so becomes even higher (and is paid by a small number of goroutines)

I'd propose something like the following, possibly with reducing the value of gcOverAssistBytes.

    debtBytes := -gp.gcAssistBytes
    scanWork := int64(gcController.assistWorkPerByte * float64(debtBytes))
    if scanWork < gcOverAssistBytes {
        scanWork = gcOverAssistBytes
        debtBytes = int64(gcController.assistBytesPerWork*float64(scanWork))
    }

Would a change like that break other invariants or goals of the GC?


pacer: assist ratio=+1.231154e+001 (scan 143 MB in 455->467 MB) workers=9+0
pacer: H_m_prev=245126496 h_t=+9.500000e-001 H_T=477996667 h_a=+9.557673e-001 H_a=479410384 h_g=+1.000000e+000 H_g=490252992 u_a=+8.475515e-001 u_g=+2.500000e-001 W_a=82463928 goalΔ=+5.000000e-002 actualΔ=+5.767295e-003 u_a/u_g=+3.390206e+000
gc 899 @482.825s 4%: 0.14+26+1.2 ms clock, 4.7+572/227/0+41 ms cpu, 455->457->236 MB, 467 MB goal, 36 P
pacer: sweep done at heap size 254MB; allocated 18MB of spans; swept 60049 pages at +2.562095e-004 pages/byte
pacer: assist ratio=+1.252757e+001 (scan 147 MB in 460->472 MB) workers=9+0
pacer: H_m_prev=247813872 h_t=+9.500000e-001 H_T=483237050 h_a=+9.561490e-001 H_a=484760864 h_g=+1.000000e+000 H_g=495627744 u_a=+7.601246e-001 u_g=+2.500000e-001 W_a=83029592 goalΔ=+5.000000e-002 actualΔ=+6.149025e-003 u_a/u_g=+3.040498e+000
gc 900 @483.353s 4%: 0.31+28+1.4 ms clock, 10+530/243/77+44 ms cpu, 460->462->230 MB, 472 MB goal, 36 P
pacer: sweep done at heap size 255MB; allocated 24MB of spans; swept 60107 pages at +2.626074e-004 pages/byte
pacer: assist ratio=+1.273292e+001 (scan 146 MB in 450->461 MB) workers=9+0
pacer: H_m_prev=242035808 h_t=+9.500000e-001 H_T=471969825 h_a=+9.541013e-001 H_a=472962480 h_g=+1.000000e+000 H_g=484071616 u_a=+8.154352e-001 u_g=+2.500000e-001 W_a=81893424 goalΔ=+5.000000e-002 actualΔ=+4.101271e-003 u_a/u_g=+3.261741e+000
gc 901 @483.877s 4%: 0.093+25+1.2 ms clock, 2.9+532/225/1.0+38 ms cpu, 450->451->236 MB, 461 MB goal, 36 P
pacer: sweep done at heap size 265MB; allocated 29MB of spans; swept 58896 pages at +2.516126e-004 pages/byte
pacer: assist ratio=+1.214765e+001 (scan 143 MB in 460->472 MB) workers=9+0
pacer: H_m_prev=247497552 h_t=+9.500000e-001 H_T=482620226 h_a=+9.558650e-001 H_a=484071792 h_g=+1.000000e+000 H_g=494995104 u_a=+8.234678e-001 u_g=+2.500000e-001 W_a=81130152 goalΔ=+5.000000e-002 actualΔ=+5.864970e-003 u_a/u_g=+3.293871e+000
gc 902 @484.398s 4%: 0.17+24+1.2 ms clock, 5.4+522/217/41+39 ms cpu, 460->461->238 MB, 472 MB goal, 36 P
pacer: sweep done at heap size 265MB; allocated 27MB of spans; swept 60644 pages at +2.568556e-004 pages/byte
pacer: assist ratio=+1.269908e+001 (scan 144 MB in 464->476 MB) workers=9+0
pacer: H_m_prev=249631648 h_t=+9.500000e-001 H_T=486781713 h_a=+9.570594e-001 H_a=488543968 h_g=+1.000000e+000 H_g=499263296 u_a=+9.082547e-001 u_g=+2.500000e-001 W_a=81885624 goalΔ=+5.000000e-002 actualΔ=+7.059419e-003 u_a/u_g=+3.633019e+000
gc 903 @484.949s 4%: 0.43+23+1.3 ms clock, 2.5+566/204/0+8.0 ms cpu, 464->465->234 MB, 476 MB goal, 36 P
pacer: sweep done at heap size 259MB; allocated 25MB of spans; swept 61043 pages at +2.629836e-004 pages/byte
pacer: assist ratio=+1.248932e+001 (scan 145 MB in 456->468 MB) workers=9+0
pacer: H_m_prev=245437576 h_t=+9.500000e-001 H_T=478603273 h_a=+9.551745e-001 H_a=479873296 h_g=+1.000000e+000 H_g=490875152 u_a=+7.491987e-001 u_g=+2.500000e-001 W_a=81006200 goalΔ=+5.000000e-002 actualΔ=+5.174525e-003 u_a/u_g=+2.996795e+000
gc 904 @485.456s 4%: 0.15+26+1.3 ms clock, 4.9+478/223/17+41 ms cpu, 456->457->238 MB, 468 MB goal, 36 P
pacer: sweep done at heap size 263MB; allocated 24MB of spans; swept 59909 pages at +2.528582e-004 pages/byte
pacer: assist ratio=+1.322047e+001 (scan 144 MB in 466->477 MB) workers=9+0
pacer: H_m_prev=250500912 h_t=+9.500000e-001 H_T=488476778 h_a=+9.621801e-001 H_a=491527904 h_g=+1.000000e+000 H_g=501001824 u_a=+8.063436e-001 u_g=+2.500000e-001 W_a=82953048 goalΔ=+5.000000e-002 actualΔ=+1.218010e-002 u_a/u_g=+3.225375e+000
gc 905 @486.013s 4%: 0.77+28+1.4 ms clock, 24+581/245/0+46 ms cpu, 466->468->235 MB, 477 MB goal, 36 P
pacer: sweep done at heap size 253MB; allocated 17MB of spans; swept 61285 pages at +2.621329e-004 pages/byte
pacer: assist ratio=+1.300706e+001 (scan 149 MB in 460->471 MB) workers=9+0
pacer: H_m_prev=247202336 h_t=+9.500000e-001 H_T=482044555 h_a=+9.558154e-001 H_a=483482128 h_g=+1.000000e+000 H_g=494404672 u_a=+7.913021e-001 u_g=+2.500000e-001 W_a=84369712 goalΔ=+5.000000e-002 actualΔ=+5.815369e-003 u_a/u_g=+3.165208e+000
gc 906 @486.606s 4%: 0.22+25+1.3 ms clock, 7.2+510/225/16+43 ms cpu, 460->461->236 MB, 471 MB goal, 36 P
pacer: sweep done at heap size 263MB; allocated 26MB of spans; swept 59549 pages at +2.537141e-004 pages/byte

screen shot 2016-07-22 at 2 56 52 pm redacted

@rhysh

This comment has been minimized.

Copy link
Contributor Author

commented Jul 23, 2016

I tried out the change I proposed, and it seems to do the trick. Incoming network events continue at their usual rate during the concurrent GC, rather than becoming less frequent. The delay between incoming network events and the resumption of the goroutines they unblock is often several microseconds rather than several milliseconds. Goroutines that are scheduled during the concurrent GC often run for just a few hundred microseconds (I dialed down gcOverAssistBytes to 64kiB as well).

I'll share an execution trace and some output of gctrace and gcpacertrace soon, and mail a CL.

Are there any circumstances under which a fix for this would be part of go1.7?

@rhysh

This comment has been minimized.

Copy link
Contributor Author

commented Jul 23, 2016

I mailed CL 25155. With that change (at PS 1) applied to go1.7rc3, the runtime execution trace of my program shows inbound network events as continuing at the same rate whether or not a concurrent GC is executing. The delay between inbound network events and the execution of the goroutines they unblock is generally less than 10µs even when a garbage collection is in progress. With go1.7rc3 without the change, the delays are often near 10ms.

The gctrace output shows that the heap grew more during the concurrent mark phase than with go1.7rc3. In the other gctrace outputs I shared, the GC cycles started when the goal was around 10MB away and completed after growing 1–2MB: the collection finished about 5–10x faster than expected. With this change, the gctrace output shows that the GC cycles start about 20MB before the goal and complete after the 10MB has been allocated: the collection finished only 2x faster than the pacer expected.

Finally, the runtime execution trace shows the goroutine run queue growing and shrinking during the concurrent GC, rather than waiting until almost the end of the cycle before draining. It shows many goroutines each being scheduled for a few hundred microseconds, and yielding once their work is done to allow other goroutines to have some CPU time.

pacer: assist ratio=+5.599321e+000 (scan 155 MB in 468->496 MB) workers=9+0
pacer: H_m_prev=260271008 h_t=+8.876642e-001 H_T=491304256 h_a=+9.315716e-001 H_a=502732080 h_g=+1.000000e+000 H_g=520542016 u_a=+6.200409e-001 u_g=+2.500000e-001 W_a=88987328 goalΔ=+1.123358e-001 actualΔ=+4.390740e-002 u_a/u_g=+2.480164e+000
gc 1199 @688.817s 3%: 4.7+28+1.1 ms clock, 150+447/255/0.030+35 ms cpu, 468->479->259 MB, 496 MB goal, 36 P
pacer: sweep done at heap size 271MB; allocated 11MB of spans; swept 62143 pages at +2.583740e-004 pages/byte
pacer: assist ratio=+5.534240e+000 (scan 158 MB in 489->518 MB) workers=9+0
pacer: H_m_prev=271608696 h_t=+8.893833e-001 H_T=513172939 h_a=+9.295383e-001 H_a=524079392 h_g=+1.000000e+000 H_g=543217392 u_a=+5.818598e-001 u_g=+2.500000e-001 W_a=87967656 goalΔ=+1.106167e-001 actualΔ=+4.015502e-002 u_a/u_g=+2.327439e+000
gc 1200 @689.361s 3%: 0.065+28+1.1 ms clock, 2.0+335/248/155+37 ms cpu, 489->499->245 MB, 518 MB goal, 36 P
pacer: sweep done at heap size 266MB; allocated 20MB of spans; swept 64830 pages at +2.813143e-004 pages/byte
pacer: assist ratio=+6.188041e+000 (scan 154 MB in 466->491 MB) workers=9+0
pacer: H_m_prev=257808720 h_t=+8.979625e-001 H_T=489311275 h_a=+9.371894e-001 H_a=499424320 h_g=+1.000000e+000 H_g=515617440 u_a=+6.944268e-001 u_g=+2.500000e-001 W_a=85881912 goalΔ=+1.020375e-001 actualΔ=+3.922693e-002 u_a/u_g=+2.777707e+000
gc 1201 @689.904s 3%: 0.20+28+1.2 ms clock, 6.2+468/255/15+38 ms cpu, 466->476->233 MB, 491 MB goal, 36 P
pacer: sweep done at heap size 253MB; allocated 20MB of spans; swept 62384 pages at +2.868048e-004 pages/byte
pacer: assist ratio=+6.014401e+000 (scan 147 MB in 441->466 MB) workers=9+0
pacer: H_m_prev=244340048 h_t=+8.945008e-001 H_T=462902409 h_a=+9.222571e-001 H_a=469684400 h_g=+1.000000e+000 H_g=488680096 u_a=+2.575276e-001 u_g=+2.500000e-001 W_a=88528928 goalΔ=+1.054992e-001 actualΔ=+2.775636e-002 u_a/u_g=+1.030110e+000
gc 1202 @690.446s 3%: 0.24+27+1.3 ms clock, 7.8+7.5/243/507+42 ms cpu, 441->447->239 MB, 466 MB goal, 36 P
pacer: sweep done at heap size 249MB; allocated 10MB of spans; swept 59243 pages at +2.540863e-004 pages/byte
pacer: assist ratio=+1.224502e+001 (scan 155 MB in 466->478 MB) workers=9+0
pacer: H_m_prev=251040688 h_t=+9.329543e-001 H_T=485250185 h_a=+9.722309e-001 H_a=495110208 h_g=+1.000000e+000 H_g=502081376 u_a=+7.129376e-001 u_g=+2.500000e-001 W_a=91107928 goalΔ=+6.704567e-002 actualΔ=+3.927659e-002 u_a/u_g=+2.851750e+000
gc 1203 @691.011s 3%: 0.89+29+1.0 ms clock, 28+503/256/0+32 ms cpu, 466->472->249 MB, 478 MB goal, 36 P
pacer: sweep done at heap size 264MB; allocated 14MB of spans; swept 61019 pages at +2.573175e-004 pages/byte
pacer: assist ratio=+7.232023e+000 (scan 160 MB in 476->498 MB) workers=9+0
pacer: H_m_prev=261604112 h_t=+9.104736e-001 H_T=499787761 h_a=+9.500183e-001 H_a=510132800 h_g=+1.000000e+000 H_g=523208224 u_a=+6.067982e-001 u_g=+2.500000e-001 W_a=88859424 goalΔ=+8.952635e-002 actualΔ=+3.954463e-002 u_a/u_g=+2.427193e+000
gc 1204 @691.602s 3%: 0.10+32+1.1 ms clock, 3.2+421/284/88+37 ms cpu, 476->486->242 MB, 498 MB goal, 36 P
pacer: sweep done at heap size 264MB; allocated 22MB of spans; swept 62582 pages at +2.729750e-004 pages/byte
pacer: assist ratio=+6.720931e+000 (scan 150 MB in 461->484 MB) workers=9+0
pacer: H_m_prev=253853736 h_t=+9.072456e-001 H_T=484161420 h_a=+9.460856e-001 H_a=494021104 h_g=+1.000000e+000 H_g=507707472 u_a=+5.527317e-001 u_g=+2.500000e-001 W_a=90982696 goalΔ=+9.275440e-002 actualΔ=+3.884002e-002 u_a/u_g=+2.210927e+000
gc 1205 @692.146s 3%: 0.11+31+1.5 ms clock, 3.6+348/276/166+50 ms cpu, 461->471->246 MB, 484 MB goal, 36 P
pacer: sweep done at heap size 259MB; allocated 15MB of spans; swept 61663 pages at +2.635020e-004 pages/byte
pacer: assist ratio=+7.261088e+000 (scan 159 MB in 470->492 MB) workers=9+0
pacer: H_m_prev=258115136 h_t=+9.106866e-001 H_T=493177126 h_a=+9.494011e-001 H_a=503169936 h_g=+1.000000e+000 H_g=516230272 u_a=+5.632074e-001 u_g=+2.500000e-001 W_a=91345616 goalΔ=+8.931342e-002 actualΔ=+3.871454e-002 u_a/u_g=+2.252830e+000
gc 1206 @692.652s 3%: 0.15+29+1.2 ms clock, 4.7+339/261/229+39 ms cpu, 470->479->253 MB, 492 MB goal, 36 P
pacer: sweep done at heap size 273MB; allocated 19MB of spans; swept 61908 pages at +2.565693e-004 pages/byte

screen shot 2016-07-22 at 7 46 11 pm redacted

@gopherbot

This comment has been minimized.

Copy link

commented Jul 23, 2016

CL https://golang.org/cl/25155 mentions this issue.

@aclements

This comment has been minimized.

Copy link
Member

commented Jul 23, 2016

That's really interesting! I had tried a very similar change when I was looking at #14812: https://go-review.googlesource.com/c/24704/, but found that it didn't help. It's structured slightly differently, but should be equivalent except for two things: I only decreased the over-assist to 512k instead of 64k, and I always added the over-assist rather doing min(over-assist, required work). Would you mind giving CL 24704 a try on your workload, perhaps both with the 512k over-assist and with the 64k over-assist? I want to make sure we really understand what's going on and what the effect of these changes is.

@aclements

This comment has been minimized.

Copy link
Member

commented Jul 23, 2016

Are there any circumstances under which a fix for this would be part of go1.7?

It's not impossible. Code-wise it's pretty low risk. It clearly satisfies the invariants there. The only risk in my mind is that touching anything related to the pacer can have surprising consequences, but even there I think this is pretty low risk. I'd like to take a long stare at it and have at least one other person do so, too.

Allocation during a GC cycle is several orders of magnitude more expensive than allocation between GC cycles. I'd expect the cost difference to be less than 10x.

Is that true of all allocations? Given what's currently on master, I would expect some allocations to be very expensive, but for the mean cost of allocation to be only slightly higher when GC is running.

At the very end of the concurrent GC phase, between 564ms and 566ms in the trace, the goroutine queue drains. The goroutines are each scheduled for more time than they are when the GC isn't running (compare with 568–570ms), but they each run quickly enough that the run queue drains. I'd expect the whole concurrent GC cycle to look like this, at least because it looks like confusion (1) isn't in play there.

There's a fairness issue in the way assists currently work that's probably causing this. If an assist isn't able to fully satisfy itself, it goes on to the assist queue. But an assist that starts later may find more mark work and satisfy itself right away, even though there are older unsatisfied assists on the queue. Since only the background mark worker can satisfy queued assists, if it's not getting enough time because every P is busy doing assists, those queued assists just sit there. This is a lot like very bad lock barging. The result is that you get a pile of blocked assists, and there's a huge spike in runnable goroutines when those get unblocked at GC termination.

This is on my list of things to fix for 1.8. My plan is to effectively queue all assists and do all mark work in the background mark worker. This will make the queuing fair, centralize all mark work in one place in the code, and make it possible to satisfy an assist even in the middle of an indivisible scanning block (which will, in turn, make it more practical to count root scanning work towards assists, which will help satisfy assists at the beginning of the GC cycle).

Is (4) what you mean by "the pacer pulling back"? Does the pacer divide the GC into two (or more) phases, during which it imposes different allocation taxes?

No, actually. What I meant is that the high CPU utilization over the whole GC cycle should cause the pacer to start the next cycle sooner. But you mentioned later that it's also finishing 5--10x sooner than expected, which is probably why this isn't happening.

Would a change like that break other invariants or goals of the GC?

Your proposed change is fine invariant-wise. The only requirement is that a goroutine is not allowed to return from the assist until gcAssistBytes is non-negative. And using the min(over-assist, required work) should accomplish the desired amortization just as well as (if not better than) the current approach of always adding the over-assist.

@aclements

This comment has been minimized.

Copy link
Member

commented Jul 23, 2016

I forgot (not enough text yet!): I wrote a little self-contained RPC-like benchmark yesterday that I think exhibits equivalent symptoms. It's at https://github.com/aclements/go-gcbench/blob/master/gcbench/progs/rpc.go. I run it with GOMAXPROCS set to half my CPUs (since the client and server are separate processes) and fiddle with -reqs-per-sec until I get about 1 GC/sec, which is about -reqs-per-sec 5000 on my laptop with GOMAXPROCS=2.

I haven't tried any runtime changes with it yet.

@aclements

This comment has been minimized.

Copy link
Member

commented Jul 26, 2016

Trying with my RPC benchmark, I get the following client-observed latency tails:

change P99 P99.9 max
master 32ms 61ms 202ms
gcOverAssistBytes = 64k 11ms 40ms 70ms
cl/25155/2 9ms 18ms 32ms
cl/24704/1 (rebased) 33ms 61ms 81ms
cl/24704/1 + 64k over-assist 8ms 22ms 28ms

P50 is 50--100µs in all cases.

So, just setting over-assist to 64k helps but isn't sufficient, and just doing over-assist in terms of scan work instead of allocation bytes (cl/24704/1) also isn't sufficient. We need both. And it looks like your CL (25155) behaves very similarly to my CL (24704) with over-assist turned down to 64k, which is good, because it should. :)

I think we should move forward with your CL (I'll suggest a few mostly documentation things to pull over from 24704). I'm going to run the benchmark battery against it. And I'll see what other people think of putting this in 1.7.

@rhysh

This comment has been minimized.

Copy link
Contributor Author

commented Jul 26, 2016

Would you mind giving CL 24704 a try on your workload, perhaps both with the 512k over-assist and with the 64k over-assist? I want to make sure we really understand what's going on and what the effect of these changes is.

I cherry-picked CL 24704 PS 1 onto go1.7rc3, and it helps my application. Objectively, the delay between inbound network events and the execution of the goroutines they unblock is reduced to only a few microseconds, down from several milliseconds. Subjectively, the frequency of inbound network traffic is still lower than when the GC isn't running, but it's not as much of a decrease as with go1.7rc3.

With go1.7rc3 plus cherry-pick of CL 24704 PS 1 plus reducing the over-assist to 64k, the inbound network activity continues at just about the same rate whether or not the GC is running. This is somewhat subjective, so we might be in the realm of tuning here.

Is that true of all allocations? Given what's currently on master, I would expect some allocations to be very expensive, but for the mean cost of allocation to be only slightly higher when GC is running.

As far as I can tell, it's not true of all allocations—just the allocations that happen. Specifically, it looks with go1.7rc3 like allocations near the beginning of concurrent mark are very expensive. There are some goroutines that usually run for a few hundred microseconds, but if they're scheduled at the beginning of concurrent mark then they'll end up taking 10ms to do their work. I don't know what would change for those goroutines other than the per-byte cost of allocation.

Because the allocations early in the cycle cost so much, there isn't much time left for less expensive ones. It's possible that these average out to something reasonable, but the 10ms penalty for some goroutines early in the cycle is surprising, and doesn't seem to be in the spirit of a low-latency garbage collector. It seems in practice like each garbage collection ends up with three STW phases, with the new and unfortunate "gcAssistAlloc" phase being the longest by far (around 10ms, vs around 1–2ms for mark termination and 100–200µs for sweep termination). Each P will run goroutines (quickly) until it runs a goroutine that allocates, so it's easy for all Ps to end up busy like this, with no progress on interactive user code.


In preparing CL 25155, I tried an over-assist of 1<<20 (as in master) and 64<<10 (as in PS2). I picked it because I remember you saying that the GC can scan about 1MB per second, and some old recommendations (from Dmitry Vyukov?) about splitting work into pieces that can be done in about 100µs. Should the new over-assist figure be calculated more scientifically?


Here's what the "Network" row looks like during GCs for a few flavors of the Go runtime:

go1.7rc3, GC is around t=70–100ms. Note how network activity nearly stops during this time.

screen shot 2016-07-26 at 1 39 58 pm

go1.7rc3 with cherry-pick of cl/24704/1, GC is around t=270–310ms

screen shot 2016-07-26 at 1 40 20 pm

go1.7rc3 with cherry-pick of cl/24704/1 plus reducing the over-assist to 64k, GC is around t=340–370ms

screen shot 2016-07-26 at 1 40 33 pm

@aclements

This comment has been minimized.

Copy link
Member

commented Jul 26, 2016

With go1.7rc3 plus cherry-pick of CL 24704 PS 1 plus reducing the over-assist to 64k, the inbound network activity continues at just about the same rate whether or not the GC is running. This is somewhat subjective, so we might be in the realm of tuning here.

Great. It sounds like my RPC benchmark was getting similar results to the real thing.

Specifically, it looks with go1.7rc3 like allocations near the beginning of concurrent mark are very expensive. ... Because the allocations early in the cycle cost so much, there isn't much time left for less expensive ones.

Yes, these are the allocations I would expect to be more expensive, since nobody's built up credit at that point (Rick and I tried lots of "accounting tricks" to spread this out, but invariably mutators just blew through any slack we gave them and ran into the same problem). But you have an interesting point about these consuming so much of the cycle that there isn't much time to bring the average back down.

In preparing CL 25155, I tried an over-assist of 1<<20 (as in master) and 64<<10 (as in PS2). I picked it because I remember you saying that the GC can scan about 1MB per second, and some old recommendations (from Dmitry Vyukov?) about splitting work into pieces that can be done in about 100µs.

1MB per millisecond. :) But, yes, this is about right and splitting work up into 100µs chunks is generally pretty reasonable.

Should the new over-assist figure be calculated more scientifically?

I tried 16k, 32k, and 128k with my RPC benchmark and the differences were in the noise.

For Go 1.8, I'm considering tying assists to span acquisition, which will amortize the assist cost using a completely different mechanism. If that works, the over-assist knob will be gone anyway.

@rhysh

This comment has been minimized.

Copy link
Contributor Author

commented Jul 27, 2016

Great. It sounds like my RPC benchmark was getting similar results to the real thing.

Yes. I ran your RPC benchmark (replacing the "golang.org/x/benchmarks/gcbench.newer/heapgen" import with "github.com/aclements/go-gcbench/gcbench/heapgen") and its behavior is similar to what I see in my application.

Its behavior also reopens one of my initial questions in this issue: is it right for the GC to consume all available resources at the expense of mutator latency?

When run with go1.7rc3, the RPC benchmark shows a few Ps making very slow progress on a few Gs, as the run queue grows. After the GC, there's a flurry of mutator activity as the run queue drains.

screen shot 2016-07-26 at 5 47 41 pm

go run .../github.com/aclements/go-gcbench/gcbench/progs/rpc.go -trace /tmp/trace.bin.go17rc3 -benchtime 10s
BenchmarkRPC/reqs-per-sec:10000/ballast:64MB/gomaxprocs:4
1
13.0 GCs/op
1.27 GCs/sec
33000 95%ile-ns/sweepTerm
160000 95%ile-ns/markTerm
525 MB-marked/CPU/sec
-0.0263 95%ile-heap-overshoot
-0.231 5%ile-heap-overshoot
0.879 95%ile-CPU-util
9721 reqs/sec
Warning: -0.231 5%ile-heap-overshoot < -0.200
Warning: 0.879 95%ile-CPU-util > 0.500
2016/07/26 17:30:13 cleaning up client
2016/07/26 17:30:13 closing 294 connections
client-measured request latency:
[    13.803µs,    16.982µs) 1
[    16.982µs,    20.892µs) 1
[    20.892µs,    25.703µs) 1
[    25.703µs,    31.622µs) 1
[    31.622µs,    38.904µs) 8
[    38.904µs,    47.863µs) 3134
[    47.863µs,    58.884µs) 35491
[    58.884µs,    72.443µs) 23327
[    72.443µs,    89.125µs) 16944
[    89.125µs,   109.647µs) 10246
[   109.647µs,   134.896µs) 4406
[   134.896µs,   165.958µs) 1275
[   165.958µs,   204.173µs) 269
[   204.173µs,   251.188µs) 64
[   251.188µs,   309.029µs) 34
[   309.029µs,   380.189µs) 38
[   380.189µs,   467.735µs) 30
[   467.735µs,   575.439µs) 35
[   575.439µs,   707.945µs) 28
[   707.945µs,   870.963µs) 32
[   870.963µs,  1.071519ms) 29
[  1.071519ms,  1.318256ms) 39
[  1.318256ms,   1.62181ms) 42
[   1.62181ms,  1.995262ms) 33
[  1.995262ms,  2.454708ms) 33
[  2.454708ms,  3.019951ms) 46
[  3.019951ms,  3.715352ms) 45
[  3.715352ms,  4.570881ms) 58
[  4.570881ms,  5.623413ms) 74
[  5.623413ms,  6.918309ms) 64
[  6.918309ms,   8.51138ms) 92
[   8.51138ms, 10.471285ms) 99
[ 10.471285ms, 12.882495ms) 134
[ 12.882495ms, 15.848931ms) 165
[ 15.848931ms, 19.498445ms) 208
[ 19.498445ms, 23.988329ms) 257
[ 23.988329ms, 29.512092ms) 255
[ 29.512092ms, 36.307805ms) 177

When run with cl/25155/3, the execution trace shows all Ps running their gcBgMarkWorker for about 10ms, then a flurry of mutator activity, then another 10–20ms of all Ps running gcBgMarkWorker.

screen shot 2016-07-26 at 5 48 07 pm

go run ../.../github.com/aclements/go-gcbench/gcbench/progs/rpc.go -trace /tmp/trace.bin25155 -benchtime 10s)
BenchmarkRPC/reqs-per-sec:10000/ballast:64MB/gomaxprocs:4
1
11.0 GCs/op
1.08 GCs/sec
39000 95%ile-ns/sweepTerm
260000 95%ile-ns/markTerm
484 MB-marked/CPU/sec
-0.0328 95%ile-heap-overshoot
-0.297 5%ile-heap-overshoot
0.316 95%ile-CPU-util
9726 reqs/sec
Warning: -0.297 5%ile-heap-overshoot < -0.200
2016/07/26 17:30:47 cleaning up client
2016/07/26 17:30:47 closing 182 connections
client-measured request latency:
[    13.803µs,    16.982µs) 1
[    16.982µs,    20.892µs) 0
[    20.892µs,    25.703µs) 0
[    25.703µs,    31.622µs) 0
[    31.622µs,    38.904µs) 9
[    38.904µs,    47.863µs) 2503
[    47.863µs,    58.884µs) 32406
[    58.884µs,    72.443µs) 24280
[    72.443µs,    89.125µs) 17620
[    89.125µs,   109.647µs) 10894
[   109.647µs,   134.896µs) 5123
[   134.896µs,   165.958µs) 1744
[   165.958µs,   204.173µs) 387
[   204.173µs,   251.188µs) 129
[   251.188µs,   309.029µs) 92
[   309.029µs,   380.189µs) 73
[   380.189µs,   467.735µs) 48
[   467.735µs,   575.439µs) 72
[   575.439µs,   707.945µs) 60
[   707.945µs,   870.963µs) 58
[   870.963µs,  1.071519ms) 34
[  1.071519ms,  1.318256ms) 44
[  1.318256ms,   1.62181ms) 77
[   1.62181ms,  1.995262ms) 92
[  1.995262ms,  2.454708ms) 107
[  2.454708ms,  3.019951ms) 94
[  3.019951ms,  3.715352ms) 98
[  3.715352ms,  4.570881ms) 109
[  4.570881ms,  5.623413ms) 125
[  5.623413ms,  6.918309ms) 211
[  6.918309ms,   8.51138ms) 231
[   8.51138ms, 10.471285ms) 183
[ 10.471285ms, 12.882495ms) 215
[ 12.882495ms, 15.848931ms) 131
[ 15.848931ms, 19.498445ms) 44

cl/25155/3 is very helpful for my production application, I'd love to see it merged for Go 1.7.


1MB per millisecond. :) But, yes, this is about right and splitting work up into 100µs chunks is generally pretty reasonable.

Whoops, yes! Work segments of 100µs sound good.

I'm surprised that the gcBgMarkWorker goroutines are able to consume all Ps for 10ms at a time—and possibly longer, the cl/25155/3 screenshot at 1880–1890ms shows a non-zero run queue while all 4 Ps continue to run their gcBgMarkWorkers for more than 10ms. I know you have CL 24706 to change this to 1ms, which should help. A delay of 1ms is much less noticeable in most apps than one of 10ms, but it's still a constant delay that's not under the user's control.

The runtime reserves 25% of CPU time for the GC; does it make sense to reserve 25% for the application, preventing the GC from using more than 75%?

Here's the execution trace of cl/25155/3 with cl/24706/1 cherry-picked on top. Of the four Ps, three of them alternate between running gcBgMarkWorkers for short periods, and running mutator code. It attenuates the latency mode near 10ms by a factor of around 100–1000, all but eliminating it (although there's now a similarly-shaped lump in the response time population near 1ms).

screen shot 2016-07-26 at 6 14 40 pm

go run .../github.com/aclements/go-gcbench/gcbench/progs/rpc.go -trace /tmp/trace.bin.25155.24706 -benchtime 10s
BenchmarkRPC/reqs-per-sec:10000/ballast:64MB/gomaxprocs:4
1
11.0 GCs/op
1.08 GCs/sec
25000 95%ile-ns/sweepTerm
100000 95%ile-ns/markTerm
515 MB-marked/CPU/sec
-0.0182 95%ile-heap-overshoot
-0.320 5%ile-heap-overshoot
0.315 95%ile-CPU-util
9724 reqs/sec
Warning: -0.320 5%ile-heap-overshoot < -0.200
2016/07/26 18:12:50 cleaning up client
2016/07/26 18:12:50 closing 18 connections
client-measured request latency:
[    31.622µs,    38.904µs) 10
[    38.904µs,    47.863µs) 2700
[    47.863µs,    58.884µs) 31665
[    58.884µs,    72.443µs) 25215
[    72.443µs,    89.125µs) 17765
[    89.125µs,   109.647µs) 11147
[   109.647µs,   134.896µs) 5184
[   134.896µs,   165.958µs) 1724
[   165.958µs,   204.173µs) 417
[   204.173µs,   251.188µs) 185
[   251.188µs,   309.029µs) 159
[   309.029µs,   380.189µs) 157
[   380.189µs,   467.735µs) 132
[   467.735µs,   575.439µs) 145
[   575.439µs,   707.945µs) 158
[   707.945µs,   870.963µs) 165
[   870.963µs,  1.071519ms) 170
[  1.071519ms,  1.318256ms) 121
[  1.318256ms,   1.62181ms) 18
[   1.62181ms,  1.995262ms) 5
[  1.995262ms,  2.454708ms) 6
[  2.454708ms,  3.019951ms) 0
[  3.019951ms,  3.715352ms) 0
[  3.715352ms,  4.570881ms) 0
[  4.570881ms,  5.623413ms) 0
[  5.623413ms,  6.918309ms) 0
[  6.918309ms,   8.51138ms) 0
[   8.51138ms, 10.471285ms) 0
[ 10.471285ms, 12.882495ms) 1
@aclements

This comment has been minimized.

Copy link
Member

commented Jul 27, 2016

I ran your RPC benchmark (replacing the "golang.org/x/benchmarks/gcbench.newer/heapgen" import with "github.com/aclements/go-gcbench/gcbench/heapgen")

Whoops! goimports gone wrong. I pushed a fix for this (and delete the old gcbench.newer so it doesn't happen agian). I also added printing the 99th and 99.9th percentile numbers.

Its behavior also reopens one of my initial questions in this issue: is it right for the GC to consume all available resources at the expense of mutator latency? ... The runtime reserves 25% of CPU time for the GC; does it make sense to reserve 25% for the application, preventing the GC from using more than 75%?

Well, it's not supposed to do this, but when it does, it doesn't have much choice. If there's any strict limit on GC CPU utilization, then allocation can outpace the garbage collector, leading to overshooting the goal heap size in the best case and never terminating GC in the worst case (until you OOM). It's supposed to prevent this by starting the GC cycle early enough that it can spread out the work over enough time, but for complicated reasons that I think I understand but haven't figured out how to fix yet, this doesn't work very well in situations with really high allocation rates.

cl/25155/3 is very helpful for my production application, I'd love to see it merged for Go 1.7.

Rick and I have agreed that it seems low risk. He's going to take a close look at it tomorrow morning.

I'm surprised that the gcBgMarkWorker goroutines are able to consume all Ps for 10ms at a time—and possibly longer, the cl/25155/3 screenshot at 1880–1890ms shows a non-zero run queue while all 4 Ps continue to run their gcBgMarkWorkers for more than 10ms. I know you have CL 24706 to change this to 1ms, which should help. A delay of 1ms is much less noticeable in most apps than one of 10ms, but it's still a constant delay that's not under the user's control.

What you're seeing are the idle mark workers kicking in because there's a moment on each P when it's not running anything. Unfortunately, these currently run for 10ms when they run, which was a bad design decision (expedient at the time; I didn't realize at the time that it would have this consequence). The 1ms preemption was a stop-gap experiment, which, oddly, didn't help the RPC application we were looking at at the time. But it's certainly not the right solution for the idle worker (it may be okay for the fractional worker). On my list for Go 1.8 is to figure out how to get the idle mark worker to really behave like an idle-priority task and give up the P the moment there's anything else to do.

@rhysh

This comment has been minimized.

Copy link
Contributor Author

commented Jul 27, 2016

Well, it's not supposed to do this, but when it does, it doesn't have much choice. If there's any strict limit on GC CPU utilization, then allocation can outpace the garbage collector, leading to overshooting the goal heap size in the best case and never terminating GC in the worst case (until you OOM). It's supposed to prevent this by starting the GC cycle early enough that it can spread out the work over enough time, but for complicated reasons that I think I understand but haven't figured out how to fix yet, this doesn't work very well in situations with really high allocation rates.

I may have been too imprecise. What I had in mind was a limit on the CPU usage of the idle workers. It seems like the assist is what ensures that the mutator doesn't allocate too quickly, which means that mechanism would have to stay in place. I was curious as to whether the runtime could prevent more than 3/4ths of the Ps from concurrently running their idle mark workers—separate from the mandatory assists.

Your plan to preempt the idle mark worker once there's other work to do sounds much better all around.

Rick and I have agreed that it seems low risk. He's going to take a close look at it tomorrow morning.

Great! I'll try to be on-hand for edits.

@aclements

This comment has been minimized.

Copy link
Member

commented Jul 27, 2016

I may have been too imprecise. What I had in mind was a limit on the CPU usage of the idle workers.

Oh, I see. Yes, that could work, though it may not work so well with GOMAXPROCS=1. One possible approach to preempting the idle worker as soon as there's work is to only run the idle worker on N-1 of the idle Ps, keeping one P truly idle. If that one gets woken up to do work, it would actively halt one of the idle workers to make another P truly idle.

@gopherbot gopherbot closed this in ccca9c9 Jul 27, 2016

@dvyukov

This comment has been minimized.

Copy link
Member

commented Jul 27, 2016

On a related note, I always wondered why GC credit/debit is per-g rather than per-p. What's the reasoning behind that? Have we tried to make it per-p? There are obviously pros and cons. But per-p looks more reasonable to me. Is it the case that per-g credit can be left unused for lots of goroutines? Or there is something that prevents that?

@aclements

This comment has been minimized.

Copy link
Member

commented Jul 27, 2016

Are there any circumstances under which a fix for this would be part of go1.7?

It's now been pushed to master and should get cherry-picked in the last batch before the release. We're keeping a close eye on it and running internal tests, which I should have the results of tomorrow.

On a related note, I always wondered why GC credit/debit is per-g rather than per-p. What's the reasoning behind that? Have we tried to make it per-p? There are obviously pros and cons. But per-p looks more reasonable to me.

I don't think it would be bad to do it per P. The reason we do it per G is that the time a goroutine spends helping the garbage collector is supposed to be proportional to its own allocation rate, so that the cost of assisting roughly balances across the bytes you allocate. Whether or not this actually matters, I'm really not sure. Doing it per P would still maintain the really important invariant that allocation not outpace scanning.

I have been considering moving the assist to when a span is acquired from the mcentral to reduce the overhead to the hot allocation path. In that case, it may make more sense to track it per P, along with the mcache. If I also change assists to always queue if there's not enough global credit (and not do any scan work of their own), then there might not be any need to track assist debt at all, per P or per G.

Is it the case that per-g credit can be left unused for lots of goroutines? Or there is something that prevents that?

Currently, left over credit is recycled along with the g struct. This seems to work fine in practice, but of course, if lots of goroutines build up spare credit and then exit and new goroutines don't replace the exited goroutines, we'll lose track of this built-up credit. A while ago I wrote https://go-review.googlesource.com/c/24703/ to flush this credit to the global credit pool on goroutine exit, but it didn't seem to have any effect, so it never got merged.

@rhysh

This comment has been minimized.

Copy link
Contributor Author

commented Jul 27, 2016

It's now been pushed to master and should get cherry-picked in the last batch before the release. We're keeping a close eye on it and running internal tests, which I should have the results of tomorrow.

That's great news, thank you! Its behavior looks good in the few services I operate.

I still see that nearly all Ps can end up running their gcBgMarkWorker goroutines, largely starving the mutator of CPU time. I've got an example of 35/36 Ps doing that for around 16ms while the run queue (per "Goroutines" row) is non-empty. It doesn't happen on every GC, so most collections (at ccca9c9, with CL 25155 merged) allow the mutator to continue running with low latency.

@aclements do you have a separate issue open to track your plans for preempting the idle mark worker?

@aclements

This comment has been minimized.

Copy link
Member

commented Jul 28, 2016

@aclements do you have a separate issue open to track your plans for preempting the idle mark worker?

I just created #16528 to track this. You've got some pretty great data on this problem here. If you've got time, feel free to add to the new issue.

@Dieterbe

This comment has been minimized.

Copy link
Contributor

commented Jul 30, 2016

Great work @rhysh and @aclements !
Do we think this should also solve #14812 ? If so I will happily try to repro again and try out the fix to confirm.

@quentinmit

This comment has been minimized.

Copy link
Contributor

commented Jul 30, 2016

@Dieterbe I expect it will be a noticeable improvement. Not sure if it will rise to the level of "solved" :)

You should see fewer latency spikes at the cost of slightly more time spent in the GC.

@aclements

This comment has been minimized.

Copy link
Member

commented Jul 30, 2016

It would be great to give it a try. There are still some known latency
issues, but I expect this to at least help. You can either grab master now
or just wait a few days and try the 1.7 release.

On Jul 30, 2016 6:47 AM, "Dieter Plaetinck" notifications@github.com
wrote:

Great work @rhysh https://github.com/rhysh and @aclements
https://github.com/aclements !
Do we think this should also solve #14812
#14812 ? If so I will happily try to
repro again and try out the fix to confirm.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#16432 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/ACkFO29D-sFNPFP62l9fp2MHkea3MPowks5qayuggaJpZM4JQQy0
.

@RLH

This comment has been minimized.

Copy link
Contributor

commented Jul 30, 2016

It would be great to try master. If there are problems the sooner we know
about them the more options we have.

On Sat, Jul 30, 2016 at 12:06 PM, Austin Clements notifications@github.com
wrote:

It would be great to give it a try. There are still some known latency
issues, but I expect this to at least help. You can either grab master now
or just wait a few days and try the 1.7 release.

On Jul 30, 2016 6:47 AM, "Dieter Plaetinck" notifications@github.com
wrote:

Great work @rhysh https://github.com/rhysh and @aclements
https://github.com/aclements !
Do we think this should also solve #14812
#14812 ? If so I will happily try
to
repro again and try out the fix to confirm.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#16432 (comment), or
mute
the thread
<
https://github.com/notifications/unsubscribe-auth/ACkFO29D-sFNPFP62l9fp2MHkea3MPowks5qayuggaJpZM4JQQy0

.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#16432 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/AA7WnxFO7u_SISbW_7pD4d3nSxH7Vradks5qa3aSgaJpZM4JQQy0
.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
You can’t perform that action at this time.