Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: assisting mark and sweep make GC uncontrollable in the critical path #21107

Closed
zhaozhiqianghw opened this issue Jul 21, 2017 · 19 comments

Comments

Projects
None yet
9 participants
@zhaozhiqianghw
Copy link

commented Jul 21, 2017

Please answer these questions before submitting your issue. Thanks!

What version of Go are you using (go version)?

Go 1.8

@rhysh gave a talk(https://github.com/gophercon/2017-talks/blob/master/RhysHiltner-AnIntroductionToGoToolTrace/Rhys-Hiltner-go-tool-trace-GopherCon-2017.pdf) in 2017 gopherCon about using trace to show the influence of assisting marking and sweeping. These two stages will cause a long stop in any mutator goroutine. Actually, our team also find this bad effect. Although these methods are used to prevent the rate of mutator and prevent the heap increasing too fast. But these strategies really make the GC uncontrolable in critial path.

So does Go have any plan to improve this? @aclements

\cc @rhysh @aclements

@odeke-em odeke-em changed the title GC: the influence of assist marking and sweeping runtime: assisting mark and sweep make GC uncontrollable in the critical path Jul 21, 2017

@odeke-em

This comment has been minimized.

Copy link
Member

commented Jul 21, 2017

I've updated the title to "runtime: assisting mark and sweep make GC uncontrollable in the critical path", anyone, please correct it if needed. I'll also mark this for Go1.10 so that we can take a look at it when the tree opens.

@odeke-em odeke-em added this to the Go1.10 milestone Jul 21, 2017

@ianlancetaylor

This comment has been minimized.

Copy link
Contributor

commented Jul 21, 2017

@zhaozhiqianghw Can you provide a runnable test case that shows the problem? Right now I don't see a clear bug here. If you just want to discuss the GC, that should be done on the golang-dev mailing list, not in the issue tracker. Thanks.

@zhaozhiqianghw

This comment has been minimized.

Copy link
Author

commented Jul 29, 2017

@ianlancetaylor I think this binarytree example(http://benchmarksgame.alioth.debian.org/u64q/program.php?test=binarytrees&lang=go&id=4) can show the case clearly. Although the STW1 and STW2 are really short. But the sweep stage and assist stage occupy the whole Ps. These really have the same influence as the STW. But these are hidden from the staticist. So we can still say the GC of Go have a long STW, but not in the marking stage, actually in sweeping stage also some time in marking stage.

We just open the trace in this example, then we can see the case. I run this example with parameter 8 and 4 cores, find in some time four core do GC sweep for 1ms. It's mean the "STW" is 1ms at least. I use go1.9rc1 to show the GC ASSIST.

@aclements

This comment has been minimized.

Copy link
Member

commented Jul 31, 2017

@zhaozhiqianghw, can you describe a specific problem you're seeing with binarytree? (Bear in mind that microbenchmarks are important only as far as they model the behavior of things people actually care about.)

This may be a dup of #14951, but I can't really say without more detail.

@zhaozhiqianghw

This comment has been minimized.

Copy link
Author

commented Jul 31, 2017

default

What I want to show is above. After GC marking, almost all P "stoped" to do assisting sweep. This is really STW for the mutator. For a program caring about latencies, this sweeping "STW" will introduce about 1ms latencies to critial mutator. And we really come across this situation, I just use a common benchmark to show it.

So am I clear? I think this's a problem. I want to know your opinion. @aclements

@RLH

This comment has been minimized.

Copy link
Contributor

commented Jul 31, 2017

@zhaozhiqianghw

This comment has been minimized.

Copy link
Author

commented Jul 31, 2017

Actually I can see this in both Go 1.8.x and 1.9 rc1. My case is not the same as the case you mentioned @RLH . I display the gctrace below. As I analyized the code, after GC marking, the mutators are invoked into the sweeping, because lots of same kind of spans need to be swept.

gc 81 @2.492s 31%: 0.020+0.099+0.075 ms clock, 0.16+0.071/0.042/0.067+0.60 ms cpu, 4->4->0 MB, 5 MB goal, 8 P
gc 82 @2.532s 30%: 0.021+0.10+0.072 ms clock, 0.17+0.086/0.065/0.039+0.57 ms cpu, 4->4->0 MB, 5 MB goal, 8 P
gc 83 @2.574s 30%: 0.021+0.12+0.083 ms clock, 0.17+0.094/0.059/0.059+0.66 ms cpu, 4->4->0 MB, 5 MB goal, 8 P
gc 84 @2.617s 29%: 0.023+0.13+0.062 ms clock, 0.19+0.10/0.066/0.088+0.49 ms cpu, 4->4->0 MB, 5 MB goal, 8 P
gc 85 @2.658s 29%: 0.029+0.13+0.097 ms clock, 0.23+0.10/0.073/0.033+0.77 ms cpu, 4->4->0 MB, 5 MB goal, 8 P
@RLH

This comment has been minimized.

Copy link
Contributor

commented Jul 31, 2017

Thanks for running this on 1.9 rc. Can you confirm that this is running on a 8 HW thread machine, probably a 4 core AMD64 box? Is anything else running on the machine? I'm trying to rule out the possibility that the OS is not making available GOMAXPROCS worth of HW threads to the Go application.

Continuing on that note, is the code that demonstrates this issue the same as mentioned above at http://benchmarksgame.alioth.debian.org/u64q/program.php?test=binarytrees&lang=go&id=4
If so the line in main
runtime.GOMAXPROCS(runtime.NumCPU() * 2)
may be the problem since it basically tells the runtime that there are twice as many CPUs available than there really are. If the OS takes HW threads away from the runtime for 1 millisecond, a typical OS time slice, then that might explain what is happening.

@aclements

This comment has been minimized.

Copy link
Member

commented Jul 31, 2017

@zhaozhiqianghw, how are you running binarytree? When I run it without arguments, it finishes in 0.02 seconds and does no GCs. If I run it with the argument "21" suggested by the shootout, the heap is significantly larger than the 4MB shown in your gctrace and retains significantly more than the 0MB shown in your gctrace.

@zhaozhiqianghw

This comment has been minimized.

Copy link
Author

commented Jul 31, 2017

Let's make the details more clear. @RLH to reduce the infulence of OS, I remove the runtime.GOMAXPROCS(runtime.NumCPU()), and run the program with GOGC=100 GODEBUG=gctrace=1 taskset -c 4-7 ./main 21 to confirm that it runs with 4 HW threads. Then let's see whether the problem occur or not.

sweep

GOGC=100 GODEBUG=gctrace=1 taskset -c 4-7 ./main 21
gc 1 @0.051s 4%: 0.10+10+0.10 ms clock, 0.31+0.060/9.9/13+0.30 ms cpu, 4->4->4 MB, 5 MB goal, 4 P
gc 2 @0.085s 11%: 0.027+31+0.072 ms clock, 0.11+13/31/2.9+0.29 ms cpu, 7->8->8 MB, 8 MB goal, 4 P
gc 3 @0.160s 16%: 0.032+65+0.11 ms clock, 0.12+23/64/6.6+0.44 ms cpu, 15->17->17 MB, 17 MB goal, 4 P
gc 4 @0.303s 18%: 0.032+139+0.11 ms clock, 0.13+42/138/0.89+0.45 ms cpu, 29->35->35 MB, 35 MB goal, 4 P
gc 5 @0.586s 16%: 0.034+188+0.15 ms clock, 0.13+0.44/187/174+0.61 ms cpu, 57->67->67 MB, 70 MB goal, 4 P
stretch tree of depth 22       check: 8388607
gc 6 @1.074s 15%: 0.036+373+0.097 ms clock, 0.14+0.76/373/338+0.38 ms cpu, 113->128->128 MB, 135 MB goal, 4 P
gc 7 @1.858s 15%: 0.072+117+0.054 ms clock, 0.072+195/117/0+0.054 ms cpu, 221->232->81 MB, 257 MB goal, 4 P
gc 8 @2.064s 17%: 0.085+122+0.085 ms clock, 0.085+231/122/0+0.085 ms cpu, 141->154->97 MB, 162 MB goal, 4 P
gc 9 @2.274s 20%: 0.062+148+0.067 ms clock, 0.12+233/148/0+0.13 ms cpu, 161->180->113 MB, 194 MB goal, 4 P
gc 10 @2.514s 21%: 0.050+172+0.075 ms clock, 0.10+223/172/0+0.15 ms cpu, 177->203->122 MB, 226 MB goal, 4 P
gc 11 @2.781s 22%: 0.054+152+0.037 ms clock, 0.10+153/152/0+0.075 ms cpu, 186->209->97 MB, 245 MB goal, 4 P
gc 12 @3.015s 23%: 0.069+156+0.098 ms clock, 0.13+188/156/0+0.19 ms cpu, 152->176->107 MB, 195 MB goal, 4 P
gc 13 @3.250s 24%: 0.070+189+0.075 ms clock, 0.14+202/189/0+0.15 ms cpu, 163->193->125 MB, 215 MB goal, 4 P
gc 14 @3.525s 25%: 0.050+205+0.035 ms clock, 0.10+198/205/0+0.071 ms cpu, 184->220->135 MB, 251 MB goal, 4 P
gc 15 @3.819s 25%: 0.059+173+0.042 ms clock, 0.11+147/173/0+0.084 ms cpu, 196->226->108 MB, 271 MB goal, 4 P
gc 16 @4.077s 26%: 0.053+151+0.069 ms clock, 0.10+190/151/0+0.13 ms cpu, 164->191->109 MB, 216 MB goal, 4 P
gc 17 @4.308s 26%: 0.082+192+0.091 ms clock, 0.16+188/192/0+0.18 ms cpu, 163->195->124 MB, 219 MB goal, 4 P
gc 18 @4.588s 27%: 0.076+176+0.077 ms clock, 0.15+155/176/0+0.15 ms cpu, 181->212->112 MB, 249 MB goal, 4 P
gc 19 @4.848s 27%: 0.081+147+0.087 ms clock, 0.16+173/147/0+0.17 ms cpu, 167->194->104 MB, 224 MB goal, 4 P
gc 20 @5.071s 27%: 0.091+169+0.092 ms clock, 0.18+194/169/0+0.18 ms cpu, 156->187->120 MB, 209 MB goal, 4 P
gc 21 @5.316s 28%: 0.050+210+0.086 ms clock, 0.10+193/210/0+0.17 ms cpu, 172->211->137 MB, 240 MB goal, 4 P
gc 22 @5.616s 28%: 0.060+158+0.076 ms clock, 0.12+129/158/0+0.15 ms cpu, 193->223->100 MB, 274 MB goal, 4 P
gc 23 @5.854s 28%: 0.062+140+0.10 ms clock, 0.12+168/139/0+0.21 ms cpu, 151->175->100 MB, 200 MB goal, 4 P

@aclements The reason is I post the final stage gctrace. I reget the gctrace from beginning, I think it's the same as yours. From the result above, we still can find the problem. And the sweep "STW" is more longer sometime (2.3ms), so I think this can reduce the time slice of OS @RLH . From the image, we also can find, this problem does not occur always. But we can always find it in somewhere. @aclements I pretty sure you can reproduce it, no matter with 18 or 21.

mark

Beside, the current credit system in GC, can also introduce uncontrolablt latency. We came across this in Go1.8.0. The reason I run this benchmark is that Go1.9 make the GC ASSIST visible. From the result above, during GC, aside with dedicated marking P, assist marking will introduce uncontrolable latency to critial mutator. So a real-time program, I think left 25% CPU time for GC is really acceptable, but inserting uncontrolable GC assist marking and assist sweeping will introduce big effect.

@aclements

This comment has been minimized.

Copy link
Member

commented Aug 3, 2017

Thanks for the additional details @zhaozhiqianghw. I can now confirm that I can reproduce similar behavior.

to reduce the infulence of OS, I remove the runtime.GOMAXPROCS(runtime.NumCPU())

I did the same and found that the original over-subscription had a significant effect on the trace because the OS was much more likely to context-switch at an unfortunate time. But this is just a tracing artifact, not a real problem, so we should focus on the version without over-subscription.

The reason is I post the final stage gctrace.

I still don't see any issues in the GC trace you posted. Am I missing something?

And the sweep "STW" is more longer sometime (2.3ms), so I think this can reduce the time slice of OS @RLH .

I'm not sure what you mean by "this can reduce the time slice of OS". However, you're right that this simultaneous sweep is an issue. When I reproduce this locally, I can see from the event details that these are all very large sweeps that reclaim very little. This is issue #18155. Thanks for finding another reproducer. :)

That said, this only happens because this benchmark has an extremely specific allocation pattern that I think is unlikely to happen outside a micro-benchmark. This long sweep is because the entire longLivedTree is allocated completely sequentially from the heap, which only happens because there are absolutely no intervening short-lived allocations. If there were even a handful of short-lived allocations, we wouldn't see this pattern. Are you observing this in a real application?

From the result above, during GC, aside with dedicated marking P, assist marking will introduce uncontrolable latency to critial mutator.

The marking you've highlighted is 0.23ms, which is well within the realm of noise from other parts of the system, like the OS. I do think we should reduce this (#14951), but it's not at all obvious to me that 0.23ms is a problem.

@zhaozhiqianghw

This comment has been minimized.

Copy link
Author

commented Aug 3, 2017

I think the stage of assist sweeping is quite strong. And about your question.

Are you observing this in a real application?

Yes, I observed this in my application and found a long sweeping in mutator. But it's hard to show my application. And I found @rhysh also report this problem. So I use this micro-benchmark to show it to you. Do you have any plan to try to reduce the influence of this simultaneous sweeping and long sweeping in mutator?

@bryanpkc

This comment has been minimized.

Copy link
Contributor

commented Aug 3, 2017

@aclements Let me try to clarify; this is the problem I mentioned to you in my July 13 email.

If you look at @zhaozhiqianghw's MARK screenshot carefully, even though each time the mutator only does 230us worth of marking, the same mutator has to do this repeatedly, for a large number of times, because the it performs a large number of allocations. The effect of this is that application requests see a 50x latency increase when they happen to run within a GC cycle. From the application's perspective, it might as well have been a 12ms STW pause.

We account for the dedicated GC goroutine(s) by using more cores than are necessary for the application workload, but the mutators still must do excessive marking during a GC cycle. Outside of a GC cycle, the request latencies are much more reasonable, except for the simultaneous sweeping just after a GC cycle. Again, when this happens, there is an effective pause in the application, and the latencies of all requests during that time suffer as a result.

Our application requires very consistent latency, and we would very much appreciate any suggestion from you on reducing or eliminating such wild latency fluctuations.

@rhysh

This comment has been minimized.

Copy link
Contributor

commented Aug 3, 2017

Yes, I observed this in my application and found a long sweeping in mutator. But it's hard to show my application. And I found @rhysh also report this problem.

I've reported similar behavior, but I haven't seen it in a production application. (I filed #18155 for someone's blog post about a micro-benchmark.)

It would certainly be good to handle worst-case scenarios like this one—demonstrating that it happens in a production application would likely increase the priority of a fix.

Are the execution traces and gctrace lines in this issue all for the binarytree benchmark, or are some from the (likely closed-source) application where you first encountered this issue?

The execution traces in this issue show two different GC-related contributions to application latency: some in the mark phase, some in the sweep phase.

I'm going to lay out my current understanding of the problems so we're all on the same page. (I don't know that it'll be the right page on the first try—@aclements and others will hopefully correct it if I'm mistaken.)


First, the mark phase.

The way mark assists are currently tuned means that the application code (mutator) ends up doing around the same amount of work as the dedicated GC worker goroutines. You can see this in the gctrace lines.

The "gc 22" line ends with "193->223->100 MB, 274 MB goal" and the "gc 23" line ends with "151->175->100 MB, 200 MB goal". At the end of gc 22, there's 100 MB of live heap memory. The default setting of GOGC means that the memory usage is allowed to double before the end of the next GC. This becomes the "200 MB goal" for gc 23.

At the end of gc 22, the GC "pacer" sets the goal of 200 MB, and based on the allocation rate of the program and a bunch of other inputs I don't yet understand, calculates that it will need to begin the 23rd GC cycle when the heap size is 151 MB. However, the live heap size at the end of that cycle remains 100 MB. Since the mark process only involves inspecting live memory, it happened twice as fast as the worst case. If all of the heap were still reachable, the mark phase would have taken longer, so the heap would have grown all the way to 200 MB instead of just to 175 MB.

That's exactly the kind of equilibrium the current pacer algorithm is designed to find. One thing that means is that the pacer is finishing the GC cycle early.

The other, shown in the "0.12+168/139/0+0.21 ms cpu" portion of the gctrace lines, is that the background workers are only doing around half of the work (139+0 ms on that line), with the other half coming from mutator assists (168 ms on that line).

Issue #14951 gets pretty deep pretty fast, but is there to track progress on (among other things) improving the pacer algorithm so it doesn't slow down applications too much via assists, when the background workers could do the work instead.

Ideally, I think, mark assists would only activate when a program's behavior changed, and would rarely occur for applications that were maintaining a steady heap size and allocation rate/pattern.

I don't know of a general workaround for this source of mutator latency, beyond "don't make large allocations in the program's critical path".


Next, delays during the sweep phase.

After the mark phase completes, the runtime prefers to hand out memory allocations from newly-reclaimed memory rather than getting new regions from the OS. Although it's clear from looking at a piece of memory whether it's available for reuse (the mark bit will not be set), it's not clear where these pieces of memory might be found. The work to find them is called "sweep".

In order to allocate memory right after the mark phase finishes, a goroutine needs two things. First, if not much sweep work has been done, the goroutine may need to do some sweep work to build up credit to offset the allocation it's about to make. Second, there has to be a piece of memory available to hold the allocation.

From the discussion in #18155, it sounds like this isn't a problem with the credit so much as actually finding the memory to use for the allocation. Simple "microbenchmarks" like those discussed in this issue and in #18155 can flood large sections of memory with identically-sized allocations that all remain live, making it hard to find a free piece of memory to use for new allocations.

I haven't seen this problem in production applications, and it seems to require a memory allocation pattern so consistent and so simple that I don't (yet) know how it could happen outside of a microbenchmark. It would be very interesting to learn of other sources of this issue, if they exist. Without seeing it in a production application, I don't know how to offer a workaround for that case.

@gopherbot

This comment has been minimized.

Copy link

commented Aug 31, 2017

Change https://golang.org/cl/60790 mentions this issue: cmd/trace: add minimum mutator utilization (MMU) plot

@aclements

This comment has been minimized.

Copy link
Member

commented Nov 8, 2017

@rhysh, thanks for the detailed analysis, as always. :)

I verified that the mark phase issues are now fixed on master (to be released in Go 1.10).

The sweep phase issues aren't, but as @rhysh said, we've only ever observed these in microbenchmarks with incredibly simple allocation and retention patterns. If people are observing this in real applications, that would be very useful to know. At any rate, this part is a dup of #18155.

Thanks for all the traces and analyses!

@aclements aclements closed this Nov 8, 2017

@zhaozhiqianghw

This comment has been minimized.

Copy link
Author

commented Nov 9, 2017

@aclements Could you please show the commits about the fixed mark phase? I only find this commit f10d99f . I really care about this topic.

@aclements

This comment has been minimized.

Copy link
Member

commented Nov 13, 2017

@zhaozhiqianghw: there were two major fixes, plus one dependency:

  1. The series that ends with https://golang.org/cl/68574 (commits 2792348^..e09dbaa) modified how fractional workers were scheduled to strongly prefer dedicated workers and to preempt fractional workers more aggressively.

  2. The series that ends with https://golang.org/cl/73832 (commits 15d6ab6^..877387e is important for the performance of the next fix, but not directly related to mark phase latency.

  3. The series that ends with https://golang.org/cl/59971 (commits 03eb948^..af192a3) modified the pacer to significantly reduce mark assist load in steady state.

@gopherbot

This comment has been minimized.

Copy link

commented Sep 27, 2018

Change https://golang.org/cl/138076 mentions this issue: runtime: track all heap arenas in a slice

gopherbot pushed a commit that referenced this issue Nov 5, 2018

cmd/trace: add minimum mutator utilization (MMU) plot
This adds an endpoint to the trace tool that plots the minimum mutator
utilization curve using information on mark assists and GC pauses from
the trace.

This commit implements a fairly straightforward O(nm) algorithm for
computing the MMU (and tests against an even more direct but slower
algorithm). Future commits will extend and optimize this algorithm.

This should be useful for debugging and understanding mutator
utilization issues like #14951, #14812, #18155. #18534, #21107,
particularly once follow-up CLs add trace cross-referencing.

Change-Id: Ic2866869e7da1e6c56ba3e809abbcb2eb9c4923a
Reviewed-on: https://go-review.googlesource.com/c/60790
Run-TryBot: Austin Clements <austin@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Hyang-Ah Hana Kim <hyangah@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.