runtime: excessive scavengeOne work slows mutator progress #57069
Some background: I have an app that serves interactive HTTP traffic (typical response time is tens of milliseconds), and which also does periodic work to refresh the state necessary to serve that interactive traffic. Its live heap is usually around 8 GiB. The work to refresh the state spans over several GC cycles and can involve allocating hundreds of MiBs or even a couple GiBs over a few hundred milliseconds, which gives the pacer a hard time and can lead to the pacer choosing a high assist factor. To work around that, we're 1/ calling
The problem: There appears to be contention on the
CC @golang/runtime and @mknyszek
What version of Go are you using (
The text was updated successfully, but these errors were encountered:
That's really interesting. It seems like just after a forced GC call, a lot of memory is suddenly available to scavenge, and in the rapid allocation the runtime thinks all that memory is going to cause it to exceed the memory limit. That's why all the
Back of envelope, I would expect the worst case for 2 MiB at once is a ~5 ms delay (which is kind of bad). However, the lock contention is a little weird though because the scavenger shouldn't ever be holding the lock across the
Does the application hang completely, or does it recover after some time? I'd love to get a
On that topic fix to #55328 might help in other ways here. I think the fix there results in a much better scavenging heuristic too, which would cause less scavenged memory to be allocated (causing these calls) in the first place. However, it does not tackle the root of the problem (or problems).
One problem is definitely scalability of
Another problem is getting into this state where the scavenger is being rapidly called into by every P because of the memory limit. This is harder to avoid because the memory limit still needs to be maintained. One idea I have here though is to have the sweeper actually help return some memory to the OS that it doesn't think will be used, to help prevent this situation. The trouble is identifying what to return to the OS.
Can you say a bit more please? Does this mean that forcing a GC causes memory to become available to the scavenger, in a way that automatic GCs do not?
I don't have data on the size of the allocated heap before this GC, or of how much the runtime has requested from the OS (like
But I've also seen CPU profiles that show lots of time in
I'll see if the team is able to run with a modified toolchain that reduces that window.
It recovers, yes. Certainly within two minutes (at its next scheduled telemetry dump time, which showed unremarkable behavior).
I think this is going to be hard, especially before Go 1.20. I haven't seen this behavior outside of the team's production environment, and the team isn't set up well to run like this (tip, plus a GOEXPERIMENT, plus a large/growing file on disk) in production.
This sounds like you're saying that our use of
Kind of. Here's what I was thinking (but I'm less convinced this specifically is an issue now, see my other replies below): forced GCs ensure the sweep phase is complete in addition to the mark phase before continuing. Sweeping frees pages, which makes them available for scavenging. The runtime kicks the background scavenger awake at the end of each sweep phase, but there's potentially a multi-millisecond delay as sysmon is responsible for waking it.
If the background scavenger is asleep during a sweep phase, this can in theory also happen with automatic GCs. However, I think in your particular case where there's this sort of "calm before the storm" as a GC gets forced, I suspect the scavenger is more likely to be asleep, leaving all the maintenance work to allocating goroutines.
Writing this out makes me think that maybe if the memory allocator starts scavenging, it immediately kicks the scavenger awake. Though, we should confirm that the background scavenger is indeed asleep on the job in this case. It could be that even if it was constantly working, it wouldn't be doing enough to avoid this.
That's really interesting. Allocations that large definitely do have the potential to increase fragmentation a good bit, making it more likely that the allocator has to find new address space. That in turn will mean more frequent scavenging to stay below the memory limit.
Yeah, I figured. A tool of last resort I suppose. I could also hack together a version that enables the page tracer with the execution trace. It should also be possible to make the page trace tooling a little more resilient to partial traces, so no crazy changes to the format are necessary. Let me know if this is of interest to you.
Yeah. That call path (
I think your description is painting a fairly clear picture: it sounds like fragmentation and the sudden need to allocate larger things is forcing the runtime to scramble and find memory to release to stay under the memory limit. The background scavenger in this scenario is instructed to scavenge until it's 5% under the memory limit which is supposed to help avoid these scenarios, but with
I think the runtime might be doing something wrong here. I think that maybe it should be reserving a little bit more space off of the heap limit for hedging against future fragmentation. I suspect at your heap size this wouldn't really cost you very much in terms of GC frequency, but would make this transition to periodic work much smoother.
More specifically, the fixed headroom found at https://cs.opensource.google/go/go/+/master:src/runtime/mgcpacer.go;l=1036 should probably be proportional to the heap limit (something like 3-5%, so multiplied by 0.95-0.97). I was originally reluctant to do this, but I think your issue is solid evidence for doing this. I also think it might help with another memory limit issue, which is that it works less well for smaller heaps (around 64 MiB or less). It's only a 2-3 line change (well, probably a little bit more including a pacer test) and it might resolve this particular issue. Of course, a number here won't be perfect for every scenario, but it's a start. A more adaptive solution would be good to consider for the future.
Also, this follows a more general pattern of the pacer pacing for the "edge" (#56966) and not hedging enough for noise or changes in the steady-state in general.