Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: excessive scavengeOne work slows mutator progress #57069

Open
rhysh opened this issue Dec 3, 2022 · 6 comments
Open

runtime: excessive scavengeOne work slows mutator progress #57069

rhysh opened this issue Dec 3, 2022 · 6 comments
Assignees
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. Performance
Milestone

Comments

@rhysh
Copy link
Contributor

rhysh commented Dec 3, 2022

Some background: I have an app that serves interactive HTTP traffic (typical response time is tens of milliseconds), and which also does periodic work to refresh the state necessary to serve that interactive traffic. Its live heap is usually around 8 GiB. The work to refresh the state spans over several GC cycles and can involve allocating hundreds of MiBs or even a couple GiBs over a few hundred milliseconds, which gives the pacer a hard time and can lead to the pacer choosing a high assist factor. To work around that, we're 1/ calling runtime.GC at a few key points in the state-refresh process, and 2/ running with GOMEMLIMIT=20GiB and GOGC=off. The app runs on a machine with 16 hardware threads (GOMAXPROCS=16).

The problem: There appears to be contention on the mheap lock in runtime.(*pageAlloc).scavengeOne. Execution traces show this can tie up all 16 Ps for hundreds of milliseconds.

CC @golang/runtime and @mknyszek

What version of Go are you using (go version)?

go1.19.3 linux/amd64

Does this issue reproduce with the latest release?

Yes: go1.19.3 is the latest stable release. I have not tried with tip.

What operating system and processor architecture are you using (go env)?

linux/amd64

What did you do?

GOMAXPROCS=16, GOMEMLIMIT=20GiB, GOGC=off

Manual call to runtime.GC, followed (after that call returns, indicating that the sweep work is done) by allocating memory especially quickly.

All while serving interactive (tens of milliseconds typical response time) HTTP requests.

What did you expect to see?

Consistent responsiveness for the interactive HTTP requests this process serves, regardless of which phase the GC is executing.

When the app has its own runnable goroutines, I expect the runtime to stay close to within its goal of 25% of P time spent on GC-related work.

What did you see instead?

The app's Ps are kept busy in a nearly-opaque form of work. Rather than running a G for around 100 µs and then moving on to the next G, each P has a single G that it claims to be running for hundreds of milliseconds without interruption. (Note, if those goroutines had "normal" work, I'd still expect them to be preempted by sysmon every 20 ms or so.) At times, every P in the app is kept busy this way.

When the Ps are busy this way, their CPUSample events in the execution trace (each of which represents 10 milliseconds of on-CPU time) are spread tens of milliseconds apart, and sometimes more than 100 ms. That suggests that these Ps are not spending their time executing any code, even within the kernel, and that instead they're asleep waiting for a lock that is not visible to the execution trace (such as a runtime-internal lock).

The CPUSample events during those periods show the Ps are trying to allocate new spans to use for new memory allocations. Most samples are within runtime.(*pageAlloc).scavengeOne and show a call to runtime.madvise. A few show time in that function's calls to runtime.lock and runtime.unlock.

Because those periods include so little on-CPU time outside of runtime.(*pageAlloc).scavengeOne, and also show so little on-CPU time in total, and show some on-CPU time interacting with the mheap lock, I take this to mean that the Ps are spending their with runtime.(*pageAlloc).scavengeOne on their stack, but sleeping, likely waiting for the mheap lock.

Here's one of the call stacks from a CPUSample event that arrived while a P was busy for 100+ milliseconds (showing from mallocgc):

runtime.madvise:546
runtime.sysUnusedOS:111
runtime.sysUnused:62
runtime.(*pageAlloc).scavengeOne:741
runtime.(*pageAlloc).scavenge.func1:647
runtime.(*pageAlloc).scavenge:646
runtime.(*mheap).allocSpan:1317
runtime.(*mheap).alloc.func1:910
runtime.systemstack:492
runtime.(*mheap).alloc:904
runtime.(*mcentral).grow:244
runtime.(*mcentral).cacheSpan:164
runtime.(*mcache).refill:181
runtime.(*mcache).nextFree:819
runtime.mallocgc:1018

Here's how the execution trace looks. The sweep work ends around 275 ms. That also means the app's manual runtime.GC call returns, which allows the allocation-intensive refresh procedure to continue. Procs 1, 12, 15 do some allocation-heavy work and then get stuck in scavengeOne from the 400 to 800 ms marks. (Some other Procs, like 3 and 11, do heavy allocations around the 400 ms mark but do not get stuck.) Around 800 ms, all 16 Procs get stuck and the queue of runnable goroutines grows. And around 1000 ms, a timer should have triggered to call runtime/trace.Stop, but the trace continues beyond the 1300 ms mark; the app is no longer responsive.

Screen Shot 2022-12-03 at 11 39 30 AM

And here's the call_tree view from under mallocgc. This is the runtime/pprof output of the same CPU profile that is shown in the execution trace; these samples will cover the same time period as that entire execution trace, not filtered to any smaller time range.

$ ln -s /usr/local/go/bin/go /tmp/redacted
$ go tool pprof '-focus=^runtime\.mallocgc$' '-show_from=^runtime\.mallocgc$' -call_tree \
    -nodefraction=0 -edgefraction=0 -nodecount=1000 /tmp/redacted pprof/profile-during-trace

madvise

@gopherbot gopherbot added the compiler/runtime Issues related to the Go compiler and/or runtime. label Dec 3, 2022
@mknyszek
Copy link
Contributor

mknyszek commented Dec 5, 2022

That's really interesting. It seems like just after a forced GC call, a lot of memory is suddenly available to scavenge, and in the rapid allocation the runtime thinks all that memory is going to cause it to exceed the memory limit. That's why all the scavengeOne calls are coming from allocSpan. I wonder how big the ranges are that it's trying to scavenge. It should be maximum 2 MiB, though in that case the runtime is only guessing that those memory regions are backed with huge pages (I'm hoping this will improve significantly with the solution to #55328, which I have in flight).

Back of envelope, I would expect the worst case for 2 MiB at once is a ~5 ms delay (which is kind of bad). However, the lock contention is a little weird though because the scavenger shouldn't ever be holding the lock across the madvise call, though it will grab it twice: once before madvise, and once after. Failing to make progress on this entirely is somewhat surprising; at some point the GC CPU limiter should kick in, though the window for that is large at 1 CPU-second per P. This makes me think that window should be smaller.

Does the application hang completely, or does it recover after some time? I'd love to get a GODEBUG=pagetrace=/path/to/file out of this (need to build with GOEXPERIMENT=pagetrace on tip), but unfortunately that currently requires starting the trace from the very start of execution, so it might require a lot of disk. I think that would essentially tell the full story.

On that topic fix to #55328 might help in other ways here. I think the fix there results in a much better scavenging heuristic too, which would cause less scavenged memory to be allocated (causing these calls) in the first place. However, it does not tackle the root of the problem (or problems).

One problem is definitely scalability of scavengeOne. Right now, it needs that very central lock. It doesn't hold it over the madvise, but it's still clearly having a large effect. I think this might be the easiest problem to tackle, but it's still not that easy as it problem requires some page allocator restructuring.

Another problem is getting into this state where the scavenger is being rapidly called into by every P because of the memory limit. This is harder to avoid because the memory limit still needs to be maintained. One idea I have here though is to have the sweeper actually help return some memory to the OS that it doesn't think will be used, to help prevent this situation. The trouble is identifying what to return to the OS.

@prattmic prattmic added Performance NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. labels Dec 5, 2022
@prattmic
Copy link
Member

prattmic commented Dec 5, 2022

cc @golang/runtime @mknyszek @aclements

@rhysh
Copy link
Contributor Author

rhysh commented Dec 5, 2022

It seems like just after a forced GC call, a lot of memory is suddenly available to scavenge

Can you say a bit more please? Does this mean that forcing a GC causes memory to become available to the scavenger, in a way that automatic GCs do not?

in the rapid allocation the runtime thinks all that memory is going to cause it to exceed the memory limit

I don't have data on the size of the allocated heap before this GC, or of how much the runtime has requested from the OS (like /memory/classes/total:bytes).

I wonder how big the ranges are that it's trying to scavenge.

Focusing on scavengeOne samples, the CPU profile associated with this execution trace shows time below runtime.(*mcache).nextFree, and it also shows time below runtime.(*mcache).allocLarge. The "large" allocations are in the neighborhood of 190 kiB, but looking at the heap profiles this process instance generated I see a lot of variation in that allocation's exact size. Because we're not rounding those up to, say, the nearest power of 2, I wonder if the runtime isn't often able to reuse those and if they're a big cause of the need for scavenging in the first place. I'm not very familiar with the scavenger or large allocations — does that sound right to you?

But I've also seen CPU profiles that show lots of time in runtime.(*pageAlloc).scavengeOne and almost no samples in runtime.(*mcache).allocLarge. Though that could be a matter of timing, with the large allocs taking place either before or after that profile's time window.

Failing to make progress on this entirely is somewhat surprising; at some point the GC CPU limiter should kick in, though the window for that is large at 1 CPU-second per P. This makes me think that window should be smaller.

I'll see if the team is able to run with a modified toolchain that reduces that window.

Does the application hang completely, or does it recover after some time?

It recovers, yes. Certainly within two minutes (at its next scheduled telemetry dump time, which showed unremarkable behavior).

I'd love to get a GODEBUG=pagetrace=/path/to/file out of this (need to build with GOEXPERIMENT=pagetrace on tip), but unfortunately that currently requires starting the trace from the very start of execution, so it might require a lot of disk. I think that would essentially tell the full story.

I think this is going to be hard, especially before Go 1.20. I haven't seen this behavior outside of the team's production environment, and the team isn't set up well to run like this (tip, plus a GOEXPERIMENT, plus a large/growing file on disk) in production.

Another problem is getting into this state where the scavenger is being rapidly called into by every P because of the memory limit. This is harder to avoid because the memory limit still needs to be maintained.

This sounds like you're saying that our use of GOMEMLIMIT and GOGC=off is directly responsible for the large amount of scavengeOne work. Is that right?

Thank you!

@mknyszek
Copy link
Contributor

mknyszek commented Dec 6, 2022

It seems like just after a forced GC call, a lot of memory is suddenly available to scavenge

Can you say a bit more please? Does this mean that forcing a GC causes memory to become available to the scavenger, in a way that automatic GCs do not?

Kind of. Here's what I was thinking (but I'm less convinced this specifically is an issue now, see my other replies below): forced GCs ensure the sweep phase is complete in addition to the mark phase before continuing. Sweeping frees pages, which makes them available for scavenging. The runtime kicks the background scavenger awake at the end of each sweep phase, but there's potentially a multi-millisecond delay as sysmon is responsible for waking it.

If the background scavenger is asleep during a sweep phase, this can in theory also happen with automatic GCs. However, I think in your particular case where there's this sort of "calm before the storm" as a GC gets forced, I suspect the scavenger is more likely to be asleep, leaving all the maintenance work to allocating goroutines.

Writing this out makes me think that maybe if the memory allocator starts scavenging, it immediately kicks the scavenger awake. Though, we should confirm that the background scavenger is indeed asleep on the job in this case. It could be that even if it was constantly working, it wouldn't be doing enough to avoid this.

in the rapid allocation the runtime thinks all that memory is going to cause it to exceed the memory limit

I don't have data on the size of the allocated heap before this GC, or of how much the runtime has requested from the OS (like /memory/classes/total:bytes).

I wonder how big the ranges are that it's trying to scavenge.

Focusing on scavengeOne samples, the CPU profile associated with this execution trace shows time below runtime.(*mcache).nextFree, and it also shows time below runtime.(*mcache).allocLarge. The "large" allocations are in the neighborhood of 190 kiB, but looking at the heap profiles this process instance generated I see a lot of variation in that allocation's exact size. Because we're not rounding those up to, say, the nearest power of 2, I wonder if the runtime isn't often able to reuse those and if they're a big cause of the need for scavenging in the first place. I'm not very familiar with the scavenger or large allocations — does that sound right to you?

That's really interesting. Allocations that large definitely do have the potential to increase fragmentation a good bit, making it more likely that the allocator has to find new address space. That in turn will mean more frequent scavenging to stay below the memory limit.

But I've also seen CPU profiles that show lots of time in runtime.(*pageAlloc).scavengeOne and almost no samples in runtime.(*mcache).allocLarge. Though that could be a matter of timing, with the large allocs taking place either before or after that profile's time window.
Hm, yeah it could be timing. Fragmentation can still happen even if not in the allocLarge case, if a lot of your objects tend to be in the larger end of the size classes.

Failing to make progress on this entirely is somewhat surprising; at some point the GC CPU limiter should kick in, though the window for that is large at 1 CPU-second per P. This makes me think that window should be smaller.

I'll see if the team is able to run with a modified toolchain that reduces that window.

Thanks!

Does the application hang completely, or does it recover after some time?

It recovers, yes. Certainly within two minutes (at its next scheduled telemetry dump time, which showed unremarkable behavior).
Got it, just wanted to make sure this wasn't some kind of deadlock.

I'd love to get a GODEBUG=pagetrace=/path/to/file out of this (need to build with GOEXPERIMENT=pagetrace on tip), but unfortunately that currently requires starting the trace from the very start of execution, so it might require a lot of disk. I think that would essentially tell the full story.

I think this is going to be hard, especially before Go 1.20. I haven't seen this behavior outside of the team's production environment, and the team isn't set up well to run like this (tip, plus a GOEXPERIMENT, plus a large/growing file on disk) in production.

Yeah, I figured. A tool of last resort I suppose. I could also hack together a version that enables the page tracer with the execution trace. It should also be possible to make the page trace tooling a little more resilient to partial traces, so no crazy changes to the format are necessary. Let me know if this is of interest to you.

Another problem is getting into this state where the scavenger is being rapidly called into by every P because of the memory limit. This is harder to avoid because the memory limit still needs to be maintained.

This sounds like you're saying that our use of GOMEMLIMIT and GOGC=off is directly responsible for the large amount of scavengeOne work. Is that right?

Yeah. That call path (allocSpan -> scavenge -> scavengeOne) only kicks in when an allocation would cause the memory limit to be exceeded.

I think your description is painting a fairly clear picture: it sounds like fragmentation and the sudden need to allocate larger things is forcing the runtime to scramble and find memory to release to stay under the memory limit. The background scavenger in this scenario is instructed to scavenge until it's 5% under the memory limit which is supposed to help avoid these scenarios, but with GOMEMLIMIT and GOGC=off the heap limit can easily eat up that 5% each GC cycle.

I think the runtime might be doing something wrong here. I think that maybe it should be reserving a little bit more space off of the heap limit for hedging against future fragmentation. I suspect at your heap size this wouldn't really cost you very much in terms of GC frequency, but would make this transition to periodic work much smoother.

More specifically, the fixed headroom found at https://cs.opensource.google/go/go/+/master:src/runtime/mgcpacer.go;l=1036 should probably be proportional to the heap limit (something like 3-5%, so multiplied by 0.95-0.97). I was originally reluctant to do this, but I think your issue is solid evidence for doing this. I also think it might help with another memory limit issue, which is that it works less well for smaller heaps (around 64 MiB or less). It's only a 2-3 line change (well, probably a little bit more including a pacer test) and it might resolve this particular issue. Of course, a number here won't be perfect for every scenario, but it's a start. A more adaptive solution would be good to consider for the future.

Also, this follows a more general pattern of the pacer pacing for the "edge" (#56966) and not hedging enough for noise or changes in the steady-state in general.

Thank you!

@mknyszek mknyszek added this to the Go1.21 milestone Dec 7, 2022
@mknyszek mknyszek self-assigned this Dec 7, 2022
@mknyszek
Copy link
Contributor

mknyszek commented Jan 3, 2023

Here's an attempt at a fix if you'd like to try it out: https://go.dev/cl/460375

@gopherbot
Copy link

Change https://go.dev/cl/460375 mentions this issue: runtime: make the memory limit heap goal headroom proportional

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. Performance
Projects
Status: Todo
Development

No branches or pull requests

4 participants