Description
Some background: I have an app that serves interactive HTTP traffic (typical response time is tens of milliseconds), and which also does periodic work to refresh the state necessary to serve that interactive traffic. Its live heap is usually around 8 GiB. The work to refresh the state spans over several GC cycles and can involve allocating hundreds of MiBs or even a couple GiBs over a few hundred milliseconds, which gives the pacer a hard time and can lead to the pacer choosing a high assist factor. To work around that, we're 1/ calling runtime.GC
at a few key points in the state-refresh process, and 2/ running with GOMEMLIMIT=20GiB and GOGC=off. The app runs on a machine with 16 hardware threads (GOMAXPROCS=16).
The problem: There appears to be contention on the mheap
lock in runtime.(*pageAlloc).scavengeOne
. Execution traces show this can tie up all 16 Ps for hundreds of milliseconds.
CC @golang/runtime and @mknyszek
What version of Go are you using (go version
)?
go1.19.3 linux/amd64
Does this issue reproduce with the latest release?
Yes: go1.19.3 is the latest stable release. I have not tried with tip.
What operating system and processor architecture are you using (go env
)?
linux/amd64
What did you do?
GOMAXPROCS=16, GOMEMLIMIT=20GiB, GOGC=off
Manual call to runtime.GC
, followed (after that call returns, indicating that the sweep work is done) by allocating memory especially quickly.
All while serving interactive (tens of milliseconds typical response time) HTTP requests.
What did you expect to see?
Consistent responsiveness for the interactive HTTP requests this process serves, regardless of which phase the GC is executing.
When the app has its own runnable goroutines, I expect the runtime to stay close to within its goal of 25% of P time spent on GC-related work.
What did you see instead?
The app's Ps are kept busy in a nearly-opaque form of work. Rather than running a G for around 100 µs and then moving on to the next G, each P has a single G that it claims to be running for hundreds of milliseconds without interruption. (Note, if those goroutines had "normal" work, I'd still expect them to be preempted by sysmon
every 20 ms or so.) At times, every P in the app is kept busy this way.
When the Ps are busy this way, their CPUSample events in the execution trace (each of which represents 10 milliseconds of on-CPU time) are spread tens of milliseconds apart, and sometimes more than 100 ms. That suggests that these Ps are not spending their time executing any code, even within the kernel, and that instead they're asleep waiting for a lock that is not visible to the execution trace (such as a runtime-internal lock).
The CPUSample events during those periods show the Ps are trying to allocate new spans to use for new memory allocations. Most samples are within runtime.(*pageAlloc).scavengeOne
and show a call to runtime.madvise
. A few show time in that function's calls to runtime.lock
and runtime.unlock
.
Because those periods include so little on-CPU time outside of runtime.(*pageAlloc).scavengeOne
, and also show so little on-CPU time in total, and show some on-CPU time interacting with the mheap
lock, I take this to mean that the Ps are spending their with runtime.(*pageAlloc).scavengeOne
on their stack, but sleeping, likely waiting for the mheap
lock.
Here's one of the call stacks from a CPUSample event that arrived while a P was busy for 100+ milliseconds (showing from mallocgc
):
runtime.madvise:546
runtime.sysUnusedOS:111
runtime.sysUnused:62
runtime.(*pageAlloc).scavengeOne:741
runtime.(*pageAlloc).scavenge.func1:647
runtime.(*pageAlloc).scavenge:646
runtime.(*mheap).allocSpan:1317
runtime.(*mheap).alloc.func1:910
runtime.systemstack:492
runtime.(*mheap).alloc:904
runtime.(*mcentral).grow:244
runtime.(*mcentral).cacheSpan:164
runtime.(*mcache).refill:181
runtime.(*mcache).nextFree:819
runtime.mallocgc:1018
Here's how the execution trace looks. The sweep work ends around 275 ms. That also means the app's manual runtime.GC
call returns, which allows the allocation-intensive refresh procedure to continue. Procs 1, 12, 15 do some allocation-heavy work and then get stuck in scavengeOne
from the 400 to 800 ms marks. (Some other Procs, like 3 and 11, do heavy allocations around the 400 ms mark but do not get stuck.) Around 800 ms, all 16 Procs get stuck and the queue of runnable goroutines grows. And around 1000 ms, a timer should have triggered to call runtime/trace.Stop
, but the trace continues beyond the 1300 ms mark; the app is no longer responsive.
And here's the call_tree
view from under mallocgc
. This is the runtime/pprof output of the same CPU profile that is shown in the execution trace; these samples will cover the same time period as that entire execution trace, not filtered to any smaller time range.
$ ln -s /usr/local/go/bin/go /tmp/redacted
$ go tool pprof '-focus=^runtime\.mallocgc$' '-show_from=^runtime\.mallocgc$' -call_tree \
-nodefraction=0 -edgefraction=0 -nodecount=1000 /tmp/redacted pprof/profile-during-trace