-
Notifications
You must be signed in to change notification settings - Fork 18.4k
Description
The /gc/pauses:seconds
metric gives a count of GC-related stop-the-world events (and the distribution of their durations), but there are several other ways for an application to trigger a STW. The prevalence and duration of those is also worth tracking: sometimes the work done while the world is stopped is expensive (#33250), and sometimes the stopping itself is expensive (#31222).
In the runtime today, I see runtime.GOMAXPROCS
, runtime.ReadMemStats
, runtime/pprof.Lookup("goroutine").WriteTo
, runtime/trace.{Start,Stop}
(runtime.{Start,Stop}Trace
), runtime.Stack(..., true)
, runtime/debug.WriteHeapDump
, syscall.AllThreadsSyscall{,6}
, plus GC phase transitions, as being able to cause a STW. It's not unusual for latency-sensitive apps to call one or more of those on a regular basis: in my own experience that's usually goroutine profiles, ReadMemStats (via the expvar
package), and execution traces.
I propose adding a new histogram metric that reports on the full duration of all STW events, including the time to bring the world to a stop and the time with the world stopped, tentatively named /sched/pauses:seconds
. This will make it easier to identify the presence and impact of stop-the-world pauses, even when they're not related to the GC.
Existing:
/gc/pauses:seconds
Distribution of individual GC-related stop-the-world pause
latencies. Bucket counts increase monotonically.
Proposed addition:
/sched/pauses:seconds
Distribution of individual stop-the-world pause latencies.
Bucket counts increase monotonically.