Skip to content

runtime/metrics: provide histogram of all STW events #63340

@rhysh

Description

@rhysh

The /gc/pauses:seconds metric gives a count of GC-related stop-the-world events (and the distribution of their durations), but there are several other ways for an application to trigger a STW. The prevalence and duration of those is also worth tracking: sometimes the work done while the world is stopped is expensive (#33250), and sometimes the stopping itself is expensive (#31222).

In the runtime today, I see runtime.GOMAXPROCS, runtime.ReadMemStats, runtime/pprof.Lookup("goroutine").WriteTo, runtime/trace.{Start,Stop} (runtime.{Start,Stop}Trace), runtime.Stack(..., true), runtime/debug.WriteHeapDump, syscall.AllThreadsSyscall{,6}, plus GC phase transitions, as being able to cause a STW. It's not unusual for latency-sensitive apps to call one or more of those on a regular basis: in my own experience that's usually goroutine profiles, ReadMemStats (via the expvar package), and execution traces.

I propose adding a new histogram metric that reports on the full duration of all STW events, including the time to bring the world to a stop and the time with the world stopped, tentatively named /sched/pauses:seconds. This will make it easier to identify the presence and impact of stop-the-world pauses, even when they're not related to the GC.

Existing:

/gc/pauses:seconds
	Distribution of individual GC-related stop-the-world pause
	latencies. Bucket counts increase monotonically.

Proposed addition:

/sched/pauses:seconds
	Distribution of individual stop-the-world pause latencies.
	Bucket counts increase monotonically.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions