Skip to content

proposal: runtime/metrics: add metrics for timers #75956

@mknyszek

Description

@mknyszek

We currently don't expose any metrics for timers, created by the time package.

Counters

I propose the addition of six new metrics:

/sched/timers/classes/expired:timers
    Monotonically increasing count of timers that successfully expired.

/sched/timers/classes/stopped:timers
    Monotonically increasing count of timers that were stopped early.

/sched/timers/classes/pending:timers
    Count of outstanding timers. Resetting a pending timer does not change stopped or pending, but does
    increment /sched/timers/resets:resets.

/sched/timers/classes/total:timers
    Monotonically increasing count of instances of scheduling a timer. Each call to time.NewTimer,
    time.(*Timer).Reset, time.After, and time.AfterFunc counts as scheduling a timer.
    Each tick (read from the channel or not) in a time.Ticker counts as scheduling at timer.
    This is equal to the sum of /sched/timers/classes/*.

/sched/timers/resets:resets
    Monotonically increasing count of all resets of pending timers.

/sched/timers/tracked:timers
    Count of timers currently tracked by the scheduler. This includes pending timers as tracked in
    /sched/timers/classes/pending:timers as well as stopped timers that the scheduler is still tracking
    ("zombies") but haven't been disposed of yet.

Together, these metrics reflect the timer state machine and account for all timer scheduling activity. Note that creating a timer and doing nothing with it does not update any metrics. Simultaneously, stopping and subsequently resetting a timer counts as an additional "timer scheduling" event, whereas resetting a pending timer does not.

Optional: distributions

I also want to float the idea of adding one or more metrics containing a distribution of timer durations. Producing a meaningful metric here is a little subtle due to the ability to reset a pending timer. I propose the following metrics.

/sched/timers/expired-by-duration:seconds
    Distribution of the time between a timer first pending and finally expiring.

/sched/timers/stopped-by-duration:seconds
    Distribution of the time between a timer first pending and finally being stopped.

/sched/timers/pending-by-duration:seconds
    Distribution of how long pending timers have been pending for.

/sched/timers/pending-by-expiry:seconds
    Distribution of the initial duration of all pending timers.
    That is, the value passed to time.(*Timer).Reset, time.NewTimer, time.After,
    time.AfterFunc, or time.NewTicker.

These distributional metrics will require an extra field in the timer to track when it was scheduled.

It's less clear to me how important these are, but I think the example at least helps motivate the naming of the previous metrics. That is, we put the counts under classes to leave space for other metrics under /sched/timers.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions