Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: API for unstable metrics #37112

Open
mknyszek opened this issue Feb 7, 2020 · 22 comments
Open

runtime: API for unstable metrics #37112

mknyszek opened this issue Feb 7, 2020 · 22 comments

Comments

@mknyszek
Copy link
Contributor

@mknyszek mknyszek commented Feb 7, 2020

Proposal: API for unstable runtime metrics

Background & Motivation

Today runtime metrics are exposed in two ways.

The first way is via the struct-based sampling APIs runtime.ReadMemStats and runtime/debug.GCStats. These functions accept a pointer to a struct and then populate the struct with data from the runtime.

The problems with this type of API are:

  • Removing/renaming old metrics from the structs is impossible.
    • For example, MemStats.BySize is hard-coded to 61 size classes when there are currently 83. We cannot ever change BySize.
  • Adding implementation-specific metrics to the structs is discouraged, because it pollutes the API when inevitably they'll be deprecated.
  • runtime.ReadMemStats has a global effect on the application because it forces a STW. This has a direct effect on latency. Being able to tease apart which metrics actually need gives users more control over performance.

The good things about this type of API are:

  • Protected by the Go 1 compatibility promise.
  • Easy for applications to ingest, use for their own purposes, or push to a metrics collection service or log.

The second is via GODEBUG flags which emit strings containing metrics to standard error (e.g. gctrace, gcpacertrace, scavtrace).

The problems with this type of API are:

  • Difficult for an application to ingest because it must be parsed.
  • Format of the output is not protected by the Go 1 backwards compatibility promise.

The good things about this type of API are:

  • We can freely change it and add implementation-specific metrics.
  • We never have to live with bad decisions.

I would like to propose a new API which takes the best of both approaches.

Requirements

  • The API should be easily extendable with new metrics.
  • The API should be easily retractable, to deprecate old metrics.
    • Removing a metric should not break any Go applications as per the Go 1 compatibility promise.
  • The API should be discoverable, to obtain a list of currently relevant metrics.
  • The API should be rich, allowing a variety of metrics (e.g. distributions).
  • The API implementation should minimize CPU/memory usage, such that it does not appreciably
    affect any of the metrics being measured.
  • The API should include useful existing metrics already exposed by the runtime.

Goals

Given the requirements, I suggest we prioritize the following concerns when designing the API in the following order.

  1. Extensibility.
    • Metrics are “unstable” and therefore it should always be compatible to add or remove metrics.
    • Since metrics will tend to be implementation-specific, this feature is critical.
  2. Discoverability.
    • Because these metrics are “unstable,” there must be a way for the application, and for the human writing the application, to discover the set of usable metrics and be able to do something useful with that information (e.g. log the metric).
    • The API should enable collecting a subset of metrics programmatically. For example, one might want to “collect all memory-related metrics” or “collect all metrics which are efficient to collect”.
  3. Performance.
    • Must have a minimized effect on the metrics it returns in the steady-state.
    • Should scale up to 100s of metrics, an amount that a human might consider “a lot.”
      • Note that picking the right types to expose can limit the amount of metrics we need to expose. For example, a distribution type would significantly reduce the number of metrics.
  4. Ergonomics.
    • The API should be as easy to use as it can be, given the above.

Design

See full design document at https://golang.org/design/37112-unstable-runtime-metrics.

Highlights:

  • Expose a new sampling-based API in a new package, the runtime/metrics package.
  • Use string keys for each metric which include the unit of the metric in an easily-parseable format.
  • Expose a discovery API which provides metadata about each metric at runtime, such as whether it requires a STW and whether it's cumulative (counter as opposed to a gauge).
  • Add a Histogram interface to the package which represents a distribution.
  • Support for event-based metrics is discussed and left open, but considered outside the scope of this proposal.

Backwards Compatibility

Note that although the set of metrics the runtime exposes will not be stable across Go versions, the API to discover and access those metrics will be.

Therefore, this proposal strictly increases the API surface of the Go standard library without changing any existing functionality and is therefore Go 1 compatible.

@gopherbot gopherbot added this to the Proposal milestone Feb 7, 2020
@gopherbot
Copy link

@gopherbot gopherbot commented Feb 8, 2020

Change https://golang.org/cl/218677 mentions this issue: design/37112-unstable-runtime-metrics.md: add proposal

@mknyszek mknyszek added this to Incoming in Proposals Feb 27, 2020
@rsc rsc changed the title proposal: API for unstable runtime metrics proposal: runtime: API for unstable metrics Mar 4, 2020
@rsc rsc moved this from Incoming to Active in Proposals Mar 4, 2020
@rsc
Copy link
Contributor

@rsc rsc commented Mar 11, 2020

@mknyszek, what's the status of this? Is the doc ready for review and discussion (if so please check it in)?

@mknyszek
Copy link
Contributor Author

@mknyszek mknyszek commented Mar 11, 2020

The doc is ready for review and discussion, I just don't have +2 or a review to land it yet. I'll try to ping more people to look at the CL.

gopherbot pushed a commit to golang/proposal that referenced this issue Mar 18, 2020
This change adds a proposal and design document for addition of a
runtime metrics package which is designed to support adding and removing
runtime metrics without breaking the Go 1 compatibility promise.

Updates golang/go#37112.

Change-Id: I550740970ad74c5c71d712735e4984808bdbf463
Reviewed-on: https://go-review.googlesource.com/c/proposal/+/218677
Reviewed-by: Austin Clements <austin@google.com>
@bcmills
Copy link
Member

@bcmills bcmills commented Mar 19, 2020

A couple of high-level comments:

  1. The Histogram interface seems too abstract. The user will need to know the space of possible concrete types anyway, and won't be able to use a given Histogram without knowing its concrete element type. Instead, I would be inclined to provide concrete Float64Histogram and TimeHistogram types — and then you don't need to worry about whether you can add methods to the interface in the future.

@bcmills
Copy link
Member

@bcmills bcmills commented Mar 19, 2020

  1. Should Metric type include type information about the metric values? That would allow the user to skip reading metrics that they won't be able to interpret anyway.

Actually, perhaps the Metric should include an allocation function..? That would make the allocation strategy a bit less implicit, and might reduce the risk of aliasing bugs in user code.

@mknyszek
Copy link
Contributor Author

@mknyszek mknyszek commented Mar 19, 2020

A couple of high-level comments:

  1. The Histogram interface seems too abstract. The user will need to know the space of possible concrete types anyway, and won't be able to use a given Histogram without knowing its concrete element type. Instead, I would be inclined to provide concrete Float64Histogram and TimeHistogram types — and then you don't need to worry about whether you can add methods to the interface in the future.

Given that we probably need to copy data out of the runtime anyway, I think you're right that it makes sense to just have concrete types.

  1. Should Metric type include type information about the metric values? That would allow the user to skip reading metrics that they won't be able to interpret anyway.

Actually, perhaps the Metric should include an allocation function..? That would make the allocation strategy a bit less implicit, and might reduce the risk of aliasing bugs in user code.

I considered including type information in the Metric in order to let users filter out metrics they don't understand, and I'm certainly not opposed. An allocation function is also a neat idea. One less potential footgun, but it is more work the user has to do to set this all up.

@bcmills
Copy link
Member

@bcmills bcmills commented Mar 19, 2020

It may also be worthwhile to have Read report an explicit error if a requested metric does not exist.

(I could imagine that someone might try to call Read with a hard-coded list of metrics, and then the user code may produce an obscure panic when trying to read the unpopulated value.)

@rsc
Copy link
Contributor

@rsc rsc commented Apr 1, 2020

I know there are minor details still being worked out, like whether Read returns an error in some cases. Other than those, it seems like people are generally in favor of this proposal.

Am I reading this wrong? Does anyone object to accepting this proposal?

@rsc
Copy link
Contributor

@rsc rsc commented Apr 8, 2020

OK, then based on the discussion above this seems like a likely accept.

@rsc rsc moved this from Active to Likely Accept in Proposals Apr 8, 2020
@rsc
Copy link
Contributor

@rsc rsc commented Apr 15, 2020

No change in consensus, so accepted.

@rsc rsc moved this from Likely Accept to Accepted in Proposals Apr 15, 2020
@rsc rsc removed this from the Proposal milestone Apr 15, 2020
@rsc rsc added this to the Backlog milestone Apr 15, 2020
@rsc rsc changed the title proposal: runtime: API for unstable metrics runtime: API for unstable metrics Apr 15, 2020
@mknyszek mknyszek self-assigned this Apr 24, 2020
@gopherbot
Copy link

@gopherbot gopherbot commented Aug 5, 2020

Change https://golang.org/cl/247041 mentions this issue: runtime,runtime/metrics: add memory metrics

@gopherbot
Copy link

@gopherbot gopherbot commented Aug 5, 2020

Change https://golang.org/cl/247040 mentions this issue: runtime/metrics: add package interface

@gopherbot
Copy link

@gopherbot gopherbot commented Oct 26, 2020

Change https://golang.org/cl/247045 mentions this issue: runtime,runtime/metrics: add object size distribution metrics

@gopherbot
Copy link

@gopherbot gopherbot commented Oct 26, 2020

Change https://golang.org/cl/247049 mentions this issue: runtime,runtime/metrics: export goroutine count as a metric

@gopherbot
Copy link

@gopherbot gopherbot commented Oct 26, 2020

Change https://golang.org/cl/247047 mentions this issue: runtime,runtime/metrics: add metric for distribution of GC pauses

@gopherbot
Copy link

@gopherbot gopherbot commented Oct 26, 2020

Change https://golang.org/cl/247046 mentions this issue: runtime: add timeHistogram type

@gopherbot
Copy link

@gopherbot gopherbot commented Oct 26, 2020

Change https://golang.org/cl/247043 mentions this issue: runtime,runtime/metrics: add heap object count metric

@gopherbot
Copy link

@gopherbot gopherbot commented Oct 26, 2020

Change https://golang.org/cl/247044 mentions this issue: runtime,runtime/metrics: add heap goal and GC cycle metrics

@mknyszek
Copy link
Contributor Author

@mknyszek mknyszek commented Oct 26, 2020

It may also be worthwhile to have Read report an explicit error if a requested metric does not exist.

(I could imagine that someone might try to call Read with a hard-coded list of metrics, and then the user code may produce an obscure panic when trying to read the unpopulated value.)

Oops, I totally forgot about this. Luckily, the case you're worried about came up in the intervening months, and the way I addressed it is by making the resulting Value have Kind KindBad if a metric doesn't exist. It's not an explicit error value returned by Read, but it does signal the issue (and it does seem right to set this anyway, even if we do decide on returning an explicit error value, too). WDYT? @bcmills

gopherbot pushed a commit that referenced this issue Oct 26, 2020
This change creates the runtime/metrics package and adds the initial
interface as laid out in the design document.

For #37112.

Change-Id: I202dcee08ab008dd63bf96f7a4162f5b5f813637
Reviewed-on: https://go-review.googlesource.com/c/go/+/247040
Run-TryBot: Michael Knyszek <mknyszek@google.com>
TryBot-Result: Go Bot <gobot@golang.org>
Trust: Michael Knyszek <mknyszek@google.com>
Reviewed-by: Michael Pratt <mpratt@google.com>
gopherbot pushed a commit that referenced this issue Oct 26, 2020
This change adds support for a variety of runtime memory metrics and
contains the base implementation of Read for the runtime/metrics
package, which lives in the runtime.

It also adds testing infrastructure for the metrics package, and a bunch
of format and documentation tests.

For #37112.

Change-Id: I16a2c4781eeeb2de0abcb045c15105f1210e2d8a
Reviewed-on: https://go-review.googlesource.com/c/go/+/247041
Run-TryBot: Michael Knyszek <mknyszek@google.com>
TryBot-Result: Go Bot <gobot@golang.org>
Reviewed-by: Michael Pratt <mpratt@google.com>
Trust: Michael Knyszek <mknyszek@google.com>
gopherbot pushed a commit that referenced this issue Oct 26, 2020
For #37112.

Change-Id: Idd3dd5c84215ddd1ab05c2e76e848aa0a4d40fb0
Reviewed-on: https://go-review.googlesource.com/c/go/+/247043
Run-TryBot: Michael Knyszek <mknyszek@google.com>
TryBot-Result: Go Bot <gobot@golang.org>
Trust: Michael Knyszek <mknyszek@google.com>
Reviewed-by: Michael Pratt <mpratt@google.com>
gopherbot pushed a commit that referenced this issue Oct 26, 2020
This change adds three new metrics: the heap goal, GC cycle count, and
forced GC count. These metrics are identical to their MemStats
counterparts.

For #37112.

Change-Id: I5a5e8dd550c0d646e5dcdbdf38274895e27cdd88
Reviewed-on: https://go-review.googlesource.com/c/go/+/247044
Run-TryBot: Michael Knyszek <mknyszek@google.com>
TryBot-Result: Go Bot <gobot@golang.org>
Trust: Michael Knyszek <mknyszek@google.com>
Reviewed-by: Michael Pratt <mpratt@google.com>
gopherbot pushed a commit that referenced this issue Oct 26, 2020
This change adds metrics for the distribution of objects allocated and
freed by size, mirroring MemStats' BySize field.

For #37112.

Change-Id: Ibaf1812da93598b37265ec97abc6669c1a5efcbf
Reviewed-on: https://go-review.googlesource.com/c/go/+/247045
Run-TryBot: Michael Knyszek <mknyszek@google.com>
TryBot-Result: Go Bot <gobot@golang.org>
Trust: Michael Knyszek <mknyszek@google.com>
Reviewed-by: Michael Pratt <mpratt@google.com>
gopherbot pushed a commit that referenced this issue Oct 26, 2020
This change adds a concurrent HDR time histogram to the runtime with
tests. It also adds a function to generate boundaries for use by the
metrics package.

For #37112.

Change-Id: Ifbef8ddce8e3a965a0dcd58ccd4915c282ae2098
Reviewed-on: https://go-review.googlesource.com/c/go/+/247046
Run-TryBot: Michael Knyszek <mknyszek@google.com>
TryBot-Result: Go Bot <gobot@golang.org>
Trust: Michael Knyszek <mknyszek@google.com>
Reviewed-by: Michael Pratt <mpratt@google.com>
gopherbot pushed a commit that referenced this issue Oct 26, 2020
For #37112.

Change-Id: Ibb0425c9c582ae3da3b2662d5bbe830d7df9079c
Reviewed-on: https://go-review.googlesource.com/c/go/+/247047
Run-TryBot: Michael Knyszek <mknyszek@google.com>
TryBot-Result: Go Bot <gobot@golang.org>
Trust: Michael Knyszek <mknyszek@google.com>
Reviewed-by: Michael Pratt <mpratt@google.com>
gopherbot pushed a commit that referenced this issue Oct 26, 2020
For #37112.

Change-Id: I994dfe848605b95ef6aec24f53869e929247e987
Reviewed-on: https://go-review.googlesource.com/c/go/+/247049
Run-TryBot: Michael Knyszek <mknyszek@google.com>
TryBot-Result: Go Bot <gobot@golang.org>
Trust: Michael Knyszek <mknyszek@google.com>
Reviewed-by: Michael Pratt <mpratt@google.com>
@bcmills
Copy link
Member

@bcmills bcmills commented Nov 3, 2020

KindBad seems reasonable to me.

@narqo
Copy link
Contributor

@narqo narqo commented Dec 18, 2020

Would it be possible to update the proposal and the documentation describing the relationship between expvar and runtime/metrics?

expvar already exposes runtime.Memstats in some predefined format. Some projects use expvar as a vendor-neutral layer for exposing their own metrics. And there are third-party tooling, that pulls the data from expvar into different metrics collectors.

Now it feels confusing why std has two ways of exposing its internal stats.

@mknyszek
Copy link
Contributor Author

@mknyszek mknyszek commented Jan 6, 2021

@narqo Yeah, I'm happy to update both the proposal and the documentation. I'll make sure it happens before the release.

The short answer to the relationship between the two is that runtime/metrics is low-level, and intended to be a replacement for ReadMemStats. ReadMemStats has problems for us in terms of exposing statistics that are useful to us for both debugging and evolving the runtime. Future evolution of runtime/metrics could also involve "push" metrics ("time series data") instead of just "pull" metrics ("sampled data") which doesn't quite fit with expvar's model, so I think having a separate, underlying package still makes sense.

Another point here is that despite its use of string keys, runtime/metrics is actually much lower-level API that puts a lot of focus on efficiency at the price of usability. AFAICT, isn't really the case with expvar (despite some performance improvements I see that were made a few years ago).

I filed another issue to try to figure out how runtime/metrics should be exported through expvar (#43555) because I am honestly not 100% sure. I think I'm personally slightly leaning toward re-exporting runtime/metrics up through expvar as one big blob, but I have some concerns with just doing that.

Thank you for your input here, and please feel free to start/continue the discussion on #43555.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Proposals
Accepted
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
5 participants