Counter, UpDownCounter, and Gauge instruments compared #156

jmacd · 2021-05-15T08:15:05Z

This is written in support of open-telemetry/opentelemetry-specification#1297.

ltyson

A few overall comments:

the introduction feels too long and doesn't really make it clear what the problem is -- I think that's the weakest part right now
I would suggest adding concrete examples of real-world metrics for each of the instrument types, to help example-based-learners (like me) to understand more easily

ltyson · 2021-05-17T23:41:04Z

text/metrics/0156-counter-gauge-model.md

+
+## Background
+
+OpenTelemetry has a founding principal that the interface (API) should


tiny nit: 'principal' => 'principle'

ltyson · 2021-05-18T00:13:17Z

text/metrics/0156-counter-gauge-model.md

+used with Counter and UpDownCounter instruments, there is an assumed
+relationship between the aggregation temporality and the choice of
+synchronous or asynchronous API.  Inputs to synchronous
+(UpDown)Counter instruments are the changes of a Sum aggregation


thought: I find the phrase changes of a Sum aggregation difficult to mentally parse. What about adding the word incremental to more explicitly distinguish it from the behavior of an asynchronous counter, e.g.

Inputs to synchronous (UpDown)Counter instruments are the incremental changes to a Sum aggregation (i.e., deltas)

+1.

I would suggest that we avoid "incremental" since it might suggest monotonicity (or we will have to put something like "incremental/decremental").
My suggestion would be "Inputs to synchronous (UpDown)Counter instruments are the deltas, while inputs to asynchronous instruments are always the absolute value (e.g. the total sum)."

ltyson · 2021-05-18T01:57:05Z

text/metrics/0156-counter-gauge-model.md

+synchronous or asynchronous API.  Inputs to synchronous
+(UpDown)Counter instruments are the changes of a Sum aggregation
+(i.e., deltas).  Inputs to asynchronous (UpDown)Counter instruments
+are the totals of a Sum aggregation (i.e., cumulatives).


It might be useful to explain (or link to an explanation of) why synchronous Counters are assumed to be deltas vs why async Counters are assumed to be cumulative, to help the reader follow the reasoning of why those things are true (or common).

ltyson · 2021-05-18T02:00:29Z

text/metrics/0156-counter-gauge-model.md

+Sum points imply a linear scale of measurement.  A Sum value that is
+twice the amount of another actually means twice as much of the
+variable was counted.  Linear interpolation is considered to preserve
+meaning.


You might consider adding a concrete example of a real life metric to help the reader follow the precise but very technical description above (maybe even a diagram, if it lends itself to that). There are some examples I've seen in other places that you could probably just steal, e.g. https://lightstep.com/blog/opentelemetry-101-what-are-metrics/

I'm thinking that would help people like me who learn more readily through examples. Same below with Gauge events.

ltyson · 2021-05-18T02:13:37Z

text/metrics/0156-counter-gauge-model.md

+
+## New measurements: Counter and UpDownCounter instruments
+
+Sum points have been defined to have linear scale of measurement,


I had some trouble deciphering the phrase "linear scale of measurement". I understand the concept, after reading the rest of the paragraph, but I wonder if there's a different way to say this, maybe by adding an explanatory aside, something like this (although it probably doesn't use the right terminology)

"Sum points have been defined to have linear scale of measurement, meaning the same Sum point value could be obtained through many different combinations of metric event values. This property can also be applied in reverse, meaning that Sum points can be subdivided."

ltyson · 2021-05-18T02:18:13Z

text/metrics/0156-counter-gauge-model.md

+instrumentation library with or without subdivided Sums and to
+meaningfully aggregate data with a mixture of attributes.
+
+## New measurements: Gauge instruments


Nice! This section is super clear and easy to understand.

nit: "New measurements" => "Adding new measurements" for clarity? Same below.

ltyson · 2021-05-18T02:22:17Z

text/metrics/0156-counter-gauge-model.md

+The OpenTelemetry Metrics API introduces a new kind of instrument, the
+UpDownCounter, that behaves like a Counter, meaning that attributes
+subdivide the variable being counted, but their primary interpretation
+is like that of a Gauge.


At this point I'm thinking, "ok, that's nice, but why should I care?"

I would suggest starting this doc at the very top by briefly illustrating the problem that exists when you only have Counter and Gauge instruments, to help the reader see that things aren't fine the way they are, and motivate them to understand the problem and care about this solution.

Actually, you might just start with the use-case that you wrote up for me in our slack chat about this.

The user journey I'm after is this: 1. As a user I know nothing about metrics named random_oss_software.* but they were produced with OTel SDKs. 2. I click "Create a dashboard" 3. I enter the "random_oss.software.*" pattern 4. I get a good dashboard.

and then show at least one scenario (doesn't have to be exhaustive) where this is not possible to do without the UpDownCounter instrument.

Then you could move into the rest of the doc as written.

ltyson · 2021-05-18T02:43:06Z

text/metrics/0156-counter-gauge-model.md

+the meaning in the original events is preserved in the complete set of
+timeseries.
+
+Removing attributes from metric streams without changing meaning


I'd suggest moving this down below the example, and adding a sentence afterward to draw attention to the problem, e.g.

"...which means applying the natural aggregation function to merge metric streams. But how do we know which is the "natural" aggregation function? For Counters the answer is always SUM (for: reason), however for Gauges it might be SUM or MEAN depending on the semantics of the values represented by the particular metric. (for: reason). Therein lies the problem."

ltyson · 2021-05-18T02:46:37Z

text/metrics/0156-counter-gauge-model.md

+- Ratios are not necessarily meaningful
+- Linear interpolation is not necessarily supported.
+
+## Attributes are used for interpretation


This section header doesn't quite seem to flow naturally with the text, maybe call this section something
like "Adding attributes to events for additional meaning" to help tie it to previous and subsequent sections.

ltyson · 2021-05-18T02:52:05Z

text/metrics/0156-counter-gauge-model.md

+_Metric timeseries_ is the output of aggregating a stream of data
+points for a specific set of resource and attribute dimensions.
+
+## Meaning and interpretation of Counter and UpDownCounter events


I don't know if this is what you're going for, but if your main goal is to explain the need for UpDownCounter, it might make sense to omit it from this section and wait to present it later as a solution to the problem described in this doc. If that's the goal, it feels a bit premature to describe it here, because I as a reader don't yet know why it's necessary or useful and I get a bit lost.

AjohnsonLS · 2021-05-18T17:15:01Z

text/metrics/0156-counter-gauge-model.md

+
+_Metric data stream_ is a collection of data points, written by a
+writer, having an identity that consists of the instrument's name, the
+instrumentation library, resource attributes, and metric attributes.


Is this meant to help define an interface/standard or simply saying that a data stream encompasses all of the data and the source?

AjohnsonLS · 2021-05-18T17:18:37Z

text/metrics/0156-counter-gauge-model.md

+
+_Metric data points_ are the items in a stream, each has a point
+kind. For the purpose of this text, the point kind is Sum or Gauge.
+Sum points have two options: Temporality and Monotonicity.


Temporality is previously defined, but not Monotonicity.

AjohnsonLS · 2021-05-18T17:22:08Z

text/metrics/0156-counter-gauge-model.md

+measurements.  No implied linear scale of measurement, therefore:
+
+- Rate-of-change may not be well defined
+- Ratios are not necessarily meaningful


These statements feel too strong to me. I can think of instances where gauges are very useful in context of rate, ratios, and trend modeling. However, I think it's important to say that Gauges are signed with their own time stamp. Valuable analysis likely needs some data alignment (aggregating by some method to regular time intervals) in order to make meaningful comparisons

AjohnsonLS · 2021-05-18T17:25:04Z

text/metrics/0156-counter-gauge-model.md

+
+The same aggregation can be applied when removing an attributes from
+metric streams forces reaggregation.  The most current value should be
+selected.  In case of identical timestamps, a random value should be


The random value sample for time collisions is an important decision that should be clearly documented. There's a lot of behavior that surfaces when this happens that is difficult to understand. I do think it's a good solution, just needs to be clearly documented.

nit: removing an attributes => either "removing an attribute" or "removing attributes"

ocelotl

There are several things that are hard for me to understand. It is very likely that this is the case because of how little do I know of the concepts of metrics and the terminology used to describe them. Nevertheless, maybe this can be useful information to the author in case it is possible to rephrase some of this OTEP in order to make it a bit more clear to more inexperienced readers 🙂

ocelotl · 2021-05-18T22:21:21Z

text/metrics/0156-counter-gauge-model.md

+
+## Glossary
+
+_Meaning_: Metrics API events have a semantic definition that dictates


This would be a bit easier to understand if the glossary also had an entry for Metrics API events.

ocelotl · 2021-05-18T22:23:23Z

text/metrics/0156-counter-gauge-model.md

+Metrics API instruments.  For the purpose of this text, it is a
+Counter, an UpDownCounter, or a Gauge.
+
+_Metric attributes_ can be applied to Metric API events, which allows


This describes what can be done with metric attributes (can be applied to metric API events), but it does not actually define what metric attributes are.

I would imagine attributes are specific to each metric so the spec wouldn't define what the attributes are. This is what is referred to as labels in the ASCII art in the spec but the agreed-upon name going forward is attributes.

ocelotl · 2021-05-18T22:24:12Z

text/metrics/0156-counter-gauge-model.md

+attribute dimensions.
+
+_Metric data stream_ is a collection of data points, written by a
+writer, having an identity that consists of the instrument's name, the


Is written by a writer redundant?

ocelotl · 2021-05-18T22:25:39Z

text/metrics/0156-counter-gauge-model.md

+attribute dimensions.
+
+_Metric data stream_ is a collection of data points, written by a
+writer, having an identity that consists of the instrument's name, the


...having an identity that consists of the instrument's name

What instrument is this? How is this instrument related to the metric data stream?

ocelotl · 2021-05-18T22:27:46Z

text/metrics/0156-counter-gauge-model.md

+attribute dimensions.
+
+_Metric data stream_ is a collection of data points, written by a
+writer, having an identity that consists of the instrument's name, the


Who has an identity that consists of the instrument's name, the instrumentation library, resource attributes, and metric attributes? The Metric data stream or the data points?

ocelotl · 2021-05-18T23:17:26Z

text/metrics/0156-counter-gauge-model.md

+
+The rate interpretation is preferred for monotonic Sum points, and the
+the current total interpretation is preferred for non-monotonic Sum
+points.  Both interpretations are meaningful and useful for both kinds


Bit confused here, first this says that one interpretation is preferred for a certain kind of sum point and the other one is preferred for the other kind of sum point. Then it says that both are "meaningful and useful" for both kinds of sum point. Then, why is one preferred over the other? 🤷

ocelotl · 2021-05-18T23:20:54Z

text/metrics/0156-counter-gauge-model.md

+
+Sum points imply a linear scale of measurement.  A Sum value that is
+twice the amount of another actually means twice as much of the
+variable was counted.  Linear interpolation is considered to preserve


Sorry, what do you mean with "linear interpolation is considered to preserve meaning"? Do you mean that it is possible to interpolate linearly between two sum points?

ocelotl · 2021-05-18T23:21:41Z

text/metrics/0156-counter-gauge-model.md

+Gauge instruments produce Gauge metric data points are taken to
+have meaning in a metric stream as follows:
+
+- Gauge point values are individual measurements captured at an instant in time


Aren't sum points also individual measurements captured at an instant in time?

ocelotl · 2021-05-18T23:24:10Z

text/metrics/0156-counter-gauge-model.md

+synchronous and asynchronous measurements.  When recording Gauge
+values through a synchronous API, the interpretation is "last known
+value", and when recording Gauge values through an asynchronous API
+the interpretation is "current value".


Hm, "last known value" and "current value" are very similar. It is hard to understand the difference between synchronous gauge and asynchronous gauge using this concept.

ocelotl · 2021-05-18T23:25:24Z

text/metrics/0156-counter-gauge-model.md

+the interpretation is "current value".
+
+The distinction between last known value (synchronous) and current
+value (asynchronous) is considered not significant in the data model.


Hm, but it is significant for the reader of this document to understand the difference between synchronous gauge and asynchronous gauge...

jsuereth

I like a lot things in this document, but I think it suffers from a few things.

the bottom section outlining when to use Gauge vs. UpDownCounter is gold and should be teased apart or put first.
The notion of "safe operations" to perform regarding Attributes (like removal/addition) should be formalized as part of the DataModel. I think you took a good first crack, but we may want to flesh out ALL of these things so that it's clear what operations are allowed in "telemetry schema" and which are considered "compatible"
The distinction of "last value" vs. "current value" is (currently) a bit confusingly worded. It also makes me wonder if we should represent SyncGauge differently in OTLP regarding "start timestamp" and "current timestamp". E.g. should we ALWAYS have sync gauge use the timestamp when value was recorded vs. when it is reported externally?

jsuereth · 2021-05-21T19:00:50Z

text/metrics/0156-counter-gauge-model.md

+zero with each interval (_delta_) or accumulated over a sequence of
+intervals (_cumulative_).  Both forms of temporality are considered
+important, as they offer a useful tradeoff between cost and
+reliability.  The data model specifies that a change of temporality


If you're talking version of a service, possibly. I could code service A to use DELTAs, then decide to switch to cumulatives in version 2.

Practically you shouldn't expect a service to be going back and forth between delta + cumulative points.

jsuereth · 2021-05-21T19:05:40Z

text/metrics/0156-counter-gauge-model.md

+Note that these two statements imply different interpretation for
+synchronous and asynchronous measurements.  When recording Gauge
+values through a synchronous API, the interpretation is "last known
+value", and when recording Gauge values through an asynchronous API


Not sure I buy this. In a synchronous Gauge, recording a value is "Here's the current value".

The difference in Async vs. Sync is whether you can have the "current value" or "last known value" be sampled on-demand.

i.e. I think the terminology here is being phrased from an SDK/exporter perspective vs. the instrument's perspective (which I assume is closer to a user of metrics API)

jsuereth · 2021-05-21T19:09:02Z

text/metrics/0156-counter-gauge-model.md

+the interpretation is "current value".
+
+The distinction between last known value (synchronous) and current
+value (asynchronous) is considered not significant in the data model.


I think you're implicitly saying that when you export metrics, you're setting the Timestamp for a synchronous Gauge to "export time", whereas the actual timestamp was when the synchronous instrument sent the last piece of data.

E.g. Reproting INterval

| X X | t0 t1 t2 t3

Here, the synchronous gauge is reporting its value at t1 and t2. For the collection interval t0 -> t3, we report the value recorded at t2 BUT you're saying for DELTA + CUMULATIVE it would report its timestamp at t3, whereas an asynchronous instrument would just capture a value at t3.

I think this needs some kind of picture.

jsuereth · 2021-05-21T19:13:05Z

text/metrics/0156-counter-gauge-model.md

+data requires attention to the meaning being conveyed.
+
+Safe attribute erasure for OpenTelemetry Metrics streams is specified
+in a way that preserves meaning while removing only the forms of


What if "removing the forms of interpretation that made use of the erased attribute" basically is destroying the practical value of the metric?

I think this is implying that removing an attribute from a metric is an ok thing to do (it' s not). The meaning that IS preserved is relevant to reaggregation + processing, but could be disastrous to dashboards and users.

jsuereth · 2021-05-21T19:21:08Z

text/metrics/0156-counter-gauge-model.md

+
+In database systems, this process is refered to as a performing a
+"Group-By", where aggregation is used to combine streams within each
+distinct set of grouped attributes.  For the benefit of OpenTelemetry


This kind of falls over for me in practice. Database Group-By is a full query language where NORMALLY you need to provide your own Aggregation semantic for every column of data or the query will fail.

What we're suggesting here is instead of:

SELECT AVG(count) FROM x GROUP BY label

The default aggregation function is infferred by otel based on the metric type.

Two things:

We shouldn't prevent users from using different aggregation functions in practice. Indeed, once they make it to PromQL (or MQL for GCP), they'll be able to do whatever they want here and need to track their own meaning.
2.I assume a lot of the gymnastics we go through around default aggregation is to avoid exposing a query-like langauge for "rewrite rule" style collection behavior.

My $.02 here is the focus on giving users a way to solve "rewrite rules" use cases is good. Making it as easy as possible is good. If we can't explain what is and isn't safe in very simple terms, we might be in trouble. If the "meaning" we retain isn't the one users wanted to retain, then we're not really adding value.

jsuereth · 2021-05-21T19:22:15Z

text/metrics/0156-counter-gauge-model.md

+When conveying meaning to the user by grouping and aggregating over a
+subset of attribute keys, the default aggregation selected should be
+one that preserves meaning.  For monotonic Counter instruments, this
+means conveying the combined rate of each group.  For UpDownCounter


Why does monotonic counter instruments turn into a rate? Did I miss an above description on how UpDown vs. Counter have different aggregation meaning?

Maybe this description?

jmacd · 2021-09-08T22:06:34Z

I will close this until such a point in time as someone convinces me it's needed.

jmacd added 4 commits May 15, 2021 01:12

Counter, UpDownCounter, and Gauge instruments compared

2f0a6d9

rewrite top

2911f62

clarify

f94f898

add summary

02c746b

jmacd marked this pull request as ready for review May 17, 2021 23:16

jmacd requested review from a team as code owners May 17, 2021 23:16

ltyson reviewed May 18, 2021

View reviewed changes

dashpole approved these changes May 18, 2021

View reviewed changes

AjohnsonLS reviewed May 18, 2021

View reviewed changes

ocelotl reviewed May 18, 2021

View reviewed changes

jsuereth reviewed May 21, 2021

View reviewed changes

jmacd mentioned this pull request May 24, 2021

Allow sum to present with negative measurements in Histogram open-telemetry/opentelemetry-proto#303

Open

jmacd closed this Sep 8, 2021

jmacd mentioned this pull request Dec 10, 2021

Use RFC SHOULD NOT wording in histogram aggregation open-telemetry/opentelemetry-specification#2164

Merged

jmacd mentioned this pull request Apr 1, 2022

Clarify view conflicts open-telemetry/opentelemetry-specification#2462

Merged


		## Background

		OpenTelemetry has a founding principal that the interface (API) should


		## New measurements: Counter and UpDownCounter instruments

		Sum points have been defined to have linear scale of measurement,


		## Glossary

		_Meaning_: Metrics API events have a semantic definition that dictates

Counter, UpDownCounter, and Gauge instruments compared #156

Counter, UpDownCounter, and Gauge instruments compared #156

Conversation

jmacd commented May 15, 2021

ltyson left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ocelotl left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jsuereth left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jmacd commented Sep 8, 2021