Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Counter, UpDownCounter, and Gauge instruments compared #156

Closed
wants to merge 4 commits into from

Conversation

jmacd
Copy link
Contributor

@jmacd jmacd commented May 15, 2021

This is written in support of open-telemetry/opentelemetry-specification#1297.

@jmacd jmacd marked this pull request as ready for review May 17, 2021 23:16
@jmacd jmacd requested review from a team as code owners May 17, 2021 23:16
Copy link

@ltyson ltyson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few overall comments:

  • the introduction feels too long and doesn't really make it clear what the problem is -- I think that's the weakest part right now
  • I would suggest adding concrete examples of real-world metrics for each of the instrument types, to help example-based-learners (like me) to understand more easily


## Background

OpenTelemetry has a founding principal that the interface (API) should
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tiny nit: 'principal' => 'principle'

used with Counter and UpDownCounter instruments, there is an assumed
relationship between the aggregation temporality and the choice of
synchronous or asynchronous API. Inputs to synchronous
(UpDown)Counter instruments are the changes of a Sum aggregation
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thought: I find the phrase changes of a Sum aggregation difficult to mentally parse. What about adding the word incremental to more explicitly distinguish it from the behavior of an asynchronous counter, e.g.

Inputs to synchronous (UpDown)Counter instruments are the incremental changes to a Sum aggregation (i.e., deltas)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1.

I would suggest that we avoid "incremental" since it might suggest monotonicity (or we will have to put something like "incremental/decremental").
My suggestion would be "Inputs to synchronous (UpDown)Counter instruments are the deltas, while inputs to asynchronous instruments are always the absolute value (e.g. the total sum)."

synchronous or asynchronous API. Inputs to synchronous
(UpDown)Counter instruments are the changes of a Sum aggregation
(i.e., deltas). Inputs to asynchronous (UpDown)Counter instruments
are the totals of a Sum aggregation (i.e., cumulatives).
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be useful to explain (or link to an explanation of) why synchronous Counters are assumed to be deltas vs why async Counters are assumed to be cumulative, to help the reader follow the reasoning of why those things are true (or common).

Sum points imply a linear scale of measurement. A Sum value that is
twice the amount of another actually means twice as much of the
variable was counted. Linear interpolation is considered to preserve
meaning.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You might consider adding a concrete example of a real life metric to help the reader follow the precise but very technical description above (maybe even a diagram, if it lends itself to that). There are some examples I've seen in other places that you could probably just steal, e.g. https://lightstep.com/blog/opentelemetry-101-what-are-metrics/

I'm thinking that would help people like me who learn more readily through examples. Same below with Gauge events.


## New measurements: Counter and UpDownCounter instruments

Sum points have been defined to have linear scale of measurement,
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had some trouble deciphering the phrase "linear scale of measurement". I understand the concept, after reading the rest of the paragraph, but I wonder if there's a different way to say this, maybe by adding an explanatory aside, something like this (although it probably doesn't use the right terminology)

"Sum points have been defined to have linear scale of measurement, meaning the same Sum point value could be obtained through many different combinations of metric event values. This property can also be applied in reverse, meaning that Sum points can be subdivided."

instrumentation library with or without subdivided Sums and to
meaningfully aggregate data with a mixture of attributes.

## New measurements: Gauge instruments
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! This section is super clear and easy to understand.

nit: "New measurements" => "Adding new measurements" for clarity? Same below.

The OpenTelemetry Metrics API introduces a new kind of instrument, the
UpDownCounter, that behaves like a Counter, meaning that attributes
subdivide the variable being counted, but their primary interpretation
is like that of a Gauge.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At this point I'm thinking, "ok, that's nice, but why should I care?"

I would suggest starting this doc at the very top by briefly illustrating the problem that exists when you only have Counter and Gauge instruments, to help the reader see that things aren't fine the way they are, and motivate them to understand the problem and care about this solution.

Actually, you might just start with the use-case that you wrote up for me in our slack chat about this.

The user journey I'm after is this:
1. As a user I know nothing about metrics named random_oss_software.* but they were produced with OTel SDKs.
2. I click "Create a dashboard"
3. I enter the "random_oss.software.*" pattern
4. I get a good dashboard.

and then show at least one scenario (doesn't have to be exhaustive) where this is not possible to do without the UpDownCounter instrument.

Then you could move into the rest of the doc as written.

the meaning in the original events is preserved in the complete set of
timeseries.

Removing attributes from metric streams without changing meaning
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd suggest moving this down below the example, and adding a sentence afterward to draw attention to the problem, e.g.

"...which means applying the natural aggregation function to merge metric streams. But how do we know which is the "natural" aggregation function? For Counters the answer is always SUM (for: reason), however for Gauges it might be SUM or MEAN depending on the semantics of the values represented by the particular metric. (for: reason). Therein lies the problem."

- Ratios are not necessarily meaningful
- Linear interpolation is not necessarily supported.

## Attributes are used for interpretation
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This section header doesn't quite seem to flow naturally with the text, maybe call this section something
like "Adding attributes to events for additional meaning" to help tie it to previous and subsequent sections.

_Metric timeseries_ is the output of aggregating a stream of data
points for a specific set of resource and attribute dimensions.

## Meaning and interpretation of Counter and UpDownCounter events
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know if this is what you're going for, but if your main goal is to explain the need for UpDownCounter, it might make sense to omit it from this section and wait to present it later as a solution to the problem described in this doc. If that's the goal, it feels a bit premature to describe it here, because I as a reader don't yet know why it's necessary or useful and I get a bit lost.


_Metric data stream_ is a collection of data points, written by a
writer, having an identity that consists of the instrument's name, the
instrumentation library, resource attributes, and metric attributes.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this meant to help define an interface/standard or simply saying that a data stream encompasses all of the data and the source?


_Metric data points_ are the items in a stream, each has a point
kind. For the purpose of this text, the point kind is Sum or Gauge.
Sum points have two options: Temporality and Monotonicity.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Temporality is previously defined, but not Monotonicity.

measurements. No implied linear scale of measurement, therefore:

- Rate-of-change may not be well defined
- Ratios are not necessarily meaningful

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These statements feel too strong to me. I can think of instances where gauges are very useful in context of rate, ratios, and trend modeling. However, I think it's important to say that Gauges are signed with their own time stamp. Valuable analysis likely needs some data alignment (aggregating by some method to regular time intervals) in order to make meaningful comparisons


The same aggregation can be applied when removing an attributes from
metric streams forces reaggregation. The most current value should be
selected. In case of identical timestamps, a random value should be

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The random value sample for time collisions is an important decision that should be clearly documented. There's a lot of behavior that surfaces when this happens that is difficult to understand. I do think it's a good solution, just needs to be clearly documented.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: removing an attributes => either "removing an attribute" or "removing attributes"

Copy link

@ocelotl ocelotl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are several things that are hard for me to understand. It is very likely that this is the case because of how little do I know of the concepts of metrics and the terminology used to describe them. Nevertheless, maybe this can be useful information to the author in case it is possible to rephrase some of this OTEP in order to make it a bit more clear to more inexperienced readers 🙂


## Glossary

_Meaning_: Metrics API events have a semantic definition that dictates
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would be a bit easier to understand if the glossary also had an entry for Metrics API events.

Metrics API instruments. For the purpose of this text, it is a
Counter, an UpDownCounter, or a Gauge.

_Metric attributes_ can be applied to Metric API events, which allows
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This describes what can be done with metric attributes (can be applied to metric API events), but it does not actually define what metric attributes are.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would imagine attributes are specific to each metric so the spec wouldn't define what the attributes are. This is what is referred to as labels in the ASCII art in the spec but the agreed-upon name going forward is attributes.

attribute dimensions.

_Metric data stream_ is a collection of data points, written by a
writer, having an identity that consists of the instrument's name, the
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is written by a writer redundant?

attribute dimensions.

_Metric data stream_ is a collection of data points, written by a
writer, having an identity that consists of the instrument's name, the
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

...having an identity that consists of the instrument's name

What instrument is this? How is this instrument related to the metric data stream?

attribute dimensions.

_Metric data stream_ is a collection of data points, written by a
writer, having an identity that consists of the instrument's name, the
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Who has an identity that consists of the instrument's name, the instrumentation library, resource attributes, and metric attributes? The Metric data stream or the data points?


The rate interpretation is preferred for monotonic Sum points, and the
the current total interpretation is preferred for non-monotonic Sum
points. Both interpretations are meaningful and useful for both kinds
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bit confused here, first this says that one interpretation is preferred for a certain kind of sum point and the other one is preferred for the other kind of sum point. Then it says that both are "meaningful and useful" for both kinds of sum point. Then, why is one preferred over the other? 🤷


Sum points imply a linear scale of measurement. A Sum value that is
twice the amount of another actually means twice as much of the
variable was counted. Linear interpolation is considered to preserve
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, what do you mean with "linear interpolation is considered to preserve meaning"? Do you mean that it is possible to interpolate linearly between two sum points?

Gauge instruments produce Gauge metric data points are taken to
have meaning in a metric stream as follows:

- Gauge point values are individual measurements captured at an instant in time
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aren't sum points also individual measurements captured at an instant in time?

synchronous and asynchronous measurements. When recording Gauge
values through a synchronous API, the interpretation is "last known
value", and when recording Gauge values through an asynchronous API
the interpretation is "current value".
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, "last known value" and "current value" are very similar. It is hard to understand the difference between synchronous gauge and asynchronous gauge using this concept.

the interpretation is "current value".

The distinction between last known value (synchronous) and current
value (asynchronous) is considered not significant in the data model.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, but it is significant for the reader of this document to understand the difference between synchronous gauge and asynchronous gauge...

Copy link
Contributor

@jsuereth jsuereth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like a lot things in this document, but I think it suffers from a few things.

  1. the bottom section outlining when to use Gauge vs. UpDownCounter is gold and should be teased apart or put first.

  2. The notion of "safe operations" to perform regarding Attributes (like removal/addition) should be formalized as part of the DataModel. I think you took a good first crack, but we may want to flesh out ALL of these things so that it's clear what operations are allowed in "telemetry schema" and which are considered "compatible"

  3. The distinction of "last value" vs. "current value" is (currently) a bit confusingly worded. It also makes me wonder if we should represent SyncGauge differently in OTLP regarding "start timestamp" and "current timestamp". E.g. should we ALWAYS have sync gauge use the timestamp when value was recorded vs. when it is reported externally?

zero with each interval (_delta_) or accumulated over a sequence of
intervals (_cumulative_). Both forms of temporality are considered
important, as they offer a useful tradeoff between cost and
reliability. The data model specifies that a change of temporality
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you're talking version of a service, possibly. I could code service A to use DELTAs, then decide to switch to cumulatives in version 2.

Practically you shouldn't expect a service to be going back and forth between delta + cumulative points.

Note that these two statements imply different interpretation for
synchronous and asynchronous measurements. When recording Gauge
values through a synchronous API, the interpretation is "last known
value", and when recording Gauge values through an asynchronous API
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I buy this. In a synchronous Gauge, recording a value is "Here's the current value".

The difference in Async vs. Sync is whether you can have the "current value" or "last known value" be sampled on-demand.

i.e. I think the terminology here is being phrased from an SDK/exporter perspective vs. the instrument's perspective (which I assume is closer to a user of metrics API)

the interpretation is "current value".

The distinction between last known value (synchronous) and current
value (asynchronous) is considered not significant in the data model.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you're implicitly saying that when you export metrics, you're setting the Timestamp for a synchronous Gauge to "export time", whereas the actual timestamp was when the synchronous instrument sent the last piece of data.

E.g. Reproting INterval

|       X       X           |
t0     t1      t2           t3

Here, the synchronous gauge is reporting its value at t1 and t2. For the collection interval t0 -> t3, we report the value recorded at t2 BUT you're saying for DELTA + CUMULATIVE it would report its timestamp at t3, whereas an asynchronous instrument would just capture a value at t3.

I think this needs some kind of picture.

data requires attention to the meaning being conveyed.

Safe attribute erasure for OpenTelemetry Metrics streams is specified
in a way that preserves meaning while removing only the forms of
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if "removing the forms of interpretation that made use of the erased attribute" basically is destroying the practical value of the metric?

I think this is implying that removing an attribute from a metric is an ok thing to do (it' s not). The meaning that IS preserved is relevant to reaggregation + processing, but could be disastrous to dashboards and users.


In database systems, this process is refered to as a performing a
"Group-By", where aggregation is used to combine streams within each
distinct set of grouped attributes. For the benefit of OpenTelemetry
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This kind of falls over for me in practice. Database Group-By is a full query language where NORMALLY you need to provide your own Aggregation semantic for every column of data or the query will fail.

What we're suggesting here is instead of:

SELECT AVG(count) FROM x GROUP BY label

The default aggregation function is infferred by otel based on the metric type.

Two things:

  1. We shouldn't prevent users from using different aggregation functions in practice. Indeed, once they make it to PromQL (or MQL for GCP), they'll be able to do whatever they want here and need to track their own meaning.
    2.I assume a lot of the gymnastics we go through around default aggregation is to avoid exposing a query-like langauge for "rewrite rule" style collection behavior.

My $.02 here is the focus on giving users a way to solve "rewrite rules" use cases is good. Making it as easy as possible is good. If we can't explain what is and isn't safe in very simple terms, we might be in trouble. If the "meaning" we retain isn't the one users wanted to retain, then we're not really adding value.

When conveying meaning to the user by grouping and aggregating over a
subset of attribute keys, the default aggregation selected should be
one that preserves meaning. For monotonic Counter instruments, this
means conveying the combined rate of each group. For UpDownCounter
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does monotonic counter instruments turn into a rate? Did I miss an above description on how UpDown vs. Counter have different aggregation meaning?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this description?

@jmacd
Copy link
Contributor Author

jmacd commented Sep 8, 2021

I will close this until such a point in time as someone convinces me it's needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants