Metrics Proto Does Not Support the OpenTelemetry Specification #125

MrAlias · 2020-03-13T19:37:43Z

As language SIGs include support to export metrics using the OTLP it is being discovered that the native export format they are exporting from do not fit nicely or at all into the OTLP structure. This issue is intended to track these friction points, proposed changes, and action items to resolution.

Ideally the OTLP would natively support the output of all combination of Instruments and Aggregators defined in the OpenTelemetry Specification. Currently, this is not the case.

Identified Issues

First Class Support For Min and Max Values of a MinMaxSumCount Aggregation

There is no metric kind for a MinMaxSumCount aggregator that outputs the minimum, maximum, sum, and count of events observed.

It has been suggested that a Summary metric kind can be used to transport these values. The Summary kind does have fields for the count and sum, however, it does not have dedicated fields for the maximum nor minimum.

The suggested work around here is to send the maximum as the 100th percentile and the minimum as the 0th percentile. Both of which are not obvious and the latter is mathematically incorrect (the 0th percentile is the value which 0% of events occured, where as the minimum is the minimal value where at least 1 event occured).

HistogramValue.Bucket.Exemplar is inadequate

As outlined in the linked ticket:

It's not possible to provide exemplars for non-histograms
It's not possible to provide more than one exemplar per bucket
It requires converting certain data types to a string, which will require additional specification

Instrumentation Instead of Aggregation

The OTLP metric kinds are centered around instruments not output from Aggregation:

    ...
    GAUGE_INT64 = 1;
    GAUGE_DOUBLE = 2;
    GAUGE_HISTOGRAM = 3;
    COUNTER_INT64 = 4;
    COUNTER_DOUBLE = 5;
    CUMULATIVE_HISTOGRAM = 6;
    SUMMARY = 7;

The incongruence of the instruments these kinds were modeled after and the actual instruments of the OpenTelemetry Specification is secondary to the fact that they are modeled after instruments in the first place.

If the goal of the OTLP is to transport the output of Instrument -> Aggregator, it should be modeled after the output of the Aggregators, aggregations. And while, yes, the Histogram is one of these aggregations, the included histogram kinds are conflated with instrument qualifiers (Gauge, Cumulative).

Nebulously Defined Need to be Compatible with OpenMetrics

A common refrain for the design of the OTLP is "We are trying to also stay compatible with OpenMetrics protocol".

This desire to remain compatible is an admirable one, it is negatively impacting this project. The OTLP in its pursuit to maintain easy translation to OpenMetrics has made it difficult to translate from OpenTelemetry into it. Additionally, it has led to a design that is bloated with OpenMetrics metric kinds (GAUGE_INT64, GAUGE_DOUBLE) that OpenTelemetry does not use.

It seems as though if support for the OpenMetrics protocol is desired, the implementations of the OpenTelemetry Specification should support that protocol. Instead of having the OTLP try to match that projects decisions.

Steps Forward

I'm hoping to have this Issue start the discussion on how we can resolve these issues. I see the metrics proto in need of a major overhaul to provide first class support for OpenTelemetry. Possibly being done in a v2 of the proto.

cc @bogdandrutu, @jmacd, @jkwatson, @c24t

The text was updated successfully, but these errors were encountered:

tigrannajaryan · 2020-03-13T20:44:30Z

@MrAlias our approach so far has been that we file individual issues and PRs that fix them, gradually moving towards the desirable final proto. We had more time to do this for traces and it looks like it went well. With metric we had less time and there are issues that need to be addressed.

If you know what the right fixes to the identified issues are it would be great if you could submit corresponding PRs.

major overhaul to provide first class support for OpenTelemetry. Possibly being done in a v2 of the proto.

Is there such a disconnect that requires a major overhaul? I am not a metric spec/SDK contributor, so I cannot assess myself how big is the disconnect.

@open-telemetry/specs-approvers whatever changes we want to do to the proto it is very desirable to do them quickly. I know multiple people who are currently ramping up to write code that assumes the current proto. If we do not finalize the proto quickly we will cause pain by introducing breaking changes.

If we are unable to have the final metric proto version within the next few days it is best that we announce the metric part of the protocol as not ready for implementation.

c24t · 2020-04-09T19:55:57Z

If we are unable to have the final metric proto version within the next few days it is best that we announce the metric part of the protocol as not ready for implementation.

@tigrannajaryan what did you decide on this? As it's written now, the types in the metrics proto are very different than the instruments and aggregations in the spec or OTEPS (e.g. open-telemetry/oteps#88 and open-telemetry/oteps#93).

Unless we expect to change the spec to match the proto this looks to me like it's not ready for implementation.

tigrannajaryan · 2020-04-13T19:20:28Z

@c24t I do not know enough about the metrics to make a call on this. @bogdandrutu @jmacd can you help?

MrAlias · 2020-04-13T20:17:42Z

@tigrannajaryan: I started in on a proposal refactor of the proto, but have not found the time to get it into a presentable state. The rough structure of the proposal is to switch to something closer to the Stackdriver proto model for metrics and update the data points to be more along the lines of instantaneous, delta, and cumulative values. I'm hoping to get this more formalized and will share when it is.

I agree with @c24t in that it might be wasted work to implement at this point as a I think there needs to be a refactor to support the OpenTelemetry metric data.

jmacd · 2020-04-13T22:40:31Z

I look forward to seeing this proposal @MrAlias.
I've updated OTEPs 93 and 96 with the terms "delta", "cumulative", and "instantaneous", fwiw.

tigrannajaryan · 2020-04-14T14:01:14Z

Do note that we have an implementation of metrics proto in Collector already and increasingly large amount of code depends on it because it is the foundation of internal representation.

The longer we wait for the proposal the more code needs to be rewritten. If the proposal doesn't happen quickly them we have a problem, we do not have time rewrite thousands of lines of cod (depending on how big the changes are).

tigrannajaryan · 2020-04-14T14:25:26Z

If this gets delayed more we may end up staying with OpenCensus representation in the Collector, since I don't think we will have time or people to rewrite all that code. The consequence will be that metrics will work significantly slower in the Collector.

jmacd · 2020-04-14T17:58:35Z

I see. Using a data-transfer-object as the primary internal representation for a large server has caused this headache for me in the past. I wonder if we could use a protocol-independent type internally.

…

On Tue, Apr 14, 2020 at 7:25 AM Tigran Najaryan ***@***.***> wrote: If this gets delayed more we may end up staying with OpenCensus representation in the Collector, since I don't think we will have time or people to rewrite all that code. The consequence will be that metrics will work significantly slower in the Collector. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#125 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AA3WFCIFRZT7GR6WIC5AHRTRMRW6PANCNFSM4LHJALVA> .

jkwatson · 2020-04-14T19:47:00Z

@jmacd do you think the current proto definition is robust enough to encode all of the aggregations you envision for OTEP-88 and following?

jmacd · 2020-04-14T20:01:01Z

I've been thinking about this today. :-) The descriptor structure worries me. The fact that MinMaxSumCount has to be expressed as a Summary doesn't worry me. I haven't reached any conclusions, but I started looking into the collector codebase.

…

On Tue, Apr 14, 2020 at 12:47 PM John Watson ***@***.***> wrote: @jmacd <https://github.com/jmacd> do you think the current proto definition is robust enough to encode all of the aggregations you envision for OTEP-88 and following? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#125 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AA3WFCNPSBTZAAZYG6PM4TLRMS4ULANCNFSM4LHJALVA> .

Update metric descriptor to contain information on the data qualities. This includes: - The meaning of the time interval measurements were made over is now described by the `interval` value. - The monotonic nature of the data is captured if known. - If the data values are restricted to a subset of numbers. The `SummaryDataPoint` and `HistogramDataPoint` were merged into a unified `Distribution` message that contains statistics about the observed population of values. Adds support for multiple histogram bucket definitions and adds a linear bucket definition message. Includes an explicit minimum and maximum for a `Distribution` (open-telemetry#122). Removes the exemplar code (open-telemetry#81). This can be added back in as a more generalized case now given the new `Data` structure of the Metrics. But, it has been left for a separate PR. Unifies the common parts of the previous `*DataPoints` into a unified `Data` message. This is not complete. It likely needs better names and it certainly needs comments. Related to open-telemetry#125 Fixes to open-telemetry#81

jmacd · 2020-05-28T08:26:29Z

After some progress in both the set of instruments and the terminology used to explain them, I've come to think that the ideal set of properties for describing an aggregation are:

Collective vs Additive representation
Made synchronously vs asynchronously
Monotonic output.

These can be stored using 3 bits. These bits are independent of whether a unit of metric data is considered as "cumulative" or "delta", the so-called temporal quality. In my thinking, both kinds of data are characterized by two timestamps: the start of the interval, and the end of the interval over which the data applies. A cumulative expression of metric data will re-use the same start time, as in: [T0, T1], [T0, T2], [T0, T3], ...

A delta expression of metric data will not re-use the start time:
[T0, T1], [T1, T2], [T2, T3], ...

bogdandrutu · 2020-07-30T22:28:52Z

@MrAlias this is very hard to track, do you mind splitting in smaller PRs so we make sure every issue is track accordingly.

bogdandrutu · 2020-08-05T23:09:48Z

If the goal of the OTLP is to transport the output of Instrument -> Aggregator, it should be modeled after the output of the Aggregators, aggregations. And while, yes, the Histogram is one of these aggregations, the included histogram kinds are conflated with instrument qualifiers (Gauge, Cumulative).

I think the goal of OTLP should be:

Support Aggregator output - in a generic way and not having 100s of combinations like SumCount, SumCountMin, etc.
There is a request to also support the output for synchronous Instruments -> to a possible Aggregator. This is still under discussion but will be addressed.

bogdandrutu · 2020-08-05T23:47:32Z

@MrAlias is there anything left in this issue, I feel like more than 80-90% of the concerns were addressed. Would like to close this and recommend open a new issue if there is anything else left.

MrAlias · 2020-08-05T23:48:19Z

@MrAlias is there anything left in this issue, I feel like more than 80-90% of the concerns were addressed. Would like to close this and recommend open a new issue if there is anything else left.

SGTM. Was just going to recommend the same 👍

MrAlias mentioned this issue Mar 13, 2020

Add MinMaxSumCount data type #122

Closed

MrAlias mentioned this issue Mar 13, 2020

Update SummaryDataPoint percentile comment #127

Merged

MrAlias mentioned this issue Apr 15, 2020

Proposal for metric proto refactor #137

Closed

3 tasks

This was referenced Apr 24, 2020

Add temporality to MetricDescriptor #140

Merged

Add Monotonic Types #145

Merged

jmacd added the spec:metrics label May 4, 2020

MrAlias mentioned this issue Jun 15, 2020

Remove Temporality and add Structure #158

Closed

c24t mentioned this issue Jun 16, 2020

Integrate Exemplars with Metrics SDK open-telemetry/oteps#113

Merged

jmacd added release:required-for-ga and removed release:required-for-ga labels Jun 29, 2020

jmacd mentioned this issue Jul 1, 2020

OTLP Metrics update #162

Closed

jmacd mentioned this issue Jul 15, 2020

Replace Temporality with Kind; Type with ValueType #168

Closed

bogdandrutu removed the release:required-for-ga label Jul 30, 2020

bogdandrutu added the release:required-for-ga Must be resolved before GA release, or nice to have before GA label Jul 30, 2020

bogdandrutu closed this as completed Aug 5, 2020

andrewhsu added this to Required/Allowed for GA, resolved. in GA Spec Burndown Nov 12, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Metrics Proto Does Not Support the OpenTelemetry Specification #125

Metrics Proto Does Not Support the OpenTelemetry Specification #125

MrAlias commented Mar 13, 2020

tigrannajaryan commented Mar 13, 2020

c24t commented Apr 9, 2020

tigrannajaryan commented Apr 13, 2020

MrAlias commented Apr 13, 2020

jmacd commented Apr 13, 2020

tigrannajaryan commented Apr 14, 2020

tigrannajaryan commented Apr 14, 2020

jmacd commented Apr 14, 2020 via email

jkwatson commented Apr 14, 2020

jmacd commented Apr 14, 2020 via email

jmacd commented May 28, 2020

bogdandrutu commented Jul 30, 2020

bogdandrutu commented Aug 5, 2020

bogdandrutu commented Aug 5, 2020

MrAlias commented Aug 5, 2020

Metrics Proto Does Not Support the OpenTelemetry Specification #125

Metrics Proto Does Not Support the OpenTelemetry Specification #125

Comments

MrAlias commented Mar 13, 2020

Identified Issues

First Class Support For Min and Max Values of a MinMaxSumCount Aggregation

HistogramValue.Bucket.Exemplar is inadequate

Instrumentation Instead of Aggregation

Nebulously Defined Need to be Compatible with OpenMetrics

Steps Forward

tigrannajaryan commented Mar 13, 2020

c24t commented Apr 9, 2020

tigrannajaryan commented Apr 13, 2020

MrAlias commented Apr 13, 2020

jmacd commented Apr 13, 2020

tigrannajaryan commented Apr 14, 2020

tigrannajaryan commented Apr 14, 2020

jmacd commented Apr 14, 2020 via email

jkwatson commented Apr 14, 2020

jmacd commented Apr 14, 2020 via email

jmacd commented May 28, 2020

bogdandrutu commented Jul 30, 2020

bogdandrutu commented Aug 5, 2020

bogdandrutu commented Aug 5, 2020

MrAlias commented Aug 5, 2020