Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integrate Exemplars with Metrics SDK #113

Merged
merged 8 commits into from
Jul 10, 2020

Conversation

cnnradams
Copy link
Member

This OTEP defines exemplars within OpenTelemetry and specifies behaviour for exemplars for the default set of aggregations: Proposal doc

@MrAlias
Copy link
Contributor

MrAlias commented Jun 4, 2020

Relevant to this: open-telemetry/opentelemetry-proto#81

@arminru arminru added the metrics Relates to the Metrics API/SDK label Jun 5, 2020
@jmacd
Copy link
Contributor

jmacd commented Jun 5, 2020

This is great. I added a bunch of comments describing how I think we can leave room for statistical interpretation based on exemplars in the future.

@jmacd
Copy link
Contributor

jmacd commented Jun 5, 2020

Another related remark: In the extreme case of supporting one exemplar per metric event, can the protocol degenerate into the raw structured event encoding that has been requested?

open-telemetry/opentelemetry-specification#617

I hadn't thought about this before making comments on the document above, but I think it might be possible to combine these ideas. If the value of an aggregation can optionally include raw values in addition to another aggregation, then raw values are the exemplars. @mxiamxia ^^^

@mxiamxia
Copy link
Member

mxiamxia commented Jun 5, 2020

Another related remark: In the extreme case of supporting one exemplar per metric event, can the protocol degenerate into the raw structured event encoding that has been requested?

open-telemetry/opentelemetry-specification#617

I hadn't thought about this before making comments on the document above, but I think it might be possible to combine these ideas. If the value of an aggregation can optionally include raw values in addition to another aggregation, then raw values are the exemplars. @mxiamxia ^^^

Hi @jmacd, If I get you right, we can store the raw event values as Exemplar along with the aggregations. For example, for any aggregated metrics data point, we can attach all the raw data as list of Exemplars (by preference in SDK) within a collection interval to aggregated data point. So we can have both raw data values and the original purpose of Exemplar, is it right? Thanks.

Exact
The Exact aggregator does not aggregate measurements. If exemplars are enabled, implementations may attach a separate exemplar to each measurement in an exact aggregation including the trace context and full set of labels.

Copy link
Contributor

@MrAlias MrAlias left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great to me.

@jmacd I don't see you comments?

text/metrics/0113-exemplars.md Outdated Show resolved Hide resolved
text/metrics/0113-exemplars.md Outdated Show resolved Hide resolved
text/metrics/0113-exemplars.md Outdated Show resolved Hide resolved
text/metrics/0113-exemplars.md Outdated Show resolved Hide resolved
text/metrics/0113-exemplars.md Show resolved Hide resolved
text/metrics/0113-exemplars.md Show resolved Hide resolved
text/metrics/0113-exemplars.md Outdated Show resolved Hide resolved
text/metrics/0113-exemplars.md Outdated Show resolved Hide resolved
@MrAlias
Copy link
Contributor

MrAlias commented Jun 5, 2020

@jmacd I don't see you comments?

Oh, 🤦, they're in the linked Google doc.

@MrAlias
Copy link
Contributor

MrAlias commented Jun 5, 2020

Another related remark: In the extreme case of supporting one exemplar per metric event, can the protocol degenerate into the raw structured event encoding that has been requested?

open-telemetry/opentelemetry-specification#617

I hadn't thought about this before making comments on the document above, but I think it might be possible to combine these ideas. If the value of an aggregation can optionally include raw values in addition to another aggregation, then raw values are the exemplars. @mxiamxia ^^^

Agreed, I was thinking the same about merging raw metric exports and this message type.

What about calling the exemplar message Raw or RawValues?

@cnnradams
Copy link
Member Author

Another related remark: In the extreme case of supporting one exemplar per metric event, can the protocol degenerate into the raw structured event encoding that has been requested?
open-telemetry/opentelemetry-specification#617
I hadn't thought about this before making comments on the document above, but I think it might be possible to combine these ideas. If the value of an aggregation can optionally include raw values in addition to another aggregation, then raw values are the exemplars. @mxiamxia ^^^

Agreed, I was thinking the same about merging raw metric exports and this message type.

What about calling the exemplar message Raw or RawValues?

If I'm understanding this correctly, in any case where we export raw measurements it would use the RawValue type, including for exemplars? That sounds good to me :)

Co-authored-by: Tyler Yahn <MrAlias@users.noreply.github.com>
text/metrics/0113-exemplars.md Outdated Show resolved Hide resolved
text/metrics/0113-exemplars.md Show resolved Hide resolved
text/metrics/0113-exemplars.md Show resolved Hide resolved
@jmacd
Copy link
Contributor

jmacd commented Jun 12, 2020

As discussed in the SIG call today, one of my latest concerns here is that the proposed structure for an Exemplar is nearly identical to the structure of Int64DataPoint and DoubleDataPoint. This is one reason I asked if span_id and trace_id could just be labels, because if you treat those as labels, there's only one new field that distinguishes an exemplar from an ordinary data point: the sample_count.

I'm trying to think how I'd like to see these unified. This comment #113 (comment) brings me back to an earlier conversation about how to represent MinMaxSumCount aggregation in OTLP. One answer was to use a Summary with quantiles at 0 and 100%. But another answer would be to treat these four values as four data points: two of them are exemplars (MIN and MAX) and two of them are aggregations (COUNT and SUM).

This has me thinking that we're missing some field in the OTLP structure to say the kind of aggregation or exemplar selection that produced data point. I'm starting to think that with two new fields in the Int64DataPoint and DoubleDataPoint structures, we could unify these ideas.

The structure would look like:

// Int64DataPoint is a single data point in a timeseries that describes the time-varying
// values of a int64 metric.
message Int64DataPoint {
  // The set of labels that uniquely identify this point (in addition to any common labels in 
  // the enclosing Metric). This may include "traceid" and "spanid" labels for correlation with
  // Tracing.
  repeated opentelemetry.proto.common.v1.StringKeyValue labels = 1;

  // If this is an aggregation, not a raw point, the timestamp of the beginning of the interval.
  // For Deltas: This will be the same as a prior data point's `time_unix_nano`
  // For Cumulatives: This will be the same as the prior data point's `start_time_unix_nano`
  fixed64 start_time_unix_nano = 2;

  // If this is a raw point, the timestamp when the point was generated.
  // If this is an aggregation, this is the timestamp at the end of the interval.
  fixed64 time_unix_nano = 3;

  // value itself.
  int64 value = 4;

  // If this is a raw point, this indicates the representivity
  double sample_count = 5;

  // THE KIND OF POINT IS NEW HERE
  PointKind point_kind = 6;
}

enum PointKind {
   // AGGREGATIONS

   SUM
   COUNT

   // EXEMPLARS

   // Sample is a statistically meaningful exemplar, its sample_count is set.
   SAMPLE
   // MIN and MAX are not statistical samples, they have sample_count unset.
   MIN
   MAX
}

I wrote "(in addition to any common labels in the enclosing Metric)" about labels in these points, because I think it's likely that many of the entries for a Metric have some labels in common. The scenario I'm thinking of is where you've configured in-process aggregation by two label keys ("A" and "B") so that a number of aggregation values will have two labels in common. However, the exemplars will have all their labels, meaning they will repeat the ones used in the aggregation. It suggests a structure like in the comment here: #113 (comment)

Metric {
  Descriptor
  repeated Labeled {
     // These are common labels
     repeated Labels string
     // Some of these points will have no extra labels, because they are aggregations
     // and some of these points will have extra labels, because they are exemplars associated
     // with the common labels used in the aggregations.
     repeated Int64DataPoint int64_data_points
     ...
  }
}

@jmacd
Copy link
Contributor

jmacd commented Jun 12, 2020

Regarding the above comment, I'd like to request input from people that have been involved with the Metric Views API both here as well as in OpenCensus. I do not see any way in the current OTLP protocol to encode information about which aggregation has been applied. In today's protocol how would you know the difference between a Sum and a LastValue aggregation? Possibly this is decided based on knowing whether the descriptor is a "grouping" or "adding" kind (which distinguishes Sum and LastValue), but this won't help us if we want to distinguish "LastValue" vs "RandomValue".

Is something missing from the OTLP protocol? How does the suggestion above (#113 (comment)) look? @bogdandrutu @c24t @rghetia @cijothomas @lzchen

Copy link
Member

@c24t c24t left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You've got my LGTM on the doc already, just making it official here.

text/metrics/0113-exemplars.md Outdated Show resolved Hide resolved
@c24t
Copy link
Member

c24t commented Jun 16, 2020

I do not see any way in the current OTLP protocol to encode information about which aggregation has been applied. In today's protocol how would you know the difference between a Sum and a LastValue aggregation?

As I understand it, this is a known limitation of today's protocol and something we'll need to change to support nonstandard aggregations, with or without exemplars. The discussion from open-telemetry/opentelemetry-proto#81, open-telemetry/opentelemetry-proto#125, and open-telemetry/opentelemetry-proto#137 is relevant here.

This has me thinking that we're missing some field in the OTLP structure to say the kind of aggregation or exemplar selection that produced data point

I think we're missing something too, but I don't think the right solution is to overload Int64DataPoint/DoubleDataPoint. The structure is close to what we'd need for exemplars, but the semantics are different enough that I think it would cause more harm than good to shoehorn both kinds of data into the same type. I think we should prefer more types over interpreting the same type differently depending on point_kind.

It looks like there are a lot of questions about the implementation that need to be solved in the proto (Do we still need Histogram and Summary types? Should these types represent both raw and aggregated data? Do we need separate types for each aggregator, and if so how do we handle nonstandard aggregations?), but there's enough support for this description of exemplars to move from the OTEP to the spec.

@jmacd
Copy link
Contributor

jmacd commented Jun 17, 2020

I think this is good to submit. My last comment on a potential restructuring that would avoid duplication of labels can be ignored, I think: #113 (comment)

I discussed the matter with several people who agreed that in the long run, OTLP should have a way to avoid duplication and repetition of label sets in general and that we can ignore this question for now.

Co-authored-by: Chris Kleinknecht <libc@google.com>
@jmacd
Copy link
Contributor

jmacd commented Jun 19, 2020

@bogdandrutu It's the same story with this PR. We have enough metrics approvers' approvals.

Copy link
Member

@cijothomas cijothomas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!
left a small question, not a blocker.

@jmacd
Copy link
Contributor

jmacd commented Jun 22, 2020

This has enough metrics reviewer approvals. Please merge. We have to keep interns unblocked!


## Definition

Exemplars are example data points for aggregated data. They provide specific context to otherwise general aggregations. For histogram-type metrics, exemplars are points associated with each bucket in the histogram giving an example of what was aggregated into the bucket. Exemplars are augmented beyond just measurements with references to the sampled trace where the measurement was recorded and labels that were attached to the measurement.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exemplars are example data points for aggregated data.

Are they this "generic" thing, or are they "traces"? The proto schema below suggests the latter. For example, could I store "customer id" as exemplar, so that I could answer the question "which sample customer IDs have latencies in this bucket"?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The RawValue representation is a generic way to represent sampled metric events. There will some SDK-specific/custom selection logic that may decide to select only exemplars that have trace context, or they can decide to focus on the distribution of customer IDs. The customer ID would be represented by a label value.

When the aggregator is a histogram:
The SDK can select samples using fixed-size uniform selection on a per-bucket basis, or it can select items probabilistically so as to produce an expected number of exemplars per bucket that is equal (the latter is likely to have better coverage in the case where there are empty buckets--this can be accomplished using Weighted Sampling and inverse-probability weights, for example).

Let's suppose you configure exemplar selection to choose 100 exemplars per bucket per collection period. If the selection is unbiased and the sample_count fields are accurate, you will be able to summarize the contribution to each bucket by up to 100 customers.

Suppose it's a Counter producing a Sum aggregation, instead of a histogram. You could use exemplars selected from the Counter to summarize the contribution to a sum by customer ID. There are lots of ways to sample, and I believe this representation will support a large number of useful approaches.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Customer ID as a label value would kill most metric backends. To me that's the whole point of exemplars, to allow associating samples of high-cardinality values with metrics. I am fine if we limit this high-cardinality dimension to trace IDs for now, but I am not seeing a "generic" solution here that would support exemplars on other dimensions.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the comment above had the same concern as you? Would adding correlation context as an attribute on RawValue that can have the customer ID solve the problem?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The SDK has built-in support for aggregation so that high-cardinality labels can be eliminated before they reach most metric backends. The Sum, Histogram, or Summary that you export can be aggregated so that customer_id does not appear in the aggregation value.

Exemplars selected from the same series of events (that were summarized without customer_id) can include the customer_id, and the exemplars may be used to approximate the distribution of customer_id and other dimensions that were aggregated away in the Sum, Histogram or Summary value. One of the nice properties of the approach described here is that by limiting the number of exemplars, we limit cardinality reported in a single collection interval. For example, you could select 100 exemplars and even if there are 1000 actual customer_ids, you will collect at most 100 distinct values, and if chosen probabilistically, we can expect to recover the customer_ids that were most representative of the actual distribution (i.e., the "heavy hitters", the top of the distribution).

I want to emphasize that the API and the Protocol should not discourage the use of high-cardinality metrics. Given that I see exemplars as exactly the tool for addressing high cardinality, I'm confused by:

I am fine if we limit this high-cardinality dimension to trace IDs for now

What do you think we should restrict?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue I have with this schema is that it makes no distinction between regular label dimensions (which should survive aggregation) and the exemplar dimensions like customer-id. Only trace id is explicitly separated as exemplar dimensions. That makes it very easy for a user to shoot themselves in the foot and send an explosion of dimensions to the backend. The only way to avoid it is by carefully defining custom aggregation rules in the SDK and explicitly defining which labels should be treated as time series dimensions vs. exemplar attributes. While it minimizes the API surface, I feel that it makes the API more dangerous to use. Why not allow specifying exemplar labels explicitly from the beginning, and keep them separate from regular labels?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for clarifying. I understand the concern, now.

By the way, an earlier draft of the metrics API allowed the application writer to recommend aggregation dimensions, by the name "Recommended Keys". It was removed: open-telemetry/opentelemetry-specification#463. The reason these keys were recommended is that we do not believe the author of the code knows which labels the system or the operator wants to monitor. If we ask the developer decide which dimensions are for aggregation and which are for exemplars, we make a semantic distinction out of a performance limitation (and not a universal one, as far as I know).

There is a practical reason to support arbitrary labels and deal with them through configuration: this is the natural thing to do when creating Metric events from Spans. Span attributes simply become Metric labels. We are adding a semantic convention to cover duration measurements: open-telemetry/opentelemetry-specification#657

The span-to-metrics issue is discussed here: open-telemetry/opentelemetry-specification#381

One way to address your concern would be to set the default to aggregate over zero dimensions, so that all labels are exemplar labels unless configured otherwise.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bogdandrutu I would like your opinion on this topic. We introduced RecommendedKeys() to address a perceived need in Prometheus, since Prometheus clients actually enforce pre-declared label keys. We discovered that the Prometheus protocol does not have any such restriction, which made it appealing to remove a feature. In the (dog)statsd world, it's common to add labels as needed. In modern terminology, we had created (proposed) Metrics Processors named "defaultkeys" that would use the developer-provided recommended keys, and named "ungrouped" that would use all the keys when exporting metrics. Removing recommended keys brought us back to a single basic metrics processor.

With this proposal, we begin to see a "Sampler API" for metrics, that is one that takes a full set of labels, applies a sampling decision (whether to select an exemplar or not) and then returns the set of labels to use for aggregation

If we have a choice between:

(1) asking the user to choose which labels are significant for aggregation and which are not
(2) making it really easy to configure which labels are used for aggregation

I would absolutely prefer the second choice--whether aggregation is configured by a dynamic configuration API, by a static configuration API, or by hard-coding a View in your main() function, any of these should be viable and relatively easy, and all of these are more appealing to me than asking the user to distinguish two kinds of label.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bogdandrutu Thanks for merging, but I think we should capture this discussion or at least address the question.


## Open questions

- Exemplars usually refer to a span in a sampled trace. While using the collector to perform tail-sampling, the sampling decision may be deferred until after the metric would be exported. How do we create exemplars in this case?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we address this question from the resolved discussion?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't feel that this is the place to describe a fancy SDK approach to this problem. This question leads to arbitrarily complex approaches that are also found in the discussion about tail sampling itself. How should we decide to propagate a trace-is-sampled bit in-band when making child spans during a span lifetime? It's almost the same question.

A simple approach would be to maintain a per-span sample of metric events and buffer metric data until the span ends.

Another approach would use the statistics of the spans that are being selected by the tail sampler to form an unequal probability sampling scheme. Select sample metric events that are likely to be associated with spans that match the tail-sampling decision. E.g., if tail latency is used to select exemplars, and a high correlation is observed between latency and label X, then use label X to boost sample weight on metric events. This leads to a speculative approach where you try to choose exemplars that will have interesting traces.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cnnradams Would you be willing to add an answer to this question? If head sampling, the logic for selecting trace contexts that are also being sampled is simple. If tail sampling, the logic for selecting metric samples has to be coordinated with tracing, delayed, or somehow speculative--and this decision is practically the same as deciding to what to tell your child in a span before the span is finished.

Copy link
Member Author

@cnnradams cnnradams Jul 10, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I missed this 😬

To the depth that this OTEP goes, yes, this question is answered by "either the tail sampler needs to pick traces with exemplar choices in mind, or exemplars will need to be picked without a guarantee that they will have a trace". But the actual details of this still need to be worked out as far as I'm aware. I can't really mark this as answered now that its merged, so this will have to do 🤷

Another approach would use the statistics of the spans that are being selected by the tail sampler to form an unequal probability sampling scheme.

How would you have knowledge of the spans that were chosen to be sampled when that decision was made in a different process without your input?


## Definition

Exemplars are example data points for aggregated data. They provide specific context to otherwise general aggregations. For histogram-type metrics, exemplars are points associated with each bucket in the histogram giving an example of what was aggregated into the bucket. Exemplars are augmented beyond just measurements with references to the sampled trace where the measurement was recorded and labels that were attached to the measurement.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The RawValue representation is a generic way to represent sampled metric events. There will some SDK-specific/custom selection logic that may decide to select only exemplars that have trace context, or they can decide to focus on the distribution of customer IDs. The customer ID would be represented by a label value.

When the aggregator is a histogram:
The SDK can select samples using fixed-size uniform selection on a per-bucket basis, or it can select items probabilistically so as to produce an expected number of exemplars per bucket that is equal (the latter is likely to have better coverage in the case where there are empty buckets--this can be accomplished using Weighted Sampling and inverse-probability weights, for example).

Let's suppose you configure exemplar selection to choose 100 exemplars per bucket per collection period. If the selection is unbiased and the sample_count fields are accurate, you will be able to summarize the contribution to each bucket by up to 100 customers.

Suppose it's a Counter producing a Sum aggregation, instead of a histogram. You could use exemplars selected from the Counter to summarize the contribution to a sum by customer ID. There are lots of ways to sample, and I believe this representation will support a large number of useful approaches.

An exemplar is defined as:

```
message RawValue {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The WIP open-telemetry/opentelemetry-proto#162 specifies that RawValue messages may be used in two ways.

  1. As exemplars that add additional information to SCALAR, HISTOGRAM, and SUMMARY data points.
  2. As raw data in its own right.

The SDK spec discusses three types of aggregation that can be represented as a scalar value: Sum, LastValue, and Exact. The exact representation would use RawValues with sample_count == 1.


## Open questions

- Exemplars usually refer to a span in a sampled trace. While using the collector to perform tail-sampling, the sampling decision may be deferred until after the metric would be exported. How do we create exemplars in this case?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't feel that this is the place to describe a fancy SDK approach to this problem. This question leads to arbitrarily complex approaches that are also found in the discussion about tail sampling itself. How should we decide to propagate a trace-is-sampled bit in-band when making child spans during a span lifetime? It's almost the same question.

A simple approach would be to maintain a per-span sample of metric events and buffer metric data until the span ends.

Another approach would use the statistics of the spans that are being selected by the tail sampler to form an unequal probability sampling scheme. Select sample metric events that are likely to be associated with spans that match the tail-sampling decision. E.g., if tail latency is used to select exemplars, and a high correlation is observed between latency and label X, then use label X to boost sample weight on metric events. This leads to a speculative approach where you try to choose exemplars that will have interesting traces.

@bogdandrutu bogdandrutu requested review from a team as code owners July 7, 2020 15:36
@bogdandrutu
Copy link
Member

@jmacd can I merge this or you need answers/actions for your comments?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
metrics Relates to the Metrics API/SDK
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet