Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Decouple Number Type from Point Kind in OTLP protocol #264

Closed
victlu opened this issue Feb 17, 2021 · 25 comments · Fixed by #278
Closed

Decouple Number Type from Point Kind in OTLP protocol #264

victlu opened this issue Feb 17, 2021 · 25 comments · Fixed by #278

Comments

@victlu
Copy link
Contributor

victlu commented Feb 17, 2021

The OTLP message Metric has the following definition for instrument types...

spec

  oneof data {
    IntGauge int_gauge = 4;
    DoubleGauge double_gauge = 5;
    IntSum int_sum = 6;
    DoubleSum double_sum = 7;
    IntHistogram int_histogram = 8;
    DoubleHistogram double_histogram = 9;
    DoubleSummary double_summary = 11;
  }

I propose we only define Instrument Types (i.e. Gauge, Sum, etc...) at the message Metric level.

The data type (i.e. int, double, etc...) should be lower level, maybe at the message DataPoint level.

i.e.

message DataPoint {
  // ...

  // value itself.
  // sfixed64 value = 4;     <= Replace

  oneof value {
     sfixed64 Int64Value = 4;
     double doubleValue = 5;
  }
}
@rakyll
Copy link

rakyll commented Feb 22, 2021

This would require at least one data point to arrive in order for exporters to create metric descriptors in the backend. What are the use cases where the current data model is not working for you?

@Oberon00 Oberon00 transferred this issue from open-telemetry/opentelemetry-specification Feb 23, 2021
@jmacd
Copy link
Contributor

jmacd commented Feb 23, 2021

I've proposed similar, #172, but at this point I'm happy with what we have.

@jmacd jmacd changed the title Decouple Data type from Instrument Kind in OTLP protocol Decouple Number Type from Point Kind in OTLP protocol Feb 23, 2021
@bogdandrutu
Copy link
Member

@victlu the problem with this approach, is that if there are systems that consider the data point kind as part of the identity for a metric will be impossible to have that. For example stackdriver in GCP does that, they consider an int sum different than and double sum, and they need to know if the metric will have one or the other.

@victlu
Copy link
Contributor Author

victlu commented Feb 24, 2021

@bogdandrutu, In your case, I would think it serves them better to explicitly name their metrics accordingly (as they want two time series) rather than by the data type.

I think this question is related to #1366. How do we consider the "name" to be unique in identifying a time series.

@victlu
Copy link
Contributor Author

victlu commented Feb 24, 2021

This would require at least one data point to arrive in order for exporters to create metric descriptors in the backend. What are the use cases where the current data model is not working for you?

@rakyll Why can't metric descriptors be created (without data type)? Or are you saying a descriptor MUST include the data type? In which case, that's the argument I'm making here...

IMHO, metric backend systems ultimately shows/graphs/alerts/store/etc... numeric based Time-Series. And for performance and efficiency, they want to use the most efficient representation of the "number". Thus, promotion of data types are common especially for storage considerations. The important thing is that it's just a number regardless if its transmitted as a byte, short, ushort, int, uint, long, ulong, float, double, etc...

@bogdandrutu
Copy link
Member

Or are you saying a descriptor MUST include the data type? In which case, that's the argument I'm making here...

If the type is included in the metric descriptor/name what is the benefit of doing what you suggested? Also your suggestion will add an extra check possible that you need to verify the type in the descriptor matches the type in the points.

Also on top of everything that adds extra allocations on serialization/deserialization from proto bytes.

@rakyll
Copy link

rakyll commented Feb 24, 2021

@victlu Some backends like Stackdriver see the type as an integral part of the metric. See the value_type at https://cloud.google.com/monitoring/api/ref_v3/rpc/google.api#google.api.MetricDescriptor. This model gives us an easy/cost effective way of identifying the data type in metric creation. These backends will reject if send a floating number to an integer metric, and it'd require validation of the data points in the exporter as @bogdandrutu suggests.

@victlu
Copy link
Contributor Author

victlu commented Mar 2, 2021

I am trying to understand, but I have a few questions...

From last few responses, it seems we are specializing OTLP protocol based on specific vendor's backend implementation? I assume we would do this for performance? We should call out what backend system we are optimizing for in this case.

What happens if a vendor does not need to separate ints/doubles, then it would need to merge ints and doubles together? That does not seem performant to me either.

I would think the Spec is more "higher-level" to convey the meaning and semantics of metrics. The data type of data points seems more like details.

Looking at a few existing client APIs (i.e. Prometheus, micrometer) , I don't see Counters/Gauge/Histogram/etc... decorated with a data type (i.e. IntCounter/DoubleCounter/etc...). Or they just take double to cover both int/double. What should we do here?

How will this work if I'm using a type-less language (i.e. javascript)? I assume I will need to separate out ints/doubles/etc... into different metrics?

I'm also not suggesting every data point needs to be type-less. I just don't see the data-type being at the topmost level with semantics like Sum/Gauge/Histogram/etc... Maybe we have oneof IntTimeSeries/DoubleTimeSeries under the oneof Sum/Gauge/Histogram/etc.

@victlu
Copy link
Contributor Author

victlu commented Mar 2, 2021

I propose we change to following...

message Metric {
  // ...

  oneof data {
    Gauge gauge = 4;
    Sum sum = 5;
    Histogram histogram = 6;
    Summary summary = 7;
  }
}

message Gauge {
  repeated IntDataPoint int_data_points = 1;
  repeated DoubleDataPoint double_data_points = 2;
}

message Sum {
  // aggregation_temporality describes if the aggregator reports delta changes
  // since last report time, or cumulative changes since a fixed start time.
  AggregationTemporality aggregation_temporality = 2;

  // If "true" means that the sum is monotonic.
  bool is_monotonic = 3;

  repeated IntDataPoint int_data_points = 4;
  repeated DoubleDataPoint double_data_points = 5;
}

message Histogram {
  // aggregation_temporality describes if the aggregator reports delta changes
  // since last report time, or cumulative changes since a fixed start time.
  AggregationTemporality aggregation_temporality = 2;

  repeated IntDataPoint int_data_points = 3;
  repeated DoubleDataPoint double_data_points = 4;
}

message Summary {
  repeated DoubleDataPoint double_data_points = 1;
}

@bogdandrutu
Copy link
Member

For Summary see #269 and for Historgram see #270

@bogdandrutu
Copy link
Member

bogdandrutu commented Mar 3, 2021

@victlu I kind of like your proposal, but I would like to understand how do you see this as fixing an issue? I think this is a nice cosmetic improvement, but I am not clear if this solves a real issue or we see this as just an cosmetic improvement :)

Update:
@victlu how do we deal with a case where both repeated fields contain values?

@bogdandrutu
Copy link
Member

@jmacd what do you think about this improvement in readability #264 (comment) ?

@victlu
Copy link
Contributor Author

victlu commented Mar 4, 2021

The change I'm trying to affect is to decouple the semantic of an instrument (i.e. Counter vs Gauge vs Histogram) from the details of the data type (bytes/int/long/double/float/etc). I think it may have these beneficial consequences.

  • Help to clarify the "Identity" of a metric / data collection. It's not based on data type but rather on semantic.
  • Can translate to a more simplified API surface (and resolve conflict for type-less languages)
  • Allows easier future growth of new data type timeseries (i.e. I want bytes or BigInts or Sketches), while retaining the scope and semantic of the instrument. Potentially allowing for mapping of data to more optimal encoding.
  • Allow vendors freedom to merge or keep separate these timeseries (@bogdandrutu response to repeated fields containing values)
  • Minimize on code/performance changes

@victlu
Copy link
Contributor Author

victlu commented Mar 4, 2021

Separately, I think we should group by Labels first before we have DataPoints to minimize on buffer size. I'll file a separate issue for this.

@jmacd
Copy link
Contributor

jmacd commented Mar 4, 2021

@jmacd what do you think about this improvement in readability?

I like it, but it raises the question you asked.

@victlu how do we deal with a case where both repeated fields contain values?

That's a tough one. We discussed in yesterday's data model meeting that SDKs should not generate conflicting types (loosely defined), and we discussed that we would like the collector to pass-through conflicting types (again loosely defined) as if they were separate metrics. This appearance of both double and integer repeated fields makes it possible to have a single Metric that looks "not separate", in the sense that it contains multiple number types. I think if we allow what @victlu proposes, we should firmly state that OTel welcomes you to mix integer and double metrics--that this is not a semantic conflict--just that we would like you not to do so in the same SDK.

Naturally we could use a oneof to prevent mixing integer and double values. We did a lot of benchmarking last summer to talk ourselves out of oneof messages as much as possible.

Even when we allow mixing repeated integer- with repeated double-values, it will be difficult to work with this mixture of values without combining and interleaving them (sort by time) into an intermediate representation, which probably brings up the problem discussed next.

Separately, I think we should group by Labels first before we have DataPoints to minimize on buffer size. I'll file a separate issue for this.

Yes, let's keep that separate. I recall a historical conversation, but would prefer not to dig up old threads. @bogdandrutu do have thoughts? I believe we were avoiding additional message layers where possible. The best argument in favor of the current design that I remember is that an SDK should never output more than one point per metric and label set per collection period. So, the extra layer for label set will always have 1 entry in the encoding @victlu proposes, under a simple collection model. I haven't thought through the implications of this in the collector.

I think the greater win would be to move labels and label sets into separate regions in the Export request, with a more-compact encoding.

@jsuereth
Copy link
Contributor

jsuereth commented Mar 4, 2021

@victlu When you first mentioned this I was skeptical, but I do like the simplicity proposed here. My only concern right now is the amount of churn this could cause right now. Ideally this should be done via a "deprecation" and "transition" period, as we DO have implementations that folks are trying to use with metrics.

Also to @jmacd and @bogdandrutu IIUC "oneof" is slow in implementation in Go, right? That's not a fundamental limitation in protocol buffers AFAIK, it's basically just the same as "optional" but with some runtime checks.

@jmacd
Copy link
Contributor

jmacd commented Mar 5, 2021

Like @jsuereth I'm worried about churn. I'd like to connect the question posed here, about reducing the number of types inside the Metric oneof with the question @bogdandrutu poses about histograms in #259. I propose we adopt the same approach for both of these, whichever it is.

On the one hand, we can have just the four semantically distinct types (Counter, Gauge, Histogram, Summary) and push all their variation into the types themselves. This is Victor's proposal for Gauge/Sum, and this is my proposal for Histogram in #272.

On the other hand, we can go with a distinct type for each variation in the Metric oneof. Right now, we have a distinct type for IntGauge and DoubleGauge, for example, and the question is whether we should keep that approach or not. If we keep that approach, then @bogdandrutu's point in #259 should be addressed and each distinct form of Histogram buckets gets its own type.

@jsuereth
Copy link
Contributor

jsuereth commented Mar 5, 2021

Agree w/ @jmacd on consistency between histogram + metric types. If we go with #272 then we should also take this proposal/direction (albeit I'd like to see a smooth transition accompany the propsoal).

@victlu If you want to take a crack at what migration from existing OTLP to this proposal looks like, that'd be good. This will disrupt a lot of the ecosystem in OTel and I know there's already some users of metrics where such a large change could break them if it's not done smoothly. I realize metrics isn't marked "stable" but users are users. If you want help thinking through a smooth-transition plan, I'm happy to take time to brainstorm with you.

@victlu
Copy link
Contributor Author

victlu commented Mar 5, 2021

I'll take a crack at it. @jsuereth I'll reach out on slack as I am sure I'm not up-to-speed on all the intricacies of this task.

@hdost
Copy link
Contributor

hdost commented Mar 5, 2021

I'd agree with the approach similar to #272 add the new form with simplistic names.
I assume there would need to be changes in the OLTP exporter. Should there be a a relate ticket for the "supported" languages? Which of those are there?
** Maybe a config option added to the exporter in order to switch them in the short term, dual write, etc.
How long of a transition period would be needed? If there's supposed to be a stable Model https://github.com/open-telemetry/opentelemetry-specification/milestone/16 1 month seems a little tight given a number of the open areas.

Edit:
I have created an issue #275 discuss this further since it seems to span a few issues.

@victlu
Copy link
Contributor Author

victlu commented Mar 9, 2021

Proposal for smooth transition for this (and other) protocol changes.

Migrating OTLP protocol

Strategy

To deliver OTLP protocol changes, the following proposed steps should be followed
to achieve a smooth transition/migration for all affected components.

  1. Apply changes to OTLP protocol

    • Include new OTLP messages (must be "additive" and expect no incompatibilites
      with existing components)
    • Comment/mark old OTLP messages "depricated" with expected transition date
    • Release changes
      • Coordinate multiple changes into appropriate batches as necessary
  2. Update and release all dependant components

    • Components to remove support for old messages/protocol
    • Components to support new messages/protocol
  3. Wait till after transition date

    • Review all affected components are updated
    • Remove old OTLP messages marked "depricated"
    • Release final changes
  4. Repeat as necessary

For issue #264

Mark old messages as "depricated" and add new messages...

oneof data {
    // DEPRICATED BY 4/15/2021
    IntGauge int_gauge = 4;
    DoubleGauge double_gauge = 5;
    IntSum int_sum = 6;
    DoubleSum double_sum = 7;
    IntHistogram int_histogram = 8;
    DoubleHistogram double_histogram = 9;
    DoubleSummary double_summary = 11;

    // New messages
    Gauge guage = 12;
    Sum sum = 13;
    Summary summary = 14;
    Histogram histogram = 15;
}

Add new messages...

message Gauge {
  repeated IntDataPoint int_data_points = 1;
  repeated DoubleDataPoint double_data_points = 2;
}

message Sum {
  // aggregation_temporality describes if the aggregator reports delta changes
  // since last report time, or cumulative changes since a fixed start time.
  AggregationTemporality aggregation_temporality = 2;

  // If "true" means that the sum is monotonic.
  bool is_monotonic = 3;

  repeated IntDataPoint int_data_points = 4;
  repeated DoubleDataPoint double_data_points = 5;
}

message Histogram {
  // aggregation_temporality describes if the aggregator reports delta changes
  // since last report time, or cumulative changes since a fixed start time.
  AggregationTemporality aggregation_temporality = 2;

  repeated IntDataPoint int_data_points = 3;
  repeated DoubleDataPoint double_data_points = 4;
}

message Summary {
  repeated DoubleDataPoint double_data_points = 1;
}

@jsuereth
Copy link
Contributor

jsuereth commented Mar 9, 2021

@victlu I think the transition is a bit more dramatic here. Remember that OTLP is an in-flight protocol and deprecations have a slow-rollout between producers/consumers, so you need to call that out in the transition. I.e. what you define may be a little too aggressive/churny for end users.

Here's an abridged update to what you had. I think maybe this deserves a full OTEP, but for the purposes of your change we just need to agree to the general outline.

Migrating OTLP protocol

Strategy

To deliver OTLP protocol changes, the following proposed steps should be followed
to achieve a smooth transition/migration for all affected components.

  1. Apply changes to OTLP protocol
    • Include new OTLP messages (must be "additive" and expect no incompatibilites
      with existing components)
      • For the Protocol Buffer protocol this means message names can change, but not numbers
      • For JSON protocol, definition is TBD. (Note: JSON is not marked stable yet)
    • Denote all deprecations semantics
      • Mark fields/messages as deprecated
        • for PB use [deprecated=true], optionally rename field to deprecated_
        • for JSON, TBD.
      • Annotate interpretation semantics for deprecated fields:
        • If necessary describe how to interpret from deprecated field to new field
        • Define who is allowed to use deprecated semantics. E.g. "As of release X, all new
          clients may not use this field"
    • Release changes
      • Coordinate multiple changes into appropriate batches as necessary
  2. Update all consumer components of OTLP
    • Update "core" code to use "new" fields/messages.
    • Ensure components appropriately transition from deprecated messages/fields to "new" or "replacement" fields/messages.
    • Consumers MUST NOT remove deprecated field handling until EOL of the protocol they were introduced in.
  3. Update all producers of OTLP
    • Update all producers of OTLP (SDKs/Exporters) to only produce new/replacement fields/messages
    • All producers MUST NOT use deprecated messages to declare OTLP version compliance
  4. Wait till EOL date for deprecation
    • Remove handling from consumer components of OTLP and release them.
    • Remove old OTLP messages marked "deprecated"
    • Release final changes
  5. Repeat as necessary

@jmacd
Copy link
Contributor

jmacd commented Mar 15, 2021

Please review #278.

@victlu
Copy link
Contributor Author

victlu commented Mar 15, 2021

With #278, my concern has been resolved! Thank you @jmacd .

@victlu victlu closed this as completed Mar 15, 2021
@jmacd
Copy link
Contributor

jmacd commented Mar 15, 2021

Let's leave this open until the PR merges 😀

@jmacd jmacd reopened this Mar 15, 2021
@jsuereth jsuereth added this to Types, Schema and Metadata in Spec - Metrics Data Model and Protocol Mar 16, 2021
@jsuereth jsuereth moved this from Types, Schema and Metadata to Done in Spec - Metrics Data Model and Protocol Mar 22, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants