Semantic conventions for telemetry pipeline monitoring #238

jmacd · 2023-10-31T03:31:33Z

This is a continuation of open-telemetry/semantic-conventions#184, which developed into a longer discussion that originally planned.

This text includes results from auditing the OpenTelemetry Collector and makes recommendations consistent with existing practice that extend our ability to configure basic, normal, and detailed-level metrics about OpenTelemetry data pipelines.

text/metrics/0000-pipeline-monitoring.md

dashpole · 2023-10-31T15:10:53Z

text/metrics/0000-pipeline-monitoring.md

+
+The proposed metric names are: 
+
+`otel.{station}.received`: Inclusive count of items entering the pipeline at a station.


Apologies if this has been discussed before. Having wildcards in the name seems like a bad precedent to set, and makes constructing dashboards, etc. more difficult. It seems OK to have two versions of this, one for SDK and one for collector, but labels seem like a much better solution if this can actually be any string value.

For the semantic-conventions repo, I would list three standard station names and give guidance on how to create more station names that follow the pattern. I don't think this will happen often, but I think it should be clear that automatic pipeline monitoring will break if ever the same metric name counts items at more than one location in a pipeline.

(This uniqueness requirement is also the reason why I am not proposing to count the number of items that enter and exit each processor -- because then a chain of processors would require per-processor metrics for automatic monitoring to work. For this reason, it makes sense (as proposed here IMO) to count only the items dropped by processors.)

text/metrics/0000-pipeline-monitoring.md

djaglowski · 2023-10-31T16:06:28Z

text/metrics/0000-pipeline-monitoring.md

+When there is more than one component of a given type active in a
+pipeline having the same `domain` and `signal` attributes, the `name`
+should include additional information to disambiguate the multiple
+instances using the syntax `<type>/<instance>`.  For example, if there


In the collector, this is not necessarily enough to differentiate all components in a pipeline because we may have different components with the same type. e.g. otlp receiver and otlp exporter in the same pipeline.

Should this be addressed here, or if the problem is unique to the collector, should the collector adopt a strategy to incorporate the class into the type?

Good question. Since I've heavily updated this document and plan to present it in tomorrow's Collector SIG, I'm going to leave it open. As designed, it should be clear that received metrics are from receivers and exported metrics are from exporters. I want to say that everything else is a processor, but I realize this is a thorny question.

dmitryax · 2023-11-01T06:01:53Z

text/metrics/0238-pipeline-monitoring.md

+PipelineLossRate = LastStageTotal{success=false} / FirstStageTotal{*}
+```
+
+Since total loss can be calculated with only a single timneseries per


Suggested change

Since total loss can be calculated with only a single timneseries per

Since total loss can be calculated with only a single timeseries per

dmitryax · 2023-11-01T06:07:09Z

text/metrics/0238-pipeline-monitoring.md

+
+#### Practice of error suppression
+
+There is a accepted practice in the OpenTelemetry Collector of


Suggested change

There is a accepted practice in the OpenTelemetry Collector of

There is an accepted practice in the OpenTelemetry Collector of

text/metrics/0238-pipeline-monitoring.md

mtwo · 2023-11-20T17:21:11Z

@jack-berg @carlosalberto can we mark this as triaged with priority p1?

carlosalberto · 2023-11-21T17:30:10Z

text/metrics/0238-pipeline-monitoring.md

+the pipeline should match the number of items successfully received,
+otherwise the system is capable of reporting the combined losses to
+the user.
+


This is a vital paragraph to understand why the default, different levels of detail for each station. Putting a end-to-end example, as you did with https://github.com/jmacd/oteps/blob/jmacd/drops/text/metrics/0238-pipeline-monitoring.md#pipeline-monitoring-diagram would work great here (albeit with numbers/metrics instead of charts)

djaglowski

To summarize my other comments and provide a suggestion:

Per component metrics are valuable and if anything should be enhanced.
The notion of diffing values between stations does not seem to allow sufficient accuracy in describing the way data may actually flow through a collector because it assumes linearity of data flow.

What seems like a better solution would be to add metrics which directly describe the aggregate behavior of the processors in each of the collector's pipelines.

The total number of items flowing into the first processor.
The total number of items added by the processors.
The total number of items dropped by the processors.
(Implied by 1-3) The total number of items flowing out from the last processor.

djaglowski · 2023-11-27T20:37:37Z

text/metrics/0238-pipeline-monitoring.md

+integrity of the pipeline.  Stations allow data to enter a pipeline
+only through receiver components.  Stations are never responsible for


Stations allow data to enter a pipeline only through receiver components.

Generally this makes sense but I have to question whether this models the reality of what the collector may do. Specifically, the notion of aggregation comes to mind.

Suppose we have 10 data points which a processor will aggregate into 1, according to our data model. Should this be communicated as 9 data points dropped? In my opinion this is an inaccurate characterization of a valid operation. A more accurate description might require a notion of adding 1 while dropping 10.

Additionally, I think there may be some valid cases where a processor would add data into a stream. For example, here is a proposed component which would augment a data stream. Perhaps more debate is necessary there, but it seems reasonable to me that in some cases we may generate additional items and insert them into the data stream, when they are naturally complimentary.

A simpler case of adding data to a stream would involve computed metrics. Say we have a metric w/ "free" and "used" data points, and we wish to generate a "% utilized" metric. I think this could reasonably be done in a processor as well.

djaglowski · 2023-11-27T20:38:15Z

text/metrics/0238-pipeline-monitoring.md

+#### Collector perspective
+
+Collector counters are exclusive.  Like for SDKs, items that enter a
+processor are counted in one of three ways and to compute a meaningful
+ratio requires all three timeseries.  If the processor is a sampler,
+for example, the effective sampling rate is computed as
+`(accepted+refused)/(accepted+refused+dropped)`.
+
+While the collector defines and emits metrics sufficient for
+monitoring the individual pipeline component--taken as a whole, there
+is substantial redundancy in having so many exclusive counters.  For
+example, when a collector pipeline features no processors, the
+receiver's `refused` count is expected to equal the exporter's
+`send_failed` count.
+
+When there are several processors, it is primarily the number of
+dropped items that we are interested in counting.  When there are
+multiple sequential processors in a pipeline, however, counting the
+total number of items at each stage in a multi-processor pipeline
+leads to over-counting in aggregate.  For example, if you combine
+`accepted` and `refused` for two adjacent processors, then remove the
+metric attribute which distinguishes them, the resulting sum will be
+twice the number of items processed by the pipeline.


In my opinion this describes too limited a view of how the collector is expected to function. I'm concerned about establishing a model which cannot describe the reality of what the collector does.

For example, when a collector pipeline features no processors, the receiver's refused count is expected to equal the exporter's send_failed count.

Is this true when a pipeline contains multiple exporters, each of which can fail independently? Similarly, what about when a receiver is used in multiple pipelines?

For example, if you combine accepted and refused for two adjacent processors, then remove the metric attribute which distinguishes them, the resulting sum will be twice the number of items processed by the pipeline.

My understanding is that this is how our data model is intended to work. If we had a similar situation, for example "request" counts from a set of switches, we could use the same aggregation mechanism to understand the total number of requests processed by the switches, but we would not simultaneously expect this total to account for packets which flowed through multiple switches.

This is not to say that there is no way to represent net behavior across a linear sequence of processors, but I do not buy that there is a problem problem with counts for each processor.

djaglowski · 2023-11-27T20:43:28Z

text/metrics/0238-pipeline-monitoring.md

+Because of station integrity, we can make the following assertions:
+
+1. Data that enters a station is eventually exported or dropped.
+2. No other outcomes are possible.


As noted elsewhere, I think we need to account for other possibilities:

Duplicated data (fanned out to multiple exporters within a pipeline, or from a receiver to multiple pipelines).

Added data.

Aggregated data (dropped and added)

Data which is exported successfully by a subset of exporters, and rejected by a different subset.

djaglowski · 2023-11-27T20:46:50Z

text/metrics/0238-pipeline-monitoring.md

+metric detail is configured, to avoid redundancy.  For simple
+pipelines, the number of items exported equals the number of items
+received minus the number of items dropped, and for simple pipelines
+it is sufficient to observe only successes and failures by receiver as
+well as items dropped by processors.


I'm not convinced the concept of a "simple" pipeline is worthy of a special model. In my opinion, we should find a solution which works more generally.

jack-berg · 2023-11-30T21:36:18Z

text/metrics/0238-pipeline-monitoring.md

+only through receiver components.  Stations are never responsible for
+dropping data, because only processor components drop data.  Stations


SDK processors do not drop data, and processors aren't responsible for passing data to the next process in a pipeline. Instead, higher order SDK code ensures that each registered processor is called. As the spec puts it, "Each processor registered on TracerProvider is a start of pipeline that consist of span processor and optional exporter".

The architectural differences between SDK and collector processor may impact your design. For example, further on you state:

Data that enters a station is eventually exported or dropped.

No other outcomes are possible.

For SDKs, its perfect valid for a processor to do nothing besides extract baggage and add it to the span. No filtering, no exporting.

jack-berg · 2023-11-30T21:46:22Z

text/metrics/0238-pipeline-monitoring.md

+instances using the syntax `<type>/<instance>`.  For example, if there
+were two `batch` processors in a collection pipeline (e.g., one for
+error spans and one for non-error spans) they might use the names
+`batch/error` and `batch/noerror`.


In the SDK, if two batch processors are present, they're only differentiable by the associated exporter. You can't configure the SDK to send some spans to one batch processor, and others to another. So maybe the name would have to be something like batch/otlp, batch/zipkin, etc.

jack-berg · 2023-11-30T21:48:43Z

text/metrics/0238-pipeline-monitoring.md

+| `otel.name`    | Type, name, or "type/name" of the component | Normal (Opt-out)           | `probabilitysampler`, `batch`, `otlp/grpc`                |
+| `otel.success` | Boolean: item considered success?           | Normal (Opt-out)           | `true`, `false`                                            |
+| `otel.reason`  | Explaination of success/failures.           | Detailed (Opt-in)          | `ok`, `timeout`, `permission_denied`, `resource_exhausted` |
+| `otel.scope`   | Name of instrumentation.                    | Detailed (Opt-in)          | `opentelemetry.io/library`                                 |


Is this intended to be different than the scope name? It seems like its the same. If so, then we can't opt-in or out of it. Maybe the prometheus exporter has options for that, but the scope isn't an optional part of the data model.

jack-berg · 2023-11-30T21:49:20Z

text/metrics/0238-pipeline-monitoring.md

+| Attributes     | Meaning                                     | Level of detail (Optional) | Examples                                                   |
+|----------------|---------------------------------------------|----------------------------|------------------------------------------------------------|
+| `otel.signal`  | Name of the telemetry signal                | Basic (Opt-out)            | `traces`, `logs`, `metrics`                                |
+| `otel.name`    | Type, name, or "type/name" of the component | Normal (Opt-out)           | `probabilitysampler`, `batch`, `otlp/grpc`                |


otel.component.name?

jack-berg · 2023-11-30T21:54:16Z

text/metrics/0238-pipeline-monitoring.md

+|---------------------|------------------|
+| `otel.sdk.received` | Basic            |
+| `otel.sdk.dropped`  | Normal           |
+| `otel.sdk.exported` | Detailed         |


I'm struggling to see what metrics I would expect to get out of an SDK with different requirement levels. Would you be able to include an examples with a typical SDK configuration? I'm imaging an SDK with logs, metrics, and traces enabled, each configured to export data via an OTLP exporter. What metrics and series do I see at Basic, Normal, and Detailed?

andrzej-stencel · 2024-01-10T17:55:50Z

text/metrics/0238-pipeline-monitoring.md

+  way than `otel.success`, with recommended values specified below.
+- `otel.signal` (string): This is the name of the signal (e.g., "logs",
+  "metrics", "traces")
+- `otel.name` (string): Name of the component in a pipeline.


Suggested change

- `otel.name` (string): Name of the component in a pipeline.

- `otel.component` (string): Name of the component in a pipeline.

andrzej-stencel · 2024-01-10T17:57:54Z

text/metrics/0238-pipeline-monitoring.md

+- `consumed`: Indicates a normal, synchronous request success case.
+  The item was consumed by the next stage of the pipeline, which
+  returned success.
+- `unsampled`: Indicates a successful drop case, due to sampling.


When the Filter processor filters out data, should it count it as unsampled? The name doesn't sound right in this case.

EDIT: How about discarded instead of unsampled?

andrzej-stencel · 2024-01-10T18:19:56Z

text/metrics/0238-pipeline-monitoring.md

+
+- `otelsdk.producer.items`: count of successful and failed items of
+  telemetry produced, by signal type, by an OpenTelemetry SDK.
+- `otelcol.receiver.items`: count of successful and failed items of


If I have a connector configured, I assume I get both the otelcol.exporter.items and otelcol.receiver.items metrics emitted for the connector, right?

Let's say I configured the Count connector on a traces pipeline, as described in the example in the component's README. The count connector then accepts traces on the traces/in pipeline and creates metrics on the metrics/out pipeline.

I imagine the otelcol.exporter.items metric for the count connector would count the incoming spans on the trace pipeline. What would be the otel.outcome for those correctly consumed spans? Would it be consumed or rather unsampled? These logs aren't shipped anywhere by the component, they are "swallowed" by the connector if I understand correctly.

I imagine the otelcol.receiver.items metric for the count connector would count the metrics created on the metrics pipeline, with the otel.outcome set to consumed.

Is the consume operation synchronous? I think the traces/in pipeline will wait until the metrics/out pipeline finishes the consume operation, so the outcome for traces/in will depend on the outcome for metrics/out. If metrics/out fails w/ a retryable status, maybe the producer will retry.

Since the count connector can produce more or fewer metric data points than arriving spans, I do not expect the item counts to match between the exporter and receiver, but I think the outcomes could match for synchronous operations. If the operation is asynchronous, the rules discussed in this proposal would apply -- the traces/in might see consumed while the metrics/out sees some sort of failure.

I don't see any problems, per se, just that the monitoring equations for connectors don't apply. I can't assume that the items_in == items_dropped + items_out.

andrzej-stencel · 2024-01-10T18:20:55Z

text/metrics/0238-pipeline-monitoring.md

+what would ordinarily count as failure.  This behavior makes automatic
+component health status reporting more difficult than necessary.
+
+One goal if this proposal is that Collector component health could be


Suggested change

One goal if this proposal is that Collector component health could be

One goal of this proposal is that Collector component health could be

dmitryax · 2024-01-30T20:35:19Z

text/metrics/0238-pipeline-monitoring.md

+which adds to the confusion -- it is not standard practice for
+receivers to retry in the OpenTelemetry collector, that is the duty of
+exporters in our current practice.  So, the memory limiter component,
+to be consistent, should count "failure drops" to indicate that the
+next stage of the pipeline did not see the data.


It depends on the type of receiver. The push-based receivers respond with a retriable error code. Some pull-based receivers retry themselves, for example, the filelog receiver.

jmacd · 2024-02-02T20:00:55Z

This is a work-in-progress. I will re-open it when it is ready for re-review.

jmacd · 2024-02-02T20:12:41Z

(I am incorporating all the feedback received so far. Thank you, reviewers!)

Changes the treatment of [PartialSuccess](https://opentelemetry.io/docs/specs/otlp/#partial-success), making them successful and logging a warning instead of returning an error to the caller. These responses are meant to convey successful receipt of valid data which could not be accepted for other reasons, specifically to cover situations where the OpenTelemetry SDK and Collector have done nothing wrong, specifically to avoid retries. While the existing OTLP exporter returns a permanent error (also avoids retries), it makes the situation look like a total failure when in fact it is more nuanced. As discussed in the tracking issue, it is a lot of work to propagate these "partial" successes backwards in a pipeline, so the appropriate simple way to handle these items is to return success. In this PR, we log a warning. In a future PR, (IMO) as discussed in open-telemetry/oteps#238, we should count the spans/metrics/logs that are rejected in this way using a dedicated outcome label. **Link to tracking Issue:** Part of #9243 **Testing:** Tests for the "partial success" warning have been added. **Documentation:** PartialSuccess behavior was not documented. Given the level of detail in the README, it feels appropriate to continue not documenting, otherwise lots of new details should be added. --------- Co-authored-by: Alex Boten <aboten@lightstep.com>

jmacd · 2024-04-11T19:23:36Z

@kristinapathak will resume this effort, thank you!

jmacd added 2 commits October 28, 2023 10:05

wip count success/failure/drops

cef7595

rough draft

4d3cfae

jmacd requested review from a team as code owners October 31, 2023 03:31

jmacd mentioned this pull request Oct 31, 2023

Proposed updates to #184 carlosalberto/semantic-conventions#1

Closed

dashpole reviewed Oct 31, 2023

View reviewed changes

djaglowski reviewed Oct 31, 2023

View reviewed changes

updates

3a9ef27

dmitryax reviewed Nov 1, 2023

View reviewed changes

dashpole reviewed Nov 1, 2023

View reviewed changes

text/metrics/0238-pipeline-monitoring.md Outdated Show resolved Hide resolved

dashpole mentioned this pull request Nov 17, 2023

KEP-647: Update apiserver tracing to GA for 1.30 kubernetes/enhancements#4333

Merged

carlosalberto added priority:p1 triaged labels Nov 20, 2023

carlosalberto reviewed Nov 21, 2023

View reviewed changes

djaglowski reviewed Nov 27, 2023

View reviewed changes

jmacd mentioned this pull request Nov 28, 2023

Request for comments: service.type and service.distro Resource attributes open-telemetry/semantic-conventions#554

Closed

jack-berg reviewed Nov 30, 2023

View reviewed changes

jmacd added 7 commits December 14, 2023 16:18

Wip

0062480

TODO WIP too much text now

b39b732

wip update

a117779

wip2

3ce0510

specify values

bbcf391

the long tail (start)

1334191

draft with no more TODOs

f9d1e9a

This was referenced Jan 5, 2024

Pipeline monitoring metrics [WIP Draft] open-telemetry/semantic-conventions#598

Closed

OTLP PartialSuccess responses should not be interpreted as errors, items should count as "rejected" in pipeline metrics open-telemetry/opentelemetry-collector#9243

Open

andrzej-stencel reviewed Jan 10, 2024

View reviewed changes

This was referenced Jan 10, 2024

Treat PartialSuccess as Success open-telemetry/opentelemetry-collector#9260

Merged

[consumer] Allow annotating consumer errors with metadata open-telemetry/opentelemetry-collector#9041

Open

This was referenced Jan 17, 2024

[obsreport] why the "/" is used for metrics of agent open-telemetry/opentelemetry-collector#9299

Open

prefix should be consistent for internal metrics open-telemetry/opentelemetry-collector#9315

Open

Merge branch 'main' of github.com:open-telemetry/oteps into jmacd/drops

12a1a6b

dmitryax reviewed Jan 30, 2024

View reviewed changes

jmacd added 4 commits January 31, 2024 11:59

Merge branch 'main' of github.com:open-telemetry/oteps into jmacd/drops

a23795f

wip

1f48c3e

add examples

261323c

small revision; needs more work

6195761

jmacd closed this Feb 2, 2024

djaglowski mentioned this pull request Feb 5, 2024

[telemetry] - It would be great to have addtitional metrics open-telemetry/opentelemetry-collector#9412

Open

jmacd mentioned this pull request Feb 7, 2024

WIP: Pipeline monitoring metrics #249

Closed

kristinapathak mentioned this pull request May 29, 2024

WIP: pipeline monitoring otep #259

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Semantic conventions for telemetry pipeline monitoring #238

Semantic conventions for telemetry pipeline monitoring #238

jmacd commented Oct 31, 2023

dashpole Oct 31, 2023

jmacd Oct 31, 2023

djaglowski Oct 31, 2023

jmacd Oct 31, 2023

dmitryax Nov 1, 2023

dmitryax Nov 1, 2023

mtwo commented Nov 20, 2023

carlosalberto Nov 21, 2023

djaglowski left a comment

djaglowski Nov 27, 2023 •

edited

djaglowski Nov 27, 2023

djaglowski Nov 27, 2023

djaglowski Nov 27, 2023

jack-berg Nov 30, 2023

jack-berg Nov 30, 2023

jack-berg Nov 30, 2023

jack-berg Nov 30, 2023

jack-berg Nov 30, 2023

andrzej-stencel Jan 10, 2024

andrzej-stencel Jan 10, 2024 •

edited

andrzej-stencel Jan 10, 2024

jmacd Feb 2, 2024

andrzej-stencel Jan 10, 2024

dmitryax Jan 30, 2024

jmacd commented Feb 2, 2024

jmacd commented Feb 2, 2024

jmacd commented Apr 11, 2024


		The proposed metric names are:

		`otel.{station}.received`: Inclusive count of items entering the pipeline at a station.

	Since total loss can be calculated with only a single timneseries per
	Since total loss can be calculated with only a single timeseries per


		#### Practice of error suppression

		There is a accepted practice in the OpenTelemetry Collector of

	There is a accepted practice in the OpenTelemetry Collector of
	There is an accepted practice in the OpenTelemetry Collector of

		integrity of the pipeline. Stations allow data to enter a pipeline
		only through receiver components. Stations are never responsible for

		only through receiver components. Stations are never responsible for
		dropping data, because only processor components drop data. Stations

	- `otel.name` (string): Name of the component in a pipeline.
	- `otel.component` (string): Name of the component in a pipeline.

	One goal if this proposal is that Collector component health could be
	One goal of this proposal is that Collector component health could be

Semantic conventions for telemetry pipeline monitoring #238

Semantic conventions for telemetry pipeline monitoring #238

Conversation

jmacd commented Oct 31, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mtwo commented Nov 20, 2023

Choose a reason for hiding this comment

djaglowski left a comment

Choose a reason for hiding this comment

djaglowski Nov 27, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

andrzej-stencel Jan 10, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jmacd commented Feb 2, 2024

jmacd commented Feb 2, 2024

jmacd commented Apr 11, 2024

djaglowski Nov 27, 2023 •

edited

andrzej-stencel Jan 10, 2024 •

edited