Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Semantic conventions for telemetry pipeline monitoring #238

Closed
wants to merge 15 commits into from

Conversation

jmacd
Copy link
Contributor

@jmacd jmacd commented Oct 31, 2023

This is a continuation of open-telemetry/semantic-conventions#184, which developed into a longer discussion that originally planned.

This text includes results from auditing the OpenTelemetry Collector and makes recommendations consistent with existing practice that extend our ability to configure basic, normal, and detailed-level metrics about OpenTelemetry data pipelines.

text/metrics/0000-pipeline-monitoring.md Outdated Show resolved Hide resolved

The proposed metric names are:

`otel.{station}.received`: Inclusive count of items entering the pipeline at a station.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apologies if this has been discussed before. Having wildcards in the name seems like a bad precedent to set, and makes constructing dashboards, etc. more difficult. It seems OK to have two versions of this, one for SDK and one for collector, but labels seem like a much better solution if this can actually be any string value.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the semantic-conventions repo, I would list three standard station names and give guidance on how to create more station names that follow the pattern. I don't think this will happen often, but I think it should be clear that automatic pipeline monitoring will break if ever the same metric name counts items at more than one location in a pipeline.

(This uniqueness requirement is also the reason why I am not proposing to count the number of items that enter and exit each processor -- because then a chain of processors would require per-processor metrics for automatic monitoring to work. For this reason, it makes sense (as proposed here IMO) to count only the items dropped by processors.)

text/metrics/0000-pipeline-monitoring.md Outdated Show resolved Hide resolved
text/metrics/0000-pipeline-monitoring.md Outdated Show resolved Hide resolved
text/metrics/0000-pipeline-monitoring.md Outdated Show resolved Hide resolved
When there is more than one component of a given type active in a
pipeline having the same `domain` and `signal` attributes, the `name`
should include additional information to disambiguate the multiple
instances using the syntax `<type>/<instance>`. For example, if there
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the collector, this is not necessarily enough to differentiate all components in a pipeline because we may have different components with the same type. e.g. otlp receiver and otlp exporter in the same pipeline.

Should this be addressed here, or if the problem is unique to the collector, should the collector adopt a strategy to incorporate the class into the type?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question. Since I've heavily updated this document and plan to present it in tomorrow's Collector SIG, I'm going to leave it open. As designed, it should be clear that received metrics are from receivers and exported metrics are from exporters. I want to say that everything else is a processor, but I realize this is a thorny question.

PipelineLossRate = LastStageTotal{success=false} / FirstStageTotal{*}
```

Since total loss can be calculated with only a single timneseries per
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Since total loss can be calculated with only a single timneseries per
Since total loss can be calculated with only a single timeseries per


#### Practice of error suppression

There is a accepted practice in the OpenTelemetry Collector of
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
There is a accepted practice in the OpenTelemetry Collector of
There is an accepted practice in the OpenTelemetry Collector of

text/metrics/0238-pipeline-monitoring.md Outdated Show resolved Hide resolved
@mtwo
Copy link
Member

mtwo commented Nov 20, 2023

@jack-berg @carlosalberto can we mark this as triaged with priority p1?

the pipeline should match the number of items successfully received,
otherwise the system is capable of reporting the combined losses to
the user.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a vital paragraph to understand why the default, different levels of detail for each station. Putting a end-to-end example, as you did with https://github.com/jmacd/oteps/blob/jmacd/drops/text/metrics/0238-pipeline-monitoring.md#pipeline-monitoring-diagram would work great here (albeit with numbers/metrics instead of charts)

Copy link
Member

@djaglowski djaglowski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To summarize my other comments and provide a suggestion:

  • Per component metrics are valuable and if anything should be enhanced.
  • The notion of diffing values between stations does not seem to allow sufficient accuracy in describing the way data may actually flow through a collector because it assumes linearity of data flow.

What seems like a better solution would be to add metrics which directly describe the aggregate behavior of the processors in each of the collector's pipelines.

  1. The total number of items flowing into the first processor.
  2. The total number of items added by the processors.
  3. The total number of items dropped by the processors.
  4. (Implied by 1-3) The total number of items flowing out from the last processor.

Comment on lines 186 to 187
integrity of the pipeline. Stations allow data to enter a pipeline
only through receiver components. Stations are never responsible for
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stations allow data to enter a pipeline only through receiver components.

Generally this makes sense but I have to question whether this models the reality of what the collector may do. Specifically, the notion of aggregation comes to mind.

Suppose we have 10 data points which a processor will aggregate into 1, according to our data model. Should this be communicated as 9 data points dropped? In my opinion this is an inaccurate characterization of a valid operation. A more accurate description might require a notion of adding 1 while dropping 10.

Additionally, I think there may be some valid cases where a processor would add data into a stream. For example, here is a proposed component which would augment a data stream. Perhaps more debate is necessary there, but it seems reasonable to me that in some cases we may generate additional items and insert them into the data stream, when they are naturally complimentary.

A simpler case of adding data to a stream would involve computed metrics. Say we have a metric w/ "free" and "used" data points, and we wish to generate a "% utilized" metric. I think this could reasonably be done in a processor as well.

Comment on lines 142 to 164
#### Collector perspective

Collector counters are exclusive. Like for SDKs, items that enter a
processor are counted in one of three ways and to compute a meaningful
ratio requires all three timeseries. If the processor is a sampler,
for example, the effective sampling rate is computed as
`(accepted+refused)/(accepted+refused+dropped)`.

While the collector defines and emits metrics sufficient for
monitoring the individual pipeline component--taken as a whole, there
is substantial redundancy in having so many exclusive counters. For
example, when a collector pipeline features no processors, the
receiver's `refused` count is expected to equal the exporter's
`send_failed` count.

When there are several processors, it is primarily the number of
dropped items that we are interested in counting. When there are
multiple sequential processors in a pipeline, however, counting the
total number of items at each stage in a multi-processor pipeline
leads to over-counting in aggregate. For example, if you combine
`accepted` and `refused` for two adjacent processors, then remove the
metric attribute which distinguishes them, the resulting sum will be
twice the number of items processed by the pipeline.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my opinion this describes too limited a view of how the collector is expected to function. I'm concerned about establishing a model which cannot describe the reality of what the collector does.

For example, when a collector pipeline features no processors, the receiver's refused count is expected to equal the exporter's send_failed count.

Is this true when a pipeline contains multiple exporters, each of which can fail independently? Similarly, what about when a receiver is used in multiple pipelines?

For example, if you combine accepted and refused for two adjacent processors, then remove the metric attribute which distinguishes them, the resulting sum will be twice the number of items processed by the pipeline.

My understanding is that this is how our data model is intended to work. If we had a similar situation, for example "request" counts from a set of switches, we could use the same aggregation mechanism to understand the total number of requests processed by the switches, but we would not simultaneously expect this total to account for packets which flowed through multiple switches.

This is not to say that there is no way to represent net behavior across a linear sequence of processors, but I do not buy that there is a problem problem with counts for each processor.

Because of station integrity, we can make the following assertions:

1. Data that enters a station is eventually exported or dropped.
2. No other outcomes are possible.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As noted elsewhere, I think we need to account for other possibilities:

  1. Duplicated data (fanned out to multiple exporters within a pipeline, or from a receiver to multiple pipelines).
  2. Added data.
  3. Aggregated data (dropped and added)
  4. Data which is exported successfully by a subset of exporters, and rejected by a different subset.

Comment on lines 197 to 201
metric detail is configured, to avoid redundancy. For simple
pipelines, the number of items exported equals the number of items
received minus the number of items dropped, and for simple pipelines
it is sufficient to observe only successes and failures by receiver as
well as items dropped by processors.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not convinced the concept of a "simple" pipeline is worthy of a special model. In my opinion, we should find a solution which works more generally.

Comment on lines 187 to 188
only through receiver components. Stations are never responsible for
dropping data, because only processor components drop data. Stations
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SDK processors do not drop data, and processors aren't responsible for passing data to the next process in a pipeline. Instead, higher order SDK code ensures that each registered processor is called. As the spec puts it, "Each processor registered on TracerProvider is a start of pipeline that consist of span processor and optional exporter".

The architectural differences between SDK and collector processor may impact your design. For example, further on you state:

  1. Data that enters a station is eventually exported or dropped.
  1. No other outcomes are possible.

For SDKs, its perfect valid for a processor to do nothing besides extract baggage and add it to the span. No filtering, no exporting.

instances using the syntax `<type>/<instance>`. For example, if there
were two `batch` processors in a collection pipeline (e.g., one for
error spans and one for non-error spans) they might use the names
`batch/error` and `batch/noerror`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the SDK, if two batch processors are present, they're only differentiable by the associated exporter. You can't configure the SDK to send some spans to one batch processor, and others to another. So maybe the name would have to be something like batch/otlp, batch/zipkin, etc.

| `otel.name` | Type, name, or "type/name" of the component | Normal (Opt-out) | `probabilitysampler`, `batch`, `otlp/grpc` |
| `otel.success` | Boolean: item considered success? | Normal (Opt-out) | `true`, `false` |
| `otel.reason` | Explaination of success/failures. | Detailed (Opt-in) | `ok`, `timeout`, `permission_denied`, `resource_exhausted` |
| `otel.scope` | Name of instrumentation. | Detailed (Opt-in) | `opentelemetry.io/library` |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this intended to be different than the scope name? It seems like its the same. If so, then we can't opt-in or out of it. Maybe the prometheus exporter has options for that, but the scope isn't an optional part of the data model.

| Attributes | Meaning | Level of detail (Optional) | Examples |
|----------------|---------------------------------------------|----------------------------|------------------------------------------------------------|
| `otel.signal` | Name of the telemetry signal | Basic (Opt-out) | `traces`, `logs`, `metrics` |
| `otel.name` | Type, name, or "type/name" of the component | Normal (Opt-out) | `probabilitysampler`, `batch`, `otlp/grpc` |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

otel.component.name?

|---------------------|------------------|
| `otel.sdk.received` | Basic |
| `otel.sdk.dropped` | Normal |
| `otel.sdk.exported` | Detailed |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm struggling to see what metrics I would expect to get out of an SDK with different requirement levels. Would you be able to include an examples with a typical SDK configuration? I'm imaging an SDK with logs, metrics, and traces enabled, each configured to export data via an OTLP exporter. What metrics and series do I see at Basic, Normal, and Detailed?

way than `otel.success`, with recommended values specified below.
- `otel.signal` (string): This is the name of the signal (e.g., "logs",
"metrics", "traces")
- `otel.name` (string): Name of the component in a pipeline.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- `otel.name` (string): Name of the component in a pipeline.
- `otel.component` (string): Name of the component in a pipeline.

- `consumed`: Indicates a normal, synchronous request success case.
The item was consumed by the next stage of the pipeline, which
returned success.
- `unsampled`: Indicates a successful drop case, due to sampling.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When the Filter processor filters out data, should it count it as unsampled? The name doesn't sound right in this case.

EDIT: How about discarded instead of unsampled?


- `otelsdk.producer.items`: count of successful and failed items of
telemetry produced, by signal type, by an OpenTelemetry SDK.
- `otelcol.receiver.items`: count of successful and failed items of
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I have a connector configured, I assume I get both the otelcol.exporter.items and otelcol.receiver.items metrics emitted for the connector, right?

Let's say I configured the Count connector on a traces pipeline, as described in the example in the component's README. The count connector then accepts traces on the traces/in pipeline and creates metrics on the metrics/out pipeline.

I imagine the otelcol.exporter.items metric for the count connector would count the incoming spans on the trace pipeline. What would be the otel.outcome for those correctly consumed spans? Would it be consumed or rather unsampled? These logs aren't shipped anywhere by the component, they are "swallowed" by the connector if I understand correctly.

I imagine the otelcol.receiver.items metric for the count connector would count the metrics created on the metrics pipeline, with the otel.outcome set to consumed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the consume operation synchronous? I think the traces/in pipeline will wait until the metrics/out pipeline finishes the consume operation, so the outcome for traces/in will depend on the outcome for metrics/out. If metrics/out fails w/ a retryable status, maybe the producer will retry.

Since the count connector can produce more or fewer metric data points than arriving spans, I do not expect the item counts to match between the exporter and receiver, but I think the outcomes could match for synchronous operations. If the operation is asynchronous, the rules discussed in this proposal would apply -- the traces/in might see consumed while the metrics/out sees some sort of failure.

I don't see any problems, per se, just that the monitoring equations for connectors don't apply. I can't assume that the items_in == items_dropped + items_out.

what would ordinarily count as failure. This behavior makes automatic
component health status reporting more difficult than necessary.

One goal if this proposal is that Collector component health could be
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
One goal if this proposal is that Collector component health could be
One goal of this proposal is that Collector component health could be

Comment on lines 147 to 151
which adds to the confusion -- it is not standard practice for
receivers to retry in the OpenTelemetry collector, that is the duty of
exporters in our current practice. So, the memory limiter component,
to be consistent, should count "failure drops" to indicate that the
next stage of the pipeline did not see the data.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It depends on the type of receiver. The push-based receivers respond with a retriable error code. Some pull-based receivers retry themselves, for example, the filelog receiver.

@jmacd
Copy link
Contributor Author

jmacd commented Feb 2, 2024

This is a work-in-progress. I will re-open it when it is ready for re-review.

@jmacd jmacd closed this Feb 2, 2024
@jmacd
Copy link
Contributor Author

jmacd commented Feb 2, 2024

(I am incorporating all the feedback received so far. Thank you, reviewers!)

codeboten pushed a commit to open-telemetry/opentelemetry-collector that referenced this pull request Feb 6, 2024
Changes the treatment of
[PartialSuccess](https://opentelemetry.io/docs/specs/otlp/#partial-success),
making them successful and logging a warning instead of returning an
error to the caller. These responses are meant to convey successful
receipt of valid data which could not be accepted for other reasons,
specifically to cover situations where the OpenTelemetry SDK and
Collector have done nothing wrong, specifically to avoid retries. While
the existing OTLP exporter returns a permanent error (also avoids
retries), it makes the situation look like a total failure when in fact
it is more nuanced.

As discussed in the tracking issue, it is a lot of work to propagate
these "partial" successes backwards in a pipeline, so the appropriate
simple way to handle these items is to return success.

In this PR, we log a warning. In a future PR, (IMO) as discussed in
open-telemetry/oteps#238, we should count the
spans/metrics/logs that are rejected in this way using a dedicated
outcome label.

**Link to tracking Issue:**
Part of #9243

**Testing:** Tests for the "partial success" warning have been added.

**Documentation:** PartialSuccess behavior was not documented. Given the
level of detail in the README, it feels appropriate to continue not
documenting, otherwise lots of new details should be added.

---------

Co-authored-by: Alex Boten <aboten@lightstep.com>
@jmacd
Copy link
Contributor Author

jmacd commented Apr 11, 2024

@kristinapathak will resume this effort, thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants