Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: Pipeline monitoring metrics #249

Closed
wants to merge 21 commits into from

Conversation

jmacd
Copy link
Contributor

@jmacd jmacd commented Feb 7, 2024

Derived from #238. Ready for interested reviewers.
Still needs examples.
Diagram needs to be updated.

The alternative, which uses one metric instrument per producer outcome
and one metric instrument per consumer outcome, has known
difficulties. To define a ratio between any one outcome and the total
requires a metric formula defined by all the outcomes. On other hand,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
requires a metric formula defined by all the outcomes. On other hand,
requires a metric formula defined by all the outcomes. On the other hand,

text/metrics/0238-pipeline-monitoring.md Outdated Show resolved Hide resolved
text/metrics/0238-pipeline-monitoring.md Outdated Show resolved Hide resolved
text/metrics/0238-pipeline-monitoring.md Outdated Show resolved Hide resolved

#### Producer and Consumer instruments

We choose to specify two metric instruments for use counting outcomes,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if this is what was intended but this doesn't read right to me.

Suggested change
We choose to specify two metric instruments for use counting outcomes,
We choose to specify two metric instruments for use in counting outcomes,

Comment on lines 12 to 13
- `otelcol_consumed_items`: Received and inserted data items (Collector)
- `otelcol_produced_items`: Exported, dropped, and discarded items (Collector)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The producer/consumer terminology makes these definitions a bit confusing for me. Intuitively I would expect inserted items to be a producer behavior, and dropped/discarded items to be a consumer behavior.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah -- I had this same realization, the terms feel ambiguous.

How would you feel about

otelcol_input_items and otelcol_output_items?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Much clearer

Comment on lines 54 to 57
consumer outcomes. In an ideal pipeline, a conservation rule exists
between what goes in (i.e., is consumed) and what goes out (i.e., is
produced). The use of producer and consumer metric instruments is
designed to enable this form of consistency check. When the pipeline
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From an accounting perspective, I see why we would want to group received + inserted items (so that this total matches exported + dropped + discarded). But the language here is difficult to reconcile with the external vs internal nature of the operations.

Taking a step back, I agree with the categories you've identified (received, exported, inserted, discarded, dropped), but there are several ways to organize them. This proposal organizes the categories in terms of incremental (received, inserted) vs decremental (discarded, dropped, exported) because it gives us the desirable property that the two instruments should be equal. However, I wonder if these same categories can be modeled in a different way while still giving us the ability to check consistency.

Would it be enough that all categories should sum to 0 by subtracting the decremental operations from the incremental ones? Organized according to real data flow, it would be received - discarded + inserted - (dropped + exported) = 0. I think that by separating the incremental from the decremental, it allows this to work for backends, but alternately, could we require that the decremental categories are reported as negative numbers within the same instrument? To me this seems more intuitive but I'm not sure all backends can handle this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm happy with the equation received - discarded + inserted - (dropped + exported) = 0.

I don't think I see a difference between received and inserted. If the telemetry has the component name, it'll be clear whether it was a processor or a receiver, and could be just a semantic question. If we added another attribute to identify the kind of component, or required it to be included in the otel.component attribute, is that enough to distinguish received and inserted?

My thinking, in creating a discarded and dropped designation specifically was to have enough decomposition in the data that you could perform the equation as you wrote it, meaning to count received (receivers), subtract discarded, add received (processors), subtract dropped, leaving exported, which is the thing you'll compare with the next segment, potentially.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Continuing --

Your suggestion about negative-values, as opposed to the positive-only expression I've used, brings to mind several related topics. I think this is the "best" way to do it from a metrics data model perspective, but I want to point out other ways we can map these metric events.

Consider each item of telemetry that enters the pipeline, has with it an associated trace context. There is:

a. The UpDownCounter formulation -- for every item arriving, add 1. for every item departing, subtract 1. this can tell us the number of items for attribute sets that are symmetric. If we add one for every item that is input/consumed, then subtract one for every item that is output/produced, the resulting tally is a number of in-flight items, but this mapping has to ignore the outcome/success labels for the +1/-1 to balance out.
b. The Span formulation -- when the receiver starts a new request (or the processor inserts some new data), there is an effective span start event (or a log about the arrival of some telemetry) for some items of telemetry. When the outcome is known for those points (having called the follower), there is a span finish event which can be annotated w/ the subtotal for each outcome/success matching the number of items consumed.
c. The LogRecord formulation -- (same as span formula, but one log record per event, vs span start/end events).

I'm afraid to keep adding text to the document, but I would go further with the above suggestions. If we are using metrics to monitor the health of all the SDKs, then we will be missing a signal when the metrics SDK itself is failing. I want the metrics SDK to have a span encapsulating each export operation.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think I see a difference between received and inserted. If the telemetry has the component name, it'll be clear whether it was a processor or a receiver, and could be just a semantic question. If we added another attribute to identify the kind of component, or required it to be included in the otel.component attribute, is that enough to distinguish received and inserted?

Looks like I missed an important part of the design: processors are responsible for counting items only when the number changes while passing through a processor

I was thinking was that we should report "received" and "exported" for processors in order to account for situations where data streams are merged. For example, a collector pipeline with two receivers will combine streams into the first processor, so from that processor's perspective it seems important to report the total "received". Likewise, a similar problems could arise from receivers or exporters used in multiple pipelines.

To use a concrete example:

pipelines:
  logs/1:
    receivers: [R1, R2]
    processors: [P1]
    exporters: [E1, E2]
  logs/2:
    receivers: [R1]
    processors: [P2]
    exporters: [E1]
component received discarded inserted dropped exported
R1 10 - - - -
R2 20 - - - -
P1 30 25 0 - 5
P2 10 10 2 - 2
E1 - - - 0 7
E2 - - - 0 5

In this example, it seems much easier to understand what's going on with P1 when it reports receiving 30.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My earlier design for this proposal included what you're suggesting -- the idea that every processor in the pipeline will independently report complete totals. I think is excessive, there is a lot of redundancy, but the problem can be framed this way. In fact, the current design can be applied the way you describe by a simple redefinition rule -- if you consider a pipeline segment to be an individual receiver, an individual processor, or an individual exporter, you'll get the metrics you're expecting. I think this might even be appropriate in complex pipelines.

The defect I'm aware of, when each processor counts independent totals, is that it becomes easy to aggregate adjacent pipeline segments together, which results in overcounting from a pipeline perspective. This is not a unique problem to processor metrics -- the problem arises when a metric query aggregates more than one collector belonging to the same pipeline, or more than one exporter, or more than one processor. My goal is to make it easy to write queries that encompass

In my current proposal, if you aggregate the total for otelcol_consumed_items grouping by all attributes to a single total, the result will be the number of collector pipeline segments times the number of items. If you restrict your query to one segment (meaning one pipeline and one collector), then the aggregate equals the number of items. This property holds because each segment has one exporter and one receiver.

Since there are multiple processors in a pipeline segment, if each processor counts a total, then the aggregate for that segment will equal the number of processors times the number of items, which is not a useful measure to compare against adjacent pipeline segments. When each processor reports a total, you have to aggregate down to an individual processor to understand its behavior. But then, the logic to check whether the receiver and exporter are consistent, given processor behavior, becomes complicated at best--the aggregation would have to filter the dropped and discarded categories from the processor metrics, and then we'd be able to recover the pipeline equations in this proposal.

This is why I ended up proposing that processors count changes in item count, because the changes in item count aggregate correctly despite multiple processors.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for explaining further. The tradeoffs are tough here but if we're defining a segment as having only one receiver and one exporter, it excludes a large percentage (maybe substantial majority?) of collector configurations. Even in a simple pipeline like below change counts for P1 have little meaning.

    receivers: [R1, R2]
    processors: [P1]
    exporters: [E1]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question about the example, specifically.
Why are there two paths between R1 and E1? This fact will make it difficult to monitor the pipeline, because it appears to double the input on purpose. The pipeline equations will show this happening, but it will be up to interpretation to say whether it's on purpose or not.

The way I would monitor the setup in your example is to compute all the paths for which I expect the conservation rule to hold. They are:

(R1 + R2) -> P1 -> E1
(R1 + R2) -> P1 -> E2
R1 -> P2 -> E1

Since two paths lead to E1, the pipeline equations have to be combined. For E1, the equation will include a factor of 2 for R1.

2*Received(R1) + Received(R2) = Dropped(P1) + Dropped(P2) + Exported(E1)

This kind of calculation can be automated and derived from the metrics I'm proposing, if you have the graph. I mean, if you want to know that P1 received 30 items of telemetry, just add R1 and R2's consumed item totals, that should be easy.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we're defining a segment as having only one receiver and one exporter

This is an interesting statement -- I've definitely not been clear on this topic. I didn't mean to say one receiver and one exporter. I meant all receivers and one exporter, because that's where the conservation principle holds. The sum of all receivers reaches every exporter, and that is a pipeline segment, so your second example,

    receivers: [R1, R2]
    processors: [P1]
    exporters: [E1]

is exactly the kind of simple pipeline segment that will be easy to monitor, and it will be easy to monitor even if it has a bunch of processors too.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are there two paths between R1 and E1?

I agree it's likely not useful. It's a contrived example but I wanted to include the full set of possible stream merges and fan outs:

  • single pipeline concerns
    • merge before first processor
    • fan out after last processors
  • inter-pipeline concerns
    • fan out after receiver shared by multiple pipelines
    • merge before exporter shared by multiple pipelines

(R1 + R2) -> P1 -> E1
(R1 + R2) -> P1 -> E2

I think is perhaps where I'm getting tripped up. Could we define a segment as being able to have more than one receiver? This still aggregates correctly. I see why we cannot include multiple exporters, because data is fanned out within the segment, but the fanout that occurs when a receiver is shared between pipelines does not affect the counts for an individual pipeline.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't mean to say one receiver and one exporter. I meant all receivers and one exporter, because that's where the conservation principle holds.

I commented before seeing this but I see we arrived at the same conclusion. 👍

Copy link
Contributor

@codeboten codeboten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for re-writing this @jmacd, just a few comments. Will the diagram included in this PR be updated to represent the concepts of producers/consumers?


## Explanation

This document proposes two metric instrument semantics threefour
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it three or four?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Trying to say I've only defined two semantics, consumed and produced. Then, I prefix the SDK or Collector part to make 4 logical metric instruments, but then I exclude one (reasons stated in "SDK-specific considerations"), leaving three.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe @codeboten is referring to the threefour on line 16

Pipeline components included in this specification are:

- OpenTelemetry SDKs: As telemetry producers, these components are the
start of a pipeline. These components also
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing end of the sentence

The first equation:

```
Consumed(Segment) == Recieved(Segment) + Inserted(Segment)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Consumed(Segment) == Recieved(Segment) + Inserted(Segment)
Consumed(Segment) == Received(Segment) + Inserted(Segment)


The producer categories, leading to the second pipeline segment equation:

- **Exported**: An attempt was made to export the telemetry to a following pipeline segment
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this an attempt or rather the data was successfully exported to the follower?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An attempt. When the attempt is made, there is at least some expectation that the next pipeline segment has seen the data. Exported includes success and failed cases, and I'm not sure how I can change the words to improve this understanding. I mean to count cases where an RPC was made, essentially, whether it fails or not, because it sets up our expectation for the next segment.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, just to be clear, which metric do we use for an exporter that failed to even establish a connection to a downstream receiver?
For example, if I configure the collector with an OTLP exporter with a bad endpoint, and the HTTP/GRPC connection cannot be made, the export will "fail" but there is no expectation that any following receiver will ever see the data (so won't count it).
It seems Exported doesn't fit here by your definition. Would it be Dropped?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems Exported doesn't fit here by your definition. Would it be Dropped?

Yes

possible to verify this and warn about improper accounting during
shutdown.

These equations allow are useful in the abstract, because , without ordering
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
These equations allow are useful in the abstract, because , without ordering
These equations allow are useful in the abstract, without ordering

Copy link
Member

@TylerHelmuth TylerHelmuth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jmacd thanks for working on this. After yesterday's collector SIG meeting it is important to move this work forward so we can get to a stable semantic convention the collector can rely on so we can sort out its metric names once and for all. Let me know how I can help.


## Explanation

This document proposes two metric instrument semantics threefour
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe @codeboten is referring to the threefour on line 16

Comment on lines +288 to +290
- `otelcol_consumed_items`: The number of items received or inserted into a pipeline.
- `otelcol_produced_items`: The number of items discarded, dropped, or exported by a Collector pipeline segment.
- `otelsdk_produced_items`: The number of items discarded, dropped, or exported by a SDK pipeline segment.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If otelcol and otelsdk are namespacing these metrics, should the names be:

  • otelcol.consumed_items
  • otelcol.produced_items
  • otelsdk.produced_items


### Recommended conventional attributes

- `otel.success` (boolean): This is true or false depending on whether the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is otel being used to namespace these attributes so they wouldn't conflict with other attribute names? I think we should add some more clarity in the name to make it clear these are attributes of an otel pipeline: how do you feel about the otel.pipeline. prefix?

@jmacd
Copy link
Contributor Author

jmacd commented May 8, 2024

@kristinapathak is taking over this work from me. (I thought that I had already stated this!)
Sorry for the delay, and looking forward to progress!


## Explanation

This document proposes two metric instrument semantics threefour
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
This document proposes two metric instrument semantics threefour
This document proposes two metric instrument semantics three

An arrangement of pipeline components acting as a single unit, such as
implemented by the OpenTelemetry Collector, is called a segment. Each
segment consists of a receiver, zero or more processors, and an
exporter. The terms "following" and "preceding" apply to pipeline
Copy link

@0x006EA1E5 0x006EA1E5 May 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If a OTel Collector pipeline is configured with more than one receiver / exporter, is this then considered to be multiple, logical segments?

How about when the routingconnector is used? Will this be multiple segments contained within a single Collector instance?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@0x006EA1E5, I would enjoy to continue this discussion on this new PR, but my short response is:

If a OTel Collector pipeline is configured with more than one receiver / exporter, is this then considered to be multiple, logical segments?

yes! A single Collector pipeline can have multiple segments

How about when the routingconnector is used? Will this be multiple segments contained within a single Collector instance?

My new PR includes an example with the spanmetrics connector, but the short answer is also yes. 🙂 A connector is both the end of one segment and the start of the following one. I'm not as familiar with the routing connector so will look into it more to get a better understanding. It looks like it would be a good example to include.

pipeline. The preceding component ("preceder") produces data that is
consumed by the following component ("follower").

An arrangement of pipeline components acting as a single unit, such as
Copy link

@0x006EA1E5 0x006EA1E5 May 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the intention that there will be similar otelcol_*_items metrics for the segments as well as the components? It's not clear to me here how these two concepts apply here.

When it comes to "data loss", I am often more interested in the network boundary between "segments", e.g., when using the loadbalancingexporter to route to a following Collector instance.
Currently, I compare the component level loadbalancingexporter and following otlpreceiver metrics to try to understand data loss, but really what I care about is segment level view

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@0x006EA1E5, I'm working on writing out more details on data loss between segments. Here is my current scribble that looks at how a resource exhausted response would look.

@jmacd
Copy link
Contributor Author

jmacd commented May 29, 2024

Closing in favor of #259.

@jmacd jmacd closed this May 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants