WIP: Pipeline monitoring metrics #249

jmacd · 2024-02-07T16:51:16Z

Derived from #238. Ready for interested reviewers.
Still needs examples.
Diagram needs to be updated.

djaglowski · 2024-02-08T16:26:44Z

text/metrics/0238-pipeline-monitoring.md

+The alternative, which uses one metric instrument per producer outcome
+and one metric instrument per consumer outcome, has known
+difficulties.  To define a ratio between any one outcome and the total
+requires a metric formula defined by all the outcomes.  On other hand,


Suggested change

requires a metric formula defined by all the outcomes. On other hand,

requires a metric formula defined by all the outcomes. On the other hand,

text/metrics/0238-pipeline-monitoring.md

djaglowski · 2024-02-08T16:29:14Z

text/metrics/0238-pipeline-monitoring.md

+
+#### Producer and Consumer instruments
+
+We choose to specify two metric instruments for use counting outcomes,


Not sure if this is what was intended but this doesn't read right to me.

Suggested change

We choose to specify two metric instruments for use counting outcomes,

We choose to specify two metric instruments for use in counting outcomes,

djaglowski · 2024-02-08T16:41:13Z

text/metrics/0238-pipeline-monitoring.md

+- `otelcol_consumed_items`: Received and inserted data items (Collector)
+- `otelcol_produced_items`: Exported, dropped, and discarded items (Collector)


The producer/consumer terminology makes these definitions a bit confusing for me. Intuitively I would expect inserted items to be a producer behavior, and dropped/discarded items to be a consumer behavior.

Yeah -- I had this same realization, the terms feel ambiguous.

How would you feel about

otelcol_input_items and otelcol_output_items?

Much clearer

djaglowski · 2024-02-08T18:19:24Z

text/metrics/0238-pipeline-monitoring.md

+consumer outcomes.  In an ideal pipeline, a conservation rule exists
+between what goes in (i.e., is consumed) and what goes out (i.e., is
+produced).  The use of producer and consumer metric instruments is
+designed to enable this form of consistency check.  When the pipeline


From an accounting perspective, I see why we would want to group received + inserted items (so that this total matches exported + dropped + discarded). But the language here is difficult to reconcile with the external vs internal nature of the operations.

Taking a step back, I agree with the categories you've identified (received, exported, inserted, discarded, dropped), but there are several ways to organize them. This proposal organizes the categories in terms of incremental (received, inserted) vs decremental (discarded, dropped, exported) because it gives us the desirable property that the two instruments should be equal. However, I wonder if these same categories can be modeled in a different way while still giving us the ability to check consistency.

Would it be enough that all categories should sum to 0 by subtracting the decremental operations from the incremental ones? Organized according to real data flow, it would be received - discarded + inserted - (dropped + exported) = 0. I think that by separating the incremental from the decremental, it allows this to work for backends, but alternately, could we require that the decremental categories are reported as negative numbers within the same instrument? To me this seems more intuitive but I'm not sure all backends can handle this.

I'm happy with the equation received - discarded + inserted - (dropped + exported) = 0.

I don't think I see a difference between received and inserted. If the telemetry has the component name, it'll be clear whether it was a processor or a receiver, and could be just a semantic question. If we added another attribute to identify the kind of component, or required it to be included in the otel.component attribute, is that enough to distinguish received and inserted?

My thinking, in creating a discarded and dropped designation specifically was to have enough decomposition in the data that you could perform the equation as you wrote it, meaning to count received (receivers), subtract discarded, add received (processors), subtract dropped, leaving exported, which is the thing you'll compare with the next segment, potentially.

Continuing --

Your suggestion about negative-values, as opposed to the positive-only expression I've used, brings to mind several related topics. I think this is the "best" way to do it from a metrics data model perspective, but I want to point out other ways we can map these metric events.

Consider each item of telemetry that enters the pipeline, has with it an associated trace context. There is:

a. The UpDownCounter formulation -- for every item arriving, add 1. for every item departing, subtract 1. this can tell us the number of items for attribute sets that are symmetric. If we add one for every item that is input/consumed, then subtract one for every item that is output/produced, the resulting tally is a number of in-flight items, but this mapping has to ignore the outcome/success labels for the +1/-1 to balance out.
b. The Span formulation -- when the receiver starts a new request (or the processor inserts some new data), there is an effective span start event (or a log about the arrival of some telemetry) for some items of telemetry. When the outcome is known for those points (having called the follower), there is a span finish event which can be annotated w/ the subtotal for each outcome/success matching the number of items consumed.
c. The LogRecord formulation -- (same as span formula, but one log record per event, vs span start/end events).

I'm afraid to keep adding text to the document, but I would go further with the above suggestions. If we are using metrics to monitor the health of all the SDKs, then we will be missing a signal when the metrics SDK itself is failing. I want the metrics SDK to have a span encapsulating each export operation.

I don't think I see a difference between received and inserted. If the telemetry has the component name, it'll be clear whether it was a processor or a receiver, and could be just a semantic question. If we added another attribute to identify the kind of component, or required it to be included in the otel.component attribute, is that enough to distinguish received and inserted?

Looks like I missed an important part of the design: processors are responsible for counting items only when the number changes while passing through a processor

I was thinking was that we should report "received" and "exported" for processors in order to account for situations where data streams are merged. For example, a collector pipeline with two receivers will combine streams into the first processor, so from that processor's perspective it seems important to report the total "received". Likewise, a similar problems could arise from receivers or exporters used in multiple pipelines.

To use a concrete example:

pipelines: logs/1: receivers: [R1, R2] processors: [P1] exporters: [E1, E2] logs/2: receivers: [R1] processors: [P2] exporters: [E1]

component received discarded inserted dropped exported

R1 10 - - - -

R2 20 - - - -

P1 30 25 0 - 5

P2 10 10 2 - 2

E1 - - - 0 7

E2 - - - 0 5

In this example, it seems much easier to understand what's going on with P1 when it reports receiving 30.

My earlier design for this proposal included what you're suggesting -- the idea that every processor in the pipeline will independently report complete totals. I think is excessive, there is a lot of redundancy, but the problem can be framed this way. In fact, the current design can be applied the way you describe by a simple redefinition rule -- if you consider a pipeline segment to be an individual receiver, an individual processor, or an individual exporter, you'll get the metrics you're expecting. I think this might even be appropriate in complex pipelines.

The defect I'm aware of, when each processor counts independent totals, is that it becomes easy to aggregate adjacent pipeline segments together, which results in overcounting from a pipeline perspective. This is not a unique problem to processor metrics -- the problem arises when a metric query aggregates more than one collector belonging to the same pipeline, or more than one exporter, or more than one processor. My goal is to make it easy to write queries that encompass

In my current proposal, if you aggregate the total for otelcol_consumed_items grouping by all attributes to a single total, the result will be the number of collector pipeline segments times the number of items. If you restrict your query to one segment (meaning one pipeline and one collector), then the aggregate equals the number of items. This property holds because each segment has one exporter and one receiver.

Since there are multiple processors in a pipeline segment, if each processor counts a total, then the aggregate for that segment will equal the number of processors times the number of items, which is not a useful measure to compare against adjacent pipeline segments. When each processor reports a total, you have to aggregate down to an individual processor to understand its behavior. But then, the logic to check whether the receiver and exporter are consistent, given processor behavior, becomes complicated at best--the aggregation would have to filter the dropped and discarded categories from the processor metrics, and then we'd be able to recover the pipeline equations in this proposal.

This is why I ended up proposing that processors count changes in item count, because the changes in item count aggregate correctly despite multiple processors.

Thanks for explaining further. The tradeoffs are tough here but if we're defining a segment as having only one receiver and one exporter, it excludes a large percentage (maybe substantial majority?) of collector configurations. Even in a simple pipeline like below change counts for P1 have little meaning.

receivers: [R1, R2] processors: [P1] exporters: [E1]

Question about the example, specifically.
Why are there two paths between R1 and E1? This fact will make it difficult to monitor the pipeline, because it appears to double the input on purpose. The pipeline equations will show this happening, but it will be up to interpretation to say whether it's on purpose or not.

The way I would monitor the setup in your example is to compute all the paths for which I expect the conservation rule to hold. They are:

(R1 + R2) -> P1 -> E1
(R1 + R2) -> P1 -> E2
R1 -> P2 -> E1

Since two paths lead to E1, the pipeline equations have to be combined. For E1, the equation will include a factor of 2 for R1.

2*Received(R1) + Received(R2) = Dropped(P1) + Dropped(P2) + Exported(E1)

This kind of calculation can be automated and derived from the metrics I'm proposing, if you have the graph. I mean, if you want to know that P1 received 30 items of telemetry, just add R1 and R2's consumed item totals, that should be easy.

we're defining a segment as having only one receiver and one exporter

This is an interesting statement -- I've definitely not been clear on this topic. I didn't mean to say one receiver and one exporter. I meant all receivers and one exporter, because that's where the conservation principle holds. The sum of all receivers reaches every exporter, and that is a pipeline segment, so your second example,

receivers: [R1, R2] processors: [P1] exporters: [E1]

is exactly the kind of simple pipeline segment that will be easy to monitor, and it will be easy to monitor even if it has a bunch of processors too.

Why are there two paths between R1 and E1?

I agree it's likely not useful. It's a contrived example but I wanted to include the full set of possible stream merges and fan outs:

single pipeline concerns

merge before first processor

fan out after last processors

inter-pipeline concerns

fan out after receiver shared by multiple pipelines

merge before exporter shared by multiple pipelines

(R1 + R2) -> P1 -> E1
(R1 + R2) -> P1 -> E2

I think is perhaps where I'm getting tripped up. Could we define a segment as being able to have more than one receiver? This still aggregates correctly. I see why we cannot include multiple exporters, because data is fanned out within the segment, but the fanout that occurs when a receiver is shared between pipelines does not affect the counts for an individual pipeline.

I didn't mean to say one receiver and one exporter. I meant all receivers and one exporter, because that's where the conservation principle holds.

I commented before seeing this but I see we arrived at the same conclusion. 👍

codeboten

Thanks for re-writing this @jmacd, just a few comments. Will the diagram included in this PR be updated to represent the concepts of producers/consumers?

codeboten · 2024-02-08T21:15:47Z

text/metrics/0238-pipeline-monitoring.md

+
+## Explanation
+
+This document proposes two metric instrument semantics threefour


is it three or four?

Trying to say I've only defined two semantics, consumed and produced. Then, I prefix the SDK or Collector part to make 4 logical metric instruments, but then I exclude one (reasons stated in "SDK-specific considerations"), leaving three.

I believe @codeboten is referring to the threefour on line 16

codeboten · 2024-02-08T21:18:05Z

text/metrics/0238-pipeline-monitoring.md

+Pipeline components included in this specification are:
+
+- OpenTelemetry SDKs: As telemetry producers, these components are the
+  start of a pipeline.  These components also 


missing end of the sentence

codeboten · 2024-02-08T22:44:56Z

text/metrics/0238-pipeline-monitoring.md

+The first equation:
+
+```
+Consumed(Segment) == Recieved(Segment) + Inserted(Segment)


Suggested change

Consumed(Segment) == Recieved(Segment) + Inserted(Segment)

Consumed(Segment) == Received(Segment) + Inserted(Segment)

codeboten · 2024-02-08T22:46:02Z

text/metrics/0238-pipeline-monitoring.md

+
+The producer categories, leading to the second pipeline segment equation:
+
+- **Exported**: An attempt was made to export the telemetry to a following pipeline segment


is this an attempt or rather the data was successfully exported to the follower?

An attempt. When the attempt is made, there is at least some expectation that the next pipeline segment has seen the data. Exported includes success and failed cases, and I'm not sure how I can change the words to improve this understanding. I mean to count cases where an RPC was made, essentially, whether it fails or not, because it sets up our expectation for the next segment.

So, just to be clear, which metric do we use for an exporter that failed to even establish a connection to a downstream receiver?
For example, if I configure the collector with an OTLP exporter with a bad endpoint, and the HTTP/GRPC connection cannot be made, the export will "fail" but there is no expectation that any following receiver will ever see the data (so won't count it).
It seems Exported doesn't fit here by your definition. Would it be Dropped?

It seems Exported doesn't fit here by your definition. Would it be Dropped?

Yes

codeboten · 2024-02-08T22:47:01Z

text/metrics/0238-pipeline-monitoring.md

+possible to verify this and warn about improper accounting during
+shutdown.
+
+These equations allow are useful in the abstract, because , without ordering


Suggested change

These equations allow are useful in the abstract, because , without ordering

These equations allow are useful in the abstract, without ordering

TylerHelmuth

@jmacd thanks for working on this. After yesterday's collector SIG meeting it is important to move this work forward so we can get to a stable semantic convention the collector can rely on so we can sort out its metric names once and for all. Let me know how I can help.

TylerHelmuth · 2024-04-04T16:32:21Z

text/metrics/0238-pipeline-monitoring.md

+
+## Explanation
+
+This document proposes two metric instrument semantics threefour


I believe @codeboten is referring to the threefour on line 16

TylerHelmuth · 2024-04-04T16:42:06Z

text/metrics/0238-pipeline-monitoring.md

+- `otelcol_consumed_items`: The number of items received or inserted into a pipeline.
+- `otelcol_produced_items`: The number of items discarded, dropped, or exported by a Collector pipeline segment.
+- `otelsdk_produced_items`: The number of items discarded, dropped, or exported by a SDK pipeline segment.


If otelcol and otelsdk are namespacing these metrics, should the names be:

otelcol.consumed_items

otelcol.produced_items

otelsdk.produced_items

TylerHelmuth · 2024-04-04T16:44:05Z

text/metrics/0238-pipeline-monitoring.md

+
+### Recommended conventional attributes
+
+- `otel.success` (boolean): This is true or false depending on whether the


Is otel being used to namespace these attributes so they wouldn't conflict with other attribute names? I think we should add some more clarity in the name to make it clear these are attributes of an otel pipeline: how do you feel about the otel.pipeline. prefix?

jmacd · 2024-05-08T16:30:17Z

@kristinapathak is taking over this work from me. (I thought that I had already stated this!)
Sorry for the delay, and looking forward to progress!

tarokkk · 2024-05-10T08:42:02Z

text/metrics/0238-pipeline-monitoring.md

+
+## Explanation
+
+This document proposes two metric instrument semantics threefour


Suggested change

This document proposes two metric instrument semantics threefour

This document proposes two metric instrument semantics three

0x006EA1E5 · 2024-05-15T12:53:14Z

text/metrics/0238-pipeline-monitoring.md

+An arrangement of pipeline components acting as a single unit, such as
+implemented by the OpenTelemetry Collector, is called a segment.  Each
+segment consists of a receiver, zero or more processors, and an
+exporter.  The terms "following" and "preceding" apply to pipeline


If a OTel Collector pipeline is configured with more than one receiver / exporter, is this then considered to be multiple, logical segments?

How about when the routingconnector is used? Will this be multiple segments contained within a single Collector instance?

@0x006EA1E5, I would enjoy to continue this discussion on this new PR, but my short response is:

If a OTel Collector pipeline is configured with more than one receiver / exporter, is this then considered to be multiple, logical segments?

yes! A single Collector pipeline can have multiple segments

How about when the routingconnector is used? Will this be multiple segments contained within a single Collector instance?

My new PR includes an example with the spanmetrics connector, but the short answer is also yes. 🙂 A connector is both the end of one segment and the start of the following one. I'm not as familiar with the routing connector so will look into it more to get a better understanding. It looks like it would be a good example to include.

0x006EA1E5 · 2024-05-15T13:08:55Z

text/metrics/0238-pipeline-monitoring.md

+pipeline.  The preceding component ("preceder") produces data that is
+consumed by the following component ("follower").
+
+An arrangement of pipeline components acting as a single unit, such as


Is the intention that there will be similar otelcol_*_items metrics for the segments as well as the components? It's not clear to me here how these two concepts apply here.

When it comes to "data loss", I am often more interested in the network boundary between "segments", e.g., when using the loadbalancingexporter to route to a following Collector instance.
Currently, I compare the component level loadbalancingexporter and following otlpreceiver metrics to try to understand data loss, but really what I care about is segment level view

@0x006EA1E5, I'm working on writing out more details on data loss between segments. Here is my current scribble that looks at how a resource exhausted response would look.

jmacd · 2024-05-29T22:34:50Z

Closing in favor of #259.

jmacd added 19 commits October 28, 2023 10:05

wip count success/failure/drops

cef7595

rough draft

4d3cfae

updates

3a9ef27

Wip

0062480

TODO WIP too much text now

b39b732

wip update

a117779

wip2

3ce0510

specify values

bbcf391

the long tail (start)

1334191

draft with no more TODOs

f9d1e9a

Merge branch 'main' of github.com:open-telemetry/oteps into jmacd/drops

12a1a6b

Merge branch 'main' of github.com:open-telemetry/oteps into jmacd/drops

a23795f

wip

1f48c3e

add examples

261323c

small revision; needs more work

6195761

last wip

481b93a

text

38fbc08

text

a2bee68

wip

addca49

crobert-1 mentioned this pull request Feb 7, 2024

Add metrics to understand cost of telemetry open-telemetry/opentelemetry-collector-contrib#30729

Closed

wip

6abc7ed

andrzej-stencel mentioned this pull request Feb 8, 2024

prefix should be consistent for internal metrics open-telemetry/opentelemetry-collector#9315

Closed

djaglowski reviewed Feb 8, 2024

View reviewed changes

updates

74953f2

codeboten reviewed Feb 8, 2024

View reviewed changes

TylerHelmuth reviewed Apr 4, 2024

View reviewed changes

jmacd mentioned this pull request Apr 11, 2024

[telemetry] - It would be great to have addtitional metrics open-telemetry/opentelemetry-collector#9412

Open

tarokkk reviewed May 10, 2024

View reviewed changes

0x006EA1E5 reviewed May 15, 2024

View reviewed changes

kristinapathak mentioned this pull request May 29, 2024

WIP: pipeline monitoring otep #259

Draft

jmacd closed this May 29, 2024

	requires a metric formula defined by all the outcomes. On other hand,
	requires a metric formula defined by all the outcomes. On the other hand,


		#### Producer and Consumer instruments

		We choose to specify two metric instruments for use counting outcomes,

		- `otelcol_consumed_items`: Received and inserted data items (Collector)
		- `otelcol_produced_items`: Exported, dropped, and discarded items (Collector)


		## Explanation

		This document proposes two metric instrument semantics threefour

	Consumed(Segment) == Recieved(Segment) + Inserted(Segment)
	Consumed(Segment) == Received(Segment) + Inserted(Segment)


		The producer categories, leading to the second pipeline segment equation:

		- Exported: An attempt was made to export the telemetry to a following pipeline segment

	These equations allow are useful in the abstract, because , without ordering
	These equations allow are useful in the abstract, without ordering


		### Recommended conventional attributes

		- `otel.success` (boolean): This is true or false depending on whether the

WIP: Pipeline monitoring metrics #249

WIP: Pipeline monitoring metrics #249

Conversation

jmacd commented Feb 7, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codeboten left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TylerHelmuth left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jmacd commented May 8, 2024

Choose a reason for hiding this comment

0x006EA1E5 May 15, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

0x006EA1E5 May 15, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jmacd commented May 29, 2024

jmacd commented Feb 7, 2024 •

edited

Loading

TylerHelmuth left a comment •

edited

Loading

0x006EA1E5 May 15, 2024 •

edited

Loading

0x006EA1E5 May 15, 2024 •

edited

Loading