WIP: [processor/interval] Implement the main logic #30827

RichieSams · 2024-01-29T14:08:56Z

Description:

This PR implements the main logic of this new processor.

Link to tracking Issue:
#29461

Testing:
I added a test for the main aggregation / export behavior, but many more need to be added to test the various edge cases.

Documentation:

I updated the README with the updated configs, but I should add some examples as well.

RichieSams · 2024-01-29T14:12:00Z

@djaglowski This isn't ready to merge yet, but I wanted to get what I had posted so you all could give initial feedback.

cc @sh0rez and @0x006EA1E5. I'd love your feedback as well, if you have time.

processor/intervalprocessor/factory.go

djaglowski

@RichieSams, I apologize but I'm finding this difficult to review while we have another processor under development which has to solve so many of the same concerns. Caching rules, identity, emit interval. Since the other processor appears to have a more detailed design proposal, I'd like to wait on this and see if we can't follow along by default, diverge only where necessary.

RichieSams · 2024-01-31T21:05:01Z

@RichieSams, I apologize but I'm finding this difficult to review while we have another processor under development which has to solve so many of the same concerns. Caching rules, identity, emit interval. Since the other processor appears to have a more detailed design proposal, I'd like to wait on this and see if we can't follow along by default, diverge only where necessary.

That's fine. I understand

@jpkrohling

Adds a new internal, _experimental_ package `metrics/identity` which implements identity types for resource, scope, metric and stream. This is closely related to work being done in #30707 and #30827. The package is specifically experimental, as it shall be treated as an internal component to above processors which may change at any moment as long as those are under active initial development. /cc @jpkrohling @djaglowski @RichieSams

@jpkrohling

Adds a new internal, _experimental_ package `metrics/identity` which implements identity types for resource, scope, metric and stream. This is closely related to work being done in open-telemetry#30707 and open-telemetry#30827. The package is specifically experimental, as it shall be treated as an internal component to above processors which may change at any moment as long as those are under active initial development. /cc @jpkrohling @djaglowski @RichieSams

@jpkrohling

Adds a new internal, _experimental_ package `metrics/identity` which implements identity types for resource, scope, metric and stream. This is closely related to work being done in open-telemetry#30707 and open-telemetry#30827. The package is specifically experimental, as it shall be treated as an internal component to above processors which may change at any moment as long as those are under active initial development. /cc @jpkrohling @djaglowski @RichieSams

… staleness It's a glorified wrapper over a Map type, which allows values to be expired based on a pre-supplied interval.

So Staleness isn't so tightly tied to it

So users are forced to use the correct methods. Also adds lots of documentation

…owFunc`

RichieSams · 2024-02-23T19:40:46Z

processor/intervalprocessor/processor.go

+
+		sum := m.Sum()
+		numDP := sum.DataPoints().AppendEmpty()
+		dp.DataPoint.CopyTo(numDP)


We need to update the timestamp to now(). Now() should be calculated once at the top and the same value used for everything

I don't think we should set any timestamps to "now". We are aggregating data points so should use the timestamps they carry in a meaningful way. The appropriate behavior may depend on the type of metric, but for example if aggregating a sum, I would expect the start time to be the earliest from all data points being aggregated, and the end time to be the latest from those data points.

If we don't update the timestamp, then why are we continuing to re-export them? Depending on the final destination, we're either creating duplicate data, or requiring the destination to implement data deduplication.

It's not clear to me why we would continue to re-export data points when there is no new data to report. That said, I haven't had time to focus on the nuances of this processor and likely won't for the foreseeable future so I'll leave it to others to decide. It just seems to me that we should be reporting a direct representation of the data points we actually receive.

Right. What you're describing is then similar to the "second" approach I describe here: #29461 (comment)

I think both approaches are "valid". But we just need to decide which this processor is doing

Thanks for pointing that out. The second approach sounds better to me.

For us, the principal use case of this set of features is to be able to send count connector metrics to a Prometheus remote write endpoint.

The count connector sends a stream of increment data points (deltas with value one).

Prometheus expects a steady stream of cumulative data points, published at the "scrape interval".

For a Prometheus scrape, consecutive reads of a counter which hasn't incremented will read the same value each time, so we see the behavior of a steady stream of repeating, unchanging values, every interval.

It is my understanding that downstream metrics systems based on Prometheus expects to see this steady stream of unchanging cumulatives in this case, and if there is a gap they will see this as a break in the series and may not behave as expected.

Looking at the behaviour of the collector's own metrics, when the service's telemetry is configured to export metrics via a periodic reader as follows:

telemetry: metrics: level: detailed readers: - periodic: interval: 15000 exporter: otlp: endpoint: localhost:4321 protocol: grpc/protobuf

we see Sum metrics exported at every interval, even when there is no change in value. For example, I see a series of 0 values for metric otelcol_exporter/send_failed_metric_points.

Therefore, I think we somehow need to be able to get this behaviour in the count connector scenario described above (or any case where source metric datapoints are received only on every change in value).

To be clear, I mean here to transform a stream of datapoints received at a variable frequency into a stream of datapoints at a fixed frequency, continuing to publish new datapoints even when there is no change in value, as we see in both the Prometheus scrape case, and the collector's own metrics case.

I guess we could rescope this to only publish datapoints on change, but if we do that, we would then need another process to meet the need described here, and it's not quite clear to me how that other processor could work if it wasn't also doing what this processor would do.

I'm not clear what the concrete use case is for not publishing new datapoints when there has not been a change in value, but if there is a use case for both modes of operation, then maybe this could be selected by a config option?

I think the general idea with not publishing when there is no new information is that, in theory at least, it's not useful.

If we are sure that backends expect the no-new-info points, then we should probably include them in this processor's output. However, I am surprised to here this because I have not seen this in my experience (though I have barely touched prometheus) and because @sh0rez's comment here indicates that prometheus does not need these data points.

Prometheus will consider a timeseries "stale" if it doesn't see any data points for some time (default 5 minutes). See here: https://prometheus.io/docs/prometheus/latest/querying/basics/#staleness

If no sample is found (by default) 5 minutes before a sampling timestamp, no value is returned for that time series at this point in time. This effectively means that time series "disappear" from graphs at times where their latest collected sample is older than 5 minutes or after they are marked stale.

Imagine you have an alert configured in Prometheus Alert Manager based on the ratio of the collector's metrics otelcol_exporter_send_failed_metric_points and otelcol_exporter_sent_metric_points, say to trigger an alert when the ratio of failed is greater than 10 percent.

Prometheus can only calculate the ratio where it has values for both series, so following the staleness rule above, only up to five minutes since the last datapoint recorded for both series.

We can see how we could easily get in trouble if we didn't publish new data points for a series that didn't change in this case.

If otelcol_exporter_sent_metric_points stopped incrementing, and otelcol_exporter_send_failed_metric_points did start to increment, then depending on the time window used to calculate the ratio etc, it could take more than five minutes for the ratio to exceed 10 percent. So, if we don't continue to have new datapoints for the unchanging otelcol_exporter_sent_metric_points series, we wouldn't actually trigger the alert in this case.

Of course this is a synthetic example, but I think it illustrates how the convention of Prometheus datasources continuing to periodically produce datapoints even when a series' value hasn't changed, is sometimes relied upon.

@jpkrohling

Adds a new internal, _experimental_ package `metrics/identity` which implements identity types for resource, scope, metric and stream. This is closely related to work being done in open-telemetry#30707 and open-telemetry#30827. The package is specifically experimental, as it shall be treated as an internal component to above processors which may change at any moment as long as those are under active initial development. /cc @jpkrohling @djaglowski @RichieSams

github-actions · 2024-03-16T05:19:59Z

This PR was marked stale due to lack of activity. It will be closed in 14 days.

github-actions · 2024-03-31T05:19:00Z

Closed as inactive. Feel free to reopen if this PR is still being worked on.

Description: This PR implements the main logic of this new processor. Link to tracking Issue: #29461 This is a re-opening of [30827](#30827) Testing: I added a test for the main aggregation / export behavior, but I need to add more to test state expiry Documentation: I updated the README with the updated configs, but I should add some examples as well.

Description: This PR implements the main logic of this new processor. Link to tracking Issue: open-telemetry#29461 This is a re-opening of [30827](open-telemetry#30827) Testing: I added a test for the main aggregation / export behavior, but I need to add more to test state expiry Documentation: I updated the README with the updated configs, but I should add some examples as well.

RichieSams requested a review from a team as a code owner January 29, 2024 14:08

RichieSams requested a review from bogdandrutu January 29, 2024 14:08

github-actions bot assigned andrzej-stencel Jan 29, 2024

github-actions bot added the processor/interval label Jan 29, 2024

RichieSams marked this pull request as draft January 29, 2024 14:12

djaglowski reviewed Jan 29, 2024

View reviewed changes

processor/intervalprocessor/factory.go Show resolved Hide resolved

djaglowski reviewed Jan 31, 2024

View reviewed changes

sh0rez mentioned this pull request Feb 2, 2024

internal/exp/metrics: identity types #31017

Merged

0x006EA1E5 mentioned this pull request Feb 3, 2024

new component: deltatocumulative processor #30479

Closed

RichieSams force-pushed the processor/interval branch from e099f5c to 13b6e8a Compare February 5, 2024 19:49

RichieSams mentioned this pull request Feb 14, 2024

REQUEST: New membership for RichieSams open-telemetry/community#1945

Closed

6 tasks

RichieSams added 11 commits February 21, 2024 11:14

[internal/exp/metrics] Add a new internal package for handling metric…

064769e

… staleness It's a glorified wrapper over a Map type, which allows values to be expired based on a pre-supplied interval.

Properly clean up lookup map when popping

8ccd51f

Make PriorityMap an interface

932874a

So Staleness isn't so tightly tied to it

Implement Map.LoadOrStore()

eb95216

Use the new upcoming iterator style instead of our custom Range()

f165342

Move Staleness map to be internal

2d75b1f

So users are forced to use the correct methods. Also adds lots of documentation

Fix linter / check errors

a3d08a8

The Staleness tests can't be parallel, because they use the global `N…

6a41d35

…owFunc`

Expose NowFunc so other modules can do testing

cde1ec7

[processor/interval] Implement the main logic

c53a086

Utilize the new Staleness map

303390e

RichieSams force-pushed the processor/interval branch from e30edf3 to 303390e Compare February 21, 2024 17:32

github-actions bot added the internal/exp/metrics label Feb 21, 2024

RichieSams commented Feb 23, 2024

View reviewed changes

0x006EA1E5 mentioned this pull request Mar 1, 2024

Aggregate metric datapoints over time period #29461

Closed

2 tasks

github-actions bot added the Stale label Mar 16, 2024

github-actions bot closed this Mar 31, 2024

RichieSams mentioned this pull request Apr 1, 2024

[processor/interval] Implement the main logic #32054

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP: [processor/interval] Implement the main logic #30827

WIP: [processor/interval] Implement the main logic #30827

RichieSams commented Jan 29, 2024

RichieSams commented Jan 29, 2024

djaglowski left a comment

RichieSams commented Jan 31, 2024

RichieSams Feb 23, 2024

djaglowski Feb 26, 2024

RichieSams Feb 26, 2024

djaglowski Feb 26, 2024

RichieSams Feb 26, 2024

djaglowski Feb 26, 2024

0x006EA1E5 Mar 1, 2024

0x006EA1E5 Mar 1, 2024

djaglowski Mar 1, 2024

0x006EA1E5 Mar 1, 2024

github-actions bot commented Mar 16, 2024

github-actions bot commented Mar 31, 2024

WIP: [processor/interval] Implement the main logic #30827

WIP: [processor/interval] Implement the main logic #30827

Conversation

RichieSams commented Jan 29, 2024

RichieSams commented Jan 29, 2024

djaglowski left a comment

Choose a reason for hiding this comment

RichieSams commented Jan 31, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

github-actions bot commented Mar 16, 2024

github-actions bot commented Mar 31, 2024