Aggregate metric datapoints over time period #29461

0x006EA1E5 · 2023-11-23T13:53:59Z

The purpose and use-cases of the new component

This processor would receive a stream of data points for a a metric timeseries, and periodically emit an "aggregate" at a set interval.

We can achieve something similar already by exporting metrics to the Prometheus exporter, then periodically scraping the prom endpoint with a Prometheus receiver. However this is clunky and somewhat less efficient than using a processor.

One concrete use case is where we want to send high frequency metric datapoints to the Prometheus remote write exporter, for example datapoints produces by the count connector. When counting spans, the count connector will produce a single delta datapoints (increment value 1) for each counted span, which could of course be many times per second. However, typically we would only want to remote write to Prometheus periodically, as we would if we were scraping, perhaps once every 30 seconds. This is especially true for downstream metric sinks that charge for datapoints per minute.

This proposed processor would be stateful, tracking metrics by identity, maintaining a single aggregate value. This aggregate will be output every interval.

For example, if the processor received the following delta datapoints: 1, 3, 5, then at the next "tick" of the interval clock, a single delta datapoint of 9 would be emitted.

Similarly, if the processor received the following cumulative datapoints: 7, 9, 11, then at the next "tick" of the interval clock, a single cumulative datapoint of 11 would be emitted.

There would be a "max_staleness" config option so that we can stop tracking metrics which don't receive any data for a given time.

Example configuration for the component

max_staleness
Include/exclude MatchMetrics

Telemetry data types supported

Metrics

Is this a vendor-specific component?

This is a vendor-specific component
If this is a vendor-specific component, I am proposing to contribute and support it as a representative of the vendor.

Code Owner(s)

No response

Sponsor (optional)

@djaglowski

Additional context

No response

atoulme · 2023-12-13T19:40:26Z

@crobert-1 who is the sponsor for this issue?

crobert-1 · 2023-12-13T19:42:14Z

@crobert-1 who is the sponsor for this issue?

@djaglowski volunteered to be the sponsor here.

verejoel · 2024-01-18T14:18:21Z

This is something that is on our radar, and would like to support as much as possible. Our use case is to enable remote writing of metrics from the count connector to Thanos.

RichieSams · 2024-01-23T14:17:40Z

Some clarifications of the spec for this one, given that #30479 will exist:

All non-cumulative metrics will be dropped. If you have delta metrics, use new component: deltatocumulative processor #30479 to convert them to cumulative
The aggregator will only store the "last" value received per unique combo of metric name + labels

@djaglowski Do you have any thoughts for a name? timeaggregationprocessor?

djaglowski · 2024-01-23T14:21:53Z

@RichieSams, this all looks good to me.

sh0rez · 2024-01-23T14:25:31Z

other name idea: intervalprocessor (because it emits at fixed intervals)

for implementing this, you may want to take a look at streams.Ident from #30707, which can be used as a map[streams.Ident]T for storing last values for each distinct sample stream

RichieSams · 2024-01-23T14:31:15Z

Thanks for the pointer! I like that name as well; it's less of a mouthful.

verejoel · 2024-01-23T14:42:05Z

Just so I understand the current situation:

Delta to Cumulative Processor (metrics) #29300 dropped in favour of new component: deltatocumulative processor #30479
new component: deltatocumulative processor #30479 will convert delta metrics to cumulative and perform aggregation on a configurable interval
Aggregate metric datapoints over time period #29461 (this processor) will just do the aggregation, and could (contrived example) be used to re-aggregate the metrics coming out of new component: deltatocumulative processor #30479

Is this correct?

RichieSams · 2024-01-23T14:51:01Z

Not quite, #30479 will only convert delta to cumulative. IE in ideal settings, there will be a 1 to 1 mapping of input metrics to output metrics.

#29461 (this processor) will only do aggregation.

RichieSams · 2024-01-23T14:51:59Z

As an added comment: I have started working on this issue, PRs to follow in the coming days

djaglowski · 2024-01-23T15:53:24Z

Thanks @RichieSams!

0x006EA1E5 · 2024-01-23T20:47:33Z

All non-cumulative metrics will be dropped. If you have delta metrics, use new component: deltatocumulative processor #30479 to convert them to cumulative

There is no reason why we can't aggregate deltas over time too, though, right?

RichieSams · 2024-01-23T21:20:45Z

All non-cumulative metrics will be dropped. If you have delta metrics, use new component: deltatocumulative processor #30479 to convert them to cumulative

There is no reason why we can't aggregate deltas over time too, though, right?

It just duplicates the code of the new deltatocumulativeprocessor. Unless you mean something else like:

Accumulate delta metrics
At each interval export the current sum for each delta
Reset all the deltas back to zero

This could work. But I'd be curious to the use-cases for that. Vs just converting to cumulative and aggregating those.

0x006EA1E5 · 2024-01-23T21:43:06Z

All non-cumulative metrics will be dropped. If you have delta metrics, use new component: deltatocumulative processor #30479 to convert them to cumulative

There is no reason why we can't aggregate deltas over time too, though, right?

It just duplicates the code of the new deltatocumulativeprocessor. Unless you mean something else like:

Accumulate delta metrics

At each interval export the current sum for each delta

Reset all the deltas back to zero

This could work. But I'd be curious to the use-cases for that. Vs just converting to cumulative and aggregating those.

Yes exactly, many deltas in, one delta out. I think the batch processor does something similar, although the output trigger is batch size, not a clock period.

I guess one use case would be where deltas are preferred/required downstream, but you want to reduce the data rate, e.g., sending a single delta of 10,000 over the wire after 15s is much more efficient than sending 10,000 deltas of 1.

I'm not saying this has to be in scope for the initial implementation if there is no immediate need, but just trying to think about how all this might work.

0x006EA1E5 · 2024-01-23T21:56:27Z

All non-cumulative metrics will be dropped.

I'm assuming that the config across all these processes will be somewhat consistent.

The Cumulative to Delta Processor can be configured with metric include and exclude rules.

Isn't it more appropriate to simply pass through metric data that isn't matched?

If a user wants to drop a certain metric, they should configure a filter processor, no?

This gives maximum flexibility, and the pipeline is nice and explicit.

We'll, I guess it's best to align with what other processors do in comparable situations...

RichieSams · 2024-01-23T21:58:22Z

I think the batch processor does something similar, although the output trigger is batch size, not a clock period.

It looks like the batchprocessor doesn't do any aggregation. It collects groups of metrics, and then sends them all at once in a single go. https://github.com/open-telemetry/opentelemetry-collector/blob/main/processor/batchprocessor/batch_processor.go#L425

IE, it would collect 10,000 deltas, and send them all at once to the next component in the pipeline. Vs without it, the next component would get 10,000 individual requests with each single metric. So it's optimizing for things like TCP connection / request overhead.

I'm not saying this has to be in scope for the initial implementation if there is no immediate need, but just trying to think about how all this might work.

For sure. I think it could be added in a future scope if the need arises. I don't think that would impact the immediate design.

RichieSams · 2024-01-23T22:00:38Z

Isn't it more appropriate to simply pass through metric data that isn't matched?

I don't have a strong opinion either way. I'd go with whatever the convention is for other processors.

So for delta metrics, we'd "consume" all the metrics and then periodically export on the interval. For cumulative metrics, we'd pass them to the next component in the chain untouched. Yes?

0x006EA1E5 · 2024-01-23T22:08:18Z

Isn't it more appropriate to simply pass through metric data that isn't matched?

I don't have a strong opinion either way. I'd go with whatever the convention is for other processors.

So for delta metrics, we'd "consume" all the metrics and then periodically export on the interval. For cumulative metrics, we'd pass them to the next component in the chain untouched. Yes?

I think it's the other way round😅.

The principle use-case is for cumulative metrics, as produced by the new delta to cumulative processor.

But otherwise, yes, exactly 👍

RichieSams · 2024-01-23T22:11:15Z

Right right, my brain is fried today. lol

RichieSams · 2024-02-21T17:41:50Z

@djaglowski @sh0rez @0x006EA1E5 How should this processor export the aggregated metrics?

The current implementation will export the metrics every X seconds to infinity. max_staleness affects if / when metrics will no longer be exported. IE:

As metrics come in, they are aggregated
New metrics are added to the state
Existing metrics replace whatever was already in the state (IFF their timestamp is newer)

.

In parallel to this, an exporter runs every X seconds
It exports all metrics currently stored in the state (Should it update all their timestamps to "now()"?)

However, I realized there is another potential approach. The processor could aggregate metrics over time, and then on an interval, export the aggregate once, flushing the state to empty.

As metrics come in, they are aggregated
New metrics are added to the state
Existing metrics replace whatever was already in the state (IFF their timestamp is newer)

.

In parallel to this, an exporter runs every X seconds
It exports all metrics currently stored in the state (Should it update all their timestamps to "now()"?)
The state is cleared to empty

In both cases, a mutex is used to ensure aggregation and export are serialized with each other.

0x006EA1E5 · 2024-02-23T16:31:00Z

However, I realized there is another potential approach. The processor could aggregate metrics over time, and then on an interval, export the aggregate once, flushing the state to empty.

I think the first approach it correct, the exporter publishes the latest / current value every interval.

It exports all metrics currently stored in the state (Should it update all their timestamps to "now()"?)

Yes, the timestamp should be updated to now(). Or at least, the be precise, the exported datapoint should have the timestamp of the instant it was exported. Looking at the spec, cumulative datapoints can also have a start_time timestamp, which in our case would be the timestamp of the previous publish (or, if this is the first publish for the timeseries, then I think the spec describes what should happen in the various edge cases). The output is then a series of contiguous, non-overlapping intervals with no gaps the taken from the first datapoint received.

This is then similar to how things work if we were to publish a metric to a prometheus exporter, and the scrape that back in with a prometheus receiver. We can export datapoints to a prometheus exporter at any rate, but as the prometheus receiver scrapes at a set, steady interval, it just produces one datapoint per timeseries every scrap interval. This includes the case where no new datapoints were sent to the prometheus exporter, the next scrape will just output a new datapoint with the now() timestamp (as I understand it).

Doing it this way meets a key use-case of working with native OTLP metrics in the whole pipeline, until eventually sending to a prometheusremotewrite exporter.

0x006EA1E5 · 2024-02-23T16:33:45Z

The processor could aggregate metrics over time, and then on an interval, export the aggregate once, flushing the state to empty.

I think if/when we have the ability to work with delta datapoints, we would do something similar, or at least set the value to zero (we would still keep state for the timeseries).

But as I understand it, we are just looking at cumulative datapoints for now, and will pass-through deltas, right?

RichieSams · 2024-02-23T19:38:53Z

cumulative datapoints can also have a start_time timestamp, which in our case would be the timestamp of the previous publish

Wouldn't start_timestamp just be inherited? IE if the incoming metrics have it, we leave it as-is

0x006EA1E5 · 2024-02-28T12:57:31Z

Looking at the spec, for cumulative datapoints the start_time_unix_nano should be the start of series, so the same value for each subsequent datapoint

Contrast with cumulative aggregation temporality where we expect to report the full sum since 'start' (where usually start means a process/application start):

(I will edit my comment above to correct it)

So I suppose in our case, that is going to mean the cumulative datapoints' start_time_unix_nano will be the start_time_unix_nano of the first datapoint received, as you suggest :)

0x006EA1E5 · 2024-02-28T13:04:44Z

It is not clear to me, but it looks like a Sum consists of NumberDataPoints, and start_time_unix_nano is optional to for those https://github.com/open-telemetry/opentelemetry-proto/blob/v0.9.0/opentelemetry/proto/metrics/v1/metrics.proto#L395

RichieSams · 2024-02-28T17:40:29Z

@sh0rez Can I get your opinion on the two export options I presented above? We have one vote for each atm :P

sh0rez · 2024-03-01T13:33:01Z

imo the reason for this processor to exist is to limit the flowrate of datapoints (datapoints per minute, etc).
my gut feeling how it should work:

Let $I$ be the aggregation interval.
All incoming samples during $\Delta I$ are collected.
If delta:
1. accumulate the values into one big delta
2. use min(start_timestamp). this is important because the start->time interval defines what the delta accounts for.
3. use max(timestamp)
If cumulative:
1. export the latest value unaltered
2. start will not change per spec. if it does, it signals a restart and that needs to be retained.
3. time is important to keep, as we shall not "lie values" into the future.

In the case where no new datapoints are received within $I$, nothing shall be exported either.
For delta this is correct because zero delta carries no information over no delta at all.
For cumulative this is correct because there is nothing new to report. This matches the behavior of Prometheus: If a scrape fails / the target goes away, no new points are written. If it comes back, there's a gap in the series that signals what happened.

We can leave the delta case out for now.

0x006EA1E5 · 2024-03-01T14:54:42Z

time is important to keep, as we shall not "lie values" into the future.

Yes this makes sense 👍

@sh0rez

imo the reason for this processor to exist is to limit the flowrate of datapoints (datapoints per minute, etc).

I would say the reason if to fix the flowrate, rather than just limit.

I think we have an open question as to whether we should publish a new cumulative datapoint at every interval in the case where the cumulative value has not changed. e.g., would we publish a stream of 0s every interval, or just publish the first 0 and the only publish another datapoint when there is a new / different value.

I detailed my use case in more detail here: #30827 (comment)

Following the discussion here: open-telemetry#29461 (comment)

Description: This PR implements the main logic of this new processor. Link to tracking Issue: #29461 This is a re-opening of [30827](#30827) Testing: I added a test for the main aggregation / export behavior, but I need to add more to test state expiry Documentation: I updated the README with the updated configs, but I should add some examples as well.

github-actions · 2024-05-01T03:30:23Z

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

RichieSams · 2024-05-01T11:54:28Z

Closed by: #32054

Description: This PR implements the main logic of this new processor. Link to tracking Issue: open-telemetry#29461 This is a re-opening of [30827](open-telemetry#30827) Testing: I added a test for the main aggregation / export behavior, but I need to add more to test state expiry Documentation: I updated the README with the updated configs, but I should add some examples as well.

@jpkrohling

…code owners (#33019) **Description:** <Describe what has changed.>  @jpkrohling and @djaglowski volunteered to be sponsors of the delta to cumulative processor, and @djaglowski also volunteered to be sponsor of the interval processor in relation to this. They should also be code owners. From [CONTRIBUTING.md](https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/CONTRIBUTING.md#adding-new-components): ``` A sponsor is an approver who will be in charge of being the official reviewer of the code and become a code owner for the component. ``` **Link to tracking Issue:** <Issue number if applicable> #30479 - Delta to cumulative processor #29461 - Interval processor --------- Co-authored-by: Juraci Paixão Kröhling <juraci@kroehling.de>

@jpkrohling

…code owners (open-telemetry#33019) **Description:** <Describe what has changed.>  @jpkrohling and @djaglowski volunteered to be sponsors of the delta to cumulative processor, and @djaglowski also volunteered to be sponsor of the interval processor in relation to this. They should also be code owners. From [CONTRIBUTING.md](https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/CONTRIBUTING.md#adding-new-components): ``` A sponsor is an approver who will be in charge of being the official reviewer of the code and become a code owner for the component. ``` **Link to tracking Issue:** <Issue number if applicable> open-telemetry#30479 - Delta to cumulative processor open-telemetry#29461 - Interval processor --------- Co-authored-by: Juraci Paixão Kröhling <juraci@kroehling.de>

0x006EA1E5 added needs triage New item requiring triage Sponsor Needed New component seeking sponsor labels Nov 23, 2023

0x006EA1E5 mentioned this issue Nov 23, 2023

Delta to Cumulative Processor (metrics) #29300

Closed

2 tasks

crobert-1 added Accepted Component New component has been sponsored and removed Sponsor Needed New component seeking sponsor needs triage New item requiring triage labels Nov 23, 2023

github-actions bot mentioned this issue Nov 28, 2023

Weekly Report: 2023-11-21 - 2023-11-28 #29517

Closed

98 tasks

djaglowski mentioned this issue Jan 11, 2024

count connector is not working with prometheus exporter: delta not accumulated #30203

Closed

verejoel mentioned this issue Jan 18, 2024

New component: Log-based metrics processor #18269

Open

2 tasks

djaglowski mentioned this issue Jan 23, 2024

new component: deltatocumulative processor #30479

Open

djaglowski assigned RichieSams Jan 23, 2024

sh0rez mentioned this issue Mar 7, 2024

[processor/deltatocumulative]: progress tracking #30705

Open

26 tasks

RichieSams added a commit to RichieSams/opentelemetry-collector-contrib that referenced this issue Apr 1, 2024

Clear all the state after exporting

cbb29c0

Following the discussion here: open-telemetry#29461 (comment)

RichieSams mentioned this issue Apr 1, 2024

[processor/interval] Implement the main logic #32054

Merged

RichieSams added a commit to RichieSams/opentelemetry-collector-contrib that referenced this issue Apr 17, 2024

Clear all the state after exporting

88b9a9d

Following the discussion here: open-telemetry#29461 (comment)

RichieSams added a commit to RichieSams/opentelemetry-collector-contrib that referenced this issue Apr 17, 2024

Clear all the state after exporting

ddd8bfe

Following the discussion here: open-telemetry#29461 (comment)

RichieSams added a commit to RichieSams/opentelemetry-collector-contrib that referenced this issue Apr 18, 2024

Clear all the state after exporting

f80b71d

Following the discussion here: open-telemetry#29461 (comment)

RichieSams added a commit to RichieSams/opentelemetry-collector-contrib that referenced this issue Apr 23, 2024

Clear all the state after exporting

b1f248c

Following the discussion here: open-telemetry#29461 (comment)

RichieSams added a commit to RichieSams/opentelemetry-collector-contrib that referenced this issue Apr 23, 2024

Clear all the state after exporting

c7bfe75

Following the discussion here: open-telemetry#29461 (comment)

RichieSams added a commit to RichieSams/opentelemetry-collector-contrib that referenced this issue Apr 23, 2024

Clear all the state after exporting

62e8434

Following the discussion here: open-telemetry#29461 (comment)

RichieSams added a commit to RichieSams/opentelemetry-collector-contrib that referenced this issue Apr 23, 2024

Clear all the state after exporting

6184101

Following the discussion here: open-telemetry#29461 (comment)

RichieSams added a commit to RichieSams/opentelemetry-collector-contrib that referenced this issue Apr 24, 2024

Clear all the state after exporting

5b5ebfb

Following the discussion here: open-telemetry#29461 (comment)

github-actions bot added the Stale label May 1, 2024

djaglowski closed this as completed May 1, 2024

crobert-1 mentioned this issue May 13, 2024

[chore][processor/interval][processor/deltatocumulative] Add missing code owners #33019

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Aggregate metric datapoints over time period #29461

Aggregate metric datapoints over time period #29461

0x006EA1E5 commented Nov 23, 2023 •

edited by atoulme

atoulme commented Dec 13, 2023

crobert-1 commented Dec 13, 2023 •

edited

verejoel commented Jan 18, 2024

RichieSams commented Jan 23, 2024

djaglowski commented Jan 23, 2024

sh0rez commented Jan 23, 2024

RichieSams commented Jan 23, 2024

verejoel commented Jan 23, 2024

RichieSams commented Jan 23, 2024

RichieSams commented Jan 23, 2024

djaglowski commented Jan 23, 2024

0x006EA1E5 commented Jan 23, 2024

RichieSams commented Jan 23, 2024

0x006EA1E5 commented Jan 23, 2024

0x006EA1E5 commented Jan 23, 2024

RichieSams commented Jan 23, 2024

RichieSams commented Jan 23, 2024

0x006EA1E5 commented Jan 23, 2024

RichieSams commented Jan 23, 2024

RichieSams commented Feb 21, 2024 •

edited

0x006EA1E5 commented Feb 23, 2024 •

edited

0x006EA1E5 commented Feb 23, 2024

RichieSams commented Feb 23, 2024

0x006EA1E5 commented Feb 28, 2024 •

edited

0x006EA1E5 commented Feb 28, 2024

RichieSams commented Feb 28, 2024

sh0rez commented Mar 1, 2024 •

edited

0x006EA1E5 commented Mar 1, 2024

github-actions bot commented May 1, 2024

RichieSams commented May 1, 2024

Aggregate metric datapoints over time period #29461

Aggregate metric datapoints over time period #29461

Comments

0x006EA1E5 commented Nov 23, 2023 • edited by atoulme

The purpose and use-cases of the new component

Example configuration for the component

Telemetry data types supported

Is this a vendor-specific component?

Code Owner(s)

Sponsor (optional)

Additional context

atoulme commented Dec 13, 2023

crobert-1 commented Dec 13, 2023 • edited

verejoel commented Jan 18, 2024

RichieSams commented Jan 23, 2024

djaglowski commented Jan 23, 2024

sh0rez commented Jan 23, 2024

RichieSams commented Jan 23, 2024

verejoel commented Jan 23, 2024

RichieSams commented Jan 23, 2024

RichieSams commented Jan 23, 2024

djaglowski commented Jan 23, 2024

0x006EA1E5 commented Jan 23, 2024

RichieSams commented Jan 23, 2024

0x006EA1E5 commented Jan 23, 2024

0x006EA1E5 commented Jan 23, 2024

RichieSams commented Jan 23, 2024

RichieSams commented Jan 23, 2024

0x006EA1E5 commented Jan 23, 2024

RichieSams commented Jan 23, 2024

RichieSams commented Feb 21, 2024 • edited

0x006EA1E5 commented Feb 23, 2024 • edited

0x006EA1E5 commented Feb 23, 2024

RichieSams commented Feb 23, 2024

0x006EA1E5 commented Feb 28, 2024 • edited

0x006EA1E5 commented Feb 28, 2024

RichieSams commented Feb 28, 2024

sh0rez commented Mar 1, 2024 • edited

0x006EA1E5 commented Mar 1, 2024

github-actions bot commented May 1, 2024

RichieSams commented May 1, 2024

0x006EA1E5 commented Nov 23, 2023 •

edited by atoulme

crobert-1 commented Dec 13, 2023 •

edited

RichieSams commented Feb 21, 2024 •

edited

0x006EA1E5 commented Feb 23, 2024 •

edited

0x006EA1E5 commented Feb 28, 2024 •

edited

sh0rez commented Mar 1, 2024 •

edited