Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: [processor/interval] Implement the main logic #30827

Closed
wants to merge 11 commits into from

Conversation

RichieSams
Copy link
Contributor

Description:

This PR implements the main logic of this new processor.

Link to tracking Issue:
#29461

Testing:
I added a test for the main aggregation / export behavior, but many more need to be added to test the various edge cases.

Documentation:

I updated the README with the updated configs, but I should add some examples as well.

@RichieSams
Copy link
Contributor Author

@djaglowski This isn't ready to merge yet, but I wanted to get what I had posted so you all could give initial feedback.

cc @sh0rez and @0x006EA1E5. I'd love your feedback as well, if you have time.

@RichieSams RichieSams marked this pull request as draft January 29, 2024 14:12
Copy link
Member

@djaglowski djaglowski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@RichieSams, I apologize but I'm finding this difficult to review while we have another processor under development which has to solve so many of the same concerns. Caching rules, identity, emit interval. Since the other processor appears to have a more detailed design proposal, I'd like to wait on this and see if we can't follow along by default, diverge only where necessary.

@RichieSams
Copy link
Contributor Author

@RichieSams, I apologize but I'm finding this difficult to review while we have another processor under development which has to solve so many of the same concerns. Caching rules, identity, emit interval. Since the other processor appears to have a more detailed design proposal, I'd like to wait on this and see if we can't follow along by default, diverge only where necessary.

That's fine. I understand

jpkrohling pushed a commit that referenced this pull request Feb 19, 2024
Adds a new internal, _experimental_ package `metrics/identity` which
implements identity types for resource, scope, metric and stream.

This is closely related to work being done in #30707 and #30827.

The package is specifically experimental, as it shall be treated as an
internal component to above processors which may change at any moment as
long as those are under active initial development.

/cc @jpkrohling @djaglowski @RichieSams
djaglowski pushed a commit to djaglowski/opentelemetry-collector-contrib that referenced this pull request Feb 19, 2024
Adds a new internal, _experimental_ package `metrics/identity` which
implements identity types for resource, scope, metric and stream.

This is closely related to work being done in open-telemetry#30707 and open-telemetry#30827.

The package is specifically experimental, as it shall be treated as an
internal component to above processors which may change at any moment as
long as those are under active initial development.

/cc @jpkrohling @djaglowski @RichieSams
djaglowski pushed a commit to djaglowski/opentelemetry-collector-contrib that referenced this pull request Feb 19, 2024
Adds a new internal, _experimental_ package `metrics/identity` which
implements identity types for resource, scope, metric and stream.

This is closely related to work being done in open-telemetry#30707 and open-telemetry#30827.

The package is specifically experimental, as it shall be treated as an
internal component to above processors which may change at any moment as
long as those are under active initial development.

/cc @jpkrohling @djaglowski @RichieSams

sum := m.Sum()
numDP := sum.DataPoints().AppendEmpty()
dp.DataPoint.CopyTo(numDP)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to update the timestamp to now(). Now() should be calculated once at the top and the same value used for everything

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should set any timestamps to "now". We are aggregating data points so should use the timestamps they carry in a meaningful way. The appropriate behavior may depend on the type of metric, but for example if aggregating a sum, I would expect the start time to be the earliest from all data points being aggregated, and the end time to be the latest from those data points.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we don't update the timestamp, then why are we continuing to re-export them? Depending on the final destination, we're either creating duplicate data, or requiring the destination to implement data deduplication.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not clear to me why we would continue to re-export data points when there is no new data to report. That said, I haven't had time to focus on the nuances of this processor and likely won't for the foreseeable future so I'll leave it to others to decide. It just seems to me that we should be reporting a direct representation of the data points we actually receive.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. What you're describing is then similar to the "second" approach I describe here: #29461 (comment)

I think both approaches are "valid". But we just need to decide which this processor is doing

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pointing that out. The second approach sounds better to me.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For us, the principal use case of this set of features is to be able to send count connector metrics to a Prometheus remote write endpoint.

The count connector sends a stream of increment data points (deltas with value one).

Prometheus expects a steady stream of cumulative data points, published at the "scrape interval".

For a Prometheus scrape, consecutive reads of a counter which hasn't incremented will read the same value each time, so we see the behavior of a steady stream of repeating, unchanging values, every interval.

It is my understanding that downstream metrics systems based on Prometheus expects to see this steady stream of unchanging cumulatives in this case, and if there is a gap they will see this as a break in the series and may not behave as expected.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at the behaviour of the collector's own metrics, when the service's telemetry is configured to export metrics via a periodic reader as follows:

  telemetry:
    metrics:
      level: detailed
      readers:
        - periodic:
            interval: 15000
            exporter:
              otlp:
                endpoint: localhost:4321
                protocol: grpc/protobuf

we see Sum metrics exported at every interval, even when there is no change in value. For example, I see a series of 0 values for metric otelcol_exporter/send_failed_metric_points.

Therefore, I think we somehow need to be able to get this behaviour in the count connector scenario described above (or any case where source metric datapoints are received only on every change in value).

To be clear, I mean here to transform a stream of datapoints received at a variable frequency into a stream of datapoints at a fixed frequency, continuing to publish new datapoints even when there is no change in value, as we see in both the Prometheus scrape case, and the collector's own metrics case.

I guess we could rescope this to only publish datapoints on change, but if we do that, we would then need another process to meet the need described here, and it's not quite clear to me how that other processor could work if it wasn't also doing what this processor would do.

I'm not clear what the concrete use case is for not publishing new datapoints when there has not been a change in value, but if there is a use case for both modes of operation, then maybe this could be selected by a config option?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the general idea with not publishing when there is no new information is that, in theory at least, it's not useful.

If we are sure that backends expect the no-new-info points, then we should probably include them in this processor's output. However, I am surprised to here this because I have not seen this in my experience (though I have barely touched prometheus) and because @sh0rez's comment here indicates that prometheus does not need these data points.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Prometheus will consider a timeseries "stale" if it doesn't see any data points for some time (default 5 minutes). See here: https://prometheus.io/docs/prometheus/latest/querying/basics/#staleness

If no sample is found (by default) 5 minutes before a sampling timestamp, no value is returned for that time series at this point in time. This effectively means that time series "disappear" from graphs at times where their latest collected sample is older than 5 minutes or after they are marked stale.

Imagine you have an alert configured in Prometheus Alert Manager based on the ratio of the collector's metrics otelcol_exporter_send_failed_metric_points and otelcol_exporter_sent_metric_points, say to trigger an alert when the ratio of failed is greater than 10 percent.

Prometheus can only calculate the ratio where it has values for both series, so following the staleness rule above, only up to five minutes since the last datapoint recorded for both series.

We can see how we could easily get in trouble if we didn't publish new data points for a series that didn't change in this case.

If otelcol_exporter_sent_metric_points stopped incrementing, and otelcol_exporter_send_failed_metric_points did start to increment, then depending on the time window used to calculate the ratio etc, it could take more than five minutes for the ratio to exceed 10 percent. So, if we don't continue to have new datapoints for the unchanging otelcol_exporter_sent_metric_points series, we wouldn't actually trigger the alert in this case.

Of course this is a synthetic example, but I think it illustrates how the convention of Prometheus datasources continuing to periodically produce datapoints even when a series' value hasn't changed, is sometimes relied upon.

XinRanZhAWS pushed a commit to XinRanZhAWS/opentelemetry-collector-contrib that referenced this pull request Mar 13, 2024
Adds a new internal, _experimental_ package `metrics/identity` which
implements identity types for resource, scope, metric and stream.

This is closely related to work being done in open-telemetry#30707 and open-telemetry#30827.

The package is specifically experimental, as it shall be treated as an
internal component to above processors which may change at any moment as
long as those are under active initial development.

/cc @jpkrohling @djaglowski @RichieSams
Copy link
Contributor

This PR was marked stale due to lack of activity. It will be closed in 14 days.

@github-actions github-actions bot added the Stale label Mar 16, 2024
Copy link
Contributor

Closed as inactive. Feel free to reopen if this PR is still being worked on.

@github-actions github-actions bot closed this Mar 31, 2024
djaglowski pushed a commit that referenced this pull request Apr 30, 2024
Description:

This PR implements the main logic of this new processor.

Link to tracking Issue:

#29461
This is a re-opening of
[30827](#30827)

Testing:
I added a test for the main aggregation / export behavior, but I need to
add more to test state expiry

Documentation:

I updated the README with the updated configs, but I should add some
examples as well.
rimitchell pushed a commit to rimitchell/opentelemetry-collector-contrib that referenced this pull request May 8, 2024
Description:

This PR implements the main logic of this new processor.

Link to tracking Issue:

open-telemetry#29461
This is a re-opening of
[30827](open-telemetry#30827)

Testing:
I added a test for the main aggregation / export behavior, but I need to
add more to test state expiry

Documentation:

I updated the README with the updated configs, but I should add some
examples as well.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants