Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[receiver/prometheus] Histograms without buckets are dropped #22070

Closed
swiatekm-sumo opened this issue May 18, 2023 · 16 comments
Closed

[receiver/prometheus] Histograms without buckets are dropped #22070

swiatekm-sumo opened this issue May 18, 2023 · 16 comments
Labels
bug Something isn't working receiver/prometheus Prometheus receiver

Comments

@swiatekm-sumo
Copy link
Contributor

swiatekm-sumo commented May 18, 2023

Component(s)

receiver/prometheus

What happened?

I wanted to ingest just the sum and count timeseries for a histogram using prometheus receiver, and used an appropriate relabel config. But the receiver dropped the histogram.

Description

If we ingest a histogram without buckets - only the count and sum parts - prometheusreceiver drops it.

Steps to Reproduce

Simplest way is probably to modify data for one of the unit tests. I verified it with

.

Expected Result

I'd have expected to get a histogram without buckets. The spec says this is valid.

Actual Result

The histogram was dropped.

Collector version

0.77.0

Additional context

What I'm trying to do is to ingest only the sum and count parts of the histogram. This works on Prometheus itself.

The alternative would be to ingest the buckets, and then drop them in a processor. There currently aren't any functions to do this to histograms, but we do have ones for summaries, so it wouldn't be a stretch to add them.

I believe this happens as a result of the following check:

if !mg.hasCount || len(mg.complexValue) == 0 {
.

I haven't verified if this affects simpleprometheusreceiver, but I assume it does.

@swiatekm-sumo swiatekm-sumo added bug Something isn't working needs triage New item requiring triage labels May 18, 2023
@github-actions github-actions bot added the receiver/prometheus Prometheus receiver label May 18, 2023
@github-actions
Copy link
Contributor

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@dashpole
Copy link
Contributor

Requiring prometheus histograms to have at least one bucket is in-line with the OpenMetrics spec for histograms:

A Histogram MetricPoint MUST contain at least one bucket, and SHOULD contain Sum, and Created values. Every bucket MUST have a threshold and a value. ... Histogram MetricPoints MUST have one bucket with an +Inf threshold.

So it MUST have an Inf bucket, whose value is the same as the sum of the series.

An OpenTelemetry histogram without any buckets MUST, per the spec, have an Inf bucket if it is exported by a Prometheus exporter:

A series of {name}_bucket metric points that contain all attributes of the histogram point recorded as labels. Additionally, a label, denoted as le is added denoting the bucket boundary. The label’s value is the stringified floating point value of bucket boundaries, ordered from lowest to highest. The value of each point is the sum of the count of all histogram buckets up the the boundary reported in the le label. These points will include a single exemplar that falls within le label and no other le labelled point. The final bucket metric MUST have an +Inf threshold.

How did you encounter a prometheus histogram without any buckets?

@swiatekm-sumo
Copy link
Contributor Author

How did you encounter a prometheus histogram without any buckets?

I dropped the buckets at ingest using a relabeling rule. To give a more specific example, the Kubernetes Apiserver exposes a apiserver_request_duration_seconds histogram metric. This metric has fairly high cardinality, and a lot of the time you can make do with just the sums and counts, so we'd dropped the buckets using roughly the following:

- action: keep
  regex: apiserver_request_duration_seconds_(?:count|sum)
  sourceLabels:
  - __name__

If I do this using plain Prometheus, I will get apiserver_request_duration_seconds_count and apiserver_request_duration_seconds_total as Counters. I am fine if the receiver did this as well, but given that Otel allows empty histograms, I thought this would be a simpler solution.

@swiatekm-sumo
Copy link
Contributor Author

There's also #22853 as a solution to the same problem, but it's not as efficient to have to scrape and process the whole Histogram just to drop all the buckets later.

@swiatekm-sumo
Copy link
Contributor Author

@dashpole in a similar vein, the receiver rejects a Summary without quantiles, while Prometheus happily accepts it.

I'm reporting this in the context of attempting to migrate a large K8s environment from Prometheus to Otel with prometheus receiver. And so I need to deal with targets producing metrics that aren't up to spec, with scrape configs I don't own or control, and other fun bits like that.

I suppose the fundamental problem here is that Prometheus itself is very permissive about what it accepts, and doesn't actually check any of the spec requirements. Whereas the receiver is more strict, both on the OpenMetrics and Otel sides.

At a high level, what can we do about this? Some kind of permissive mode where if we run into a problem, we just shrug and ingest the datapoint as Unknown? The Histogram problem which this issue describes can be worked around later in the Otel pipeline, but the Summary problem causes the whole scrape to fail, so the only solution seems to be to fix all the consumers. It would at the very least be nice if we could ingest all the correct data from a given scrape even if there are errors along the way.

@dashpole
Copy link
Contributor

Thanks, and i'm sorry for making it harder on you. I think we should revisit some of the validation, especially for summaries. Off the top of my head, I don't remember why we require quantiles. The only thing that would make sense would be if quantiles were required in OTLP... if they aren't, we should definitely remove that requirement. OTel only has summaries to support these sort of "legacy" cases, so extra validation doesn't make any sense for summaries.

My recollection of the +Inf bucket requirement for histograms was that it protected against client-side bugs where some histogram buckets were missing from the exposition. It might be worth making it configurable, or dropping the requirement all-together.

@dashpole
Copy link
Contributor

I'll raise this at the prometheus wg next wednesday

@dashpole
Copy link
Contributor

dashpole commented Jul 5, 2023

Summary quantiles should be optional: https://github.com/open-telemetry/opentelemetry-proto/blob/c4dfbc51f3cd4089778555a2ac5d9bc093ed2956/opentelemetry/proto/metrics/v1/metrics.proto#L634. We should definitely remove the requirement to have a quantile.

For histograms, the +Inf bucket is always the same as the _sum series. We can either:

  1. Always ignore the +Inf bucket
  2. Validate that the +Inf bucket exists, and/or validate that it is the same as the sum.

@swiatekm-sumo
Copy link
Contributor Author

Summary quantiles should be optional: https://github.com/open-telemetry/opentelemetry-proto/blob/c4dfbc51f3cd4089778555a2ac5d9bc093ed2956/opentelemetry/proto/metrics/v1/metrics.proto#L634. We should definitely remove the requirement to have a quantile.

Good to hear. I'll create a new issue for this, and can work on the change afterwards, if that's ok.

For histograms, the +Inf bucket is always the same as the _sum series. We can either:

1. Always ignore the +Inf bucket

2. Validate that the +Inf bucket exists, and/or validate that it is the same as the sum.

Maybe just validate that the value of this bucket is correct, if it exists? If it doesn't exist, we can just ignore it.

@dashpole
Copy link
Contributor

dashpole commented Jul 5, 2023

I was the only one at the wg meeting today, so I wasn't able to discuss it. @Aneurysm9, unless you have concerns, I think we should start to ignore the +Inf bucket. The only potential negative consequence would be that the +Inf bucket would "reappear" if you exported it using a prometheus exporter, but that seems unlikely to cause issues.

@swiatekm-sumo
Copy link
Contributor Author

@dashpole is there a reason not to merge #23448? That change is relatively benign and non-breaking, whereas dropping +Inf buckets is technically breaking, so it'd be better to do it separately imo.

@dashpole
Copy link
Contributor

dashpole commented Jul 12, 2023

Ignoring the Inf bucket shouldn't be breaking, right? It is always the same as the count.

@swiatekm-sumo
Copy link
Contributor Author

swiatekm-sumo commented Jul 12, 2023

I think I was misunderstanding what you wanted to do there. So, just to be clear:

  • The current state is that we calculate our final OTLP bucket (the one without a right bound) from the +Inf Prometheus bucket.
  • We want the output to be identical, but instead calculate it from the total count, and just ignore the value of the +Inf Prometheus bucket completely.

Is that right?

@dashpole
Copy link
Contributor

Yes. I think the end result would be something that produces a single OTel bucket even if there are no prometheus buckets.

@swiatekm-sumo
Copy link
Contributor Author

Allright, that's what I've done in #23448.

mx-psi pushed a commit that referenced this issue Aug 1, 2023
**Description:**
Prometheus receiver currently drops Histograms without any buckets.
These are, however, explicitly allowed by the Otel spec, and can be
quite useful. This change allows ingesting them. When we do so, we add
an additional bucket at +Inf equal to the `count` attribute of the
Histogram.

**Link to tracking Issue:** #22070 

**Testing:**
Modified existing tests.
@swiatekm-sumo
Copy link
Contributor Author

Fixed in #23448

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working receiver/prometheus Prometheus receiver
Projects
None yet
Development

No branches or pull requests

3 participants