PromQL: Correctly detect counter resets in `_sum` of Histograms and Summaries

## Bug Report
**What did you do?**

I did exactly what is described in the [documentation](https://prometheus.io/docs/practices/histograms/#count-and-sum-of-observations), just that the Summary or Histogram involved is tracking the SpamAssassin of mails on my mail server. This is the query:

```PromQL
  rate(spamassassin_score_sum[1h])
/
  rate(spamassassin_score_count[1h])
```

**What did you expect to see?**

The average SpamAssassin score of the mails scanned over the last 1 hour.

**What did you see instead? Under which circumstances?**

Really weird numbers. The reason is that SpamAssassin scores can be negative, leading to the `spamassassin_score_sum` going down quite frequently, which the `rate` function sees as a counter reset, hugely blowing up the resulting rate.

**Environment**

This bug has been around forever.

**Context**

This is a bug known essentially forever. It's kind of “by design”, which is probably the reason why it was never filed as a formal issue. Another reason might be that it is rarely of relevance. The overwhelming majority of use cases for Summaries and Histograms are request latencies, response sizes, and similar observations that never become negative. However, there are niche use cases with negative observations (anybody remembers the “scientific experimentation” from the original README? ;-). And a bug that strikes rarely comes with the “advantage” of surprise when it strikes eventually. This bug has always been in my thoughts, and I have even talked about it whenever the opportunity came up. Yet another excuse for not filing a formal issue is that I used to think this bug would solve itself once the internal Prometheus data model can represent structured metrics properly. For as long as Prometheus exists, my hope was that the next major release would give us this ability, but it hasn't happened yet, and realistically we have no idea when it will happen. Thus, all those excuses are moot, and we should finally have a real issue about it, even if it is only to document that it is a known issue.

Another reason to do it now is that OpenMetrics apparently considers to remove the `..._sum` component from a Histogram if it has negative buckets. That would be really a pity as it would _remove_ the ability to calculate averages instead of _fixing_ it. (Also, what about Summaries, which don't have buckets?)

It is quite problematic to let a known deficiency of Prometheus leak into an open standard like OpenMetrics, effectively preventing all systems using OpenMetrics from calculating averages from Histograms and Summaries with negative observations. So having an issue to point to might be helpful. 

**Suggested solutions**

The fundamental problem is that counter resets in `spamassassin_score_sum` cannot be detected by looking for the value going down. However, counter reset detection works just fine in `spamassassin_score_count`. The fundamental solution is thus to detect the counter resets in `spamassassin_score_count` only and then apply the same timing of counter resets to `spamassassin_score_sum`.

The direct approach would be to bake that logic into the `rate` function directly, i.e. if `rate` is applied to a metric named `<x>_sum`, it would look for a metric called `<x>_count` and take the timing of the counter resets from there. The problem here is that we rely on naming conventions to find out about metrics that belong together. In most cases, it should work, but who knows what metric names people have come up with out there. This approach would also be a breaking change (although it would only affect expressions that were nonsensical anyway, unless of course that name collision strikes, in which case it would be breaking in a really really surprising way).

A more explicit approach would be to provide a new function called `rate_with_base_counter` (or similar), which takes a second parameter to deduce the counter resets from. Above expression would become
```PromQL
  rate_with_base_counter(spamassassin_score_sum[1h], spamassassin_score_count)
/
  rate(spamassassin_score_count[1h])
```

If this problem only ever occurs in the calculation of an average, we could also go for a more convenient and easier to understand function `rate_ratio`, which takes two range selectors, calculates the rate of each but taking counter resets only from the 2nd parameter, and then calculates the ratio. Expression would then look like:
```PromQL
rate_ratio(spamassassin_score_sum[1h], spamassassin_score_count[1h])
```


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PromQL: Correctly detect counter resets in `_sum` of Histograms and Summaries #6669

Bug Report

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

PromQL: Correctly detect counter resets in _sum of Histograms and Summaries #6669

Description

Bug Report

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

PromQL: Correctly detect counter resets in `_sum` of Histograms and Summaries #6669