Bug Report
What did you do?
I did exactly what is described in the documentation, just that the Summary or Histogram involved is tracking the SpamAssassin of mails on my mail server. This is the query:
rate(spamassassin_score_sum[1h])
/
rate(spamassassin_score_count[1h])
What did you expect to see?
The average SpamAssassin score of the mails scanned over the last 1 hour.
What did you see instead? Under which circumstances?
Really weird numbers. The reason is that SpamAssassin scores can be negative, leading to the spamassassin_score_sum going down quite frequently, which the rate function sees as a counter reset, hugely blowing up the resulting rate.
Environment
This bug has been around forever.
Context
This is a bug known essentially forever. It's kind of “by design”, which is probably the reason why it was never filed as a formal issue. Another reason might be that it is rarely of relevance. The overwhelming majority of use cases for Summaries and Histograms are request latencies, response sizes, and similar observations that never become negative. However, there are niche use cases with negative observations (anybody remembers the “scientific experimentation” from the original README? ;-). And a bug that strikes rarely comes with the “advantage” of surprise when it strikes eventually. This bug has always been in my thoughts, and I have even talked about it whenever the opportunity came up. Yet another excuse for not filing a formal issue is that I used to think this bug would solve itself once the internal Prometheus data model can represent structured metrics properly. For as long as Prometheus exists, my hope was that the next major release would give us this ability, but it hasn't happened yet, and realistically we have no idea when it will happen. Thus, all those excuses are moot, and we should finally have a real issue about it, even if it is only to document that it is a known issue.
Another reason to do it now is that OpenMetrics apparently considers to remove the ..._sum component from a Histogram if it has negative buckets. That would be really a pity as it would remove the ability to calculate averages instead of fixing it. (Also, what about Summaries, which don't have buckets?)
It is quite problematic to let a known deficiency of Prometheus leak into an open standard like OpenMetrics, effectively preventing all systems using OpenMetrics from calculating averages from Histograms and Summaries with negative observations. So having an issue to point to might be helpful.
Suggested solutions
The fundamental problem is that counter resets in spamassassin_score_sum cannot be detected by looking for the value going down. However, counter reset detection works just fine in spamassassin_score_count. The fundamental solution is thus to detect the counter resets in spamassassin_score_count only and then apply the same timing of counter resets to spamassassin_score_sum.
The direct approach would be to bake that logic into the rate function directly, i.e. if rate is applied to a metric named <x>_sum, it would look for a metric called <x>_count and take the timing of the counter resets from there. The problem here is that we rely on naming conventions to find out about metrics that belong together. In most cases, it should work, but who knows what metric names people have come up with out there. This approach would also be a breaking change (although it would only affect expressions that were nonsensical anyway, unless of course that name collision strikes, in which case it would be breaking in a really really surprising way).
A more explicit approach would be to provide a new function called rate_with_base_counter (or similar), which takes a second parameter to deduce the counter resets from. Above expression would become
rate_with_base_counter(spamassassin_score_sum[1h], spamassassin_score_count)
/
rate(spamassassin_score_count[1h])
If this problem only ever occurs in the calculation of an average, we could also go for a more convenient and easier to understand function rate_ratio, which takes two range selectors, calculates the rate of each but taking counter resets only from the 2nd parameter, and then calculates the ratio. Expression would then look like:
rate_ratio(spamassassin_score_sum[1h], spamassassin_score_count[1h])
Bug Report
What did you do?
I did exactly what is described in the documentation, just that the Summary or Histogram involved is tracking the SpamAssassin of mails on my mail server. This is the query:
What did you expect to see?
The average SpamAssassin score of the mails scanned over the last 1 hour.
What did you see instead? Under which circumstances?
Really weird numbers. The reason is that SpamAssassin scores can be negative, leading to the
spamassassin_score_sumgoing down quite frequently, which theratefunction sees as a counter reset, hugely blowing up the resulting rate.Environment
This bug has been around forever.
Context
This is a bug known essentially forever. It's kind of “by design”, which is probably the reason why it was never filed as a formal issue. Another reason might be that it is rarely of relevance. The overwhelming majority of use cases for Summaries and Histograms are request latencies, response sizes, and similar observations that never become negative. However, there are niche use cases with negative observations (anybody remembers the “scientific experimentation” from the original README? ;-). And a bug that strikes rarely comes with the “advantage” of surprise when it strikes eventually. This bug has always been in my thoughts, and I have even talked about it whenever the opportunity came up. Yet another excuse for not filing a formal issue is that I used to think this bug would solve itself once the internal Prometheus data model can represent structured metrics properly. For as long as Prometheus exists, my hope was that the next major release would give us this ability, but it hasn't happened yet, and realistically we have no idea when it will happen. Thus, all those excuses are moot, and we should finally have a real issue about it, even if it is only to document that it is a known issue.
Another reason to do it now is that OpenMetrics apparently considers to remove the
..._sumcomponent from a Histogram if it has negative buckets. That would be really a pity as it would remove the ability to calculate averages instead of fixing it. (Also, what about Summaries, which don't have buckets?)It is quite problematic to let a known deficiency of Prometheus leak into an open standard like OpenMetrics, effectively preventing all systems using OpenMetrics from calculating averages from Histograms and Summaries with negative observations. So having an issue to point to might be helpful.
Suggested solutions
The fundamental problem is that counter resets in
spamassassin_score_sumcannot be detected by looking for the value going down. However, counter reset detection works just fine inspamassassin_score_count. The fundamental solution is thus to detect the counter resets inspamassassin_score_countonly and then apply the same timing of counter resets tospamassassin_score_sum.The direct approach would be to bake that logic into the
ratefunction directly, i.e. ifrateis applied to a metric named<x>_sum, it would look for a metric called<x>_countand take the timing of the counter resets from there. The problem here is that we rely on naming conventions to find out about metrics that belong together. In most cases, it should work, but who knows what metric names people have come up with out there. This approach would also be a breaking change (although it would only affect expressions that were nonsensical anyway, unless of course that name collision strikes, in which case it would be breaking in a really really surprising way).A more explicit approach would be to provide a new function called
rate_with_base_counter(or similar), which takes a second parameter to deduce the counter resets from. Above expression would becomeIf this problem only ever occurs in the calculation of an average, we could also go for a more convenient and easier to understand function
rate_ratio, which takes two range selectors, calculates the rate of each but taking counter resets only from the 2nd parameter, and then calculates the ratio. Expression would then look like: