Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inverted histogram #2018

Closed
ractive opened this Issue Sep 22, 2016 · 5 comments

Comments

Projects
None yet
3 participants
@ractive
Copy link

ractive commented Sep 22, 2016

The current historgram metric type is a good tool to calculate SLAs or to generally report how much was "good".

But common usecases are e.g. tracking slow or large requests etc. where you want to report how much was "not good". And this is quite cumbersome with a histogram. To get e.g. the rate of requests taking more then 1s you'd need to do:

sum(rate( response_time_seconds_count[5m]))
    - ignoring(le) sum(rate( response_time_seconds_bucket{le="1"} [5m] ))

If you could define a histogram with buckets that contain datapoints that are greater then ("gt") a certain threshold, this query would be much simpler.
E.g. having these buckets:

response_time_seconds_bucket{gt="0"}
response_time_seconds_bucket{gt="0.1"}
response_time_seconds_bucket{gt="0.2"}
response_time_seconds_bucket{gt="0.5"}
response_time_seconds_bucket{gt="1"}

You could track requests > 1s only with:

sum(rate( response_time_seconds_bucket{gt="1"}[5m]))

The {gt="0"} bucket would be similar to the {le="+Inf"} acting as a lower boundary matching all values not already catched by another bucket. The client libraries would need to make sure that a bucket with gt=0 exists.

I think this would be a nice addition to prometheus making some common tasks easier.

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Sep 22, 2016

Absolute values aren't of much use, it's the ratios that you care about. In that respect there's no particular advantage to one over the other, and having two slightly different ways to get the same data would not be a net benefit to users.

@beorn7

This comment has been minimized.

Copy link
Member

beorn7 commented Sep 22, 2016

On 22 September 2016 at 09:43, Jean-Pierre Bergamin
notifications@github.com wrote:

sum(rate( response_time_seconds_count[5m]))

  • ignoring(le) sum(rate( response_time_seconds_bucket{le="1"} [5m] ))

That's not really a very complicated expression. Check out the
expression for the Apdex score
https://prometheus.io/docs/practices/histograms/#apdex-score . Once
you are at the "usual" complexity level of PromQL expressions, the
simplification introduced by the proposed inverted histogram doesn't
really matter much in relative term. In contrast, introducing a
slightly different metric type is a quite heavy complication, which
would leave a trail through the whole stack, starting at code
instrumentation.

Björn Rabenstein, Engineer
http://soundcloud.com/brabenstein

SoundCloud Ltd. | Rheinsberger Str. 76/77, 10115 Berlin, Germany
Managing Director: Alexander Ljung | Incorporated in England & Wales
with Company No. 6343600 | Local Branch Office | AG Charlottenburg |
HRB 110657B

@ractive

This comment has been minimized.

Copy link
Author

ractive commented Sep 22, 2016

We graph the ratio of slow requests which ends up in "very" complicated queries like:

(  sum(rate( backend_http_response_time_seconds_count {instance=~"$instance"} [30s])) by(instance) - ignoring(le) sum(rate( backend_http_response_time_seconds_bucket {instance=~"$instance", le="1"} [30s] )) by(instance)  )
/
(sum(rate( backend_http_response_time_seconds_count {instance=~"$instance"} [30s])) by(instance))

The "gt" buckets could even be mapped to the "le" buckets by the server so that the clients do not need to care about any histogram change.

response_time_seconds_bucket{gt="0"} -> response_time_seconds_bucket{le="+Inf"}
response_time_seconds_bucket{gt="0.1"} -> response_time_seconds_count - response_time_seconds_bucket{le="0.1"}
response_time_seconds_bucket{gt="0.2"} -> response_time_seconds_count - response_time_seconds_bucket{le="0.2"}
etc.

having two slightly different ways to get the same data would not be a net benefit to users.

It would just make using histograms easier to use in most of our usecases: tracking slow and large stuff.

@beorn7

This comment has been minimized.

Copy link
Member

beorn7 commented Sep 22, 2016

On 22 September 2016 at 13:08, Jean-Pierre Bergamin <
notifications@github.com> wrote:

The "gt" buckets could even be mapped to the "le" buckets by the server so

that the clients do not need to care about any histogram change.

response_time_seconds_bucket{gt="0"} -> response_time_seconds_bucket{le="+Inf"}
response_time_seconds_bucket{gt="0.1"} -> response_time_seconds_count - response_time_seconds_bucket{le="0.1"}
response_time_seconds_bucket{gt="0.2"} -> response_time_seconds_count - response_time_seconds_bucket{le="0.2"}
etc.

If you need this mapping, you can create recording rules to generate those
time series. (Note, however, that it is strongly recommended to let metrics
with the same name always have the same label dimensions. So you inverted
histogram should be named differently from response_time_seconds_bucket.)
(Note also that this doubles the number of time series, which is usually
the resource bottleneck on a Prometheus server already.)

Björn Rabenstein, Engineer
http://soundcloud.com/brabenstein

SoundCloud Ltd. | Rheinsberger Str. 76/77, 10115 Berlin, Germany
Managing Director: Alexander Ljung | Incorporated in England & Wales with
Company No. 6343600 | Local Branch Office | AG Charlottenburg | HRB 110657B

@lock

This comment has been minimized.

Copy link

lock bot commented Mar 24, 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked and limited conversation to collaborators Mar 24, 2019

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
You can’t perform that action at this time.