Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

histogram_quantile returns NaN if 2 or more buckets are zeroes #4264

Closed
dimitar-petrov opened this Issue Jun 13, 2018 · 8 comments

Comments

Projects
None yet
2 participants
@dimitar-petrov
Copy link

dimitar-petrov commented Jun 13, 2018

Bug Report

What did you do?
Here is the data that I am working with:
rate(callback_latency_seconds_bucket{app='foo'}[5m])

Element Value
{app="foo",host="host",instance="host:8111",job="job",le="+Inf",type="trade"} 0
{app="foo",host="host",instance="host:8111",job="job",le="0.1",type="trade"} 0
{app="foo",host="host",instance="host:8111",job="job",le="0.2",type="trade"} 45.21835104314429
{app="foo",host="host",instance="host:8111",job="job",le="0.3",type="trade"} 9.303669245669461
{app="foo",host="host",instance="host:8111",job="job",le="0.5",type="trade"} 17.35919496594457
{app="foo",host="host",instance="host:8111",job="job",le="0.8",type="trade"} 3.2629508779597116
{app="foo",host="host",instance="host:8111",job="job",le="1.0",type="trade"} 0.07777748971300107
{app="foo",host="host",instance="host:8111",job="job",le="1.5",type="trade"} 0.15555497942600213
{app="foo",host="host",instance="host:8111",job="job",le="3.0",type="trade"} 0
{app="foo",host="host",instance="host:8111",job="job",le="5.0",type="trade"} 0

I would like to be able to display the xth quantile and I am using the following function:

histogram_quantile(0.99, sum(rate(callback_latency_seconds_bucket{app="foo"}[5m])) by (le))

But I am not getting any values:

Element Value
{} NaN

If instead I increase the rate period so all buckets are populated I am geting the following result:

rate(callback_latency_seconds_bucket{app='foo'}[13h])

Element Value
{app="foo",host="host",instance="host:8111",job="job",le="+Inf",type="trade"} 0.017100552439744232
{app="foo",host="host",instance="host:8111",job="job",le="0.1",type="trade"} 0
{app="foo",host="host",instance="host:8111",job="job",le="0.2",type="trade"} 30.89769082745349
{app="foo",host="host",instance="host:8111",job="job",le="0.3",type="trade"} 5.448407098567457
{app="foo",host="host",instance="host:8111",job="job",le="0.5",type="trade"} 4.3670515287577505
{app="foo",host="host",instance="host:8111",job="job",le="0.8",type="trade"} 0.8159503955527045
{app="foo",host="host",instance="host:8111",job="job",le="1.0",type="trade"} 0.10720547359418431
{app="foo",host="host",instance="host:8111",job="job",le="1.5",type="trade"} 0.08841137481291425
{app="foo",host="host",instance="host:8111",job="job",le="3.0",type="trade"} 0.04321146033782339
{app="foo",host="host",instance="host:8111",job="job",le="5.0",type="trade"} 0.022493051101133203

histogram_quantile(0.99, sum(rate(callback_latency_seconds_bucket{app="foo"}[13h])) by (le))

Element Value
{} 0.10005487285638165

What did you expect to see?
My expectation is that quantile should be calculated even if I do not have request delays in two buckets

What did you see instead? Under which circumstances?
explained above

Environment

  • System information:
% uname -srm
Linux 4.16.5-1-ARCH x86_64
  • Prometheus version:
% docker exec -ti prometheus prometheus --version
prometheus, version 2.2.1 (branch: HEAD, revision: bc6058c81272a8d938c05e75607371284236aadc)
  build user:       root@149e5b3f0829
  build date:       20180314-14:15:45
  go version:       go1.10
  • Alertmanager version:

    Not relevant

  • Prometheus configuration file:

  - job_name: job

    scrape_interval: 30s
    static_configs:
          - targets: ['host:8111']
  • Alertmanager configuration file:
    Not relevant

  • Logs:
    Not relevant

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Jun 13, 2018

My expectation is that quantile should be calculated even if I do not have request delays in two buckets

There's something else going on here. This looks like your application is exposing a non-cumulative histogram. Can you share the values of it on /metrics?

@dimitar-petrov

This comment has been minimized.

Copy link
Author

dimitar-petrov commented Jun 13, 2018

# HELP callback_latency_seconds Callback latency in seconds
# TYPE callback_latency_seconds histogram
callback_latency_seconds_bucket{app="foo",host="host",le="+Inf",type="trade"} 0
callback_latency_seconds_bucket{app="foo",host="host",le="0.1",type="trade"} 0
callback_latency_seconds_bucket{app="foo",host="host",le="0.2",type="trade"} 157
callback_latency_seconds_bucket{app="foo",host="host",le="0.3",type="trade"} 206
callback_latency_seconds_bucket{app="foo",host="host",le="0.5",type="trade"} 415
callback_latency_seconds_bucket{app="foo",host="host",le="0.8",type="trade"} 659
callback_latency_seconds_bucket{app="foo",host="host",le="1.0",type="trade"} 394
callback_latency_seconds_bucket{app="foo",host="host",le="1.5",type="trade"} 597
callback_latency_seconds_bucket{app="foo",host="host",le="3.0",type="trade"} 64
callback_latency_seconds_bucket{app="foo",host="host",le="5.0",type="trade"} 1
callback_latency_seconds_count{app="foo",host="host",type="trade"} 2493
callback_latency_seconds_sum{app="foo",host="host",type="trade"} 1834.2480816841125
# HELP callbacks Number of callbacks.
# TYPE callbacks counter
callbacks{app="foo",host="host",type="order"} 57673
callbacks{app="foo",host="host",type="trade"} 2493
@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Jun 13, 2018

That's incorrectly implemented, Prometheus histograms are meant to be cumulative. In addition the format specifies that buckets should be ordered so +Inf should be last and +Inf should match _count.

Which client library is this from?

@dimitar-petrov

This comment has been minimized.

Copy link
Author

dimitar-petrov commented Jun 13, 2018

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Jun 13, 2018

I'd suggest filing a bug with them, and also consider using the official client at https://github.com/prometheus/client_python.

@dimitar-petrov

This comment has been minimized.

Copy link
Author

dimitar-petrov commented Jun 13, 2018

Thanks Brian,

I will open a bug over there.
Cannot use the official client since it is synchronous.

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Jun 13, 2018

The official client should still work fine for async use cases, though there may be extra utilities you'd like on top of it.

This issue is elsewhere, so closing.

@lock

This comment has been minimized.

Copy link

lock bot commented Mar 22, 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked and limited conversation to collaborators Mar 22, 2019

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
You can’t perform that action at this time.