Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NaN instead of proper values in metrics #860

Closed
wojtek-t opened this Issue Jun 26, 2015 · 14 comments

Comments

Projects
None yet
9 participants
@wojtek-t
Copy link

wojtek-t commented Jun 26, 2015

We're using Prometheus in Kubernetes project. However, we quite often observe NaN instead of a proper value in out metrics, e.g.:

....
rest_client_request_latency_microseconds{url="http://127.0.0.1:8080/api/v1/nodes?fieldSelector=%7Bvalue%7D",verb="GET",quantile="0.5"} NaN
rest_client_request_latency_microseconds{url="http://127.0.0.1:8080/api/v1/nodes?fieldSelector=%7Bvalue%7D",verb="GET",quantile="0.9"} NaN
rest_client_request_latency_microseconds{url="http://127.0.0.1:8080/api/v1/nodes?fieldSelector=%7Bvalue%7D",verb="GET",quantile="0.99"} NaN
rest_client_request_latency_microseconds_sum{url="http://127.0.0.1:8080/api/v1/nodes?fieldSelector=%7Bvalue%7D",verb="GET"} 3395
rest_client_request_latency_microseconds_count{url="http://127.0.0.1:8080/api/v1/nodes?fieldSelector=%7Bvalue%7D",verb="GET"} 2
....

What is interesting, the sum and count is always ok - the problem is only with percentiles.

This particular metric is defined as following:

29   RequestLatency = prometheus.NewSummaryVec(
30     prometheus.SummaryOpts{
31       Subsystem: restClientSubsystem,
32       Name:      "request_latency_microseconds",
33       Help:      "Request latency in microseconds. Broken down by verb and URL",
34     },
35     []string{"verb", "url"},
36   )

Do you know why is this happening?

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Jun 26, 2015

If there's no requests in a period, then it's not possible to calculate the percentile. NaN is reported instead in that case.

Does that answer your question?

@wojtek-t

This comment has been minimized.

Copy link
Author

wojtek-t commented Jun 26, 2015

Sorry - I'm not sure if I understand in what period.
From the example above - there were 2 requests, with a sum of 3395 - however, all percentiles are reported as NaN. Is that expected?

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Jun 26, 2015

If it has been a while since the requests, then that is expected. If you graph the percentile values you'll see they had a non-nan value around the time of the requests.

@wojtek-t

This comment has been minimized.

Copy link
Author

wojtek-t commented Jun 26, 2015

So those percentiles are computed only from some time period (not from all the values requests so far - i.e. since starting the component)? If so, what is the value of that period and can this be configured?

@wojtek-t

This comment has been minimized.

Copy link
Author

wojtek-t commented Jun 26, 2015

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Jun 26, 2015

See prometheus/client_golang#85 and https://github.com/prometheus/client_golang/blob/fcd2986466589bcf7a411ec3b52d85a8df9dcc8b/prometheus/summary.go#L118

If you want to control things at that level, I'd suggest a Histogram rather than a Summary.

@wojtek-t

This comment has been minimized.

Copy link
Author

wojtek-t commented Jun 26, 2015

I see - thanks for quick response!

@quinton-hoole-zz

This comment has been minimized.

Copy link

quinton-hoole-zz commented Jun 30, 2015

Hello @brian-brazil . Small world :-)

@ashoksahoo

This comment has been minimized.

Copy link

ashoksahoo commented May 9, 2017

How do I filter out the NaN, I am going a group by (metrics_type) and for some types its giving NaN. I am using value > 0.

@r4j4h

This comment has been minimized.

Copy link

r4j4h commented May 10, 2017

@ashoksahoo As far as I know that is the proper way. If you need to catch negatives as well you can combine with or and include value <= 0.

@macnibblet

This comment has been minimized.

Copy link

macnibblet commented May 25, 2017

I'm having the same problem as @ashoksahoo

My query
sum(rate(http_endpoint_duration_seconds[5m])) by (endpoint, quantile)

The problem is I have 60+ servers and some of them don't get hit every now and then which means I'm almost never getting any graphs displayed because the result of the query above is Nan

@bencoughlan

This comment has been minimized.

Copy link

bencoughlan commented Jun 19, 2017

Anything available in this that we can pass in to have it default to 0 if NaN.nfinity shows up?

@Keith-Ball

This comment has been minimized.

Copy link

Keith-Ball commented Jan 23, 2018

Per grafana/grafana#8860 (comment); It looks like there is a workaround of simply checking that the rate values are >= 0.
ex: avg without (quantile)(rate(api_request_duration[5m]) >= 0)

@MikeSpreitzer

This comment has been minimized.

Copy link

MikeSpreitzer commented Jan 19, 2019

Yes, Histograms are better than Summaries --- particularly because they aggregate better.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.