history_quantile returns NaN when computed from 0 values #6645

llacroix · 2020-01-16T18:34:33Z

Bug Report

Tried to display histogram_quantile of a query:

histogram_quantile(0.95, sum(rate(build_seconds_bucket{job="promsd"}[1m])) by (le, instance))

It returns:

labels	value
{instance="10.0.1.62:9090"}	NaN

But:

sum(rate(build_seconds_bucket{job="promsd"}[1m])) by (le, instance)

Returns this:

labels	value
{instance="10.0.1.62:9090",le="0.05"}	0
{instance="10.0.1.62:9090",le="0.25"}	0
{instance="10.0.1.62:9090",le="1.0"}	0
{instance="10.0.1.62:9090",le="0.025"}	0
{instance="10.0.1.62:9090",le="0.075"}	0
{instance="10.0.1.62:9090",le="0.5"}	0
{instance="10.0.1.62:9090",le="7.5"}	0
{instance="10.0.1.62:9090",le="+Inf"}	0
{instance="10.0.1.62:9090",le="10.0"}	0
{instance="10.0.1.62:9090",le="2.5"}	0
{instance="10.0.1.62:9090",le="5.0"}	0
{instance="10.0.1.62:9090",le="0.005"}	0
{instance="10.0.1.62:9090",le="0.01"}	0
{instance="10.0.1.62:9090",le="0.1"}	0
{instance="10.0.1.62:9090",le="0.75"}	0

It seems the issue could be related to this one:
#4264

Here's a screenshot of what it looks like:

What did you expect to see?

I did expect the values of histogram quantile to be null if that's what they are instead of NaN. As value didn't change over time during those time period, it would make sense to have a null value as there can't be a quantile measured if there is no change. It could be 0 but in reality as I'm not observing any value it shouldn't return 0. In my case, I'm measuring build time but builds can be once per day, twice per hours of may be 30 times per minutes. But if no builds happen during a time frame, the value doesn't magically goes down to 0. Still, grafana does interpret NaN as null. When checking in the values returned I can see Null and the mouse hover displays NaN.

Data being scraped

# HELP build_seconds Time spent building config
# TYPE build_seconds histogram
build_seconds_bucket{le="0.005"} 0.0
build_seconds_bucket{le="0.01"} 0.0
build_seconds_bucket{le="0.025"} 0.0
build_seconds_bucket{le="0.05"} 1.0
build_seconds_bucket{le="0.075"} 2.0
build_seconds_bucket{le="0.1"} 2.0
build_seconds_bucket{le="0.25"} 2.0
build_seconds_bucket{le="0.5"} 2.0
build_seconds_bucket{le="0.75"} 2.0
build_seconds_bucket{le="1.0"} 2.0
build_seconds_bucket{le="2.5"} 2.0
build_seconds_bucket{le="5.0"} 2.0
build_seconds_bucket{le="7.5"} 2.0
build_seconds_bucket{le="10.0"} 2.0
build_seconds_bucket{le="+Inf"} 2.0
build_seconds_count 2.0
build_seconds_sum 0.0961240604519844
# TYPE build_seconds_created gauge
build_seconds_created 1.5791957863170412e+09

From what I could see, when the value changes, the query does work as intended but as builds aren't generated periodically the data can be scrapped multiple times with the same value as the buckets are not changed between scrapes.

But as the sum(rate()) call does return values of 0, it seems like the histogram_quantile is doing something weird because it doesn't seem to return any NaN value and as far as I know, prometheus didn't scrape any NaN value either.

The client being used is the prometheus_client and the piece of code relevant to this is this:

https://github.com/llacroix/prometheus-swarm-discovery/blob/master/prometheus_sd/service.py#L412

And the metric is defined here:

https://github.com/llacroix/prometheus-swarm-discovery/blob/master/prometheus_sd/metrics.py#L31

The service is running in asyncio but I believe it's irrelevant. The timing is correctly computed and the metrics output uses the prometheus_client api.

Environment

System information:

name	value
Version	2.15.2
`d9613e5`
HEAD
root@688433cf4ff7
20200106-14:50:51
go1.13.5

Prometheus configuration file:

global:
  scrape_interval: 15s
  scrape_timeout: 10s
  evaluation_interval: 15s
alerting:
  alertmanagers:
  - static_configs:
    - targets: []
    scheme: http
    timeout: 10s
    api_version: v1
scrape_configs:
- job_name: prometheus
  honor_timestamps: true
  scrape_interval: 15s
  scrape_timeout: 10s
  metrics_path: /metrics
  scheme: http
  static_configs:
  - targets:
    - localhost:9090
- job_name: file_sd_http
  honor_timestamps: true
  scrape_interval: 15s
  scrape_timeout: 10s
  metrics_path: /metrics
  scheme: http
  file_sd_configs:
  - files:
    - /configs/http/*.json
    refresh_interval: 5m

The text was updated successfully, but these errors were encountered:

llacroix · 2020-01-16T19:27:27Z

Not sure thought if it's a big issue, I obviously cannot interpret the data between builds in any way other than a null value.

I guess I could fix my issue by changing the query to include the last value to fill the holes if it's possible. Currently if I set a gauge it doesn't display anything relevant even if I tell it to display the last non null value. NaN is technically not null so I guess having a null would at least help with that.

brian-brazil · 2020-01-16T20:59:05Z

This is the expected behaviour, as you'd get a NaN also if you tried to calculate the average as that'd be dividing by zero.

llacroix · 2020-01-16T21:14:38Z

Well okay then, I guessed it might have been related to a division by 0. Then may be it should be a bug in the grafana source to allow a different interpretation of NaN vs Null.

Because NaN is technically non null but not something worth showing in some cases.

That said, the quickest workaround I found was to use that query instead to compute an average.

build_seconds_sum / build_seconds_count

It's not ideal but does look like something I can live with for now.

brian-brazil · 2020-01-16T21:17:54Z

That's average since process start, not the last run. If you've more questions the -users list is the best place.

llacroix · 2020-01-16T21:33:13Z

Yes, not ideal. I did ask the question on the user list. As long as the data is correctly scraped I can live with interpreting it correctly later.

brian-brazil closed this as completed Jan 16, 2020

prometheus locked as resolved and limited conversation to collaborators Dec 1, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

history_quantile returns NaN when computed from 0 values #6645

history_quantile returns NaN when computed from 0 values #6645

llacroix commented Jan 16, 2020 •

edited

llacroix commented Jan 16, 2020

brian-brazil commented Jan 16, 2020

llacroix commented Jan 16, 2020

brian-brazil commented Jan 16, 2020

llacroix commented Jan 16, 2020

history_quantile returns NaN when computed from 0 values #6645

history_quantile returns NaN when computed from 0 values #6645

Comments

llacroix commented Jan 16, 2020 • edited

Bug Report

llacroix commented Jan 16, 2020

brian-brazil commented Jan 16, 2020

llacroix commented Jan 16, 2020

brian-brazil commented Jan 16, 2020

llacroix commented Jan 16, 2020

llacroix commented Jan 16, 2020 •

edited