Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

history_quantile returns NaN when computed from 0 values #6645

Closed
llacroix opened this issue Jan 16, 2020 · 5 comments
Closed

history_quantile returns NaN when computed from 0 values #6645

llacroix opened this issue Jan 16, 2020 · 5 comments

Comments

@llacroix
Copy link

llacroix commented Jan 16, 2020

Bug Report

Tried to display histogram_quantile of a query:

histogram_quantile(0.95, sum(rate(build_seconds_bucket{job="promsd"}[1m])) by (le, instance))

It returns:

labels value
{instance="10.0.1.62:9090"} NaN

But:

sum(rate(build_seconds_bucket{job="promsd"}[1m])) by (le, instance)

Returns this:

labels value
{instance="10.0.1.62:9090",le="0.05"} 0
{instance="10.0.1.62:9090",le="0.25"} 0
{instance="10.0.1.62:9090",le="1.0"} 0
{instance="10.0.1.62:9090",le="0.025"} 0
{instance="10.0.1.62:9090",le="0.075"} 0
{instance="10.0.1.62:9090",le="0.5"} 0
{instance="10.0.1.62:9090",le="7.5"} 0
{instance="10.0.1.62:9090",le="+Inf"} 0
{instance="10.0.1.62:9090",le="10.0"} 0
{instance="10.0.1.62:9090",le="2.5"} 0
{instance="10.0.1.62:9090",le="5.0"} 0
{instance="10.0.1.62:9090",le="0.005"} 0
{instance="10.0.1.62:9090",le="0.01"} 0
{instance="10.0.1.62:9090",le="0.1"} 0
{instance="10.0.1.62:9090",le="0.75"} 0

It seems the issue could be related to this one:
#4264

Here's a screenshot of what it looks like:

image

What did you expect to see?

I did expect the values of histogram quantile to be null if that's what they are instead of NaN. As value didn't change over time during those time period, it would make sense to have a null value as there can't be a quantile measured if there is no change. It could be 0 but in reality as I'm not observing any value it shouldn't return 0. In my case, I'm measuring build time but builds can be once per day, twice per hours of may be 30 times per minutes. But if no builds happen during a time frame, the value doesn't magically goes down to 0. Still, grafana does interpret NaN as null. When checking in the values returned I can see Null and the mouse hover displays NaN.

Data being scraped

# HELP build_seconds Time spent building config
# TYPE build_seconds histogram
build_seconds_bucket{le="0.005"} 0.0
build_seconds_bucket{le="0.01"} 0.0
build_seconds_bucket{le="0.025"} 0.0
build_seconds_bucket{le="0.05"} 1.0
build_seconds_bucket{le="0.075"} 2.0
build_seconds_bucket{le="0.1"} 2.0
build_seconds_bucket{le="0.25"} 2.0
build_seconds_bucket{le="0.5"} 2.0
build_seconds_bucket{le="0.75"} 2.0
build_seconds_bucket{le="1.0"} 2.0
build_seconds_bucket{le="2.5"} 2.0
build_seconds_bucket{le="5.0"} 2.0
build_seconds_bucket{le="7.5"} 2.0
build_seconds_bucket{le="10.0"} 2.0
build_seconds_bucket{le="+Inf"} 2.0
build_seconds_count 2.0
build_seconds_sum 0.0961240604519844
# TYPE build_seconds_created gauge
build_seconds_created 1.5791957863170412e+09

From what I could see, when the value changes, the query does work as intended but as builds aren't generated periodically the data can be scrapped multiple times with the same value as the buckets are not changed between scrapes.

But as the sum(rate()) call does return values of 0, it seems like the histogram_quantile is doing something weird because it doesn't seem to return any NaN value and as far as I know, prometheus didn't scrape any NaN value either.

The client being used is the prometheus_client and the piece of code relevant to this is this:

https://github.com/llacroix/prometheus-swarm-discovery/blob/master/prometheus_sd/service.py#L412

And the metric is defined here:

https://github.com/llacroix/prometheus-swarm-discovery/blob/master/prometheus_sd/metrics.py#L31

The service is running in asyncio but I believe it's irrelevant. The timing is correctly computed and the metrics output uses the prometheus_client api.

Environment

  • System information:
name value
Version 2.15.2
d9613e5
HEAD
root@688433cf4ff7
20200106-14:50:51
go1.13.5
  • Prometheus configuration file:
global:
  scrape_interval: 15s
  scrape_timeout: 10s
  evaluation_interval: 15s
alerting:
  alertmanagers:
  - static_configs:
    - targets: []
    scheme: http
    timeout: 10s
    api_version: v1
scrape_configs:
- job_name: prometheus
  honor_timestamps: true
  scrape_interval: 15s
  scrape_timeout: 10s
  metrics_path: /metrics
  scheme: http
  static_configs:
  - targets:
    - localhost:9090
- job_name: file_sd_http
  honor_timestamps: true
  scrape_interval: 15s
  scrape_timeout: 10s
  metrics_path: /metrics
  scheme: http
  file_sd_configs:
  - files:
    - /configs/http/*.json
    refresh_interval: 5m

@llacroix
Copy link
Author

Not sure thought if it's a big issue, I obviously cannot interpret the data between builds in any way other than a null value.

I guess I could fix my issue by changing the query to include the last value to fill the holes if it's possible. Currently if I set a gauge it doesn't display anything relevant even if I tell it to display the last non null value. NaN is technically not null so I guess having a null would at least help with that.

@brian-brazil
Copy link
Contributor

This is the expected behaviour, as you'd get a NaN also if you tried to calculate the average as that'd be dividing by zero.

@llacroix
Copy link
Author

Well okay then, I guessed it might have been related to a division by 0. Then may be it should be a bug in the grafana source to allow a different interpretation of NaN vs Null.

Because NaN is technically non null but not something worth showing in some cases.

That said, the quickest workaround I found was to use that query instead to compute an average.

build_seconds_sum / build_seconds_count

It's not ideal but does look like something I can live with for now.

@brian-brazil
Copy link
Contributor

That's average since process start, not the last run. If you've more questions the -users list is the best place.

@llacroix
Copy link
Author

Yes, not ideal. I did ask the question on the user list. As long as the data is correctly scraped I can live with interpreting it correctly later.

@prometheus prometheus locked as resolved and limited conversation to collaborators Dec 1, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants