Histogram bucket label value gets converted to float somehow #8367

SteffenDE · 2021-01-14T18:11:05Z

What did you do?
I'm tracking Caddy metrics (https://caddyserver.com/) with prometheus (latest docker image) and am experiencing a strange problem: I've got three instances reporting metrics and for one instance the bucket label le="1" is saved as le="1.0":

(query: sum(increase(caddy_http_request_duration_seconds_bucket{handler=~"reverse_proxy", server="srv0", le=~"1|1.0"}[5m])) by (le, instance))

I've checked the http output of the specific caddy instance and every "le" label is correctly formatted as int. Only when querying prometheus I get a float value:

# curl -s http://10.100.16.9:8080/metrics|grep seconds_bucket|grep 200|grep GET|grep reverse_proxy|grep request
caddy_http_request_duration_seconds_bucket{code="200",handler="reverse_proxy",method="GET",server="srv0",le="0.005"} 364
caddy_http_request_duration_seconds_bucket{code="200",handler="reverse_proxy",method="GET",server="srv0",le="0.01"} 1101
caddy_http_request_duration_seconds_bucket{code="200",handler="reverse_proxy",method="GET",server="srv0",le="0.025"} 1773
caddy_http_request_duration_seconds_bucket{code="200",handler="reverse_proxy",method="GET",server="srv0",le="0.05"} 2253
caddy_http_request_duration_seconds_bucket{code="200",handler="reverse_proxy",method="GET",server="srv0",le="0.1"} 2757
caddy_http_request_duration_seconds_bucket{code="200",handler="reverse_proxy",method="GET",server="srv0",le="0.25"} 3147
caddy_http_request_duration_seconds_bucket{code="200",handler="reverse_proxy",method="GET",server="srv0",le="0.5"} 3847
caddy_http_request_duration_seconds_bucket{code="200",handler="reverse_proxy",method="GET",server="srv0",le="1"} 4390
caddy_http_request_duration_seconds_bucket{code="200",handler="reverse_proxy",method="GET",server="srv0",le="2.5"} 4490
caddy_http_request_duration_seconds_bucket{code="200",handler="reverse_proxy",method="GET",server="srv0",le="5"} 4506
caddy_http_request_duration_seconds_bucket{code="200",handler="reverse_proxy",method="GET",server="srv0",le="10"} 4511
caddy_http_request_duration_seconds_bucket{code="200",handler="reverse_proxy",method="GET",server="srv0",le="+Inf"} 4512
# curl -s http://10.100.16.9:8080/metrics|grep seconds_bucket|grep "le=\"1.0\""
#

It also happens to the other buckets (le="5", le="10").

(This really messes up the grafana heatmap when the affected instance is included:

vs

)

What did you expect to see?
The value should be the same le="1" as with the other instances.

What did you see instead? Under which circumstances?
The value is a float instead of an integer.

Environment

System information:

Linux 4.15.0-115-generic x86_64

Prometheus version:

prometheus, version 2.24.0 (branch: HEAD, revision: 02e92236a8bad3503ff5eec3e04ac205a3b8e4fe)
  build user:       root@d9f90f0b1f76
  build date:       20210106-13:48:37
  go version:       go1.15.6
  platform:         linux/amd64

Prometheus configuration file:

# my global config
global:
  scrape_interval:     15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
  evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
  # scrape_timeout is set to the global default (10s).

# Alertmanager configuration
alerting:
  alertmanagers:
  - static_configs:
    - targets:
      # - alertmanager:9093

# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
  # - "first_rules.yml"
  # - "second_rules.yml"

# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
  # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
  - job_name: 'prometheus'

    # metrics_path defaults to '/metrics'
    # scheme defaults to 'http'.

    static_configs:
    - targets: ['localhost:9090']

  - job_name: 'caddy'

    hetzner_sd_configs:
     - bearer_token: "***"
       role: "hcloud"
    relabel_configs:
     - source_labels: [__meta_hetzner_hcloud_label_prometheus]
       regex: true.*
       action: keep
     - source_labels: [__meta_hetzner_hcloud_label_caddy]
       regex: true:*
       action: keep
     - source_labels: [__meta_hetzner_server_name]
       target_label: server_name
       replacement: '$1'
     - source_labels: [__meta_hetzner_hcloud_private_ipv4_net]
       target_label: __address__
       replacement: '$1:8080'

The text was updated successfully, but these errors were encountered:

brian-brazil · 2021-01-14T18:50:04Z

I don't see anything here that indicates a bug in Prometheus, as all the output you've provided is via 3rd party systems. If you have evidence that Prometheus ingestion is mangling data, please share it in the form of /query or /query_range output. I would presume that the data is like this as that is what the target is exposing, and thus should be taken up with them.

SteffenDE · 2021-01-14T19:19:06Z

Here's the output of a query against /query.

$ curl 'http://127.0.0.1:9090/api/v1/query?query=sum%28increase%28caddy_http_request_duration_seconds_bucket%7Bhandler%3D%7E%22reverse_proxy%22%2C+server%3D%22srv0%22%2C+le%3D%7E%221%7C1.0%22%7D%5B5m%5D%29%29+by+%28le%2C+instance%29&time=1610648999.814'
{"status":"success","data":{"resultType":"vector","result":[{"metric":{"instance":"10.100.16.7:8080","le":"1"},"value":[1610648999.814,"671.5789473684209"]},{"metric":{"instance":"10.100.16.8:8080","le":"1"},"value":[1610648999.814,"0"]},{"metric":{"instance":"10.100.16.9:8080","le":"1.0"},"value":[1610648999.814,"541.0526315789474"]}]}}

The target is exposing the values as "1" and not as "1.0".

brian-brazil · 2021-01-14T19:29:29Z

You're absoluately sure that 10.100.16.9:8080 is exposing 1 when Prometheus is scraping it? I'd guess that it's negotiating OpenMetrics, which you'd see if you used tcpdump so there's no bug here (and any valid rendering of 1 is valid).

This sounds like something to take up with Grafana.

roidelapluie · 2021-01-14T19:44:54Z

Can you try curl -H 'Accept: application/openmetrics-text; version=0.0.1' http://10.100.16.9:8080/metrics

beorn7 · 2021-01-14T19:51:46Z

@SteffenDE the "problem" is that Caddy activates OpenMetrics here. The client_golang documentation explains what's happening:

	// If true, the experimental OpenMetrics encoding is added to the
	// possible options during content negotiation. Note that Prometheus
	// 2.5.0+ will negotiate OpenMetrics as first priority. OpenMetrics is
	// the only way to transmit exemplars. However, the move to OpenMetrics
	// is not completely transparent. Most notably, the values of "quantile"
	// labels of Summaries and "le" labels of Histograms are formatted with
	// a trailing ".0" if they would otherwise look like integer numbers
	// (which changes the identity of the resulting series on the Prometheus
	// server).

However, if you look at a /metrics endpoint without negotiating anything in particular (e.g. with curl or with your browser), then you get the classic Prometheus text format.

In your case, I assume some of your Caddy instances are configured differently and do not expose OpenMetrics format, see caddyserver/caddy#3944 how that might be possible.

This whole issue is a bit unfortunate and has a lot of historic context. Treating what's essentially a floating point number as a string creates this kind of shenanigans. You are not the first to run into it. My vision has always been that Prometheus will eventually become aware of bucket boundaries not being floats, and at least there should be some sanitation within Prometheus. (It was there in Prometheus 1.x. It got dropped in Prometheus 2.x, arguably by accident. And my attempt to at least make it a requirement for OpenMetrics was not successful.) If you want to learn more in a somewhat entertaining way, you could watch me telling the story.

I'm not sure how Grafana's heatmap is dealing with it. I could imagine it's fine as long as the labels don't get mixed.

In any case, a PromQL sum by (le) (...) will certainly fail to sum le="1.0" and le="1" into the same resulting vector element. So you'll get them both, and that might throw Grafana off.

I guess, for now, you just have to make sure that all your Caddy instances are all exposing OpenMetrics or all not exposing OpenMetrics.

You have my full sympathy that this is annoying and that it could have been avoided by a more thoughtful design.

SteffenDE · 2021-01-14T20:01:36Z

Thank you all for clarifying this. It's a little strange as all caddy servers used the same configuration directory (mounted via NFS). But you're absolutely right that somehow only one instance reported in the OpenMetrics format.
After rebuilding my docker image and restarting all caddy servers all are now reporting in the OpenMetrics format and grafana is also happy now:

brian-brazil added the kind/more-info-needed label Jan 14, 2021

SteffenDE closed this as completed Jan 14, 2021

Dlougach mentioned this issue Jul 26, 2021

Histogram buckets rendered differently than in other languages jupp0r/prometheus-cpp#506

Open

prometheus locked as resolved and limited conversation to collaborators Nov 16, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Histogram bucket label value gets converted to float somehow #8367

Histogram bucket label value gets converted to float somehow #8367

SteffenDE commented Jan 14, 2021 •

edited

brian-brazil commented Jan 14, 2021

SteffenDE commented Jan 14, 2021

brian-brazil commented Jan 14, 2021

roidelapluie commented Jan 14, 2021 •

edited

beorn7 commented Jan 14, 2021

SteffenDE commented Jan 14, 2021

Histogram bucket label value gets converted to float somehow #8367

Histogram bucket label value gets converted to float somehow #8367

Comments

SteffenDE commented Jan 14, 2021 • edited

brian-brazil commented Jan 14, 2021

SteffenDE commented Jan 14, 2021

brian-brazil commented Jan 14, 2021

roidelapluie commented Jan 14, 2021 • edited

beorn7 commented Jan 14, 2021

SteffenDE commented Jan 14, 2021

SteffenDE commented Jan 14, 2021 •

edited

roidelapluie commented Jan 14, 2021 •

edited