Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Histogram bucket label value gets converted to float somehow #8367

Closed
SteffenDE opened this issue Jan 14, 2021 · 6 comments
Closed

Histogram bucket label value gets converted to float somehow #8367

SteffenDE opened this issue Jan 14, 2021 · 6 comments

Comments

@SteffenDE
Copy link

SteffenDE commented Jan 14, 2021

What did you do?
I'm tracking Caddy metrics (https://caddyserver.com/) with prometheus (latest docker image) and am experiencing a strange problem: I've got three instances reporting metrics and for one instance the bucket label le="1" is saved as le="1.0":
image
(query: sum(increase(caddy_http_request_duration_seconds_bucket{handler=~"reverse_proxy", server="srv0", le=~"1|1.0"}[5m])) by (le, instance))

I've checked the http output of the specific caddy instance and every "le" label is correctly formatted as int. Only when querying prometheus I get a float value:

# curl -s http://10.100.16.9:8080/metrics|grep seconds_bucket|grep 200|grep GET|grep reverse_proxy|grep request
caddy_http_request_duration_seconds_bucket{code="200",handler="reverse_proxy",method="GET",server="srv0",le="0.005"} 364
caddy_http_request_duration_seconds_bucket{code="200",handler="reverse_proxy",method="GET",server="srv0",le="0.01"} 1101
caddy_http_request_duration_seconds_bucket{code="200",handler="reverse_proxy",method="GET",server="srv0",le="0.025"} 1773
caddy_http_request_duration_seconds_bucket{code="200",handler="reverse_proxy",method="GET",server="srv0",le="0.05"} 2253
caddy_http_request_duration_seconds_bucket{code="200",handler="reverse_proxy",method="GET",server="srv0",le="0.1"} 2757
caddy_http_request_duration_seconds_bucket{code="200",handler="reverse_proxy",method="GET",server="srv0",le="0.25"} 3147
caddy_http_request_duration_seconds_bucket{code="200",handler="reverse_proxy",method="GET",server="srv0",le="0.5"} 3847
caddy_http_request_duration_seconds_bucket{code="200",handler="reverse_proxy",method="GET",server="srv0",le="1"} 4390
caddy_http_request_duration_seconds_bucket{code="200",handler="reverse_proxy",method="GET",server="srv0",le="2.5"} 4490
caddy_http_request_duration_seconds_bucket{code="200",handler="reverse_proxy",method="GET",server="srv0",le="5"} 4506
caddy_http_request_duration_seconds_bucket{code="200",handler="reverse_proxy",method="GET",server="srv0",le="10"} 4511
caddy_http_request_duration_seconds_bucket{code="200",handler="reverse_proxy",method="GET",server="srv0",le="+Inf"} 4512
# curl -s http://10.100.16.9:8080/metrics|grep seconds_bucket|grep "le=\"1.0\""
# 

It also happens to the other buckets (le="5", le="10").

(This really messes up the grafana heatmap when the affected instance is included:
image
vs
image
)

What did you expect to see?
The value should be the same le="1" as with the other instances.

What did you see instead? Under which circumstances?
The value is a float instead of an integer.

Environment

  • System information:
Linux 4.15.0-115-generic x86_64
  • Prometheus version:
prometheus, version 2.24.0 (branch: HEAD, revision: 02e92236a8bad3503ff5eec3e04ac205a3b8e4fe)
  build user:       root@d9f90f0b1f76
  build date:       20210106-13:48:37
  go version:       go1.15.6
  platform:         linux/amd64
  • Prometheus configuration file:
# my global config
global:
  scrape_interval:     15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
  evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
  # scrape_timeout is set to the global default (10s).

# Alertmanager configuration
alerting:
  alertmanagers:
  - static_configs:
    - targets:
      # - alertmanager:9093

# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
  # - "first_rules.yml"
  # - "second_rules.yml"

# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
  # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
  - job_name: 'prometheus'

    # metrics_path defaults to '/metrics'
    # scheme defaults to 'http'.

    static_configs:
    - targets: ['localhost:9090']

  - job_name: 'caddy'

    hetzner_sd_configs:
     - bearer_token: "***"
       role: "hcloud"
    relabel_configs:
     - source_labels: [__meta_hetzner_hcloud_label_prometheus]
       regex: true.*
       action: keep
     - source_labels: [__meta_hetzner_hcloud_label_caddy]
       regex: true:*
       action: keep
     - source_labels: [__meta_hetzner_server_name]
       target_label: server_name
       replacement: '$1'
     - source_labels: [__meta_hetzner_hcloud_private_ipv4_net]
       target_label: __address__
       replacement: '$1:8080'
@brian-brazil
Copy link
Contributor

I don't see anything here that indicates a bug in Prometheus, as all the output you've provided is via 3rd party systems. If you have evidence that Prometheus ingestion is mangling data, please share it in the form of /query or /query_range output. I would presume that the data is like this as that is what the target is exposing, and thus should be taken up with them.

@SteffenDE
Copy link
Author

Here's the output of a query against /query.

$ curl 'http://127.0.0.1:9090/api/v1/query?query=sum%28increase%28caddy_http_request_duration_seconds_bucket%7Bhandler%3D%7E%22reverse_proxy%22%2C+server%3D%22srv0%22%2C+le%3D%7E%221%7C1.0%22%7D%5B5m%5D%29%29+by+%28le%2C+instance%29&time=1610648999.814'
{"status":"success","data":{"resultType":"vector","result":[{"metric":{"instance":"10.100.16.7:8080","le":"1"},"value":[1610648999.814,"671.5789473684209"]},{"metric":{"instance":"10.100.16.8:8080","le":"1"},"value":[1610648999.814,"0"]},{"metric":{"instance":"10.100.16.9:8080","le":"1.0"},"value":[1610648999.814,"541.0526315789474"]}]}}

The target is exposing the values as "1" and not as "1.0".

@brian-brazil
Copy link
Contributor

You're absoluately sure that 10.100.16.9:8080 is exposing 1 when Prometheus is scraping it? I'd guess that it's negotiating OpenMetrics, which you'd see if you used tcpdump so there's no bug here (and any valid rendering of 1 is valid).

This sounds like something to take up with Grafana.

@roidelapluie
Copy link
Member

roidelapluie commented Jan 14, 2021

Can you try curl -H 'Accept: application/openmetrics-text; version=0.0.1' http://10.100.16.9:8080/metrics

@beorn7
Copy link
Member

beorn7 commented Jan 14, 2021

@SteffenDE the "problem" is that Caddy activates OpenMetrics here. The client_golang documentation explains what's happening:

	// If true, the experimental OpenMetrics encoding is added to the
	// possible options during content negotiation. Note that Prometheus
	// 2.5.0+ will negotiate OpenMetrics as first priority. OpenMetrics is
	// the only way to transmit exemplars. However, the move to OpenMetrics
	// is not completely transparent. Most notably, the values of "quantile"
	// labels of Summaries and "le" labels of Histograms are formatted with
	// a trailing ".0" if they would otherwise look like integer numbers
	// (which changes the identity of the resulting series on the Prometheus
	// server).

However, if you look at a /metrics endpoint without negotiating anything in particular (e.g. with curl or with your browser), then you get the classic Prometheus text format.

In your case, I assume some of your Caddy instances are configured differently and do not expose OpenMetrics format, see caddyserver/caddy#3944 how that might be possible.

This whole issue is a bit unfortunate and has a lot of historic context. Treating what's essentially a floating point number as a string creates this kind of shenanigans. You are not the first to run into it. My vision has always been that Prometheus will eventually become aware of bucket boundaries not being floats, and at least there should be some sanitation within Prometheus. (It was there in Prometheus 1.x. It got dropped in Prometheus 2.x, arguably by accident. And my attempt to at least make it a requirement for OpenMetrics was not successful.) If you want to learn more in a somewhat entertaining way, you could watch me telling the story.

I'm not sure how Grafana's heatmap is dealing with it. I could imagine it's fine as long as the labels don't get mixed.

In any case, a PromQL sum by (le) (...) will certainly fail to sum le="1.0" and le="1" into the same resulting vector element. So you'll get them both, and that might throw Grafana off.

I guess, for now, you just have to make sure that all your Caddy instances are all exposing OpenMetrics or all not exposing OpenMetrics.

You have my full sympathy that this is annoying and that it could have been avoided by a more thoughtful design.

@SteffenDE
Copy link
Author

Thank you all for clarifying this. It's a little strange as all caddy servers used the same configuration directory (mounted via NFS). But you're absolutely right that somehow only one instance reported in the OpenMetrics format.
After rebuilding my docker image and restarting all caddy servers all are now reporting in the OpenMetrics format and grafana is also happy now:
image

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

4 participants