Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prometheus not updating service discovery from Consul #3314

Closed
zemek opened this Issue Oct 18, 2017 · 5 comments

Comments

Projects
None yet
6 participants
@zemek
Copy link
Contributor

zemek commented Oct 18, 2017

What did you do?
We configured 2 Prometheus servers in HA (running same config) to discover targets using Consul.

What did you expect to see?
Both Prometheus servers to scrape all of the targets that are registered in Consul

What did you see instead? Under which circumstances?
At first it was fine, but when we had more instances spin up, only one Prometheus server started scraping the new targets. Here is a graph:
screen shot 2017-10-18 at 2 10 34 pm
rate(prometheus_target_interval_length_seconds_count{interval='15s'}[5m])*15
prometheus_sd_consul_rpc_duration_seconds{quantile='0.5', call="service"}

Sending a SIGHUP to the bad Prometheus server ended up resolving the issue. I don't think this would be caused by config since there weren't any changes, and it did initially get the full list of hosts.

Environment

  • System information:
	Linux 4.4.0-97-generic x86_64
  • Prometheus version:
prometheus, version 2.0.0-rc.1 (branch: HEAD, revision: 5ab8834befbd92241a88976c790ace7543edcd59)
  build user:       root@1f56dd8b6f7b
  build date:       20171017-12:34:15
  go version:       go1.9.1
  • Prometheus configuration file:
global:
  scrape_interval:     15s # Scrape targets every 15 seconds.
  evaluation_interval: 15s # Evaluate rules every 15 seconds.

rule_files:
  - /etc/prometheus/recording_rules/*.yml
  - /etc/prometheus/alerting_rules/*.yml

scrape_configs:
  - job_name: 'node_exporter'
    consul_sd_configs:
      - server: '127.0.0.1:8500'
        services: ['node_exporter']
    relabel_configs: &DEFAULT_RELABEL
      - source_labels: [__meta_consul_tags]
        regex: '.*,host=([^,]+),.*'
        replacement: '${1}'
        target_label: 'host'
      - source_labels: [__meta_consul_tags]
        regex: '.*,role=([^,]+),.*'
        replacement: '${1}'
        target_label: 'role'
  - job_name: 'statsd_exporter'
    consul_sd_configs:
      - server: '127.0.0.1:8500'
        services: ['statsd_exporter']
    relabel_configs: *DEFAULT_RELABEL
  - job_name: 'self'
    consul_sd_configs:
      - server: '127.0.0.1:8500'
        services: ['prometheus']
    relabel_configs: *DEFAULT_RELABEL
  - job_name: 'alertmanager'
    consul_sd_configs:
      - server: '127.0.0.1:8500'
        services: ['alertmanager']
    relabel_configs: *DEFAULT_RELABEL

alerting:
  alertmanagers:
    - consul_sd_configs:
      - server: '127.0.0.1:8500'
        services: ['alertmanager']
  • Logs:
level=info ts=2017-10-17T22:13:21.658938785Z caller=main.go:216 msg="Starting prometheus" version="(version=2.0.0-rc.1, branch=HEAD, revision=5ab8834befbd92241a88976c790ace7543edcd59)"
level=info ts=2017-10-17T22:13:21.658993398Z caller=main.go:217 build_context="(go=go1.9.1, user=root@1f56dd8b6f7b, date=20171017-12:34:15)"
level=info ts=2017-10-17T22:13:21.659010138Z caller=main.go:218 host_details="(Linux 4.4.0-97-generic #120~14.04.1-Ubuntu SMP Wed Sep 20 15:53:13 UTC 2017 x86_64 ip-10-1-30-199 (none))"
level=info ts=2017-10-17T22:13:21.661576149Z caller=main.go:315 msg="Starting TSDB"
level=info ts=2017-10-17T22:13:21.661590711Z caller=targetmanager.go:68 component="target manager" msg="Starting target manager..."
level=info ts=2017-10-17T22:13:21.661548832Z caller=web.go:380 component=web msg="Start listening for connections" address=0.0.0.0:9090
level=info ts=2017-10-17T22:13:21.850824129Z caller=main.go:327 msg="TSDB started"
level=info ts=2017-10-17T22:13:21.850877412Z caller=main.go:394 msg="Loading configuration file" filename=/etc/prometheus/config.yml
level=info ts=2017-10-17T22:13:21.852601767Z caller=main.go:371 msg="Server is ready to receive requests."
level=info ts=2017-10-18T20:13:18.182409504Z caller=main.go:394 msg="Loading configuration file" filename=/etc/prometheus/config.yml
@or4cle

This comment has been minimized.

Copy link

or4cle commented Nov 14, 2017

FWIW, we have encountered the same issue, on Prometheus 1.7.1. The workaround was also the same (send SIGHUP).

@krasi-georgiev

This comment has been minimized.

Copy link
Member

krasi-georgiev commented Dec 1, 2017

I did a big refactoring of the SD Service discovery if you want to give it a try and report if the bug is still there.

Here is a link to download an executable for Linux 64bit
https://github.com/krasi-georgiev/prometheus/releases/download/v2.0.0-beta.x/prometheus

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Apr 3, 2018

As we haven't heard back from you in some time I'm going to presume that this was resolved in 2.x when SD was changed there. If not, please reopen.

@vishksaj

This comment has been minimized.

Copy link

vishksaj commented Sep 24, 2018

facing same issue in 2.2.1
level=error ts=2018-09-24T15:16:31.126666572Z caller=consul.go:326 component="discovery manager scrape" discovery=consul msg="Error refreshing service" err="Get http://consul:8500/v1/catalog/service/######?index=55173847&wait=30000ms: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)"

@lock

This comment has been minimized.

Copy link

lock bot commented Mar 23, 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked and limited conversation to collaborators Mar 23, 2019

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
You can’t perform that action at this time.