Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

targets from consul never show up after upgrade to 1.4 #2220

Closed
onorua opened this Issue Nov 25, 2016 · 5 comments

Comments

Projects
None yet
3 participants
@onorua
Copy link

onorua commented Nov 25, 2016

What did you do?
We have a server which was upgraded couple of hours ago, it got all the data loaded, but apparently did not do anything useful since that moment.

When I press targets on GUI, it doesn't see anything, downgrade to 1.3.1 made the trick. I've checked and it seems no major changes to consul autodiscovery happened.

What did you expect to see?
Smooth upgrade, restart, load files and continue to scrape

Environment

  • System information:
Linux 4.4.0-45-generic x86_64
  • Prometheus version:
prometheus, version 1.4.0 (branch: master, revision: ecad074e46ef60536722c5b01e6c5277f2b50a3d)
  build user:       root@5fb8bc2a8e57
  build date:       20161125-12:45:05
  go version:       go1.7.3
  • Prometheus configuration file:
# my global config
global:
  scrape_interval:     30s # By default, scrape targets every 15 seconds.
  evaluation_interval: 30s # By default, scrape targets every 15 seconds.
  # scrape_timeout is set to the global default (10s).

  # Attach these labels to any time series or alerts when communicating with
  # external systems (federation, remote storage, Alertmanager).
  external_labels:
      monitor: 'cloud'

# Load and evaluate rules in this file every 'evaluation_interval' seconds.
rule_files:
  - "alert.rules"
  # - "first.rules"
  # - "second.rules"

# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
  # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
  # Auto-discovery via consul tags by Registrator
  - job_name: 'consul-services'

    consul_sd_configs:
    - server: 'consul:8500'
      scheme: 'http'

    relabel_configs:
    - source_labels: [ '__meta_consul_tags' ]
      action: keep
      regex: .*exporter.*
    - source_labels: [ '__meta_consul_address', '__meta_consul_service_port' ]
      action: replace
      regex: (.+)(?::\d+);(\d+)
      replacement: $1:$2
      target_label: __address__
    - source_labels: [ '__meta_consul_service' ]
      action: keep
      regex: (.+)
      replacement: $1
      target_label: __name__
    - source_labels: [ '__meta_consul_service_id' ]
      action: replace
      regex: (.*):(.*):(.*)
      replacement: $2
      target_label: container_name
    - source_labels: [ '__meta_consul_service_id' ]
      action: replace
      regex: (.*):(.*):(.*)
      replacement: $1
      target_label: host_name
    - source_labels: [ '__meta_consul_service_id' ]
      action: replace
      regex: .*-(\w*)-(\w*)-(\d*):(.*):(.*)
      replacement: $1
      target_label: colo
  • Logs:
time="2016-11-25T13:53:49Z" level=info msg="All requests for rebuilding the label indexes queued. (Actual processing may lag behind.)" source="crashrecovery.go:529"
time="2016-11-25T13:53:49Z" level=warning msg="Crash recovery complete." source="crashrecovery.go:152"
time="2016-11-25T13:53:49Z" level=info msg="1657859 series loaded." source="storage.go:359"
time="2016-11-25T13:53:49Z" level=info msg="Starting target manager..." source="targetmanager.go:63"
time="2016-11-25T13:53:49Z" level=info msg="Listening on :9090" source="web.go:248"
time="2016-11-25T13:54:25Z" level=info msg="Checkpointing in-memory metrics and chunks..." source="persistence.go:549"
time="2016-11-25T13:55:27Z" level=info msg="Done checkpointing in-memory metrics and chunks in 1m1.956668514s." source="persistence.go:573"

and it was checkpointing all the time about once per minute.

@fabxc

This comment has been minimized.

Copy link
Member

fabxc commented Nov 25, 2016

Thanks for reporting and sorry about the disruption. I'll do my best to investigate and find a fix within the weekend.

The checkpointing cannot really be related. @beorn7

@beorn7

This comment has been minimized.

Copy link
Member

beorn7 commented Nov 27, 2016

If your server checkpoints more often than configured via the -storage.local.checkpoint-interval flag, it is doing so because it creates a lot of "dirty" series during series maintenance. That's a normal thing and aimed at making crash recovery faster. If you are running on SSDs, you can increase -storage.local.checkpoint-dirty-series-limit to see less frequent checkpointing.

There is that PromCon talk where the storage layer is explained, including all those mysterious flags, see https://www.youtube.com/watch?v=HbnGSNEjhUc&t=1815s for the part where all the flags come into the game.

And as @fabxc said, has nothing to do with a problem in Consul SD.

@onorua

This comment has been minimized.

Copy link
Author

onorua commented Nov 27, 2016

Thank you for the valuable suggestions! After setting storage.local.checkpoint-dirty-series-limit for 1.4.0 version, I could see prometheus scraping node_exporter metric for all hosts, but completely ignoring cAdviser and our app exporters. These 2 metrics never show up on 1.4 but they appeared immediately as soon as I've rolled back to 1.3.1.
Anything I'm missing here?

@fabxc

This comment has been minimized.

Copy link
Member

fabxc commented Nov 28, 2016

The issue with your consul discovery was indeed a bug that #2223 fixes.
We'll cut a 1.4.1 with the fix today.

@fabxc fabxc closed this in #2223 Nov 28, 2016

@lock

This comment has been minimized.

Copy link

lock bot commented Mar 24, 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked and limited conversation to collaborators Mar 24, 2019

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
You can’t perform that action at this time.