Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug in ConsulSD when using multiple jobs in same config. #3007

Closed
civik opened this Issue Jul 31, 2017 · 28 comments

Comments

Projects
None yet
8 participants
@civik
Copy link

civik commented Jul 31, 2017

What did you do?

Added multiple jobs, both using ConsulSD.

For example:

   - job_name: 'myconsuljob1'
      scrape_interval: 1m
      scrape_timeout: 55s
      consul_sd_configs:
      - server: 'my1.foo.com:8500'
      ...

   - job_name: 'myconsuljob2'
      scrape_interval: 1m
      scrape_timeout: 55s
      consul_sd_configs:
      - server: 'my2.foo.com:8500'

What did you expect to see?

Both jobs would discover properly

What did you see instead? Under which circumstances?

Only 1st ConsulSD job in the config seems to work - if I switch the order in the config-it will give me the other set of discovered hosts. If I change one of the target consul servers, it seems to discover both until the Prometheus server gets restarted, then it reverts back to only discovering the first job in the config.

Environment

  • System information:

Linux 3.10.0-514.21.1.el7.x86_64 x86_64

  • Prometheus version:

1.7.1

  • Prometheus configuration file:
    global:
      scrape_interval: 60s
      scrape_timeout: 55s
      evaluation_interval: 55s
      external_labels:
        k8s_datacenter: xxx
        k8s_cluster: xxx

    rule_files:
    - "/etc/config/*.rule"

    scrape_configs:
    - job_name: ingress-check
      metrics_path: /probe
      params:
        module:
        - http_2xx
      relabel_configs:
      - regex: (.*)(:80)?
        replacement: ${1}
        source_labels:
        - __address__
        target_label: __param_target
      - regex: (.*)
        replacement: ${1}
        source_labels:
        - __param_target
        target_label: instance
      - regex: .*
        replacement: blackbox-prod:9115
        source_labels: []
        target_label: __address__
      static_configs:
      - labels:
          sourceenv: xx1
          sourcesvc: ingress
        targets:
        - https://xxx/
      - labels:
          sourceenv: xx2
          sourcesvc: ingress
        targets:
        - https://xxx/
      - labels:
          sourceenv: xx1
          sourcesvc: consul
        xxfooxxs:
        - http://xx11001.xxfooxx.com:8500/v1/status/leader
        - http://xx11002.xxfooxx.com:8500/v1/status/leader
        - http://xx11003.xxfooxx.com:8500/v1/status/leader
      - labels:
          sourceenv: xx2-prod
          sourcesvc: consul
        xxfooxxs:
        - http://xx21001.xxfooxx.com:8500/v1/status/leader
        - http://xx21002.xxfooxx.com:8500/v1/status/leader
        - http://xx21003.xxfooxx.com:8500/v1/status/leader
      - labels:
          sourceenv: xx1-prod
          sourcesvc: etcd
        targets:
        - http://10.XX.XX.XXX:2379/health
        - http://10.XX.XX.XXY:2379/health
        - http://10.XX.XX.XXZ:2379/health
      - labels:
          sourceenv: xx2-prod
          sourcesvc: etcd
        targets:
        - http://10.XX.XX.XXX:2379/health
        - http://10.XX.XX.XXY:2379/health
        - http://10.XX.XX.XXZ:2379/health

    - job_name: 'xx2-federate'
      scheme: https
      tls_config:
        insecure_skip_verify: true
      honor_labels: true
      metrics_path: '/federate'
      params:
        'match[]':
        - '{job="kubernetes-apiservers"}'
        - '{job="kubernetes-service-endpoints"}'
        - '{job="kubernetes-pods"}'
        - '{job="kubernetes-nodes"}'
      static_configs:
        - targets:
          - 'prometheus.us-central-1xx2.core'

    - job_name: 'xx1-federate'
      scheme: https
      tls_config:
        insecure_skip_verify: true
      honor_labels: true
      metrics_path: '/federate'
      scheme: https
      params:
        'match[]':
        - '{job="kubernetes-apiservers"}'
        - '{job="kubernetes-service-endpoints"}'
        - '{job="kubernetes-pods"}'
        - '{job="kubernetes-nodes"}'
      static_configs:
        - targets:
          - 'prometheus.us-central-1xx1.core'

    - job_name: 'k8s-metal-xx2'
      scrape_interval: 2m
      scrape_timeout: 115s
      consul_sd_configs:
      - server: 'xx21002.xxfooxx:8500'
        services: ['nomad-client']
        scheme: http
      relabel_configs:
      - source_labels: [__meta_sd_consul_tags]
        separator:     ','
        regex:         label:([^=]+)=([^,]+)
        target_label:  ${1}
        replacement:   ${2}
      - source_labels: ['__address__']
        separator:     ':'
        regex:         '(.*):(4646)'
        target_label:  '__address__'
        replacement:   '${1}:9101'

    - job_name: 'k8s-metal-xx1'
      scrape_interval: 2m
      scrape_timeout: 115s
      consul_sd_configs:
      - server: 'xx11003.xxfooxx:8500'
        services: ['nomad-client']
        scheme: http
      relabel_configs:
      - source_labels: [__meta_sd_consul_tags]
        separator:     ','
        regex:         label:([^=]+)=([^,]+)
        target_label:  ${1}
        replacement:   ${2}
      - source_labels: ['__address__']
        separator:     ':'
        regex:         '(.*):(4646)'
        target_label:  '__address__'
        replacement:   '${1}:9101'

@krasi-georgiev

This comment has been minimized.

Copy link
Member

krasi-georgiev commented Sep 8, 2017

DIBS
any idea if this has been already fixed or has an open PR?

@civik

This comment has been minimized.

Copy link
Author

civik commented Sep 9, 2017

Let me know if you need any additional data. Currently still exhibiting this problem. Config above only will load the hosts in the first SD job.

I'm basically repurposing another consul service that has the hosts I want, and relabeling to change the port to node_ex port. Thats pretty much the only out of the ordinary thing I'm doing.

@krasi-georgiev

This comment has been minimized.

Copy link
Member

krasi-georgiev commented Sep 9, 2017

@civik have you checked it against the dev-2.0 branch ?
I think this will be reales soon and if it works well there than it might be better to just wait.

Otherwise if any of the maintainers confirms it is ok to fix in this branch I can work on it.

@grobie

This comment has been minimized.

Copy link
Member

grobie commented Sep 9, 2017

@krasi-georgiev I'm reasonably sure the bug is specific to the consul sd implementation and hasn't been fixed in dev-2.0. Please feel free to give it a shot here, we'll release Prometheus v1.8 before the big 2.0 release. I'd love to review a PR to fix this bug (mention me in the PR in case you're able to find and fix it).

@krasi-georgiev

This comment has been minimized.

Copy link
Member

krasi-georgiev commented Sep 10, 2017

@civik @grobie I am not able to replicate the issue, with my setup prometheus finds both jobs every time.
Tested against v1.7.1 and master

It seems there is something different in your setup so let me know what you think is different and I will keep trying.
Try stripping down your config and maybe something in the file config is confusing the parsing.

I used Docker and this my setup:

#create a network so that the containers can talk to each other
docker network create x

#start 2 independent consul servers
docker run -d --name consul --net=x -p 8500:8500 consul agent -dev -client=0.0.0.0 -bind=0.0.0.0
docker run -d --name consul1 --net=x -p 8501:8500 consul agent -dev -client=0.0.0.0 -bind=0.0.0.0

#connect one exporter to each consul server
docker run -d --net=x -p 9107:9107 prom/consul-exporter -consul.server=consul:8500
docker run -d --net=x  -p 9108:9107 prom/consul-exporter -consul.server=consul1:8500

#start one registrator for each consul server to register the exporters and 
#than in the prometheus config set to scrape only the `consul-exporter` containers
docker run -d --net=x --volume=/var/run/docker.sock:/tmp/docker.sock gliderlabs/registrator:latest -ip="127.0.0.1" consul://consul:8500
docker run -d --net=x --volume=/var/run/docker.sock:/tmp/docker.sock gliderlabs/registrator:latest -ip="127.0.0.1" consul://consul1:8500

config

global:
  scrape_interval: 15s
  scrape_timeout: 10s
  evaluation_interval: 15s
  external_labels:
    monitor: codelab-monitor
scrape_configs:
- job_name: myconsuljob1
  scrape_interval: 5s
  scrape_timeout: 4s
  metrics_path: /metrics
  scheme: http
  consul_sd_configs:
  - server: my1.foo.com:8500
    tag_separator: ','
    scheme: http
    services:
    - consul-exporter
- job_name: myconsuljob2
  scrape_interval: 5s
  scrape_timeout: 4s
  metrics_path: /metrics
  scheme: http
  consul_sd_configs:
  - server: my2.foo.com:8501
    tag_separator: ','
    scheme: http
    services:
    - consul-exporter

my1.foo.com and my2.foo.com are pointing to the local consul docker containers at 127.0.0.1

screen shot 2017-09-10 at 14 25 20

@civik

This comment has been minimized.

Copy link
Author

civik commented Sep 12, 2017

I just stripped my config down to the bare minimum. This is my current running config. I did try it without the relabels but that also didnt help.

scrape_configs:
- job_name: k8s-metal-1
  honor_labels: true
  scrape_interval: 1m
  scrape_timeout: 50s
  metrics_path: /metrics
  scheme: http
  consul_sd_configs:
  - server: 10.6.5.10:8500
    tag_separator: ','
    scheme: http
    services:
    - nomad-client
  relabel_configs:
  - source_labels: [__address__]
    separator: ':'
    regex: (.*):(4646)
    target_label: __address__
    replacement: ${1}:9101
    action: replace
- job_name: k8s-metal-2
  honor_labels: true
  scrape_interval: 1m
  scrape_timeout: 50s
  metrics_path: /metrics
  scheme: http
  consul_sd_configs:
  - server: 10.7.5.10:8500
    tag_separator: ','
    scheme: http
    services:
    - nomad-client
  relabel_configs:
  - source_labels: [__address__]
    separator: ':'
    regex: (.*):(4646)
    target_label: __address__
    replacement: ${1}:9101
    action: replace

Some possible idiosyncrasies:

  • Consul 7.x in a 3 node cluster
  • Deploying with Helm
  • Consul running on a remote host
  • Consul running on same PORT (8500), but on different SERVERS

just spitballin' here...

@civik

This comment has been minimized.

Copy link
Author

civik commented Sep 12, 2017

So suddenly its got both configured now. The series of events was removing all other scrape configs except the consul_sd stuff, reload config, put back other scrape configs, reload config. Bam - both working. This was a straight ctl-z to put the other configs back, so I'm pretty sure its not a typo/syntax thing. Just a thought - could it be something with behavior on a config reload vs running from scratch?

@krasi-georgiev

This comment has been minimized.

Copy link
Member

krasi-georgiev commented Sep 12, 2017

how did you reload the config so I can try as well?

@civik

This comment has been minimized.

Copy link
Author

civik commented Sep 13, 2017

The helm chart deployment watches the configmaps and will hit the /-/reload handler when it changes. I should have more time to look at this today.

@civik

This comment has been minimized.

Copy link
Author

civik commented Sep 17, 2017

I've been wrestling with this on and off for a couple days, trying to figure out what would be going wrong on my end, trying to figure out a consistant reproduction This is as close as I've gotten.

I deploy the following config in its own instance:

scrape_configs:
- job_name: prometheus
  scrape_interval: 30s
  static_configs:
  - targets: ['127.0.0.1:9090']

- job_name: disc1
  scrape_interval: 2m
  scrape_timeout: 115s
  consul_sd_configs:
  - server: xx.xx.xx.72:8500
    services: ['prom-worker-metal']
    scheme: http
  relabel_configs:
  - source_labels: [__meta_sd_consul_tags]
    separator:     ','
    regex:         label:([^=]+)=([^,]+)
    target_label:  ${1}
    replacement:   ${2}
  - source_labels: ['__address__']
    separator:     ':'
    regex:         '(.*):(9091)'
    target_label:  '__address__'
    replacement:   '${1}:9101'

- job_name: disc2
  scrape_interval: 2m
  scrape_timeout: 115s
  consul_sd_configs:
  - server: xx.xx.xx.72:8500
    services: ['prom-master-metal']
    scheme: http
  relabel_configs:
  - source_labels: [__meta_sd_consul_tags]
    separator:     ','
    regex:         label:([^=]+)=([^,]+)
    target_label:  ${1}
    replacement:   ${2}
  - source_labels: ['__address__']
    separator:     ':'
    regex:         '(.*):(9091)'
    target_label:  '__address__'
    replacement:   '${1}:9101'

- job_name: disc3
  scrape_interval: 2m
  scrape_timeout: 115s
  consul_sd_configs:
  - server: xx.xx.xx.71:8500
    services: ['prom-worker-metal']
    scheme: http
  relabel_configs:
  - source_labels: [__meta_sd_consul_tags]
    separator:     ','
    regex:         label:([^=]+)=([^,]+)
    target_label:  ${1}
    replacement:   ${2}
  - source_labels: ['__address__']
    separator:     ':'
    regex:         '(.*):(9091)'
    target_label:  '__address__'
    replacement:   '${1}:9101'

- job_name: disc4
  scrape_interval: 2m
  scrape_timeout: 115s
  consul_sd_configs:
  - server: xx.xx.xx.71:8500
    services: ['prom-master-metal']
    scheme: http
  relabel_configs:
  - source_labels: [__meta_sd_consul_tags]
    separator:     ','
    regex:         label:([^=]+)=([^,]+)
    target_label:  ${1}
    replacement:   ${2}
  - source_labels: ['__address__']
    separator:     ':'
    regex:         '(.*):(9091)'
    target_label:  '__address__'
    replacement:   '${1}:9101'

Deploying this config yields 2/4 discovered jobs:

image

Then I'll hit the reload API endpoint, changing nothing else curl -X POST http://prom-tst-consul.xxx.com/-/reload

Now all 4 jobs are discovered:
image

Hitting reload endpoint 3rd time yields the 'disc4' job not being discovered
Hitting reload endpoint 4th,5th time yields the 'disc3 & disc4' job not being discovered
Hitting endpoint 6th time yields the 'disc3' job not being discovered

@civik

This comment has been minimized.

Copy link
Author

civik commented Sep 20, 2017

@krasi-georgiev Its got to be some kind of race or timeout on reading from Consul. If I hit the reload ( curl -X POST http://prom.foo.com/-/reload ) endpoint several times I can eventually get all my consul services discovered with associated jobs. Also, if I relabel a port number to something that intentionally breaks the target, It will ALWAYS populate the correct jobs.

@krasi-georgiev

This comment has been minimized.

Copy link
Member

krasi-georgiev commented Oct 13, 2017

@civik is this still an issue with the latest master branch ?
I have some time now so can try to replicate it again.
ping me on prometheus - irc (krasi) and will try to get to the bottom of this.

@civik

This comment has been minimized.

Copy link
Author

civik commented Oct 30, 2017

Yes its still an issue. The title of this issue is inaccurate, it seems as if its more scale related. As soon as you hit 20-ish+ targets it manifests. Perhaps a too aggressive timeout? I'll ping you when I have a sec, but it may be a couple weeks as it is our busiest time of year right now.

@krasi-georgiev

This comment has been minimized.

Copy link
Member

krasi-georgiev commented Oct 31, 2017

FYI I am working on a big refactoring to the discovery service #3362

@krasi-georgiev

This comment has been minimized.

Copy link
Member

krasi-georgiev commented Dec 1, 2017

@civik the SD refactoring is ready and looking for someone to test it in more complex env before it gets merged.
Would you have some time to give it a try? I can compile and send you a binary if that would help.

@krasi-georgiev

This comment has been minimized.

Copy link
Member

krasi-georgiev commented Dec 1, 2017

@civik
Here is a link to download an executable for Linux 64bit
https://github.com/krasi-georgiev/prometheus/releases/download/v2.0.0-beta.x/prometheus

let me know if it still has the same bug.

@civik

This comment has been minimized.

Copy link
Author

civik commented Dec 4, 2017

@krasi-georgiev OK I will ASAP. We dont have any 2.0 test pipelines set up yet, but this might be a carrot on a stick to do so.

@krasi-georgiev

This comment has been minimized.

Copy link
Member

krasi-georgiev commented Dec 8, 2017

@civik How big is your setup btw?
If you don' think you will have time to test it let me know so I can try to find some more use cases and test it properly before the merge.

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Jun 15, 2018

Are you still seeing this with 2.3.0? Consul SD got a big rewrite.

@civik

This comment has been minimized.

Copy link
Author

civik commented Jun 20, 2018

@brian-brazil @krasi-georgiev sigh... sorry I have failed you.

I'm out on vacation for a week or so, I'll try to test this when I return. Our scale for Consul SD in our largest environment would be something around ~100 nodes. My primary use case right now is discovering node_exporter on metal to compliment the K8s SD.

@krasi-georgiev

This comment has been minimized.

Copy link
Member

krasi-georgiev commented Jun 20, 2018

no problem.
The SD refactoring is now merged so you can download the 2.3 version.

@davtsur

This comment has been minimized.

Copy link

davtsur commented Jun 27, 2018

I’m using version 2.3.0 , and using consul_sd_configs to scrape 500~ nodes with a few consul sd config rules.

I noticed that we have a very large amount of open fds.

Also I see many TCP connections open to the same node (for each of the scraped nodes), to be

accurate 55 TCP connections , and its constant and doesn’t change.

I added an additional static config in order to see if it will have a different pattern of behavior in

terms of open fds and open TCP connections and I see that it behaves more like I would expect,

meaning most of time 0 TCP connections and shortly after they are opened they close.

@krasi-georgiev

This comment has been minimized.

Copy link
Member

krasi-georgiev commented Jun 27, 2018

@davtsur thanks for the report, but would you mind opening a new issue as this is unrelated to the what we are discussing here.

@simonpasquier

This comment has been minimized.

Copy link
Member

simonpasquier commented Aug 1, 2018

@civik did you have time to see if the issue could be reproduced with the latest Prometheus release?

@civik

This comment has been minimized.

Copy link
Author

civik commented Sep 4, 2018

I've changed positions so I lost my testbed for this issue. However, I am planning a new deployment that will leverage Consul SD extensively so I should know more very soon. Just curious what we currently know about scale around Consul services? How many clients have been tested?

@simonpasquier

This comment has been minimized.

Copy link
Member

simonpasquier commented Sep 4, 2018

Thanks for the heads-up. @iksaif might have some information regarding your questions.

@iksaif

This comment has been minimized.

Copy link
Contributor

iksaif commented Sep 7, 2018

I've seen prometheus instances scrapping thousands of consul targets (the limit really is the memory necessary for metric names and points, not consul here). For a consul cluster itself I've seen 5+ datacenter together with tens of thousands of nodes. Most of the patches to achieve that are upstream, some are on https://github.com/criteo-forks/consul/tree/1.2.2-criteo

@simonpasquier

This comment has been minimized.

Copy link
Member

simonpasquier commented Oct 9, 2018

@civik I'm closing it for now. Feel free to reopen if you face the issue again.

@lock lock bot locked and limited conversation to collaborators Apr 7, 2019

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
You can’t perform that action at this time.