Join GitHub today
GitHub is home to over 31 million developers working together to host and review code, manage projects, and build software together.
Sign upBug in ConsulSD when using multiple jobs in same config. #3007
Comments
gouthamve
added
component/service discovery
kind/bug
labels
Aug 1, 2017
brian-brazil
added
the
priority/P2
label
Aug 21, 2017
This comment has been minimized.
This comment has been minimized.
|
DIBS |
This comment has been minimized.
This comment has been minimized.
|
Let me know if you need any additional data. Currently still exhibiting this problem. Config above only will load the hosts in the first SD job. I'm basically repurposing another consul service that has the hosts I want, and relabeling to change the port to node_ex port. Thats pretty much the only out of the ordinary thing I'm doing. |
This comment has been minimized.
This comment has been minimized.
|
@civik have you checked it against the dev-2.0 branch ? Otherwise if any of the maintainers confirms it is ok to fix in this branch I can work on it. |
This comment has been minimized.
This comment has been minimized.
|
@krasi-georgiev I'm reasonably sure the bug is specific to the consul sd implementation and hasn't been fixed in dev-2.0. Please feel free to give it a shot here, we'll release Prometheus v1.8 before the big 2.0 release. I'd love to review a PR to fix this bug (mention me in the PR in case you're able to find and fix it). |
This comment has been minimized.
This comment has been minimized.
|
@civik @grobie I am not able to replicate the issue, with my setup prometheus finds both jobs every time. It seems there is something different in your setup so let me know what you think is different and I will keep trying. I used Docker and this my setup:
config
my1.foo.com and my2.foo.com are pointing to the local consul docker containers at 127.0.0.1 |
This comment has been minimized.
This comment has been minimized.
|
I just stripped my config down to the bare minimum. This is my current running config. I did try it without the relabels but that also didnt help.
Some possible idiosyncrasies:
just spitballin' here... |
This comment has been minimized.
This comment has been minimized.
|
So suddenly its got both configured now. The series of events was removing all other scrape configs except the consul_sd stuff, reload config, put back other scrape configs, reload config. Bam - both working. This was a straight ctl-z to put the other configs back, so I'm pretty sure its not a typo/syntax thing. Just a thought - could it be something with behavior on a config reload vs running from scratch? |
This comment has been minimized.
This comment has been minimized.
|
how did you reload the config so I can try as well? |
This comment has been minimized.
This comment has been minimized.
|
The helm chart deployment watches the configmaps and will hit the /-/reload handler when it changes. I should have more time to look at this today. |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
|
@krasi-georgiev Its got to be some kind of race or timeout on reading from Consul. If I hit the reload |
This comment has been minimized.
This comment has been minimized.
|
@civik is this still an issue with the latest master branch ? |
This comment has been minimized.
This comment has been minimized.
|
Yes its still an issue. The title of this issue is inaccurate, it seems as if its more scale related. As soon as you hit 20-ish+ targets it manifests. Perhaps a too aggressive timeout? I'll ping you when I have a sec, but it may be a couple weeks as it is our busiest time of year right now. |
This comment has been minimized.
This comment has been minimized.
|
FYI I am working on a big refactoring to the discovery service #3362 |
This comment has been minimized.
This comment has been minimized.
|
@civik the SD refactoring is ready and looking for someone to test it in more complex env before it gets merged. |
This comment has been minimized.
This comment has been minimized.
|
@civik let me know if it still has the same bug. |
This comment has been minimized.
This comment has been minimized.
|
@krasi-georgiev OK I will ASAP. We dont have any 2.0 test pipelines set up yet, but this might be a carrot on a stick to do so. |
krasi-georgiev
referenced this issue
Dec 4, 2017
Merged
Decouple the discovery and refactor the retrieval package #3362
This comment has been minimized.
This comment has been minimized.
|
@civik How big is your setup btw? |
This comment has been minimized.
This comment has been minimized.
|
Are you still seeing this with 2.3.0? Consul SD got a big rewrite. |
This comment has been minimized.
This comment has been minimized.
|
@brian-brazil @krasi-georgiev sigh... sorry I have failed you. I'm out on vacation for a week or so, I'll try to test this when I return. Our scale for Consul SD in our largest environment would be something around ~100 nodes. My primary use case right now is discovering node_exporter on metal to compliment the K8s SD. |
This comment has been minimized.
This comment has been minimized.
|
no problem. |
This comment has been minimized.
This comment has been minimized.
davtsur
commented
Jun 27, 2018
|
I’m using version 2.3.0 , and using consul_sd_configs to scrape 500~ nodes with a few consul sd config rules. I noticed that we have a very large amount of open fds. Also I see many TCP connections open to the same node (for each of the scraped nodes), to be accurate 55 TCP connections , and its constant and doesn’t change. I added an additional static config in order to see if it will have a different pattern of behavior in terms of open fds and open TCP connections and I see that it behaves more like I would expect, meaning most of time 0 TCP connections and shortly after they are opened they close. |
This comment has been minimized.
This comment has been minimized.
|
@davtsur thanks for the report, but would you mind opening a new issue as this is unrelated to the what we are discussing here. |
This comment has been minimized.
This comment has been minimized.
|
@civik did you have time to see if the issue could be reproduced with the latest Prometheus release? |
This comment has been minimized.
This comment has been minimized.
|
I've changed positions so I lost my testbed for this issue. However, I am planning a new deployment that will leverage Consul SD extensively so I should know more very soon. Just curious what we currently know about scale around Consul services? How many clients have been tested? |
This comment has been minimized.
This comment has been minimized.
|
Thanks for the heads-up. @iksaif might have some information regarding your questions. |
This comment has been minimized.
This comment has been minimized.
|
I've seen prometheus instances scrapping thousands of consul targets (the limit really is the memory necessary for metric names and points, not consul here). For a consul cluster itself I've seen 5+ datacenter together with tens of thousands of nodes. Most of the patches to achieve that are upstream, some are on https://github.com/criteo-forks/consul/tree/1.2.2-criteo |
simonpasquier
added
the
kind/more-info-needed
label
Sep 7, 2018
This comment has been minimized.
This comment has been minimized.
|
@civik I'm closing it for now. Feel free to reopen if you face the issue again. |



civik commentedJul 31, 2017
•
edited
What did you do?
Added multiple jobs, both using ConsulSD.
For example:
What did you expect to see?
Both jobs would discover properly
What did you see instead? Under which circumstances?
Only 1st ConsulSD job in the config seems to work - if I switch the order in the config-it will give me the other set of discovered hosts. If I change one of the target consul servers, it seems to discover both until the Prometheus server gets restarted, then it reverts back to only discovering the first job in the config.
Environment
Linux 3.10.0-514.21.1.el7.x86_64 x86_64
1.7.1