Join GitHub today
GitHub is home to over 31 million developers working together to host and review code, manage projects, and build software together.
Sign upDNS SD: If all targets disappear, Prometheus keeps them. #2799
Comments
beorn7
added
component/service discovery
kind/bug
labels
Jun 2, 2017
brian-brazil
added
the
priority/P3
label
Jul 14, 2017
This comment has been minimized.
This comment has been minimized.
|
I'm hitting this bug pretty consistently while renaming services, causing targets to get scraped twice with old and new labels. |
This comment has been minimized.
This comment has been minimized.
|
The bug is pretty readable in the code, we simply interpret any empty result set as an error: prometheus/discovery/dns/dns.go Lines 193 to 198 in 76c9a0d |
This comment has been minimized.
This comment has been minimized.
|
I understand the issue now. The behavior to continue as long as no results are returned was originally introduced in 8c08a50 in order to support search domains. I don't think we should use the length of the returned result as indicator here, but we should check the returned rcode. |
This comment has been minimized.
This comment has been minimized.
|
@grobie That's not the right behavior: If you're using multiple search domains and the name you're looking for is available on the next server, you can't check for NXDOMAIN to determine whether to continue or not. |
This comment has been minimized.
This comment has been minimized.
|
Coming from another side, should we be supporting search domains? I would have expected the user to have to provide the FQDN. |
This comment has been minimized.
This comment has been minimized.
|
Not having energy for this discussion. Going to unsubscribe from the thread, ping me if you have questions about the issue itself. |
This comment has been minimized.
This comment has been minimized.
|
In any case, a DNS SRV entry that has gone away entirely needs to remove the Prometheus target entirely. Thanks, @grobie, for spotting the issue here. :supergrover: |
This comment has been minimized.
This comment has been minimized.
|
Yes, we need search domains, especially because we expect people to have "one Prometheus per cluster". Search domains are not without issues but such is life when using DNS for Service Discovery. Even musl/alpine eventually saw the light here and implemented them. So, if the current code does inscrutable things … let's spec out the behaviour we want and go from there? |
This comment has been minimized.
This comment has been minimized.
|
@grobie if you are too busy during your trip, I can give it a try. Just let me know because we really cannot afford to work on it in parallel. |
beorn7
self-assigned this
Jul 20, 2017
This comment has been minimized.
This comment has been minimized.
|
I take the lock on this for now. |
This comment has been minimized.
This comment has been minimized.
|
The discussion around search domains is irrelevant here, we have to change the implementation regardless. What we have to change is the handling of empty DNS responses. We currently assume that any DNS response with an empty resource records answer list is invalid and return an It's not correct to conclude from the number of returned resource records whether a query was successful or not. The DNS protocol provides the rcode field for that. First we need to decide which rcodes indicate an error for which we want to keep the current behavior (returning an error, ignoring the result and keeping the cached targets list) and which rcodes indicate a successful query. There are at least 20 rcodes defined by the IANA: https://www.iana.org/assignments/dns-parameters/dns-parameters.xhtml#dns-parameters-6 The most common response codes in practice are Implementation wise, we need to change the |
This comment has been minimized.
This comment has been minimized.
|
Given the course of events (and my workflow, and the fact that I'm going on vacation on Friday), I'll pass the lock onto @grobie (who has quite some detailed ideas already, anyway). |
beorn7
assigned
grobie
and unassigned
beorn7
Jul 24, 2017
This comment has been minimized.
This comment has been minimized.
|
This bug is going to bite me Real Soon Now (about to start using |
This comment has been minimized.
This comment has been minimized.
|
I would be even more than happy to review and accept a pull request! I'm
not working on this issue and neither will have time for it during the next
couple of weeks.
Ideally we'd first finally write a test setup to ensure the correct
behavior in all cases. Though, I haven't found a DNS server mock
implementation yet, so it might either mean writing one on our own or
writing more like integration tests with a real DNS server.
…On Mon, Sep 4, 2017, 04:06 Matt Palmer ***@***.***> wrote:
This bug is going to bite me Real Soon Now (about to start using
dns_sd_config in anger). Is this on your plate to fix in the next, say,
few days or so, @grobie <https://github.com/grobie>? Or would you be
happy to accept a suitable PR based on your last comments (which I think
are right on the button)?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#2799 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAANaGp3txaa89EzEng67I_npq-IqJcmks5se1sigaJpZM4NuDB4>
.
|
This comment has been minimized.
This comment has been minimized.
|
I'm not sure I'm up for large-scale refactorings... I can make myself understood in Go, but I'm far from literate. I'll give it a go, but don't expect too much. |
This comment has been minimized.
This comment has been minimized.
|
I think the fix here is relatively easy. The more challenging part is to set up proper testing as described by @grobie to make sure similar things will not bite us again in the future. |
mpalmer
referenced this issue
Sep 5, 2017
Merged
Improve DNS response handling to prevent "stuck" records [Fixes #2799] #3138
This comment has been minimized.
This comment has been minimized.
|
OK, PR created. Not quite "relatively easy" after all... Without a test suite already in place for the discovery module, I was lost trying to comprehensively test the changes. Instead, I commented the bejeebers out of all the code, and did a fair bit of refactoring (including making the function names more descriptive) to make it painfully obvious what exactly is going on. I also exercised the changes in a variety of ways to try and shake out any remaining corner cases. |
mpalmer
added a commit
to mpalmer/prometheus
that referenced
this issue
Sep 7, 2017
grobie
added a commit
that referenced
this issue
Sep 15, 2017
This comment has been minimized.
This comment has been minimized.
|
Fixed in #3138. |
grobie
closed this
Sep 15, 2017
This comment has been minimized.
This comment has been minimized.
unclegaara
commented
Jul 26, 2018
|
I have a same problem with prometheus 2.1.0. |
This comment has been minimized.
This comment has been minimized.
|
Well, given the problem reported here was fixed nearly a year ago, you're probably going to be better served by opening a new issue. |
This comment has been minimized.
This comment has been minimized.
lock
bot
commented
Mar 22, 2019
|
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs. |
beorn7 commentedJun 2, 2017
•
edited
What did you do?
Configured a job with dns_sd_configs. At some point, the DNS entry disappeared completely:
What did you expect to see?
The targets should disappear from Prometheus.
What did you see instead? Under which circumstances?
The targets were kept by Prometheus, appearing on the targets page as down,
up==0etc., triggering corresponding alertsEnvironment
Linux 3.16.7- x86_64
prometheus, version 1.6.0 (branch: master, revision: 10f6453)
build user: root@0816a56aa81c
build date: 20170414-18:36:18
go version: go1.8.1
n/a
Alertmanager configuration file:
n/a
Logs:
n/a