Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DNS SD: If all targets disappear, Prometheus keeps them. #2799

Closed
beorn7 opened this Issue Jun 2, 2017 · 21 comments

Comments

Projects
None yet
7 participants
@beorn7
Copy link
Member

beorn7 commented Jun 2, 2017

What did you do?

Configured a job with dns_sd_configs. At some point, the DNS entry disappeared completely:

$ dig +short SRV all.telemetry.xxx.srv.s-cloud.net
$

What did you expect to see?

The targets should disappear from Prometheus.

What did you see instead? Under which circumstances?

The targets were kept by Prometheus, appearing on the targets page as down, up==0 etc., triggering corresponding alerts

Environment

  • System information:

Linux 3.16.7- x86_64

  • Prometheus version:

prometheus, version 1.6.0 (branch: master, revision: 10f6453)
build user: root@0816a56aa81c
build date: 20170414-18:36:18
go version: go1.8.1

  • Alertmanager version:

n/a

  • Prometheus configuration file:
[...]
- job_name: xxx
  dns_sd_configs:
  - names:
    - all.telemetry.xxx.srv.s-cloud.net
    refresh_interval: 10s
[...]
  • Alertmanager configuration file:
    n/a

  • Logs:
    n/a

@grobie

This comment has been minimized.

Copy link
Member

grobie commented Jul 19, 2017

I'm hitting this bug pretty consistently while renaming services, causing targets to get scraped twice with old and new labels.

@grobie

This comment has been minimized.

Copy link
Member

grobie commented Jul 19, 2017

The bug is pretty readable in the code, we simply interpret any empty result set as an error:

if len(response.Answer) > 0 {
return response, nil
}
}
}
return response, fmt.Errorf("could not resolve %s: no server responded", name)

@grobie

This comment has been minimized.

Copy link
Member

grobie commented Jul 19, 2017

I understand the issue now. The behavior to continue as long as no results are returned was originally introduced in 8c08a50 in order to support search domains. I don't think we should use the length of the returned result as indicator here, but we should check the returned rcode.

@discordianfish

This comment has been minimized.

Copy link
Member

discordianfish commented Jul 20, 2017

@grobie That's not the right behavior: If you're using multiple search domains and the name you're looking for is available on the next server, you can't check for NXDOMAIN to determine whether to continue or not.
It's also not clear to me how that code would cause it. It looks like it's returning the first non-empty response or the last one with error in which case I would assume prometheus removes the targets. But that probably doesn't happen? So we should return err=nil + the last response with didn't have SERVFAIL etc?

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Jul 20, 2017

Coming from another side, should we be supporting search domains? I would have expected the user to have to provide the FQDN.

@discordianfish

This comment has been minimized.

Copy link
Member

discordianfish commented Jul 20, 2017

Not having energy for this discussion. Going to unsubscribe from the thread, ping me if you have questions about the issue itself.

@beorn7

This comment has been minimized.

Copy link
Member Author

beorn7 commented Jul 20, 2017

In any case, a DNS SRV entry that has gone away entirely needs to remove the Prometheus target entirely. Thanks, @grobie, for spotting the issue here. :supergrover:

@matthiasr

This comment has been minimized.

Copy link
Contributor

matthiasr commented Jul 20, 2017

Yes, we need search domains, especially because we expect people to have "one Prometheus per cluster". Search domains are not without issues but such is life when using DNS for Service Discovery. Even musl/alpine eventually saw the light here and implemented them.

So, if the current code does inscrutable things … let's spec out the behaviour we want and go from there?

@beorn7

This comment has been minimized.

Copy link
Member Author

beorn7 commented Jul 20, 2017

@grobie if you are too busy during your trip, I can give it a try. Just let me know because we really cannot afford to work on it in parallel.

@beorn7 beorn7 self-assigned this Jul 20, 2017

@beorn7

This comment has been minimized.

Copy link
Member Author

beorn7 commented Jul 20, 2017

I take the lock on this for now.

@grobie

This comment has been minimized.

Copy link
Member

grobie commented Jul 20, 2017

The discussion around search domains is irrelevant here, we have to change the implementation regardless.

What we have to change is the handling of empty DNS responses. We currently assume that any DNS response with an empty resource records answer list is invalid and return an fmt.Errorf("could not resolve %s: no server responded", name) error. This is partially beneficial as the DNS discovery already degrades gracefully in case of DNS outages as described in #2924.

It's not correct to conclude from the number of returned resource records whether a query was successful or not. The DNS protocol provides the rcode field for that.

First we need to decide which rcodes indicate an error for which we want to keep the current behavior (returning an error, ignoring the result and keeping the cached targets list) and which rcodes indicate a successful query. There are at least 20 rcodes defined by the IANA: https://www.iana.org/assignments/dns-parameters/dns-parameters.xhtml#dns-parameters-6

The most common response codes in practice are NoError, NXDomain and ServFail. The first two would be clear "success, use the returned result" and the last a clear "return an error and don't store the (empty) result". I'm unsure about all others, we could start with treating them as errors.

Implementation wise, we need to change the len(response.Answer) shortcut in the loop to response.Rcode == dns.RcodeSuccess and continue in all other cases. Instead of always returning an error at the end, we need a switch on the rcode.

@beorn7

This comment has been minimized.

Copy link
Member Author

beorn7 commented Jul 24, 2017

Given the course of events (and my workflow, and the fact that I'm going on vacation on Friday), I'll pass the lock onto @grobie (who has quite some detailed ideas already, anyway).

@beorn7 beorn7 assigned grobie and unassigned beorn7 Jul 24, 2017

@mpalmer

This comment has been minimized.

Copy link
Contributor

mpalmer commented Sep 4, 2017

This bug is going to bite me Real Soon Now (about to start using dns_sd_config in anger). Is this on your plate to fix in the next, say, few days or so, @grobie? Or would you be happy to accept a suitable PR based on your last comments (which I think are right on the button)?

@grobie

This comment has been minimized.

Copy link
Member

grobie commented Sep 4, 2017

@mpalmer

This comment has been minimized.

Copy link
Contributor

mpalmer commented Sep 4, 2017

I'm not sure I'm up for large-scale refactorings... I can make myself understood in Go, but I'm far from literate. I'll give it a go, but don't expect too much.

@beorn7

This comment has been minimized.

Copy link
Member Author

beorn7 commented Sep 4, 2017

I think the fix here is relatively easy. The more challenging part is to set up proper testing as described by @grobie to make sure similar things will not bite us again in the future.

@mpalmer

This comment has been minimized.

Copy link
Contributor

mpalmer commented Sep 5, 2017

OK, PR created. Not quite "relatively easy" after all...

Without a test suite already in place for the discovery module, I was lost trying to comprehensively test the changes. Instead, I commented the bejeebers out of all the code, and did a fair bit of refactoring (including making the function names more descriptive) to make it painfully obvious what exactly is going on. I also exercised the changes in a variety of ways to try and shake out any remaining corner cases.

mpalmer added a commit to mpalmer/prometheus that referenced this issue Sep 7, 2017

Improve DNS response handling to prevent "stuck" records [Fixes prome…
…theus#2799]

The problem reported in prometheus#2799 was that in the event that all records for a
name were removed, the target group was never updated to be the "empty" set.
Essentially, whatever Prometheus last saw as a non-empty list of targets
would stay that way forever (or at least until Prometheus restarted...).  This
came about because of a fairly naive interpretation of what a valid-looking
DNS response actually looked like -- essentially, the only valid DNS responses
were ones that had a non-empty record list.  That's fine as long as your
config always lists only target names which have non-empty record sets; if
your environment happens to legitimately have empty record sets sometimes,
all hell breaks loose (otherwise-cleanly shutdown systems trigger up==0 alerts,
for instance).

This patch is a refactoring of the DNS lookup behaviour that maintains
existing behaviour with regard to search paths, but correctly handles empty
and non-existent record sets.

RFC1034 s4.3.1 says there's three ways a recursive DNS server can respond:

1.  Here is your answer (possibly an empty answer, because of the way DNS
   considers all records for a name, regardless of type, when deciding
   whether the name exists).

2. There is no spoon (the name you asked for definitely does not exist).

3. I am a teapot (something has gone terribly wrong).

Situations 1 and 2 are fine and dandy; whatever the answer is (empty or
otherwise) is the list of targets.  If something has gone wrong, then we
shouldn't go updating the target list because we don't really *know* what
the target list should be.

Multiple DNS servers to query is a straightforward augmentation; if you get
an error, then try the next server in the list, until you get an answer or
run out servers to ask.  Only if *all* the servers return errors should you
return an error to the calling code.

Where things get complicated is the search path.  In order to be able to
confidently say, "this name does not exist anywhere, you can remove all the
targets for this name because it's definitely GORN", at least one server for
*all* the possible names need to return either successful-but-empty
responses, or NXDOMAIN.  If any name errors out, then -- since that one
might have been the one where the records came from -- you need to say
"maintain the status quo until we get a known-good response".

It is possible, though unlikely, that a poorly-configured DNS setup (say,
one which had a domain in its search path for which all configured recursive
resolvers respond with REFUSED) could result in the same "stuck" records
problem we're solving here, but the DNS configuration should be fixed in
that case, and there's nothing we can do in Prometheus itself to fix the
problem.

I've tested this patch on a local scratch instance in all the various ways I
can think of:

1. Adding records (targets get scraped)

2. Adding records of a different type

3. Remove records of the requested type, leaving other type records intact
   (targets don't get scraped)

4. Remove all records for the name (targets don't get scraped)

5. Shutdown the resolver (targets still get scraped)

There's no automated test suite additions, because there isn't a test suite
for DNS discovery, and I was stretching my Go skills to the limit to make
this happen; mock objects are beyond me.

grobie added a commit that referenced this issue Sep 15, 2017

Improve DNS response handling to prevent "stuck" records [Fixes #2799] (
#3138)

The problem reported in #2799 was that in the event that all records for a
name were removed, the target group was never updated to be the "empty" set.
Essentially, whatever Prometheus last saw as a non-empty list of targets
would stay that way forever (or at least until Prometheus restarted...).  This
came about because of a fairly naive interpretation of what a valid-looking
DNS response actually looked like -- essentially, the only valid DNS responses
were ones that had a non-empty record list.  That's fine as long as your
config always lists only target names which have non-empty record sets; if
your environment happens to legitimately have empty record sets sometimes,
all hell breaks loose (otherwise-cleanly shutdown systems trigger up==0 alerts,
for instance).

This patch is a refactoring of the DNS lookup behaviour that maintains
existing behaviour with regard to search paths, but correctly handles empty
and non-existent record sets.

RFC1034 s4.3.1 says there's three ways a recursive DNS server can respond:

1.  Here is your answer (possibly an empty answer, because of the way DNS
   considers all records for a name, regardless of type, when deciding
   whether the name exists).

2. There is no spoon (the name you asked for definitely does not exist).

3. I am a teapot (something has gone terribly wrong).

Situations 1 and 2 are fine and dandy; whatever the answer is (empty or
otherwise) is the list of targets.  If something has gone wrong, then we
shouldn't go updating the target list because we don't really *know* what
the target list should be.

Multiple DNS servers to query is a straightforward augmentation; if you get
an error, then try the next server in the list, until you get an answer or
run out servers to ask.  Only if *all* the servers return errors should you
return an error to the calling code.

Where things get complicated is the search path.  In order to be able to
confidently say, "this name does not exist anywhere, you can remove all the
targets for this name because it's definitely GORN", at least one server for
*all* the possible names need to return either successful-but-empty
responses, or NXDOMAIN.  If any name errors out, then -- since that one
might have been the one where the records came from -- you need to say
"maintain the status quo until we get a known-good response".

It is possible, though unlikely, that a poorly-configured DNS setup (say,
one which had a domain in its search path for which all configured recursive
resolvers respond with REFUSED) could result in the same "stuck" records
problem we're solving here, but the DNS configuration should be fixed in
that case, and there's nothing we can do in Prometheus itself to fix the
problem.

I've tested this patch on a local scratch instance in all the various ways I
can think of:

1. Adding records (targets get scraped)

2. Adding records of a different type

3. Remove records of the requested type, leaving other type records intact
   (targets don't get scraped)

4. Remove all records for the name (targets don't get scraped)

5. Shutdown the resolver (targets still get scraped)

There's no automated test suite additions, because there isn't a test suite
for DNS discovery, and I was stretching my Go skills to the limit to make
this happen; mock objects are beyond me.
@grobie

This comment has been minimized.

Copy link
Member

grobie commented Sep 15, 2017

Fixed in #3138.

@grobie grobie closed this Sep 15, 2017

@unclegaara

This comment has been minimized.

Copy link

unclegaara commented Jul 26, 2018

I have a same problem with prometheus 2.1.0.

@mpalmer

This comment has been minimized.

Copy link
Contributor

mpalmer commented Jul 27, 2018

Well, given the problem reported here was fixed nearly a year ago, you're probably going to be better served by opening a new issue.

@lock

This comment has been minimized.

Copy link

lock bot commented Mar 22, 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked and limited conversation to collaborators Mar 22, 2019

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
You can’t perform that action at this time.