Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

External queries fail with Cloudflare domain in DNS search list #119

Closed
jpap opened this issue Jun 29, 2017 · 10 comments
Closed

External queries fail with Cloudflare domain in DNS search list #119

jpap opened this issue Jun 29, 2017 · 10 comments
Labels
lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.

Comments

@jpap
Copy link

jpap commented Jun 29, 2017

My domain's DNS is hosted with Cloudflare. I am using a bare metal cluster that has the domain in the /etc/resolv.conf search list.

DNS queries for external domains (e.g. www.yahoo.com) fail under the above conditions on some containers (e.g. Alpine-based):

/ # ping www.yahoo.com
ping: bad address 'www.yahoo.com'

When I manually remove my domain from the /etc/resolv.conf search list in the container the query works as expected.

With Wireshark, I was able to determine that Cloudflare's NS returns RCODE = 0 with no RRs when queried with a nonexistent domain (e.g. www.yahoo.com.mydomain.com). Most other NSs I've tried return RCODE = 3 in this case. (This issue never came up until I moved to Cloudflare; my domain registry's nameservers return RCODE = 3 for nonexistent domains.)

Could the RCODE = 0 result code on the search list be preventing Kubernetes DNS from performing a FQDN lookup (e.g. just www.yahoo.com) in this case, resulting in the ultimate failure of the query?

I've raised the RCODE issue with Cloudflare, and had a quick look at the SkyDNS and miekg/dns project source code, but it wasn't not immediately clear to me what the code path is here.

@cmluciano
Copy link

That seems likely. I wonder if this is something that requires an entry through dnsmasq. @bowei @MrHohn thoughts?

@bowei
Copy link
Member

bowei commented Jun 29, 2017

Returning RCODE=0 means success. Some DNS implementations interpret this as "the binding exists, but there are no addresses for the queried type". Cloudflare should really return RCODE=3 (NXDOMAIN) in this case.

@jpap
Copy link
Author

jpap commented Jul 28, 2017

Cloudflare has responded to this issue after investigating it for almost a month. They're not going to fix their DNS anytime soon: it appears that they rely on this kind of functionality internally (!), and they've determined it too risky to change.

I've since put in place a workaround whereby I do not use Cloudflare hosted domains in my DNS search list as circulated by DHCP. It's an inconvenience to specify FQDN where appropriate but one I can live with.

@bowei, do consider a workaround here in Kubernetes DNS, though I can appreciate why that might be unreasonable. I'd just hate others to hit this issue as it's not an easy problem to diagnose and isolate. I'll let you keep this issue open (for a workaround) or close it as appropriate.

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

Prevent issues from auto-closing with an /lifecycle frozen comment.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or @fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 1, 2018
@rmb938
Copy link

rmb938 commented Feb 1, 2018

I just spent like an hour trying to debug this before I found this issue... time to find a new dns provider I guess...

@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten
/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Mar 3, 2018
@synhershko
Copy link

Just to make sure - is this DNS issue only applies to CloudFlare or can this be the case with other DNS providers as well?

@daurnimator
Copy link

Returning RCODE=0 means success. Some DNS implementations interpret this as "the binding exists, but there are no addresses for the queried type". Cloudflare should really return RCODE=3 (NXDOMAIN) in this case.

Cloudflare seem to be following RFC2308 Section 2.2: https://tools.ietf.org/html/rfc2308#section-2.2

@jpap
Copy link
Author

jpap commented Apr 2, 2018

I received an e-mail from Cloudflare on March 23, 2018 stating that they have fixed this issue on their DNS servers.

@jpap jpap closed this as completed Apr 2, 2018
@pavel-odintsov
Copy link

Hello!

Yes, we just finished deployment for release which includes NODATA/NXDOMAIN improvement. For all DNSSEC disabled domains it should be correct now. For DNSSEC enabled domains we keep old behaviour because it required for our approach how we generate DNSSEC signatures.

Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.
Projects
None yet
Development

No branches or pull requests

9 participants