New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

External queries fail with Cloudflare domain in DNS search list #119

Closed
jpap opened this Issue Jun 29, 2017 · 10 comments

Comments

Projects
None yet
9 participants
@jpap
Copy link

jpap commented Jun 29, 2017

My domain's DNS is hosted with Cloudflare. I am using a bare metal cluster that has the domain in the /etc/resolv.conf search list.

DNS queries for external domains (e.g. www.yahoo.com) fail under the above conditions on some containers (e.g. Alpine-based):

/ # ping www.yahoo.com
ping: bad address 'www.yahoo.com'

When I manually remove my domain from the /etc/resolv.conf search list in the container the query works as expected.

With Wireshark, I was able to determine that Cloudflare's NS returns RCODE = 0 with no RRs when queried with a nonexistent domain (e.g. www.yahoo.com.mydomain.com). Most other NSs I've tried return RCODE = 3 in this case. (This issue never came up until I moved to Cloudflare; my domain registry's nameservers return RCODE = 3 for nonexistent domains.)

Could the RCODE = 0 result code on the search list be preventing Kubernetes DNS from performing a FQDN lookup (e.g. just www.yahoo.com) in this case, resulting in the ultimate failure of the query?

I've raised the RCODE issue with Cloudflare, and had a quick look at the SkyDNS and miekg/dns project source code, but it wasn't not immediately clear to me what the code path is here.

@cmluciano

This comment has been minimized.

Copy link
Member

cmluciano commented Jun 29, 2017

That seems likely. I wonder if this is something that requires an entry through dnsmasq. @bowei @MrHohn thoughts?

@bowei

This comment has been minimized.

Copy link
Member

bowei commented Jun 29, 2017

Returning RCODE=0 means success. Some DNS implementations interpret this as "the binding exists, but there are no addresses for the queried type". Cloudflare should really return RCODE=3 (NXDOMAIN) in this case.

@jpap

This comment has been minimized.

Copy link
Author

jpap commented Jul 28, 2017

Cloudflare has responded to this issue after investigating it for almost a month. They're not going to fix their DNS anytime soon: it appears that they rely on this kind of functionality internally (!), and they've determined it too risky to change.

I've since put in place a workaround whereby I do not use Cloudflare hosted domains in my DNS search list as circulated by DHCP. It's an inconvenience to specify FQDN where appropriate but one I can live with.

@bowei, do consider a workaround here in Kubernetes DNS, though I can appreciate why that might be unreasonable. I'd just hate others to hit this issue as it's not an easy problem to diagnose and isolate. I'll let you keep this issue open (for a workaround) or close it as appropriate.

@fejta-bot

This comment has been minimized.

Copy link

fejta-bot commented Jan 1, 2018

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

Prevent issues from auto-closing with an /lifecycle frozen comment.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or @fejta.
/lifecycle stale

@rmb938

This comment has been minimized.

Copy link

rmb938 commented Feb 1, 2018

I just spent like an hour trying to debug this before I found this issue... time to find a new dns provider I guess...

@fejta-bot

This comment has been minimized.

Copy link

fejta-bot commented Mar 3, 2018

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten
/remove-lifecycle stale

@synhershko

This comment has been minimized.

Copy link

synhershko commented Mar 17, 2018

Just to make sure - is this DNS issue only applies to CloudFlare or can this be the case with other DNS providers as well?

@daurnimator

This comment has been minimized.

Copy link

daurnimator commented Apr 2, 2018

Returning RCODE=0 means success. Some DNS implementations interpret this as "the binding exists, but there are no addresses for the queried type". Cloudflare should really return RCODE=3 (NXDOMAIN) in this case.

Cloudflare seem to be following RFC2308 Section 2.2: https://tools.ietf.org/html/rfc2308#section-2.2

@jpap

This comment has been minimized.

Copy link
Author

jpap commented Apr 2, 2018

I received an e-mail from Cloudflare on March 23, 2018 stating that they have fixed this issue on their DNS servers.

@jpap jpap closed this Apr 2, 2018

@pavel-odintsov

This comment has been minimized.

Copy link

pavel-odintsov commented Apr 9, 2018

Hello!

Yes, we just finished deployment for release which includes NODATA/NXDOMAIN improvement. For all DNSSEC disabled domains it should be correct now. For DNSSEC enabled domains we keep old behaviour because it required for our approach how we generate DNSSEC signatures.

Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment