Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Go-lang "netgo" DNS resolver bug with catch-all DNS server entries #10863

Closed
dmp42 opened this issue Feb 17, 2015 · 44 comments
Closed

Go-lang "netgo" DNS resolver bug with catch-all DNS server entries #10863

dmp42 opened this issue Feb 17, 2015 · 44 comments
Labels
area/networking kind/bug Bugs are bugs. The cause may or may not be known at triage time so debugging may be needed.

Comments

@dmp42
Copy link
Contributor

dmp42 commented Feb 17, 2015

This is following-up investigation from #10802

In the following scenario:

# /etc/resolv.conf
nameserver mydns
search mydomain

where "mydns" would return an ipv6 address for any subdomain of mydomain

What happens is that when docker 1.5 resolve index.docker.io it will favor the ipv6 returned for index.docker.io.mydomain. rather than the A record for index.docker.io.

This does NOT happen (apparently) if index.docker.io.mydomain. returns an ipv4 record.

Obviously, this does not happen either with docker 1.4.

Finally, curl (for comparison) does NOT use the ipv6 address, but does correctly use the A record for index.docker.io..

To me, this sounds like a DNS resolution order preference bug - not sure if this is a docker bug, or lower.

cc @chmanie @icecrime @stevvoe @dmcgowan

@dmp42
Copy link
Contributor Author

dmp42 commented Feb 17, 2015

While this is triggered through the angle of distribution stuff (index and registry), it certainly affects any other network operation depending on the same code.

@icecrime
Copy link
Contributor

Ping @MalteJ

@tiborvass tiborvass added this to the 1.5.1 milestone Feb 17, 2015
@LK4D4
Copy link
Contributor

LK4D4 commented Feb 17, 2015

ping @estesp too :)
Ah, no, sorry, I thought it can be related to /etc/resolv.conf filtering.
I'm not sure that docker resolving something at all itself, but I might be wrong.

@MalteJ
Copy link
Contributor

MalteJ commented Feb 17, 2015

@dmp42 I think you shouldn't use a search-domain that resolves a wildcard subdomain.

By the way, curl always uses IPv4 unless you do curl -6 ...

@dmp42
Copy link
Contributor Author

dmp42 commented Feb 17, 2015

@MalteJ maybe. Now:

  • docker 1.4 uses the correct record
  • the correct record is used if the ip returned is ipv4
  • curl (and presumably other tools) use the correct ip as well

"Search" should be used as a fallback in case no valid record is found for the fqdn (IIRC).

@MalteJ
Copy link
Contributor

MalteJ commented Feb 17, 2015

hmm, it looks like the DNS will be queried for an AAAA record for index.docker.io. which cannot be found. Then the DNS is queried for AAAA index.docker.io.mydomain.. And then the A-record queries would be executed.

@dmp42
Copy link
Contributor Author

dmp42 commented Feb 17, 2015

From tcpdump, here is what is apparently resolved in order:

22:43:28.650876 IP XXX.41322 > YYY.domain: 25810+ A? registry-1.docker.io. (38)
22:43:28.650976 IP XXX.42402 > YYY.domain: 32691+ AAAA? registry-1.docker.io. (38)
22:43:28.651258 IP XXX.33356 > YYY.domain: 20838+ AAAA? registry-1.docker.io.ZZZ. (55

@dmp42
Copy link
Contributor Author

dmp42 commented Feb 17, 2015

And answers:

22:43:28.651443 IP YYY.domain > XXX.33356: 20838 2/4/4 CNAME ZZZ., AAAA 2a01:4f8:151:84c5::21:1 (246)
22:43:28.660191 IP YYY.domain > XXX.41322: 25810 1/4/4 A 162.242.195.84 (204)

@MalteJ
Copy link
Contributor

MalteJ commented Feb 17, 2015

I think they describe a similar problem:
https://engineering.opendns.com/2014/06/04/dual-stack-search-domains-host-roulette/

@dmp42
Copy link
Contributor Author

dmp42 commented Feb 17, 2015

My DNS competence really stops right here - I let it to the super savvys to figure out :)

@MalteJ
Copy link
Contributor

MalteJ commented Feb 17, 2015

I am not sure if we can do something about that. To me it sounds more like a Kernel or Golang issue.

@LK4D4
Copy link
Contributor

LK4D4 commented Feb 17, 2015

@dmp42 Could you provide exact steps for reproduce, so we can write script for bisect?

@MalteJ
Copy link
Contributor

MalteJ commented Feb 17, 2015

@LK4D4 I think you need a DNS that resolves *.mydomain. AAAA 2001:db8::1
and add the search domain mydomain to your /etc/resolv.conf:

echo search mydomain >> /etc/resolv.conf

@dmp42
Copy link
Contributor Author

dmp42 commented Feb 17, 2015

@LK4D4 not entirely trivial but:

  • setup a DNS resolver at address Y, that always answer a given ipv6 X on any *.something request
  • use nameserver Y and search something to the /etc/resolv.conf on the docker daemon host

Also @chmanie from the original bug report is very friendly and probably willing to help testing since he has a "not working" setup.

@dmcgowan
Copy link
Member

Based on the description this change comes to mind fdd2abe? Although this change supposedly was made in 1.4.0.

@estesp
Copy link
Contributor

estesp commented Feb 17, 2015

@dmcgowan beat me to it, but yeah..also was in 1.4.0.. this has to be related to registry DNS lookup, not the fact that IPv6 was added to the container network model in 1.5.0

@dmp42
Copy link
Contributor Author

dmp42 commented Feb 18, 2015

Maybe both?

@estesp
Copy link
Contributor

estesp commented Feb 18, 2015

Given @dmp42's description of the repro scenario, @MalteJ's pointer to the OpenDNS blog seems useful unless I misunderstand the problem: #10863 (comment) - <-- the blog notes the lookup issue for DualStack when adding search domains in /etc/resolv.conf depending on OS.

@estesp
Copy link
Contributor

estesp commented Feb 18, 2015

Could the move from go1.3.3 -> go1.4.1 be a factor here? That did happen between the two Docker releases, and a quick look at dnsclient_unix.go goLookupIP() changes are at least interesting w.r.t. how Go's "netgo" implementation returns/orders A vs AAAA results.

Note that it's not fun to dig for differences as git diff go1.3.3 go1.4.1 ... is useless given they moved the directories for pkg content around!

@MalteJ
Copy link
Contributor

MalteJ commented Feb 18, 2015

Could the move from go1.3.3 -> go1.4.1 be a factor here?

That is my first guess.
Nevertheless a catch-all DNS is an anti pattern. Especially in a dual-stack environment where there is no standardized order of IPv4/IPv6 queries, search domains etc.

@estesp
Copy link
Contributor

estesp commented Jun 2, 2015

This is the set of changes that I found most interesting--the actual DNS "client" implementation that is called from the code you linked to (dnsclient_unix.go): https://gist.github.com/estesp/a6913f0d7923689180e3

Two things are interesting that I haven't had time to dig through thoroughly--but the rooted concept (which may have interplay with the "match all" setup noted earlier in this issue), and the go routine "racer" method of finding both ipv4 and ipv6 results, which may mean ordering changes are having an impact on results? Again, I need (wish I had??) more time to really decide if there is something here or just a red herring, but I'm not totally convinced the Go changes aren't partially involved.

@estesp
Copy link
Contributor

estesp commented Jun 4, 2015

I'm actually confused with how the DNS server setup gets the client into the state reported.. on either go1.3.3 or go1.4.2 compiled binary doing IP lookups, the following flow happens for index.docker.io:

go1.4.2

10:23:03.196659 IP 172.17.0.22.42393 > 172.17.0.13.domain: 35580+ A? index.docker.io. (33)
10:23:03.196707 IP 172.17.0.22.42393 > 172.17.0.13.domain: 35580+ A? index.docker.io. (33)
10:23:03.196767 IP 172.17.0.22.42393 > 172.17.0.13.domain: 16389+ AAAA? index.docker.io. (33)
10:23:03.196775 IP 172.17.0.22.42393 > 172.17.0.13.domain: 16389+ AAAA? index.docker.io. (33)
10:23:03.197064 IP 172.17.0.13.1889 > google-public-dns-a.google.com.domain: 45650+ [1au] A? index.docker.io. (44)
10:23:03.197252 IP 172.17.0.13.42898 > google-public-dns-a.google.com.domain: 8172+ [1au] AAAA? index.docker.io. (44)
10:23:03.197564 IP 172.17.0.13.52072 > google-public-dns-a.google.com.domain: 3967+ [1au] NS? . (28)
10:23:03.221600 IP google-public-dns-a.google.com.domain > 172.17.0.13.1889: 45650 1/0/1 A 162.242.195.84 (60)
10:23:03.222073 IP 172.17.0.13.domain > 172.17.0.22.42393: 35580 1/13/0 A 162.242.195.84 (260)
10:23:03.226545 IP google-public-dns-a.google.com.domain > 172.17.0.13.42898: 8172 0/1/1 (114)

go 1.3.3

10:25:33.922616 IP 172.17.0.22.57536 > 172.17.0.13.domain: 51477+ A? index.docker.io. (33)
10:25:33.922669 IP 172.17.0.22.57536 > 172.17.0.13.domain: 51477+ A? index.docker.io. (33)
10:25:33.922704 IP 172.17.0.22.57536 > 172.17.0.13.domain: 27449+ AAAA? index.docker.io. (33)
10:25:33.922715 IP 172.17.0.22.57536 > 172.17.0.13.domain: 27449+ AAAA? index.docker.io. (33)
10:25:33.923099 IP 172.17.0.13.62093 > google-public-dns-a.google.com.domain: 25323+ [1au] A? index.docker.io. (44)
10:25:33.923343 IP 172.17.0.13.29026 > google-public-dns-a.google.com.domain: 30220+ [1au] AAAA? index.docker.io. (44)
10:25:33.923918 IP 172.17.0.13.16392 > google-public-dns-a.google.com.domain: 17430+ [1au] NS? . (28)
10:25:33.948675 IP google-public-dns-a.google.com.domain > 172.17.0.13.62093: 25323 1/0/1 A 162.242.195.84 (60)
10:25:33.949118 IP 172.17.0.13.domain > 172.17.0.22.57536: 51477 1/13/0 A 162.242.195.84 (260)
10:25:33.954178 IP google-public-dns-a.google.com.domain > 172.17.0.13.29026: 30220 0/1/1 (114)

I would have to mispell index.docker.io to get the client to start adding search terms (or mess with options ndots to make "larger" strings of dot-separated names get auto-appended with the search term from resolv.conf); like this:

looking for index.docker.iioo:

10:33:12.782678 IP 172.17.0.22.45887 > 172.17.0.13.domain: 36178+ A? index.docker.iioo. (35)
10:33:12.782716 IP 172.17.0.22.45887 > 172.17.0.13.domain: 36178+ A? index.docker.iioo. (35)
10:33:12.782826 IP 172.17.0.22.45887 > 172.17.0.13.domain: 6067+ AAAA? index.docker.iioo. (35)
10:33:12.782839 IP 172.17.0.22.45887 > 172.17.0.13.domain: 6067+ AAAA? index.docker.iioo. (35)
10:33:12.783461 IP 172.17.0.13.63868 > google-public-dns-a.google.com.domain: 56515+ [1au] A? index.docker.iioo. (46)
10:33:12.783662 IP 172.17.0.13.46812 > google-public-dns-a.google.com.domain: 58771+ [1au] AAAA? index.docker.iioo. (46)
10:33:12.783831 IP 172.17.0.13.23765 > google-public-dns-a.google.com.domain: 43741+ [1au] NS? . (28)
10:33:12.813554 IP google-public-dns-a.google.com.domain > 172.17.0.13.23765: 43741$ 14/0/1 NS b.root-servers.net., NS l.root-servers.net., NS m.root-servers.net., NS k.root-servers.net., NS c.root-servers.net., NS e.root-servers.net., NS j.root-servers.net., NS f.root-servers.net., NS i.root-servers.net., NS g.root-servers.net., NS h.root-servers.net., NS d.root-servers.net., NS a.root-servers.net., RRSIG (397)
10:33:12.844116 IP google-public-dns-a.google.com.domain > 172.17.0.13.63868: 56515 NXDomain$ 0/6/1 (648)
10:33:12.844488 IP 172.17.0.13.domain > 172.17.0.22.45887: 36178 NXDomain 0/1/0 (110)
10:33:12.856942 IP google-public-dns-a.google.com.domain > 172.17.0.13.46812: 58771 NXDomain$ 0/6/1 (648)
10:33:12.857267 IP 172.17.0.13.domain > 172.17.0.22.45887: 6067 NXDomain 0/1/0 (110)
10:33:12.857429 IP 172.17.0.22.52654 > 172.17.0.13.domain: 17732+ A? index.docker.iioo.testdocker.org. (50)
10:33:12.857463 IP 172.17.0.22.52654 > 172.17.0.13.domain: 17732+ A? index.docker.iioo.testdocker.org. (50)
10:33:12.857497 IP 172.17.0.22.52654 > 172.17.0.13.domain: 24036+ AAAA? index.docker.iioo.testdocker.org. (50)
10:33:12.857509 IP 172.17.0.22.52654 > 172.17.0.13.domain: 24036+ AAAA? index.docker.iioo.testdocker.org. (50)
10:33:12.857576 IP 172.17.0.13.domain > 172.17.0.22.52654: 17732* 0/1/0 (96)
10:33:12.857645 IP 172.17.0.13.domain > 172.17.0.22.52654: 17732* 0/1/0 (96)
10:33:12.857718 IP 172.17.0.13.domain > 172.17.0.22.52654: 24036* 1/2/2 AAAA 2002::2002:2002:1 (146)
10:33:12.857812 IP 172.17.0.13.domain > 172.17.0.22.52654: 24036* 1/2/2 AAAA 2002::2002:2002:1 (146)

This is with a bind server running in a container, acting as an authoritative master for "testdocker.org" and a /etc/resolv.conf with search testdocker.org inside. Bind is also setup to forward to Google public DNS as you can see (in all the traces above). I have a "splat" match that auto-responds with the AAAA (IPv6) record above "2002::2002:2002:1" for *.testdocker.org.

Maybe the original reporter (@chmanie) can give some more detail on the DNS setup so I can better reproduce the exact scenario?

@chmanie
Copy link

chmanie commented Jun 4, 2015

I'm not exactly a dev-op, but happy to help! Could you help me on getting the information you need?

@estesp
Copy link
Contributor

estesp commented Jun 4, 2015

Sure @chmanie , thanks! What would be most helpful is the exact /etc/resolv.conf of the system where the failure occurs, and, if possible, the exact named.conf and referenced files within that for the BIND server configuration which handles the xxx.de domain originally discussed. Feel free to mask out anything you don't want me or others to know (e.g. the exact domain or IP addresses in the BIND config), but it would be very helpful to know the exact DNS server config and how it is handling forwarding versus authoritative responses for a specific set of domains to help understand how this is happening and/or get an exact reproducing scenario.

@sstarcher
Copy link

@estesp I believe we are currently experiencing the same issue. Our setup is as follows.
DNS server is Consul 0.5.2 with default settings with a recursor setup for 8.8.8.8
We are running in an AWS VPC

Let me know if I can provide anything to help.

@discordianfish
Copy link
Contributor

In general a (stub) resolver library, like the one in libc, appends the search domain, then tries to resolv that and fails back to the name without search domain second. BUT since this is a bit expensive, it usually excepts domains with 'more than one dot'. See man resolv.conf under ndots:

             ndots:n
                     sets a threshold for the number of dots which must appear in a  name  given  to
                     res_query(3)  (see  resolver(3)) before an initial absolute query will be made.
                     The default for n is 1, meaning that if there are any dots in a name, the  name
                     will  be  tried  first  as an absolute name before any search list elements are
                     appended to it.  The value for this option is silently capped to 15.

The golang resolver seems to have same defaults supports reading ndots from /etc/resolv.conf as well: https://golang.org/src/net/dnsconfig_unix.go

@drieschel
Copy link

@estesp I am the hoster from @chmanie's server where the problem occured.

The problem was the "search somedomain" directive in the /etc/resolv.conf file. You have most likely the same problem. If you delete this line then should everything work fine.

Edit:
Okay, @chmanie introduced me a bit more. When I remember correctly, then was the problem the enabled dual stack (ipv4 + ipv6) in the newer docker version in conjunction with the following configuration:

@chamnie's server is dual stack ready and in the resolv.conf was the "search domain" line. But the docker server didn't (or still doesn't) have a AAAA record.

Now the docker tool on @chmanie's server tried to request the docker server with ipv6 first. Because of the "search domain" line could the requested domain resolved, sadly to the ip from "domain", not to the correct ip. That's why the docker tool didn't try to resolve the docker domain over ipv4.

Edit 2:
Forgot another important detail: The zone "domain" has a * CNAME record on "domain", That was also part of the problem.

@discordianfish
Copy link
Contributor

Side note: It seems we're using netgo, not the default which would use libc.

@sstarcher
Copy link

Whenever I remove the "search ec2.internal"
For docker pull I get "FATA[0000] Get https://registry-1.docker.io/v1/repositories/library/busybox/tags: dial tcp: lookup registry-1.docker.io on 127.0.0.1:53: cannot unmarshal DNS message"

@estesp
Copy link
Contributor

estesp commented Jun 4, 2015

@discordianfish I will rename.. I thought I had a true netgo binary in my little stable of lookup binaries, but I forgot about the --installsuffix hack so wasn't getting it.

I now have go1.3.3, go1.4.2, go1.4.2-docker (built in a Docker dev container against the custom-built Go), and go1.4.2-netgo.

The go1.4.2-netgo-built binary exhibits the actual problem:

/testdial-1.4.2-netgo index.docker.io
addr: 2002::2002:2002:1
addr: 162.242.195.84

Note: my silly DNS setup is responding with the useless IPv6 address based on a match-all rule

The following trace shows that the netgo lookup code is willing to append the search domain even while the worker (see dnsclient_unix.go in the Go 1.4.2 sources) for the A record is still working on returning the (correct) IPv4 response.

13:42:50.021716 IP 172.17.0.23.47708 > 172.17.0.13.domain: 54982+ A? index.docker.io. (33)
13:42:50.021791 IP 172.17.0.23.47708 > 172.17.0.13.domain: 54982+ A? index.docker.io. (33)
13:42:50.021925 IP 172.17.0.23.36219 > 172.17.0.13.domain: 64311+ AAAA? index.docker.io. (33)
13:42:50.021938 IP 172.17.0.23.36219 > 172.17.0.13.domain: 64311+ AAAA? index.docker.io. (33)
13:42:50.022036 IP 172.17.0.13.domain > 172.17.0.23.36219: 64311 0/1/0 (103)
13:42:50.022103 IP 172.17.0.13.domain > 172.17.0.23.36219: 64311 0/1/0 (103)
13:42:50.022337 IP 172.17.0.13.55684 > google-public-dns-b.google.com.domain: 22399+ [1au] A? index.docker.io. (44)
13:42:50.022583 IP 172.17.0.13.56478 > google-public-dns-b.google.com.domain: 7997+ [1au] NS? . (28)
13:42:50.022834 IP 172.17.0.23.55110 > 172.17.0.13.domain: 49279+ AAAA? index.docker.io.testdocker.org. (48)
13:42:50.022856 IP 172.17.0.23.55110 > 172.17.0.13.domain: 49279+ AAAA? index.docker.io.testdocker.org. (48)
13:42:50.023017 IP 172.17.0.13.domain > 172.17.0.23.55110: 49279* 1/2/2 AAAA 2002::2002:2002:1 (144)
13:42:50.023140 IP 172.17.0.13.domain > 172.17.0.23.55110: 49279* 1/2/2 AAAA 2002::2002:2002:1 (144)
13:42:50.053067 IP google-public-dns-b.google.com.domain > 172.17.0.13.55684: 22399 1/0/1 A 162.242.195.84 (60)
13:42:50.053494 IP 172.17.0.13.domain > 172.17.0.23.47708: 54982 1/13/0 A 162.242.195.84 (260)

@estesp estesp changed the title IPV6 regression Go-lang "netgo" DNS resolver bug with catch-all DNS server entries Jun 4, 2015
@estesp estesp removed the exp/expert label Jun 4, 2015
@estesp
Copy link
Contributor

estesp commented Jun 4, 2015

Note that the core problem appears to be that the request for AAAA record gets a different response from the DNS server than the non-netgo runs, and that is what causes the client (which in this case is the netgo-path dnsclient_unix.go code) to append the search domain and then re-ask for AAAA, which matches the catch-all address. Not sure yet if it has to do with the fact that the A and AAAA requests come from the same connection pre-1.4.2-netgo or not? I will try to debug that particular problem further and hopefully have something useful to report upstream in Go.

@estesp
Copy link
Contributor

estesp commented Jun 4, 2015

@drieschel by the way, thanks for providing the extra details. At this point, it definitely seems like a bug in the Go 'netgo' (versus cgo -> libc) resolver code related to how A and AAAA record results are handled differently in more recent Go versions. I don't think there is necessarily any specific problem with your DNS setup and Docker, although we definitely seem to have exposed a weird bug to resolve with the Go community.

@discordianfish
Copy link
Contributor

@estesp / @drieschel Looks like you isolate the problem very well, can you open an issue upstream? I closed my golang/go#11070 since it's a different issue (the issue @sstarcher has as well), although the issue described here is pretty much confirmed already.

@estesp
Copy link
Contributor

estesp commented Jun 5, 2015

@discordianfish I just opened golang/go#11081

@discordianfish
Copy link
Contributor

@estesp Great, thanks!

@estesp
Copy link
Contributor

estesp commented Jun 25, 2015

Quick update--this will most likely be fixed in Go 1.5; this patchset fixes the problem and is under review: https://go-review.googlesource.com/#/c/10836/

I don't know how we want to handle the Docker side of this issue as moving to Go 1.5 is gated on its release and then, a follow-on decision and timeframe for Docker to be built by Go 1.5 compilers.

@estesp
Copy link
Contributor

estesp commented Jul 28, 2015

The patch for the netgo DNS lookup bug is now merged in Go-lang and appears though it will make the Go 1.5 release. When Docker starts building with Go 1.5, this problem can be validated as resolved and this issue closed.

@thaJeztah
Copy link
Member

@estesp we're on Go 1.5+ now; is it safe to close this issue?

@estesp
Copy link
Contributor

estesp commented May 11, 2016

Yes, we can definitely close now.

Sent from my iPhone

On May 11, 2016, at 6:12 PM, Sebastiaan van Stijn notifications@github.com wrote:

@estesp we're on Go 1.5+ now; is it safe to close this issue?


You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/networking kind/bug Bugs are bugs. The cause may or may not be known at triage time so debugging may be needed.
Projects
None yet
Development

No branches or pull requests