Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

net: builtin DNS stub resolver fails to parse responses from consul with "cannot unmarshal DNS message" #11070

Closed
discordianfish opened this issue Jun 4, 2015 · 13 comments

Comments

Projects
None yet
7 participants
@discordianfish
Copy link

commented Jun 4, 2015

Hi,

if you have a CNAME like registry-1.docker.io, using a DNS resolver like consul and try to resolve the record by using a go 1.4.2 application using netgo, the resolution fails.

This can be reproduced by running a recursor like consul, pointing /etc/resolv.conf to it and compile this: https://gist.github.com/discordianfish/467ea55ae86426815a21 with CGO_ENABLED=0 go build -installsuffix netgo.

This particular behaviour is something I only observed with consul, so it might be very well a bug there. But compiling the same with default options / CGO enabled, the resolution works just fine. All other tools can resolve the record just fine as well. And we couldn't reproduce it with go 1.2 nor go 1.3.

Consul

Run consul and provide upstream -recursors

Without search domain

2015/06/04 18:35:42 Get https://registry-1.docker.io: dial tcp: lookup registry-1.docker.io on 127.0.0.1:53: cannot unmarshal DNS message

tcpdump shows:

18:37:15.401845 IP (tos 0x0, ttl 64, id 62857, offset 0, flags [DF], proto UDP (17), length 66)
    10.128.40.4.48468 > 10.128.0.2.53: [bad udp cksum 0x3d45 -> 0xe807!] 34126+ A? registry-1.docker.io. (38)
18:37:15.401968 IP (tos 0x0, ttl 64, id 62858, offset 0, flags [DF], proto UDP (17), length 66)
    10.128.40.4.33424 > 10.128.0.2.53: [bad udp cksum 0x3d45 -> 0x0346!] 42169+ AAAA? registry-1.docker.io. (38)
18:37:15.402745 IP (tos 0x0, ttl 64, id 10096, offset 0, flags [none], proto UDP (17), length 242)
    10.128.0.2.53 > 10.128.40.4.48468: [udp sum ok] 34126 q: A? registry-1.docker.io. 7/0/0 registry-1.docker.io. CNAME us-east-1-elbio-rm5bon1qaeo4-623296237.us-east-1.elb.amazonaws.com., us-east-1-elbio-rm5bon1qaeo4-623296237.us-east-1.elb.amazonaws.com. A 52.5.4.145, us-east-1-elbio-rm5bon1qaeo4-623296237.us-east-1.elb.amazonaws.com. A 52.6.136.158, us-east-1-elbio-rm5bon1qaeo4-623296237.us-east-1.elb.amazonaws.com. A 54.164.219.90, us-east-1-elbio-rm5bon1qaeo4-623296237.us-east-1.elb.amazonaws.com. A 54.172.137.222, us-east-1-elbio-rm5bon1qaeo4-623296237.us-east-1.elb.amazonaws.com. A 107.21.22.134, us-east-1-elbio-rm5bon1qaeo4-623296237.us-east-1.elb.amazonaws.com. A 52.0.31.125 (214)
18:37:15.402949 IP (tos 0x0, ttl 64, id 10097, offset 0, flags [none], proto UDP (17), length 228)
    10.128.0.2.53 > 10.128.40.4.33424: [udp sum ok] 42169 q: AAAA? registry-1.docker.io. 1/1/0 registry-1.docker.io. CNAME us-east-1-elbio-rm5bon1qaeo4-623296237.us-east-1.elb.amazonaws.com. ns: us-east-1.elb.amazonaws.com. SOA ns-1119.awsdns-11.org. awsdns-hostmaster.amazon.com. 1 7200 900 1209600 60 (200)
18:37:15.403562 IP (tos 0x0, ttl 64, id 62859, offset 0, flags [DF], proto UDP (17), length 66)
    10.128.40.4.57356 > 10.128.0.2.53: [bad udp cksum 0x3d45 -> 0xb92e!] 37231+ A? registry-1.docker.io. (38)
18:37:15.403735 IP (tos 0x0, ttl 64, id 10098, offset 0, flags [none], proto UDP (17), length 242)
    10.128.0.2.53 > 10.128.40.4.57356: [udp sum ok] 37231 q: A? registry-1.docker.io. 7/0/0 registry-1.docker.io. CNAME us-east-1-elbio-rm5bon1qaeo4-623296237.us-east-1.elb.amazonaws.com., us-east-1-elbio-rm5bon1qaeo4-623296237.us-east-1.elb.amazonaws.com. A 52.0.31.125, us-east-1-elbio-rm5bon1qaeo4-623296237.us-east-1.elb.amazonaws.com. A 52.5.4.145, us-east-1-elbio-rm5bon1qaeo4-623296237.us-east-1.elb.amazonaws.com. A 52.6.136.158, us-east-1-elbio-rm5bon1qaeo4-623296237.us-east-1.elb.amazonaws.com. A 54.164.219.90, us-east-1-elbio-rm5bon1qaeo4-623296237.us-east-1.elb.amazonaws.com. A 54.172.137.222, us-east-1-elbio-rm5bon1qaeo4-623296237.us-east-1.elb.amazonaws.com. A 107.21.22.134 (214)

With search domain

If you use some search domain, the results are different:

2015/06/04 18:04:14 Get https://registry-1.docker.io: dial tcp: lookup registry-1.docker.io: no such host

Tcpdump:

18:38:37.157005 IP (tos 0x0, ttl 64, id 62874, offset 0, flags [DF], proto UDP (17), length 66)
    10.128.40.4.39343 > 10.128.0.2.53: [bad udp cksum 0x3d45 -> 0x11e3!] 32536+ A? registry-1.docker.io. (38)
18:38:37.157131 IP (tos 0x0, ttl 64, id 62875, offset 0, flags [DF], proto UDP (17), length 66)
    10.128.40.4.52552 > 10.128.0.2.53: [bad udp cksum 0x3d45 -> 0x7ced!] 57433+ AAAA? registry-1.docker.io. (38)
18:38:37.157924 IP (tos 0x0, ttl 64, id 10113, offset 0, flags [none], proto UDP (17), length 228)
    10.128.0.2.53 > 10.128.40.4.52552: [udp sum ok] 57433 q: AAAA? registry-1.docker.io. 1/1/0 registry-1.docker.io. CNAME us-east-1-elbio-rm5bon1qaeo4-623296237.us-east-1.elb.amazonaws.com. ns: us-east-1.elb.amazonaws.com. SOA ns-1119.awsdns-11.org. awsdns-hostmaster.amazon.com. 1 7200 900 1209600 60 (200)
18:38:37.158132 IP (tos 0x0, ttl 64, id 10114, offset 0, flags [none], proto UDP (17), length 242)
    10.128.0.2.53 > 10.128.40.4.39343: [udp sum ok] 32536 q: A? registry-1.docker.io. 7/0/0 registry-1.docker.io. CNAME us-east-1-elbio-rm5bon1qaeo4-623296237.us-east-1.elb.amazonaws.com., us-east-1-elbio-rm5bon1qaeo4-623296237.us-east-1.elb.amazonaws.com. A 52.6.136.158, us-east-1-elbio-rm5bon1qaeo4-623296237.us-east-1.elb.amazonaws.com. A 54.164.219.90, us-east-1-elbio-rm5bon1qaeo4-623296237.us-east-1.elb.amazonaws.com. A 54.172.137.222, us-east-1-elbio-rm5bon1qaeo4-623296237.us-east-1.elb.amazonaws.com. A 107.21.22.134, us-east-1-elbio-rm5bon1qaeo4-623296237.us-east-1.elb.amazonaws.com. A 52.0.31.125, us-east-1-elbio-rm5bon1qaeo4-623296237.us-east-1.elb.amazonaws.com. A 52.5.4.145 (214)
18:38:37.160648 IP (tos 0x0, ttl 64, id 62876, offset 0, flags [DF], proto UDP (17), length 78)
    10.128.40.4.46128 > 10.128.0.2.53: [bad udp cksum 0x3d51 -> 0xdce0!] 11418+ AAAA? registry-1.docker.io.example.com. (50)
18:38:37.160760 IP (tos 0x0, ttl 64, id 62877, offset 0, flags [DF], proto UDP (17), length 66)
    10.128.40.4.51488 > 10.128.0.2.53: [bad udp cksum 0x3d45 -> 0x418c!] 8190+ A? registry-1.docker.io. (38)
18:38:37.160919 IP (tos 0x0, ttl 64, id 10115, offset 0, flags [none], proto UDP (17), length 242)
    10.128.0.2.53 > 10.128.40.4.51488: [udp sum ok] 8190 q: A? registry-1.docker.io. 7/0/0 registry-1.docker.io. CNAME us-east-1-elbio-rm5bon1qaeo4-623296237.us-east-1.elb.amazonaws.com., us-east-1-elbio-rm5bon1qaeo4-623296237.us-east-1.elb.amazonaws.com. A 52.5.4.145, us-east-1-elbio-rm5bon1qaeo4-623296237.us-east-1.elb.amazonaws.com. A 52.6.136.158, us-east-1-elbio-rm5bon1qaeo4-623296237.us-east-1.elb.amazonaws.com. A 54.164.219.90, us-east-1-elbio-rm5bon1qaeo4-623296237.us-east-1.elb.amazonaws.com. A 54.172.137.222, us-east-1-elbio-rm5bon1qaeo4-623296237.us-east-1.elb.amazonaws.com. A 107.21.22.134, us-east-1-elbio-rm5bon1qaeo4-623296237.us-east-1.elb.amazonaws.com. A 52.0.31.125 (214)
18:38:37.162057 IP (tos 0x0, ttl 64, id 62878, offset 0, flags [DF], proto UDP (17), length 78)
    10.128.40.4.34535 > 10.128.0.2.53: [bad udp cksum 0x3d51 -> 0x3f74!] 63338+ A? registry-1.docker.io.example.com. (50)
18:38:37.163365 IP (tos 0x0, ttl 64, id 10116, offset 0, flags [none], proto UDP (17), length 135)
    10.128.0.2.53 > 10.128.40.4.46128: [udp sum ok] 11418 NXDomain q: AAAA? registry-1.docker.io.example.com. 0/1/0 ns: example.com. SOA sns.dns.icann.org. noc.dns.icann.org. 2015060216 7200 3600 1209600 3600 (107)
18:38:37.164818 IP (tos 0x0, ttl 64, id 10117, offset 0, flags [none], proto UDP (17), length 135)
    10.128.0.2.53 > 10.128.40.4.34535: [udp sum ok] 63338 NXDomain q: A? registry-1.docker.io.example.com. 0/1/0 ns: example.com. SOA sns.dns.icann.org. noc.dns.icann.org. 2015060216 7200 3600 1209600 3600 (107)

This later one is different, but might be related to moby/moby#10863

Other DNS servers

Other DNS servers seem to work, yet the DNS requests look very similar. I'm using the same nameservers I provided as upstream recursors for consul before to rule out it's somehow related to those.

Without search domain

2015/06/04 18:46:20 No error

Tcpdump:

18:46:20.642497 IP (tos 0x0, ttl 64, id 62917, offset 0, flags [DF], proto UDP (17), length 66)
    10.128.40.4.36033 > 10.128.0.2.53: [bad udp cksum 0x3d45 -> 0x81e9!] 7168+ A? registry-1.docker.io. (38)
18:46:20.642592 IP (tos 0x0, ttl 64, id 62918, offset 0, flags [DF], proto UDP (17), length 66)
    10.128.40.4.40689 > 10.128.0.2.53: [bad udp cksum 0x3d45 -> 0xbd21!] 52860+ AAAA? registry-1.docker.io. (38)
18:46:20.644876 IP (tos 0x0, ttl 64, id 10156, offset 0, flags [none], proto UDP (17), length 242)
    10.128.0.2.53 > 10.128.40.4.36033: [udp sum ok] 7168 q: A? registry-1.docker.io. 7/0/0 registry-1.docker.io. CNAME us-east-1-elbio-rm5bon1qaeo4-623296237.us-east-1.elb.amazonaws.com., us-east-1-elbio-rm5bon1qaeo4-623296237.us-east-1.elb.amazonaws.com. A 52.6.136.158, us-east-1-elbio-rm5bon1qaeo4-623296237.us-east-1.elb.amazonaws.com. A 54.164.219.90, us-east-1-elbio-rm5bon1qaeo4-623296237.us-east-1.elb.amazonaws.com. A 54.172.137.222, us-east-1-elbio-rm5bon1qaeo4-623296237.us-east-1.elb.amazonaws.com. A 107.21.22.134, us-east-1-elbio-rm5bon1qaeo4-623296237.us-east-1.elb.amazonaws.com. A 52.0.31.125, us-east-1-elbio-rm5bon1qaeo4-623296237.us-east-1.elb.amazonaws.com. A 52.5.4.145 (214)
18:46:20.644895 IP (tos 0x0, ttl 64, id 10157, offset 0, flags [none], proto UDP (17), length 228)
    10.128.0.2.53 > 10.128.40.4.40689: [udp sum ok] 52860 q: AAAA? registry-1.docker.io. 1/1/0 registry-1.docker.io. CNAME us-east-1-elbio-rm5bon1qaeo4-623296237.us-east-1.elb.amazonaws.com. ns: us-east-1.elb.amazonaws.com. SOA ns-1119.awsdns-11.org. awsdns-hostmaster.amazon.com. 1 7200 900 1209600 60 (200)

With search domain

Tcpdump:

tcpdump: listening on eth0, link-type EN10MB (Ethernet), capture size 65535 bytes
18:48:19.693146 IP (tos 0x0, ttl 64, id 62941, offset 0, flags [DF], proto UDP (17), length 66)
    10.128.40.4.44090 > 10.128.0.2.53: [bad udp cksum 0x3d45 -> 0x1c34!] 25148+ A? registry-1.docker.io. (38)
18:48:19.693243 IP (tos 0x0, ttl 64, id 62942, offset 0, flags [DF], proto UDP (17), length 66)
    10.128.40.4.47151 > 10.128.0.2.53: [bad udp cksum 0x3d45 -> 0xf1ff!] 32864+ AAAA? registry-1.docker.io. (38)
18:48:19.694158 IP (tos 0x0, ttl 64, id 10180, offset 0, flags [none], proto UDP (17), length 242)
    10.128.0.2.53 > 10.128.40.4.44090: [udp sum ok] 25148 q: A? registry-1.docker.io. 7/0/0 registry-1.docker.io. CNAME us-east-1-elbio-rm5bon1qaeo4-623296237.us-east-1.elb.amazonaws.com., us-east-1-elbio-rm5bon1qaeo4-623296237.us-east-1.elb.amazonaws.com. A 52.6.136.158, us-east-1-elbio-rm5bon1qaeo4-623296237.us-east-1.elb.amazonaws.com. A 54.164.219.90, us-east-1-elbio-rm5bon1qaeo4-623296237.us-east-1.elb.amazonaws.com. A 54.172.137.222, us-east-1-elbio-rm5bon1qaeo4-623296237.us-east-1.elb.amazonaws.com. A 107.21.22.134, us-east-1-elbio-rm5bon1qaeo4-623296237.us-east-1.elb.amazonaws.com. A 52.0.31.125, us-east-1-elbio-rm5bon1qaeo4-623296237.us-east-1.elb.amazonaws.com. A 52.5.4.145 (214)
18:48:19.711674 IP (tos 0x0, ttl 64, id 10181, offset 0, flags [none], proto UDP (17), length 228)
    10.128.0.2.53 > 10.128.40.4.47151: [udp sum ok] 32864 q: AAAA? registry-1.docker.io. 1/1/0 registry-1.docker.io. CNAME us-east-1-elbio-rm5bon1qaeo4-623296237.us-east-1.elb.amazonaws.com. ns: us-east-1.elb.amazonaws.com. SOA ns-1119.awsdns-11.org. awsdns-hostmaster.amazon.com. 1 7200 900 1209600 60 (200)
18:48:19.711821 IP (tos 0x0, ttl 64, id 62943, offset 0, flags [DF], proto UDP (17), length 94)
    10.128.40.4.37072 > 10.128.0.2.53: [bad udp cksum 0x3d61 -> 0x282a!] 30733+ AAAA? registry-1.docker.io.stage-us-east-1.aws.dckr.io. (66)
18:48:19.714437 IP (tos 0x0, ttl 64, id 10182, offset 0, flags [none], proto UDP (17), length 181)
    10.128.0.2.53 > 10.128.40.4.37072: [udp sum ok] 30733 NXDomain q: AAAA? registry-1.docker.io.stage-us-east-1.aws.dckr.io. 0/1/0 ns: aws.dckr.io. SOA ns-1870.awsdns-41.co.uk. awsdns-hostmaster.amazon.com. 1 7200 900 1209600 86400 (153)

@discordianfish discordianfish changed the title Regression: netgo not resolving some CNAME when using some recursors Regression: netgo not resolving some CNAME when using some? recursors Jun 4, 2015

@discordianfish

This comment has been minimized.

Copy link
Author

commented Jun 4, 2015

I can continue investigating tomorrow, there must be some difference between the responses from consul vs the other dns server (which is a internal AWS recursor).

@discordianfish

This comment has been minimized.

Copy link
Author

commented Jun 4, 2015

I just realize that there is hashicorp/consul#854, so it might be a consul bug after all. Still not 100% sure whether netgo should behave the way it does...

@adg adg changed the title Regression: netgo not resolving some CNAME when using some? recursors net: netgo not resolving some CNAME when using some? recursors Jun 4, 2015

@bradfitz

This comment has been minimized.

Copy link
Member

commented Jun 4, 2015

Is "netcgo" a typo?

@bradfitz

This comment has been minimized.

Copy link
Member

commented Jun 4, 2015

Also, note that in Go 1.5, netgo will be the default most of the time, without a special build tag. It'll decide at runtime whether to use netgo or libc's resolver based on the system's config and the hostname.

Please try with Go tip.

@mikioh

This comment has been minimized.

Copy link
Contributor

commented Jun 4, 2015

@discordianfish,

Please open a new issue for using builtin DNS stub resolver with search? domain? both? keywords. Please don't mix two issues together.

For the "cannot unmarshal DNS message" error, as described in hashicorp/consul#854, builtin DNS stub resolver in go1.4 and above are more RFC 1035 compliant than go1.3 and below. I'm not sure what we should do when the recursor stuff (I guess it's a recursive server) replies a long response message on UDP transport.

The "no such host" error; builtin DNS stub resolver with search? domain? both? issue looks pretty interesting. A few possibilities come to my mind, but not sure without any concrete information. Please provide, a) DNS RR sets for your target alias or canonical name, b) your stub resolver configuration (usually resolv.conf), c) your recursive DNS server configuration, into a new issue. It would be a great help If you can provide the environment for repro. Thanks.

@mikioh mikioh changed the title net: netgo not resolving some CNAME when using some? recursors net: builtin DNS stub resolver fails to parse responses from consul with "cannot unmarshal DNS message" Jun 4, 2015

@ianlancetaylor ianlancetaylor added this to the Go1.5Maybe milestone Jun 4, 2015

@discordianfish

This comment has been minimized.

Copy link
Author

commented Jun 4, 2015

@mikioh Yes, it seems like those are two issues. "Something" doesn't handle responses >512 bytes and this gets silently ignored since it tries to append the domain which again fails. At least that's what I assume is going on.

The "fix" in consul seems to be to compress the response, bringing it below 512 bytes but it is perfectly valid to have bigger responses and all major resolvers are supporting that by implementing EDNS or fall back to tcp and I think netgo is doing this as well, right?

Possible that consul is returning a malformed response that most resolvers still manage to parse but netgo is too strict. But then the error could be probably improved.

PS: @bradfitz Yes fixed, was a bit in a rush ;)

@mikioh

This comment has been minimized.

Copy link
Contributor

commented Jun 5, 2015

by implementing EDNS or fall back to tcp and I think netgo is doing this as well, right?

Yup, see #6464 for the reason why we hesitate to support EDNS0. If the need of DNSSEC increases, we perhaps might implement DNSSEC+EDNS0 to builtin DNS stub resolver; even in that case we might not allow simple EDNS0-only conversation, not sure.

@mikioh

This comment has been minimized.

Copy link
Contributor

commented Jun 5, 2015

@discordianfish,

Please let us know if you have any updates on this "cannot unmarshal DNS meesage" issue at any time. We keep this issue open awhile. Also if you still have "no such host" errors, please open a new issue and provide your environment information.

@sstarcher

This comment has been minimized.

Copy link

commented Jun 5, 2015

@mikioh I can reproduce the "no such host" error consistently. Much of what's below was helped by @discordianfish

  1. Go1.4
  2. Consul DNS Server
  3. CGO_ENABLED=0 go build -installsuffix netgo
  4. _, err := http.Get("https://registry-1.docker.io")

As for the "cannot unmarshal DNS" I have not reproduced it directly with Go1.4. The situation where I produced it was as follows.

  1. Go1.4
  2. Consul DNS Server
  3. removed "search ec2.internal" from /etc/resolv.conf
  4. docker pull busybox
@bradfitz

This comment has been minimized.

Copy link
Member

commented Jun 5, 2015

Can we get a repro case that doesn't involve a massive step like installing a custom DNS server with unknown details on how to configure?

Give us a self-contained Go program, or maybe a network dump, or even a Docker environment where we can reproduce it. But not instructions with a massive-yet-undefined setup step.

@mdempsky

This comment has been minimized.

Copy link
Member

commented Jun 5, 2015

@discordianfish If Consul is sending >512 byte DNS responses over UDP without the client indicating support for large DNS responses via EDNS0, then Consul is RFC non-compliant. It's supposed to instead truncate the packet (note: truncated packets should still be valid DNS responses, so don't just blindly chop off bytes past 512), set the TC flag, and support the same query over TCP (which has no size limits).

@mikioh

This comment has been minimized.

Copy link
Contributor

commented Jun 5, 2015

@sstarcher,

Can you please open a new issue for investigating the "no such host error with consul" because it would be a long journey and usually an issue related to DNS has multiple combined root causes.

@discordianfish

This comment has been minimized.

Copy link
Author

commented Jun 5, 2015

Okay, the actual problem appears to be a API / usability issue with miekg/dns: miekg/dns#216 and how consul it using it. Will make sure someone opens a new issue for the 'no such host' error.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
You can’t perform that action at this time.