Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consul panic after repeated DNS recursion failure #284

Closed
payoffstan opened this issue Aug 14, 2014 · 12 comments
Closed

Consul panic after repeated DNS recursion failure #284

payoffstan opened this issue Aug 14, 2014 · 12 comments
Labels
type/bug Feature does not function as expected

Comments

@payoffstan
Copy link

Consul hits a runtime.panic after repeated i/o timeouts on dns recursion. This is on 0.3.1. Here's what the log looks like:

2014/08/14 01:06:15 [ERR] dns: recurse failed: read udp 10.10.100.130:53: i/o timeout
2014/08/14 01:06:16 [ERR] dns: recurse failed: read udp 10.10.100.130:53: i/o timeout
2014/08/14 08:40:57 [ERR] dns: recurse failed: read udp 10.10.100.130:53: i/o timeout
2014/08/14 08:44:33 [ERR] dns: recurse failed: read udp 10.10.100.130:53: i/o timeout
2014/08/14 08:44:34 [ERR] dns: recurse failed: read udp 10.10.100.130:53: i/o timeout
2014/08/14 08:54:51 [ERR] dns: recurse failed: read udp 10.10.100.130:53: i/o timeout
2014/08/14 09:03:27 [ERR] dns: recurse failed: read udp 10.10.100.130:53: i/o timeout
2014/08/14 09:13:55 [ERR] dns: recurse failed: read udp 10.10.100.130:53: i/o timeout
2014/08/14 09:37:41 [ERR] dns: recurse failed: read udp 10.10.100.130:53: i/o timeout
2014/08/14 09:41:28 [ERR] dns: recurse failed: read udp 10.10.100.130:53: i/o timeout
2014/08/14 09:43:21 [ERR] dns: recurse failed: read udp 10.10.100.130:53: i/o timeout
2014/08/14 09:43:25 [ERR] dns: recurse failed: read udp 10.10.100.130:53: i/o timeout
2014/08/14 09:53:11 [ERR] dns: recurse failed: read udp 10.10.100.130:53: i/o timeout
2014/08/14 09:53:13 [ERR] dns: recurse failed: read udp 10.10.100.130:53: i/o timeout
2014/08/14 09:53:15 [ERR] dns: recurse failed: read udp 10.10.100.130:53: i/o timeout
2014/08/14 09:53:16 [ERR] dns: recurse failed: read udp 10.10.100.130:53: i/o timeout
2014/08/14 09:53:25 [ERR] dns: recurse failed: read udp 10.10.100.130:53: i/o timeout
2014/08/14 10:00:11 [ERR] dns: recurse failed: read udp 10.10.100.130:53: i/o timeout

panic: runtime error: slice bounds out of range

goroutine 488917 [running]:
runtime.panic(0x9c21a0, 0xf7d40f)
/opt/go/src/pkg/runtime/panic.c:279 +0xf5
github.com/miekg/dns.unpackStructValue(0x9d8360, 0xc2084f8440, 0x0, 0x196, 0xc208285000, 0x2e3, 0x1000, 0x1b, 0xc200000000, 0x0, ...)
/opt/gopath/src/github.com/miekg/dns/msg.go:1170 +0x5df4
github.com/miekg/dns.UnpackStruct(0xa1b5a0, 0xc2084f8440, 0xc208285000, 0x2e3, 0x1000, 0x11, 0x17, 0x0, 0x0)
/opt/gopath/src/github.com/miekg/dns/msg.go:1254 +0xc8
github.com/miekg/dns.UnpackRR(0xc208285000, 0x2e3, 0x1000, 0x17, 0x7f12c44e3770, 0xc2084f8440, 0x11, 0x0, 0x0)
/opt/gopath/src/github.com/miekg/dns/msg.go:1331 +0x214
github.com/miekg/dns.(_Msg).Unpack(0xc2080f03f0, 0xc208285000, 0x2e3, 0x1000, 0x0, 0x0)
/opt/gopath/src/github.com/miekg/dns/msg.go:1546 +0x504
github.com/miekg/dns.(_Conn).ReadMsg(0xc2080aee10, 0xc2080f02d0, 0x0, 0x0)
/opt/gopath/src/github.com/miekg/dns/client.go:176 +0x1b6
github.com/miekg/dns.(_Client).exchange(0xc2080aedc0, 0xc2080f02d0, 0xc2080ed000, 0x10, 0x0, 0x0, 0x0, 0x0)
/opt/gopath/src/github.com/miekg/dns/client.go:152 +0x388
github.com/miekg/dns.(_Client).Exchange(0xc2080aedc0, 0xc2080f02d0, 0xc2080ed000, 0x10, 0xa45c10, 0x6, 0x0, 0x0)
/opt/gopath/src/github.com/miekg/dns/client.go:91 +0x88
github.com/hashicorp/consul/command/agent.(_DNSServer).handleRecurse(0xc2081109b0, 0x7f12cb00a4a0, 0xc208ed1bc0, 0xc2080f02d0)
/opt/gopath/src/github.com/hashicorp/consul/command/agent/dns.go:571 +0x222
github.com/hashicorp/consul/command/agent._DNSServer.(github.com/hashicorp/consul/command/agent.handleRecurse)·fm(0x7f12cb00a4a0, 0xc208ed1bc0, 0xc2
080f02d0)
/opt/gopath/src/github.com/hashicorp/consul/command/agent/dns.go:79 +0x44
github.com/miekg/dns.HandlerFunc.ServeDNS(0xc2080ed010, 0x7f12cb00a4a0, 0xc208ed1bc0, 0xc2080f02d0)
/opt/gopath/src/github.com/miekg/dns/server.go:79 +0x40
github.com/miekg/dns.(_ServeMux).ServeDNS(0xc2080ecf70, 0x7f12cb00a4a0, 0xc208ed1bc0, 0xc2080f02d0)
/opt/gopath/src/github.com/miekg/dns/server.go:176 +0xc9
github.com/miekg/dns.(_Server).serve(0xc20813a9c0, 0x7f12cb008ca0, 0xc208b84f30, 0x7f12cb009320, 0xc2080ecf70, 0xc20822a000, 0x1c, 0xffff, 0xc2080b2018, 0xc208e9a620, ...)
/opt/gopath/src/github.com/miekg/dns/server.go:330 +0x510
created by github.com/miekg/dns.(*Server).serveUDP
/opt/gopath/src/github.com/miekg/dns/server.go:299 +0x1fb

@armon
Copy link
Member

armon commented Aug 14, 2014

Interesting. Looks like a DNS decoding issue. Is the upstream address you were resolving sensitive? It would be useful to build a test case around it. (Or if you know the type/format of records that should be resolving)

@payoffstan
Copy link
Author

I haven't looked at the network captures, but there may be some sensitive data in there. Rather than post it publicly, I can forward it to your email. It might be that there's something odd about the DNS records, but I feel Consul should log an error and not panic in that case.

Do you think it would make any sense to invert the recursion relationship? i.e. recurse .consul names to the Consul server? I feel this approach loses some of the efficiency & power of the local Consul client but I'm curious as to what others think.

@armon
Copy link
Member

armon commented Aug 14, 2014

Feel free to email that information to me if it contains anything sensitive. I agree, it should never panic. This is a bug in the DNS library we use. I will try to work upstream to get it fixed.

You really don't loose anything inverting the relationship with Consul. If anything, you get some perks of using a caching DNS server in front of Consul, and just forwarding the .consul TLD

@miekg
Copy link

miekg commented Aug 22, 2014

From that codepath the only thing I can see is that off would be larger then lenmsg. Not likely, but might be possible. Would be nice to have some binary data for a test case.

@miekg
Copy link

miekg commented Aug 22, 2014

Working under the assumption off is too large, does this patch help?

diff --git a/msg.go b/msg.go
index cab1842..61c3b39 100644
--- a/msg.go
+++ b/msg.go
@@ -1154,6 +1154,9 @@ func unpackStructValue(val reflect.Value, msg []byte, off int) (off1 int, err er
                        if off == lenmsg {
                                break
                        }
+                       if off > lenmsg {
+                               return lenmsg, &Error{err: "overflow unpacking string"}
+                       }
                        switch val.Type().Field(i).Tag {
                        default:
                                return lenmsg, &Error{"bad tag unpacking string: " + val.Type().Field(i).Tag.Get("dns")}

obviously against the latest git of go dns.

@armon
Copy link
Member

armon commented Aug 22, 2014

@payoffstan Can you confirm with your test case that this patch works? I don't have a test case to reproduce this.

@payoffstan
Copy link
Author

Thanks @miekg and @armon. We don't have a test case per se but I can patch a dev server and see how it holds up. We finally encountered another panic recently but I haven't gotten the scrubbed packet capture from my ops guy yet. I'll email that once I have it so maybe we can construct a test case for the future.

@miekg
Copy link

miekg commented Aug 23, 2014

[ Quoting notifications@github.com in "Re: [consul] Consul panic after rep..." ]

@payoffstan Can you confirm with your test case that this patch works? I don't have a test case to reproduce this.

Yes, some binary data would help here. I think this check should also be
placed in the other cases.

/Miek

Miek Gieben

@krolaw
Copy link

krolaw commented Aug 28, 2014

I had the same problem with dd90d7942ba6f768acc803b3fba190d2c8a3559d

panic: runtime error: index out of range

goroutine 976 [running]:
runtime.panic(0x80df40, 0xaca1bc)
/usr/local/go/src/pkg/runtime/panic.c:279 +0xf5
github.com/miekg/dns.unpackStructValue(0x7f21a0, 0xc2081b8cc0, 0x0, 0x196, 0xc208256200, 0x200, 0x200, 0x1fd, 0xc200000000, 0x0, ...)
/Users/krolaw/go/src/github.com/miekg/dns/msg.go:976 +0x11fc
github.com/miekg/dns.UnpackStruct(0x815b00, 0xc2081b8cc0, 0xc208256200, 0x200, 0x200, 0x1f1, 0x1fd, 0x0, 0x0)
/Users/krolaw/go/src/github.com/miekg/dns/msg.go:1254 +0xc8
github.com/miekg/dns.UnpackRR(0xc208256200, 0x200, 0x200, 0x1fd, 0x7f41365fc130, 0xc2081b8cc0, 0x1f1, 0x0, 0x0)
/Users/krolaw/go/src/github.com/miekg/dns/msg.go:1331 +0x214
github.com/miekg/dns.(_Msg).Unpack(0xc20815a090, 0xc208256200, 0x200, 0x200, 0x0, 0x0)
/Users/krolaw/go/src/github.com/miekg/dns/msg.go:1546 +0x504
github.com/miekg/dns.(_Conn).ReadMsg(0xc2082a6d20, 0xc20815a000, 0x0, 0x0)
/Users/krolaw/go/src/github.com/miekg/dns/client.go:176 +0x1b6
github.com/miekg/dns.(_Client).exchange(0xc2082a6cd0, 0xc20815a000, 0xc2081c4240, 0xf, 0x0, 0x0, 0x0, 0x0)
/Users/krolaw/go/src/github.com/miekg/dns/client.go:152 +0x388
github.com/miekg/dns.(_Client).Exchange(0xc2082a6cd0, 0xc20815a000, 0xc2081c4240, 0xf, 0xc2081c4240, 0xf, 0x0, 0x0)
/Users/krolaw/go/src/github.com/miekg/dns/client.go:91 +0x88
trafficmate.(_Dns).ServeDNS(0xc2080280d0, 0x7f41365fc098, 0xc2080cee40, 0xc20815a000)
/Users/krolaw/go/src/trafficmate/dns.go:288 +0x759
github.com/miekg/dns.(_Server).serve(0xc2081000c0, 0x7f41365fae18, 0xc20823a720, 0x7f41365fadc8, 0xc2080280d0, 0xc20814ee00, 0x2b, 0x200, 0xc208030098, 0x0)
/Users/krolaw/go/src/github.com/miekg/dns/server.go:325 +0x504
created by github.com/miekg/dns.(*Server).serveUDP
/Users/krolaw/go/src/github.com/miekg/dns/server.go:294 +0x1a8

I've updated to the latest, I'll report back if the problem persists.

@armon
Copy link
Member

armon commented Aug 28, 2014

@krolaw If you have a reproducible case, or a test input that would be great!

@miekg
Copy link

miekg commented Aug 28, 2014

Even a tcpdump would suffice.
On 28 Aug 2014 18:23, "Armon Dadgar" notifications@github.com wrote:

@krolaw https://github.com/krolaw If you have a reproducible case, or a
test input that would be great!


Reply to this email directly or view it on GitHub
#284 (comment).

@armon armon added the type/bug Feature does not function as expected label Oct 14, 2014
@ryanuber
Copy link
Member

Closing due to inactivity. Please ping us if this comes up again!

duckhan pushed a commit to duckhan/consul that referenced this issue Oct 24, 2021
This commit makes a couple of changes:

  * Uses master branch for the hashicorp/consul/sdk dependency.
  * Fix server-acl-init federation tests. Now that consul always enforces ACLs, we have to make sure that we have a valid token when calling the API. Prior to that we called secondary's agent/join API with a token that it had no way of verifying.
  * Standardize on retry.Run timeouts in lifecycle-sidecar tests. Sometimes tests were using too short of a timeout compared to other tests. I switched it to just use the default timer.
  * Fix lifecycle-sidecar tests that were leaving running consul agents behind.
  * Bumps Consul version in CI
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/bug Feature does not function as expected
Projects
None yet
Development

No branches or pull requests

5 participants