Join GitHub today
GitHub is home to over 36 million developers working together to host and review code, manage projects, and build software together.Sign up
net: retry DNS lookups before failure? #16865
I've frequently noticed that our net DNS tests running on builders are often flaky.
Notice they're all after 5 seconds. (our default DNS timeout)
Did a UDP request get lost?
Did a UDP response get lost?
Does NAT make some builders worse?
Should we make builders re-try all DNS tests N times?
But this is also flaky (but to a much lesser degree) on my desktop on wired ethernet. With 500 runs, I still see occasional failures.
Maybe we should make our net package's DNS code automatically resend the UDP request after half the timeout? (i.e. after 2.5 seconds by default)
I'm okay with us changing the DNS resolver logic to more closely match other DNS client libraries if that helps the flakiness, but I'm hesitant to do things like change default timeouts / retry logic just to appease flaky tests.
A possible testing-side fix: we could run a simple local DNS server that just knows how to respond to certain fixed DNS queries. It doesn't even need to implement proper DNS packet decoding. It just needs to copy the 16-bit query ID at the start of the packet, and then do an exact byte-string lookup on the rest to decide on a response.
Flaky tests is how I started down this path, but then I realized our DNS client just might need work too.
But looking at the cited test failures, I see one is pure Go, one is cgo, and one is
@adg and I started working on that once (can't find the bug) but never finished, apparently.
It seems our code already does try to do a certain number of attempts (
It looks like one deadline is set up before the loop, then the first one will fail due to timeout, and all the rest will all necessarily fail because the timeout is already dead.
What do other DNS implementations do?
Yeah, that appears to be part of the problem at least. libresolv in glibc uses cfg.timeout to compute individual UDP round-trip timeouts, not as a global timeout.
It has a kind of goofy timeout calculation logic though. For the first server in the nameserver list, it uses cfg.timeout directly. But for the rest, it uses
Checking glibc commit history, it looks like that logic came from BIND 8.2.3 in 2000 (see bminor/glibc@e685e07). Prior to that, there was a somewhat seemingly more sane approach: for the first attempt to each server, use
I want to say this is just an accident because of how they split out a function similar to our
djbdns's client library doesn't respect the timeout/retry settings in resolv.conf. In stub resolver mode, it simply always uses 3 retries, and uses timeouts of 3s, 11s, and 45s per UDP query.
This might be related: One of our core components actually experienced many
We were actually wondering why the Go DNS resolver does not try TCP when UDP fails here:
referenced this issue
Sep 8, 2016
referenced this issue
Oct 5, 2016
Maybe I shouldn't be commenting on closed tickets but we have been experiencing DNS UDP dial timeouts in our production environment using Go 1.7.3.
dial tcp: lookup xxx.xxxxxx.com on 10.129.0.2:53: dial udp 10.129.0.2:53: i/o timeout"}
Go 1.7.3 using official Docker image on AWS EC2.