Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

net: retry DNS lookups before failure? #16865

Closed
bradfitz opened this issue Aug 24, 2016 · 20 comments

Comments

Projects
None yet
7 participants
@bradfitz
Copy link
Member

commented Aug 24, 2016

I've frequently noticed that our net DNS tests running on builders are often flaky.

For example:

https://build.golang.org/log/ce5a87135d1a5ed4f17bd998ace2e0060b9ad597
https://build.golang.org/log/b3e762fc83d463acba21987ff558c8018b33c7cb
https://build.golang.org/log/250fc567590d125f1c8fd27740105eb7288ab16c

--- FAIL: TestLookupDotsWithRemoteSource (5.05s)
    lookup_test.go:566: LookupSRV(xmpp-server, tcp, google.com): lookup _xmpp-server._tcp.google.com on 8.8.8.8:53: no such host (mode=go)

--- FAIL: TestLookupDotsWithRemoteSource (5.46s)
    lookup_test.go:540: LookupMX(google.com): lookup google.com on 8.8.8.8:53: no such host (mode=cgo)
FAIL
FAIL    net 7.838s

--- FAIL: TestLookupGmailNS (5.01s)
    lookup_test.go:142: lookup gmail.com. on 8.8.8.8:53: dial udp 8.8.8.8:53: i/o timeout
FAIL
FAIL    net 7.336s

etc.

Notice they're all after 5 seconds. (our default DNS timeout)

Did a UDP request get lost?

Did a UDP response get lost?

Does NAT make some builders worse?

Should we make builders re-try all DNS tests N times?

But this is also flaky (but to a much lesser degree) on my desktop on wired ethernet. With 500 runs, I still see occasional failures.

Maybe we should make our net package's DNS code automatically resend the UDP request after half the timeout? (i.e. after 2.5 seconds by default)

/cc @mdempsky @josharian @minux @ianlancetaylor @mikioh

@bradfitz bradfitz added the Testing label Aug 24, 2016

@bradfitz bradfitz added this to the Go1.8 milestone Aug 24, 2016

@bradfitz bradfitz self-assigned this Aug 24, 2016

@mdempsky

This comment has been minimized.

Copy link
Member

commented Aug 24, 2016

I'm okay with us changing the DNS resolver logic to more closely match other DNS client libraries if that helps the flakiness, but I'm hesitant to do things like change default timeouts / retry logic just to appease flaky tests.

A possible testing-side fix: we could run a simple local DNS server that just knows how to respond to certain fixed DNS queries. It doesn't even need to implement proper DNS packet decoding. It just needs to copy the 16-bit query ID at the start of the packet, and then do an exact byte-string lookup on the rest to decide on a response.

@bradfitz

This comment has been minimized.

Copy link
Member Author

commented Aug 24, 2016

I'm okay with us changing the DNS resolver logic to more closely match other DNS client libraries if that helps the flakiness, but I'm hesitant to do things like change default timeouts / retry logic just to appease flaky tests.

Flaky tests is how I started down this path, but then I realized our DNS client just might need work too.

But looking at the cited test failures, I see one is pure Go, one is cgo, and one is dial udp 8.8.8.8:53: i/o timeout ... how does a UDP dial time out!? Isn't it instant?

A possible testing-side fix: we could run a simple local DNS server that just knows how to respond to certain fixed DNS queries. It doesn't even need to implement proper DNS packet decoding. It just needs to copy the 16-bit query ID at the start of the packet, and then do an exact byte-string lookup on the rest to decide on a response.

@adg and I started working on that once (can't find the bug) but never finished, apparently.

@bradfitz

This comment has been minimized.

Copy link
Member Author

commented Aug 25, 2016

It seems our code already does try to do a certain number of attempts (for i := 0; i < cfg.attempts; i++ { in dnsclient_unix.go:func tryOneName), but I don't think it's necessarily working.

It looks like one deadline is set up before the loop, then the first one will fail due to timeout, and all the rest will all necessarily fail because the timeout is already dead.

What do other DNS implementations do?

@mdempsky

This comment has been minimized.

Copy link
Member

commented Aug 25, 2016

Yeah, that appears to be part of the problem at least. libresolv in glibc uses cfg.timeout to compute individual UDP round-trip timeouts, not as a global timeout.

It has a kind of goofy timeout calculation logic though. For the first server in the nameserver list, it uses cfg.timeout directly. But for the rest, it uses (cfg.timeout << ns) / len(cfg.servers), where ns is the server's (0-based) index in cfg.servers.

Checking glibc commit history, it looks like that logic came from BIND 8.2.3 in 2000 (see bminor/glibc@e685e07). Prior to that, there was a somewhat seemingly more sane approach: for the first attempt to each server, use cfg.timeout directly; but for retries use (cfg.timeout << try) / len(cfg.servers).

I want to say this is just an accident because of how they split out a function similar to our exchange, but renaming the variable from try to ns seems like it was intentional.

djbdns's client library doesn't respect the timeout/retry settings in resolv.conf. In stub resolver mode, it simply always uses 3 retries, and uses timeouts of 3s, 11s, and 45s per UDP query.

@bradfitz

This comment has been minimized.

Copy link
Member Author

commented Aug 26, 2016

More good comments in #16885 (comment)

@bradfitz

This comment has been minimized.

Copy link
Member Author

commented Aug 26, 2016

And #16885 (comment) suggests it might be a Go 1.7 regression worth fixing in a point release.

@mdempsky

This comment has been minimized.

Copy link
Member

commented Aug 26, 2016

Related to integrating context support?

@bradfitz

This comment has been minimized.

Copy link
Member Author

commented Aug 26, 2016

I think so.

@Nitecon

This comment has been minimized.

Copy link

commented Aug 26, 2016

This would definitely be good for fixing in a 1.7 point release as it does currently cause us production impact and does not follow the resolver implementations for max timeouts as mentioned by Alex in #16885

@mdempsky

This comment has been minimized.

Copy link
Member

commented Aug 26, 2016

Behavior changed in f60fcca.

@mdempsky

This comment has been minimized.

Copy link
Member

commented Aug 29, 2016

@bradfitz Have you already started on this? If not, I can prepare a CL.

@bradfitz

This comment has been minimized.

Copy link
Member Author

commented Aug 29, 2016

I haven't. Please do!

@mdempsky mdempsky assigned mdempsky and unassigned bradfitz Aug 29, 2016

@gopherbot

This comment has been minimized.

Copy link

commented Aug 29, 2016

CL https://golang.org/cl/28057 mentions this issue.

@gopherbot gopherbot closed this in 11e3955 Aug 29, 2016

@bradfitz

This comment has been minimized.

Copy link
Member Author

commented Aug 29, 2016

@mdempsky, think we should cherry-pick this back to Go 1.7.1?

/cc @ianlancetaylor @broady

@mdempsky

This comment has been minimized.

Copy link
Member

commented Aug 29, 2016

@bradfitz It should definitely be cherry-picked to Go 1.7.x. If 1.7.1 isn't final yet, I think this makes sense to include.

@bradfitz bradfitz modified the milestones: Go1.7.1, Go1.8 Aug 29, 2016

@mlafeldt

This comment has been minimized.

Copy link

commented Sep 1, 2016

This might be related: One of our core components actually experienced many dial udp 10.8.0.2:53: i/o timeout errors in production after updating from Go 1.6 to 1.7. We had to switch to the cgo DNS implementation to still be able to move forward with Go 1.7.

We were actually wondering why the Go DNS resolver does not try TCP when UDP fails here:

c, err := d.dialDNS(ctx, network, server)
if err != nil {
return nil, err
}

/cc @Luzifer @denderello

@mikioh

This comment has been minimized.

Copy link
Contributor

commented Sep 2, 2016

Drop testing label because this looks a functionality issue.

@mikioh mikioh removed the Testing label Sep 2, 2016

gopherbot pushed a commit that referenced this issue Sep 7, 2016

[release-branch.go1.7] net: restore per-query timeout logic
The handling of "options timeout:n" is supposed to be per individual
DNS server exchange, not per Lookup call.

Fixes #16865.

Change-Id: I2304579b9169c1515292f142cb372af9d37ff7c1
Reviewed-on: https://go-review.googlesource.com/28057
Run-TryBot: Matthew Dempsky <mdempsky@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>
Reviewed-on: https://go-review.googlesource.com/28640
@mikioh

This comment has been minimized.

Copy link
Contributor

commented Oct 5, 2016

This is a regression of Go 1.7. Go 1.6.3 or below, or Go 1.7.1 or above works fine.

@jbrook

This comment has been minimized.

Copy link

commented Nov 17, 2016

Maybe I shouldn't be commenting on closed tickets but we have been experiencing DNS UDP dial timeouts in our production environment using Go 1.7.3.

dial tcp: lookup xxx.xxxxxx.com on 10.129.0.2:53: dial udp 10.129.0.2:53: i/o timeout"}

Go 1.7.3 using official Docker image on AWS EC2.
AWS EC2 instance has DNS resolving to a Route53 private hosted zone.

@bradfitz

This comment has been minimized.

Copy link
Member Author

commented Nov 17, 2016

You can comment on closed tickets if you'd like, but it's just not very effectual, as we don't track closed tickets.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
You can’t perform that action at this time.