New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

net: resolver should call res_init when resolv.conf changes #21083

Open
oconnor663 opened this Issue Jul 19, 2017 · 10 comments

Comments

Projects
None yet
7 participants
@oconnor663

oconnor663 commented Jul 19, 2017

If you force the cgo resolver on non-Debian-based Linux with glibc <=2.25 (that is, any stable glibc as of this writing) and Go 1.8.3 (probably any released version), it's very easy to get into a state where all network connections fail even though the machine's network is up:

package main

import (
	"fmt"
	"net"
	"time"
)

func main() {
	i := 1
	for {
		ip, err := net.LookupIP("google.com")
		if err != nil {
			fmt.Printf("%3d error: %s\n", i, err)
		} else {
			fmt.Printf("%3d success: %s\n", i, ip)
		}
		time.Sleep(1 * time.Second)
		i++
	}
}
  1. Turn off / unplug the network connection on your machine.
  2. Observe that NetworkManager or whatever daemon on your machine updates /etc/resolv.conf and removes the nameservers. (This does depend on what Linux distro / local network configuration you have, so if /etc/resolv.conf never changes you might need to repro this on a different machine.)
  3. GODEBUG=netdns=cgo go run test.go (or whatever you named it) on the file above. The GODEBUG setting forces the net library to use the cgo resolver. Note that the pure-Go resolver does not exhibit this bug.
  4. Watch it for a few seconds and observe a couple lookup errors (no such host). This is expected, because the network is down.
  5. Without stopping the test program above, turn the network back on.
  6. Confirm again that /etc/resolv.conf changes, and that your nameservers are back.
  7. Keep watching the output of the test program.
    BUG: The test program's DNS requests keep failing. Sometimes they fix themselves eventually, I'm not sure under exactly what conditions, but sometimes I never see them recover. It seems like if the test program ran with the network off for a minute, recovery is less likely than if you turned the network back on after a couple seconds.

This seems to be related to a bug in glibc, where /etc/resolv.conf isn't checked for changes as often as it should be. It looks like glibc is going to ship a fix for this in their next release, though affected versions will still be around for years. Debian has patched this fix themselves for a long time (this patch?), which is why I don't think the bug will repro on Debian-based distros. I've recently had to work around this issue in application code, by calling libc::res_init manually. The Rust standard library has started calling res_init after DNS failures by default, and that issue links to similar workarounds in other languages and applications.

Seems like a lot of big programs have gone through the pain of re-discovering this issue. Here's Mozilla Firefox from 14 years ago. And more recently, Chef (and Ruby).

I think the Go standard library should consider adding a workaround similar to what Rust has done. It looks like this was considered in #10850, but it might not have been clear how common this bug is? The most painful case is a long-running Go process on a laptop where the WiFi comes and goes.

[Apologies if the "proposal" label was incorrect. This could be called a bugfix, depending on how you feel about it :) ]

@gopherbot gopherbot added this to the Proposal milestone Jul 19, 2017

@gopherbot gopherbot added the Proposal label Jul 19, 2017

@ianlancetaylor

This comment has been minimized.

Contributor

ianlancetaylor commented Jul 19, 2017

@rsc

This comment has been minimized.

Contributor

rsc commented Jul 31, 2017

Links:

Both patches simply call res_init and then return the error they already hold. They do not retry the call that failed. I assume calling code is expected to do that eventually (this is to break a loop that would let everything keep getting the bad result). I guess if it's been 14+ years the distros are never going to fix this themselves / glibc isn't?

At least in glibc, res_init has a lock so it's OK to call from multiple goroutines simultaneously. It would be pretty easy to call res_init on the error result from getaddrinfo, as in those two CLs.

Rust seems to do this only for lookup_host, not other network calls. It's unclear if there are other entry points we should be considering.

Overall this seems unfortunate but fairly low risk to try to fix on the Go side. Not for a point release but maybe for Go 1.10.

Update: Don't do this on OS X (see rust-lang/rust#43592), and maybe evaluate whether BSD libc more generally should do it.

ping @mdempsky @iangudger ?

@oconnor663

This comment has been minimized.

oconnor663 commented Jul 31, 2017

I guess if it's been 14+ years the distros are never going to fix this themselves / glibc isn't?

If their timeline is still right, it looks like glibc is actually going to ship a fix for this tomorrow. But I assume some supported systems (most current non-Debian distros?) will be running versions of glibc with this bug for many years.

Rust seems to do this only for lookup_host, not other network calls

I think the private function lookup_host ends up getting called by everything network related in the Rust standard library. For example, TcpStream::connect -> each_addr -> to_socket_addrs -> resolve_socket_addr -> lookup_host.

Overall this seems unfortunate but fairly low risk to try to fix on the Go side.

Two big benefits to fixing this on the Go side:

  1. If App A depends on Library B, and Library B makes network calls, I don't know of a way for A's authors to reliably fix B's calls without making code changes in B.
  2. Fixing this bug in application code requires cgo, which is a bummer if your build didn't need cgo before.
@iangudger

This comment has been minimized.

Contributor

iangudger commented Jul 31, 2017

I have no opinion on this.

@oconnor663

This comment has been minimized.

oconnor663 commented Aug 1, 2017

Possible risk: we'll have to be careful about calling res_init from multiple threads. See rust-lang/rust#43592

@rsc

This comment has been minimized.

Contributor

rsc commented Aug 14, 2017

Ping @mdempsky.

@mdempsky

This comment has been minimized.

Member

mdempsky commented Aug 15, 2017

Calling res_init after a lookup failure seems like a kludge to me. It wouldn't necessarily detect when we're just getting a different response (e.g., if search paths change). Also, I saw that this is what Mozilla and Rust both do, but I couldn't immediately track down how Ruby/Chef decide when to call res_init.

I would be inclined to just periodically call res_init (e.g., before every getaddrinfo call, unless it's been less than 5 seconds since the previous one?), so the behavior more closely matches the Go resolver behavior.

@mikioh mikioh changed the title from proposal: the net/cgo_unix DNS resolver should call res_init on lookup failure to proposal: net: cgo DNS stub resolver should call res_init on lookup failure Sep 20, 2017

@rsc

This comment has been minimized.

Contributor

rsc commented Sep 25, 2017

My concern with the "before every getaddrinfo call except not more than once in 5 seconds" is that every 5 seconds there will be a latency hiccup where calls back up on the res_init finishing, a res_init that 99% of the time is not necessary. Using the failure to trigger the res_init seems less costly. It seems to be working for Keybase at least (written in Go, linked above).

@rsc

This comment has been minimized.

Contributor

rsc commented Sep 25, 2017

Superficial reading of the glibc fix looks like they are stat'ing resolv.conf to see if it changes. If we did that instead of blindly calling res_init, then @mdempsky's suggestion seems a bit better. Also, we're already watching resolv.conf for changes, because we decide whether to use cgo at all based on the resolv.conf being "complicated" or not.

If we are already stat'ing the resolv.conf, maybe we should just call res_init each time we see it change. But that check is on the non-cgo path only and would have to be moved up earlier, to be common to all lookups. Still, that would make the "time to notice a change" the same in both Go and cgo modes.

If you move the tryUpdate calls into the common code paths instead of the go-only ones, then it might also have the effect that if resolv.conf changes from "complicated" to "simple" then we might start using the Go native resolver again, which might be a win.

If we're going to do this for Go 1.10, we should do it fairly soon to get it soaking.

Anyone interested in trying this? @mikioh? @mdempsky?

@rsc rsc changed the title from proposal: net: cgo DNS stub resolver should call res_init on lookup failure to net: resolver should call res_init when resolv.conf changes Sep 25, 2017

@rsc rsc modified the milestones: Proposal, Go1.10 Sep 25, 2017

@oconnor663

This comment has been minimized.

oconnor663 commented Oct 6, 2017

That bug linked above (rust-lang/rust#43592) is a real crasher. We think that res_init is threadsafe under glibc, but we know it's not on macOS at least. The Rust stdlib ended up working around this with a runtime call to gnu_get_libc_version, rust-lang/rust#44965.

@bradfitz bradfitz modified the milestones: Go1.10, Go1.11 Nov 17, 2017

@ianlancetaylor ianlancetaylor modified the milestones: Go1.11, Go1.12 Jun 27, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment