New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

net: dns lookup sometimes fails with "no such host" when there are too many open files #18588

Open
josselin-c opened this Issue Jan 10, 2017 · 11 comments

Comments

Projects
None yet
@josselin-c
Contributor

josselin-c commented Jan 10, 2017

What did you do?

My network-intensive code would sometimes fail resolving domain names, returning "no such host" using the default resolver (pure go).The domains that fail resolving are valid domains that are successfully resolved using dig and the same dns server address.

I noticed that increasing the number of open files limit fix the issue.

I managed to reproduce the failure with: https://play.golang.org/p/MEZUS8h-o5

go build test.go
ulimit -n 20 # Set open file limit to something low
$ ./test 
2017/01/10 11:01:11 lookup munic.io on 127.0.1.1:53: dial udp 127.0.1.1:53: socket: too many open files
$ ./test 
2017/01/10 11:01:11 lookup munic.io on 127.0.1.1:53: dial udp 127.0.1.1:53: socket: too many open files
$ ./test 
2017/01/10 12:01:12 lookup munic.io on 127.0.1.1:53: no such host

What did you expect to see?

Always "too many open files"

What did you see instead?

Sometimes Lookup returns "no such host" when it shouldn't.

Does this issue reproduce with the latest release (go1.7.4)?

I first noticed the problem with 1.7.3. I reproduced the problem on 1.8beta2.

System details

go version go1.8beta2 linux/amd64
GOARCH="amd64"
GOBIN=""
GOEXE=""
GOHOSTARCH="amd64"
GOHOSTOS="linux"
GOOS="linux"
GOPATH="/home/costanzj/projects/go"
GORACE=""
GOROOT="/usr/local/go"
GOTOOLDIR="/usr/local/go/pkg/tool/linux_amd64"
GCCGO="gccgo"
CC="gcc"
GOGCCFLAGS="-fPIC -m64 -pthread -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build335463609=/tmp/go-build -gno-record-gcc-switches"
CXX="g++"
CGO_ENABLED="1"
PKG_CONFIG="pkg-config"
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
GOROOT/bin/go version: go version go1.8beta2 linux/amd64
GOROOT/bin/go tool compile -V: compile version go1.8beta2 X:framepointer
uname -sr: Linux 4.4.0-38-generic
Distributor ID:	Ubuntu
Description:	Ubuntu 16.04.1 LTS
Release:	16.04
Codename:	xenial
/lib/x86_64-linux-gnu/libc.so.6: GNU C Library (Ubuntu GLIBC 2.23-0ubuntu5) stable release version 2.23, by Roland McGrath et al.
gdb --version: GNU gdb (Ubuntu 7.11.1-0ubuntu1~16.04) 7.11.1

@josharian josharian changed the title from dns lookup sometimes fails with "no such host" to net: dns lookup sometimes fails with "no such host" when there are too many open files Jan 10, 2017

@josharian

This comment has been minimized.

Contributor

josharian commented Jan 10, 2017

@mdempsky

This comment has been minimized.

Member

mdempsky commented Jan 10, 2017

I think the issue is in goLookupIPCNAMEOrder. What's happening is the AAAA query fails with "no such host" (because munic.io has no AAAA records), but the A query sometimes fails with "too many open files". Currently if there's more than one error, which one we return to the caller depends on a race.

@mikioh mikioh added this to the Go1.9 milestone Jan 11, 2017

@mikioh

This comment has been minimized.

Contributor

mikioh commented Jan 11, 2017

There's a friend issue #18183. I'm inclined to return a list of effective errors when we cannot determine the cause of the Lookup API failure for surviving the dual IP stack era.

@shuting-yst

This comment has been minimized.

shuting-yst commented May 9, 2017

I think the issue is in goLookupIPCNAMEOrder. What's happening is the AAAA query fails with "no such host" (because munic.io has no AAAA records), but the A query sometimes fails with "too many open files". Currently if there's more than one error, which one we return to the caller depends on a race.

@mdempsky I didn't catch the point reason, could you explain the detail?

@bradfitz bradfitz added the NeedsFix label Jun 8, 2017

@bradfitz bradfitz modified the milestones: Go1.10, Go1.9 Jun 8, 2017

@bradfitz

This comment has been minimized.

Member

bradfitz commented Jun 8, 2017

We unfortunately didn't get to this, but if it's in 1.7 and 1.8, it's not a regression, so we'll look into addressing it for Go 1.10.

@bradfitz bradfitz added the help wanted label Jun 8, 2017

@agnivade

This comment has been minimized.

Member

agnivade commented Aug 30, 2017

I've looked into this a bit.

Yes, the issue is in goLookupIPCNAMEOrder. To be specific, its in the racer goroutine calls. I did some debugging and what is happening is it always returns the last error. Usually, there's only one error - "socket: too many open files", but sometimes the "no such host" comes up too. And when it does, "no such host" is always the last error.

Normal case - (I have added some debugging lines inside the racer lane handler which prints the error)

lookup munic.io. on 127.0.1.1:53: dial udp 127.0.1.1:53: socket: too many open files
lookup munic.io. on 127.0.1.1:53: dial udp 127.0.1.1:53: socket: too many open files
lookup munic.io. on 127.0.1.1:53: dial udp 127.0.1.1:53: socket: too many open files
lookup munic.io. on 127.0.1.1:53: dial udp 127.0.1.1:53: socket: too many open files
2017/08/30 07:00:45 lookup munic.io on 127.0.1.1:53: dial udp 127.0.1.1:53: socket: too many open files


lookup munic.io. on 127.0.1.1:53: dial udp 127.0.1.1:53: socket: too many open files
lookup munic.io. on 127.0.1.1:53: dial udp 127.0.1.1:53: socket: too many open files
lookup munic.io. on 127.0.1.1:53: dial udp 127.0.1.1:53: socket: too many open files
lookup munic.io. on 127.0.1.1:53: dial udp 127.0.1.1:53: socket: too many open files
2017/08/30 07:00:45 lookup munic.io on 127.0.1.1:53: dial udp 127.0.1.1:53: socket: too many open files

Other case -

lookup munic.io. on 127.0.1.1:53: dial udp 127.0.1.1:53: socket: too many open files
lookup munic.io. on 127.0.1.1:53: dial udp 127.0.1.1:53: socket: too many open files
lookup munic.io. on 127.0.1.1:53: dial udp 127.0.1.1:53: socket: too many open files
lookup munic.io. on 127.0.1.1:53: dial udp 127.0.1.1:53: socket: too many open files
lookup munic.io. on 127.0.1.1:53: no such host
2017/08/30 12:30:46 lookup munic.io on 127.0.1.1:53: no such host

lookup munic.io. on 127.0.1.1:53: dial udp 127.0.1.1:53: socket: too many open files
lookup munic.io. on 127.0.1.1:53: dial udp 127.0.1.1:53: socket: too many open files
lookup munic.io. on 127.0.1.1:53: dial udp 127.0.1.1:53: socket: too many open files
lookup munic.io. on 127.0.1.1:53: dial udp 127.0.1.1:53: socket: too many open files
lookup munic.io. on 127.0.1.1:53: no such host
lookup munic.io. on 127.0.1.1:53: no such host
lookup munic.io. on 127.0.1.1:53: no such host
lookup munic.io. on 127.0.1.1:53: no such host
lookup munic.io. on 127.0.1.1:53: no such host
lookup munic.io. on 127.0.1.1:53: no such host
2017/08/30 12:46:49 lookup munic.io on 127.0.1.1:53: no such host

I think if we change the logic to return the first seen error instead of the last error, its a more elegant solution. As mentioned by @pmarks-net on a slightly different but related thread #18183 (comment).

Combining the errors is also a possibility but as a user, I would prefer to see one error rather than a list of errors.

Overall, it seems to be a fairly simple enough fix. The real thing is to decide what it should be - returning all errors or the first encountered error. Or maybe something very different. Let me know what you guys think.

@pmarks-net

This comment has been minimized.

Contributor

pmarks-net commented Aug 30, 2017

According to goLookupIPCNAMEOrder, no ordering exists between the A and AAAA queries (until the actual addresses are sorted by sortByRFC6724).

So, "first" and "last" are meaningless when talking about A vs. AAAA; the only way I see to fix this case would be to define an ordering between errors, i.e. "too many open files" always replaces "no such host" when both are present.

Out of curiosity, does enabling StrictErrors make your example behave more predictably? That might only help if the two errors are generated by different elements of conf.nameList.

@agnivade

This comment has been minimized.

Member

agnivade commented Aug 31, 2017

So, "first" and "last" are meaningless when talking about A vs. AAAA; the only way I see to fix this case would be to define an ordering between errors, i.e. "too many open files" always replaces "no such host" when both are present.

"first" and "last" still hold meaning when you think of it as first error returned. As in whichever goroutine returned first with an error. Defining a precedence order among errors might be overdoing it IMO. There are lots of errors and lots of edge cases where one error might rank above another. Maybe a simpler solution might work here ?

Out of curiosity, does enabling StrictErrors make your example behave more predictably?

I re-ran the examples with StrictErrors. No such difference. I am not super familiar with the code base, but when I print the index inside the conf.nameList range loop, I see that it has only one element.

Here is the error case, with added logging -

21:44:09-agniva-~/play/gosource/go/src$./lookup 
[0]: fqdn- munic.io., error- lookup munic.io. on 127.0.1.1:53: dial udp 127.0.1.1:53: socket: too many open files
[0]: fqdn- munic.io., error- lookup munic.io. on 127.0.1.1:53: dial udp 127.0.1.1:53: socket: too many open files
[0]: fqdn- munic.io., error- lookup munic.io. on 127.0.1.1:53: dial udp 127.0.1.1:53: socket: too many open files
[0]: fqdn- munic.io., error- lookup munic.io. on 127.0.1.1:53: dial udp 127.0.1.1:53: socket: too many open files
[0]: fqdn- munic.io., error- lookup munic.io. on 127.0.1.1:53: no such host
2017/08/31 21:44:09 lookup munic.io on 127.0.1.1:53: no such host

mentioning the source code here for easy reference -

for i, fqdn := range conf.nameList(name) {
	for _, qtype := range qtypes {
		go func(qtype uint16) {
			cname, rrs, err := r.tryOneName(ctx, conf, fqdn, qtype)
			lane <- racer{cname, rrs, err}
		}(qtype)
	}
	hitStrictError := false
	for range qtypes {
		racer := <-lane
		if racer.error != nil {
			if nerr, ok := racer.error.(Error); ok && nerr.Temporary() && r.StrictErrors {
				fmt.Printf("[%d]: fqdn- %v, error- %v\n", i, fqdn, racer.error)
				// This error will abort the nameList loop.
				hitStrictError = true
				lastErr = racer.error
			} else if lastErr == nil || fqdn == name+"." {
				fmt.Printf("[%d]: fqdn- %v, error- %v\n", i, fqdn, racer.error)
				// Prefer error for original name.
				lastErr = racer.error
			}
			continue
		}
		addrs = append(addrs, addrRecordList(racer.rrs)...)
		if cname == "" {
			cname = racer.cname
		}
	}
	if hitStrictError {
		// If either family hit an error with StrictErrors enabled,
		// discard all addresses. This ensures that network flakiness
		// cannot turn a dualstack hostname IPv4/IPv6-only.
		addrs = nil
		break
	}
	if len(addrs) > 0 {
		break
	}
}
@pmarks-net

This comment has been minimized.

Contributor

pmarks-net commented Aug 31, 2017

"first" and "last" still hold meaning when you think of it as first error returned. As in whichever goroutine returned first with an error.

When two goroutines are literally racing each other, "first" and "last" both yield nondeterministic behavior, so I think it would be difficult to argue that either option is actually better. At least, I have no opinion in the matter.

@tjmerritt

This comment has been minimized.

tjmerritt commented Sep 14, 2017

A different question is why do we get the 127.0.1.1:53: dial udp 127.0.1.1:53: socket: too many open files error in the first case. I hit this very quickly with 100 threads making outbound http requests. It seems that open files for make in UDP DNS requests are not be reused. This leak results in the too many open files error trigger when the open file limit for the user is hit. In my case this limit is set at 1024. With a max of 100 concurrent http requests, I don't really see the need for more than 300 open files. I could up the open file limit, but it increases the time to failure and does not eliminate it.

@mikioh

This comment has been minimized.

Contributor

mikioh commented Feb 16, 2018

@tjmerritt,

I filed a new issue, #23866, for your concern. Let's focus on "error handling" here.

@gopherbot gopherbot modified the milestones: Go1.11, Unplanned May 23, 2018

@eur0pa eur0pa referenced this issue Aug 1, 2018

Closed

new features #5

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment