Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

net: DNS broken on darwin without cgo (1.13 regression) #31705

Closed
bradfitz opened this issue Apr 26, 2019 · 52 comments
Closed

net: DNS broken on darwin without cgo (1.13 regression) #31705

bradfitz opened this issue Apr 26, 2019 · 52 comments

Comments

@bradfitz
Copy link
Contributor

@bradfitz bradfitz commented Apr 26, 2019

I was testing some new code for the Go build system and found that a simple TCP dial doesn't work on Mac anymore, at least when the binary is cross-compiled.

Code is just:

var coordDialer = &net.Dialer{
        Timeout:   10 * time.Second,
        KeepAlive: 15 * time.Second,
}       

// dialCoordinatorTCP returns a TCP connection to the coordinator, making                                                                                                                                                                                          
// a CONNECT request to a proxy as a fallback.                                                                                                                                                                                                                     
func dialCoordinatorTCP(ctx context.Context, addr string) (net.Conn, error) {
        tcpConn, err := coordDialer.DialContext(ctx, "tcp", addr)

... with a context.Background() for ctx.

It always times out after 10 seconds.

But if I redeploy the same code but built with Go 1.12.x, it works fine.

@bradfitz

This comment has been minimized.

Copy link
Contributor Author

@bradfitz bradfitz commented Apr 26, 2019

My best guess is f6b42a5 ("net: use libSystem bindings for DNS resolution on macos if cgo is unavailable").

We might need some more test coverage. Or a no-cgo darwin builder.

/cc @ianlancetaylor @grantseltzer

@bradfitz bradfitz changed the title net: buildlet doesn't work on darwin-amd64 with Go master net: cross-compiled cgo-less buildlet doesn't work on darwin-amd64 with Go master Apr 26, 2019
@groob

This comment has been minimized.

Copy link
Contributor

@groob groob commented Apr 28, 2019

FWIW I can't seem to reproduce with a binary compiled on a 10.14.4 mac. I tested building with cgo disabled and setting GODEBUG=netdns=go.

@bradfitz

This comment has been minimized.

Copy link
Contributor Author

@bradfitz bradfitz commented Apr 29, 2019

I built on Linux and ran on a Mac, without setting any special environment variables.

@grantseltzer

This comment has been minimized.

Copy link
Contributor

@grantseltzer grantseltzer commented Apr 29, 2019

Could this be because when you cross compile on Linux the linker doesn't have access to libSystem?

Not sure how this done for every other binding when there's cross compilation

Also, this is with CGO enabled, netcgo not specified, cross compiled for darwin on linux?

@bradfitz

This comment has been minimized.

Copy link
Contributor Author

@bradfitz bradfitz commented Apr 29, 2019

@bradfitz

This comment has been minimized.

Copy link
Contributor Author

@bradfitz bradfitz commented Apr 30, 2019

We just disabled cgo support for darwin/386 (per #31751) so we now have a CGO_ENABLED=0 Mac builder, which now hits this issue. Which is good in that we can reproduce it.

Looks like it's stuck in DNS queries, so f6b42a5 looks implicated.

https://build.golang.org/log/289a154e730768cccbc64dd0ea2af16b4b48db88

ok  	mime	0.017s
ok  	mime/multipart	0.365s
ok  	mime/quotedprintable	0.112s
panic: test timed out after 3m0s

goroutine 343 [running]:
testing.(*M).startAlarm.func1()
	/var/folders/9w/4l2_g3kx01x199n37fbmv3s80000gn/T/workdir-host-darwin-10_14/go/src/testing/testing.go:1380 +0xc5
created by time.goFunc
	/var/folders/9w/4l2_g3kx01x199n37fbmv3s80000gn/T/workdir-host-darwin-10_14/go/src/time/sleep.go:169 +0x31

goroutine 1 [chan receive, 2 minutes]:
testing.(*T).Run(0x11720f00, 0x239388, 0xf, 0x2482c8, 0x1)
	/var/folders/9w/4l2_g3kx01x199n37fbmv3s80000gn/T/workdir-host-darwin-10_14/go/src/testing/testing.go:964 +0x2c5
testing.runTests.func1(0x114da000)
	/var/folders/9w/4l2_g3kx01x199n37fbmv3s80000gn/T/workdir-host-darwin-10_14/go/src/testing/testing.go:1205 +0x54
testing.tRunner(0x114da000, 0x11498f10)
	/var/folders/9w/4l2_g3kx01x199n37fbmv3s80000gn/T/workdir-host-darwin-10_14/go/src/testing/testing.go:912 +0x90
testing.runTests(0x114a8020, 0x3eef00, 0xdf, 0xdf, 0x0)
	/var/folders/9w/4l2_g3kx01x199n37fbmv3s80000gn/T/workdir-host-darwin-10_14/go/src/testing/testing.go:1203 +0x227
testing.(*M).Run(0x11474200, 0x0)
	/var/folders/9w/4l2_g3kx01x199n37fbmv3s80000gn/T/workdir-host-darwin-10_14/go/src/testing/testing.go:1120 +0x111
net.TestMain(0x11474200)
	/var/folders/9w/4l2_g3kx01x199n37fbmv3s80000gn/T/workdir-host-darwin-10_14/go/src/net/main_test.go:52 +0x25
main.main()
	_testmain.go:552 +0xfa

goroutine 612 [chan receive, 2 minutes]:
testing.(*T).Parallel(0x11720aa0)
	/var/folders/9w/4l2_g3kx01x199n37fbmv3s80000gn/T/workdir-host-darwin-10_14/go/src/testing/testing.go:817 +0x18d
net.TestLookupGoogleSRV(0x11720aa0)
	/var/folders/9w/4l2_g3kx01x199n37fbmv3s80000gn/T/workdir-host-darwin-10_14/go/src/net/lookup_test.go:70 +0x29
testing.tRunner(0x11720aa0, 0x2482f0)
	/var/folders/9w/4l2_g3kx01x199n37fbmv3s80000gn/T/workdir-host-darwin-10_14/go/src/testing/testing.go:912 +0x90
created by testing.(*T).Run
	/var/folders/9w/4l2_g3kx01x199n37fbmv3s80000gn/T/workdir-host-darwin-10_14/go/src/testing/testing.go:963 +0x2a6

goroutine 613 [chan receive, 2 minutes]:
testing.(*T).Parallel(0x11720b40)
	/var/folders/9w/4l2_g3kx01x199n37fbmv3s80000gn/T/workdir-host-darwin-10_14/go/src/testing/testing.go:817 +0x18d
net.TestLookupGmailMX(0x11720b40)
	/var/folders/9w/4l2_g3kx01x199n37fbmv3s80000gn/T/workdir-host-darwin-10_14/go/src/net/lookup_test.go:119 +0x1f
testing.tRunner(0x11720b40, 0x2482d8)
	/var/folders/9w/4l2_g3kx01x199n37fbmv3s80000gn/T/workdir-host-darwin-10_14/go/src/testing/testing.go:912 +0x90
created by testing.(*T).Run
	/var/folders/9w/4l2_g3kx01x199n37fbmv3s80000gn/T/workdir-host-darwin-10_14/go/src/testing/testing.go:963 +0x2a6

goroutine 614 [chan receive, 2 minutes]:
testing.(*T).Parallel(0x11720be0)
	/var/folders/9w/4l2_g3kx01x199n37fbmv3s80000gn/T/workdir-host-darwin-10_14/go/src/testing/testing.go:817 +0x18d
net.TestLookupGmailNS(0x11720be0)
	/var/folders/9w/4l2_g3kx01x199n37fbmv3s80000gn/T/workdir-host-darwin-10_14/go/src/net/lookup_test.go:165 +0x1f
testing.tRunner(0x11720be0, 0x2482dc)
	/var/folders/9w/4l2_g3kx01x199n37fbmv3s80000gn/T/workdir-host-darwin-10_14/go/src/testing/testing.go:912 +0x90
created by testing.(*T).Run
	/var/folders/9w/4l2_g3kx01x199n37fbmv3s80000gn/T/workdir-host-darwin-10_14/go/src/testing/testing.go:963 +0x2a6

goroutine 615 [chan receive, 2 minutes]:
testing.(*T).Parallel(0x11720c80)
	/var/folders/9w/4l2_g3kx01x199n37fbmv3s80000gn/T/workdir-host-darwin-10_14/go/src/testing/testing.go:817 +0x18d
net.TestLookupGmailTXT(0x11720c80)
	/var/folders/9w/4l2_g3kx01x199n37fbmv3s80000gn/T/workdir-host-darwin-10_14/go/src/net/lookup_test.go:214 +0x29
testing.tRunner(0x11720c80, 0x2482e0)
	/var/folders/9w/4l2_g3kx01x199n37fbmv3s80000gn/T/workdir-host-darwin-10_14/go/src/testing/testing.go:912 +0x90
created by testing.(*T).Run
	/var/folders/9w/4l2_g3kx01x199n37fbmv3s80000gn/T/workdir-host-darwin-10_14/go/src/testing/testing.go:963 +0x2a6

goroutine 619 [running]:
	goroutine running on other thread; stack unavailable
created by testing.(*T).Run
	/var/folders/9w/4l2_g3kx01x199n37fbmv3s80000gn/T/workdir-host-darwin-10_14/go/src/testing/testing.go:963 +0x2a6
FAIL	net	180.028s
@randall77

This comment has been minimized.

Copy link
Contributor

@randall77 randall77 commented Apr 30, 2019

Could this be because when you cross compile on Linux the linker doesn't have access to libSystem?

I don't think this should matter. We don't actually need access to libSystem to build a binary which dynamically links to it. Building on Linux and running on a Mac should work fine with regards to this feature.

@gopherbot

This comment has been minimized.

Copy link

@gopherbot gopherbot commented Apr 30, 2019

Change https://golang.org/cl/174637 mentions this issue: dashboard: add darwin-amd64-nocgo config, remove nacl-386 trybot

gopherbot pushed a commit to golang/build that referenced this issue Apr 30, 2019
Also remove dead nacl-arm. It hasn't run in ages.

And update netbsd comment about why 386 doesn't run. And correct its
VM image name.

Updates golang/go#31705
Updates golang/go#31726

Change-Id: I9de4605f34a052d0a84684fca098388d75602a82
Reviewed-on: https://go-review.googlesource.com/c/build/+/174637
Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>
@bcmills

This comment has been minimized.

Copy link
Member

@bcmills bcmills commented May 6, 2019

This seems to be the same failure on the darwin-386-10_14 buildlet; retitling accordingly.

https://build.golang.org/log/d3e5f74f9924af70e34b2de05f64cb9cdfb41310

@bcmills bcmills changed the title net: cross-compiled cgo-less buildlet doesn't work on darwin-amd64 with Go master net: cross-compiled cgo-less buildlet doesn't work on darwin with Go master May 6, 2019
@grantseltzer

This comment has been minimized.

Copy link
Contributor

@grantseltzer grantseltzer commented May 6, 2019

My theory is that this has to do with the build constraints here. Perhaps CgoEnabled isn't checked correctly? I know that cgo is disabled when cross compiling.

Is reproducing this just cross compiling from macos to linux?

@bradfitz bradfitz changed the title net: cross-compiled cgo-less buildlet doesn't work on darwin with Go master net: DNS broken on darwin without cgo (1.13 regression) May 6, 2019
@bradfitz

This comment has been minimized.

Copy link
Contributor Author

@bradfitz bradfitz commented May 6, 2019

Is reproducing this just cross compiling from macos to linux?

You can reproduce this on a Mac, without cross compiling. Just build with CGO_ENABLED=0.

@bradfitz

This comment has been minimized.

Copy link
Contributor Author

@bradfitz bradfitz commented May 15, 2019

Easy way to repro on a mac:

Works:

barloga5k:net $ CGO_ENABLED=1 go test -v -short -run=TestGoLookupIP net 
=== RUN   TestGoLookupIPWithResolverConfig
--- PASS: TestGoLookupIPWithResolverConfig (0.03s)
=== RUN   TestGoLookupIPOrderFallbackToFile
--- PASS: TestGoLookupIPOrderFallbackToFile (0.00s)
PASS
ok  	net

Hangs:

barloga5k:net $ CGO_ENABLED=0 go test -v -short -run=TestGoLookupIP net 
=== RUN   TestGoLookupIPWithResolverConfig
--- PASS: TestGoLookupIPWithResolverConfig (0.03s)
=== RUN   TestGoLookupIPOrderFallbackToFile
--- PASS: TestGoLookupIPOrderFallbackToFile (0.00s)
=== RUN   TestGoLookupIP

(hang)

^\SIGQUIT: quit
PC=0x7fff7b71c1b2 m=0 sigcode=0

goroutine 0 [idle]:
runtime.pthread_cond_wait(0x149b788, 0x149b748, 0x7ffe00000000)
	/Users/bradfitz/go/src/runtime/sys_darwin.go:368 +0x39
runtime.semasleep(0xffffffffffffffff, 0x104afac)
	/Users/bradfitz/go/src/runtime/os_darwin.go:63 +0x85
runtime.notesleep(0x149b548)
	/Users/bradfitz/go/src/runtime/lock_sema.go:173 +0xe0
runtime.stopm()
	/Users/bradfitz/go/src/runtime/proc.go:1928 +0xc0
runtime.findrunnable(0xc00001e000, 0x0)
	/Users/bradfitz/go/src/runtime/proc.go:2391 +0x53f
runtime.schedule()
	/Users/bradfitz/go/src/runtime/proc.go:2524 +0x2be
runtime.park_m(0xc000062600)
	/Users/bradfitz/go/src/runtime/proc.go:2610 +0x9d
runtime.mcall(0x105a796)
	/Users/bradfitz/go/src/runtime/asm_amd64.s:318 +0x5b

goroutine 1 [chan receive]:
testing.(*T).Run(0xc000116200, 0x12ad59e, 0xe, 0x12bcfc0, 0x10a9c01)
	/Users/bradfitz/go/src/testing/testing.go:964 +0x377
testing.runTests.func1(0xc000116000)
	/Users/bradfitz/go/src/testing/testing.go:1210 +0x78
testing.tRunner(0xc000116000, 0xc00007ae08)
	/Users/bradfitz/go/src/testing/testing.go:912 +0xbf
testing.runTests(0xc00000c040, 0x14972c0, 0xdf, 0xdf, 0x0)
	/Users/bradfitz/go/src/testing/testing.go:1208 +0x2a7
testing.(*M).Run(0xc0000de000, 0x0)
	/Users/bradfitz/go/src/testing/testing.go:1125 +0x160
net.TestMain(0xc0000de000)
	/Users/bradfitz/go/src/net/main_test.go:52 +0x39
main.main()
	_testmain.go:554 +0x135

goroutine 73 [running]:
	goroutine running on other thread; stack unavailable
created by testing.(*T).Run
	/Users/bradfitz/go/src/testing/testing.go:963 +0x350

rax    0x104
rbx    0x2
rcx    0x7ffeefbff548
rdx    0xb00
rdi    0x149b788
rsi    0x290100002a00
rbp    0x7ffeefbff5d0
rsp    0x7ffeefbff548
r8     0x0
r9     0xa0
r10    0x0
r11    0x202
r12    0x149b788
r13    0x16
r14    0x290100002a00
r15    0x882f5c0
rip    0x7fff7b71c1b2
rflags 0x203
cs     0x7
fs     0x0
gs     0x0
FAIL	net	13.181s
FAIL

The hang is somewhere inside res_search.

The path that hangs is:

netgo_unix_test.go => cgo_darwin_stub.go : func cgoLookupIP (misleading name, supposed to only conditionally use cgo) => func resolverGetResources => res_search (defined in runtime).

Note that the libSystem call to res_init does succeed, at least, so it's not some libcCall cgo_import_dynamic problem in general.

@randall77, any ideas?

@groob

This comment has been minimized.

Copy link
Contributor

@groob groob commented May 15, 2019

When I run the test above and look at logs I seem to run into error messages related to /var/db/DetachedSignatures, which is usually a code signing check. Similarly MacOS error: -67062.

I wonder if the issue occurs when the binary is code signed.

@bradfitz

This comment has been minimized.

Copy link
Contributor Author

@bradfitz bradfitz commented May 15, 2019

I didn't sign any binary. I'm not aware of any part of the Go build process that automatically signs binaries, either?

@groob

This comment has been minimized.

Copy link
Contributor

@groob groob commented May 15, 2019

I'm suggesting that a signed binary might succeed while an unsigned one will get blocked by the kernel. Sorry for the confusion.

I'll test my theory.

@groob

This comment has been minimized.

Copy link
Contributor

@groob groob commented May 15, 2019

Can some share a reproducible func main example? I can reproduce by running the test, but I've tried multiple combinations of CGO_ENABLED and GODEBUG=netdns and I can't get the issue to show up that way.

@groob

This comment has been minimized.

Copy link
Contributor

@groob groob commented May 15, 2019

(the code signing logs were not an issue)

I modified runtime/lookup_darwin.go to add a println and I can see that I'm actually getting to the res_search function, but the function returns successfully an my binary works fine.

@groob

This comment has been minimized.

Copy link
Contributor

@groob groob commented May 15, 2019

Looking at the build tags.
The failing test is set to have !cgo netgo and darwin,

// +build !cgo netgo
// +build darwin dragonfly freebsd linux netbsd openbsd solaris

but the stub which is getting invoked is the one with

// +build !netgo,!cgo
// +build darwin

which was added as part of f6b42a5#diff-612b16c746d269bd07f162f3fd6ea47eR6

The test is still expecting to reach cgo_stub.go which now has !darwin

This "fixes" the issue by making the tags match up:

diff --git a/src/net/netgo_unix_test.go b/src/net/netgo_unix_test.go
index c672d3e8eb..3cd85d2ccd 100644
--- a/src/net/netgo_unix_test.go
+++ b/src/net/netgo_unix_test.go
@@ -3,7 +3,7 @@
 // license that can be found in the LICENSE file.

 // +build !cgo netgo
-// +build darwin dragonfly freebsd linux netbsd openbsd solaris
+// +build !darwin dragonfly freebsd linux netbsd openbsd solaris

 package net

I'm not sure what the correct solution is, but it doesn't seem like this test should pass on darwin anymore, since it's expecting the placeholder, not an actual function call.

	_, err, ok := cgoLookupIP(ctx, "ip", host)
	if ok {
		t.Errorf("cgoLookupIP must be a placeholder")
@grantseltzer

This comment has been minimized.

Copy link
Contributor

@grantseltzer grantseltzer commented May 21, 2019

If we run the test on darwin with:

CGO_ENABLED=0 go test -v -short -run=TestGoLookupIP net

The cgoLookupIP in cgo_darwin_stub.go is being built which is correct.

@groob in addition to the build tag change, that test should also run if on darwin AND cgo is disabled AND netgo is enabled.

Regardless it doesn't explain why res_search is hanging instead of causing the test to fail. Finally have time to investigate this now, sorry for being MIA.

@grantseltzer

This comment has been minimized.

Copy link
Contributor

@grantseltzer grantseltzer commented May 22, 2019

It's strange that a regular go program with the same calls as the test doesn't hang. Is anyone familiar with how go test binaries link/load differently from normal executables (if at all)?

@grantseltzer

This comment has been minimized.

Copy link
Contributor

@grantseltzer grantseltzer commented May 22, 2019

Stepping through with delve shows that I can't get past PUSHQ BP in res_search_trampoline

Screen Shot 2019-05-22 at 3 23 14 PM

Only appears to be one thread in use as well.

dtruss shows a loop of __semwait_signal and then psynch_cvwait

@groob

This comment has been minimized.

Copy link
Contributor

@groob groob commented Jun 3, 2019

Looks like this crashes on 10.15

=== RUN   TestGoLookupIP
SIGILL: illegal instruction
PC=0x7fff716a9758 m=3 sigcode=1

goroutine 0 [idle]:
net.res_search(0xc00009f380, 0xff00000001, 0xc000322200, 0xc000000200, 0x0)
	/Users/vvrantchan/izba/golang/src/runtime/lookup_darwin.go:29 +0x91 fp=0xc00007c8f8 sp=0xc00007c8a8 pc=0x100ab51
net.resolverGetResources(0x1303980, 0xc000092000, 0x12a8049, 0x9, 0x1000000ff, 0x102c80f, 0x8, 0xc0000a0800, 0xc00003ef30, 0xc00003ef18)
	/Users/vvrantchan/izba/golang/src/net/cgo_darwin_stub.go:108 +0xf0 fp=0xc00007ceb0 sp=0xc00007c8f8 pc=0x11380f0
net.cgoLookupIP(0x1303980, 0xc000092000, 0x12a6997, 0x2, 0x12a8049, 0x9, 0xc00003ef70, 0x104ce08, 0x3d86b25ff86, 0x37954b38, ...)
	/Users/vvrantchan/izba/golang/src/net/cgo_darwin_stub.go:50 +0x1be fp=0xc00007cf10 sp=0xc00007ceb0 pc=0x1137b7e
net.TestGoLookupIP(0xc00034c000)
	/Users/vvrantchan/izba/golang/src/net/netgo_unix_test.go:19 +0xa7 fp=0xc00007cfa8 sp=0xc00007cf10 pc=0x11c6bf7
testing.tRunner(0xc00034c000, 0x12b9578)
	/Users/vvrantchan/izba/golang/src/testing/testing.go:909 +0xbf fp=0xc00007cfd0 sp=0xc00007cfa8 pc=0x10d6a7f
runtime.goexit()
	/Users/vvrantchan/izba/golang/src/runtime/asm_amd64.s:1357 +0x1 fp=0xc00007cfd8 sp=0xc00007cfd0 pc=0x105c861
created by testing.(*T).Run
	/Users/vvrantchan/izba/golang/src/testing/testing.go:960 +0x350
@randall77

This comment has been minimized.

Copy link
Contributor

@randall77 randall77 commented Jun 3, 2019

That was quick!

@groob

This comment has been minimized.

Copy link
Contributor

@groob groob commented Jun 3, 2019

Still only affects test binaries.

@grantseltzer

This comment has been minimized.

Copy link
Contributor

@grantseltzer grantseltzer commented Jun 4, 2019

Been sidetracked on this, but I just tried a test with a test and regular binary each just calling net.LookupIP (with proper build flags/cgo disabled). _res_search, the linked routine, isn't in the test binary but it is in the regular one. This pretty much confirms the test binaries are not linking properly.

The question for me right now is how can we get logs/debug the linker so we can figure out what the difference is between go test -c and go build.

Anyone have particularly intimate knowledge of go test ?

@ianlancetaylor

This comment has been minimized.

Copy link
Contributor

@ianlancetaylor ianlancetaylor commented Jun 4, 2019

You can see exactly what the go tool is doing by using the -x option.

@rsc

This comment has been minimized.

Copy link
Contributor

@rsc rsc commented Jun 5, 2019

@groob should the 10.15 crash be a separate bug?

@groob

This comment has been minimized.

Copy link
Contributor

@groob groob commented Jun 5, 2019

@rsc possibly, it's still the net.res_search call so I was making a note in here.

10.15 is still a very early beta, I'll keep this in mind and file bugs against Go as we get closer in the test cycle for macOS and file bugs then.

@groob

This comment has been minimized.

Copy link
Contributor

@groob groob commented Jun 5, 2019

When running a go build with -x I see that the go test output has the following output:

golang/pkg/tool/darwin_amd64/link -o $WORK/b001/net.test -importcfg $WORK/b001/importcfg.link -s -w -buildmode=exe

while go build is missing the -s -w arguments.

golang/pkg/tool/darwin_amd64/link -o $WORK/b001/exe/a.out -importcfg $WORK/b001/importcfg.link -buildmode=exe 
@gopherbot

This comment has been minimized.

Copy link

@gopherbot gopherbot commented Jun 5, 2019

Change https://golang.org/cl/180779 mentions this issue: runtime: use default system stack size, not 64 kB, on non-cgo macOS

@gopherbot

This comment has been minimized.

Copy link

@gopherbot gopherbot commented Jun 5, 2019

Change https://golang.org/cl/180781 mentions this issue: net: pass NUL-terminated host name to C res_search call on macOS

@rsc

This comment has been minimized.

Copy link
Contributor

@rsc rsc commented Jun 5, 2019

Using delve to debug C code is probably not a good strategy. (It's a Go debugger, not a C debugger.)

$ lldb net.test
(lldb) target create "net.test"
Current executable set to 'net.test' (x86_64).
(lldb) run -test.v -test.short -test.run=GoLookupIP
Process 71117 launched: '/Users/rsc/go/src/net/net.test' (x86_64)
Process 71117 stopped
* thread #2, stop reason = EXC_BAD_ACCESS (code=2, address=0x70000be5fd70)
    frame #0: 0x00007fff6eaae081 libsystem_info.dylib`mdns_item_call + 33
libsystem_info.dylib`mdns_item_call:
->  0x7fff6eaae081 <+33>: movq   %rdi, -0x10020(%rbp)
    0x7fff6eaae088 <+40>: movl   %esi, -0x10024(%rbp)
    0x7fff6eaae08e <+46>: movq   %rdx, -0x10030(%rbp)
    0x7fff6eaae095 <+53>: movq   %rcx, -0x10038(%rbp)
Target 0: (net.test) stopped.
(lldb) bt
* thread #2, stop reason = EXC_BAD_ACCESS (code=2, address=0x70000be5fd70)
  * frame #0: 0x00007fff6eaae081 libsystem_info.dylib`mdns_item_call + 33
    frame #1: 0x00007fff6eab8b79 libsystem_info.dylib`_mdns_query + 169
    frame #2: 0x00007fff6eab8d96 libsystem_info.dylib`res_search + 54
    frame #3: 0x000000000405ea1d net.test`runtime.res_search_trampoline + 29
    frame #4: 0x000000000405d3e0 net.test`runtime.asmcgocall + 112
    frame #5: 0x000000000400bd94 net.test`net.res_search + 180
    frame #6: 0x0000000004139418 net.test`net.resolverGetResources + 376
    frame #7: 0x0000000004138e21 net.test`net.cgoLookupIP + 481
    frame #8: 0x00000000041c7fdf net.test`net.TestGoLookupIP + 207
    ...
(lldb) disas
libsystem_info.dylib`mdns_item_call:
    0x7fff6eaae060 <+0>:   pushq  %rbp
    0x7fff6eaae061 <+1>:   movq   %rsp, %rbp
    0x7fff6eaae064 <+4>:   subq   $0x10180, %rsp            ; imm = 0x10180 
    0x7fff6eaae06b <+11>:  movq   0x18(%rbp), %rax
    0x7fff6eaae06f <+15>:  movl   0x10(%rbp), %r10d
    0x7fff6eaae073 <+19>:  movq   0x3687af9e(%rip), %r11    ; (void *)0x00007fffa531d070: __stack_chk_guard
    0x7fff6eaae07a <+26>:  movq   (%r11), %r11
    0x7fff6eaae07d <+29>:  movq   %r11, -0x8(%rbp)
->  0x7fff6eaae081 <+33>:  movq   %rdi, -0x10020(%rbp)
    0x7fff6eaae088 <+40>:  movl   %esi, -0x10024(%rbp)
    0x7fff6eaae08e <+46>:  movq   %rdx, -0x10030(%rbp)
    0x7fff6eaae095 <+53>:  movq   %rcx, -0x10038(%rbp)

This makes pretty clear that the function has pushed a 64 kB stack frame onto the stack. So how big is the stack anyway?

	// Set the stack size we want to use.  64KB for now.
	// TODO: just use OS default size?
	const stackSize = 1 << 16
	if pthread_attr_setstacksize(&attr, stackSize) != 0 {
		write(2, unsafe.Pointer(&failthreadcreate[0]), int32(len(failthreadcreate)))
		exit(1)
	}
	//mSysStatInc(&memstats.stacks_sys, stackSize) //TODO: do this?

That TODO is sounding pretty good. Sent CL 180779.

After that CL I get

--- FAIL: TestGoLookupIP (0.20s)
    netgo_unix_test.go:25: parsing/packing of this type isn't available yet
FAIL

That seems to be something different.

Also found that we were not passing a NUL-terminated string to res_search, but maybe it was accidentally NUL-terminated, because explicitly fixing that did not help. (CL 180781).

@groob

This comment has been minimized.

Copy link
Contributor

@groob groob commented Jun 5, 2019

I intermittently get netgo_unix_test.go:25: parsing/packing of this type isn't available yet while running this test even before the changes in the CL above.

@rsc

This comment has been minimized.

Copy link
Contributor

@rsc rsc commented Jun 5, 2019

@groob, the main thread was running with an 8 MB stack and all other threads were running with 64 kB. The intermittency was based on which stack you got. The 8 MB one would run to completion and return the error; the smaller ones would fault in C due to the stack overflow and then fault again in Go trying to understand the C fault.

@rsc

This comment has been minimized.

Copy link
Contributor

@rsc rsc commented Jun 5, 2019

I thought y'all just wanted me to make the test not hang. Actually passing is a higher bar. Will try. :-)

@zchee

This comment has been minimized.

Copy link
Contributor

@zchee zchee commented Jun 5, 2019

@rsc
FYI, RES_DEBUG='debug' CGO_ENABLED=0 go test -v ... will display the _mdns_search log.(but just a little)

@rsc

This comment has been minimized.

Copy link
Contributor

@rsc rsc commented Jun 5, 2019

OK, so at least part of the problem is that if you write a C program calling res_search, you actually get symbol res_9_search from /usr/lib/libresolv.dylib, while we are using res_search from /usr/lib/system/libsystem_info.dylib.

Does anyone know why we are using the libsystem_info version?

@rsc

This comment has been minimized.

Copy link
Contributor

@rsc rsc commented Jun 5, 2019

@grantseltzer, I notice CL 166297 has no tests. What did you use to test it?

Also, the CL claimed to be fixing #12524, which was a very old bug asking for support for the /etc/resolver directory. My Mac has no /etc/resolver directory. Also, the code that looks in /etc/resolver appears to be in libresolv, but as noted in my previous comment Go is using libsystem_info. I have not chased down whether libsystem_info ever ends up back in libresolv, but so far I've seen no evidence that it does. So it may be that the new code does not work with /etc/resolver. I can't tell.

@randall77

This comment has been minimized.

Copy link
Contributor

@randall77 randall77 commented Jun 5, 2019

I don't think the res_search in /usr/lib/system/libsystem_info.dylib ever makes it to res_9_search in /usr/lib/libresolv.dylib. The following files all just reference each other, there's no path to libresolv.dylib from our root (libSystem.B.dylib):

	/usr/lib/libobjc.A.dylib
	/usr/lib/libc++abi.dylib
	/usr/lib/libc++.1.dylib
	/usr/lib/libSystem.B.dylib
        /usr/lib/system/*.dylib
@randall77

This comment has been minimized.

Copy link
Contributor

@randall77 randall77 commented Jun 5, 2019

It wouldn't be hard to try adding libresolv.dylib to our imports and renaming our resolver from res_search to res_9_search.

@gopherbot

This comment has been minimized.

Copy link

@gopherbot gopherbot commented Jun 5, 2019

Change https://golang.org/cl/180838 mentions this issue: net: skip questions before parsing answers

@rsc

This comment has been minimized.

Copy link
Contributor

@rsc rsc commented Jun 5, 2019

I can't figure out what CL 166297 intended to fix or why it was important to improve anything in non-cgo-only mode. I have a bunch of fixes for it that make it resolve names successfully, which I will send out, but then I will send a CL deleting it entirely. If at some later point someone wants to bring it back, that's fine, provided they explain why.

@gopherbot

This comment has been minimized.

Copy link

@gopherbot gopherbot commented Jun 6, 2019

Change https://golang.org/cl/180843 mentions this issue: net: remove non-cgo macOS resolver code

@gopherbot

This comment has been minimized.

Copy link

@gopherbot gopherbot commented Jun 6, 2019

Change https://golang.org/cl/180842 mentions this issue: net: fix non-cgo macOS resolver code

@grantseltzer

This comment has been minimized.

Copy link
Contributor

@grantseltzer grantseltzer commented Jun 6, 2019

@rsc the idea is so that darwin dns logic can be used instead of the netgo library for when cgo is disabled. There's a lot of overlap but the /etc/resolver files is the particular example.

The problem I see is that when distributing binaries to darwin hosts it's likely most common to disable cgo. This means that before CL 166297 tools like the ones by hashicorp would not support /etc/resolver files. At my company and those of others I talked to in person and on slack this was an issue.

In terms of testing, shamefully I was just doing it manually but will gladly work to write up tests. I'm taking a look at your CLs now, thanks for the help and feedback!

@rsc

This comment has been minimized.

Copy link
Contributor

@rsc rsc commented Jun 6, 2019

@grantseltzer, I don't believe non-cgo builds of such tools are working at all today; literally all DNS lookups seem to fail, not just ones involving /etc/resolver. Even once Go-side bugs are fixed, libsystem_info's res_search fails at PTR and CNAME queries; it may also not be thread safe; and it appears not to pay any attention to /etc/resolver. Perhaps switching to libresolv will help; perhaps not. If you'd like to reintroduce a revised version of the code for Go 1.14, that's fine. For Go 1.13, though, we'll revert things to the way they were in Go 1.12. Thanks.

gopherbot pushed a commit that referenced this issue Jun 6, 2019
At least one libc call we make
(res_search, which calls _mdns_query and then mdns_item_call)
pushes a 64 kB stack frame onto the stack.
Then it faults on the guard page.

Use the default system stack size, under the assumption
that the C code being called is compatible with that stack size.

For #31705.

Change-Id: I1b0bfc2e54043c49f0709255988ef920ce30ee82
Reviewed-on: https://go-review.googlesource.com/c/go/+/180779
Run-TryBot: Russ Cox <rsc@golang.org>
Reviewed-by: Ian Lance Taylor <iant@golang.org>
@grantseltzer

This comment has been minimized.

Copy link
Contributor

@grantseltzer grantseltzer commented Jun 6, 2019

@rsc although your changes in 180842 make a lot of sense to me, i'm quite sure that they at least work enough to honor the /etc/resolver files. The thing I don't understand is that they're only working in non-test files. That behavior is worth exploring.

With that in mind, it is a lot of added complexity for a corner case. I'll investigate to see how else I can get this to work, even if you are removing the code for now. Thanks to you as well.

gopherbot pushed a commit that referenced this issue Jun 6, 2019
This code was added in April in CL 166297, for #12524.
This CL fixes the following problems in the code:

 - The test for failure in the assembly stubs checked for
   64-bit -1 instead of 32-bit -1 to decide to fetch errno.

 - These C routines (res_init and res_search) don't set errno anyway,
   so the Go code using errno to decide success is incorrect.
   (The routines set h_errno, which is a racy global variable
   that can't safely be consulted, storing values in a different
   error space.)

 - The Go call passed res_search a non-NUL-terminated name.

 - The C res_search rejects calls asking for TypeALL as opposed to
   more specific answers like TypeA/TypeAAAA/TypeCNAME,
   breaking cgoLookupHost in all cases and cgoLookupIP
   except with IP-version-specific networks.

 - The DNS response packet was parsed twice, once with msg.Unpack
   (discarded), and once with the lower-level dnsmessage.Parser.
   The Parser loop was missing a call to p.SkipAllQuestions, with the
   result that no DNS response packet would ever parse successfully.

 - The parsing of the DNS response answers, if reached, behaved as if
   that the AResource and AAAAResource record contained textual
   IP addresses, while in fact they contain binary ones. The calls to
   parseIPv4 and parseIPv6 therefore would always returns nil,
   so that no useful result would be returned from the resolver.

With these fixes, cgoLookupIP can correctly resolve google.com
and return both the A and AAAA addresses.

Even after fixing all these things, TestGoLookupIP still fails,
because it is testing that in non-cgo builds the cgo stubs
correctly report "I can't handle the lookup", and as written the
code intentionally violates that expectation.

This CL adds new direct tests of the pseudo-cgo routines.
The direct IP address lookups succeed, but the CNAME query
causes res_search to hang, and the PTR query fails unconditionally
(a trivial C program confirms these behaviors are due to res_search itself).

Traditionally, res_search is only intended for single-threaded use.
It is unclear whether this one is safe for use from multiple goroutines.
If you run net.test under lldb, that causes syslog messages to be
printed to standard error suggesting double-free bugs:

	2019-06-05 19:52:43.505246-0400 net.test[6256:6831076] dnssd_clientstub DNSServiceRefDeallocate called with invalid DNSServiceRef 0x5c000f0 FFFFFFFF DDDDDDDD
	2019-06-05 19:52:43.505274-0400 net.test[6256:6831076] dnssd_clientstub DNSServiceRefDeallocate called with invalid DNSServiceRef 0x5c000f0 FFFFFFFF DDDDDDDD
	2019-06-05 19:52:43.505303-0400 net.test[6256:6831076] dnssd_clientstub DNSServiceRefDeallocate called with invalid DNSServiceRef 0x5c000f0 FFFFFFFF DDDDDDDD
	2019-06-05 19:52:43.505329-0400 net.test[6256:6831076] dnssd_clientstub DNSServiceRefDeallocate called with invalid DNSServiceRef 0x5c000f0 FFFFFFFF DDDDDDDD

This res_search is from libsystem_info; a normal C program would
get res_search (#defined to res_9_search) from libresolv instead.
It is unclear what the relation between the two is.
Issue #12524 was about supporting the /etc/resolver directory tree,
but only libresolv contains code for that; libsystem_info does not.
So this code probably does not enable use of /etc/resolver.

In short:

 - Before this CL, the code clearly had never run successfully.
 - The code appears not to improve upon the usual non-cgo fallback.
 - The code carries with it no tests of improved behavior.
 - The code breaks existing tests.
 - Calling res_search does not work for PTR/CNAME queries,
   so the code breaks existing behavior, even after this CL.
 - It's unclear whether res_search is safe to call from multiple threads.
 - It's unclear whether res_search is used by any other macOS programs.

Given this, it probably makes sense to delete this code rather
than rejigger the test. This CL fixes the code first, so that there
is a working copy to bring back later if we find out that it really
is necessary.

For #31705.

Change-Id: Id2e11e8ade43098b0f90dd4d16a62ca86a7a244a
Reviewed-on: https://go-review.googlesource.com/c/go/+/180842
Run-TryBot: Russ Cox <rsc@golang.org>
Reviewed-by: Keith Randall <khr@golang.org>
@gopherbot gopherbot closed this in d32ec38 Jun 6, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
10 participants
You can’t perform that action at this time.