Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: intermittent os/exec.Command.Start() Hang on Darwin in Presence of "plugin" Package #38824

Open
jcburley opened this issue May 3, 2020 · 4 comments

Comments

@jcburley
Copy link
Contributor

@jcburley jcburley commented May 3, 2020

What version of Go are you using (go version)?

$ go version
go version go1.14.2 darwin/amd64
$

Does this issue reproduce with the latest release?

Yes. Also with go version devel +be08e10b3b Fri May 1 21:57:29 2020 +0000 darwin/amd6.

It does not reproduce with Go 1.13.10.

What operating system and processor architecture are you using (go env)?

go env Output
$ go env
GO111MODULE=""
GOARCH="amd64"
GOBIN="/Users/craig/go/bin"
GOCACHE="/Users/craig/Library/Caches/go-build"
GOENV="/Users/craig/Library/Application Support/go/env"
GOEXE=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="darwin"
GOINSECURE=""
GOMODCACHE="/Users/craig/go/pkg/mod"
GONOPROXY=""
GONOSUMDB=""
GOOS="darwin"
GOPATH="/Users/craig/go"
GOPRIVATE=""
GOPROXY="https://proxy.golang.org,direct"
GOROOT="/Users/craig/github/golang/go"
GOSUMDB="sum.golang.org"
GOTMPDIR=""
GOTOOLDIR="/Users/craig/github/golang/go/pkg/tool/darwin_amd64"
GCCGO="gccgo"
GOAMD64="alignedjumps"
AR="ar"
CC="clang"
CXX="clang++"
CGO_ENABLED="1"
GOMOD=""
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -fno-caret-diagnostics -Qunused-arguments -fmessage-length=0 -fdebug-prefix-map=/var/folders/m_/0sxkm11s5wd625ypnp0g0ffm0000gn/T/go-build624939018=/tmp/go-build -gno-record-gcc-switches -fno-common"
$

Note that, in a different shell, GOROOT=/usr/local/go, corresponding to Go 1.14.2, where the same problem occurs (though, anecdotally, less frequently) compared to the recent commit, identified above, on master.

What did you do?

Built a simple Go program, ran it repeatedly via:

$ while true; do ./hangme 100; date; done

What did you expect to see?

Repeated runs ad infinitum, no hang.

What did you see instead?

Occasionally (intermittently), the program hangs. SIGQUIT stack dump shows it usually hangs in the Command.Start() receiver in os/exec, which is not supposed to hang at all. (I've seen somewhat-different stack traces across different programs and built with different versions of Golang, but they all hang at that call or soon after.)

Such hangs have been observed (by me) only in programs that import plugin (even though they don't use it at all; only its initialization code should run). Comment-out the _ "plugin" import line in the above program, rebuild (via go build), and rerun the ever-looping command, and it runs until manually stopped.

@jcburley
Copy link
Contributor Author

@jcburley jcburley commented May 3, 2020

Background: I maintain a fork/branch of Joker (a Clojure interpreter/clone written in Go) that "autowraps" much of the Go standard library for easy access by Joker code.

A couple of weeks ago, I first noticed that the (small) test suite occasionally hung. Digging further, I posted this to get some feedback:

https://stackoverflow.com/questions/61342000/why-might-exec-command-start-hang-on-darwin

As suggested there, I ran git bisect run on the Go source code to find "the" commit that introduced or triggered the problem. The culprit simply turned on "new" timer code (turning off old code), which was quite suggestive.

Later, I boiled the pertinent Joker code down to a small test program and was able to reproduce that. The first such version pulled in all the Go standard library that my fork of Joker did. I then performed manual bisections of that set and narrowed them down to just the plugin package: when included (despite not being referenced), the sample program would occasionally hang; without it, it wouldn't.

This issue is being tracked (on my end) via jcburley/joker#19. It's Closed due to a workaround being implemented such that plugin isn't included in the list of imported (and wrapped) packages when Joker is built for OS X (Darwin).

A slightly "fuller" version of the sample program, along with instructions on another approach to run it, is here:

https://github.com/jcburley/hangme/blob/master/README.md

@ianlancetaylor ianlancetaylor changed the title Intermittent os/exec.Command.Start() Hang on Darwin in Presence of "plugin" Package runtime: intermittent os/exec.Command.Start() Hang on Darwin in Presence of "plugin" Package May 4, 2020
@ianlancetaylor ianlancetaylor added this to the Backlog milestone May 4, 2020
@networkimprov
Copy link

@networkimprov networkimprov commented May 4, 2020

cc @odeke-em re MacOS

@jirfag
Copy link

@jirfag jirfag commented May 6, 2020

Hi, the same thing in golangci-lint: can reproduce only on Mac OS (Linux is ok, Windows - didn't test properly), only on go1.14 (1.13 works ok). Also, reproduces not every time: I ran in loop to reproduce.
When reproduced I see in ps two identical processes win my binary, one is the child of another.
And I also import plugin and can't reproduce the problem when comment out this importing.

backtrace from dlv
(dlv) bt                                                                                                                                            
 0  0x00007fff6bcef062 in ???
    at ?:-1
 1  0x00007fff6bdab937 in ???
    at ?:-1
 2  0x00007fff6bbaa786 in ???
    at ?:-1
 3  0x000000000b163395 in ???
    at ?:-1
 4  0x000000000b143acb in ???
    at ?:-1
 5  0x00007fff6bbaa692 in ???
    at ?:-1
 6  0x00000000053c4000 in ???
    at ?:-1
 7  0x0000000004067da0 in runtime.asmcgocall
    at /usr/local/go/src/runtime/asm_amd64.s:655
 8  0x0000000000000000 in ???
    at ?:-1
 9  0x0000000004055a8c in syscall.rawSyscall
    at /usr/local/go/src/runtime/sys_darwin.go:107
10  0x00000000040b1d1f in syscall.forkAndExecInChild
    at /usr/local/go/src/syscall/exec_darwin.go:157
11  0x00000000040b284b in syscall.forkExec
    at /usr/local/go/src/syscall/exec_unix.go:201
12  0x00000000040d5ef0 in syscall.StartProcess
    at /usr/local/go/src/syscall/exec_unix.go:248
13  0x00000000040d5ef0 in os.startProcess
    at /usr/local/go/src/os/exec_posix.go:53
14  0x00000000040d591c in os.StartProcess
    at /usr/local/go/src/os/exec.go:102
15  0x000000000429b5f6 in os/exec.(*Cmd).Start
    at /usr/local/go/src/os/exec/exec.go:422
16  0x0000000004512610 in golang.org/x/tools/internal/gocommand.runCmdContext
    at /Users/denis/go/pkg/mod/golang.org/x/tools@v0.0.0-20200502202811-ed308ab3e770/internal/gocommand/invoke.go:153
17  0x0000000004511d29 in golang.org/x/tools/internal/gocommand.(*Invocation).RunPiped
    at /Users/denis/go/pkg/mod/golang.org/x/tools@v0.0.0-20200502202811-ed308ab3e770/internal/gocommand/invoke.go:147
18  0x000000000451178e in golang.org/x/tools/internal/gocommand.(*Invocation).runRaw
go
$ go version
go version go1.14.2 darwin/amd64
$ go env
GO111MODULE=""
GOARCH="amd64"
GOBIN=""
GOCACHE="/Users/denis/Library/Caches/go-build"
GOENV="/Users/denis/Library/Application Support/go/env"
GOEXE=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="darwin"
GOINSECURE=""
GONOPROXY=""
GONOSUMDB=""
GOOS="darwin"
GOPATH="/Users/denis/go"
GOPRIVATE=""
GOPROXY="https://proxy.golang.org,direct"
GOROOT="/usr/local/go"
GOSUMDB="sum.golang.org"
GOTMPDIR=""
GOTOOLDIR="/usr/local/go/pkg/tool/darwin_amd64"
GCCGO="gccgo"
AR="ar"
CC="clang"
CXX="clang++"
CGO_ENABLED="1"
GOMOD="/Users/denis/golangci-lint/go.mod"
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -fno-caret-diagnostics -Qunused-arguments -fmessage-length=0 -fdebug-prefix-map=/var/folders/nl/54f5x38s4m53mkzzj92zsj340000gn/T/go-build180675235=/tmp/go-build -gno-record-gcc-switches -fno-common"
@jahio
Copy link

@jahio jahio commented May 20, 2020

I took a quick look at this today and can at the very least confirm that the bug still exists with 1.14.3:

Screen Shot 2020-05-20 at 1 06 01 PM

(Here's my very slightly modified code, where all I did was add some more STDOUT output to figure out what's going on.)

As we can see here, I ran it the same way as James and indeed it locked after only 3 attempts. At the point this screenshot was taken, the same program had been hanging for 7 minutes, 35 seconds.

So I looked at Activity Monitor.app to see if I can get some more information:

Screen Shot 2020-05-20 at 1 01 28 PM

Here's the text of that stack trace if you can't view the image for some reason:

[...snip...]
Call graph:
    973 Thread_4709590   DispatchQueue_1: com.apple.main-thread  (serial)
      973 runtime.asmcgocall  (in hangme) + 112  [0x405f8c0]
        973 ???  (in hangme)  load address 0x4000000 + 0x1de000  [0x41de000]
          973 dyld_stub_binder  (in libdyld.dylib) + 282  [0x7fff6e279692]
            973 dyld::fastBindLazySymbol(ImageLoader**, unsigned long)  (in dyld) + 86  [0x10656acb]
              973 ImageLoaderMachOCompressed::doBindFastLazySymbol(unsigned int, ImageLoader::LinkContext const&, void (*)(), void (*)())  (in dyld) + 53  [0x10676395]
                973 dyldGlobalLockAcquire()  (in libdyld.dylib) + 16  [0x7fff6e279786]
                  973 _pthread_mutex_firstfit_lock_slow  (in libsystem_pthread.dylib) + 222  [0x7fff6e47a937]
                    973 _pthread_mutex_firstfit_lock_wait  (in libsystem_pthread.dylib) + 83  [0x7fff6e47c917]
                      973 __psynch_mutexwait  (in libsystem_kernel.dylib) + 10  [0x7fff6e3be062]
[...snip...]

That last line caught my eye. Stuck in a mutex wait in the kernel? I went for a dtruss (like strace but for BSD/MacOS) on the pid for that process:

$ sudo dtruss -p 46759
Password:
dtrace: system integrity protection is on, some features will not be available

SYSCALL(args) 		 = return

It's just flat hung on the last line, no progress beyond that point at all.

My theory at this point is that the process isn't returning because one of the executions of ./hangme 0 is being held up by the kernel, which is waiting on a mutex to be released on some resource. This, to me, smells of a bug in MacOS, not necessarily in Go. I'm not sure why it worked with previous versions and not this version, but clearly something that plugin is doing is causing a hang at the kernel level (or at least appears to be) by having some resource tied up.

This is about the depth of my ability to debug at this point (I'm always learning, in fact that's why I'm looking at this issue!). I don't know how much help it'll be, but I certainly hope this aids in tracking things down.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
5 participants
You can’t perform that action at this time.