Description
go version
go version go1.18.1 linux/arm64
Does this issue reproduce with the latest release?
yes I tried go 1.18 and go 1.19
What operating system and processor architecture are you using (go env
)?
go env
GO111MODULE=""
GOARCH="arm64"
GOBIN=""
GOCACHE="/home/admin/.cache/go-build"
GOENV="/home/admin/.config/go/env"
GOEXE=""
GOEXPERIMENT=""
GOFLAGS=""
GOHOSTARCH="arm64"
GOHOSTOS="linux"
GOINSECURE=""
GOMODCACHE="/home/admin/go/pkg/mod"
GONOPROXY=""
GONOSUMDB=""
GOOS="linux"
GOPATH="/home/admin/go"
GOPRIVATE=""
GOPROXY="https://proxy.golang.org,direct"
GOROOT="/usr/lib/go-1.18"
GOSUMDB="sum.golang.org"
GOTMPDIR=""
GOTOOLDIR="/usr/lib/go-1.18/pkg/tool/linux_arm64"
GOVCS=""
GOVERSION="go1.18.1"
GCCGO="gccgo"
AR="ar"
CC="gcc"
CXX="g++"
CGO_ENABLED="1"
GOMOD="/home/admin/__mego/go.mod"
GOWORK=""
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -pthread -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build1438079104=/tmp/go-build -gno-record-gcc-switches"
What did you do?
I'm working with go and go compiled binaries on an NXP S32g devel board and observed a dead lock of the
programs that makes them stuck forever. The problem seems to be related to how go applications synchronize
in its threads and the mutex locks. For a simple reproduced I wrote the following code:
-
main.go
package main func main() { mprint() }
-
mprint.go
package main import "fmt" func mprint() { list := []int{1, 2, 3} list = append(list, 10) for _, v := range list { fmt.Println(v) } }
The actual program code is immaterial as the issue appears when compiling with
go build mprint.go main.go
It actually should be done in a second but the compilation is stuck forever. The process list shows me
ps ax -T
32240 32240 pts/1 Sl+ 0:06 go build mprint.go main.go
32240 32241 pts/1 Sl+ 0:00 go build mprint.go main.go
32240 32242 pts/1 Sl+ 0:00 go build mprint.go main.go
32240 32243 pts/1 Sl+ 0:00 go build mprint.go main.go
32240 32244 pts/1 Sl+ 0:00 go build mprint.go main.go
32240 32245 pts/1 Sl+ 0:00 go build mprint.go main.go
32240 32246 pts/1 Rl+ 0:06 go build mprint.go main.go
Checking with strace what's going on exposes the following:
sudo strace -p 32240
strace: Process 32240 attached
futex(0x95efc8, FUTEX_WAIT_PRIVATE, 0, NULL^Cstrace: Process 32240 detached
<detached ...>
The same for: 32242, 32243, 32244, 32245
sudo strace -p 32241
strace: Process 32241 attached
restart_syscall(<... resuming interrupted io_setup ...>) = 0
epoll_pwait(4, [], 128, 0, NULL, 512266007) = 0
getpid() = 32240
tgkill(32240, 32246, SIGURG) = 0
nanosleep({tv_sec=0, tv_nsec=10000000}, NULL) = 0
epoll_pwait(4, [], 128, 0, NULL, 523616458) = 0
getpid() = 32240
tgkill(32240, 32246, SIGURG) = 0
nanosleep({tv_sec=0, tv_nsec=10000000}, NULL) = 0
epoll_pwait(4, [], 128, 0, NULL, 534857704) = 0
getpid() = 32240
tgkill(32240, 32246, SIGURG) = 0
nanosleep({tv_sec=0, tv_nsec=10000000}, NULL) = 0
epoll_pwait(4, [], 128, 0, NULL, 545530928) = 0
getpid()
... NEVER ENDS and makes it to consume 100% CPU usage
sudo strace -p 32246
rt_sigreturn({mask=[]}) = 274878138880
--- SIGURG {si_signo=SIGURG, si_code=SI_TKILL, si_pid=32240, si_uid=1000} ---
rt_sigreturn({mask=[]}) = 274878138880
--- SIGURG {si_signo=SIGURG, si_code=SI_TKILL, si_pid=32240, si_uid=1000} ---
rt_sigreturn({mask=[]}) = 274878138880
--- SIGURG {si_signo=SIGURG, si_code=SI_TKILL, si_pid=32240, si_uid=1000} ---
rt_sigreturn({mask=[]}) = 274878138880
--- SIGURG {si_signo=SIGURG, si_code=SI_TKILL, si_pid=32240, si_uid=1000} ---
rt_sigreturn({mask=[]}) = 274878138880
--- SIGURG {si_signo=SIGURG, si_code=SI_TKILL, si_pid=32240, si_uid=1000} ---
rt_sigreturn({mask=[]}) = 274878138880
--- SIGURG {si_signo=SIGURG, si_code=SI_TKILL, si_pid=32240, si_uid=1000} ---
rt_sigreturn({mask=[]}) = 274878138880
--- SIGURG {si_signo=SIGURG, si_code=SI_TKILL, si_pid=32240, si_uid=1000} ---
rt_sigreturn({mask=[]}) = 274881856320
futex(0x9605f0, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0x9604f8, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x4000104d48, FUTEX_WAIT_PRIVATE, 0, NULL^Cstrace: Process 32246 detached
<detached ...>
So it's in the epoll_pwait() forever. Neither the descriptor nor the connected SIGURG signals seems
to get it out of this polling. I couldn't get a clue on this and after some time I decided to write
this report here.
Unfortunately this seems to be a non deterministic behavior because I cannot reproduce it to this
extend on other arm boards (e.g raspberry) or on other archs (e.g x86_64). However I was able
to get into the dead lock on x86_64 if I run go compiled code with ltrace, e.g ltrace podman --version
So I think something is causing a race condition and it might also be related to the board
hardware such that I can see it on this board in any case but only occasionally elsewhere. I know this makes
it a bad issue for you but I really hope you can give me some ideas to try, patches for testing
or just abuse me as tester of ideas on the board.
I did several tests with other programming languages, e.g C (simple mutex testing) or rust (which also has
some thread safe programming model) and they did not expose this sort of issues. Thus I believe it's not
a generaly broken board hardware but some sort of unfortunate circumstances.
Another fun fact; If the processes are stuck they can be made unstuck if you ssh into the machine.
Sounds strange yes but my assumption with this is that this behavior could be related
to ssh which avoid orphaned processes by using TCP out-of-band data also generating a SIGURG.
In go it seems SIGURG is also used for thread preemption. It feels like that it has race conditions which
are triggered any time on my arm board but does happen only occasionally on other hardware.
So if I ssh, a stuck go program can be made to continue until the next dead lock condition ;)
What did you expect to see?
I expected the go compiler not to hang in a dead lock. I also wondered about compiled go based programs
to run into dead locks with the same behavior as soon as threads are used.
What did you see instead?
go itself and go compiled binaries are stuck in a dead lock as described.
I'm running out of ideas what else I could do and kindly ask for help from the experts.
If wished I can provide temporary ssh access to the arm board I'm using which allows to
reproduce the problem as described.
Thanks a ton
Cheers,
Marcus