Skip to content

runtime: go stuck in a dead lock between threads epoll_pwait/futex(..., FUTEX_WAKE_PRIVATE, 1) on arm64 #55120

Closed as not planned
@schaefi

Description

@schaefi
go version
go version go1.18.1 linux/arm64

Does this issue reproduce with the latest release?

yes I tried go 1.18 and go 1.19

What operating system and processor architecture are you using (go env)?

go env

GO111MODULE=""
GOARCH="arm64"
GOBIN=""
GOCACHE="/home/admin/.cache/go-build"
GOENV="/home/admin/.config/go/env"
GOEXE=""
GOEXPERIMENT=""
GOFLAGS=""
GOHOSTARCH="arm64"
GOHOSTOS="linux"
GOINSECURE=""
GOMODCACHE="/home/admin/go/pkg/mod"
GONOPROXY=""
GONOSUMDB=""
GOOS="linux"
GOPATH="/home/admin/go"
GOPRIVATE=""
GOPROXY="https://proxy.golang.org,direct"
GOROOT="/usr/lib/go-1.18"
GOSUMDB="sum.golang.org"
GOTMPDIR=""
GOTOOLDIR="/usr/lib/go-1.18/pkg/tool/linux_arm64"
GOVCS=""
GOVERSION="go1.18.1"
GCCGO="gccgo"
AR="ar"
CC="gcc"
CXX="g++"
CGO_ENABLED="1"
GOMOD="/home/admin/__mego/go.mod"
GOWORK=""
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -pthread -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build1438079104=/tmp/go-build -gno-record-gcc-switches"

What did you do?

I'm working with go and go compiled binaries on an NXP S32g devel board and observed a dead lock of the
programs that makes them stuck forever. The problem seems to be related to how go applications synchronize
in its threads and the mutex locks. For a simple reproduced I wrote the following code:

  • main.go

    package main
    
    func main() {
            mprint()
    }
    
  • mprint.go

    package main
    
    import "fmt"
    
    func mprint() {
            list := []int{1, 2, 3}
            list = append(list, 10)
            for _, v := range list {
                    fmt.Println(v)
            }
    }
    

The actual program code is immaterial as the issue appears when compiling with

go build mprint.go main.go

It actually should be done in a second but the compilation is stuck forever. The process list shows me

ps ax -T

32240   32240 pts/1    Sl+    0:06 go build mprint.go main.go
32240   32241 pts/1    Sl+    0:00 go build mprint.go main.go
32240   32242 pts/1    Sl+    0:00 go build mprint.go main.go
32240   32243 pts/1    Sl+    0:00 go build mprint.go main.go
32240   32244 pts/1    Sl+    0:00 go build mprint.go main.go
32240   32245 pts/1    Sl+    0:00 go build mprint.go main.go
32240   32246 pts/1    Rl+    0:06 go build mprint.go main.go

Checking with strace what's going on exposes the following:

sudo strace -p 32240

strace: Process 32240 attached
futex(0x95efc8, FUTEX_WAIT_PRIVATE, 0, NULL^Cstrace: Process 32240 detached
 <detached ...>

The same for: 32242, 32243, 32244, 32245

sudo strace -p 32241

strace: Process 32241 attached
restart_syscall(<... resuming interrupted io_setup ...>) = 0
epoll_pwait(4, [], 128, 0, NULL, 512266007) = 0
getpid()                                = 32240
tgkill(32240, 32246, SIGURG)            = 0
nanosleep({tv_sec=0, tv_nsec=10000000}, NULL) = 0
epoll_pwait(4, [], 128, 0, NULL, 523616458) = 0
getpid()                                = 32240
tgkill(32240, 32246, SIGURG)            = 0
nanosleep({tv_sec=0, tv_nsec=10000000}, NULL) = 0
epoll_pwait(4, [], 128, 0, NULL, 534857704) = 0
getpid()                                = 32240
tgkill(32240, 32246, SIGURG)            = 0
nanosleep({tv_sec=0, tv_nsec=10000000}, NULL) = 0
epoll_pwait(4, [], 128, 0, NULL, 545530928) = 0
getpid()

... NEVER ENDS and makes it to consume 100% CPU usage

sudo strace -p 32246

rt_sigreturn({mask=[]})                 = 274878138880
--- SIGURG {si_signo=SIGURG, si_code=SI_TKILL, si_pid=32240, si_uid=1000} ---
rt_sigreturn({mask=[]})                 = 274878138880
--- SIGURG {si_signo=SIGURG, si_code=SI_TKILL, si_pid=32240, si_uid=1000} ---
rt_sigreturn({mask=[]})                 = 274878138880
--- SIGURG {si_signo=SIGURG, si_code=SI_TKILL, si_pid=32240, si_uid=1000} ---
rt_sigreturn({mask=[]})                 = 274878138880
--- SIGURG {si_signo=SIGURG, si_code=SI_TKILL, si_pid=32240, si_uid=1000} ---
rt_sigreturn({mask=[]})                 = 274878138880
--- SIGURG {si_signo=SIGURG, si_code=SI_TKILL, si_pid=32240, si_uid=1000} ---
rt_sigreturn({mask=[]})                 = 274878138880
--- SIGURG {si_signo=SIGURG, si_code=SI_TKILL, si_pid=32240, si_uid=1000} ---
rt_sigreturn({mask=[]})                 = 274878138880
--- SIGURG {si_signo=SIGURG, si_code=SI_TKILL, si_pid=32240, si_uid=1000} ---
rt_sigreturn({mask=[]})                 = 274881856320
futex(0x9605f0, FUTEX_WAKE_PRIVATE, 1)  = 0
futex(0x9604f8, FUTEX_WAKE_PRIVATE, 1)  = 1
futex(0x4000104d48, FUTEX_WAIT_PRIVATE, 0, NULL^Cstrace: Process 32246 detached
 <detached ...>

So it's in the epoll_pwait() forever. Neither the descriptor nor the connected SIGURG signals seems
to get it out of this polling. I couldn't get a clue on this and after some time I decided to write
this report here.

Unfortunately this seems to be a non deterministic behavior because I cannot reproduce it to this
extend on other arm boards (e.g raspberry) or on other archs (e.g x86_64). However I was able
to get into the dead lock on x86_64 if I run go compiled code with ltrace, e.g ltrace podman --version

So I think something is causing a race condition and it might also be related to the board
hardware such that I can see it on this board in any case but only occasionally elsewhere. I know this makes
it a bad issue for you but I really hope you can give me some ideas to try, patches for testing
or just abuse me as tester of ideas on the board.

I did several tests with other programming languages, e.g C (simple mutex testing) or rust (which also has
some thread safe programming model) and they did not expose this sort of issues. Thus I believe it's not
a generaly broken board hardware but some sort of unfortunate circumstances.

Another fun fact; If the processes are stuck they can be made unstuck if you ssh into the machine.
Sounds strange yes but my assumption with this is that this behavior could be related
to ssh which avoid orphaned processes by using TCP out-of-band data also generating a SIGURG.
In go it seems SIGURG is also used for thread preemption. It feels like that it has race conditions which
are triggered any time on my arm board but does happen only occasionally on other hardware.
So if I ssh, a stuck go program can be made to continue until the next dead lock condition ;)

What did you expect to see?

I expected the go compiler not to hang in a dead lock. I also wondered about compiled go based programs
to run into dead locks with the same behavior as soon as threads are used.

What did you see instead?

go itself and go compiled binaries are stuck in a dead lock as described.
I'm running out of ideas what else I could do and kindly ask for help from the experts.

If wished I can provide temporary ssh access to the arm board I'm using which allows to
reproduce the problem as described.

Thanks a ton

Cheers,
Marcus

Metadata

Metadata

Assignees

No one assigned

    Labels

    FrozenDueToAgeNeedsInvestigationSomeone must examine and confirm this is a valid issue and not a duplicate of an existing one.compiler/runtimeIssues related to the Go compiler and/or runtime.

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions