Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: forEachP not done and stopTheWorld failing on aix/ppc64 #30189

Open
Helflym opened this Issue Feb 12, 2019 · 2 comments

Comments

Projects
None yet
3 participants
@Helflym
Copy link
Contributor

Helflym commented Feb 12, 2019

What version of Go are you using (go version)?

$ go version
1.12rc1 

What operating system and processor architecture are you using (go env)?

go env Output
$ go env
root@castor4:/opt/freeware/src/packages/BUILD/go-root/own_test/goprogs(cgo)$  go env
GOARCH="ppc64"
GOBIN=""
GOCACHE="/var/go/.cache/go-build/"
GOEXE=""
GOFLAGS=""
GOHOSTARCH="ppc64"
GOHOSTOS="aix"
GOOS="aix"
GOPATH="/opt/freeware/src/packages/BUILD/go-path"
GOPROXY=""
GORACE=""
GOROOT="/opt/freeware/src/packages/BUILD/go-root"
GOTMPDIR=""
GOTOOLDIR="/opt/freeware/src/packages/BUILD/go-root/pkg/tool/aix_ppc64"
GCCGO="/opt/freeware/bin/gccgo"
CC="gcc"
CXX="g++"
CGO_ENABLED="0"
GOMOD=""
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -maix64 -pthread -mcmodel=large -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build147889564=/tmp/go-build -gno-record-gcc-switches"

What did you do?

I'm trying to fix some issues within the aix/ppc64 runtime. forEachP and stopTheWorldWithSema seem to crash randomly on aix/ppc64, as you can see in the following logs:
https://build.golang.org/log/4ca5efa18918e49cd52403582986875f1fc0bde3
https://build.golang.org/log/4680c735ea77419991751ae77791e5fafff706a8
https://build.golang.org/log/0ce7939d9567eca343cb680bb447b6a4ebae5131
...

I didn't managed to reproduce these crashes with a simple test... Therefore, I must launch the full ./all.bash everytime hopping it will crash. Locally, it happens 1 every 100/200 ./all.bash but the builder seems to crash more often.

I've added some traces locally in order to understand what can be wrong. But the output are quite suppressing.

First, for forEachP, I've added traces before throw("forEachP: not done") and can get these kind of traces:

##### GOMAXPROCS=2 runtime -cpu=1,2,4 -quick
sched.safePointWait =  0
p.id =  0 ;p.status =  1 ; p.runSafePointFn =  0
p.m.id =  14 p.m.libcallsp =  0 p.m.blocked =  false
p.id =  1 ;p.status =  1 ; p.runSafePointFn =  0
p.m.id =  6 p.m.libcallsp =  4570326256 p.m.blocked =  false
SCHED 0ms: gomaxprocs=2 idleprocs=1 threads=13 spinningthreads=0 idlethreads=5 runqueue=0 gcwaiting=0 nmidlelocked=1 stopwait=0 sysmonwait=0
  P0: status=1 schedtick=436780 syscalltick=8455 m=14 runqsize=0 gfreecnt=13
  P1: status=0 schedtick=84378 syscalltick=4003 m=-1 runqsize=0 gfreecnt=6

As you can see, everything seems alright, even sched.safePointWait isn't nil.
In order to get these traces, I've added a lock to sched.lock right after the if sched.safePointWait != 0.
Is that possible that without this lock, sched.safePointWait has an old value or is still being updated by another routine ?
Is it even possible to access safely sched values without a lock ?

The same seems to occur with stopTheWorld, because if I add schedtrace(true) (which has a lock inside) before throw(bad), I'm getting:

        SCHED 0ms: gomaxprocs=16 idleprocs=0 threads=8 spinningthreads=0 idlethreads=6 runqueue=0 gcwaiting=1 nmidlelocked=0 stopwait=0 sysmonwait=0
         ...
         fatal error: stopTheWorld: not stopped (stopwait != 0)

Once again, stopwait is nil in the traces but it throws because it's not.

I still don't know why these bugs only occur on aix/ppc64.
Another guess is that the 100us is too short on AIX and the time needed to print the traces is enough to update all remaining Ps. But I don't think AIX is that slow.

These bugs might also be related to another bug with acquirep() (cf https://build.golang.org/log/60b0cd90bf7560bc4924bfa70e679be9ace58bbd). I haven't found anything relevant on this bug, except that _g_.m.nextp.ptr() in stopm() isn't nil when it crashes..

I'm still trying to get more traces.
But if anyone has any ideas about what's wrong with these bugs, you're welcome !

@aclements

@bradfitz bradfitz added this to the Go1.13 milestone Feb 12, 2019

@Helflym

This comment has been minimized.

Copy link
Contributor Author

Helflym commented Feb 25, 2019

I've added an atomic.Load on sched.safePointWait and sched.stopwait. It seems to have fixed each bugs. However, it still strange that they occur only on aix/ppc64.
@laboger did you ever come across these bugs or similar ones on linux/ppc64(le?) ?

@gopherbot

This comment has been minimized.

Copy link

gopherbot commented Feb 25, 2019

Change https://golang.org/cl/163624 mentions this issue: runtime: perform atomic.Load on sched values

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.