Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: forEachP not done and stopTheWorld failing on aix/ppc64 #30189

Closed
Helflym opened this issue Feb 12, 2019 · 6 comments

Comments

@Helflym
Copy link
Contributor

commented Feb 12, 2019

What version of Go are you using (go version)?

$ go version
1.12rc1 

What operating system and processor architecture are you using (go env)?

go env Output
$ go env
root@castor4:/opt/freeware/src/packages/BUILD/go-root/own_test/goprogs(cgo)$  go env
GOARCH="ppc64"
GOBIN=""
GOCACHE="/var/go/.cache/go-build/"
GOEXE=""
GOFLAGS=""
GOHOSTARCH="ppc64"
GOHOSTOS="aix"
GOOS="aix"
GOPATH="/opt/freeware/src/packages/BUILD/go-path"
GOPROXY=""
GORACE=""
GOROOT="/opt/freeware/src/packages/BUILD/go-root"
GOTMPDIR=""
GOTOOLDIR="/opt/freeware/src/packages/BUILD/go-root/pkg/tool/aix_ppc64"
GCCGO="/opt/freeware/bin/gccgo"
CC="gcc"
CXX="g++"
CGO_ENABLED="0"
GOMOD=""
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -maix64 -pthread -mcmodel=large -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build147889564=/tmp/go-build -gno-record-gcc-switches"

What did you do?

I'm trying to fix some issues within the aix/ppc64 runtime. forEachP and stopTheWorldWithSema seem to crash randomly on aix/ppc64, as you can see in the following logs:
https://build.golang.org/log/4ca5efa18918e49cd52403582986875f1fc0bde3
https://build.golang.org/log/4680c735ea77419991751ae77791e5fafff706a8
https://build.golang.org/log/0ce7939d9567eca343cb680bb447b6a4ebae5131
...

I didn't managed to reproduce these crashes with a simple test... Therefore, I must launch the full ./all.bash everytime hopping it will crash. Locally, it happens 1 every 100/200 ./all.bash but the builder seems to crash more often.

I've added some traces locally in order to understand what can be wrong. But the output are quite suppressing.

First, for forEachP, I've added traces before throw("forEachP: not done") and can get these kind of traces:

##### GOMAXPROCS=2 runtime -cpu=1,2,4 -quick
sched.safePointWait =  0
p.id =  0 ;p.status =  1 ; p.runSafePointFn =  0
p.m.id =  14 p.m.libcallsp =  0 p.m.blocked =  false
p.id =  1 ;p.status =  1 ; p.runSafePointFn =  0
p.m.id =  6 p.m.libcallsp =  4570326256 p.m.blocked =  false
SCHED 0ms: gomaxprocs=2 idleprocs=1 threads=13 spinningthreads=0 idlethreads=5 runqueue=0 gcwaiting=0 nmidlelocked=1 stopwait=0 sysmonwait=0
  P0: status=1 schedtick=436780 syscalltick=8455 m=14 runqsize=0 gfreecnt=13
  P1: status=0 schedtick=84378 syscalltick=4003 m=-1 runqsize=0 gfreecnt=6

As you can see, everything seems alright, even sched.safePointWait isn't nil.
In order to get these traces, I've added a lock to sched.lock right after the if sched.safePointWait != 0.
Is that possible that without this lock, sched.safePointWait has an old value or is still being updated by another routine ?
Is it even possible to access safely sched values without a lock ?

The same seems to occur with stopTheWorld, because if I add schedtrace(true) (which has a lock inside) before throw(bad), I'm getting:

        SCHED 0ms: gomaxprocs=16 idleprocs=0 threads=8 spinningthreads=0 idlethreads=6 runqueue=0 gcwaiting=1 nmidlelocked=0 stopwait=0 sysmonwait=0
         ...
         fatal error: stopTheWorld: not stopped (stopwait != 0)

Once again, stopwait is nil in the traces but it throws because it's not.

I still don't know why these bugs only occur on aix/ppc64.
Another guess is that the 100us is too short on AIX and the time needed to print the traces is enough to update all remaining Ps. But I don't think AIX is that slow.

These bugs might also be related to another bug with acquirep() (cf https://build.golang.org/log/60b0cd90bf7560bc4924bfa70e679be9ace58bbd). I haven't found anything relevant on this bug, except that _g_.m.nextp.ptr() in stopm() isn't nil when it crashes..

I'm still trying to get more traces.
But if anyone has any ideas about what's wrong with these bugs, you're welcome !

@aclements

@bradfitz bradfitz added this to the Go1.13 milestone Feb 12, 2019

@Helflym

This comment has been minimized.

Copy link
Contributor Author

commented Feb 25, 2019

I've added an atomic.Load on sched.safePointWait and sched.stopwait. It seems to have fixed each bugs. However, it still strange that they occur only on aix/ppc64.
@laboger did you ever come across these bugs or similar ones on linux/ppc64(le?) ?

@gopherbot

This comment has been minimized.

Copy link

commented Feb 25, 2019

Change https://golang.org/cl/163624 mentions this issue: runtime: perform atomic.Load on sched values

@laboger

This comment has been minimized.

Copy link
Contributor

commented Apr 16, 2019

For some reason I don't always see these notifications so I'm sorry I missed this. We have not seen these particular errors on Linux. Carlos told me the memory model should be the same between what's used in AIX and Linux. However, there could be differences in the kernels that might possibly affect this. Carlos will ask around and see what he can find out.

@laboger

This comment has been minimized.

Copy link
Contributor

commented Apr 16, 2019

This error reminds me of some of the errors I've seen when the value in r2 is wrong. In particular, the fix found here, which I see was later #ifdef'ed out on AIX https://go-review.googlesource.com/c/go/+/117515/. In this case the addressing for sched depends on the value of r2, so that would lead to accessing the wrong storage if r2 was bad. I don't know why adding an atomic.Load would affect the results, unless it affects how goroutine switches happen. Not sure how likely but just thought I would mention it. The way r2 is managed on AIX is different than Linux.

@bcmills

This comment has been minimized.

Copy link
Member

commented May 14, 2019

I'm still trying to get more traces.

Here's one in https://build.golang.org/log/9db739f8bcaac0d30280c8a9c438850ec1915c46.

fatal error: forEachP: not done

runtime stack:
runtime.throw(0x1004cedd4, 0x12)
	/ramdisk8GB/workdir-host-aix-ppc64-osuosl/go/src/runtime/panic.go:714 +0x68
runtime.forEachP(0x11010ec88)
	/ramdisk8GB/workdir-host-aix-ppc64-osuosl/go/src/runtime/proc.go:1399 +0x4cc
runtime.gcMarkDone.func1()
	/ramdisk8GB/workdir-host-aix-ppc64-osuosl/go/src/runtime/mgc.go:1424 +0x64
runtime.systemstack(0x0)
	/ramdisk8GB/workdir-host-aix-ppc64-osuosl/go/src/runtime/asm_ppc64x.s:269 +0x94
runtime.mstart()
	/ramdisk8GB/workdir-host-aix-ppc64-osuosl/go/src/runtime/proc.go:1146

goroutine 20 [GC worker (idle)]:
runtime.systemstack_switch()
	/ramdisk8GB/workdir-host-aix-ppc64-osuosl/go/src/runtime/asm_ppc64x.s:216 +0x10 fp=0xa00010000208eb0 sp=0xa00010000208e90 pc=0x1000603f0
runtime.gcMarkDone()
	/ramdisk8GB/workdir-host-aix-ppc64-osuosl/go/src/runtime/mgc.go:1417 +0xd8 fp=0xa00010000208f18 sp=0xa00010000208eb0 pc=0x10001d2f8
runtime.gcBgMarkWorker(0xa00010000038500)
	/ramdisk8GB/workdir-host-aix-ppc64-osuosl/go/src/runtime/mgc.go:1961 +0x338 fp=0xa00010000208fb8 sp=0xa00010000208f18 pc=0x10001e3a8
runtime.goexit()
	/ramdisk8GB/workdir-host-aix-ppc64-osuosl/go/src/runtime/asm_ppc64x.s:884 +0x4 fp=0xa00010000208fb8 sp=0xa00010000208fb8 pc=0x100062c54
created by runtime.gcBgMarkStartWorkers
	/ramdisk8GB/workdir-host-aix-ppc64-osuosl/go/src/runtime/mgc.go:1782 +0xa4

goroutine 1 [runnable]:
sync.(*Mutex).Unlock(0xa00010000484bc8)
	/ramdisk8GB/workdir-host-aix-ppc64-osuosl/go/src/sync/mutex.go:179 +0x14
go/token.(*File).unpack(0xa00010000484ba0, 0x100, 0x10001000049d500, 0xa0001000049d500, 0x51, 0xa, 0x1)
	/ramdisk8GB/workdir-host-aix-ppc64-osuosl/go/src/go/token/position.go:328 +0x140
go/token.(*File).position(0xa00010000484ba0, 0x5eb, 0x100000000000001, 0x0, 0x0, 0x100, 0x0, 0x0)
	/ramdisk8GB/workdir-host-aix-ppc64-osuosl/go/src/go/token/position.go:334 +0x6c
go/token.(*File).PositionFor(0xa00010000484ba0, 0x5eb, 0x100000000000000, 0x0, 0x0, 0x0, 0x0, 0x0)
	/ramdisk8GB/workdir-host-aix-ppc64-osuosl/go/src/go/token/position.go:348 +0x84
go/token.(*File).Position(...)
	/ramdisk8GB/workdir-host-aix-ppc64-osuosl/go/src/go/token/position.go:357
go/token.(*File).Line(...)
	/ramdisk8GB/workdir-host-aix-ppc64-osuosl/go/src/go/token/position.go:287
go/parser.(*parser).consumeCommentGroup(0xa00010000352f00, 0x1, 0x100010000472508, 0xa0001000049d500)
	/ramdisk8GB/workdir-host-aix-ppc64-osuosl/go/src/go/parser/parser.go:289 +0x324
go/parser.(*parser).next(0xa00010000352f00)
	/ramdisk8GB/workdir-host-aix-ppc64-osuosl/go/src/go/parser/parser.go:349 +0x1e0
go/parser.(*parser).expectSemi(0xa00010000352f00)
	/ramdisk8GB/workdir-host-aix-ppc64-osuosl/go/src/go/parser/parser.go:429 +0x5c
go/parser.(*parser).parseImportSpec(0xa00010000352f00, 0x0, 0x4b, 0x0, 0x0, 0x0)
	/ramdisk8GB/workdir-host-aix-ppc64-osuosl/go/src/go/parser/parser.go:2318 +0xc8
go/parser.(*parser).parseGenDecl(0xa00010000352f00, 0x4b, 0xa00010000203748, 0x0)
	/ramdisk8GB/workdir-host-aix-ppc64-osuosl/go/src/go/parser/parser.go:2421 +0x2f8
go/parser.(*parser).parseFile(0xa00010000352f00, 0x0)
	/ramdisk8GB/workdir-host-aix-ppc64-osuosl/go/src/go/parser/parser.go:2547 +0x1b8
go/parser.ParseFile(0xa0001000035dfc0, 0xa0001000049d500, 0x51, 0x11004b970, 0xa000100004cac00, 0x6, 0x0, 0x0, 0x0)
	/ramdisk8GB/workdir-host-aix-ppc64-osuosl/go/src/go/parser/interface.go:117 +0x12c
go/build.(*Context).Import(0x11048c5a0, 0x1004c568d, 0x1, 0xa00010000476370, 0x41, 0x0, 0xa00010000341e60, 0x0, 0xc)
	/ramdisk8GB/workdir-host-aix-ppc64-osuosl/go/src/go/build/build.go:847 +0x1058
go/build.(*Context).ImportDir(...)
	/ramdisk8GB/workdir-host-aix-ppc64-osuosl/go/src/go/build/build.go:449
cmd/go/internal/search.MatchPackages.func3(0xa00010000476370, 0x41, 0x11012baa0, 0xa0001000047a000, 0x0, 0x0, 0x0, 0xa00010000476370)
	/ramdisk8GB/workdir-host-aix-ppc64-osuosl/go/src/cmd/go/internal/search/search.go:100 +0x290
path/filepath.walk(0xa00010000476370, 0x41, 0x11012baa0, 0xa0001000047a000, 0xa00010000204440, 0x0, 0x0)
	/ramdisk8GB/workdir-host-aix-ppc64-osuosl/go/src/path/filepath/path.go:362 +0xc8
path/filepath.walk(0xa0001000011c380, 0x39, 0x11012baa0, 0xa00010000434ff0, 0xa00010000204440, 0x0, 0x0)
	/ramdisk8GB/workdir-host-aix-ppc64-osuosl/go/src/path/filepath/path.go:382 +0x2ac
path/filepath.walk(0xa0001000011c100, 0x31, 0x11012baa0, 0xa000100000ee1e0, 0xa00010000204440, 0x0, 0x2)
	/ramdisk8GB/workdir-host-aix-ppc64-osuosl/go/src/path/filepath/path.go:382 +0x2ac
path/filepath.Walk(0xa0001000011c100, 0x31, 0xa00010000204440, 0x1004c568e, 0x1)
	/ramdisk8GB/workdir-host-aix-ppc64-osuosl/go/src/path/filepath/path.go:404 +0x110
cmd/go/internal/search.MatchPackages(0xffffffffffff98c, 0x3, 0x1004c5a8d)
	/ramdisk8GB/workdir-host-aix-ppc64-osuosl/go/src/cmd/go/internal/search/search.go:59 +0x318
cmd/go/internal/modload.ImportPathsQuiet.func1(0xa00010000188090, 0x1, 0x1, 0x10000011048d720)
	/ramdisk8GB/workdir-host-aix-ppc64-osuosl/go/src/cmd/go/internal/modload/load.go:164 +0xbf8
cmd/go/internal/modload.loadPatterns.func1(0x11008ec50, 0xa00010000197d70, 0x0)
	/ramdisk8GB/workdir-host-aix-ppc64-osuosl/go/src/cmd/go/internal/modload/load.go:264 +0x68
cmd/go/internal/modload.(*loader).load(0xa000100001823c0, 0xa00010000204e50)
	/ramdisk8GB/workdir-host-aix-ppc64-osuosl/go/src/cmd/go/internal/modload/load.go:637 +0xff4
cmd/go/internal/modload.loadPatterns(0xa000100000fa170, 0x1, 0x1, 0x1000000000000d0, 0xa000100002050d8, 0xffffffffffff94f, 0x2, 0x1)
	/ramdisk8GB/workdir-host-aix-ppc64-osuosl/go/src/cmd/go/internal/modload/load.go:262 +0x1fc
cmd/go/internal/modload.ImportPathsQuiet(0xa000100000fa170, 0x1, 0x1, 0x113269acef16fa12, 0x100014b50, 0x10)
	/ramdisk8GB/workdir-host-aix-ppc64-osuosl/go/src/cmd/go/internal/modload/load.go:173 +0x78
cmd/go/internal/modload.ImportPaths(0xa000100000fa170, 0x1, 0x1, 0xd0, 0xd0, 0xa000100001d2270)
	/ramdisk8GB/workdir-host-aix-ppc64-osuosl/go/src/cmd/go/internal/modload/load.go:57 +0x3c
cmd/go/internal/load.ImportPaths(0xa000100000fa170, 0x1, 0x1, 0x4, 0xb7e3c59c1823b9c9, 0x10016ac48)
	/ramdisk8GB/workdir-host-aix-ppc64-osuosl/go/src/cmd/go/internal/load/pkg.go:2000 +0x94
cmd/go/internal/load.PackagesAndErrors(0xa000100000fa170, 0x1, 0x1, 0x0, 0x0, 0x0)
	/ramdisk8GB/workdir-host-aix-ppc64-osuosl/go/src/cmd/go/internal/load/pkg.go:1947 +0xac
cmd/go/internal/load.Packages(0xa000100000fa170, 0x1, 0x1, 0xa000100001ca6c0, 0x0, 0x0)
	/ramdisk8GB/workdir-host-aix-ppc64-osuosl/go/src/cmd/go/internal/load/pkg.go:1924 +0x40
cmd/go/internal/list.runList(0x1104664c0, 0xa000100000fa170, 0x1, 0x1)
	/ramdisk8GB/workdir-host-aix-ppc64-osuosl/go/src/cmd/go/internal/list/list.go:425 +0x2278
main.main()
	/ramdisk8GB/workdir-host-aix-ppc64-osuosl/go/src/cmd/go/main.go:188 +0x768

goroutine 3 [syscall]:
os/signal.signal_recv(0x0)
	/ramdisk8GB/workdir-host-aix-ppc64-osuosl/go/src/runtime/sigqueue.go:139 +0x100
os/signal.loop()
	/ramdisk8GB/workdir-host-aix-ppc64-osuosl/go/src/os/signal/signal_unix.go:23 +0x24
created by os/signal.init.0
	/ramdisk8GB/workdir-host-aix-ppc64-osuosl/go/src/os/signal/signal_unix.go:29 +0x3c

go tool dist: FAILED: go list -gcflags=all= -ldflags=all= -f={{if .Stale}}	STALE {{.ImportPath}}: {{.StaleReason}}{{end}} std: exit status 2

@Helflym

This comment has been minimized.

Copy link
Contributor Author

commented May 14, 2019

Thanks @bcmills. I haven't updated this issue for a while as a lot of the conversation occurs in the https://golang.org/cl/163624.

The main problem is that notesleep/notewakeup don't always form an happens-before edge on AIX. This means that even if a value was updated in thread A and notewakeup called afterwards, this value might not be up to date in thread B... There are some possible workarounds, mainly by performing atomic operations on thread B: it will synchronize its memory.
However, we don't know the main reason behind this bug and which specific sequences of notesleep/notewakeup calls are triggering it.

Note: acquirep failures (https://build.golang.org/log/b823c8f306e3544b16e2aa231912b71d5cbfb966) are also linked with this issue.

@gopherbot gopherbot closed this in 1956b28 May 14, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants
You can’t perform that action at this time.