Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: forEachP not done and stopTheWorld failing on aix/ppc64 #30189

Helflym opened this issue Feb 12, 2019 · 6 comments

runtime: forEachP not done and stopTheWorld failing on aix/ppc64 #30189

Helflym opened this issue Feb 12, 2019 · 6 comments


Copy link

@Helflym Helflym commented Feb 12, 2019

What version of Go are you using (go version)?

$ go version

What operating system and processor architecture are you using (go env)?

go env Output
$ go env
root@castor4:/opt/freeware/src/packages/BUILD/go-root/own_test/goprogs(cgo)$  go env
GOGCCFLAGS="-fPIC -maix64 -pthread -mcmodel=large -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build147889564=/tmp/go-build -gno-record-gcc-switches"

What did you do?

I'm trying to fix some issues within the aix/ppc64 runtime. forEachP and stopTheWorldWithSema seem to crash randomly on aix/ppc64, as you can see in the following logs:

I didn't managed to reproduce these crashes with a simple test... Therefore, I must launch the full ./all.bash everytime hopping it will crash. Locally, it happens 1 every 100/200 ./all.bash but the builder seems to crash more often.

I've added some traces locally in order to understand what can be wrong. But the output are quite suppressing.

First, for forEachP, I've added traces before throw("forEachP: not done") and can get these kind of traces:

##### GOMAXPROCS=2 runtime -cpu=1,2,4 -quick
sched.safePointWait =  0 =  0 ;p.status =  1 ; p.runSafePointFn =  0 =  14 p.m.libcallsp =  0 p.m.blocked =  false =  1 ;p.status =  1 ; p.runSafePointFn =  0 =  6 p.m.libcallsp =  4570326256 p.m.blocked =  false
SCHED 0ms: gomaxprocs=2 idleprocs=1 threads=13 spinningthreads=0 idlethreads=5 runqueue=0 gcwaiting=0 nmidlelocked=1 stopwait=0 sysmonwait=0
  P0: status=1 schedtick=436780 syscalltick=8455 m=14 runqsize=0 gfreecnt=13
  P1: status=0 schedtick=84378 syscalltick=4003 m=-1 runqsize=0 gfreecnt=6

As you can see, everything seems alright, even sched.safePointWait isn't nil.
In order to get these traces, I've added a lock to sched.lock right after the if sched.safePointWait != 0.
Is that possible that without this lock, sched.safePointWait has an old value or is still being updated by another routine ?
Is it even possible to access safely sched values without a lock ?

The same seems to occur with stopTheWorld, because if I add schedtrace(true) (which has a lock inside) before throw(bad), I'm getting:

        SCHED 0ms: gomaxprocs=16 idleprocs=0 threads=8 spinningthreads=0 idlethreads=6 runqueue=0 gcwaiting=1 nmidlelocked=0 stopwait=0 sysmonwait=0
         fatal error: stopTheWorld: not stopped (stopwait != 0)

Once again, stopwait is nil in the traces but it throws because it's not.

I still don't know why these bugs only occur on aix/ppc64.
Another guess is that the 100us is too short on AIX and the time needed to print the traces is enough to update all remaining Ps. But I don't think AIX is that slow.

These bugs might also be related to another bug with acquirep() (cf I haven't found anything relevant on this bug, except that _g_.m.nextp.ptr() in stopm() isn't nil when it crashes..

I'm still trying to get more traces.
But if anyone has any ideas about what's wrong with these bugs, you're welcome !


@bradfitz bradfitz added this to the Go1.13 milestone Feb 12, 2019
Copy link
Contributor Author

@Helflym Helflym commented Feb 25, 2019

I've added an atomic.Load on sched.safePointWait and sched.stopwait. It seems to have fixed each bugs. However, it still strange that they occur only on aix/ppc64.
@laboger did you ever come across these bugs or similar ones on linux/ppc64(le?) ?

Copy link

@gopherbot gopherbot commented Feb 25, 2019

Change mentions this issue: runtime: perform atomic.Load on sched values

Copy link

@laboger laboger commented Apr 16, 2019

For some reason I don't always see these notifications so I'm sorry I missed this. We have not seen these particular errors on Linux. Carlos told me the memory model should be the same between what's used in AIX and Linux. However, there could be differences in the kernels that might possibly affect this. Carlos will ask around and see what he can find out.

Copy link

@laboger laboger commented Apr 16, 2019

This error reminds me of some of the errors I've seen when the value in r2 is wrong. In particular, the fix found here, which I see was later #ifdef'ed out on AIX In this case the addressing for sched depends on the value of r2, so that would lead to accessing the wrong storage if r2 was bad. I don't know why adding an atomic.Load would affect the results, unless it affects how goroutine switches happen. Not sure how likely but just thought I would mention it. The way r2 is managed on AIX is different than Linux.

Copy link

@bcmills bcmills commented May 14, 2019

I'm still trying to get more traces.

Here's one in

fatal error: forEachP: not done

runtime stack:
runtime.throw(0x1004cedd4, 0x12)
	/ramdisk8GB/workdir-host-aix-ppc64-osuosl/go/src/runtime/panic.go:714 +0x68
	/ramdisk8GB/workdir-host-aix-ppc64-osuosl/go/src/runtime/proc.go:1399 +0x4cc
	/ramdisk8GB/workdir-host-aix-ppc64-osuosl/go/src/runtime/mgc.go:1424 +0x64
	/ramdisk8GB/workdir-host-aix-ppc64-osuosl/go/src/runtime/asm_ppc64x.s:269 +0x94

goroutine 20 [GC worker (idle)]:
	/ramdisk8GB/workdir-host-aix-ppc64-osuosl/go/src/runtime/asm_ppc64x.s:216 +0x10 fp=0xa00010000208eb0 sp=0xa00010000208e90 pc=0x1000603f0
	/ramdisk8GB/workdir-host-aix-ppc64-osuosl/go/src/runtime/mgc.go:1417 +0xd8 fp=0xa00010000208f18 sp=0xa00010000208eb0 pc=0x10001d2f8
	/ramdisk8GB/workdir-host-aix-ppc64-osuosl/go/src/runtime/mgc.go:1961 +0x338 fp=0xa00010000208fb8 sp=0xa00010000208f18 pc=0x10001e3a8
	/ramdisk8GB/workdir-host-aix-ppc64-osuosl/go/src/runtime/asm_ppc64x.s:884 +0x4 fp=0xa00010000208fb8 sp=0xa00010000208fb8 pc=0x100062c54
created by runtime.gcBgMarkStartWorkers
	/ramdisk8GB/workdir-host-aix-ppc64-osuosl/go/src/runtime/mgc.go:1782 +0xa4

goroutine 1 [runnable]:
	/ramdisk8GB/workdir-host-aix-ppc64-osuosl/go/src/sync/mutex.go:179 +0x14
go/token.(*File).unpack(0xa00010000484ba0, 0x100, 0x10001000049d500, 0xa0001000049d500, 0x51, 0xa, 0x1)
	/ramdisk8GB/workdir-host-aix-ppc64-osuosl/go/src/go/token/position.go:328 +0x140
go/token.(*File).position(0xa00010000484ba0, 0x5eb, 0x100000000000001, 0x0, 0x0, 0x100, 0x0, 0x0)
	/ramdisk8GB/workdir-host-aix-ppc64-osuosl/go/src/go/token/position.go:334 +0x6c
go/token.(*File).PositionFor(0xa00010000484ba0, 0x5eb, 0x100000000000000, 0x0, 0x0, 0x0, 0x0, 0x0)
	/ramdisk8GB/workdir-host-aix-ppc64-osuosl/go/src/go/token/position.go:348 +0x84
go/parser.(*parser).consumeCommentGroup(0xa00010000352f00, 0x1, 0x100010000472508, 0xa0001000049d500)
	/ramdisk8GB/workdir-host-aix-ppc64-osuosl/go/src/go/parser/parser.go:289 +0x324
	/ramdisk8GB/workdir-host-aix-ppc64-osuosl/go/src/go/parser/parser.go:349 +0x1e0
	/ramdisk8GB/workdir-host-aix-ppc64-osuosl/go/src/go/parser/parser.go:429 +0x5c
go/parser.(*parser).parseImportSpec(0xa00010000352f00, 0x0, 0x4b, 0x0, 0x0, 0x0)
	/ramdisk8GB/workdir-host-aix-ppc64-osuosl/go/src/go/parser/parser.go:2318 +0xc8
go/parser.(*parser).parseGenDecl(0xa00010000352f00, 0x4b, 0xa00010000203748, 0x0)
	/ramdisk8GB/workdir-host-aix-ppc64-osuosl/go/src/go/parser/parser.go:2421 +0x2f8
go/parser.(*parser).parseFile(0xa00010000352f00, 0x0)
	/ramdisk8GB/workdir-host-aix-ppc64-osuosl/go/src/go/parser/parser.go:2547 +0x1b8
go/parser.ParseFile(0xa0001000035dfc0, 0xa0001000049d500, 0x51, 0x11004b970, 0xa000100004cac00, 0x6, 0x0, 0x0, 0x0)
	/ramdisk8GB/workdir-host-aix-ppc64-osuosl/go/src/go/parser/interface.go:117 +0x12c
go/build.(*Context).Import(0x11048c5a0, 0x1004c568d, 0x1, 0xa00010000476370, 0x41, 0x0, 0xa00010000341e60, 0x0, 0xc)
	/ramdisk8GB/workdir-host-aix-ppc64-osuosl/go/src/go/build/build.go:847 +0x1058
cmd/go/internal/search.MatchPackages.func3(0xa00010000476370, 0x41, 0x11012baa0, 0xa0001000047a000, 0x0, 0x0, 0x0, 0xa00010000476370)
	/ramdisk8GB/workdir-host-aix-ppc64-osuosl/go/src/cmd/go/internal/search/search.go:100 +0x290
path/filepath.walk(0xa00010000476370, 0x41, 0x11012baa0, 0xa0001000047a000, 0xa00010000204440, 0x0, 0x0)
	/ramdisk8GB/workdir-host-aix-ppc64-osuosl/go/src/path/filepath/path.go:362 +0xc8
path/filepath.walk(0xa0001000011c380, 0x39, 0x11012baa0, 0xa00010000434ff0, 0xa00010000204440, 0x0, 0x0)
	/ramdisk8GB/workdir-host-aix-ppc64-osuosl/go/src/path/filepath/path.go:382 +0x2ac
path/filepath.walk(0xa0001000011c100, 0x31, 0x11012baa0, 0xa000100000ee1e0, 0xa00010000204440, 0x0, 0x2)
	/ramdisk8GB/workdir-host-aix-ppc64-osuosl/go/src/path/filepath/path.go:382 +0x2ac
path/filepath.Walk(0xa0001000011c100, 0x31, 0xa00010000204440, 0x1004c568e, 0x1)
	/ramdisk8GB/workdir-host-aix-ppc64-osuosl/go/src/path/filepath/path.go:404 +0x110
cmd/go/internal/search.MatchPackages(0xffffffffffff98c, 0x3, 0x1004c5a8d)
	/ramdisk8GB/workdir-host-aix-ppc64-osuosl/go/src/cmd/go/internal/search/search.go:59 +0x318
cmd/go/internal/modload.ImportPathsQuiet.func1(0xa00010000188090, 0x1, 0x1, 0x10000011048d720)
	/ramdisk8GB/workdir-host-aix-ppc64-osuosl/go/src/cmd/go/internal/modload/load.go:164 +0xbf8
cmd/go/internal/modload.loadPatterns.func1(0x11008ec50, 0xa00010000197d70, 0x0)
	/ramdisk8GB/workdir-host-aix-ppc64-osuosl/go/src/cmd/go/internal/modload/load.go:264 +0x68
cmd/go/internal/modload.(*loader).load(0xa000100001823c0, 0xa00010000204e50)
	/ramdisk8GB/workdir-host-aix-ppc64-osuosl/go/src/cmd/go/internal/modload/load.go:637 +0xff4
cmd/go/internal/modload.loadPatterns(0xa000100000fa170, 0x1, 0x1, 0x1000000000000d0, 0xa000100002050d8, 0xffffffffffff94f, 0x2, 0x1)
	/ramdisk8GB/workdir-host-aix-ppc64-osuosl/go/src/cmd/go/internal/modload/load.go:262 +0x1fc
cmd/go/internal/modload.ImportPathsQuiet(0xa000100000fa170, 0x1, 0x1, 0x113269acef16fa12, 0x100014b50, 0x10)
	/ramdisk8GB/workdir-host-aix-ppc64-osuosl/go/src/cmd/go/internal/modload/load.go:173 +0x78
cmd/go/internal/modload.ImportPaths(0xa000100000fa170, 0x1, 0x1, 0xd0, 0xd0, 0xa000100001d2270)
	/ramdisk8GB/workdir-host-aix-ppc64-osuosl/go/src/cmd/go/internal/modload/load.go:57 +0x3c
cmd/go/internal/load.ImportPaths(0xa000100000fa170, 0x1, 0x1, 0x4, 0xb7e3c59c1823b9c9, 0x10016ac48)
	/ramdisk8GB/workdir-host-aix-ppc64-osuosl/go/src/cmd/go/internal/load/pkg.go:2000 +0x94
cmd/go/internal/load.PackagesAndErrors(0xa000100000fa170, 0x1, 0x1, 0x0, 0x0, 0x0)
	/ramdisk8GB/workdir-host-aix-ppc64-osuosl/go/src/cmd/go/internal/load/pkg.go:1947 +0xac
cmd/go/internal/load.Packages(0xa000100000fa170, 0x1, 0x1, 0xa000100001ca6c0, 0x0, 0x0)
	/ramdisk8GB/workdir-host-aix-ppc64-osuosl/go/src/cmd/go/internal/load/pkg.go:1924 +0x40
cmd/go/internal/list.runList(0x1104664c0, 0xa000100000fa170, 0x1, 0x1)
	/ramdisk8GB/workdir-host-aix-ppc64-osuosl/go/src/cmd/go/internal/list/list.go:425 +0x2278
	/ramdisk8GB/workdir-host-aix-ppc64-osuosl/go/src/cmd/go/main.go:188 +0x768

goroutine 3 [syscall]:
	/ramdisk8GB/workdir-host-aix-ppc64-osuosl/go/src/runtime/sigqueue.go:139 +0x100
	/ramdisk8GB/workdir-host-aix-ppc64-osuosl/go/src/os/signal/signal_unix.go:23 +0x24
created by os/signal.init.0
	/ramdisk8GB/workdir-host-aix-ppc64-osuosl/go/src/os/signal/signal_unix.go:29 +0x3c

go tool dist: FAILED: go list -gcflags=all= -ldflags=all= -f={{if .Stale}}	STALE {{.ImportPath}}: {{.StaleReason}}{{end}} std: exit status 2

Copy link
Contributor Author

@Helflym Helflym commented May 14, 2019

Thanks @bcmills. I haven't updated this issue for a while as a lot of the conversation occurs in the

The main problem is that notesleep/notewakeup don't always form an happens-before edge on AIX. This means that even if a value was updated in thread A and notewakeup called afterwards, this value might not be up to date in thread B... There are some possible workarounds, mainly by performing atomic operations on thread B: it will synchronize its memory.
However, we don't know the main reason behind this bug and which specific sequences of notesleep/notewakeup calls are triggering it.

Note: acquirep failures ( are also linked with this issue.

@gopherbot gopherbot closed this in 1956b28 May 14, 2019
@golang golang locked and limited conversation to collaborators May 13, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
5 participants