runtime: forEachP not done and stopTheWorld failing on aix/ppc64 #30189
Comments
I've added an atomic.Load on |
Change https://golang.org/cl/163624 mentions this issue: |
For some reason I don't always see these notifications so I'm sorry I missed this. We have not seen these particular errors on Linux. Carlos told me the memory model should be the same between what's used in AIX and Linux. However, there could be differences in the kernels that might possibly affect this. Carlos will ask around and see what he can find out. |
This error reminds me of some of the errors I've seen when the value in r2 is wrong. In particular, the fix found here, which I see was later #ifdef'ed out on AIX https://go-review.googlesource.com/c/go/+/117515/. In this case the addressing for sched depends on the value of r2, so that would lead to accessing the wrong storage if r2 was bad. I don't know why adding an atomic.Load would affect the results, unless it affects how goroutine switches happen. Not sure how likely but just thought I would mention it. The way r2 is managed on AIX is different than Linux. |
Here's one in https://build.golang.org/log/9db739f8bcaac0d30280c8a9c438850ec1915c46.
|
Thanks @bcmills. I haven't updated this issue for a while as a lot of the conversation occurs in the https://golang.org/cl/163624. The main problem is that notesleep/notewakeup don't always form an happens-before edge on AIX. This means that even if a value was updated in thread A and notewakeup called afterwards, this value might not be up to date in thread B... There are some possible workarounds, mainly by performing atomic operations on thread B: it will synchronize its memory. Note: acquirep failures (https://build.golang.org/log/b823c8f306e3544b16e2aa231912b71d5cbfb966) are also linked with this issue. |
What version of Go are you using (
go version
)?What operating system and processor architecture are you using (
go env
)?go env
OutputWhat did you do?
I'm trying to fix some issues within the aix/ppc64 runtime. forEachP and stopTheWorldWithSema seem to crash randomly on aix/ppc64, as you can see in the following logs:
https://build.golang.org/log/4ca5efa18918e49cd52403582986875f1fc0bde3
https://build.golang.org/log/4680c735ea77419991751ae77791e5fafff706a8
https://build.golang.org/log/0ce7939d9567eca343cb680bb447b6a4ebae5131
...
I didn't managed to reproduce these crashes with a simple test... Therefore, I must launch the full
./all.bash
everytime hopping it will crash. Locally, it happens 1 every 100/200./all.bash
but the builder seems to crash more often.I've added some traces locally in order to understand what can be wrong. But the output are quite suppressing.
First, for forEachP, I've added traces before throw("forEachP: not done") and can get these kind of traces:
As you can see, everything seems alright, even
sched.safePointWait
isn't nil.In order to get these traces, I've added a lock to
sched.lock
right after theif sched.safePointWait != 0
.Is that possible that without this lock,
sched.safePointWait
has an old value or is still being updated by another routine ?Is it even possible to access safely
sched
values without a lock ?The same seems to occur with stopTheWorld, because if I add
schedtrace(true)
(which has a lock inside) beforethrow(bad)
, I'm getting:Once again,
stopwait
is nil in the traces but it throws because it's not.I still don't know why these bugs only occur on aix/ppc64.
Another guess is that the 100us is too short on AIX and the time needed to print the traces is enough to update all remaining Ps. But I don't think AIX is that slow.
These bugs might also be related to another bug with
acquirep()
(cf https://build.golang.org/log/60b0cd90bf7560bc4924bfa70e679be9ace58bbd). I haven't found anything relevant on this bug, except that_g_.m.nextp.ptr()
instopm()
isn't nil when it crashes..I'm still trying to get more traces.
But if anyone has any ideas about what's wrong with these bugs, you're welcome !
@aclements
The text was updated successfully, but these errors were encountered: