Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: forEachP not done and stopTheWorld not stopped on openbsd/ppc64 #63384

Closed
4a6f656c opened this issue Oct 5, 2023 · 19 comments
Closed
Assignees
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. NeedsFix The path to resolution is known, but the work has not been done.
Milestone

Comments

@4a6f656c
Copy link
Contributor

4a6f656c commented Oct 5, 2023

The openbsd/ppc64 port occasionally fails with errors like the following:

fatal error: forEachP: not done

https://build.golang.org/log/20e882c562bc987948d25c42db4c8d1437a455b2

Or:

fatal error: stopTheWorld: not stopped (stopwait != 0)

https://build.golang.org/log/186564e55652471a794fd506989d98781bec30eb

This appears to be identical to the issue #30189, which related to aix/ppc64.

This was resolved via https://go-review.googlesource.com/c/go/+/163624 - presumably the same change is needed.

@4a6f656c 4a6f656c self-assigned this Oct 5, 2023
@gopherbot
Copy link
Contributor

Change https://go.dev/cl/532855 mentions this issue: runtime: atomically clear notes on openbsd/ppc64

@ianlancetaylor
Copy link
Contributor

I wonder if the problem is that Casuintptr doesn't do any synchronization if the compare part of the compare-and-swap fails.

stopTheWorldWithSema and forEachP call notetsleep to wait for the goroutines to have stopped. Right now if we are sleeping waiting for other goroutines, the other goroutine will acquire sched.lock, observe that stopwait or safePointWait is 0, and call notewakeup. Let's suppose that the notewakeup gets in before the notetsleep. In that case notetsleep will see a failed call to atomic.Casuintptr.

In the ppc64 implementation that we use, if Casuintptr fails, we don't do any synchronization. That is, we don't execute an lwsync or isync instruction. To be clear, we execute lwsync at the start of Casuintptr. But then we load the value from memory, and see whether it is the expected value. So we can have one thread call lwsync, then the other thread jumps in and stores the wakeup value, then the first thread tries the atomic compare-and-swap and fails, and Casuintptr then simply returns without calling lwsync again.

There is no synchronization between the call to notetsleep and the check of sched.stopwait or sched.safePointWait. That suggests the possibility that if the notewakeup beats the notetsleep, then the code is seeing an old value of the counter, and incorrectly reporting that the wait failed.

There is no corresponding issue with the futex implementation, because that doesn't use Casuintptr. It uses atomic.Load which always calls isync after loading the value.

I checked GCC and it executes an isync instruction after the compare-and-swap, whether that operation succeeds or fails. Our implementation executes lwsync if the compare-and-swap succeeds, but nothing if it fails. I don't understand why we can get away with using lwsync rather than isync.

CC @laboger @pmur

@ianlancetaylor
Copy link
Contributor

I forgot to say: using an atomic store in noteclear fixes the problem because that happens in the path between notetsleep and the check of stopwait or safePointWait. The atomic store executes a sync instruction.

@pmur
Copy link
Contributor

pmur commented Oct 6, 2023

I agree that something seems strange here. It is odd that both AIX or openbsd fails, surely the memory holding the pointer meets the ISA requirements for lowering isync to lwsync?

Note, the ISA claims isync can be replaced with lwsync for memory that is "neither write through, nor cache inhibited" when performing a CAS like operation. That should always be true for Go on linux. GCC can't make that assumption. The ISA is unclear if this only applies to CAS operations, or more generally. My understanding is the POWER ISA designers verified this substitution is OK in general.

@ianlancetaylor
Copy link
Contributor

@4a6f656c Is there a way for you to see whether https://go.dev/cl/533118 fixes the problem? Thanks.

@gopherbot
Copy link
Contributor

Change https://go.dev/cl/533118 mentions this issue: cmd/compile, runtime/internal/atomic: add lwsync for false ppc64 cas

@laboger
Copy link
Contributor

laboger commented Oct 9, 2023

I checked GCC and it executes an isync instruction after the compare-and-swap, whether that operation succeeds or fails. Our implementation executes lwsync if the compare-and-swap succeeds, but nothing if it fails. I don't understand why we can get away with using lwsync rather than isync.

After further investigation it's possible that the use of LWSYNC instead of ISYNC is the problem. I see that the code was changed from ISYNC to LWSYNC in https://go-review.googlesource.com/c/go/+/95175 which was before the AIX issue was written. Perhaps the ISA description this change was based on was not clear and misunderstood.

@dmitshur dmitshur added the NeedsFix The path to resolution is known, but the work has not been done. label Oct 9, 2023
@dmitshur dmitshur added this to the Go1.22 milestone Oct 9, 2023
@gopherbot gopherbot added the compiler/runtime Issues related to the Go compiler and/or runtime. label Oct 9, 2023
@ianlancetaylor
Copy link
Contributor

As far as I can tell, we need some kind of memory synchronization on the failure path of Cas. If we don't, then let's say we are running on core C1. If core C2 does an atomic store to the value after the Cas lwsync instruction but before the Cas lwar instruction, then the Cas can take the failure path. In that case there is no synchronization between the store in C2 and the load in C1. Without any such synchronization on Cas failure, the Casuintptr in notesleep and notetsleep_internal in lock_sema.go can return, telling the caller that the note has been woken up. However, the caller won't necessarily see the memory stores done on C2 before C2 changed the value. I believe that is the cause of the problem described on this issue.

I could certainly be mistaken, but where?

@pmur
Copy link
Contributor

pmur commented Oct 11, 2023

It seems odd that sched.stopwait and sched.safePointWait are read without any barriers after everything has stopped. sched.lock seems to synchronize them otherwise.

I don't think notetsleep synchronizes memory on all paths. The CAS operation does not enforce sequential consistency. Specifically, lwsync does not synchronize stores prior to the lwsync with loads after.

@laboger
Copy link
Contributor

laboger commented Oct 11, 2023

The question is, is this the correct usage for CAS? I was always under the impression that CAS is used when all threads were accessing the data with CAS. But I don't see any documentation that officially states that.

@ianlancetaylor
Copy link
Contributor

@pmur It seems clear that the code reading sched.stopwait assumes that that is OK without an atomic load, and without holding a lock, because the world is stopped. The only way that works is if notetsleep is a memory barrier of some sort. Perhaps that is wrong, but it seems like a reasonable assumption. A note is a synchronization primitive in the runtime. It seems to me that notesleep should only return with a happens-before relationship from notewakeup to notesleep. And similarly if notetsleep returns true, then there should be happens-before relationship from notewakeup.

If that is not the case—and I don't think it's my call, it's up to @golang/runtime—then we need to document that clearly in the docs for note and we probably need to audit the uses of note.

@laboger To be clear, all the threads are accessing the data in the note field with atomic operations. The question is whether we are creating a happens-before relationship from notewakeup to notetsleep returning true. Today, on ppc64, I think that we are not. I've suggested changing CAS, but you're quite right that we could change notetsleep instead.

@laboger
Copy link
Contributor

laboger commented Oct 11, 2023

To be clear, all the threads are accessing the data in the note field with atomic operations. The question is whether we are creating a happens-before relationship from notewakeup to notetsleep returning true. Today, on ppc64, I think that we are not. I've suggested changing CAS, but you're quite right that we could change notetsleep instead.

I was actually referring to noteclear since that appears to be where it is not doing an atomic store, and that is where they did the fix for AIX.

@ianlancetaylor
Copy link
Contributor

@laboger I see. I think that noteclear is a red herring here. It's true that there has to be a happens-before from noteclear to either notewakeup or notesleep. But the noteclear doesn't really have anything to do with the problem I'm trying to describe: that in some cases there is no happens-before relationship between notewakeup and notetsleep. Changing the noteclear to use an atomic store happens to fix the observed problem, because using an atomic store inserts a memory barrier. But it fixes the problem by accident. There is no reason the noteclear has to be where it is. It could move later in the function, in which case the problem would recur.

@ianlancetaylor
Copy link
Contributor

After discussion with the runtime team we think that it's clearer if compare-and-swap requires memory consistency even in the failure case. So I'm going to submit my CL. I'm going to optimistically assume that it fixes this problem.

@ianlancetaylor
Copy link
Contributor

By the way, I'll continue to use LWSYNC for now. If y'all think this should change to ISYNC, by all means go ahead. Thanks.

@ianlancetaylor
Copy link
Contributor

Opened #63506 for the similar issue on arm and mips.

@gopherbot
Copy link
Contributor

Change https://go.dev/cl/534517 mentions this issue: runtime: don't use atomic store in noteclear on AIX

@4a6f656c
Copy link
Contributor Author

@4a6f656c Is there a way for you to see whether https://go.dev/cl/533118 fixes the problem? Thanks.

The issue is pretty difficult to trigger, however I've managed to get 50+ runs of ./run.bash to pass with the CL, so it certainly seems like it fixes it. Thanks.

@ianlancetaylor
Copy link
Contributor

Thanks for trying it.

gopherbot pushed a commit that referenced this issue Oct 12, 2023
In CL 163624 we added an atomic store in noteclear on AIX only.
In the discussion on issue #63384 we think we figured out that the
real problem was in the implementation of compare-and-swap on ppc64.
That is fixed by CL 533118, so the atomic store is no longer required.

For #30189
For #63384

Change-Id: I60f4f2fac75106f2bee51a8d9663259dcde2029c
Reviewed-on: https://go-review.googlesource.com/c/go/+/534517
Auto-Submit: Ian Lance Taylor <iant@google.com>
Reviewed-by: Michael Knyszek <mknyszek@google.com>
Reviewed-by: Joel Sing <joel@sing.id.au>
Run-TryBot: Ian Lance Taylor <iant@golang.org>
TryBot-Result: Gopher Robot <gobot@golang.org>
Reviewed-by: Ian Lance Taylor <iant@google.com>
yunginnanet pushed a commit to yunginnanet/go that referenced this issue Oct 20, 2023
This CL changes ppc64 atomic compare-and-swap (cas). Before this CL,
if the cas failed--if the value in memory was not the value expected
by the cas call--the atomic function would not synchronize memory.

In the note code in runtime/lock_sema.go, used on BSD systems,
notesleep and notetsleep first try a cas on the key. If that cas fails,
something has already called notewakeup, and the sleep completes.
However, because the cas did not synchronize memory on failure,
this meant that notesleep/notetsleep could return to a core that was
unable to see the memory changes that the notewakeup was reporting.

Fixes golang#30189
Fixes golang#63384

Change-Id: I9b921de5c1c09b10a37df6b3206b9003c3f32986
Reviewed-on: https://go-review.googlesource.com/c/go/+/533118
Run-TryBot: Ian Lance Taylor <iant@golang.org>
Reviewed-by: Dmitri Shuralyov <dmitshur@google.com>
Reviewed-by: Paul Murphy <murp@ibm.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
Reviewed-by: Lynn Boger <laboger@linux.vnet.ibm.com>
Auto-Submit: Ian Lance Taylor <iant@google.com>
Reviewed-by: Ian Lance Taylor <iant@google.com>
yunginnanet pushed a commit to yunginnanet/go that referenced this issue Oct 20, 2023
In CL 163624 we added an atomic store in noteclear on AIX only.
In the discussion on issue golang#63384 we think we figured out that the
real problem was in the implementation of compare-and-swap on ppc64.
That is fixed by CL 533118, so the atomic store is no longer required.

For golang#30189
For golang#63384

Change-Id: I60f4f2fac75106f2bee51a8d9663259dcde2029c
Reviewed-on: https://go-review.googlesource.com/c/go/+/534517
Auto-Submit: Ian Lance Taylor <iant@google.com>
Reviewed-by: Michael Knyszek <mknyszek@google.com>
Reviewed-by: Joel Sing <joel@sing.id.au>
Run-TryBot: Ian Lance Taylor <iant@golang.org>
TryBot-Result: Gopher Robot <gobot@golang.org>
Reviewed-by: Ian Lance Taylor <iant@google.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. NeedsFix The path to resolution is known, but the work has not been done.
Projects
None yet
Development

No branches or pull requests

6 participants