Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Armv7-a]: Control-dependency between atomic accesses removed by -O1. #65106

Closed
lukeg101 opened this issue Aug 30, 2023 · 51 comments
Closed

[Armv7-a]: Control-dependency between atomic accesses removed by -O1. #65106

lukeg101 opened this issue Aug 30, 2023 · 51 comments

Comments

@lukeg101
Copy link
Member

Consider the following litmus test that captures bad behaviour:

C test

{ *x = 0; *y = 0; } // fixed initial state where x and y are 0

// Concurrent threads
P0 (atomic_int* y,int* x) {
  int r0 = *x;
  if (r0 == 1) {
    atomic_store_explicit(y,1,memory_order_relaxed);
  }
}

P1 (atomic_int* y,int* x) {
  int r0 = atomic_load_explicit(y,memory_order_acquire);
  *x = 1;
}

// predicate over final state - does this outcome occur?
exists (P0:r0=1 /\ P1:r0=1)

where 'P0:r0 = 0' means thread P0, local variable r0 has value 0.

When simulating this test under the C/C++ model from its initial state, the outcome of execution in the exists clause is forbidden by the source model. The allowed outcomes are:

{P0:r0=0; P1:r0=0;}
{P0:r0=1; P1:r0=0;}   

When compiling to target armv7-a using either GCC or LLVM trunk (https://godbolt.org/z/nxWTbEfK1), the compiled program has the following outcomes when simulated under the armv7 model:

{P0:r0=0; P1:r0=0;}                                           
{P0:r0=1; P1:r0=0;}
{P0:r0=1; P1:r0=1;} <--- forbidden by source model, bug!

Since the CMP;MOVEQ r1, #1;STREQ sequence is predicated on the result of CMP, but the STREQ can be reordered before the LDR of x on thread P0, allowing the outcome {P0:r0=1; P1:r0=1;}.

The issue is that the control dependency between the load and store on P0 has been removed by the compiler. Both GCC and LLVM produce the following (replace symbolic registers e.g. %P0_x with concrete registers containing x).

ARM test

{ [P0_r0]=0;[P1_r0]=0;[x]=0;[y]=0;
  %P0_P0_r0=P0_r0;%P0_x=x;%P0_y=y;
  %P1_P1_r0=P1_r0;%P1_x=x;%P1_y=y }
(*****************************************************************)
(*                 the Telechat toolsuite                        *)
(*                                                               *)
(* Luke Geeson, University College London, UK.                   *)
(* gcc -O1 -march=armv7-a -pthread --std=c11 -marm -fno-section-anchors*)
(*                                                               *)
(*****************************************************************)
  P0                   |  P1                   ;
   LDR R2,[%P0_x]      |   LDR R2,[%P1_y]      ;
   CMP R2,#1           |   DMB ISH             ;
   MOVEQ R1,#1         |   MOV R1,#1           ;
   STREQ R1,[%P0_y]    |   STR R1,[%P1_x]      ;
   STR R2,[%P0_P0_r0]  |   STR R2,[%P1_P1_r0]  ;
   BX LR               |   BX LR               ;


exists (P0_r0=1 /\ P1_r0=1) <--- yes, allowed by arm.cat, bug!

The forbidden {P0:r0=1; P1:r0=1;} behaviour disappears in GCC when optimising using -O2 or above, since R2 is reused in the STREQ when the EQ condition holds - a data dependency masks the loss of the control dependency. This buggy behaviour remains in clang -O2 and above however (https://godbolt.org/z/hGv7P3vGW).

To fix this - mark relaxed atomic load/stores as not predicated, and ideally use an explicit branch to enforce the control dependency so that it is not lost:

ARM test-fixed

{ [P0_r0]=0;[P1_r0]=0;[x]=0;[y]=0;
  %P0_P0_r0=P0_r0;%P0_x=x;%P0_y=y;
  %P1_P1_r0=P1_r0;%P1_x=x;%P1_y=y }

  P0                   |  P1                   ;
   LDR R0,[%P0_x]      |   MOV R2,#1           ;
   CMP R0,#1           |   LDR R0,[%P1_y]      ;
   BNE L0x2c           |   DMB ISH             ;
   MOV R2,#1           |   STR R2,[%P1_x]      ;
   STR R2,[%P0_y]      |   STR R0,[%P1_P1_r0]  ;
  L0x2c:               |                       ;
   STR R0,[%P0_P0_r0]  |                       ;


exists (P0_r0=1 /\ P1_r0=1)

This prevents the buggy behaviour.

We will follow up by reporting the bug for GCC.

@llvmbot
Copy link
Collaborator

llvmbot commented Aug 30, 2023

@llvm/issue-subscribers-backend-arm

@lukeg101
Copy link
Member Author

@tmatheson-arm

@efriedma-quic
Copy link
Collaborator

You "C test" has undefined behavior, I think? There's a race on "x".

@lukeg101
Copy link
Member Author

Apologies - this test was automatically generated. If we make x atomic, and make its accesses relaxed, the behaviour is the same and it is well-defined, according to the C/C++ model:

artefact> herd7 -model rc11.cat -I herdtools7/herd/libdir/ test2.litmus
Test test-relaxed Allowed
States 2
0:r0=0; 1:r0=0;
0:r0=1; 1:r0=0;
No
Witnesses
Positive: 0 Negative: 2
Condition exists (0:r0=1 /\ 1:r0=1)
Observation test-relaxed Never 0 2
Time test-relaxed 0.00
Hash=dbd2f72e7f871eb16823ba5458b4b5b7

artefact> cat test2.litmus
C test-relaxed

{ }

// Concurrent threads
P0 (atomic_int* y,atomic_int* x) {
  int r0 = atomic_load_explicit(x,memory_order_relaxed);
  if (r0 == 1) {
    atomic_store_explicit(y,1,memory_order_relaxed);
  }
}

P1 (atomic_int* y,atomic_int* x) {
  int r0 = atomic_load_explicit(y,memory_order_acquire);
  atomic_store_explicit(x,1,memory_order_relaxed);
}

exists (P0:r0=1 /\ P1:r0=1)

Further the assembly is still predicated: https://godbolt.org/z/jzGcfT97P

@efriedma-quic
Copy link
Collaborator

If I'm following correctly, the model is assuming the CPU will speculatively execute predicated ops. Are there any actual ARM CPUs which work that way, though? My impression based on optimization guides for ARM CPUs is that CPU treat a predicated ops as having a data dependency on the flags register. Given a CPU which works that way, I think the bad case would be impossible: the computation of the flags register depends on the load.

Is disabling predication of atomic accesses sufficient to solve this issue? That's cheap enough that I wouldn't mind doing it in any case, but I'm not sure that's sufficient.

@lukeg101
Copy link
Member Author

The problem is two parts:

  • The control dependency has been removed.
  • The data dependency of predication has masked the behaviour at higher opt levels.

If removing predication enforces the use of an explicit branch instruction in the backend, then this should fix both issues. I advise removing predication and adding an explicit branch - the test-fixed example above forbids the bug from occurring on arm.

The historic arm model allows the re-ordering although I could not find any hardware that exhibits this - we cannot rule it out however. What we do know is that the newer AArch32 model forbids the reordering - so the Armv8 backend should be ok.

We should conservatively guard against re-ordering on some old hardware we do not have access to. More generally, we should restore the control dependency to enforce the intended program constraints.

@efriedma-quic
Copy link
Collaborator

If removing predication enforces the use of an explicit branch instruction in the backend

Without predication, we branch in this particular case. But not in general, I think... for example, consider the following:

#include <stdatomic.h>
typedef _Atomic int atomic_int;
void f(int v, atomic_int* y,atomic_int* x, atomic_int* z) {
  int r0 = atomic_load_explicit(x,memory_order_relaxed);
  if (r0 == 1) {
    atomic_store_explicit(y,v,memory_order_relaxed);
  } else {
    atomic_store_explicit(z,v,memory_order_relaxed);
  }
}

This lowers to:

f(int, int _Atomic*, int _Atomic*, int _Atomic*):
        ldr     r2, [r2]
        cmp     r2, #1
        movne   r1, r3
        str     r0, [r1]
        bx      lr

According to your model, I think this is a miscompile? I have no idea how we could change LLVM to avoid this.

@lukeg101
Copy link
Member Author

lukeg101 commented Sep 8, 2023

I had a chat with Wilco, I think in general we cannot prevent all conditional execution but in this specific case if we turn it off for atomics we should be ok (and in your above case use a bne;mov sequence to restore the control dependency).

The aim of the game is to prevent outcomes of concurrent programs that are forbidden by the source model, for sequential programs conditional execution is ok but we should endeavour to be careful when atomics are used as this is when quirky reordering can be observed.

@efriedma-quic
Copy link
Collaborator

The problem for my example is that there isn't any way to tell there's a control dependency we need to preserve. The load, the conditional, and the store could all be in different functions. The only solution I can think of that reliably works is to insert a dummy dependency after every relaxed load. (Unless there's some additional constraint I'm not seeing...?)

@Wilco1
Copy link

Wilco1 commented Sep 8, 2023

If atomics are marked as non-hoistable, non-sinkable, non-mergeable, non-predicable then it should block all of these examples.
I can't imagine the standard requires checking control dependencies across function calls, that would be impossible.

@efriedma-quic
Copy link
Collaborator

If atomics are marked as non-hoistable, non-sinkable

I don't see how this helps. If you rewrite my example to atomic_store_explicit(r0 == 1 ? y : z, v, memory_order_relaxed), you still end up with the same issue, I think.

I can't imagine the standard requires checking control dependencies across function calls

Not sure where you're going with this... the C++ standard just requires there exists some sequence that corresponds to a relaxed load. If that sequence is ldr r0, [r0]; cbz r0, #0, that just means armv7 code is less efficient than we'd like. There are other targets where a relaxed atomic load isn't a plain load.

@efriedma-quic
Copy link
Collaborator

That said, I'm reluctant to change the default lowering of relaxed atomic loads without any evidence this is actually a realistic issue.

@Wilco1
Copy link

Wilco1 commented Sep 8, 2023

That makes the store data, not control dependent. And that is like conditional execution.

A relaxed load is simply a load. That's not the issue here. However certain optimizations should be blocked to avoid creating incorrect orderings. It appears the issue is more widespread, I think this will fail on any target:

#include <stdatomic.h>
typedef _Atomic int atomic_int;

void f4(int v, atomic_int* y,atomic_int* x, atomic_int* z) {
  int r0 = atomic_load_explicit(x,memory_order_relaxed);
  if (r0 == 1)
    atomic_store_explicit(y,v,memory_order_relaxed);
  else
    atomic_store_explicit(y,v,memory_order_relaxed);
}

So if the C memory model requires the control dependency preserved for ordering, we can't optimize the branch and merge the stores (but both GCC and LLVM do).

@efriedma-quic
Copy link
Collaborator

Most CPUs naturally preserve the dependency because of causality: either the inputs values to the store aren't available until after the load is complete, or speculative stores aren't actually committed until the CPU is sure it won't need to roll back its branch prediction.

According to this bug report, ARMv7 (but not ARMv8) cores can speculate a store in a way that's visible to other cores before the core is actually sure the store will execute. Which... you could theoretically design a CPU that works this way, but I'm skeptical any ARM CPUs exist that actually do this.

@lukeg101
Copy link
Member Author

lukeg101 commented Sep 8, 2023

If atomics are marked as non-hoistable, non-sinkable, non-mergeable, non-predicable then it should block all of these examples.

I support doing this.

It appears the issue is more widespread, I think this will fail on any target

I've observed load buffering under models of AArch64 (official), ARM, RISC-V (official), and PPC. The specific conditionalised flavour is allowed by the older ARM model, but not the AArch32 model, rightfully fixed.

There are other targets where a relaxed atomic load isn't a plain load.
and
you could theoretically design a CPU that works this way

This is exactly why we need to ensure the ctrl dependency is preserved. The ARM model allows this reordering, and there are so many machines, programs, and compilers out there that finding a chip+compiled(program) that exhibits this behaviour is prohibitively hard. It is harder to prove it doesn't happen than it is to guard against.

Besides we need to ensure this program is ABI compatible against current programs and future hardware that doesn't exist yet.

The work I have done - using models to test compiler implementations - is to avoid the problems of interpreting the standards (C/C++ and Arch) above, and to find tricky bugs like these where the programmers intent (ctrl deps) are violated.

I advise to mark atomics as Wilco mentions to guard against the current bug, and if the performance hit is too much, we can re-introduce the old behaviour behind a flag (the opposite of what we did with the LDAPR/LDAR swap).

@efriedma-quic
Copy link
Collaborator

Say you have the following:

#include <stdatomic.h>
typedef _Atomic int atomic_int;
// Pretend each function is in a separate translation unit
int do_load(atomic_int* x) {
return atomic_load_explicit(x,memory_order_relaxed);
}
void do_store(atomic_int* y, int z;) {
atomic_store_explicit(y,z,memory_order_relaxed);
}
void f(int v, atomic_int* y,atomic_int* x, atomic_int* z) {
  int r0 = do_load(x);
  if (r0 == 1) {
    do_store(y,v);
  } else {
    do_store(z,v);
  }
}

This is, as far as I can tell, essentially equivalent to the other testcases we've been discussing. So our choices for generic lowering are:

  • Introduce a dependency-preserving branch after all relaxed atomic loads
  • Introduce a dependency-preserving branch before all relaxed atomic stores
  • Introduce a memory barrier for relaxed atomic ops

Any suggestion for which one we want?

(All the discussion about "non-hoistable, non-sinkable, non-mergeable, non-predicable" is irrelevant, as far as I can tell; my example doesn't require hosting, sinking, merging, or predicating an atomic op.)

@lukeg101
Copy link
Member Author

lukeg101 commented Sep 8, 2023

I’ll prepare some tests to evaluate those options for you on Monday, thanks for this good discussion!

@lukeg101
Copy link
Member Author

lukeg101 commented Sep 9, 2023

In the meantime, could you confirm what you mean by dependency-preserving branch?
I have a hunch but want to be sure

@efriedma-quic
Copy link
Collaborator

Something like the "ldr r0, [r0]; cbz r0, #0" pattern I mentioned, which to my understanding should preserve the dependency even though the destination doesn't actually vary.

@lukeg101
Copy link
Member Author

That makes sense - I've done some thinking about the cases and I'll send something tomorrow. Still testing some other architectures.

@lukeg101
Copy link
Member Author

Righto, I've done some analysis on these cases by generating several hundred tests. This is the simplest case and is telling.

Right away we should reject "Introduce a memory barrier for relaxed atomic ops". This is too expensive - and for the cases where reordering matters a branch after a load works best.
Suppose we, emit a branch after the load on AArch64, it would look like this:

AArch64 LB004_examples_int_C_tests

{ [P0_r0]=0;[P1_r0]=0;[x]=0;[y]=0;uint64_t %P0_P0_r0=P0_r0;uint64_t %P0_x=x;uint64_t %P0_y=y;uint64_t %P1_P1_r0=P1_r0;uint64_t %P1_x=x;uint64_t %P1_y=y }

  P0                    |  P1                    ;
   MOV W10,#1           |   MOV W10,#1           ;
   LDR W8,[X%P0_x]      |   LDR W8,[X%P1_y]      ;
   CBZ W8, label1       |   CBZ W8, label2       ;
label1:                 | label2:                ;
   STR W10,[X%P0_y]     |   STR W10,[X%P1_x]     ;
   STR W8,[X%P0_P0_r0]  |   STR W8,[X%P1_P1_r0]  ;
   RET                  |   RET                  ;


exists (P0_r0=1 /\ P1_r0=1)

Under the AArch64 model we observe the outcomes:

{ P0:r0=0; P1:r0=0; }
{ P0:r0=0; P1:r0=1; }
{ P0:r0=1; P1:r0=0; }

So this would prevent the outcome { P0:r0=1; P1:r0 = 1 }.

This is the same for RISC-V, ARM, and PPC tests. It does not make a difference for MIPS or X86_64.
I can provide example litmus tests showing what this looks like.

As far as the branch before the store is concerned - the argument is symmetric. I think we should use a branch after relaxed atomic loads however, as there are more frequent cases (e.g. in the linux kernel) where a store depends on the value of a load (the converse doesn't make sense without an additional read to the new value stored).

In the case of the original bug report, if we put branches after loads whether a control dependency is emitted or masked is irrelevant as the reordering cannot occur. I think in the spirit of the original issue we should also mark atomics as non-predicable and preserve the source control dependency.

We should make this behaviour default, and put the old behaviour behind a command line flag if people want to use it. I doubt there are enough people using relaxed atomics for this to make a perf difference - but the flag will give them the option of the old behaviour if they need it.

@efriedma-quic
Copy link
Collaborator

It's probably worth posting a summary of this to Discourse to get more eyes on this.

@efriedma-quic
Copy link
Collaborator

Also, do you have some reference for your definition of a control dependency? The ARM reference manual says:

A control dependency exists when the data value returned by a read access determines the condition flags, and the values of the flags are used in the condition code checking that determines the address of a subsequent read access. This address determination might be through conditional execution, or through the evaluation of a branch.

@Wilco1
Copy link

Wilco1 commented Sep 12, 2023

There was an issue with C++ consume semantics which looks very similar to this (how to track and avoid optimizing away the dependency). So perhaps this should be reported as a C++ defect. Compilers have been emitting relaxed accesses as normal loads/stores for decades, so changing this now would be quite some work. Blocking a few compiler optimizations still looks like a simpler approach (all examples shown so far can be avoided that way).

@lukeg101
Copy link
Member Author

So I think we are discusssing two things here. I think we should go ahead and fix the above bug here, as is being done in GCC. I will raise a separate proposal on Discourse regarding putting a dependency after relaxed loads.

do you have some reference for your definition of a control dependency?
By control dependency I mean one in the source code, that is removed by the compiler.

It is buried in the standard unfortunately, but I picked out:

§5.1.2.3, 14 of ISO C11 (with similar wording in newer versions):

sequenced before is an asymmetric, transitive, pair-wise relation between evaluations executed by a single thread, which induces a partial order among those evaluations. Given any two evaluations A and B, if A is sequenced before B, then the execution of A shall precede the execution of B. [...] The presence of a sequence point between the evaluation of expressions A and B implies that every value computation and side effect associated with A is sequenced before every value computation and side effect associated with B.
and
The following are the sequence points [...] Between the evaluation of a full expression and the next full expression to be evaluated. [...] the controlling expression of a selection statement (if or switch)

We use a simulator that encodes such dependencies (based on Fig.22 of this paper) . In short, if a dependency is removed by the compiler, we may observe bugs under the architecture model.

@efriedma-quic
Copy link
Collaborator

So I think we are discusssing two things here. I think we should go ahead and fix the above bug here, as is being done in GCC

You mean, we should disable predication on atomic load/store? I mean, it's not a big deal if we do that, I guess, but I thought we concluded it doesn't solve the general issue? And on the other hand, if we solve the general issue, it's fine to continue predicating stores.

@lukeg101
Copy link
Member Author

lukeg101 commented Oct 3, 2023

Apologies for my delay - so the general issue is related to the question of what relaxed atomics should do, whereas this issue focuses on the loss of a control dependency.

I am preparing a proposal to put on discourse - I will share it with you when it's ready. Apologies again for the delay.

@lukeg101
Copy link
Member Author

lukeg101 commented Oct 23, 2023

Hi @efriedma-quic
Apologies for my delay, researching fixes for relaxed atomics took a long time. Please could you take a look at a proposal I have written up before I share it more widely with the LLVM community this coming Friday.

link

I would appreciate your thoughts on this proposal, so as to account for questions I may have missed.

I still believe that this patch should be fixed by preventing certain optimisations as Wilco suggests, but in the above proposal I address the general question of the semantics of Relaxed Atomics in practice.

@efriedma-quic
Copy link
Collaborator

So a general issue exists on targets with a relaxed memory model where a load can be delayed past a store. If a relaxed load is followed by a relaxed store, there can be a "thin-air" value: the load can see the result of code executing in a different thread which depends on the stored value. (And this isn't just a formal model issue; you've proven such hardware actually exists.)

If there's a branch that depends on the loaded value between the load and the store, this can't happen according to ARM hardware models: branch prediction machinery buffers the store until the load resolves. The proposal is to take advantage of this, and ensure there's always such a branch between an atomic load and a subsequent atomic stores. (Another possible way to handle this which your paper doesn't mention is to ensure the address of any subsequent atomic stores is somehow based on the value of the load. For example, if every relaxed load modifies the stack pointer, and the address of every relaxed store is based on the stack pointer. But that's probably slower.)

The issue hits more cases on armv7 formal models, compared to armv8 formal models: predication doesn't force buffering in the same way as a branch. If we disable predication of atomic ops on armv7, that's resolved, and we can treat armv7 exactly the same way as other similar targets... which may be worth doing just to make it easier to reason about armv7.

Thinking about it a bit more, for a dependency-preserving branch, maybe better to use cmp r0, #0; bls L1; L1: rather than cbz, to make branch mispredictions less likely.

Are there any cases where we can elide the branch? I mean, obviously we can if there's an existing branch, but can we take advantage of other memory barriers? ldar? stlr?

Can this show up with atomic ops that aren't just loads, like cas?

@lukeg101
Copy link
Member Author

lukeg101 commented Oct 23, 2023

Hi Eli,
Thanks for the quick response!

ensure there's always such a branch between an atomic load and a subsequent atomic stores.

Indeed

Another possible way to handle this which your paper doesn't mention is to ensure the address of any subsequent atomic stores is somehow based on the value of the load.

Yes there are data, address, or control dependencies that can handle this, in addition to the normal fences/acquire-release instructions. I wanted to keep the proposal simple for non-concurrency engineers 🙂

we can treat armv7 exactly the same way as other similar targets

Ideally this would fix this current issue, leaving only the general issue in the link above yes.

to make branch mispredictions less likely

Yes whatever seems better - I can test any variant of the 'branch-after-load' proposal.

Are there any cases where we can elide the branch? I mean, obviously we can if there's an existing branch

A valid case, as long as that branch isn't then later removed.

but can we take advantage of other memory barriers?

This works, but has worse perf than a control branch

ldar? stlr?

One of the other proposals is 'Strong-release-acquire' that either makes all stores release, or all loads acquire. It has the same effect as the branch-after-load proposal, but performs slightly worse on PPC machines. Also we can't use load-acquire on machines that don't have those instructions (v6-m), whereas almost every machine supported by llvm has branch instructions. See the questions section of the paper.

Can this show up with atomic ops that aren't just loads, like cas?

Yes - a relaxed atomic-exchange for armv8.2-a is implemented using a SWP instruction which has relaxed ordering. If we replaced the 'load' in the 'load buffering' test with a compare exchange we should observe re-ordering just as well. Also compare-exchange-strong uses CAS which suffers the same fate.

@efriedma-quic
Copy link
Collaborator

Are there any cases where we can elide the branch? I mean, obviously we can if there's an existing branch

A valid case, as long as that branch isn't then later removed.

Right, we'd have to do this pretty late in compilation to avoid issues with branch optimizations.

but can we take advantage of other memory barriers?

This works, but has worse perf than a control branch

I meant, if the compiler sees there's an existing acquire load/release store/fence between the relaxed load and the relaxed store, can the compiler elide the branch?

Can this show up with atomic ops that aren't just loads, like cas?

Yes - a relaxed atomic-exchange for armv8.2-a is implemented using a SWP instruction which has relaxed ordering. If we replaced the 'load' in the 'load buffering' test with a compare exchange we should observe re-ordering just as well. Also compare-exchange-strong uses CAS which suffers the same fate.

On a related note, if we lower an atomic operation to a loop (cas loop, or ldxr/stxr), does that naturally protect us from issues like this?

@efriedma-quic
Copy link
Collaborator

One more thing I just thought of; this probably has some implications for target-independent compiler optimizations; I don't think we currently have a rule that forbids sinking a relaxed load past a relaxed store, or hoist a relaxed store past a relaxed load. I don't expect this to have a significant performance impact because we rarely do that sort of transform anyway, but probably we need to add some IR modeling for the "global atomic state". (Currently, IR-level analysis models relaxed loads as just a read+write to the address.)

@lukeg101
Copy link
Member Author

I meant, if the compiler sees there's an existing acquire load/release store/fence between the relaxed load and the relaxed store, can the compiler elide the branch?

In theory yes - we would need to be careful with funky instructions like LDAPR where re-ordering of unrelated accesses is allowed.

One way is to view the branch as a barrier internally in LLVM, then the current optimisations wouldn't touch it - it would be compatible with the current practice without too much additional work.

if we lower an atomic operation to a loop (cas loop, or ldxr/stxr), does that naturally protect us from issues like this?

Not by default, consider the following litmus test that has a CAS loop on P0:

AArch64 LB+CAS

{
0:X0=x; 0:X3=y;
1:X0=y; 1:X3=x;
}
 P0                 | P1          ;
label: LDXR W1,[X0] | LDR W1,[X0] ;
 STXR W4, W1, [X0]  | NOP ;
 CBNZ W4, label     | NOP;
 MOV W2,#1          | MOV W2,#1   ;
 STR W2,[X3]        | STR W2,[X3] ;
exists (0:X1=1 /\ 1:X1=1 /\ 0:X4 = 0)

When run under the AArch64 model, we get the outcomes:

0:X1=0; 0:X4=0; 1:X1=0;
0:X1=0; 0:X4=0; 1:X1=1;
0:X1=1; 0:X4=0; 1:X1=0;
0:X1=1; 0:X4=0; 1:X1=1; <-- problem!

To enforce order, we need the branch to depend on the value loaded as you mentioned prior, and not the exclusivity check:

AArch64 LB005

{
0:X0=x; 0:X3=y;
1:X0=y; 1:X3=x;
}
 P0          | P1          ;
 LDR W1,[X0] | LDR W1,[X0] ;
 CBZ W1, L0  | CBZ W1, L1  ;
L0: NOP      | L1: NOP     ;
 MOV W2,#1   | MOV W2,#1   ;
 STR W2,[X3] | STR W2,[X3] ;
exists (0:X1=1 /\ 1:X1=1)

which has outcomes:

0:X1=0; 1:X1=0;
0:X1=0; 1:X1=1;
0:X1=1; 1:X1=0;

we currently have a rule that forbids sinking a relaxed load past a relaxed store, or hoist a relaxed store past a relaxed load. I don't expect this to have a significant performance impact because we rarely do that sort of transform anyway, but probably we need to add some IR modeling for the "global atomic state". (Currently, IR-level analysis models relaxed loads as just a read+write to the address.)

Yes I suspect this will be a valuable addition to help

@lukeg101
Copy link
Member Author

Also, I will add some answers to these questions in the questions section of the proposal as they are good!

@Wilco1
Copy link

Wilco1 commented Oct 24, 2023

I meant, if the compiler sees there's an existing acquire load/release store/fence between the relaxed load and the relaxed store, can the compiler elide the branch?

In theory yes - we would need to be careful with funky instructions like LDAPR where re-ordering of unrelated accesses is allowed

A one-way acquire/release doesn't work for this case. If there was a 2-way barrier then it would block reordering and load-buffering.

if we lower an atomic operation to a loop (cas loop, or ldxr/stxr), does that naturally protect us from issues like this?

Not by default, consider the following litmus test that has a CAS loop on P0:

This example looks incorrect - the right-hand side must block reordering too. And then I think it will work.

@lukeg101
Copy link
Member Author

A one-way acquire/release doesn't work for this case. If there was a 2-way barrier then it would block reordering and load-buffering.

Indeed - we need to be careful not to mix in LDAPR and relaxed semantics here - perhaps the compiler could carry this metadata to prevent it.

This example looks incorrect - the right-hand side must block reordering too. And then I think it will work.

We still observe the bad outcome even if we put a DMB ISH on P1:

AArch64 LB+CAS

{
0:X0=x; 0:X3=y;
1:X0=y; 1:X3=x;
}
 P0                 | P1          ;
label: LDXR W1,[X0] | LDR W1,[X0] ;
 STXR W4, W1, [X0]  | NOP ;
 CBNZ W4, label     | DMB ISH;
 MOV W2,#1          | MOV W2,#1   ;
 STR W2,[X3]        | STR W2,[X3] ;
exists (0:X1=1 /\ 1:X1=1 /\ 0:X4 = 0)

when run under the AArch64 model we get the outcomes:

0:X1=0; 0:X4=0; 1:X1=0;
0:X1=0; 0:X4=0; 1:X1=1;
0:X1=1; 0:X4=0; 1:X1=0;
0:X1=1; 0:X4=0; 1:X1=1; !!

Although Wilco and I are discussing if we can observe the !! outcome in reality... so far no luck on an Apple M1

@efriedma-quic
Copy link
Collaborator

Your model for stxr seems weird; naively, I'd expect that the cbnz depends on the stxr, and the stxr depends on the ldxr, therefore the cbnz depends on the ldxr. And it's hard for me to imagine a design where that doesn't hold; the core would have to somehow decide the stxr is going to succeed before it actually loads the value for the ldxr. Maybe the people writing the model wanted to leave the possibility open, though...

@Wilco1
Copy link

Wilco1 commented Oct 24, 2023

Agreed, there is a direct dependency between LDXR and STXR, but even without it the STXR can only succeed if the LDXR finishes. So I don't see any potential for reordering either. There is a related case where the model allows a later LDXR to reorder and execute in the middle of a LDAXR/STXR loop - however this is impossible since this must clear the exclusive monitor (and it does not happen on any hardware we tried).

Indeed - we need to be careful not to mix in LDAPR and relaxed semantics here - perhaps the compiler could carry this metadata to prevent it.

Actually LDAR/STLR are not sufficient barriers since they will allow relaxed loads/stores to pass in one direction, and then reorder with other relaxed loads/stores. So anything but a full barrier would not work.

@lukeg101
Copy link
Member Author

So anything but a full barrier would not work.

Agreed

Your model for stxr seems weird

I agree - I have raised this with the model experts and hopefully they should fix it or explain why this is the case.

@lukeg101
Copy link
Member Author

I have updated the blog post with your questions. you may need to refresh your cache to see them

@lukeg101
Copy link
Member Author

Also Wilco has now fixed original issue (control dependency masked at higher opt levels in GCC, which I have validated with tests). I propose we fix armv7 in clang in the same way for conformance.

@efriedma-quic
Copy link
Collaborator

So, to summarize:

  • There's a general issue for all targets with a weak memory model that load buffering makes relaxed loads give weird results, which can be solved with a dummy branch.
  • There's an armv7-specific issue that predication makes load buffering issues show up in more cases than other targets with a weak memory model, at least according to the formal model. (We don't have any evidence actual hardware does this.)
  • A dummy branch solves both the general issue and the armv7-specific issue.
  • As a stopgap, disabling predication of atomic ops makes the armv7-specific issue much less likely. (But not impossible; see [Armv7-a]: Control-dependency between atomic accesses removed by -O1. #65106 (comment) .)

I'm not against disabling predication of atomic ops... the runtime cost is probably unmeasurable in practice. But probably better to hold off on it if you're planning to start a wider discussion of general codegen for relaxed loads, just to avoid any potential confusion.

@lukeg101
Copy link
Member Author

But probably better to hold off on it if you're planning to start a wider discussion of general codegen for relaxed loads, just to avoid any potential confusion.

Wilco has already fixed predication in GCC and it is separate, but happens to mask an effect here that has a more general cause. We should fix predication regardless, it's just unfortunate that the general issue is happening at the same time: here

@jyknight
Copy link
Member

Can you explain why the example (as modified with x as a relaxed atomic) is disallowed by the standard? I would have thought it allowed, but I'm sure I could be mistaken.

@lukeg101
Copy link
Member Author

Hi James, Sure I can do that - please bear with me.

@lukeg101
Copy link
Member Author

lukeg101 commented Oct 30, 2023

Hi James,

So, the test (where x has relaxed order ops) is allowed by the standard model as load-store re-ordering is allowed even if there is a control flow branch in the way. My proposal is based on the stronger RC11 model that forbids load-store re-ordering. In this context, the above reasoning applies, but otherwise consider what is permitted by the standard below.

Consider the stronger test - where we put branches on both threads, makes x atomic, and applies relaxed-ordered operations:

{ *x = 0, *y = 0 }

P0 (atomic_int* y,atomic_int* x) {
  int r0 = atomic_load_explicit(x, memory_order_relaxed);
  if (r0 == 1) {
    atomic_store_explicit(y,1,memory_order_relaxed);
  }
}

P1 (atomic_int* y,atomic_int* x) {
  int r0 = atomic_load_explicit(y,memory_order_acquire);
  if (r0 == 1) {
    atomic_store_explicit(x, 1, memory_order_relaxed);
  }
}

exists (P0:r0=1 /\ P1:r0=1)

When run under C/C++ models (C23):

We get the following outcomes.

{ P0:r0=0; P1:r0=0; }
{ P0:r0=1; P1:r0=1; }

The outcome { P0:r0=0; P1:r0=0; } arises if P0 reads x, P1 reads y and then the rest of each thread continues.

Notice the outcome { P0:r0=0; P1:r0=1; } is forbidden. This can only occur if the read of y on P0 happens before the load of x upon which it is predicated. This would violate the control flow in the source program if observed. Conversely { P0:r0=1; P1:r0=0; } is forbidden for similar reasons.

The outcome { P0:r0=1; P1:r0=1; } is however allowed. It arises if the store to x on P1 happens before the load of y on P1 (and vice versa for P0) - This is allowed by the standard, as load-store re-ordering is permitted (yes, regardless of the branch...).

So the case where both loads are re-ordered is allowed, but the case where just one re-orders is forbidden. This doesn't make much sense does it? I have tested this on 3 separate encodings of the C11 memory model (from peer review papers - happy to link as needed), all give the same results. My comment below finds where in the standard this is explicitly stated.

Unfortunately, when atomics are compiled to predicated instructions for Armv7-a we do observe the outcome { P0:r0=1; P1:r0=1; } under the arm model:

ARM test

{ [P0_r0]=0;[P1_r0]=0;[x]=0;[y]=0;
  %P0_x=x;%P0_y=y;
  %P1_x=x;%P1_y=y }

  P0                   |  P1                   ;
   LDR R2,[%P0_x]      |   LDR R2,[%P1_y]      ;
   CMP R2,#1           |   CMP R2, #1          ;
   MOVEQ R1,#1         |   MOVEQ R1,#1         ;
   STREQ R1,[%P0_y]    |   STREQ R1,[%P1_x]    ;


exists (0:R2 = 1 /\ 1:R2=1)

which has outcomes under the arm.cat model:

0:R2=0; 1:R2=0;
0:R2=1; 1:R2=1; 

So:

  • The arm.cat model may be wrong, The C23 model may be wrong. Neither may be wrong, but:
  • We cannot be sure that there isn't machine out there that exhibits this behaviour - I have tried to find one, nevertheless:
  • The standard permits this behaviour, does this accurately capture the semantics of if-statements?

In any case for this bug report, using CMP R0,#1; BNE L0x2c instead of predicated instructions avoids the problem of whether we can trust these models, prevents the 1-1 outcome, and better represents the source program's intent as a branch to below the if-body:

ARM test-fixed

{ [P0_r0]=0;[P1_r0]=0;[x]=0;[y]=0;
  %P0_P0_r0=P0_r0;%P0_x=x;%P0_y=y;
  %P1_P1_r0=P1_r0;%P1_x=x;%P1_y=y }

  P0                   |  P1                   ;
   LDR R0,[%P0_x]      |   LDR R0, [%P1_y]     ;
   CMP R0,#1           |   DMB ISH             ;
   BNE L0x2c           |   BNE L0x3c           ;
   MOV R2,#1           |   MOV R2, #1          ; 
   STR R2,[%P0_y]      |   STR R2, [%P1_x]     ; 
  L0x2c:               | L0x3c:                ;
   STR R0,[%P0_P0_r0]  |   STR R0,[%P1_P1_r0]  ;


exists (P0_r0=1 /\ P1_r0=1)

EDIT: under arm.cat, aarch32.cat, ppc, RISC-V, and AArch64 models we only observe: the 0-0 outcome. ie we can prevent this in the implementation by using a branch with label.

Do you oppose this change?

Separately, there is the proposal to outlaw load-store re-ordering which is trying to strengthen our implementation under a flag regardless of modelling issues. I researched this issue following this bug report, and chose to write it up separately, it is unfortunate that they are linked.

@lukeg101
Copy link
Member Author

lukeg101 commented Oct 30, 2023

As for the standards doc, the above 'double-branch' example comes up in Recommended practice, section 7.17.3 of C23:

The requirements do not forbid r1 == 42 && r2 == 42 in the following example, with x and y initially zero:

However, this is not useful behavior, and implementations should not allow it.

@jyknight
Copy link
Member

In my opinion, as long as we're within-spec to do what we're doing here, we should not forbid conversion to predicated instructions as a "stand-alone" change, since doing so doesn't fully address the issues raised (per additional examples by @efriedma-quic).

Additionally, since the particular issue with predicated instructions is only observed on the model, and not on actual hardware, there doesn't seem to be a forcing-function to make this particular change, ahead of the discussion on whether and how to strengthen relaxed atomics in general (e.g. as your paper proposed).

@Wilco1
Copy link

Wilco1 commented Oct 31, 2023

Almost all of the examples shown are generic, ie. they do not rely on v7 conditional execution and can equally happen on other targets using conditional moves. The only v7 specific examples are conditionally executed atomics.

Avoiding conditional execution of atomics on v7 means it is now equivalent to the v8 model. That by itself removes a layer of complexity and unnecessary differences. The cost of this is negligible, and in GCC it was a good code cleanup, with the resulting code being smaller and easier to maintain.

@anthonywilliams
Copy link

The original bug report above is incorrect. Relaxed atomics do not impose any ordering semantics between threads. It is therefore possible to get the scenario where different threads disagree on the order of operations on distinct variables. This can lead to "out of thin air" consequences, where a value is stored on a branch, and the branch is only taken if that store is seen by another thread. Though the C++ standard discourages implementations from actually creating such scenarios, it is permitted by the memory model.

Note that in this case, the acquire load has no effect on the consequences, because there is not a release store to the same memory location.

I do not see a need to change this aspect of relaxed atomics. The point of relaxed atomics is to maximize the opportunity for performance by minimizing constraints. If you want ordering constraints, then you should put them in your source code: use acquire and release operations, or even SC operations.

@lukeg101
Copy link
Member Author

Thank you to everyone who has provided feedback on this. As a PhD student I appreciate every opportunity to learn! In response to the feedback I have received, I am closing this bug (and all other open bugs of this kind) as invalid. I am checking other bug reports to ensure there are no other invalid ones.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants