Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Document that atomic::fetch_add with unused result must generates ldadd with non-zero output register on Arm with LSE #69503

Open
gonzalobg opened this issue Oct 18, 2023 · 4 comments

Comments

@gonzalobg
Copy link
Contributor

gonzalobg commented Oct 18, 2023

The following code:

void inc(std::atomic<uint32_t>* p) {
    p->fetch_add(1, std::memory_order_relaxed);
}

discards the result of the atomic rmw add operation.

On Arm, it generates an ldadd instruction, but I'd expect it to generate an stadd instead, since the result is not used.
See https://gcc.godbolt.org/z/7K5YYEjT9

@gonzalobg gonzalobg changed the title Fails to optimize ldadd to stadd atomic::fetch_add with unused result generates ldadd instead of stadd on Arm Oct 18, 2023
@llvmbot
Copy link
Collaborator

llvmbot commented Oct 18, 2023

@llvm/issue-subscribers-backend-aarch64

Author: None (gonzalobg)

The following code:
void inc(std::atomic&lt;uint32_t&gt;* p) {
    p-&gt;fetch_add(1, std::memory_order_relaxed);
}

discards the result of the atomic rmw add operation.

On Arm, it generates an ldadd instruction, but I'd expect it to generate an stadd instead.
See https://gcc.godbolt.org/z/7K5YYEjT9

@sun-jacobi
Copy link
Member

sun-jacobi commented Nov 7, 2023

@EugeneZelenko @ostannard @gonzalobg Hi, can I work on this issue?

@gonzalobg
Copy link
Contributor Author

gonzalobg commented Dec 5, 2023

cc @ktkachov-arm

There is a litmus test from Will here (https://gcc.gnu.org/pipermail/gcc-patches/2018-October/509632.html) that shows that for this optimization to be sound, the implementation of the acquire fence would need to be strengthened to also fence stores, because dmb ld doesn't order stadd. Therefore, the current implementation of acquire fences would not suffice, and this change would be an ABI breaking change (due to the atomics ABI).

EDIT: reproducing Will's test for completeness

P0 (atomic_int* y,atomic_int* x) {
  atomic_store_explicit(x,1,memory_order_relaxed);
  atomic_thread_fence(memory_order_release);
  atomic_store_explicit(y,1,memory_order_relaxed);
}

P1 (atomic_int* y,atomic_int* x) {
  atomic_fetch_add_explicit(y,1,memory_order_relaxed);	// STADD
  atomic_thread_fence(memory_order_acquire);
  int r0 = atomic_load_explicit(x,memory_order_relaxed);
}

P2 (atomic_int* y) {
  int r1 = atomic_load_explicit(y,memory_order_relaxed);
}

@gonzalobg
Copy link
Contributor Author

gonzalobg commented Dec 6, 2023

Re-opening this issue to document somewhere in the source code why this optimization is unsound for Arm LSE atomics (it's not unsound for other hw archs with similar ldadd vs stadd vs ldadd wzr variants).

@ktkachov-arm would it be possible for someone from Arm to prepare the PR to document this and review it?

I believe the following herd tests show that fetch_add must be compiled to ldadd with a non-zero return register, since otherwise the current implementation of acquire fences (dmb ishld, which is part of the Atomics ABI), does not suffice (assuming the Arm Memory Consistency Model in herd covers these cases). I believe this shows that it is not sound to optimize a fetch_add with an unused return on Arm LSE to either:

  • ldadd w0, wzr, addr (an ldadd with a zero register for the output operand), or
  • stadd

The relevant part of the manual is the Atomic instructions section (C3.2.12 on the copy I have at hand), which states this precise exception:

The ST<OP> instructions, and LD<OP> instructions where the destination register is WZR or XZR, are not regarded as doing a read for the purpose of a DMB LD barrier.

The herd example for Will's litmus test is the following (test it, e.g., here: https://developer.arm.com/herd7):

AArch64 MP
{
0:X1=x; 0:X2=y;
1:X1=x; 1:X2=y;
2:X1=x; 2:X2=y;
}
 P0          | P1               | P2          ;
 MOV W0,#1   | MOV W0,#1        | LDR W0,[X2] ;
 STR W0,[X1] | LDADD W0,W0,[X2] |             ;
 dmb ish     | dmb ishld        |             ;
 STR W0,[X2] | LDR W0,[X1]      |             ;
exists
(1:X0=2 /\ 2:X0=1)

which returns 5 states:

:X0=0; 2:X0=0;
1:X0=0; 2:X0=1;
1:X0=1; 2:X0=0;
1:X0=1; 2:X0=1;
1:X0=1; 2:X0=2;  // If 2:X0==2, then 1:X0==1

Changing the LDADD W0,W0,[X2] to an STADD W0,[X2] or to an LDADD W0,WZR,[X2] grows it to 6 states, with the new state being:

1:X0=0; 2:X0=2;

which means that observing 2:X0 == 2 does not suffice to guarantee 1:X0 == 1, since it can also be 1:X0 == 0.

I believe that if, hypothetically speaking, one wanted to allow these optimizations, then either:

  • the ISA guarantees for dmb ishld would need to be strengthened to also cover the loads performed by STADD and LDADD WO,WZR,[addr] (e.g. by removing the exception for ST<OP> and LD<OP> cited above), or
  • the atomic ABI would need to change from dmb ishld to dmb ish for acquire fences.

@gonzalobg gonzalobg reopened this Dec 6, 2023
@gonzalobg gonzalobg changed the title atomic::fetch_add with unused result generates ldadd instead of stadd on Arm Document that atomic::fetch_add with unused result must generates ldadd with non-zero output register on Arm with LSE Dec 6, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants