Handling "atomic_compare_exchange_strong_explicit" in C11 #1

XuankangLin · 2017-02-28T19:04:03Z

Hi,

I was using herd7 (7.43, installed via OPAM) to run against the C11 litmus from https://github.com/nchong/c11popl15/blob/master/tests/a2.litmus:

C a2
{ [x] = 0; [y] = 0; [z] = 0; }

P0 (atomic_int* x, atomic_int* y, atomic_int *z) {
  int r0 = atomic_load_explicit(y, memory_order_relaxed);
  int r1 = atomic_compare_exchange_strong_explicit(x, z, 1, memory_order_release, memory_order_relaxed);
}

P1 (atomic_int* x, volatile int* y) {
  int r2 = atomic_load_explicit(x, memory_order_acquire);
  if (r2) {
    *y = 1;
  }
}

But herd7 complains:

line 6, characters 11-50: unexpected 'atomic_compare_exchange_strong_explicit' (in prog)

It seems "atomic_load_explicit" (and also "atomic_store_explicit" according to other tests) are recognized, but not "atomic_compare_exchange_strong_explicit".

I tried with latest version in master branch up to now, it complains the message as well. I also tried with old version herd in https://github.com/herd/herdtools, and it can successfully finish the run there.

So is herd7 indeed missing the part for "atomic_compare_exchange_strong_explicit" at the moment?
Thanks.

The text was updated successfully, but these errors were encountered:

maranget · 2017-03-01T08:56:47Z

Hi, Thank you for your report. This is a isfeature, a silly one, as herd7 implements atomic_compare_exchange_strong and atomic_compare_exchange_weak. Correction is a matter of minutes. That said, semantics is untested. Thanks again.

…

--Luc

Hi, I was using herd7 (7.43, installed via OPAM) to run against the C11 litmus from <https://github.com/nchong/c11popl15/blob/master/tests/a2.litmus>: ``` C a2 { [x] = 0; [y] = 0; [z] = 0; } P0 (atomic_int* x, atomic_int* y, atomic_int *z) { int r0 = atomic_load_explicit(y, memory_order_relaxed); int r1 = atomic_compare_exchange_strong_explicit(x, z, 1, memory_order_release, memory_order_relaxed); } P1 (atomic_int* x, volatile int* y) { int r2 = atomic_load_explicit(x, memory_order_acquire); if (r2) { *y = 1; } } ``` But herd7 complains: ``` line 6, characters 11-50: unexpected 'atomic_compare_exchange_strong_explicit' (in prog) ``` It seems "atomic_load_explicit" (and also "atomic_store_explicit" according to other tests) are recognized, but not "atomic_compare_exchange_strong_explicit". I also tried with old version herd in <https://github.com/herd/herdtools>, and it can successfully finish the run there. So is herd7 indeed missing the part for "atomic_compare_exchange_strong_explicit" at the moment? Thanks. -- You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub: #1

maranget · 2017-03-01T09:30:15Z

The issue should be solved in 7.44, now available on github as source. <https://github.com/herd/herdtools7/tree/7.44> That said, I was too eager to bump version and publish on opam, as I see you have sent another PR..

…

--Luc

Hi, I was using herd7 (7.43, installed via OPAM) to run against the C11 litmus from <https://github.com/nchong/c11popl15/blob/master/tests/a2.litmus>: ``` C a2 { [x] = 0; [y] = 0; [z] = 0; } P0 (atomic_int* x, atomic_int* y, atomic_int *z) { int r0 = atomic_load_explicit(y, memory_order_relaxed); int r1 = atomic_compare_exchange_strong_explicit(x, z, 1, memory_order_release, memory_order_relaxed); } P1 (atomic_int* x, volatile int* y) { int r2 = atomic_load_explicit(x, memory_order_acquire); if (r2) { *y = 1; } } ``` But herd7 complains: ``` line 6, characters 11-50: unexpected 'atomic_compare_exchange_strong_explicit' (in prog) ``` It seems "atomic_load_explicit" (and also "atomic_store_explicit" according to other tests) are recognized, but not "atomic_compare_exchange_strong_explicit". I also tried with old version herd in <https://github.com/herd/herdtools>, and it can successfully finish the run there. So is herd7 indeed missing the part for "atomic_compare_exchange_strong_explicit" at the moment? Thanks. -- You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub: #1

Consider the following: ``` ADD W1, W0, W0, lsl #1; ``` when `0:X0=1` the final result `0:X1=3` should always occur, however `0:X1=4` occurs since a single variable `v` is used when the registers are the same, when two variables are required. This patch fixes the above for `I_OP3` and adds a `MOVZ` test missed before.

Consider the following AArch64 instruction: ``` CMP WZR, w1, asr #1; CMP XZR, x1, asr #1; ``` which arithmetic shifts right the value in w1/x1 by 1 before comparing with xzr/wzr. This is equivalent to: ``` SUBS XZR, WZR, w1, asr #1; SUBS XZR, XZR, x1, asr #1; ``` This patch adds: - barrel shifters to the CMP command - tests for CMP with barrel shifters - tests for the equivalent SUBS instructions tested with: - pre-commit tests - all pass - make test - all pass - make version - all pass/no warnings

Added parsing support for the const type

[gen,herd] Add memtag() support for checking MTE tag locations in condition statements This change addressing the issues highlighted by issue #68 by using the `[tag(x)]` syntax for accessing the physical tag of a given location x. The herd support for this is built upon the already existing LV (LocationValue) and LL (LocationLocation) comparison operators present in herdtools, with `[tag(x)]` just being syntactical sugar for `x.atag` in generator, along with logic added to effectively returning a global location for x with the tag updated to the latest physical value The generator changes to support this include tracking of tag writes for a given cycle, and outputting write values into the condition statement, mirroring exactly what happens for Ord writes in non-MTE cycles. Notice that, by contrast, herd features a specific abstract syntax for tag identifiers (`System (TAG,x)` for concrete `tag(x)` or old style `x.atag`). Example tests: ``` $ diyone7 -arch AArch64 Rfe PodRW CoeTT PodWW -variant memtag AArch64 A "Rfe PodRWPT CoeTT PodWWTP" Generator=diyone7 (version 7.56+02~dev) Prefetch=0:x=F,0:y=W,1:y=F,1:x=W Com=Co Rf Orig=Rfe PodRWPT CoeTT PodWWTP { 0:X0=x:green; 0:X2=y:red; 0:X3=y:green; 1:X0=y:blue; 1:X1=y:red; 1:X3=x:green; } P0 | P1 ; LDR W1,[X0] | STG X0,[X1] ; STG X2,[X3] | MOV W2,#1 ; | STR W2,[X3] ; exists ([tag(y)]=:blue /\ 0:X1=1) ``` Herd: ``` Test A Allowed States 4 0:X1=0; [tag(y)]=:blue; 0:X1=0; [tag(y)]=:red; 0:X1=1; [tag(y)]=:blue; 0:X1=1; [tag(y)]=:red; Ok Witnesses Positive: 1 Negative: 3 Condition exists ([tag(y)]=:blue /\ 0:X1=1) Observation A Sometimes 1 3 Time A 0.01 Hash=61e2be468f765474eba55b7e54797e61 ``` Equivalent non-MTE cycle ``` AArch64 A "Rfe PodRW Coe PodWW" Generator=diyone7 (version 7.56+02~dev) Prefetch=0:x=F,0:y=W,1:y=F,1:x=W Com=Co Rf Orig=Rfe PodRW Coe PodWW { 0:X0=x; 0:X3=y; 1:X1=y; 1:X3=x; } P0 | P1 ; LDR W1,[X0] | MOV W0,#2 ; MOV W2,#1 | STR W0,[X1] ; STR W2,[X3] | MOV W2,#1 ; | STR W2,[X3] ; exists ([y]=2 /\ 0:X1=1) ``` Case where new syntax helps generation of a useful forbidden test: ``` $ diyone7 -arch AArch64 -variant memtag "DMB.STdWWTT CoeTL PodWWLL PodWRLA Amo.CasAP CoePT" AArch64 A "DMB.STdWWTT CoeTL PodWWLL PodWRLA Amo.CasAP CoePT" Generator=diyone7 (version 7.56+02~dev) Prefetch=0:x=F,0:y=W,1:y=F,1:x=W Com=Co Co Orig=DMB.STdWWTT CoeTL PodWWLL PodWRLA Amo.CasAP CoePT { 0:X0=x:red; 0:X1=x:green; 0:X2=y:red; 0:X3=y:green; 1:X1=y:red; 1:X3=z:green; 1:X4=x:green; } P0 | P1 ; STG X0,[X1] | MOV W0,#1 ; DMB ST | STLR W0,[X1] ; STG X2,[X3] | MOV W2,#1 ; | STLR W2,[X3] ; | MOV W5,#0 ; | MOV W6,#1 ; | CASA W5,W6,[X4] ; exists ([tag(x)]=:red /\ [y]=1 /\ 1:X5=0) /\ ~(fault (P1,x) \/ fault (P1,y)) ``` Herd: ``` Test A Allowed States 2 1:X5=0; [y]=1; [tag(x)]=:red; Fault(P1,x:green); ~Fault(P1,y); 1:X5=0; [y]=1; [tag(x)]=:red; Fault(P1,x:green); Fault(P1,y:red); No Witnesses Positive: 0 Negative: 4 Condition exists ([tag(x)]=:red /\ [y]=1 /\ 1:X5=0 /\ not (fault(P1,x) \/ fault(P1,y))) Observation A Never 0 4 Time A 0.02 Hash=319b4e72a07bcb4c8b66028ae669c06c ``` Compared to old generation: ``` Test A Allowed States 4 1:X5=0; [y]=1; ~Fault(P1,y); ~Fault(P1,x); 1:X5=0; [y]=1; Fault(P1,x:green); ~Fault(P1,y); 1:X5=0; [y]=1; Fault(P1,x:green); Fault(P1,y:red); 1:X5=0; [y]=1; Fault(P1,y:red); ~Fault(P1,x); Ok Witnesses Positive: 1 Negative: 3 Condition exists ([y]=1 /\ 1:X5=0 /\ not (fault(P1,x) \/ fault(P1,y))) Observation A Sometimes 1 3 Time A 0.06 Hash=cbea9795a448570de810e82d4f6ed75d ```

…ll barrier in that they prevent a read or write program-order-before the atomic from being reordered with a read or write program-order-after the atomic. However, a read or write program-order-after the atomic is allowed to be reordered with the write part of the atomic and a read or write program-order-before the atomic is allowed to be reordered with the read part of the atomic. Arm deemed this behaviour undesirable and the following change strengthen the memory model to forbid it. 1. Motivation ------------- The following behaviour: AArch64 SB+CASAL+LDR { 0:X1=x; 0:X3=y; 1:X1=y; 1:X3=x; 0:X4=0; 1:X4=0; x=0; y=0; } P0 | P1 ; MOV W0,#1 | MOV W0,#1 ; CASAL W4,W0,[X1] | CASAL W4, W0,[X1] ; LDR W2,[X3] | LDR W2,[X3] ; exists (0:X2=0 /\ 1:X2=0 /\ x=1 /\ y=1) is currently allowed. More precisely, the current definition of CASAL does not provide all of the ordering guarantees that are provided by a similar instruction in another architecture. The semantics of this other instruction have been used as the basis for API routines in a mainstream operating system, and this has resulted in a significant amount of existing code expecting stronger semantics than currently provided by CASAL. This incompatibility requires extra barriers to be inserted in the Arm version of these API routines, which results in more ordering operations than is strictly necessary during these routines. Arm proposes the following changes to provide semantics that match the expectation of the API routines. The proposed strengthening applies to all atomics (not just CAS) which bear the annotation -AL, and does not apply to Load/Store exclusives. This strengthening is aligned in principle with the fact that the following behaviour is already forbidden by the architecture: AArch64 SB+STR+CASAL+LDR2 { 0:X5=x; 0:X3=y; 1:X5=y; 1:X3=x; 0:X1=z; 1:X1=z; 0:X4=0; 1:X4=0; x=0; y=0; z=0; } P0 | P1 ; MOV W6,#1 | MOV W6,#1 ; STR W6,[X5] | STR W6,[X5] ; MOV W0,#1 | MOV W0,#1 ; CASAL W4,W0,[X1] | CASAL W4, W0,[X1] ; LDR W2,[X3] | LDR W2,[X3] ; exists (0:X2=0 /\ 1:X2=0 /\ x=1 /\ y=1 /\ z=1) which is already forbidden by the following clause in the definition of Barrier-ordered-before: - RW1 appears in program order before an atomic instruction with both Acquire and Release semantics that appears in program order before RW2. However it is worth noting that this strengthening would not forbid this behaviour (note that in the following test the CAS instructions fail and therefore do not store to memory): AArch64 SB+STR+CASAL+LDR { 0:X1=x; 0:X3=y; 1:X1=y; 1:X3=x; 0:X4=0; 1:X4=0; x=0; y=0; } P0 | P1 ; MOV W0,#1 | MOV W0,#1 ; STR W0,[X1] | STR W0,[X1] ; CASAL W4,W0,[X1] | CASAL W4, W0,[X1] ; LDR W2,[X3] | LDR W2,[X3] ; exists (0:X2=0 /\ 1:X2=0 /\ x=1 /\ y=1) Arm documentation change ------------------------ In Barrier-ordered-before, the clause which reads: - RW1 appears in program order before an atomic instruction with both Acquire and Release semantics that appears in program order before RW2. is strengthened to read: - RW1 is a write W1 and is generated by an atomic instruction with both Acquire and Release semantics Arm cat file change ------------------- In bob, the clause which reads: | po; ([A];amo;[L]); po is strengthened to read: | [range([A];amo;[L])]; po

…ll barrier (#322) in that they prevent a read or write program-order-before the atomic from being reordered with a read or write program-order-after the atomic. However, a read or write program-order-after the atomic is allowed to be reordered with the write part of the atomic and a read or write program-order-before the atomic is allowed to be reordered with the read part of the atomic. Arm deemed this behaviour undesirable and the following change strengthen the memory model to forbid it. 1. Motivation ------------- The following behaviour: AArch64 SB+CASAL+LDR { 0:X1=x; 0:X3=y; 1:X1=y; 1:X3=x; 0:X4=0; 1:X4=0; x=0; y=0; } P0 | P1 ; MOV W0,#1 | MOV W0,#1 ; CASAL W4,W0,[X1] | CASAL W4, W0,[X1] ; LDR W2,[X3] | LDR W2,[X3] ; exists (0:X2=0 /\ 1:X2=0 /\ x=1 /\ y=1) is currently allowed. More precisely, the current definition of CASAL does not provide all of the ordering guarantees that are provided by a similar instruction in another architecture. The semantics of this other instruction have been used as the basis for API routines in a mainstream operating system, and this has resulted in a significant amount of existing code expecting stronger semantics than currently provided by CASAL. This incompatibility requires extra barriers to be inserted in the Arm version of these API routines, which results in more ordering operations than is strictly necessary during these routines. Arm proposes the following changes to provide semantics that match the expectation of the API routines. The proposed strengthening applies to all atomics (not just CAS) which bear the annotation -AL, and does not apply to Load/Store exclusives. This strengthening is aligned in principle with the fact that the following behaviour is already forbidden by the architecture: AArch64 SB+STR+CASAL+LDR2 { 0:X5=x; 0:X3=y; 1:X5=y; 1:X3=x; 0:X1=z; 1:X1=z; 0:X4=0; 1:X4=0; x=0; y=0; z=0; } P0 | P1 ; MOV W6,#1 | MOV W6,#1 ; STR W6,[X5] | STR W6,[X5] ; MOV W0,#1 | MOV W0,#1 ; CASAL W4,W0,[X1] | CASAL W4, W0,[X1] ; LDR W2,[X3] | LDR W2,[X3] ; exists (0:X2=0 /\ 1:X2=0 /\ x=1 /\ y=1 /\ z=1) which is already forbidden by the following clause in the definition of Barrier-ordered-before: - RW1 appears in program order before an atomic instruction with both Acquire and Release semantics that appears in program order before RW2. However it is worth noting that this strengthening would not forbid this behaviour (note that in the following test the CAS instructions fail and therefore do not store to memory): AArch64 SB+STR+CASAL+LDR { 0:X1=x; 0:X3=y; 1:X1=y; 1:X3=x; 0:X4=0; 1:X4=0; x=0; y=0; } P0 | P1 ; MOV W0,#1 | MOV W0,#1 ; STR W0,[X1] | STR W0,[X1] ; CASAL W4,W0,[X1] | CASAL W4, W0,[X1] ; LDR W2,[X3] | LDR W2,[X3] ; exists (0:X2=0 /\ 1:X2=0 /\ x=1 /\ y=1) Arm documentation change ------------------------ In Barrier-ordered-before, the clause which reads: - RW1 appears in program order before an atomic instruction with both Acquire and Release semantics that appears in program order before RW2. is strengthened to read: - RW1 is a write W1 and is generated by an atomic instruction with both Acquire and Release semantics Arm cat file change ------------------- In bob, the clause which reads: | po; ([A];amo;[L]); po is strengthened to read: | [range([A];amo;[L])]; po

This change is a relaxation. The second clause of Atomic-ordered-before was put in place to accommodate for C++ ordering. However this clause need not be that strong to meet C++ and Arm relaxes it as a consequence. Motivation ------------- The following behaviour: AArch64 MP+rel+ctrl-rfi-rmw-rfiXA-R { 0:X1=x; 0:X3=y; 1:X1=y; 1:X11=x; 1:X5=z; } P0 | P1 ; MOV X0,#1 | LDR X0,[X1] ; STR X0,[X1] | CMP X0,#1 ; MOV X2,#1 | B.NE L0 ; | L0: ; STLR X2,[X3] | STR X2,[X5] ; | LDXR X6, [X5] ; | STXR W7, X8, [X5]; | LDAR X9,[X5] ; | LDR X10,[X11] ; exists (1:X0=1 /\ 1:X10=0 /\ 1:X7=0) is currently forbidden. This stems from the second clause of Atomic-ordered-before, which was put in place to accommodate for C++ release sequences. However, this second clause need not be that strong to meet C++, and therefore Arm relaxes it so that MP+rel+ctrl-rfi-rmw-rfiXA-R is allowed. 2. Changes to the Arm spec ------------------------------- In Atomic-ordered-before, the clause which reads: - RW1 is a write W1 generated by an atomic instruction or a successful Store-Exclusive instruction and RW2 is a read R2 generated by an instruction with Acquire or AcquirePC semantics such that R2 is a Local read successor of W1. is relaxed to read: - RW1 is a read (R1) generated by an atomic instruction (resp. a Load-Exclusive instruction) and RW2 is a read (R2) generated by an instruction with Acquire or AcquirePC semantics such that R2 is a Local read successor of the write W3 generated by the same atomic instruction as R1 (resp. the successful Store-Exclusive instruction paired with the Load-Exclusive instruction which generated R1). 3. Changes to the AArch64 cat file ----------------------------------- In aob, the clause which reads: | [range(rmw)]; lrs; [A | Q] is relaxed to read: | rmw; lrs; [A | Q]

Arm has investigated the weakening of its dependency definitions, in particular the notion of control dependency. In light of this investigation, Arm has decided to keep its notion of control dependency strong, but proposes to introduce a new notion of dependency, called "Pick dependency". Pick dependencies allow for very targetted ordering, and it is hoped that this will enable optimisations. See also: https://www.spinics.net/lists/kernel/msg4028862.html and the attached tech report entitled "The Ballad of Dependencies". 1. Motivation ------------- At the moment the behaviour MP+rel+CAS-addr: AArch64 MP+rel+CAS-addr { z=1; 0:X1=x; 0:X3=y; 1:X1=x; 1:X3=y; 1:X8=z; } P0 | P1 ; MOV W0, #1 | LDR W0, [X1] ; STR W0, [X3] | MOV W5, W0 ; STLR W0, [X1] | CAS W0, W6, [X8] ; | LDR W0, [X8] ; | EOR X0, X0, X0 ; | ADD X3, X3, X0 ; | LDR W4, [X3] ; exists (1:X5=1 /\ 1:X4=0) is Forbidden by the architecture. Arm would like to weaken this to allow the behaviour. The reason for MP+rel-CAS-addr being currently Forbidden is as follows: 1. there is a Data dependency on P1 from the Memory read Effect of LDR W0,[X1] (event c) to the Memory write Effect of CAS W0,W6,[X8] (event e) 2. there is a read-from on P1 from the Memory write Effect of CAS W0,W6,[X8] (event e) to the Memory read Effect of LDR W0,[X8] (event f) - in other words the Memory read Effect of LDR W0,[X8] (event f) is a Local read successor of the Memory write Effect of CAS W0,W6,[X8] (event e) 3. there is an Address dependency on P1 from the Memory read Effect of LDR W0,[X8] (event f) to the Memory read Effect of LDR W4,[X3] (event g). Points 1 and 2 combined impose order between the Memory read Effect of LDR W0,[X1] (event c) to the Memory read Effect of LDR W0,[X8] (event f), because data; rfi is in dob. In the text of the Arm ARM this corresponds to the clause: --(*)-- RW2 is a Local read successor R2 of a write W3 and there is an Address dependency or a Data dependency from R1 to W3. ------- A way forward to allow this behaviour would be to decide that there is no ordering altogether between the Memory read Effect of LDR W0,[X1] (event c) to the Memory write Effect of CAS W0,W6,[X8] (event e). However, this would lead to allowing LB+rel+CAS, which must be Forbidden: AArch64 LB+rel+CAS { 0:X1=x; 0:X3=y; 1:X1=x; 1:X3=y; } P0 | P1 ; MOV W9, #1 | MOV W4, #1 ; LDR W0, [X3] | LDR W5, [X1] ; STLR W9, [X1] | AND W10, W5, #2 ; | CAS W10, W4, [X3] ; exists (1:X5=1 /\ 0:X0=1) To accommodate for MP+rel+CAS-addr being Allowed as desired, and LB+rel+CAS still being Forbidden, Arm proposes to create a new type of dependency (called "Pick dependency"), such that: - Pick dependencies are not subjected to the clause (*) above, hence will not impose order in the case of MP+rel+CAS-addr - Pick dependencies provide order in the case of LB+rel+CAS Interestingly, this should not prescribe ordering in the litmus test MP+rel+CSEL for example. Therefore CSEL instructions should not provide similar ordering to branch instructions. However this highlights again the need for a new dependency notion, as the litmus test MP+rel+CSEL-data must be ordered. As in the CAS case, the nature of that new dependency notion is interesting. Indeed the test MP+rel+CSEL-addr should be allowed, which in turn entails that this new dependency cannot be a Data dependency, as otherwise the LDR W1,[X2] and the LDR W9,[X10,X8,SXTW] would be ordered. Arm proposes that CAS and CSEL create Pick dependencies. 2. Tests -------- Tests ----- AArch64 MP+CAS-rfi-ctrl+acq { 0:X0=x; 0:X4=y; 1:X1=y; 1:X3=x; } P0 | P1 ; MOV W1,#0 | LDAR W0,[X1] ; MOV W2,#1 | LDR W2,[X3] ; CAS W1,W2,[X0] | ; CBNZ W1,out | ; MOV W3,#1 | ; STR W3,[X4] | ; out: | ; exists (x=1 /\ 1:X0=1 /\ 1:X2=0) AArch64 R+CAS-rfi-ctrl+DMBST "Amo.Cas Rfi DpCtrldW Coe DMB.STdWW Rfe" { 0:X0=x; 0:X5=y; 1:X1=y; 1:X3=x; } P0 | P1 ; MOV W1,#1 | MOV W0,#2 ; MOV W2,#2 | STR W0,[X1] ; CAS W1,W2,[X0] | DMB ST ; LDR W3,[X0] | MOV W2,#1 ; CBNZ W3,LC00 | STR W2,[X3] ; LC00: | ; MOV W4,#1 | ; STR W4,[X5] | ; exists (x=2 /\ y=2 /\ 0:X1=1 /\ 0:X3=2) AArch64 LB+CAS-rfi-ctrl+DMBSY { 0:X0=x; 0:X5=y; 1:X1=y; 1:X3=x; } P0 | P1 ; MOV W1,#1 | ; MOV W2,#2 | LDR W0,[X1] ; CAS W1,W2,[X0] | DMB SY ; LDR W3,[X0] | MOV W2,#1 ; CBNZ W3,LC00 | STR W2,[X3] ; LC00: | ; MOV W4,#1 | ; STR W4,[X5] | ; exists (x=2 /\ 0:X1=1 /\ 0:X3=2 /\ 1:X0=1) AArch64 R+CAS+DMBLD "PodWR Amo.Cas Coe DMB.SYdWR Fre" { 0:X1=x; 0:X2=y; 1:X1=y; 1:X3=x; } P0 | P1 ; MOV W0,#1 | MOV W0,#2 ; STR W0,[X1] | STR W0,[X1] ; MOV W3,#0 | DMB SY ; MOV W4,#1 | LDR W2,[X3] ; CAS W3,W4,[X2] | ; exists (y=2 /\ 0:X3=0 /\ 1:X2=0) AArch64 SB+CAS-rfi-addr+DMBSY "PodWR Amo.Cas Rfi DpAddrdR Fre DMB.SYdWR Fre" { 0:X1=x; 0:X2=y; 0:X8=z; 1:X1=z; 1:X3=x; } P0 | P1 ; MOV W0,#1 | MOV W0,#1 ; STR W0,[X1] | STR W0,[X1] ; MOV W3,#0 | DMB SY ; MOV W4,#1 | LDR W2,[X3] ; CAS W3,W4,[X2] | ; LDR W5,[X2] | ; EOR W6,W5,W5 | ; LDR W7,[X8,W6,SXTW] | ; exists (0:X3=0 /\ 0:X5=1 /\ 0:X7=0 /\ 1:X2=0) AArch64 SB+SWP-rfi-addr+DMBSY "PodWR Amo.Swp Rfi DpAddrdR Fre DMB.SYdWR Fre" { 0:X1=x; 0:X2=y; 0:X8=z; 1:X1=z; 1:X3=x; } P0 | P1 ; MOV W0,#1 | MOV W0,#1 ; STR W0,[X1] | STR W0,[X1] ; MOV W4,#1 | DMB SY ; SWP W4,W3,[X2] | LDR W2,[X3] ; LDR W5,[X2] | ; EOR W6,W5,W5 | ; LDR W7,[X8,W6,SXTW] | ; exists (0:X3=0 /\ 0:X5=1 /\ 0:X7=0 /\ 1:X2=0) AArch64 MP+rel+CAS-addr { z=1; 0:X1=x; 0:X3=y; 1:X1=x; 1:X3=y; 1:X8=z; } P0 | P1 ; MOV W0, #1 | LDR W0, [X1] ; STR W0, [X3] | MOV W5, W0 ; STLR W0, [X1] | CAS W0, W6, [X8] ; | LDR W0, [X8] ; | EOR X0, X0, X0 ; | ADD X3, X3, X0 ; | LDR W4, [X3] ; exists (1:X5=1 /\ 1:X4=0) AArch64 LB+rel+CAS { 0:X1=x; 0:X3=y; 1:X1=x; 1:X3=y; } P0 | P1 ; MOV W9, #1 | MOV W4, #1 ; LDR W0, [X3] | LDR W5, [X1] ; STLR W9, [X1] | AND W10, W5, #2 ; | CAS W10, W4, [X3] ; exists (1:X5=1 /\ 0:X0=1)

This change is a relaxation. Arm has investigated the weakening of its dependency definitions, in particular the notion of control dependency. In light of this investigation, Arm has decided to keep its notion of control dependency strong, but proposes to introduce a new notion of dependency, called "Pick dependency". Pick dependencies allow for very targetted ordering, and it is hoped that this will enable optimisations. See also: https://www.spinics.net/lists/kernel/msg4028862.html and the attached tech report entitled "The Ballad of Dependencies". Motivation ------------- At the moment the behaviour MP+rel+CAS-addr: ``` AArch64 MP+rel+CAS-addr { z=1; 0:X1=x; 0:X3=y; 1:X1=x; 1:X3=y; 1:X8=z; } P0 | P1 ; MOV W0,#1 | LDR W0, [X1] ; STR W0,[X3] | MOV W5, W0 ; STLR W0,[X1] | CAS W0, W6, [X8] ; | LDR W0, [X8] ; | EOR X0, X0, X0 ; | ADD X3, X3, X0 ; | LDR W4, [X3] ; exists (1:X5=1 /\ 1:X4=0) ``` is Forbidden by the architecture. Arm would like to weaken this to allow the behaviour. The reason for MP+rel-CAS-addr being currently Forbidden is as follows: 1. there is a Data dependency on P1 from the Memory read Effect of LDR W0,[X1] (event c) to the Memory write Effect of CAS W0,W6,[X8] (event e) 2. there is a read-from on P1 from the Memory write Effect of CAS W0,W6,[X8] (event e) to the Memory read Effect of LDR W0,[X8] (event f) - in other words the Memory read Effect of LDR W0,[X8] (event f) is a Local read successor of the Memory write Effect of CAS W0,W6,[X8] (event e) 3. there is an Address dependency on P1 from the Memory read Effect of LDR W0,[X8] (event f) to the Memory read Effect of LDR W4,[X3] (event g). Points 1 and 2 combined impose order between the Memory read Effect of LDR W0,[X1] (event c) to the Memory read Effect of LDR W0,[X8] (event f), because data; rfi is in dob. In the text of the Arm ARM this corresponds to the clause: (*) RW2 is a Local read successor R2 of a write W3 and there is an Address dependency or a Data dependency from R1 to W3. A way forward to allow this behaviour would be to decide that there is no ordering altogether between the Memory read Effect of LDR W0,[X1] (event c) to the Memory write Effect of CAS W0,W6,[X8] (event e). However, this would lead to allowing LB+rel+CAS, which must be Forbidden: ``` AArch64 LB+rel+CAS { 0:X1=x; 0:X3=y; 1:X1=x; 1:X3=y; } P0 | P1 ; MOV W9, #1 | MOV W4, #1 ; LDR W0, [X3] | LDR W5, [X1] ; STLR W9, [X1]| AND W10, W5, #2 ; | CAS W10, W4, [X3] ; exists (1:X5=1 /\ 0:X0=1) ``` To accommodate for MP+rel+CAS-addr being Allowed as desired, and LB+rel+CAS still being Forbidden, Arm proposes to create a new type of dependency (called "Pick dependency"), such that: - Pick dependencies are not subjected to the clause (*) above, hence will not impose order in the case of MP+rel+CAS-addr - Pick dependencies provide order in the case of LB+rel+CAS Interestingly, this should not prescribe ordering in the litmus test MP+rel+CSEL for example. Therefore CSEL instructions should not provide similar ordering to branch instructions. However this highlights again the need for a new dependency notion, as the litmus test MP+rel+CSEL-data must be ordered. As in the CAS case, the nature of that new dependency notion is interesting. Indeed the test MP+rel+CSEL-addr should be allowed, which in turn entails that this new dependency cannot be a Data dependency, as otherwise the LDR W1,[X2] and the LDR W9,[X10,X8,SXTW] would be ordered. Arm proposes that CAS and CSEL create Pick dependencies. Tests -------- AArch64 MP+CAS-rfi-ctrl+acq { 0:X0=x; 0:X4=y; 1:X1=y; 1:X3=x; } P0 | P1 ; MOV W1,#0 | LDAR W0,[X1] ; MOV W2,#1 | LDR W2,[X3] ; CAS W1,W2,[X0] | ; CBNZ W1,out | ; MOV W3,#1 | ; STR W3,[X4] | ; out: | ; exists (x=1 /\ 1:X0=1 /\ 1:X2=0) AArch64 R+CAS-rfi-ctrl+DMBST "Amo.Cas Rfi DpCtrldW Coe DMB.STdWW Rfe" { 0:X0=x; 0:X5=y; 1:X1=y; 1:X3=x; } P0 | P1 ; MOV W1,#1 | MOV W0,#2 ; MOV W2,#2 | STR W0,[X1] ; CAS W1,W2,[X0] | DMB ST ; LDR W3,[X0] | MOV W2,#1 ; CBNZ W3,LC00 | STR W2,[X3] ; LC00: | ; MOV W4,#1 | ; STR W4,[X5] | ; exists (x=2 /\ y=2 /\ 0:X1=1 /\ 0:X3=2) AArch64 LB+CAS-rfi-ctrl+DMBSY { 0:X0=x; 0:X5=y; 1:X1=y; 1:X3=x; } P0 | P1 ; MOV W1,#1 | ; MOV W2,#2 | LDR W0,[X1] ; CAS W1,W2,[X0] | DMB SY ; LDR W3,[X0] | MOV W2,#1 ; CBNZ W3,LC00 | STR W2,[X3] ; LC00: | ; MOV W4,#1 | ; STR W4,[X5] | ; exists (x=2 /\ 0:X1=1 /\ 0:X3=2 /\ 1:X0=1) AArch64 R+CAS+DMBLD "PodWR Amo.Cas Coe DMB.SYdWR Fre" { 0:X1=x; 0:X2=y; 1:X1=y; 1:X3=x; } P0 | P1 ; MOV W0,#1 | MOV W0,#2 ; STR W0,[X1] | STR W0,[X1] ; MOV W3,#0 | DMB SY ; MOV W4,#1 | LDR W2,[X3] ; CAS W3,W4,[X2] | ; exists (y=2 /\ 0:X3=0 /\ 1:X2=0) AArch64 SB+CAS-rfi-addr+DMBSY "PodWR Amo.Cas Rfi DpAddrdR Fre DMB.SYdWR Fre" { 0:X1=x; 0:X2=y; 0:X8=z; 1:X1=z; 1:X3=x; } P0 | P1 ; MOV W0,#1 | MOV W0,#1 ; STR W0,[X1] | STR W0,[X1] ; MOV W3,#0 | DMB SY ; MOV W4,#1 | LDR W2,[X3] ; CAS W3,W4,[X2] | ; LDR W5,[X2] | ; EOR W6,W5,W5 | ; LDR W7,[X8,W6,SXTW] | ; exists (0:X3=0 /\ 0:X5=1 /\ 0:X7=0 /\ 1:X2=0) AArch64 SB+SWP-rfi-addr+DMBSY "PodWR Amo.Swp Rfi DpAddrdR Fre DMB.SYdWR Fre" { 0:X1=x; 0:X2=y; 0:X8=z; 1:X1=z; 1:X3=x; } P0 | P1 ; MOV W0,#1 | MOV W0,#1 ; STR W0,[X1] | STR W0,[X1] ; MOV W4,#1 | DMB SY ; SWP W4,W3,[X2] | LDR W2,[X3] ; LDR W5,[X2] | ; EOR W6,W5,W5 | ; LDR W7,[X8,W6,SXTW] | ; exists (0:X3=0 /\ 0:X5=1 /\ 0:X7=0 /\ 1:X2=0) AArch64 MP+rel+CAS-addr { z=1; 0:X1=x; 0:X3=y; 1:X1=x; 1:X3=y; 1:X8=z; } P0 | P1 ; MOV W0, #1 | LDR W0, [X1] ; STR W0, [X3] | MOV W5, W0 ; STLR W0, [X1] | CAS W0, W6, [X8] ; | LDR W0, [X8] ; | EOR X0, X0, X0 ; | ADD X3, X3, X0 ; | LDR W4, [X3] ; exists (1:X5=1 /\ 1:X4=0) AArch64 LB+rel+CAS { 0:X1=x; 0:X3=y; 1:X1=x; 1:X3=y; } P0 | P1 ; MOV W9, #1 | MOV W4, #1 ; LDR W0, [X3] | LDR W5, [X1] ; STLR W9, [X1] | AND W10, W5, #2 ; | CAS W10, W4, [X3] ; exists (1:X5=1 /\ 0:X0=1)

The `Store` pseudo-edge produces a store to location common to the previous and following edges. This store appears at thread code start. For instance: ``` % diyone7 -arch AArch64 FencedWW Rfe DpAddrdR PosRR Store Fre AArch64 A "DMB.SYdWW Rfe DpAddrdR PosRR Store Fre" Generator=diyone7 (version 7.56+02~dev) Prefetch=0:x=F,0:y=W,1:y=F,1:x=T Com=Rf Fr Orig=DMB.SYdWW Rfe DpAddrdR PosRR Store Fre { 0:X1=x; 0:X3=y; 1:X0=y; 1:X4=x; } P0 | P1 ; MOV W0,#2 | MOV W6,#1 ; STR W0,[X1] | STR W6,[X4] ; DMB SY | LDR W1,[X0] ; MOV W2,#1 | EOR W2,W1,W1 ; STR W2,[X3] | LDR W3,[X4,W2,SXTW] ; | LDR W5,[X4] ; exists ([x]=2 /\ 1:X1=1 /\ 1:X3=0 /\ 1:X5=1) ``` Generates an interesting test where the load by P1 from [x] to W3 is delayed; while the next load by P1 from [x] to W5 is not.

For a given dependency Dp_XXX_{s,d}{R,W} the csel variation syntax is by suffixing `Csel` after _XXX_. Csel variations introduce a CSEL instruction in the dependency computation chain. For instance, here is the effect of DpDataCseldW: ``` AArch64 A "DpDatadW Rfe DpDataCseldW Rfe" Generator=diyone7 (version 7.56+02~dev) Prefetch=0:x=F,0:y=W,1:y=F,1:x=W Com=Rf Rf Orig=DpDatadW Rfe DpDataCseldW Rfe { 0:X0=x; 0:X3=y; 1:X0=y; 1:X5=x; } P0 | P1 ; LDR W1,[X0] | LDR W1,[X0] ; EOR W2,W1,W1 | CMP W1,#0 ; ADD W2,W2,#1 | CSEL W2,W3,W4,EQ ; STR W2,[X3] | ADD W2,W2,#1 ; | STR W2,[X5] ; exists (0:X1=1 /\ 1:X1=1) ```

This change modifies the semantics of the translation (-variant vmsa) for checked memory instructions when MTE is enabled (-variant memtag). In short, for a checked memory instruction that performs a tag check prior to the explicit memory event, it allows the tag check operation to have its own translation (read of the PTE followed by a branching decision), separate from the translation that happens and leads to the explicit memory event. Also it adds support for a new type of memory, TaggedNormal featured in FEAT_MTE2. TaggedNormal is the only type for which we peform a tag check. To illustrate this with an example, in the following test, the STR is a checked access and will result in a TagCheck fault. AArch64 STR-TaggedNormal-red { [PTE(x)]=(oa:PA(x),attrs:(TaggedNormal)); 0:X0=x:red; } P0 ; MOV W1,#1 ; L0: ; STR W1,[X0] ; locations[fault(P0:L0,x,TagCheck);] exists(x=1) While the following test, the STR will not raise a TagCheck fault: AArch64 STR-Normal-red { [PTE(x)]=(oa:PA(x),attrs:(Normal)); 0:X0=x:red; } P0 ; MOV W1,#1 ; L0: ; STR W1,[X0] ; locations[fault(P0:L0,x,TagCheck);] exists(x=1) Signed-off-by: Nikos Nikoleris <nikos.nikoleris@arm.com>

We now have `\newisrelation{rel}{Text}` that defines two commands: * \newcommand{\<rel>name}{Text} * \newcommand{\<rel>}[2]{#1 is Text #2}

Currently some tests generated with diy will generate an incorrect post-condition. One noteable example is coRR: AArch64 coRR "Rfe PosRR Fre" Generator=diyone7 (version 7.56+03) Com=Rf Fr Orig=Rfe PosRR Fre { 0:X1=x; 1:X0=x; } P0 | P1 ; MOV W0,herd#1 | LDR W1,[X0] ; STR W0,[X1] | LDR W2,[X0] ; exists (1:X1=1 /\ 1:X2=1) This patch aims to fix this by ensuring that oneloc tests start generating the post condition from the start of a chain of a communication edges, ensuring they are all taken into account in the post condition.

We now have `\newisrelation{rel}{Text}` that defines two commands: * \newcommand{\<rel>name}{Text} * \newcommand{\<rel>}[2]{#1 is Text #2}

In effect, the commands of "successor like relation" (such as lrs, "is the Local Read Successor") use their arguments as #2 first, #1 second. By contrast the more numerous "before" relations use their arguments the other way round.

This change modifies the semantics of the translation (-variant vmsa) for checked memory instructions when MTE is enabled (-variant memtag). In short, for a checked memory instruction that performs a tag check prior to the explicit memory event, it allows the tag check operation to have its own translation (read of the PTE followed by a branching decision), separate from the translation that happens and leads to the explicit memory event. Also it adds support for a new type of memory, TaggedNormal featured in FEAT_MTE2. TaggedNormal is the only type for which we peform a tag check. To illustrate this with an example, in the following test, the STR is a checked access and will result in a TagCheck fault. AArch64 STR-TaggedNormal-red { [PTE(x)]=(oa:PA(x),attrs:(TaggedNormal)); 0:X0=x:red; } P0 ; MOV W1,#1 ; L0: ; STR W1,[X0] ; locations[fault(P0:L0,x,TagCheck);] exists(x=1) While the following test, the STR will not raise a TagCheck fault: AArch64 STR-Normal-red { [PTE(x)]=(oa:PA(x),attrs:(Normal)); 0:X0=x:red; } P0 ; MOV W1,#1 ; L0: ; STR W1,[X0] ; locations[fault(P0:L0,x,TagCheck);] exists(x=1) Signed-off-by: Nikos Nikoleris <nikos.nikoleris@arm.com>

The tool transliterate cat definitions into English Simple usage: `miaou7 -show <name> <catfile>` will translate the definition of `<name>` from model `<catfile>` into LaTeX. More advanced usage: the output will use LaTeX commands to pretty-print event set and relation definitions. For instance the **po** relation corresponds to the the command `\newcommand{\po}[2]{#1 is program-order before #2}`. Such commands can be defined in a specific file, say `catdefs.tex`. Then, the command-line option `-tex catdefs,tex` yields the following benefits: 1. Commands outputed by `miaou7` and not defined are flagged; and 2. The number of parameters of commands tells if the command correspond to an event set (one argument) or to a relation (two arguments), thereby guaranteeing appropriate formatting of definitions. Be aware that relation names and command differ: non-alphabetical characters are deleted from Cat names to produce legal LaTeX command names; a few Cat names undergo special treatment (_e.g._ `dmb.full` -> `\DMBFULL`, `L` -> `\REL`,...).

[tools] miaou! A cat that writes in English. Simple usage: `miaou7 -show <name> <catfile>` will translate the definition of `<name>` from model `<catfile>` into LaTeX. More advanced usage: the output will use LaTeX commands to pretty-print event set and relation definitions. For instance the **po** relation corresponds to the the command `\newcommand{\po}[2]{#1 is program-order before #2}`. Such commands can be defined in a specific file, say `catdefs.tex`. Then, the command-line option `-tex catdefs,tex` yields the following benefits: 1. Commands outputed by `miaou7` and not defined are flagged; and 2. The number of parameters of commands tells if the command correspond to an event set (one argument) or to a relation (two arguments), thereby guaranteeing appropriate formatting of definitions. Be aware that relation names and command differ: non-alphabetical characters are deleted from Cat names to produce legal LaTeX command names; a few Cat names undergo special treatment (_e.g._ `dmb.full` -> `\DMBFULL`, `L` -> `\REL`,...).

Unfortunately, there are several issues with litmus test generation when it comes to broad use Neon annotations. For example, litmus test for data dependency between Neon load and a store comes with several issues $ diyone7 -arch AArch64 -variant neon "Rfe PodRW Rfe DpDatadWN2N2" ---cut--- P0 | P1 ; LDR W1,[X0] | LD2 {V0.4S, V1.4S},[X0] ; (* A *) MOV W2,herd#1 | EOR V2.4S,V0.4S,V0.4S ; (* B *) STR W2,[X3] | ... ; ---cut--- First, result of load into V1 at A is not consumed. Second, instruction at B is not valid: Assembler messages: Error: operand mismatch -- `eor V2.4S,V0.4S,V0.4S' Info: did you mean this? Info: eor v1.8b, v0.8b, v0.8b Info: other valid variant(s): Info: eor v1.16b, v0.16b, v0.16b Another issue can be seen in litmus test for address dependency between a load and Neon load $ diyone7 -arch AArch64 -variant neon "PodWW Rfe DpAddrdRPN2 Fre" ---cut--- P0 | P1 ; MOV W0,herd#1 | ... ; STR W0,[X1] | ... ; MOV W2,herd#1 | ... ; STR W2,[X3] | LD2 {V0.4S, V1.4S},[X4],X3 ; (* A *) ---cut-- Neon load at A uses post-index form where X3 would be consumed after access is made to update X4, in other words, X3 doesn't contribute to the address generation for LD2 access at A. Also, there are numerous cases where litmus test generation would fail with Fatal error: exception File "lib/AArch64Base.ml", line 317, characters 48-54: Assertion failed To make things work we need to rework LDN. For things like address dependency and control dependency we need to reduce vector registers horizontally (in case more than one vector register is used) follow up with reduce horizontally single vector register into scalar, so it can be transferred to general purpose register and become available either for address generation logic or control flow logic. In term of instruction that reduction looks like LD2 {V0.4S, V1.4S},[X0] ; ADD V0.4S,V0.4S,V1.4S ; (* reduce (and consume) V0 and V1 into V0 *) ADDV S2,V0.4S ; (* reduce V0 inot S2 *) FMOV W1,S2 ; (* transfer S2 into W1 *) The only downside is that for data dependency between, say, Neon load and Neon store data have to trip via general purpose register which is sub-optimal LD2 {V0.4S, V1.4S},[X0] ; ADD V0.4S,V0.4S,V1.4S ; ADDV S2,V0.4S ; FMOV W1,S2 ; EOR W2,W1,W1 ; DUP V3.4S,W2 ; MOVI V4.4S,herd#1 ; MOVI V5.4S,herd#2 ; ADD V4.4S,V4.4S,V3.4S ; ADD V5.4S,V5.4S,V3.4S ; ST2 {V4.4S, V5.4S},[X3] ; However, when it comes to data dependency, say, between Neon load and non-Neon store things look more natural LD2 {V0.4S, V1.4S},[X0] ; ADD V0.4S,V0.4S,V1.4S ; ADDV S2,V0.4S ; FMOV W1,S2 ; EOR W2,W1,W1 ; ADD W2,W2,herd#1 ; STR W2,[X3] ; Signed-off-by: Vladimir Murzin <vladimir.murzin@arm.com>

Unfortunately, there are several issues with litmus test generation when it comes to broad use Neon annotations. For example, litmus test for data dependency between a load and Neon store comes with use of invalid instruction $ diyone7 -arch AArch64 -variant neon "Rfe PodRW Rfe DpDatadWN2N2" ---cut--- P0 | P1 ; LDR W1,[X0] | ... ; MOV W2,herd#1 | ... ; STR W2,[X3] | MOVI V3.4S,herd#1 ; | ADD V2.4S,V2.4S,V3.4S ; | ST2 {V2.4S},[X2] ; (* A *) ---cut--- Neon's ST2 at A is invalid: Error: invalid number of registers in the list; 2 registers are expected at operand 1 -- `st2 {V2.4S},[X2]' Another issues seen in litmus test for address dependency between a load and Neon store $ diyone7 -arch AArch64 -variant neon "Rfe PodRW Rfe DpAddrdWPN2 PodWW" ---cut--- P0 | P1 ; LDR W1,[X0] | ... ; MOV W2,herd#1 | ... ; STR W2,[X3] | EOR W2,W1,W1 ; (* A *) | EOR X3,X2,X2 ; | MOVI V0.4S,herd#1 ; | MOVI V1.4S,herd#2 ; | ST2 {V0.4S, V1.4S},[X4],X3 ; (* B *) | MOV W5,herd#1 ; | STR W5,[X6] ; ---cut--- First of all unnecessary exclusive or at A, yet it doesn't hurt. Second, Neon store at B uses post-index form where X3 would be consumed after access is made to update X4, in other words, X3 doesn't contribute to the address generation for ST2 access at B. Also, there are numerous cases where litmus test generation would fail with Fatal error: exception File "lib/AArch64Base.ml", line 317, characters 48-54: Assertion failed So, rework STN such that * For data dependency between a load and Neon store we produce code sequence like DUP V3.4S,W2 ; (* expand W2 into V3 *) MOVI V4.4S,#0 ; MOVI V5.4S,herd#1 ; ADD V4.4S,V4.4S,V3.4S ; (* V4 depends on V3 *) ADD V5.4S,V5.4S,V3.4S ; (* V5 depends on V3 *) ST2 {V4.4S, V5.4S},[X3] ; (* Store both V4 and V3 *) * For address dependency between a load and Neon store we produce code sequence like ADD X4,X3,W1,SXTW ; (* X4 depends on X3 (generated by previous load) *) MOVI V0.4S,herd#1 ; MOVI V1.4S,herd#2 ; ST2 {V0.4S, V1.4S},[X4] ; (* ST2 consumes X4 *) Signed-off-by: Vladimir Murzin <vladimir.murzin@arm.com>