-
Notifications
You must be signed in to change notification settings - Fork 62
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Handling "atomic_compare_exchange_strong_explicit" in C11 #1
Comments
Hi,
Thank you for your report.
This is a isfeature, a silly one, as herd7 implements
atomic_compare_exchange_strong and atomic_compare_exchange_weak.
Correction is a matter of minutes. That said, semantics is untested.
Thanks again.
…--Luc
Hi,
I was using herd7 (7.43, installed via OPAM) to run against the C11 litmus from <https://github.com/nchong/c11popl15/blob/master/tests/a2.litmus>:
```
C a2
{ [x] = 0; [y] = 0; [z] = 0; }
P0 (atomic_int* x, atomic_int* y, atomic_int *z) {
int r0 = atomic_load_explicit(y, memory_order_relaxed);
int r1 = atomic_compare_exchange_strong_explicit(x, z, 1, memory_order_release, memory_order_relaxed);
}
P1 (atomic_int* x, volatile int* y) {
int r2 = atomic_load_explicit(x, memory_order_acquire);
if (r2) {
*y = 1;
}
}
```
But herd7 complains:
```
line 6, characters 11-50: unexpected 'atomic_compare_exchange_strong_explicit' (in prog)
```
It seems "atomic_load_explicit" (and also "atomic_store_explicit" according to other tests) are recognized, but not "atomic_compare_exchange_strong_explicit".
I also tried with old version herd in <https://github.com/herd/herdtools>, and it can successfully finish the run there.
So is herd7 indeed missing the part for "atomic_compare_exchange_strong_explicit" at the moment?
Thanks.
--
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
#1
|
The issue should be solved in 7.44, now available on github
as source.
<https://github.com/herd/herdtools7/tree/7.44>
That said, I was too eager to bump version and publish on opam, as I see
you have sent another PR..
…--Luc
Hi,
I was using herd7 (7.43, installed via OPAM) to run against the C11 litmus from <https://github.com/nchong/c11popl15/blob/master/tests/a2.litmus>:
```
C a2
{ [x] = 0; [y] = 0; [z] = 0; }
P0 (atomic_int* x, atomic_int* y, atomic_int *z) {
int r0 = atomic_load_explicit(y, memory_order_relaxed);
int r1 = atomic_compare_exchange_strong_explicit(x, z, 1, memory_order_release, memory_order_relaxed);
}
P1 (atomic_int* x, volatile int* y) {
int r2 = atomic_load_explicit(x, memory_order_acquire);
if (r2) {
*y = 1;
}
}
```
But herd7 complains:
```
line 6, characters 11-50: unexpected 'atomic_compare_exchange_strong_explicit' (in prog)
```
It seems "atomic_load_explicit" (and also "atomic_store_explicit" according to other tests) are recognized, but not "atomic_compare_exchange_strong_explicit".
I also tried with old version herd in <https://github.com/herd/herdtools>, and it can successfully finish the run there.
So is herd7 indeed missing the part for "atomic_compare_exchange_strong_explicit" at the moment?
Thanks.
--
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
#1
|
maranget
added a commit
that referenced
this issue
Oct 1, 2020
maranget
added a commit
that referenced
this issue
Oct 3, 2020
lukeg101
referenced
this issue
in lukeg101/herdtools7
Nov 3, 2020
Consider the following: ``` ADD W1, W0, W0, lsl #1; ``` when `0:X0=1` the final result `0:X1=3` should always occur, however `0:X1=4` occurs since a single variable `v` is used when the registers are the same, when two variables are required. This patch fixes the above for `I_OP3` and adds a `MOVZ` test missed before.
lukeg101
referenced
this issue
in lukeg101/herdtools7
Nov 3, 2020
Consider the following: ``` ADD W1, W0, W0, lsl #1; ``` when `0:X0=1` the final result `0:X1=3` should always occur, however `0:X1=4` occurs since a single variable `v` is used when the registers are the same, when two variables are required. This patch fixes the above for `I_OP3` and adds a `MOVZ` test missed before.
lukeg101
referenced
this issue
in lukeg101/herdtools7
Nov 3, 2020
Consider the following: ``` ADD W1, W0, W0, lsl #1; ``` when `0:X0=1` the final result `0:X1=3` should always occur, however `0:X1=4` occurs since a single variable `v` is used when the registers are the same, when two variables are required. This patch fixes the above for `I_OP3` and adds a `MOVZ` test missed before.
maranget
pushed a commit
that referenced
this issue
Nov 10, 2020
Consider the following: ``` ADD W1, W0, W0, lsl #1; ``` when `0:X0=1` the final result `0:X1=3` should always occur, however `0:X1=4` occurs since a single variable `v` is used when the registers are the same, when two variables are required. This patch fixes the above for `I_OP3` and adds a `MOVZ` test missed before.
lukeg101
referenced
this issue
in lukeg101/herdtools7
Nov 14, 2020
Consider the following AArch64 instruction: ``` CMP WZR, w1, asr #1; CMP XZR, x1, asr #1; ``` which arithmetic shifts right the value in w1/x1 by 1 before comparing with xzr/wzr. This is equivalent to: ``` SUBS XZR, WZR, w1, asr #1; SUBS XZR, XZR, x1, asr #1; ``` This patch adds: - barrel shifters to the CMP command - tests for CMP with barrel shifters - tests for the equivalent SUBS instructions tested with: - pre-commit tests - all pass - make test - all pass - make version - all pass/no warnings
lukeg101
referenced
this issue
in lukeg101/herdtools7
Dec 10, 2020
Consider the following AArch64 instruction: ``` CMP WZR, w1, asr #1; CMP XZR, x1, asr #1; ``` which arithmetic shifts right the value in w1/x1 by 1 before comparing with xzr/wzr. This is equivalent to: ``` SUBS XZR, WZR, w1, asr #1; SUBS XZR, XZR, x1, asr #1; ``` This patch adds: - barrel shifters to the CMP command - tests for CMP with barrel shifters - tests for the equivalent SUBS instructions tested with: - pre-commit tests - all pass - make test - all pass - make version - all pass/no warnings
lukeg101
referenced
this issue
in lukeg101/herdtools7
Jun 3, 2021
Consider the following AArch64 instruction: ``` CMP WZR, w1, asr #1; CMP XZR, x1, asr #1; ``` which arithmetic shifts right the value in w1/x1 by 1 before comparing with xzr/wzr. This is equivalent to: ``` SUBS XZR, WZR, w1, asr #1; SUBS XZR, XZR, x1, asr #1; ``` This patch adds: - barrel shifters to the CMP command - tests for CMP with barrel shifters - tests for the equivalent SUBS instructions tested with: - pre-commit tests - all pass - make test - all pass - make version - all pass/no warnings
maranget
pushed a commit
that referenced
this issue
Jun 15, 2021
Consider the following AArch64 instruction: ``` CMP WZR, w1, asr #1; CMP XZR, x1, asr #1; ``` which arithmetic shifts right the value in w1/x1 by 1 before comparing with xzr/wzr. This is equivalent to: ``` SUBS XZR, WZR, w1, asr #1; SUBS XZR, XZR, x1, asr #1; ``` This patch adds: - barrel shifters to the CMP command - tests for CMP with barrel shifters - tests for the equivalent SUBS instructions tested with: - pre-commit tests - all pass - make test - all pass - make version - all pass/no warnings
maranget
pushed a commit
that referenced
this issue
Jul 27, 2021
Added parsing support for the const type
maranget
added a commit
that referenced
this issue
Sep 17, 2021
[gen,herd] Add memtag() support for checking MTE tag locations in condition statements This change addressing the issues highlighted by issue #68 by using the `[tag(x)]` syntax for accessing the physical tag of a given location x. The herd support for this is built upon the already existing LV (LocationValue) and LL (LocationLocation) comparison operators present in herdtools, with `[tag(x)]` just being syntactical sugar for `x.atag` in generator, along with logic added to effectively returning a global location for x with the tag updated to the latest physical value The generator changes to support this include tracking of tag writes for a given cycle, and outputting write values into the condition statement, mirroring exactly what happens for Ord writes in non-MTE cycles. Notice that, by contrast, herd features a specific abstract syntax for tag identifiers (`System (TAG,x)` for concrete `tag(x)` or old style `x.atag`). Example tests: ``` $ diyone7 -arch AArch64 Rfe PodRW CoeTT PodWW -variant memtag AArch64 A "Rfe PodRWPT CoeTT PodWWTP" Generator=diyone7 (version 7.56+02~dev) Prefetch=0:x=F,0:y=W,1:y=F,1:x=W Com=Co Rf Orig=Rfe PodRWPT CoeTT PodWWTP { 0:X0=x:green; 0:X2=y:red; 0:X3=y:green; 1:X0=y:blue; 1:X1=y:red; 1:X3=x:green; } P0 | P1 ; LDR W1,[X0] | STG X0,[X1] ; STG X2,[X3] | MOV W2,#1 ; | STR W2,[X3] ; exists ([tag(y)]=:blue /\ 0:X1=1) ``` Herd: ``` Test A Allowed States 4 0:X1=0; [tag(y)]=:blue; 0:X1=0; [tag(y)]=:red; 0:X1=1; [tag(y)]=:blue; 0:X1=1; [tag(y)]=:red; Ok Witnesses Positive: 1 Negative: 3 Condition exists ([tag(y)]=:blue /\ 0:X1=1) Observation A Sometimes 1 3 Time A 0.01 Hash=61e2be468f765474eba55b7e54797e61 ``` Equivalent non-MTE cycle ``` AArch64 A "Rfe PodRW Coe PodWW" Generator=diyone7 (version 7.56+02~dev) Prefetch=0:x=F,0:y=W,1:y=F,1:x=W Com=Co Rf Orig=Rfe PodRW Coe PodWW { 0:X0=x; 0:X3=y; 1:X1=y; 1:X3=x; } P0 | P1 ; LDR W1,[X0] | MOV W0,#2 ; MOV W2,#1 | STR W0,[X1] ; STR W2,[X3] | MOV W2,#1 ; | STR W2,[X3] ; exists ([y]=2 /\ 0:X1=1) ``` Case where new syntax helps generation of a useful forbidden test: ``` $ diyone7 -arch AArch64 -variant memtag "DMB.STdWWTT CoeTL PodWWLL PodWRLA Amo.CasAP CoePT" AArch64 A "DMB.STdWWTT CoeTL PodWWLL PodWRLA Amo.CasAP CoePT" Generator=diyone7 (version 7.56+02~dev) Prefetch=0:x=F,0:y=W,1:y=F,1:x=W Com=Co Co Orig=DMB.STdWWTT CoeTL PodWWLL PodWRLA Amo.CasAP CoePT { 0:X0=x:red; 0:X1=x:green; 0:X2=y:red; 0:X3=y:green; 1:X1=y:red; 1:X3=z:green; 1:X4=x:green; } P0 | P1 ; STG X0,[X1] | MOV W0,#1 ; DMB ST | STLR W0,[X1] ; STG X2,[X3] | MOV W2,#1 ; | STLR W2,[X3] ; | MOV W5,#0 ; | MOV W6,#1 ; | CASA W5,W6,[X4] ; exists ([tag(x)]=:red /\ [y]=1 /\ 1:X5=0) /\ ~(fault (P1,x) \/ fault (P1,y)) ``` Herd: ``` Test A Allowed States 2 1:X5=0; [y]=1; [tag(x)]=:red; Fault(P1,x:green); ~Fault(P1,y); 1:X5=0; [y]=1; [tag(x)]=:red; Fault(P1,x:green); Fault(P1,y:red); No Witnesses Positive: 0 Negative: 4 Condition exists ([tag(x)]=:red /\ [y]=1 /\ 1:X5=0 /\ not (fault(P1,x) \/ fault(P1,y))) Observation A Never 0 4 Time A 0.02 Hash=319b4e72a07bcb4c8b66028ae669c06c ``` Compared to old generation: ``` Test A Allowed States 4 1:X5=0; [y]=1; ~Fault(P1,y); ~Fault(P1,x); 1:X5=0; [y]=1; Fault(P1,x:green); ~Fault(P1,y); 1:X5=0; [y]=1; Fault(P1,x:green); Fault(P1,y:red); 1:X5=0; [y]=1; Fault(P1,y:red); ~Fault(P1,x); Ok Witnesses Positive: 1 Negative: 3 Condition exists ([y]=1 /\ 1:X5=0 /\ not (fault(P1,x) \/ fault(P1,y))) Observation A Sometimes 1 3 Time A 0.06 Hash=cbea9795a448570de810e82d4f6ed75d ```
jalglave
added a commit
that referenced
this issue
Feb 10, 2022
…ll barrier in that they prevent a read or write program-order-before the atomic from being reordered with a read or write program-order-after the atomic. However, a read or write program-order-after the atomic is allowed to be reordered with the write part of the atomic and a read or write program-order-before the atomic is allowed to be reordered with the read part of the atomic. Arm deemed this behaviour undesirable and the following change strengthen the memory model to forbid it. 1. Motivation ------------- The following behaviour: AArch64 SB+CASAL+LDR { 0:X1=x; 0:X3=y; 1:X1=y; 1:X3=x; 0:X4=0; 1:X4=0; x=0; y=0; } P0 | P1 ; MOV W0,#1 | MOV W0,#1 ; CASAL W4,W0,[X1] | CASAL W4, W0,[X1] ; LDR W2,[X3] | LDR W2,[X3] ; exists (0:X2=0 /\ 1:X2=0 /\ x=1 /\ y=1) is currently allowed. More precisely, the current definition of CASAL does not provide all of the ordering guarantees that are provided by a similar instruction in another architecture. The semantics of this other instruction have been used as the basis for API routines in a mainstream operating system, and this has resulted in a significant amount of existing code expecting stronger semantics than currently provided by CASAL. This incompatibility requires extra barriers to be inserted in the Arm version of these API routines, which results in more ordering operations than is strictly necessary during these routines. Arm proposes the following changes to provide semantics that match the expectation of the API routines. The proposed strengthening applies to all atomics (not just CAS) which bear the annotation -AL, and does not apply to Load/Store exclusives. This strengthening is aligned in principle with the fact that the following behaviour is already forbidden by the architecture: AArch64 SB+STR+CASAL+LDR2 { 0:X5=x; 0:X3=y; 1:X5=y; 1:X3=x; 0:X1=z; 1:X1=z; 0:X4=0; 1:X4=0; x=0; y=0; z=0; } P0 | P1 ; MOV W6,#1 | MOV W6,#1 ; STR W6,[X5] | STR W6,[X5] ; MOV W0,#1 | MOV W0,#1 ; CASAL W4,W0,[X1] | CASAL W4, W0,[X1] ; LDR W2,[X3] | LDR W2,[X3] ; exists (0:X2=0 /\ 1:X2=0 /\ x=1 /\ y=1 /\ z=1) which is already forbidden by the following clause in the definition of Barrier-ordered-before: - RW1 appears in program order before an atomic instruction with both Acquire and Release semantics that appears in program order before RW2. However it is worth noting that this strengthening would not forbid this behaviour (note that in the following test the CAS instructions fail and therefore do not store to memory): AArch64 SB+STR+CASAL+LDR { 0:X1=x; 0:X3=y; 1:X1=y; 1:X3=x; 0:X4=0; 1:X4=0; x=0; y=0; } P0 | P1 ; MOV W0,#1 | MOV W0,#1 ; STR W0,[X1] | STR W0,[X1] ; CASAL W4,W0,[X1] | CASAL W4, W0,[X1] ; LDR W2,[X3] | LDR W2,[X3] ; exists (0:X2=0 /\ 1:X2=0 /\ x=1 /\ y=1) Arm documentation change ------------------------ In Barrier-ordered-before, the clause which reads: - RW1 appears in program order before an atomic instruction with both Acquire and Release semantics that appears in program order before RW2. is strengthened to read: - RW1 is a write W1 and is generated by an atomic instruction with both Acquire and Release semantics Arm cat file change ------------------- In bob, the clause which reads: | po; ([A];amo;[L]); po is strengthened to read: | [range([A];amo;[L])]; po
jalglave
added a commit
that referenced
this issue
Feb 11, 2022
…ll barrier (#322) in that they prevent a read or write program-order-before the atomic from being reordered with a read or write program-order-after the atomic. However, a read or write program-order-after the atomic is allowed to be reordered with the write part of the atomic and a read or write program-order-before the atomic is allowed to be reordered with the read part of the atomic. Arm deemed this behaviour undesirable and the following change strengthen the memory model to forbid it. 1. Motivation ------------- The following behaviour: AArch64 SB+CASAL+LDR { 0:X1=x; 0:X3=y; 1:X1=y; 1:X3=x; 0:X4=0; 1:X4=0; x=0; y=0; } P0 | P1 ; MOV W0,#1 | MOV W0,#1 ; CASAL W4,W0,[X1] | CASAL W4, W0,[X1] ; LDR W2,[X3] | LDR W2,[X3] ; exists (0:X2=0 /\ 1:X2=0 /\ x=1 /\ y=1) is currently allowed. More precisely, the current definition of CASAL does not provide all of the ordering guarantees that are provided by a similar instruction in another architecture. The semantics of this other instruction have been used as the basis for API routines in a mainstream operating system, and this has resulted in a significant amount of existing code expecting stronger semantics than currently provided by CASAL. This incompatibility requires extra barriers to be inserted in the Arm version of these API routines, which results in more ordering operations than is strictly necessary during these routines. Arm proposes the following changes to provide semantics that match the expectation of the API routines. The proposed strengthening applies to all atomics (not just CAS) which bear the annotation -AL, and does not apply to Load/Store exclusives. This strengthening is aligned in principle with the fact that the following behaviour is already forbidden by the architecture: AArch64 SB+STR+CASAL+LDR2 { 0:X5=x; 0:X3=y; 1:X5=y; 1:X3=x; 0:X1=z; 1:X1=z; 0:X4=0; 1:X4=0; x=0; y=0; z=0; } P0 | P1 ; MOV W6,#1 | MOV W6,#1 ; STR W6,[X5] | STR W6,[X5] ; MOV W0,#1 | MOV W0,#1 ; CASAL W4,W0,[X1] | CASAL W4, W0,[X1] ; LDR W2,[X3] | LDR W2,[X3] ; exists (0:X2=0 /\ 1:X2=0 /\ x=1 /\ y=1 /\ z=1) which is already forbidden by the following clause in the definition of Barrier-ordered-before: - RW1 appears in program order before an atomic instruction with both Acquire and Release semantics that appears in program order before RW2. However it is worth noting that this strengthening would not forbid this behaviour (note that in the following test the CAS instructions fail and therefore do not store to memory): AArch64 SB+STR+CASAL+LDR { 0:X1=x; 0:X3=y; 1:X1=y; 1:X3=x; 0:X4=0; 1:X4=0; x=0; y=0; } P0 | P1 ; MOV W0,#1 | MOV W0,#1 ; STR W0,[X1] | STR W0,[X1] ; CASAL W4,W0,[X1] | CASAL W4, W0,[X1] ; LDR W2,[X3] | LDR W2,[X3] ; exists (0:X2=0 /\ 1:X2=0 /\ x=1 /\ y=1) Arm documentation change ------------------------ In Barrier-ordered-before, the clause which reads: - RW1 appears in program order before an atomic instruction with both Acquire and Release semantics that appears in program order before RW2. is strengthened to read: - RW1 is a write W1 and is generated by an atomic instruction with both Acquire and Release semantics Arm cat file change ------------------- In bob, the clause which reads: | po; ([A];amo;[L]); po is strengthened to read: | [range([A];amo;[L])]; po
jalglave
added a commit
that referenced
this issue
Feb 11, 2022
This change is a relaxation. The second clause of Atomic-ordered-before was put in place to accommodate for C++ ordering. However this clause need not be that strong to meet C++ and Arm relaxes it as a consequence. Motivation ------------- The following behaviour: AArch64 MP+rel+ctrl-rfi-rmw-rfiXA-R { 0:X1=x; 0:X3=y; 1:X1=y; 1:X11=x; 1:X5=z; } P0 | P1 ; MOV X0,#1 | LDR X0,[X1] ; STR X0,[X1] | CMP X0,#1 ; MOV X2,#1 | B.NE L0 ; | L0: ; STLR X2,[X3] | STR X2,[X5] ; | LDXR X6, [X5] ; | STXR W7, X8, [X5]; | LDAR X9,[X5] ; | LDR X10,[X11] ; exists (1:X0=1 /\ 1:X10=0 /\ 1:X7=0) is currently forbidden. This stems from the second clause of Atomic-ordered-before, which was put in place to accommodate for C++ release sequences. However, this second clause need not be that strong to meet C++, and therefore Arm relaxes it so that MP+rel+ctrl-rfi-rmw-rfiXA-R is allowed. 2. Changes to the Arm spec ------------------------------- In Atomic-ordered-before, the clause which reads: - RW1 is a write W1 generated by an atomic instruction or a successful Store-Exclusive instruction and RW2 is a read R2 generated by an instruction with Acquire or AcquirePC semantics such that R2 is a Local read successor of W1. is relaxed to read: - RW1 is a read (R1) generated by an atomic instruction (resp. a Load-Exclusive instruction) and RW2 is a read (R2) generated by an instruction with Acquire or AcquirePC semantics such that R2 is a Local read successor of the write W3 generated by the same atomic instruction as R1 (resp. the successful Store-Exclusive instruction paired with the Load-Exclusive instruction which generated R1). 3. Changes to the AArch64 cat file ----------------------------------- In aob, the clause which reads: | [range(rmw)]; lrs; [A | Q] is relaxed to read: | rmw; lrs; [A | Q]
jalglave
added a commit
that referenced
this issue
Feb 11, 2022
This change is a relaxation. The second clause of Atomic-ordered-before was put in place to accommodate for C++ ordering. However this clause need not be that strong to meet C++ and Arm relaxes it as a consequence. Motivation ------------- The following behaviour: AArch64 MP+rel+ctrl-rfi-rmw-rfiXA-R { 0:X1=x; 0:X3=y; 1:X1=y; 1:X11=x; 1:X5=z; } P0 | P1 ; MOV X0,#1 | LDR X0,[X1] ; STR X0,[X1] | CMP X0,#1 ; MOV X2,#1 | B.NE L0 ; | L0: ; STLR X2,[X3] | STR X2,[X5] ; | LDXR X6, [X5] ; | STXR W7, X8, [X5]; | LDAR X9,[X5] ; | LDR X10,[X11] ; exists (1:X0=1 /\ 1:X10=0 /\ 1:X7=0) is currently forbidden. This stems from the second clause of Atomic-ordered-before, which was put in place to accommodate for C++ release sequences. However, this second clause need not be that strong to meet C++, and therefore Arm relaxes it so that MP+rel+ctrl-rfi-rmw-rfiXA-R is allowed. 2. Changes to the Arm spec ------------------------------- In Atomic-ordered-before, the clause which reads: - RW1 is a write W1 generated by an atomic instruction or a successful Store-Exclusive instruction and RW2 is a read R2 generated by an instruction with Acquire or AcquirePC semantics such that R2 is a Local read successor of W1. is relaxed to read: - RW1 is a read (R1) generated by an atomic instruction (resp. a Load-Exclusive instruction) and RW2 is a read (R2) generated by an instruction with Acquire or AcquirePC semantics such that R2 is a Local read successor of the write W3 generated by the same atomic instruction as R1 (resp. the successful Store-Exclusive instruction paired with the Load-Exclusive instruction which generated R1). 3. Changes to the AArch64 cat file ----------------------------------- In aob, the clause which reads: | [range(rmw)]; lrs; [A | Q] is relaxed to read: | rmw; lrs; [A | Q]
jalglave
added a commit
that referenced
this issue
Feb 23, 2022
Arm has investigated the weakening of its dependency definitions, in particular the notion of control dependency. In light of this investigation, Arm has decided to keep its notion of control dependency strong, but proposes to introduce a new notion of dependency, called "Pick dependency". Pick dependencies allow for very targetted ordering, and it is hoped that this will enable optimisations. See also: https://www.spinics.net/lists/kernel/msg4028862.html and the attached tech report entitled "The Ballad of Dependencies". 1. Motivation ------------- At the moment the behaviour MP+rel+CAS-addr: AArch64 MP+rel+CAS-addr { z=1; 0:X1=x; 0:X3=y; 1:X1=x; 1:X3=y; 1:X8=z; } P0 | P1 ; MOV W0, #1 | LDR W0, [X1] ; STR W0, [X3] | MOV W5, W0 ; STLR W0, [X1] | CAS W0, W6, [X8] ; | LDR W0, [X8] ; | EOR X0, X0, X0 ; | ADD X3, X3, X0 ; | LDR W4, [X3] ; exists (1:X5=1 /\ 1:X4=0) is Forbidden by the architecture. Arm would like to weaken this to allow the behaviour. The reason for MP+rel-CAS-addr being currently Forbidden is as follows: 1. there is a Data dependency on P1 from the Memory read Effect of LDR W0,[X1] (event c) to the Memory write Effect of CAS W0,W6,[X8] (event e) 2. there is a read-from on P1 from the Memory write Effect of CAS W0,W6,[X8] (event e) to the Memory read Effect of LDR W0,[X8] (event f) - in other words the Memory read Effect of LDR W0,[X8] (event f) is a Local read successor of the Memory write Effect of CAS W0,W6,[X8] (event e) 3. there is an Address dependency on P1 from the Memory read Effect of LDR W0,[X8] (event f) to the Memory read Effect of LDR W4,[X3] (event g). Points 1 and 2 combined impose order between the Memory read Effect of LDR W0,[X1] (event c) to the Memory read Effect of LDR W0,[X8] (event f), because data; rfi is in dob. In the text of the Arm ARM this corresponds to the clause: --(*)-- RW2 is a Local read successor R2 of a write W3 and there is an Address dependency or a Data dependency from R1 to W3. ------- A way forward to allow this behaviour would be to decide that there is no ordering altogether between the Memory read Effect of LDR W0,[X1] (event c) to the Memory write Effect of CAS W0,W6,[X8] (event e). However, this would lead to allowing LB+rel+CAS, which must be Forbidden: AArch64 LB+rel+CAS { 0:X1=x; 0:X3=y; 1:X1=x; 1:X3=y; } P0 | P1 ; MOV W9, #1 | MOV W4, #1 ; LDR W0, [X3] | LDR W5, [X1] ; STLR W9, [X1] | AND W10, W5, #2 ; | CAS W10, W4, [X3] ; exists (1:X5=1 /\ 0:X0=1) To accommodate for MP+rel+CAS-addr being Allowed as desired, and LB+rel+CAS still being Forbidden, Arm proposes to create a new type of dependency (called "Pick dependency"), such that: - Pick dependencies are not subjected to the clause (*) above, hence will not impose order in the case of MP+rel+CAS-addr - Pick dependencies provide order in the case of LB+rel+CAS Interestingly, this should not prescribe ordering in the litmus test MP+rel+CSEL for example. Therefore CSEL instructions should not provide similar ordering to branch instructions. However this highlights again the need for a new dependency notion, as the litmus test MP+rel+CSEL-data must be ordered. As in the CAS case, the nature of that new dependency notion is interesting. Indeed the test MP+rel+CSEL-addr should be allowed, which in turn entails that this new dependency cannot be a Data dependency, as otherwise the LDR W1,[X2] and the LDR W9,[X10,X8,SXTW] would be ordered. Arm proposes that CAS and CSEL create Pick dependencies. 2. Tests -------- Tests ----- AArch64 MP+CAS-rfi-ctrl+acq { 0:X0=x; 0:X4=y; 1:X1=y; 1:X3=x; } P0 | P1 ; MOV W1,#0 | LDAR W0,[X1] ; MOV W2,#1 | LDR W2,[X3] ; CAS W1,W2,[X0] | ; CBNZ W1,out | ; MOV W3,#1 | ; STR W3,[X4] | ; out: | ; exists (x=1 /\ 1:X0=1 /\ 1:X2=0) AArch64 R+CAS-rfi-ctrl+DMBST "Amo.Cas Rfi DpCtrldW Coe DMB.STdWW Rfe" { 0:X0=x; 0:X5=y; 1:X1=y; 1:X3=x; } P0 | P1 ; MOV W1,#1 | MOV W0,#2 ; MOV W2,#2 | STR W0,[X1] ; CAS W1,W2,[X0] | DMB ST ; LDR W3,[X0] | MOV W2,#1 ; CBNZ W3,LC00 | STR W2,[X3] ; LC00: | ; MOV W4,#1 | ; STR W4,[X5] | ; exists (x=2 /\ y=2 /\ 0:X1=1 /\ 0:X3=2) AArch64 LB+CAS-rfi-ctrl+DMBSY { 0:X0=x; 0:X5=y; 1:X1=y; 1:X3=x; } P0 | P1 ; MOV W1,#1 | ; MOV W2,#2 | LDR W0,[X1] ; CAS W1,W2,[X0] | DMB SY ; LDR W3,[X0] | MOV W2,#1 ; CBNZ W3,LC00 | STR W2,[X3] ; LC00: | ; MOV W4,#1 | ; STR W4,[X5] | ; exists (x=2 /\ 0:X1=1 /\ 0:X3=2 /\ 1:X0=1) AArch64 R+CAS+DMBLD "PodWR Amo.Cas Coe DMB.SYdWR Fre" { 0:X1=x; 0:X2=y; 1:X1=y; 1:X3=x; } P0 | P1 ; MOV W0,#1 | MOV W0,#2 ; STR W0,[X1] | STR W0,[X1] ; MOV W3,#0 | DMB SY ; MOV W4,#1 | LDR W2,[X3] ; CAS W3,W4,[X2] | ; exists (y=2 /\ 0:X3=0 /\ 1:X2=0) AArch64 SB+CAS-rfi-addr+DMBSY "PodWR Amo.Cas Rfi DpAddrdR Fre DMB.SYdWR Fre" { 0:X1=x; 0:X2=y; 0:X8=z; 1:X1=z; 1:X3=x; } P0 | P1 ; MOV W0,#1 | MOV W0,#1 ; STR W0,[X1] | STR W0,[X1] ; MOV W3,#0 | DMB SY ; MOV W4,#1 | LDR W2,[X3] ; CAS W3,W4,[X2] | ; LDR W5,[X2] | ; EOR W6,W5,W5 | ; LDR W7,[X8,W6,SXTW] | ; exists (0:X3=0 /\ 0:X5=1 /\ 0:X7=0 /\ 1:X2=0) AArch64 SB+SWP-rfi-addr+DMBSY "PodWR Amo.Swp Rfi DpAddrdR Fre DMB.SYdWR Fre" { 0:X1=x; 0:X2=y; 0:X8=z; 1:X1=z; 1:X3=x; } P0 | P1 ; MOV W0,#1 | MOV W0,#1 ; STR W0,[X1] | STR W0,[X1] ; MOV W4,#1 | DMB SY ; SWP W4,W3,[X2] | LDR W2,[X3] ; LDR W5,[X2] | ; EOR W6,W5,W5 | ; LDR W7,[X8,W6,SXTW] | ; exists (0:X3=0 /\ 0:X5=1 /\ 0:X7=0 /\ 1:X2=0) AArch64 MP+rel+CAS-addr { z=1; 0:X1=x; 0:X3=y; 1:X1=x; 1:X3=y; 1:X8=z; } P0 | P1 ; MOV W0, #1 | LDR W0, [X1] ; STR W0, [X3] | MOV W5, W0 ; STLR W0, [X1] | CAS W0, W6, [X8] ; | LDR W0, [X8] ; | EOR X0, X0, X0 ; | ADD X3, X3, X0 ; | LDR W4, [X3] ; exists (1:X5=1 /\ 1:X4=0) AArch64 LB+rel+CAS { 0:X1=x; 0:X3=y; 1:X1=x; 1:X3=y; } P0 | P1 ; MOV W9, #1 | MOV W4, #1 ; LDR W0, [X3] | LDR W5, [X1] ; STLR W9, [X1] | AND W10, W5, #2 ; | CAS W10, W4, [X3] ; exists (1:X5=1 /\ 0:X0=1)
jalglave
added a commit
that referenced
this issue
Feb 24, 2022
Arm has investigated the weakening of its dependency definitions, in particular the notion of control dependency. In light of this investigation, Arm has decided to keep its notion of control dependency strong, but proposes to introduce a new notion of dependency, called "Pick dependency". Pick dependencies allow for very targetted ordering, and it is hoped that this will enable optimisations. See also: https://www.spinics.net/lists/kernel/msg4028862.html and the attached tech report entitled "The Ballad of Dependencies". 1. Motivation ------------- At the moment the behaviour MP+rel+CAS-addr: AArch64 MP+rel+CAS-addr { z=1; 0:X1=x; 0:X3=y; 1:X1=x; 1:X3=y; 1:X8=z; } P0 | P1 ; MOV W0, #1 | LDR W0, [X1] ; STR W0, [X3] | MOV W5, W0 ; STLR W0, [X1] | CAS W0, W6, [X8] ; | LDR W0, [X8] ; | EOR X0, X0, X0 ; | ADD X3, X3, X0 ; | LDR W4, [X3] ; exists (1:X5=1 /\ 1:X4=0) is Forbidden by the architecture. Arm would like to weaken this to allow the behaviour. The reason for MP+rel-CAS-addr being currently Forbidden is as follows: 1. there is a Data dependency on P1 from the Memory read Effect of LDR W0,[X1] (event c) to the Memory write Effect of CAS W0,W6,[X8] (event e) 2. there is a read-from on P1 from the Memory write Effect of CAS W0,W6,[X8] (event e) to the Memory read Effect of LDR W0,[X8] (event f) - in other words the Memory read Effect of LDR W0,[X8] (event f) is a Local read successor of the Memory write Effect of CAS W0,W6,[X8] (event e) 3. there is an Address dependency on P1 from the Memory read Effect of LDR W0,[X8] (event f) to the Memory read Effect of LDR W4,[X3] (event g). Points 1 and 2 combined impose order between the Memory read Effect of LDR W0,[X1] (event c) to the Memory read Effect of LDR W0,[X8] (event f), because data; rfi is in dob. In the text of the Arm ARM this corresponds to the clause: --(*)-- RW2 is a Local read successor R2 of a write W3 and there is an Address dependency or a Data dependency from R1 to W3. ------- A way forward to allow this behaviour would be to decide that there is no ordering altogether between the Memory read Effect of LDR W0,[X1] (event c) to the Memory write Effect of CAS W0,W6,[X8] (event e). However, this would lead to allowing LB+rel+CAS, which must be Forbidden: AArch64 LB+rel+CAS { 0:X1=x; 0:X3=y; 1:X1=x; 1:X3=y; } P0 | P1 ; MOV W9, #1 | MOV W4, #1 ; LDR W0, [X3] | LDR W5, [X1] ; STLR W9, [X1] | AND W10, W5, #2 ; | CAS W10, W4, [X3] ; exists (1:X5=1 /\ 0:X0=1) To accommodate for MP+rel+CAS-addr being Allowed as desired, and LB+rel+CAS still being Forbidden, Arm proposes to create a new type of dependency (called "Pick dependency"), such that: - Pick dependencies are not subjected to the clause (*) above, hence will not impose order in the case of MP+rel+CAS-addr - Pick dependencies provide order in the case of LB+rel+CAS Interestingly, this should not prescribe ordering in the litmus test MP+rel+CSEL for example. Therefore CSEL instructions should not provide similar ordering to branch instructions. However this highlights again the need for a new dependency notion, as the litmus test MP+rel+CSEL-data must be ordered. As in the CAS case, the nature of that new dependency notion is interesting. Indeed the test MP+rel+CSEL-addr should be allowed, which in turn entails that this new dependency cannot be a Data dependency, as otherwise the LDR W1,[X2] and the LDR W9,[X10,X8,SXTW] would be ordered. Arm proposes that CAS and CSEL create Pick dependencies. 2. Tests -------- Tests ----- AArch64 MP+CAS-rfi-ctrl+acq { 0:X0=x; 0:X4=y; 1:X1=y; 1:X3=x; } P0 | P1 ; MOV W1,#0 | LDAR W0,[X1] ; MOV W2,#1 | LDR W2,[X3] ; CAS W1,W2,[X0] | ; CBNZ W1,out | ; MOV W3,#1 | ; STR W3,[X4] | ; out: | ; exists (x=1 /\ 1:X0=1 /\ 1:X2=0) AArch64 R+CAS-rfi-ctrl+DMBST "Amo.Cas Rfi DpCtrldW Coe DMB.STdWW Rfe" { 0:X0=x; 0:X5=y; 1:X1=y; 1:X3=x; } P0 | P1 ; MOV W1,#1 | MOV W0,#2 ; MOV W2,#2 | STR W0,[X1] ; CAS W1,W2,[X0] | DMB ST ; LDR W3,[X0] | MOV W2,#1 ; CBNZ W3,LC00 | STR W2,[X3] ; LC00: | ; MOV W4,#1 | ; STR W4,[X5] | ; exists (x=2 /\ y=2 /\ 0:X1=1 /\ 0:X3=2) AArch64 LB+CAS-rfi-ctrl+DMBSY { 0:X0=x; 0:X5=y; 1:X1=y; 1:X3=x; } P0 | P1 ; MOV W1,#1 | ; MOV W2,#2 | LDR W0,[X1] ; CAS W1,W2,[X0] | DMB SY ; LDR W3,[X0] | MOV W2,#1 ; CBNZ W3,LC00 | STR W2,[X3] ; LC00: | ; MOV W4,#1 | ; STR W4,[X5] | ; exists (x=2 /\ 0:X1=1 /\ 0:X3=2 /\ 1:X0=1) AArch64 R+CAS+DMBLD "PodWR Amo.Cas Coe DMB.SYdWR Fre" { 0:X1=x; 0:X2=y; 1:X1=y; 1:X3=x; } P0 | P1 ; MOV W0,#1 | MOV W0,#2 ; STR W0,[X1] | STR W0,[X1] ; MOV W3,#0 | DMB SY ; MOV W4,#1 | LDR W2,[X3] ; CAS W3,W4,[X2] | ; exists (y=2 /\ 0:X3=0 /\ 1:X2=0) AArch64 SB+CAS-rfi-addr+DMBSY "PodWR Amo.Cas Rfi DpAddrdR Fre DMB.SYdWR Fre" { 0:X1=x; 0:X2=y; 0:X8=z; 1:X1=z; 1:X3=x; } P0 | P1 ; MOV W0,#1 | MOV W0,#1 ; STR W0,[X1] | STR W0,[X1] ; MOV W3,#0 | DMB SY ; MOV W4,#1 | LDR W2,[X3] ; CAS W3,W4,[X2] | ; LDR W5,[X2] | ; EOR W6,W5,W5 | ; LDR W7,[X8,W6,SXTW] | ; exists (0:X3=0 /\ 0:X5=1 /\ 0:X7=0 /\ 1:X2=0) AArch64 SB+SWP-rfi-addr+DMBSY "PodWR Amo.Swp Rfi DpAddrdR Fre DMB.SYdWR Fre" { 0:X1=x; 0:X2=y; 0:X8=z; 1:X1=z; 1:X3=x; } P0 | P1 ; MOV W0,#1 | MOV W0,#1 ; STR W0,[X1] | STR W0,[X1] ; MOV W4,#1 | DMB SY ; SWP W4,W3,[X2] | LDR W2,[X3] ; LDR W5,[X2] | ; EOR W6,W5,W5 | ; LDR W7,[X8,W6,SXTW] | ; exists (0:X3=0 /\ 0:X5=1 /\ 0:X7=0 /\ 1:X2=0) AArch64 MP+rel+CAS-addr { z=1; 0:X1=x; 0:X3=y; 1:X1=x; 1:X3=y; 1:X8=z; } P0 | P1 ; MOV W0, #1 | LDR W0, [X1] ; STR W0, [X3] | MOV W5, W0 ; STLR W0, [X1] | CAS W0, W6, [X8] ; | LDR W0, [X8] ; | EOR X0, X0, X0 ; | ADD X3, X3, X0 ; | LDR W4, [X3] ; exists (1:X5=1 /\ 1:X4=0) AArch64 LB+rel+CAS { 0:X1=x; 0:X3=y; 1:X1=x; 1:X3=y; } P0 | P1 ; MOV W9, #1 | MOV W4, #1 ; LDR W0, [X3] | LDR W5, [X1] ; STLR W9, [X1] | AND W10, W5, #2 ; | CAS W10, W4, [X3] ; exists (1:X5=1 /\ 0:X0=1)
maranget
pushed a commit
that referenced
this issue
Feb 28, 2022
Arm has investigated the weakening of its dependency definitions, in particular the notion of control dependency. In light of this investigation, Arm has decided to keep its notion of control dependency strong, but proposes to introduce a new notion of dependency, called "Pick dependency". Pick dependencies allow for very targetted ordering, and it is hoped that this will enable optimisations. See also: https://www.spinics.net/lists/kernel/msg4028862.html and the attached tech report entitled "The Ballad of Dependencies". 1. Motivation ------------- At the moment the behaviour MP+rel+CAS-addr: AArch64 MP+rel+CAS-addr { z=1; 0:X1=x; 0:X3=y; 1:X1=x; 1:X3=y; 1:X8=z; } P0 | P1 ; MOV W0, #1 | LDR W0, [X1] ; STR W0, [X3] | MOV W5, W0 ; STLR W0, [X1] | CAS W0, W6, [X8] ; | LDR W0, [X8] ; | EOR X0, X0, X0 ; | ADD X3, X3, X0 ; | LDR W4, [X3] ; exists (1:X5=1 /\ 1:X4=0) is Forbidden by the architecture. Arm would like to weaken this to allow the behaviour. The reason for MP+rel-CAS-addr being currently Forbidden is as follows: 1. there is a Data dependency on P1 from the Memory read Effect of LDR W0,[X1] (event c) to the Memory write Effect of CAS W0,W6,[X8] (event e) 2. there is a read-from on P1 from the Memory write Effect of CAS W0,W6,[X8] (event e) to the Memory read Effect of LDR W0,[X8] (event f) - in other words the Memory read Effect of LDR W0,[X8] (event f) is a Local read successor of the Memory write Effect of CAS W0,W6,[X8] (event e) 3. there is an Address dependency on P1 from the Memory read Effect of LDR W0,[X8] (event f) to the Memory read Effect of LDR W4,[X3] (event g). Points 1 and 2 combined impose order between the Memory read Effect of LDR W0,[X1] (event c) to the Memory read Effect of LDR W0,[X8] (event f), because data; rfi is in dob. In the text of the Arm ARM this corresponds to the clause: --(*)-- RW2 is a Local read successor R2 of a write W3 and there is an Address dependency or a Data dependency from R1 to W3. ------- A way forward to allow this behaviour would be to decide that there is no ordering altogether between the Memory read Effect of LDR W0,[X1] (event c) to the Memory write Effect of CAS W0,W6,[X8] (event e). However, this would lead to allowing LB+rel+CAS, which must be Forbidden: AArch64 LB+rel+CAS { 0:X1=x; 0:X3=y; 1:X1=x; 1:X3=y; } P0 | P1 ; MOV W9, #1 | MOV W4, #1 ; LDR W0, [X3] | LDR W5, [X1] ; STLR W9, [X1] | AND W10, W5, #2 ; | CAS W10, W4, [X3] ; exists (1:X5=1 /\ 0:X0=1) To accommodate for MP+rel+CAS-addr being Allowed as desired, and LB+rel+CAS still being Forbidden, Arm proposes to create a new type of dependency (called "Pick dependency"), such that: - Pick dependencies are not subjected to the clause (*) above, hence will not impose order in the case of MP+rel+CAS-addr - Pick dependencies provide order in the case of LB+rel+CAS Interestingly, this should not prescribe ordering in the litmus test MP+rel+CSEL for example. Therefore CSEL instructions should not provide similar ordering to branch instructions. However this highlights again the need for a new dependency notion, as the litmus test MP+rel+CSEL-data must be ordered. As in the CAS case, the nature of that new dependency notion is interesting. Indeed the test MP+rel+CSEL-addr should be allowed, which in turn entails that this new dependency cannot be a Data dependency, as otherwise the LDR W1,[X2] and the LDR W9,[X10,X8,SXTW] would be ordered. Arm proposes that CAS and CSEL create Pick dependencies. 2. Tests -------- Tests ----- AArch64 MP+CAS-rfi-ctrl+acq { 0:X0=x; 0:X4=y; 1:X1=y; 1:X3=x; } P0 | P1 ; MOV W1,#0 | LDAR W0,[X1] ; MOV W2,#1 | LDR W2,[X3] ; CAS W1,W2,[X0] | ; CBNZ W1,out | ; MOV W3,#1 | ; STR W3,[X4] | ; out: | ; exists (x=1 /\ 1:X0=1 /\ 1:X2=0) AArch64 R+CAS-rfi-ctrl+DMBST "Amo.Cas Rfi DpCtrldW Coe DMB.STdWW Rfe" { 0:X0=x; 0:X5=y; 1:X1=y; 1:X3=x; } P0 | P1 ; MOV W1,#1 | MOV W0,#2 ; MOV W2,#2 | STR W0,[X1] ; CAS W1,W2,[X0] | DMB ST ; LDR W3,[X0] | MOV W2,#1 ; CBNZ W3,LC00 | STR W2,[X3] ; LC00: | ; MOV W4,#1 | ; STR W4,[X5] | ; exists (x=2 /\ y=2 /\ 0:X1=1 /\ 0:X3=2) AArch64 LB+CAS-rfi-ctrl+DMBSY { 0:X0=x; 0:X5=y; 1:X1=y; 1:X3=x; } P0 | P1 ; MOV W1,#1 | ; MOV W2,#2 | LDR W0,[X1] ; CAS W1,W2,[X0] | DMB SY ; LDR W3,[X0] | MOV W2,#1 ; CBNZ W3,LC00 | STR W2,[X3] ; LC00: | ; MOV W4,#1 | ; STR W4,[X5] | ; exists (x=2 /\ 0:X1=1 /\ 0:X3=2 /\ 1:X0=1) AArch64 R+CAS+DMBLD "PodWR Amo.Cas Coe DMB.SYdWR Fre" { 0:X1=x; 0:X2=y; 1:X1=y; 1:X3=x; } P0 | P1 ; MOV W0,#1 | MOV W0,#2 ; STR W0,[X1] | STR W0,[X1] ; MOV W3,#0 | DMB SY ; MOV W4,#1 | LDR W2,[X3] ; CAS W3,W4,[X2] | ; exists (y=2 /\ 0:X3=0 /\ 1:X2=0) AArch64 SB+CAS-rfi-addr+DMBSY "PodWR Amo.Cas Rfi DpAddrdR Fre DMB.SYdWR Fre" { 0:X1=x; 0:X2=y; 0:X8=z; 1:X1=z; 1:X3=x; } P0 | P1 ; MOV W0,#1 | MOV W0,#1 ; STR W0,[X1] | STR W0,[X1] ; MOV W3,#0 | DMB SY ; MOV W4,#1 | LDR W2,[X3] ; CAS W3,W4,[X2] | ; LDR W5,[X2] | ; EOR W6,W5,W5 | ; LDR W7,[X8,W6,SXTW] | ; exists (0:X3=0 /\ 0:X5=1 /\ 0:X7=0 /\ 1:X2=0) AArch64 SB+SWP-rfi-addr+DMBSY "PodWR Amo.Swp Rfi DpAddrdR Fre DMB.SYdWR Fre" { 0:X1=x; 0:X2=y; 0:X8=z; 1:X1=z; 1:X3=x; } P0 | P1 ; MOV W0,#1 | MOV W0,#1 ; STR W0,[X1] | STR W0,[X1] ; MOV W4,#1 | DMB SY ; SWP W4,W3,[X2] | LDR W2,[X3] ; LDR W5,[X2] | ; EOR W6,W5,W5 | ; LDR W7,[X8,W6,SXTW] | ; exists (0:X3=0 /\ 0:X5=1 /\ 0:X7=0 /\ 1:X2=0) AArch64 MP+rel+CAS-addr { z=1; 0:X1=x; 0:X3=y; 1:X1=x; 1:X3=y; 1:X8=z; } P0 | P1 ; MOV W0, #1 | LDR W0, [X1] ; STR W0, [X3] | MOV W5, W0 ; STLR W0, [X1] | CAS W0, W6, [X8] ; | LDR W0, [X8] ; | EOR X0, X0, X0 ; | ADD X3, X3, X0 ; | LDR W4, [X3] ; exists (1:X5=1 /\ 1:X4=0) AArch64 LB+rel+CAS { 0:X1=x; 0:X3=y; 1:X1=x; 1:X3=y; } P0 | P1 ; MOV W9, #1 | MOV W4, #1 ; LDR W0, [X3] | LDR W5, [X1] ; STLR W9, [X1] | AND W10, W5, #2 ; | CAS W10, W4, [X3] ; exists (1:X5=1 /\ 0:X0=1)
jalglave
added a commit
that referenced
this issue
Mar 1, 2022
Arm has investigated the weakening of its dependency definitions, in particular the notion of control dependency. In light of this investigation, Arm has decided to keep its notion of control dependency strong, but proposes to introduce a new notion of dependency, called "Pick dependency". Pick dependencies allow for very targetted ordering, and it is hoped that this will enable optimisations. See also: https://www.spinics.net/lists/kernel/msg4028862.html and the attached tech report entitled "The Ballad of Dependencies". 1. Motivation ------------- At the moment the behaviour MP+rel+CAS-addr: AArch64 MP+rel+CAS-addr { z=1; 0:X1=x; 0:X3=y; 1:X1=x; 1:X3=y; 1:X8=z; } P0 | P1 ; MOV W0, #1 | LDR W0, [X1] ; STR W0, [X3] | MOV W5, W0 ; STLR W0, [X1] | CAS W0, W6, [X8] ; | LDR W0, [X8] ; | EOR X0, X0, X0 ; | ADD X3, X3, X0 ; | LDR W4, [X3] ; exists (1:X5=1 /\ 1:X4=0) is Forbidden by the architecture. Arm would like to weaken this to allow the behaviour. The reason for MP+rel-CAS-addr being currently Forbidden is as follows: 1. there is a Data dependency on P1 from the Memory read Effect of LDR W0,[X1] (event c) to the Memory write Effect of CAS W0,W6,[X8] (event e) 2. there is a read-from on P1 from the Memory write Effect of CAS W0,W6,[X8] (event e) to the Memory read Effect of LDR W0,[X8] (event f) - in other words the Memory read Effect of LDR W0,[X8] (event f) is a Local read successor of the Memory write Effect of CAS W0,W6,[X8] (event e) 3. there is an Address dependency on P1 from the Memory read Effect of LDR W0,[X8] (event f) to the Memory read Effect of LDR W4,[X3] (event g). Points 1 and 2 combined impose order between the Memory read Effect of LDR W0,[X1] (event c) to the Memory read Effect of LDR W0,[X8] (event f), because data; rfi is in dob. In the text of the Arm ARM this corresponds to the clause: --(*)-- RW2 is a Local read successor R2 of a write W3 and there is an Address dependency or a Data dependency from R1 to W3. ------- A way forward to allow this behaviour would be to decide that there is no ordering altogether between the Memory read Effect of LDR W0,[X1] (event c) to the Memory write Effect of CAS W0,W6,[X8] (event e). However, this would lead to allowing LB+rel+CAS, which must be Forbidden: AArch64 LB+rel+CAS { 0:X1=x; 0:X3=y; 1:X1=x; 1:X3=y; } P0 | P1 ; MOV W9, #1 | MOV W4, #1 ; LDR W0, [X3] | LDR W5, [X1] ; STLR W9, [X1] | AND W10, W5, #2 ; | CAS W10, W4, [X3] ; exists (1:X5=1 /\ 0:X0=1) To accommodate for MP+rel+CAS-addr being Allowed as desired, and LB+rel+CAS still being Forbidden, Arm proposes to create a new type of dependency (called "Pick dependency"), such that: - Pick dependencies are not subjected to the clause (*) above, hence will not impose order in the case of MP+rel+CAS-addr - Pick dependencies provide order in the case of LB+rel+CAS Interestingly, this should not prescribe ordering in the litmus test MP+rel+CSEL for example. Therefore CSEL instructions should not provide similar ordering to branch instructions. However this highlights again the need for a new dependency notion, as the litmus test MP+rel+CSEL-data must be ordered. As in the CAS case, the nature of that new dependency notion is interesting. Indeed the test MP+rel+CSEL-addr should be allowed, which in turn entails that this new dependency cannot be a Data dependency, as otherwise the LDR W1,[X2] and the LDR W9,[X10,X8,SXTW] would be ordered. Arm proposes that CAS and CSEL create Pick dependencies. 2. Tests -------- Tests ----- AArch64 MP+CAS-rfi-ctrl+acq { 0:X0=x; 0:X4=y; 1:X1=y; 1:X3=x; } P0 | P1 ; MOV W1,#0 | LDAR W0,[X1] ; MOV W2,#1 | LDR W2,[X3] ; CAS W1,W2,[X0] | ; CBNZ W1,out | ; MOV W3,#1 | ; STR W3,[X4] | ; out: | ; exists (x=1 /\ 1:X0=1 /\ 1:X2=0) AArch64 R+CAS-rfi-ctrl+DMBST "Amo.Cas Rfi DpCtrldW Coe DMB.STdWW Rfe" { 0:X0=x; 0:X5=y; 1:X1=y; 1:X3=x; } P0 | P1 ; MOV W1,#1 | MOV W0,#2 ; MOV W2,#2 | STR W0,[X1] ; CAS W1,W2,[X0] | DMB ST ; LDR W3,[X0] | MOV W2,#1 ; CBNZ W3,LC00 | STR W2,[X3] ; LC00: | ; MOV W4,#1 | ; STR W4,[X5] | ; exists (x=2 /\ y=2 /\ 0:X1=1 /\ 0:X3=2) AArch64 LB+CAS-rfi-ctrl+DMBSY { 0:X0=x; 0:X5=y; 1:X1=y; 1:X3=x; } P0 | P1 ; MOV W1,#1 | ; MOV W2,#2 | LDR W0,[X1] ; CAS W1,W2,[X0] | DMB SY ; LDR W3,[X0] | MOV W2,#1 ; CBNZ W3,LC00 | STR W2,[X3] ; LC00: | ; MOV W4,#1 | ; STR W4,[X5] | ; exists (x=2 /\ 0:X1=1 /\ 0:X3=2 /\ 1:X0=1) AArch64 R+CAS+DMBLD "PodWR Amo.Cas Coe DMB.SYdWR Fre" { 0:X1=x; 0:X2=y; 1:X1=y; 1:X3=x; } P0 | P1 ; MOV W0,#1 | MOV W0,#2 ; STR W0,[X1] | STR W0,[X1] ; MOV W3,#0 | DMB SY ; MOV W4,#1 | LDR W2,[X3] ; CAS W3,W4,[X2] | ; exists (y=2 /\ 0:X3=0 /\ 1:X2=0) AArch64 SB+CAS-rfi-addr+DMBSY "PodWR Amo.Cas Rfi DpAddrdR Fre DMB.SYdWR Fre" { 0:X1=x; 0:X2=y; 0:X8=z; 1:X1=z; 1:X3=x; } P0 | P1 ; MOV W0,#1 | MOV W0,#1 ; STR W0,[X1] | STR W0,[X1] ; MOV W3,#0 | DMB SY ; MOV W4,#1 | LDR W2,[X3] ; CAS W3,W4,[X2] | ; LDR W5,[X2] | ; EOR W6,W5,W5 | ; LDR W7,[X8,W6,SXTW] | ; exists (0:X3=0 /\ 0:X5=1 /\ 0:X7=0 /\ 1:X2=0) AArch64 SB+SWP-rfi-addr+DMBSY "PodWR Amo.Swp Rfi DpAddrdR Fre DMB.SYdWR Fre" { 0:X1=x; 0:X2=y; 0:X8=z; 1:X1=z; 1:X3=x; } P0 | P1 ; MOV W0,#1 | MOV W0,#1 ; STR W0,[X1] | STR W0,[X1] ; MOV W4,#1 | DMB SY ; SWP W4,W3,[X2] | LDR W2,[X3] ; LDR W5,[X2] | ; EOR W6,W5,W5 | ; LDR W7,[X8,W6,SXTW] | ; exists (0:X3=0 /\ 0:X5=1 /\ 0:X7=0 /\ 1:X2=0) AArch64 MP+rel+CAS-addr { z=1; 0:X1=x; 0:X3=y; 1:X1=x; 1:X3=y; 1:X8=z; } P0 | P1 ; MOV W0, #1 | LDR W0, [X1] ; STR W0, [X3] | MOV W5, W0 ; STLR W0, [X1] | CAS W0, W6, [X8] ; | LDR W0, [X8] ; | EOR X0, X0, X0 ; | ADD X3, X3, X0 ; | LDR W4, [X3] ; exists (1:X5=1 /\ 1:X4=0) AArch64 LB+rel+CAS { 0:X1=x; 0:X3=y; 1:X1=x; 1:X3=y; } P0 | P1 ; MOV W9, #1 | MOV W4, #1 ; LDR W0, [X3] | LDR W5, [X1] ; STLR W9, [X1] | AND W10, W5, #2 ; | CAS W10, W4, [X3] ; exists (1:X5=1 /\ 0:X0=1)
maranget
pushed a commit
that referenced
this issue
Mar 2, 2022
This change is a relaxation. Arm has investigated the weakening of its dependency definitions, in particular the notion of control dependency. In light of this investigation, Arm has decided to keep its notion of control dependency strong, but proposes to introduce a new notion of dependency, called "Pick dependency". Pick dependencies allow for very targetted ordering, and it is hoped that this will enable optimisations. See also: https://www.spinics.net/lists/kernel/msg4028862.html and the attached tech report entitled "The Ballad of Dependencies". Motivation ------------- At the moment the behaviour MP+rel+CAS-addr: ``` AArch64 MP+rel+CAS-addr { z=1; 0:X1=x; 0:X3=y; 1:X1=x; 1:X3=y; 1:X8=z; } P0 | P1 ; MOV W0,#1 | LDR W0, [X1] ; STR W0,[X3] | MOV W5, W0 ; STLR W0,[X1] | CAS W0, W6, [X8] ; | LDR W0, [X8] ; | EOR X0, X0, X0 ; | ADD X3, X3, X0 ; | LDR W4, [X3] ; exists (1:X5=1 /\ 1:X4=0) ``` is Forbidden by the architecture. Arm would like to weaken this to allow the behaviour. The reason for MP+rel-CAS-addr being currently Forbidden is as follows: 1. there is a Data dependency on P1 from the Memory read Effect of LDR W0,[X1] (event c) to the Memory write Effect of CAS W0,W6,[X8] (event e) 2. there is a read-from on P1 from the Memory write Effect of CAS W0,W6,[X8] (event e) to the Memory read Effect of LDR W0,[X8] (event f) - in other words the Memory read Effect of LDR W0,[X8] (event f) is a Local read successor of the Memory write Effect of CAS W0,W6,[X8] (event e) 3. there is an Address dependency on P1 from the Memory read Effect of LDR W0,[X8] (event f) to the Memory read Effect of LDR W4,[X3] (event g). Points 1 and 2 combined impose order between the Memory read Effect of LDR W0,[X1] (event c) to the Memory read Effect of LDR W0,[X8] (event f), because data; rfi is in dob. In the text of the Arm ARM this corresponds to the clause: (*) RW2 is a Local read successor R2 of a write W3 and there is an Address dependency or a Data dependency from R1 to W3. A way forward to allow this behaviour would be to decide that there is no ordering altogether between the Memory read Effect of LDR W0,[X1] (event c) to the Memory write Effect of CAS W0,W6,[X8] (event e). However, this would lead to allowing LB+rel+CAS, which must be Forbidden: ``` AArch64 LB+rel+CAS { 0:X1=x; 0:X3=y; 1:X1=x; 1:X3=y; } P0 | P1 ; MOV W9, #1 | MOV W4, #1 ; LDR W0, [X3] | LDR W5, [X1] ; STLR W9, [X1]| AND W10, W5, #2 ; | CAS W10, W4, [X3] ; exists (1:X5=1 /\ 0:X0=1) ``` To accommodate for MP+rel+CAS-addr being Allowed as desired, and LB+rel+CAS still being Forbidden, Arm proposes to create a new type of dependency (called "Pick dependency"), such that: - Pick dependencies are not subjected to the clause (*) above, hence will not impose order in the case of MP+rel+CAS-addr - Pick dependencies provide order in the case of LB+rel+CAS Interestingly, this should not prescribe ordering in the litmus test MP+rel+CSEL for example. Therefore CSEL instructions should not provide similar ordering to branch instructions. However this highlights again the need for a new dependency notion, as the litmus test MP+rel+CSEL-data must be ordered. As in the CAS case, the nature of that new dependency notion is interesting. Indeed the test MP+rel+CSEL-addr should be allowed, which in turn entails that this new dependency cannot be a Data dependency, as otherwise the LDR W1,[X2] and the LDR W9,[X10,X8,SXTW] would be ordered. Arm proposes that CAS and CSEL create Pick dependencies. Tests -------- AArch64 MP+CAS-rfi-ctrl+acq { 0:X0=x; 0:X4=y; 1:X1=y; 1:X3=x; } P0 | P1 ; MOV W1,#0 | LDAR W0,[X1] ; MOV W2,#1 | LDR W2,[X3] ; CAS W1,W2,[X0] | ; CBNZ W1,out | ; MOV W3,#1 | ; STR W3,[X4] | ; out: | ; exists (x=1 /\ 1:X0=1 /\ 1:X2=0) AArch64 R+CAS-rfi-ctrl+DMBST "Amo.Cas Rfi DpCtrldW Coe DMB.STdWW Rfe" { 0:X0=x; 0:X5=y; 1:X1=y; 1:X3=x; } P0 | P1 ; MOV W1,#1 | MOV W0,#2 ; MOV W2,#2 | STR W0,[X1] ; CAS W1,W2,[X0] | DMB ST ; LDR W3,[X0] | MOV W2,#1 ; CBNZ W3,LC00 | STR W2,[X3] ; LC00: | ; MOV W4,#1 | ; STR W4,[X5] | ; exists (x=2 /\ y=2 /\ 0:X1=1 /\ 0:X3=2) AArch64 LB+CAS-rfi-ctrl+DMBSY { 0:X0=x; 0:X5=y; 1:X1=y; 1:X3=x; } P0 | P1 ; MOV W1,#1 | ; MOV W2,#2 | LDR W0,[X1] ; CAS W1,W2,[X0] | DMB SY ; LDR W3,[X0] | MOV W2,#1 ; CBNZ W3,LC00 | STR W2,[X3] ; LC00: | ; MOV W4,#1 | ; STR W4,[X5] | ; exists (x=2 /\ 0:X1=1 /\ 0:X3=2 /\ 1:X0=1) AArch64 R+CAS+DMBLD "PodWR Amo.Cas Coe DMB.SYdWR Fre" { 0:X1=x; 0:X2=y; 1:X1=y; 1:X3=x; } P0 | P1 ; MOV W0,#1 | MOV W0,#2 ; STR W0,[X1] | STR W0,[X1] ; MOV W3,#0 | DMB SY ; MOV W4,#1 | LDR W2,[X3] ; CAS W3,W4,[X2] | ; exists (y=2 /\ 0:X3=0 /\ 1:X2=0) AArch64 SB+CAS-rfi-addr+DMBSY "PodWR Amo.Cas Rfi DpAddrdR Fre DMB.SYdWR Fre" { 0:X1=x; 0:X2=y; 0:X8=z; 1:X1=z; 1:X3=x; } P0 | P1 ; MOV W0,#1 | MOV W0,#1 ; STR W0,[X1] | STR W0,[X1] ; MOV W3,#0 | DMB SY ; MOV W4,#1 | LDR W2,[X3] ; CAS W3,W4,[X2] | ; LDR W5,[X2] | ; EOR W6,W5,W5 | ; LDR W7,[X8,W6,SXTW] | ; exists (0:X3=0 /\ 0:X5=1 /\ 0:X7=0 /\ 1:X2=0) AArch64 SB+SWP-rfi-addr+DMBSY "PodWR Amo.Swp Rfi DpAddrdR Fre DMB.SYdWR Fre" { 0:X1=x; 0:X2=y; 0:X8=z; 1:X1=z; 1:X3=x; } P0 | P1 ; MOV W0,#1 | MOV W0,#1 ; STR W0,[X1] | STR W0,[X1] ; MOV W4,#1 | DMB SY ; SWP W4,W3,[X2] | LDR W2,[X3] ; LDR W5,[X2] | ; EOR W6,W5,W5 | ; LDR W7,[X8,W6,SXTW] | ; exists (0:X3=0 /\ 0:X5=1 /\ 0:X7=0 /\ 1:X2=0) AArch64 MP+rel+CAS-addr { z=1; 0:X1=x; 0:X3=y; 1:X1=x; 1:X3=y; 1:X8=z; } P0 | P1 ; MOV W0, #1 | LDR W0, [X1] ; STR W0, [X3] | MOV W5, W0 ; STLR W0, [X1] | CAS W0, W6, [X8] ; | LDR W0, [X8] ; | EOR X0, X0, X0 ; | ADD X3, X3, X0 ; | LDR W4, [X3] ; exists (1:X5=1 /\ 1:X4=0) AArch64 LB+rel+CAS { 0:X1=x; 0:X3=y; 1:X1=x; 1:X3=y; } P0 | P1 ; MOV W9, #1 | MOV W4, #1 ; LDR W0, [X3] | LDR W5, [X1] ; STLR W9, [X1] | AND W10, W5, #2 ; | CAS W10, W4, [X3] ; exists (1:X5=1 /\ 0:X0=1)
maranget
added a commit
that referenced
this issue
Mar 2, 2022
The `Store` pseudo-edge produces a store to location common to the previous and following edges. This store appears at thread code start. For instance: ``` % diyone7 -arch AArch64 FencedWW Rfe DpAddrdR PosRR Store Fre AArch64 A "DMB.SYdWW Rfe DpAddrdR PosRR Store Fre" Generator=diyone7 (version 7.56+02~dev) Prefetch=0:x=F,0:y=W,1:y=F,1:x=T Com=Rf Fr Orig=DMB.SYdWW Rfe DpAddrdR PosRR Store Fre { 0:X1=x; 0:X3=y; 1:X0=y; 1:X4=x; } P0 | P1 ; MOV W0,#2 | MOV W6,#1 ; STR W0,[X1] | STR W6,[X4] ; DMB SY | LDR W1,[X0] ; MOV W2,#1 | EOR W2,W1,W1 ; STR W2,[X3] | LDR W3,[X4,W2,SXTW] ; | LDR W5,[X4] ; exists ([x]=2 /\ 1:X1=1 /\ 1:X3=0 /\ 1:X5=1) ``` Generates an interesting test where the load by P1 from [x] to W3 is delayed; while the next load by P1 from [x] to W5 is not.
maranget
added a commit
that referenced
this issue
Mar 8, 2022
For a given dependency Dp_XXX_{s,d}{R,W} the csel variation syntax is by suffixing `Csel` after _XXX_. Csel variations introduce a CSEL instruction in the dependency computation chain. For instance, here is the effect of DpDataCseldW: ``` AArch64 A "DpDatadW Rfe DpDataCseldW Rfe" Generator=diyone7 (version 7.56+02~dev) Prefetch=0:x=F,0:y=W,1:y=F,1:x=W Com=Rf Rf Orig=DpDatadW Rfe DpDataCseldW Rfe { 0:X0=x; 0:X3=y; 1:X0=y; 1:X5=x; } P0 | P1 ; LDR W1,[X0] | LDR W1,[X0] ; EOR W2,W1,W1 | CMP W1,#0 ; ADD W2,W2,#1 | CSEL W2,W3,W4,EQ ; STR W2,[X3] | ADD W2,W2,#1 ; | STR W2,[X5] ; exists (0:X1=1 /\ 1:X1=1) ```
maranget
added a commit
that referenced
this issue
Mar 8, 2022
For a given dependency Dp_XXX_{s,d}{R,W} the csel variation syntax is by suffixing `Csel` after _XXX_. Csel variations introduce a CSEL instruction in the dependency computation chain. For instance, here is the effect of DpDataCseldW: ``` AArch64 A "DpDatadW Rfe DpDataCseldW Rfe" Generator=diyone7 (version 7.56+02~dev) Prefetch=0:x=F,0:y=W,1:y=F,1:x=W Com=Rf Rf Orig=DpDatadW Rfe DpDataCseldW Rfe { 0:X0=x; 0:X3=y; 1:X0=y; 1:X5=x; } P0 | P1 ; LDR W1,[X0] | LDR W1,[X0] ; EOR W2,W1,W1 | CMP W1,#0 ; ADD W2,W2,#1 | CSEL W2,W3,W4,EQ ; STR W2,[X3] | ADD W2,W2,#1 ; | STR W2,[X5] ; exists (0:X1=1 /\ 1:X1=1) ```
relokin
referenced
this issue
in relokin/herdtools7
Nov 30, 2023
This change modifies the semantics of the translation (-variant vmsa) for checked memory instructions when MTE is enabled (-variant memtag). In short, for a checked memory instruction that performs a tag check prior to the explicit memory event, it allows the tag check operation to have its own translation (read of the PTE followed by a branching decision), separate from the translation that happens and leads to the explicit memory event. Also it adds support for a new type of memory, TaggedNormal featured in FEAT_MTE2. TaggedNormal is the only type for which we peform a tag check. To illustrate this with an example, in the following test, the STR is a checked access and will result in a TagCheck fault. AArch64 STR-TaggedNormal-red { [PTE(x)]=(oa:PA(x),attrs:(TaggedNormal)); 0:X0=x:red; } P0 ; MOV W1,#1 ; L0: ; STR W1,[X0] ; locations[fault(P0:L0,x,TagCheck);] exists(x=1) While the following test, the STR will not raise a TagCheck fault: AArch64 STR-Normal-red { [PTE(x)]=(oa:PA(x),attrs:(Normal)); 0:X0=x:red; } P0 ; MOV W1,#1 ; L0: ; STR W1,[X0] ; locations[fault(P0:L0,x,TagCheck);] exists(x=1) Signed-off-by: Nikos Nikoleris <nikos.nikoleris@arm.com>
relokin
pushed a commit
to hugookeeffe/herdtools7
that referenced
this issue
Dec 1, 2023
Currently some tests generated with diy will generate an incorrect post-condition. One noteable example is coRR: AArch64 coRR "Rfe PosRR Fre" Generator=diyone7 (version 7.56+03) Com=Rf Fr Orig=Rfe PosRR Fre { 0:X1=x; 1:X0=x; } P0 | P1 ; MOV W0,herd#1 | LDR W1,[X0] ; STR W0,[X1] | LDR W2,[X0] ; exists (1:X1=1 /\ 1:X2=1) This patch aims to fix this by ensuring that oneloc tests start generating the post condition from the start of a chain of a communication edges, ensuring they are all taken into account in the post condition.
relokin
referenced
this issue
in relokin/herdtools7
Dec 8, 2023
This change modifies the semantics of the translation (-variant vmsa) for checked memory instructions when MTE is enabled (-variant memtag). In short, for a checked memory instruction that performs a tag check prior to the explicit memory event, it allows the tag check operation to have its own translation (read of the PTE followed by a branching decision), separate from the translation that happens and leads to the explicit memory event. Also it adds support for a new type of memory, TaggedNormal featured in FEAT_MTE2. TaggedNormal is the only type for which we peform a tag check. To illustrate this with an example, in the following test, the STR is a checked access and will result in a TagCheck fault. AArch64 STR-TaggedNormal-red { [PTE(x)]=(oa:PA(x),attrs:(TaggedNormal)); 0:X0=x:red; } P0 ; MOV W1,#1 ; L0: ; STR W1,[X0] ; locations[fault(P0:L0,x,TagCheck);] exists(x=1) While the following test, the STR will not raise a TagCheck fault: AArch64 STR-Normal-red { [PTE(x)]=(oa:PA(x),attrs:(Normal)); 0:X0=x:red; } P0 ; MOV W1,#1 ; L0: ; STR W1,[X0] ; locations[fault(P0:L0,x,TagCheck);] exists(x=1) Signed-off-by: Nikos Nikoleris <nikos.nikoleris@arm.com>
maranget
added a commit
that referenced
this issue
Dec 19, 2023
The tool transliterate cat definitions into English Simple usage: `miaou7 -show <name> <catfile>` will translate the definition of `<name>` from model `<catfile>` into LaTeX. More advanced usage: the output will use LaTeX commands to pretty-print event set and relation definitions. For instance the **po** relation corresponds to the the command `\newcommand{\po}[2]{#1 is program-order before #2}`. Such commands can be defined in a specific file, say `catdefs.tex`. Then, the command-line option `-tex catdefs,tex` yields the following benefits: 1. Commands outputed by `miaou7` and not defined are flagged; and 2. The number of parameters of commands tells if the command correspond to an event set (one argument) or to a relation (two arguments), thereby guaranteeing appropriate formatting of definitions. Be aware that relation names and command differ: non-alphabetical characters are deleted from Cat names to produce legal LaTeX command names; a few Cat names undergo special treatment (_e.g._ `dmb.full` -> `\DMBFULL`, `L` -> `\REL`,...).
jalglave
pushed a commit
that referenced
this issue
Dec 19, 2023
The tool transliterate cat definitions into English Simple usage: `miaou7 -show <name> <catfile>` will translate the definition of `<name>` from model `<catfile>` into LaTeX. More advanced usage: the output will use LaTeX commands to pretty-print event set and relation definitions. For instance the **po** relation corresponds to the the command `\newcommand{\po}[2]{#1 is program-order before #2}`. Such commands can be defined in a specific file, say `catdefs.tex`. Then, the command-line option `-tex catdefs,tex` yields the following benefits: 1. Commands outputed by `miaou7` and not defined are flagged; and 2. The number of parameters of commands tells if the command correspond to an event set (one argument) or to a relation (two arguments), thereby guaranteeing appropriate formatting of definitions. Be aware that relation names and command differ: non-alphabetical characters are deleted from Cat names to produce legal LaTeX command names; a few Cat names undergo special treatment (_e.g._ `dmb.full` -> `\DMBFULL`, `L` -> `\REL`,...).
maranget
added a commit
that referenced
this issue
Dec 21, 2023
[tools] miaou! A cat that writes in English. Simple usage: `miaou7 -show <name> <catfile>` will translate the definition of `<name>` from model `<catfile>` into LaTeX. More advanced usage: the output will use LaTeX commands to pretty-print event set and relation definitions. For instance the **po** relation corresponds to the the command `\newcommand{\po}[2]{#1 is program-order before #2}`. Such commands can be defined in a specific file, say `catdefs.tex`. Then, the command-line option `-tex catdefs,tex` yields the following benefits: 1. Commands outputed by `miaou7` and not defined are flagged; and 2. The number of parameters of commands tells if the command correspond to an event set (one argument) or to a relation (two arguments), thereby guaranteeing appropriate formatting of definitions. Be aware that relation names and command differ: non-alphabetical characters are deleted from Cat names to produce legal LaTeX command names; a few Cat names undergo special treatment (_e.g._ `dmb.full` -> `\DMBFULL`, `L` -> `\REL`,...).
murzinv
pushed a commit
to murzinv/herdtools7
that referenced
this issue
Dec 22, 2023
Unfortunately, there are several issues with litmus test generation when it comes to broad use Neon annotations. For example, litmus test for data dependency between Neon load and a store comes with several issues $ diyone7 -arch AArch64 -variant neon "Rfe PodRW Rfe DpDatadWN2N2" ---cut--- P0 | P1 ; LDR W1,[X0] | LD2 {V0.4S, V1.4S},[X0] ; (* A *) MOV W2,herd#1 | EOR V2.4S,V0.4S,V0.4S ; (* B *) STR W2,[X3] | ... ; ---cut--- First, result of load into V1 at A is not consumed. Second, instruction at B is not valid: Assembler messages: Error: operand mismatch -- `eor V2.4S,V0.4S,V0.4S' Info: did you mean this? Info: eor v1.8b, v0.8b, v0.8b Info: other valid variant(s): Info: eor v1.16b, v0.16b, v0.16b Another issue can be seen in litmus test for address dependency between a load and Neon load $ diyone7 -arch AArch64 -variant neon "PodWW Rfe DpAddrdRPN2 Fre" ---cut--- P0 | P1 ; MOV W0,herd#1 | ... ; STR W0,[X1] | ... ; MOV W2,herd#1 | ... ; STR W2,[X3] | LD2 {V0.4S, V1.4S},[X4],X3 ; (* A *) ---cut-- Neon load at A uses post-index form where X3 would be consumed after access is made to update X4, in other words, X3 doesn't contribute to the address generation for LD2 access at A. Also, there are numerous cases where litmus test generation would fail with Fatal error: exception File "lib/AArch64Base.ml", line 317, characters 48-54: Assertion failed To make things work we need to rework LDN. For things like address dependency and control dependency we need to reduce vector registers horizontally (in case more than one vector register is used) follow up with reduce horizontally single vector register into scalar, so it can be transferred to general purpose register and become available either for address generation logic or control flow logic. In term of instruction that reduction looks like LD2 {V0.4S, V1.4S},[X0] ; ADD V0.4S,V0.4S,V1.4S ; (* reduce (and consume) V0 and V1 into V0 *) ADDV S2,V0.4S ; (* reduce V0 inot S2 *) FMOV W1,S2 ; (* transfer S2 into W1 *) The only downside is that for data dependency between, say, Neon load and Neon store data have to trip via general purpose register which is sub-optimal LD2 {V0.4S, V1.4S},[X0] ; ADD V0.4S,V0.4S,V1.4S ; ADDV S2,V0.4S ; FMOV W1,S2 ; EOR W2,W1,W1 ; DUP V3.4S,W2 ; MOVI V4.4S,herd#1 ; MOVI V5.4S,herd#2 ; ADD V4.4S,V4.4S,V3.4S ; ADD V5.4S,V5.4S,V3.4S ; ST2 {V4.4S, V5.4S},[X3] ; However, when it comes to data dependency, say, between Neon load and non-Neon store things look more natural LD2 {V0.4S, V1.4S},[X0] ; ADD V0.4S,V0.4S,V1.4S ; ADDV S2,V0.4S ; FMOV W1,S2 ; EOR W2,W1,W1 ; ADD W2,W2,herd#1 ; STR W2,[X3] ; Signed-off-by: Vladimir Murzin <vladimir.murzin@arm.com>
murzinv
pushed a commit
to murzinv/herdtools7
that referenced
this issue
Dec 22, 2023
Unfortunately, there are several issues with litmus test generation when it comes to broad use Neon annotations. For example, litmus test for data dependency between a load and Neon store comes with use of invalid instruction $ diyone7 -arch AArch64 -variant neon "Rfe PodRW Rfe DpDatadWN2N2" ---cut--- P0 | P1 ; LDR W1,[X0] | ... ; MOV W2,herd#1 | ... ; STR W2,[X3] | MOVI V3.4S,herd#1 ; | ADD V2.4S,V2.4S,V3.4S ; | ST2 {V2.4S},[X2] ; (* A *) ---cut--- Neon's ST2 at A is invalid: Error: invalid number of registers in the list; 2 registers are expected at operand 1 -- `st2 {V2.4S},[X2]' Another issues seen in litmus test for address dependency between a load and Neon store $ diyone7 -arch AArch64 -variant neon "Rfe PodRW Rfe DpAddrdWPN2 PodWW" ---cut--- P0 | P1 ; LDR W1,[X0] | ... ; MOV W2,herd#1 | ... ; STR W2,[X3] | EOR W2,W1,W1 ; (* A *) | EOR X3,X2,X2 ; | MOVI V0.4S,herd#1 ; | MOVI V1.4S,herd#2 ; | ST2 {V0.4S, V1.4S},[X4],X3 ; (* B *) | MOV W5,herd#1 ; | STR W5,[X6] ; ---cut--- First of all unnecessary exclusive or at A, yet it doesn't hurt. Second, Neon store at B uses post-index form where X3 would be consumed after access is made to update X4, in other words, X3 doesn't contribute to the address generation for ST2 access at B. Also, there are numerous cases where litmus test generation would fail with Fatal error: exception File "lib/AArch64Base.ml", line 317, characters 48-54: Assertion failed So, rework STN such that * For data dependency between a load and Neon store we produce code sequence like DUP V3.4S,W2 ; (* expand W2 into V3 *) MOVI V4.4S,#0 ; MOVI V5.4S,herd#1 ; ADD V4.4S,V4.4S,V3.4S ; (* V4 depends on V3 *) ADD V5.4S,V5.4S,V3.4S ; (* V5 depends on V3 *) ST2 {V4.4S, V5.4S},[X3] ; (* Store both V4 and V3 *) * For address dependency between a load and Neon store we produce code sequence like ADD X4,X3,W1,SXTW ; (* X4 depends on X3 (generated by previous load) *) MOVI V0.4S,herd#1 ; MOVI V1.4S,herd#2 ; ST2 {V0.4S, V1.4S},[X4] ; (* ST2 consumes X4 *) Signed-off-by: Vladimir Murzin <vladimir.murzin@arm.com>
murzinv
pushed a commit
to murzinv/herdtools7
that referenced
this issue
Dec 22, 2023
Unfortunately, there are several issues with litmus test generation when it comes to broad use Neon annotations. For example, litmus test for data dependency between a load and Neon store comes with use of invalid instruction $ diyone7 -arch AArch64 -variant neon "Rfe PodRW Rfe DpDatadWN2N2" ---cut--- P0 | P1 ; LDR W1,[X0] | ... ; MOV W2,herd#1 | ... ; STR W2,[X3] | MOVI V3.4S,herd#1 ; | ADD V2.4S,V2.4S,V3.4S ; | ST2 {V2.4S},[X2] ; (* A *) ---cut--- Neon's ST2 at A is invalid: Error: invalid number of registers in the list; 2 registers are expected at operand 1 -- `st2 {V2.4S},[X2]' Another issues seen in litmus test for address dependency between a load and Neon store $ diyone7 -arch AArch64 -variant neon "Rfe PodRW Rfe DpAddrdWPN2 PodWW" ---cut--- P0 | P1 ; LDR W1,[X0] | ... ; MOV W2,herd#1 | ... ; STR W2,[X3] | EOR W2,W1,W1 ; (* A *) | EOR X3,X2,X2 ; | MOVI V0.4S,herd#1 ; | MOVI V1.4S,herd#2 ; | ST2 {V0.4S, V1.4S},[X4],X3 ; (* B *) | MOV W5,herd#1 ; | STR W5,[X6] ; ---cut--- First of all unnecessary exclusive or at A, yet it doesn't hurt. Second, Neon store at B uses post-index form where X3 would be consumed after access is made to update X4, in other words, X3 doesn't contribute to the address generation for ST2 access at B. Also, there are numerous cases where litmus test generation would fail with Fatal error: exception File "lib/AArch64Base.ml", line 317, characters 48-54: Assertion failed So, rework STN such that * For data dependency between a load and Neon store we produce code sequence like DUP V3.4S,W2 ; (* expand W2 into V3 *) MOVI V4.4S,#0 ; MOVI V5.4S,herd#1 ; ADD V4.4S,V4.4S,V3.4S ; (* V4 depends on V3 *) ADD V5.4S,V5.4S,V3.4S ; (* V5 depends on V3 *) ST2 {V4.4S, V5.4S},[X3] ; (* Store both V4 and V3 *) * For address dependency between a load and Neon store we produce code sequence like ADD X4,X3,W1,SXTW ; (* X4 depends on X3 (generated by previous load) *) MOVI V0.4S,herd#1 ; MOVI V1.4S,herd#2 ; ST2 {V0.4S, V1.4S},[X4] ; (* ST2 consumes X4 *) Signed-off-by: Vladimir Murzin <vladimir.murzin@arm.com>
murzinv
pushed a commit
to murzinv/herdtools7
that referenced
this issue
Jan 3, 2024
Unfortunately, there are several issues with litmus test generation when it comes to broad use Neon annotations. For example, litmus test for data dependency between Neon load and a store comes with several issues $ diyone7 -arch AArch64 -variant neon "Rfe PodRW Rfe DpDatadWN2N2" ---cut--- P0 | P1 ; LDR W1,[X0] | LD2 {V0.4S, V1.4S},[X0] ; (* A *) MOV W2,herd#1 | EOR V2.4S,V0.4S,V0.4S ; (* B *) STR W2,[X3] | ... ; ---cut--- First, result of load into V1 at A is not consumed. Second, instruction at B is not valid: Assembler messages: Error: operand mismatch -- `eor V2.4S,V0.4S,V0.4S' Info: did you mean this? Info: eor v1.8b, v0.8b, v0.8b Info: other valid variant(s): Info: eor v1.16b, v0.16b, v0.16b Another issue can be seen in litmus test for address dependency between a load and Neon load $ diyone7 -arch AArch64 -variant neon "PodWW Rfe DpAddrdRPN2 Fre" ---cut--- P0 | P1 ; MOV W0,herd#1 | ... ; STR W0,[X1] | ... ; MOV W2,herd#1 | ... ; STR W2,[X3] | LD2 {V0.4S, V1.4S},[X4],X3 ; (* A *) ---cut-- Neon load at A uses post-index form where X3 would be consumed after access is made to update X4, in other words, X3 doesn't contribute to the address generation for LD2 access at A. Also, there are numerous cases where litmus test generation would fail with Fatal error: exception File "lib/AArch64Base.ml", line 317, characters 48-54: Assertion failed To make things work we need to rework LDN. For things like address dependency and control dependency we need to reduce vector registers horizontally (in case more than one vector register is used) follow up with reduce horizontally single vector register into scalar, so it can be transferred to general purpose register and become available either for address generation logic or control flow logic. In term of instruction that reduction looks like LD2 {V0.4S, V1.4S},[X0] ; ADD V0.4S,V0.4S,V1.4S ; (* reduce (and consume) V0 and V1 into V0 *) ADDV S2,V0.4S ; (* reduce V0 inot S2 *) FMOV W1,S2 ; (* transfer S2 into W1 *) The only downside is that for data dependency between, say, Neon load and Neon store data have to trip via general purpose register which is sub-optimal LD2 {V0.4S, V1.4S},[X0] ; ADD V0.4S,V0.4S,V1.4S ; ADDV S2,V0.4S ; FMOV W1,S2 ; EOR W2,W1,W1 ; DUP V3.4S,W2 ; MOVI V4.4S,herd#1 ; MOVI V5.4S,herd#2 ; ADD V4.4S,V4.4S,V3.4S ; ADD V5.4S,V5.4S,V3.4S ; ST2 {V4.4S, V5.4S},[X3] ; However, when it comes to data dependency, say, between Neon load and non-Neon store things look more natural LD2 {V0.4S, V1.4S},[X0] ; ADD V0.4S,V0.4S,V1.4S ; ADDV S2,V0.4S ; FMOV W1,S2 ; EOR W2,W1,W1 ; ADD W2,W2,herd#1 ; STR W2,[X3] ; Signed-off-by: Vladimir Murzin <vladimir.murzin@arm.com>
murzinv
pushed a commit
to murzinv/herdtools7
that referenced
this issue
Jan 3, 2024
Unfortunately, there are several issues with litmus test generation when it comes to broad use Neon annotations. For example, litmus test for data dependency between a load and Neon store comes with use of invalid instruction $ diyone7 -arch AArch64 -variant neon "Rfe PodRW Rfe DpDatadWN2N2" ---cut--- P0 | P1 ; LDR W1,[X0] | ... ; MOV W2,herd#1 | ... ; STR W2,[X3] | MOVI V3.4S,herd#1 ; | ADD V2.4S,V2.4S,V3.4S ; | ST2 {V2.4S},[X2] ; (* A *) ---cut--- Neon's ST2 at A is invalid: Error: invalid number of registers in the list; 2 registers are expected at operand 1 -- `st2 {V2.4S},[X2]' Another issues seen in litmus test for address dependency between a load and Neon store $ diyone7 -arch AArch64 -variant neon "Rfe PodRW Rfe DpAddrdWPN2 PodWW" ---cut--- P0 | P1 ; LDR W1,[X0] | ... ; MOV W2,herd#1 | ... ; STR W2,[X3] | EOR W2,W1,W1 ; (* A *) | EOR X3,X2,X2 ; | MOVI V0.4S,herd#1 ; | MOVI V1.4S,herd#2 ; | ST2 {V0.4S, V1.4S},[X4],X3 ; (* B *) | MOV W5,herd#1 ; | STR W5,[X6] ; ---cut--- First of all unnecessary exclusive or at A, yet it doesn't hurt. Second, Neon store at B uses post-index form where X3 would be consumed after access is made to update X4, in other words, X3 doesn't contribute to the address generation for ST2 access at B. Also, there are numerous cases where litmus test generation would fail with Fatal error: exception File "lib/AArch64Base.ml", line 317, characters 48-54: Assertion failed So, rework STN such that * For data dependency between a load and Neon store we produce code sequence like DUP V3.4S,W2 ; (* expand W2 into V3 *) MOVI V4.4S,#0 ; MOVI V5.4S,herd#1 ; ADD V4.4S,V4.4S,V3.4S ; (* V4 depends on V3 *) ADD V5.4S,V5.4S,V3.4S ; (* V5 depends on V3 *) ST2 {V4.4S, V5.4S},[X3] ; (* Store both V4 and V3 *) * For address dependency between a load and Neon store we produce code sequence like ADD X4,X3,W1,SXTW ; (* X4 depends on X3 (generated by previous load) *) MOVI V0.4S,herd#1 ; MOVI V1.4S,herd#2 ; ST2 {V0.4S, V1.4S},[X4] ; (* ST2 consumes X4 *) Signed-off-by: Vladimir Murzin <vladimir.murzin@arm.com>
maranget
pushed a commit
to murzinv/herdtools7
that referenced
this issue
Jan 3, 2024
Unfortunately, there are several issues with litmus test generation when it comes to broad use Neon annotations. For example, litmus test for data dependency between Neon load and a store comes with several issues $ diyone7 -arch AArch64 -variant neon "Rfe PodRW Rfe DpDatadWN2N2" ---cut--- P0 | P1 ; LDR W1,[X0] | LD2 {V0.4S, V1.4S},[X0] ; (* A *) MOV W2,herd#1 | EOR V2.4S,V0.4S,V0.4S ; (* B *) STR W2,[X3] | ... ; ---cut--- First, result of load into V1 at A is not consumed. Second, instruction at B is not valid: Assembler messages: Error: operand mismatch -- `eor V2.4S,V0.4S,V0.4S' Info: did you mean this? Info: eor v1.8b, v0.8b, v0.8b Info: other valid variant(s): Info: eor v1.16b, v0.16b, v0.16b Another issue can be seen in litmus test for address dependency between a load and Neon load $ diyone7 -arch AArch64 -variant neon "PodWW Rfe DpAddrdRPN2 Fre" ---cut--- P0 | P1 ; MOV W0,herd#1 | ... ; STR W0,[X1] | ... ; MOV W2,herd#1 | ... ; STR W2,[X3] | LD2 {V0.4S, V1.4S},[X4],X3 ; (* A *) ---cut-- Neon load at A uses post-index form where X3 would be consumed after access is made to update X4, in other words, X3 doesn't contribute to the address generation for LD2 access at A. Also, there are numerous cases where litmus test generation would fail with Fatal error: exception File "lib/AArch64Base.ml", line 317, characters 48-54: Assertion failed To make things work we need to rework LDN. For things like address dependency and control dependency we need to reduce vector registers horizontally (in case more than one vector register is used) follow up with reduce horizontally single vector register into scalar, so it can be transferred to general purpose register and become available either for address generation logic or control flow logic. In term of instruction that reduction looks like LD2 {V0.4S, V1.4S},[X0] ; ADD V0.4S,V0.4S,V1.4S ; (* reduce (and consume) V0 and V1 into V0 *) ADDV S2,V0.4S ; (* reduce V0 inot S2 *) FMOV W1,S2 ; (* transfer S2 into W1 *) The only downside is that for data dependency between, say, Neon load and Neon store data have to trip via general purpose register which is sub-optimal LD2 {V0.4S, V1.4S},[X0] ; ADD V0.4S,V0.4S,V1.4S ; ADDV S2,V0.4S ; FMOV W1,S2 ; EOR W2,W1,W1 ; DUP V3.4S,W2 ; MOVI V4.4S,herd#1 ; MOVI V5.4S,herd#2 ; ADD V4.4S,V4.4S,V3.4S ; ADD V5.4S,V5.4S,V3.4S ; ST2 {V4.4S, V5.4S},[X3] ; However, when it comes to data dependency, say, between Neon load and non-Neon store things look more natural LD2 {V0.4S, V1.4S},[X0] ; ADD V0.4S,V0.4S,V1.4S ; ADDV S2,V0.4S ; FMOV W1,S2 ; EOR W2,W1,W1 ; ADD W2,W2,herd#1 ; STR W2,[X3] ; Signed-off-by: Vladimir Murzin <vladimir.murzin@arm.com>
maranget
pushed a commit
to murzinv/herdtools7
that referenced
this issue
Jan 3, 2024
Unfortunately, there are several issues with litmus test generation when it comes to broad use Neon annotations. For example, litmus test for data dependency between a load and Neon store comes with use of invalid instruction $ diyone7 -arch AArch64 -variant neon "Rfe PodRW Rfe DpDatadWN2N2" ---cut--- P0 | P1 ; LDR W1,[X0] | ... ; MOV W2,herd#1 | ... ; STR W2,[X3] | MOVI V3.4S,herd#1 ; | ADD V2.4S,V2.4S,V3.4S ; | ST2 {V2.4S},[X2] ; (* A *) ---cut--- Neon's ST2 at A is invalid: Error: invalid number of registers in the list; 2 registers are expected at operand 1 -- `st2 {V2.4S},[X2]' Another issues seen in litmus test for address dependency between a load and Neon store $ diyone7 -arch AArch64 -variant neon "Rfe PodRW Rfe DpAddrdWPN2 PodWW" ---cut--- P0 | P1 ; LDR W1,[X0] | ... ; MOV W2,herd#1 | ... ; STR W2,[X3] | EOR W2,W1,W1 ; (* A *) | EOR X3,X2,X2 ; | MOVI V0.4S,herd#1 ; | MOVI V1.4S,herd#2 ; | ST2 {V0.4S, V1.4S},[X4],X3 ; (* B *) | MOV W5,herd#1 ; | STR W5,[X6] ; ---cut--- First of all unnecessary exclusive or at A, yet it doesn't hurt. Second, Neon store at B uses post-index form where X3 would be consumed after access is made to update X4, in other words, X3 doesn't contribute to the address generation for ST2 access at B. Also, there are numerous cases where litmus test generation would fail with Fatal error: exception File "lib/AArch64Base.ml", line 317, characters 48-54: Assertion failed So, rework STN such that * For data dependency between a load and Neon store we produce code sequence like DUP V3.4S,W2 ; (* expand W2 into V3 *) MOVI V4.4S,#0 ; MOVI V5.4S,herd#1 ; ADD V4.4S,V4.4S,V3.4S ; (* V4 depends on V3 *) ADD V5.4S,V5.4S,V3.4S ; (* V5 depends on V3 *) ST2 {V4.4S, V5.4S},[X3] ; (* Store both V4 and V3 *) * For address dependency between a load and Neon store we produce code sequence like ADD X4,X3,W1,SXTW ; (* X4 depends on X3 (generated by previous load) *) MOVI V0.4S,herd#1 ; MOVI V1.4S,herd#2 ; ST2 {V0.4S, V1.4S},[X4] ; (* ST2 consumes X4 *) Signed-off-by: Vladimir Murzin <vladimir.murzin@arm.com>
murzinv
pushed a commit
to murzinv/herdtools7
that referenced
this issue
Jan 3, 2024
Unfortunately, there are several issues with litmus test generation when it comes to broad use Neon annotations. For example, litmus test for data dependency between Neon load and a store comes with several issues $ diyone7 -arch AArch64 -variant neon "Rfe PodRW Rfe DpDatadWN2N2" ---cut--- P0 | P1 ; LDR W1,[X0] | LD2 {V0.4S, V1.4S},[X0] ; (* A *) MOV W2,herd#1 | EOR V2.4S,V0.4S,V0.4S ; (* B *) STR W2,[X3] | ... ; ---cut--- First, result of load into V1 at A is not consumed. Second, instruction at B is not valid: Assembler messages: Error: operand mismatch -- `eor V2.4S,V0.4S,V0.4S' Info: did you mean this? Info: eor v1.8b, v0.8b, v0.8b Info: other valid variant(s): Info: eor v1.16b, v0.16b, v0.16b Another issue can be seen in litmus test for address dependency between a load and Neon load $ diyone7 -arch AArch64 -variant neon "PodWW Rfe DpAddrdRPN2 Fre" ---cut--- P0 | P1 ; MOV W0,herd#1 | ... ; STR W0,[X1] | ... ; MOV W2,herd#1 | ... ; STR W2,[X3] | LD2 {V0.4S, V1.4S},[X4],X3 ; (* A *) ---cut-- Neon load at A uses post-index form where X3 would be consumed after access is made to update X4, in other words, X3 doesn't contribute to the address generation for LD2 access at A. Also, there are numerous cases where litmus test generation would fail with Fatal error: exception File "lib/AArch64Base.ml", line 317, characters 48-54: Assertion failed To make things work we need to rework LDN. For things like address dependency and control dependency we need to reduce vector registers horizontally (in case more than one vector register is used) follow up with reduce horizontally single vector register into scalar, so it can be transferred to general purpose register and become available either for address generation logic or control flow logic. In term of instruction that reduction looks like LD2 {V0.4S, V1.4S},[X0] ; ADD V0.4S,V0.4S,V1.4S ; (* reduce (and consume) V0 and V1 into V0 *) ADDV S2,V0.4S ; (* reduce V0 inot S2 *) FMOV W1,S2 ; (* transfer S2 into W1 *) The only downside is that for data dependency between, say, Neon load and Neon store data have to trip via general purpose register which is sub-optimal LD2 {V0.4S, V1.4S},[X0] ; ADD V0.4S,V0.4S,V1.4S ; ADDV S2,V0.4S ; FMOV W1,S2 ; EOR W2,W1,W1 ; DUP V3.4S,W2 ; MOVI V4.4S,herd#1 ; MOVI V5.4S,herd#2 ; ADD V4.4S,V4.4S,V3.4S ; ADD V5.4S,V5.4S,V3.4S ; ST2 {V4.4S, V5.4S},[X3] ; However, when it comes to data dependency, say, between Neon load and non-Neon store things look more natural LD2 {V0.4S, V1.4S},[X0] ; ADD V0.4S,V0.4S,V1.4S ; ADDV S2,V0.4S ; FMOV W1,S2 ; EOR W2,W1,W1 ; ADD W2,W2,herd#1 ; STR W2,[X3] ; Signed-off-by: Vladimir Murzin <vladimir.murzin@arm.com>
murzinv
pushed a commit
to murzinv/herdtools7
that referenced
this issue
Jan 3, 2024
Unfortunately, there are several issues with litmus test generation when it comes to broad use Neon annotations. For example, litmus test for data dependency between a load and Neon store comes with use of invalid instruction $ diyone7 -arch AArch64 -variant neon "Rfe PodRW Rfe DpDatadWN2N2" ---cut--- P0 | P1 ; LDR W1,[X0] | ... ; MOV W2,herd#1 | ... ; STR W2,[X3] | MOVI V3.4S,herd#1 ; | ADD V2.4S,V2.4S,V3.4S ; | ST2 {V2.4S},[X2] ; (* A *) ---cut--- Neon's ST2 at A is invalid: Error: invalid number of registers in the list; 2 registers are expected at operand 1 -- `st2 {V2.4S},[X2]' Another issues seen in litmus test for address dependency between a load and Neon store $ diyone7 -arch AArch64 -variant neon "Rfe PodRW Rfe DpAddrdWPN2 PodWW" ---cut--- P0 | P1 ; LDR W1,[X0] | ... ; MOV W2,herd#1 | ... ; STR W2,[X3] | EOR W2,W1,W1 ; (* A *) | EOR X3,X2,X2 ; | MOVI V0.4S,herd#1 ; | MOVI V1.4S,herd#2 ; | ST2 {V0.4S, V1.4S},[X4],X3 ; (* B *) | MOV W5,herd#1 ; | STR W5,[X6] ; ---cut--- First of all unnecessary exclusive or at A, yet it doesn't hurt. Second, Neon store at B uses post-index form where X3 would be consumed after access is made to update X4, in other words, X3 doesn't contribute to the address generation for ST2 access at B. Also, there are numerous cases where litmus test generation would fail with Fatal error: exception File "lib/AArch64Base.ml", line 317, characters 48-54: Assertion failed So, rework STN such that * For data dependency between a load and Neon store we produce code sequence like DUP V3.4S,W2 ; (* expand W2 into V3 *) MOVI V4.4S,#0 ; MOVI V5.4S,herd#1 ; ADD V4.4S,V4.4S,V3.4S ; (* V4 depends on V3 *) ADD V5.4S,V5.4S,V3.4S ; (* V5 depends on V3 *) ST2 {V4.4S, V5.4S},[X3] ; (* Store both V4 and V3 *) * For address dependency between a load and Neon store we produce code sequence like ADD X4,X3,W1,SXTW ; (* X4 depends on X3 (generated by previous load) *) MOVI V0.4S,herd#1 ; MOVI V1.4S,herd#2 ; ST2 {V0.4S, V1.4S},[X4] ; (* ST2 consumes X4 *) Signed-off-by: Vladimir Murzin <vladimir.murzin@arm.com>
murzinv
pushed a commit
to murzinv/herdtools7
that referenced
this issue
Jan 4, 2024
Unfortunately, there are several issues with litmus test generation when it comes to broad use Neon annotations. For example, litmus test for data dependency between Neon load and a store comes with several issues $ diyone7 -arch AArch64 -variant neon "Rfe PodRW Rfe DpDatadWN2N2" ---cut--- P0 | P1 ; LDR W1,[X0] | LD2 {V0.4S, V1.4S},[X0] ; (* A *) MOV W2,herd#1 | EOR V2.4S,V0.4S,V0.4S ; (* B *) STR W2,[X3] | ... ; ---cut--- First, result of load into V1 at A is not consumed. Second, instruction at B is not valid: Assembler messages: Error: operand mismatch -- `eor V2.4S,V0.4S,V0.4S' Info: did you mean this? Info: eor v1.8b, v0.8b, v0.8b Info: other valid variant(s): Info: eor v1.16b, v0.16b, v0.16b Another issue can be seen in litmus test for address dependency between a load and Neon load $ diyone7 -arch AArch64 -variant neon "PodWW Rfe DpAddrdRPN2 Fre" ---cut--- P0 | P1 ; MOV W0,herd#1 | ... ; STR W0,[X1] | ... ; MOV W2,herd#1 | ... ; STR W2,[X3] | LD2 {V0.4S, V1.4S},[X4],X3 ; (* A *) ---cut-- Neon load at A uses post-index form where X3 would be consumed after access is made to update X4, in other words, X3 doesn't contribute to the address generation for LD2 access at A. Also, there are numerous cases where litmus test generation would fail with Fatal error: exception File "lib/AArch64Base.ml", line 317, characters 48-54: Assertion failed To make things work we need to rework LDN. For things like address dependency and control dependency we need to reduce vector registers horizontally (in case more than one vector register is used) follow up with reduce horizontally single vector register into scalar, so it can be transferred to general purpose register and become available either for address generation logic or control flow logic. In term of instruction that reduction looks like LD2 {V0.4S, V1.4S},[X0] ; ADD V0.4S,V0.4S,V1.4S ; (* reduce (and consume) V0 and V1 into V0 *) ADDV S2,V0.4S ; (* reduce V0 inot S2 *) FMOV W1,S2 ; (* transfer S2 into W1 *) The only downside is that for data dependency between, say, Neon load and Neon store data have to trip via general purpose register which is sub-optimal LD2 {V0.4S, V1.4S},[X0] ; ADD V0.4S,V0.4S,V1.4S ; ADDV S2,V0.4S ; FMOV W1,S2 ; EOR W2,W1,W1 ; DUP V3.4S,W2 ; MOVI V4.4S,herd#1 ; MOVI V5.4S,herd#2 ; ADD V4.4S,V4.4S,V3.4S ; ADD V5.4S,V5.4S,V3.4S ; ST2 {V4.4S, V5.4S},[X3] ; However, when it comes to data dependency, say, between Neon load and non-Neon store things look more natural LD2 {V0.4S, V1.4S},[X0] ; ADD V0.4S,V0.4S,V1.4S ; ADDV S2,V0.4S ; FMOV W1,S2 ; EOR W2,W1,W1 ; ADD W2,W2,herd#1 ; STR W2,[X3] ; Signed-off-by: Vladimir Murzin <vladimir.murzin@arm.com>
murzinv
pushed a commit
to murzinv/herdtools7
that referenced
this issue
Jan 4, 2024
Unfortunately, there are several issues with litmus test generation when it comes to broad use Neon annotations. For example, litmus test for data dependency between a load and Neon store comes with use of invalid instruction $ diyone7 -arch AArch64 -variant neon "Rfe PodRW Rfe DpDatadWN2N2" ---cut--- P0 | P1 ; LDR W1,[X0] | ... ; MOV W2,herd#1 | ... ; STR W2,[X3] | MOVI V3.4S,herd#1 ; | ADD V2.4S,V2.4S,V3.4S ; | ST2 {V2.4S},[X2] ; (* A *) ---cut--- Neon's ST2 at A is invalid: Error: invalid number of registers in the list; 2 registers are expected at operand 1 -- `st2 {V2.4S},[X2]' Another issues seen in litmus test for address dependency between a load and Neon store $ diyone7 -arch AArch64 -variant neon "Rfe PodRW Rfe DpAddrdWPN2 PodWW" ---cut--- P0 | P1 ; LDR W1,[X0] | ... ; MOV W2,herd#1 | ... ; STR W2,[X3] | EOR W2,W1,W1 ; (* A *) | EOR X3,X2,X2 ; | MOVI V0.4S,herd#1 ; | MOVI V1.4S,herd#2 ; | ST2 {V0.4S, V1.4S},[X4],X3 ; (* B *) | MOV W5,herd#1 ; | STR W5,[X6] ; ---cut--- First of all unnecessary exclusive or at A, yet it doesn't hurt. Second, Neon store at B uses post-index form where X3 would be consumed after access is made to update X4, in other words, X3 doesn't contribute to the address generation for ST2 access at B. Also, there are numerous cases where litmus test generation would fail with Fatal error: exception File "lib/AArch64Base.ml", line 317, characters 48-54: Assertion failed So, rework STN such that * For data dependency between a load and Neon store we produce code sequence like DUP V3.4S,W2 ; (* expand W2 into V3 *) MOVI V4.4S,#0 ; MOVI V5.4S,herd#1 ; ADD V4.4S,V4.4S,V3.4S ; (* V4 depends on V3 *) ADD V5.4S,V5.4S,V3.4S ; (* V5 depends on V3 *) ST2 {V4.4S, V5.4S},[X3] ; (* Store both V4 and V3 *) * For address dependency between a load and Neon store we produce code sequence like ADD X4,X3,W1,SXTW ; (* X4 depends on X3 (generated by previous load) *) MOVI V0.4S,herd#1 ; MOVI V1.4S,herd#2 ; ST2 {V0.4S, V1.4S},[X4] ; (* ST2 consumes X4 *) Signed-off-by: Vladimir Murzin <vladimir.murzin@arm.com>
murzinv
pushed a commit
to murzinv/herdtools7
that referenced
this issue
Jan 5, 2024
This is a combination of 16 commits. [tests]: Fix Neon tests V44 and V45 Encoding of immediate for 64-bit variant of MOVI instruction is different to 32-bit MOVI <Dd>, #<imm> where <imm> Is a 64-bit immediate 'aaaaaaaabbbbbbbbccccccccddddddddeeeeeeeeffffffffgggggggghhhhhhhh', encoded in "a:b:c:d:e:f:g:h". Unfortunately, that was overlooked for tests V44 and V45, which are failing to build: /tmp/ccJAwX0B.s:126: Error: invalid value for immediate at operand 2 -- `movi d1,herd#254' make: *** [Makefile:39: V44.o] Error 1 /tmp/ccX7x0v3.s: Assembler messages: /tmp/ccX7x0v3.s:487: Error: invalid value for immediate at operand 2 -- `movi d2,herd#254' Fix that by using acceptable immediates [litmus]: Fix build neon tests for -mode presi `-mode presi` doen't know about typdef for register output so build failing with V34.c:197:3: error: unknown type name ‘int_4_t’ 197 | int_4_t out_0_v0; | ^~~~~~~ V34.c: In function ‘pp_log’: Let's move `dump_array_typedefs` into common place and use in both `std` and `presi` modes [litmus]: Add comments for handling initialization of Neon registers It might look confusing why for some instructions cleraly output registers are going into 'inputs'. The reason is that such instructions can insert data into individual elements of Neon register without clearing the remaining bits to zero, thus to guarantee that unaffected bits are zero the whole Neon register need to be zero-initialized prior the use. Several options have been evaluated [1] and it seems the only robust way to achieve that is using the constraint "+" combined with the explicit initialization of the 'stable_*' variables. So, lets add comments to relevant parts to save time digging up history next time somebody get confused. [1] herd#131 [litmus]: add MOV (from general) to stable regs This instruction can insert data into individual elements within a SIMD&FP register without clearing the remaining bits to zero [litmus]: add MOV (element) to stable regs This instruction can insert data into individual elements within a SIMD&FP register without clearing the remaining bits to zero [all]: Add support for DUP [all]: Add partial support for FMOV (general) [all]: Add support for ADDV [gen]: Rework LDN Unfortunately, there are several issues with litmus test generation when it comes to broad use Neon annotations. For example, litmus test for data dependency between Neon load and a store comes with several issues $ diyone7 -arch AArch64 -variant neon "Rfe PodRW Rfe DpDatadWN2N2" ---cut--- P0 | P1 ; LDR W1,[X0] | LD2 {V0.4S, V1.4S},[X0] ; (* A *) MOV W2,herd#1 | EOR V2.4S,V0.4S,V0.4S ; (* B *) STR W2,[X3] | ... ; ---cut--- First, result of load into V1 at A is not consumed. Second, instruction at B is not valid: Assembler messages: Error: operand mismatch -- `eor V2.4S,V0.4S,V0.4S' Info: did you mean this? Info: eor v1.8b, v0.8b, v0.8b Info: other valid variant(s): Info: eor v1.16b, v0.16b, v0.16b Another issue can be seen in litmus test for address dependency between a load and Neon load $ diyone7 -arch AArch64 -variant neon "PodWW Rfe DpAddrdRPN2 Fre" ---cut--- P0 | P1 ; MOV W0,herd#1 | ... ; STR W0,[X1] | ... ; MOV W2,herd#1 | ... ; STR W2,[X3] | LD2 {V0.4S, V1.4S},[X4],X3 ; (* A *) ---cut-- Neon load at A uses post-index form where X3 would be consumed after access is made to update X4, in other words, X3 doesn't contribute to the address generation for LD2 access at A. Also, there are numerous cases where litmus test generation would fail with Fatal error: exception File "lib/AArch64Base.ml", line 317, characters 48-54: Assertion failed To make things work we need to rework LDN. For things like address dependency and control dependency we need to reduce vector registers horizontally (in case more than one vector register is used) follow up with reduce horizontally single vector register into scalar, so it can be transferred to general purpose register and become available either for address generation logic or control flow logic. In term of instruction that reduction looks like LD2 {V0.4S, V1.4S},[X0] ; ADD V0.4S,V0.4S,V1.4S ; (* reduce (and consume) V0 and V1 into V0 *) ADDV S2,V0.4S ; (* reduce V0 inot S2 *) FMOV W1,S2 ; (* transfer S2 into W1 *) The only downside is that for data dependency between, say, Neon load and Neon store data have to trip via general purpose register which is sub-optimal LD2 {V0.4S, V1.4S},[X0] ; ADD V0.4S,V0.4S,V1.4S ; ADDV S2,V0.4S ; FMOV W1,S2 ; EOR W2,W1,W1 ; DUP V3.4S,W2 ; MOVI V4.4S,herd#1 ; MOVI V5.4S,herd#2 ; ADD V4.4S,V4.4S,V3.4S ; ADD V5.4S,V5.4S,V3.4S ; ST2 {V4.4S, V5.4S},[X3] ; However, when it comes to data dependency, say, between Neon load and non-Neon store things look more natural LD2 {V0.4S, V1.4S},[X0] ; ADD V0.4S,V0.4S,V1.4S ; ADDV S2,V0.4S ; FMOV W1,S2 ; EOR W2,W1,W1 ; ADD W2,W2,herd#1 ; STR W2,[X3] ; [gen]: Rework STN Unfortunately, there are several issues with litmus test generation when it comes to broad use Neon annotations. For example, litmus test for data dependency between a load and Neon store comes with use of invalid instruction $ diyone7 -arch AArch64 -variant neon "Rfe PodRW Rfe DpDatadWN2N2" ---cut--- P0 | P1 ; LDR W1,[X0] | ... ; MOV W2,herd#1 | ... ; STR W2,[X3] | MOVI V3.4S,herd#1 ; | ADD V2.4S,V2.4S,V3.4S ; | ST2 {V2.4S},[X2] ; (* A *) ---cut--- Neon's ST2 at A is invalid: Error: invalid number of registers in the list; 2 registers are expected at operand 1 -- `st2 {V2.4S},[X2]' Another issues seen in litmus test for address dependency between a load and Neon store $ diyone7 -arch AArch64 -variant neon "Rfe PodRW Rfe DpAddrdWPN2 PodWW" ---cut--- P0 | P1 ; LDR W1,[X0] | ... ; MOV W2,herd#1 | ... ; STR W2,[X3] | EOR W2,W1,W1 ; (* A *) | EOR X3,X2,X2 ; | MOVI V0.4S,herd#1 ; | MOVI V1.4S,herd#2 ; | ST2 {V0.4S, V1.4S},[X4],X3 ; (* B *) | MOV W5,herd#1 ; | STR W5,[X6] ; ---cut--- First of all unnecessary exclusive or at A, yet it doesn't hurt. Second, Neon store at B uses post-index form where X3 would be consumed after access is made to update X4, in other words, X3 doesn't contribute to the address generation for ST2 access at B. Also, there are numerous cases where litmus test generation would fail with Fatal error: exception File "lib/AArch64Base.ml", line 317, characters 48-54: Assertion failed So, rework STN such that * For data dependency between a load and Neon store we produce code sequence like DUP V3.4S,W2 ; (* expand W2 into V3 *) MOVI V4.4S,#0 ; MOVI V5.4S,herd#1 ; ADD V4.4S,V4.4S,V3.4S ; (* V4 depends on V3 *) ADD V5.4S,V5.4S,V3.4S ; (* V5 depends on V3 *) ST2 {V4.4S, V5.4S},[X3] ; (* Store both V4 and V3 *) * For address dependency between a load and Neon store we produce code sequence like ADD X4,X3,W1,SXTW ; (* X4 depends on X3 (generated by previous load) *) MOVI V0.4S,herd#1 ; MOVI V1.4S,herd#2 ; ST2 {V0.4S, V1.4S},[X4] ; (* ST2 consumes X4 *) [gen]: Rename Neon annotations Currently, Neon annotations are mapped to instruction as * N1 -> LD1/ST1 (1 register) * N2 -> LD2/ST2 (2 register) * N3 -> LD3/ST3 (3 register) * N4 -> LD4/ST4 (4 register) Basically reflecting how many registers consumed by family of load/store multiple structure instructions. However, that naming is not flexible enough when it comes to extending Neon support with other variants or even non-interleaving LD1/ST1 which can accept up to 4 registers. So, let's make some (name)space for Neon annotations: * Ne1 -> LD1/ST1 (1 register) * Ne2i -> LD2/ST2 (2 register) * Ne3i -> LD3/ST3 (3 register) * Ne4i -> LD4/ST4 (4 register) Where we preserve an idea of amount of registers required by instructions, yet additionally highlight that some annotations would result in interleaving (i) Neon load/store instructions. [herd]: Complete support for LD1/ST1 LD1/ST1 are non-interleave (contiguous) load/store and can use up to 4 registers. However, current implementation tend to correlate number of registers used by instructions with interleave pattern and unnecessary put limitation on LD1 having just single register. Let's complete support for LD1/ST1 and allow more than one register. [gen]: Complete support for LD1/ST1 Provide annotation to emmit non-interleave LD1/ST1 instructions with more than one register: Ne1 -> LD1 {reg} Ne2 -> LD1 {reg,reg} Ne3 -> LD1 {reg,reg,reg} Ne4 -> LD1 {reg,reg,reg,reg} [litmus]: Utilize Arm Neon intrinsics for Neon tests It was noticed that for some litmus test generated code could not build. For instance AArch64 A { 0:X0=1; } P0 ; B END ; MOV V0.S[0], W0 ; END: ; exists (0:V0.4S={1,0,0,0}) results into T.c: In function ‘P0’: T.c:195:10: error: cast specifies array type 195 | :"[v0]" ((int_4_t)(0)),[x0] "r" ((int)(0)) | ^ where int_4_t is an alias to int[4] Let's use Arm Neon intrinsics and types [1] to handle register initialization. For that we (ab)use `internal_init` and make it return register initialization string and type string. Register initialization usually assumes some initial value so `internal_init` is extended to take optional string representing the value. Rest of the code is adopted to benefit from the new functionality. [1] https://arm-software.github.io/acle/neon_intrinsics/advsimd.html Update tests for litmus to accept them For making litmus happy, the following changes apply: 1. V10 and V45 change type of `y` to `int64_t`. Namely as the test stores a 64bit integer to y, sizes must match. Otherwise the produced executable may crash or produce wrong results. 2. V43, V46 and V47 restore address register to an exact symbolic value, thereby cancelling the effect of post-index addressing. Namely, litmus7 cannot print symbolic values of the forme x+offset. 3. V47 specify type `int64_t` for the register 0:X1 that is used as an `X` register in test code and initialised. This mundane changes avoids a warning. Also add test V64 (which identified a problem) and update reference files accordingly. Co-authored-by: Luc Maranget <Luc.Maranget@inria.fr> Signed-off-by: Vladimir Murzin <vladimir.murzin@arm.com>
maranget
added a commit
that referenced
this issue
Jan 5, 2024
This is a combination of 16 commits. [tests]: Fix Neon tests V44 and V45 Encoding of immediate for 64-bit variant of MOVI instruction is different to 32-bit MOVI <Dd>, #<imm> where <imm> Is a 64-bit immediate 'aaaaaaaabbbbbbbbccccccccddddddddeeeeeeeeffffffffgggggggghhhhhhhh', encoded in "a:b:c:d:e:f:g:h". Unfortunately, that was overlooked for tests V44 and V45, which are failing to build: /tmp/ccJAwX0B.s:126: Error: invalid value for immediate at operand 2 -- `movi d1,#254' make: *** [Makefile:39: V44.o] Error 1 /tmp/ccX7x0v3.s: Assembler messages: /tmp/ccX7x0v3.s:487: Error: invalid value for immediate at operand 2 -- `movi d2,#254' Fix that by using acceptable immediates [litmus]: Fix build neon tests for -mode presi `-mode presi` doen't know about typdef for register output so build failing with V34.c:197:3: error: unknown type name ‘int_4_t’ 197 | int_4_t out_0_v0; | ^~~~~~~ V34.c: In function ‘pp_log’: Let's move `dump_array_typedefs` into common place and use in both `std` and `presi` modes [litmus]: Add comments for handling initialization of Neon registers It might look confusing why for some instructions cleraly output registers are going into 'inputs'. The reason is that such instructions can insert data into individual elements of Neon register without clearing the remaining bits to zero, thus to guarantee that unaffected bits are zero the whole Neon register need to be zero-initialized prior the use. Several options have been evaluated [1] and it seems the only robust way to achieve that is using the constraint "+" combined with the explicit initialization of the 'stable_*' variables. So, lets add comments to relevant parts to save time digging up history next time somebody get confused. [1] #131 [litmus]: add MOV (from general) to stable regs This instruction can insert data into individual elements within a SIMD&FP register without clearing the remaining bits to zero [litmus]: add MOV (element) to stable regs This instruction can insert data into individual elements within a SIMD&FP register without clearing the remaining bits to zero [all]: Add support for DUP [all]: Add partial support for FMOV (general) [all]: Add support for ADDV [gen]: Rework LDN Unfortunately, there are several issues with litmus test generation when it comes to broad use Neon annotations. For example, litmus test for data dependency between Neon load and a store comes with several issues $ diyone7 -arch AArch64 -variant neon "Rfe PodRW Rfe DpDatadWN2N2" ---cut--- P0 | P1 ; LDR W1,[X0] | LD2 {V0.4S, V1.4S},[X0] ; (* A *) MOV W2,#1 | EOR V2.4S,V0.4S,V0.4S ; (* B *) STR W2,[X3] | ... ; ---cut--- First, result of load into V1 at A is not consumed. Second, instruction at B is not valid: Assembler messages: Error: operand mismatch -- `eor V2.4S,V0.4S,V0.4S' Info: did you mean this? Info: eor v1.8b, v0.8b, v0.8b Info: other valid variant(s): Info: eor v1.16b, v0.16b, v0.16b Another issue can be seen in litmus test for address dependency between a load and Neon load $ diyone7 -arch AArch64 -variant neon "PodWW Rfe DpAddrdRPN2 Fre" ---cut--- P0 | P1 ; MOV W0,#1 | ... ; STR W0,[X1] | ... ; MOV W2,#1 | ... ; STR W2,[X3] | LD2 {V0.4S, V1.4S},[X4],X3 ; (* A *) ---cut-- Neon load at A uses post-index form where X3 would be consumed after access is made to update X4, in other words, X3 doesn't contribute to the address generation for LD2 access at A. Also, there are numerous cases where litmus test generation would fail with Fatal error: exception File "lib/AArch64Base.ml", line 317, characters 48-54: Assertion failed To make things work we need to rework LDN. For things like address dependency and control dependency we need to reduce vector registers horizontally (in case more than one vector register is used) follow up with reduce horizontally single vector register into scalar, so it can be transferred to general purpose register and become available either for address generation logic or control flow logic. In term of instruction that reduction looks like LD2 {V0.4S, V1.4S},[X0] ; ADD V0.4S,V0.4S,V1.4S ; (* reduce (and consume) V0 and V1 into V0 *) ADDV S2,V0.4S ; (* reduce V0 inot S2 *) FMOV W1,S2 ; (* transfer S2 into W1 *) The only downside is that for data dependency between, say, Neon load and Neon store data have to trip via general purpose register which is sub-optimal LD2 {V0.4S, V1.4S},[X0] ; ADD V0.4S,V0.4S,V1.4S ; ADDV S2,V0.4S ; FMOV W1,S2 ; EOR W2,W1,W1 ; DUP V3.4S,W2 ; MOVI V4.4S,#1 ; MOVI V5.4S,#2 ; ADD V4.4S,V4.4S,V3.4S ; ADD V5.4S,V5.4S,V3.4S ; ST2 {V4.4S, V5.4S},[X3] ; However, when it comes to data dependency, say, between Neon load and non-Neon store things look more natural LD2 {V0.4S, V1.4S},[X0] ; ADD V0.4S,V0.4S,V1.4S ; ADDV S2,V0.4S ; FMOV W1,S2 ; EOR W2,W1,W1 ; ADD W2,W2,#1 ; STR W2,[X3] ; [gen]: Rework STN Unfortunately, there are several issues with litmus test generation when it comes to broad use Neon annotations. For example, litmus test for data dependency between a load and Neon store comes with use of invalid instruction $ diyone7 -arch AArch64 -variant neon "Rfe PodRW Rfe DpDatadWN2N2" ---cut--- P0 | P1 ; LDR W1,[X0] | ... ; MOV W2,#1 | ... ; STR W2,[X3] | MOVI V3.4S,#1 ; | ADD V2.4S,V2.4S,V3.4S ; | ST2 {V2.4S},[X2] ; (* A *) ---cut--- Neon's ST2 at A is invalid: Error: invalid number of registers in the list; 2 registers are expected at operand 1 -- `st2 {V2.4S},[X2]' Another issues seen in litmus test for address dependency between a load and Neon store $ diyone7 -arch AArch64 -variant neon "Rfe PodRW Rfe DpAddrdWPN2 PodWW" ---cut--- P0 | P1 ; LDR W1,[X0] | ... ; MOV W2,#1 | ... ; STR W2,[X3] | EOR W2,W1,W1 ; (* A *) | EOR X3,X2,X2 ; | MOVI V0.4S,#1 ; | MOVI V1.4S,#2 ; | ST2 {V0.4S, V1.4S},[X4],X3 ; (* B *) | MOV W5,#1 ; | STR W5,[X6] ; ---cut--- First of all unnecessary exclusive or at A, yet it doesn't hurt. Second, Neon store at B uses post-index form where X3 would be consumed after access is made to update X4, in other words, X3 doesn't contribute to the address generation for ST2 access at B. Also, there are numerous cases where litmus test generation would fail with Fatal error: exception File "lib/AArch64Base.ml", line 317, characters 48-54: Assertion failed So, rework STN such that * For data dependency between a load and Neon store we produce code sequence like DUP V3.4S,W2 ; (* expand W2 into V3 *) MOVI V4.4S,#0 ; MOVI V5.4S,#1 ; ADD V4.4S,V4.4S,V3.4S ; (* V4 depends on V3 *) ADD V5.4S,V5.4S,V3.4S ; (* V5 depends on V3 *) ST2 {V4.4S, V5.4S},[X3] ; (* Store both V4 and V3 *) * For address dependency between a load and Neon store we produce code sequence like ADD X4,X3,W1,SXTW ; (* X4 depends on X3 (generated by previous load) *) MOVI V0.4S,#1 ; MOVI V1.4S,#2 ; ST2 {V0.4S, V1.4S},[X4] ; (* ST2 consumes X4 *) [gen]: Rename Neon annotations Currently, Neon annotations are mapped to instruction as * N1 -> LD1/ST1 (1 register) * N2 -> LD2/ST2 (2 register) * N3 -> LD3/ST3 (3 register) * N4 -> LD4/ST4 (4 register) Basically reflecting how many registers consumed by family of load/store multiple structure instructions. However, that naming is not flexible enough when it comes to extending Neon support with other variants or even non-interleaving LD1/ST1 which can accept up to 4 registers. So, let's make some (name)space for Neon annotations: * Ne1 -> LD1/ST1 (1 register) * Ne2i -> LD2/ST2 (2 register) * Ne3i -> LD3/ST3 (3 register) * Ne4i -> LD4/ST4 (4 register) Where we preserve an idea of amount of registers required by instructions, yet additionally highlight that some annotations would result in interleaving (i) Neon load/store instructions. [herd]: Complete support for LD1/ST1 LD1/ST1 are non-interleave (contiguous) load/store and can use up to 4 registers. However, current implementation tend to correlate number of registers used by instructions with interleave pattern and unnecessary put limitation on LD1 having just single register. Let's complete support for LD1/ST1 and allow more than one register. [gen]: Complete support for LD1/ST1 Provide annotation to emmit non-interleave LD1/ST1 instructions with more than one register: Ne1 -> LD1 {reg} Ne2 -> LD1 {reg,reg} Ne3 -> LD1 {reg,reg,reg} Ne4 -> LD1 {reg,reg,reg,reg} [litmus]: Utilize Arm Neon intrinsics for Neon tests It was noticed that for some litmus test generated code could not build. For instance AArch64 A { 0:X0=1; } P0 ; B END ; MOV V0.S[0], W0 ; END: ; exists (0:V0.4S={1,0,0,0}) results into T.c: In function ‘P0’: T.c:195:10: error: cast specifies array type 195 | :"[v0]" ((int_4_t)(0)),[x0] "r" ((int)(0)) | ^ where int_4_t is an alias to int[4] Let's use Arm Neon intrinsics and types [1] to handle register initialization. For that we (ab)use `internal_init` and make it return register initialization string and type string. Register initialization usually assumes some initial value so `internal_init` is extended to take optional string representing the value. Rest of the code is adopted to benefit from the new functionality. [1] https://arm-software.github.io/acle/neon_intrinsics/advsimd.html Update tests for litmus to accept them For making litmus happy, the following changes apply: 1. V10 and V45 change type of `y` to `int64_t`. Namely as the test stores a 64bit integer to y, sizes must match. Otherwise the produced executable may crash or produce wrong results. 2. V43, V46 and V47 restore address register to an exact symbolic value, thereby cancelling the effect of post-index addressing. Namely, litmus7 cannot print symbolic values of the forme x+offset. 3. V47 specify type `int64_t` for the register 0:X1 that is used as an `X` register in test code and initialised. This mundane changes avoids a warning. Also add test V64 (which identified a problem) and update reference files accordingly. Co-authored-by: Luc Maranget <Luc.Maranget@inria.fr> Signed-off-by: Vladimir Murzin <vladimir.murzin@arm.com>
maranget
added a commit
that referenced
this issue
Jan 22, 2024
[litmus] Fix two small bugs in litmus 1. Remove stable registers from the list of clobbers in inline asssembly. Otherwise gcc protests. 2. Silently switch the `ascall` setting to true when computing code prelude size at runtime is needed for any reason. Example of a test that triggered the two errors, while compiled as `litmus7 -mach vougeot -o /tmp/A.tar BLR.litmus`: ``` AArch64 BLR-BL Stable=X30 { } P0 ; MOV W0,#1 ; ADR X4,first ; BLR X4 ; BL first ; B end ; first: ; MOV X29,X30 ; BL snd ; RET X29 ; snd: ; ADD W0,W0,W0 ; RET ; end: ; forall 0:X0=4 ```
relokin
referenced
this issue
in relokin/herdtools7
Feb 1, 2024
This change implements the ability to generate DC CVAU and IC IVAU instructions as a fence instruction. This allows the user to implement these instructions with more fine-grain control into litmus tests. The relaxations for DC CVAU and IC IVAU are: DC.CVAU<n/p> IC.IVAU<n/p> where n and p allow the user to choose whether the DC/IC instruction will use the previous location or the next location. Examples that couldn't be generated previously: $> diyone7 -moreedges true -arch AArch64 -variant self "PodWW DC.CVAUp DMB.ISH Rfe PodRR DSB.ISH IC.IVAUn DSB.ISH ISB FrePI" AArch64 A "PodWWIP DC.CVAUp DMB.ISH Rfe PodRR DSB.ISH IC.IVAUn DSB.ISH ISB FrePI" Generator=diyone7 (version 7.56+03) Com=Rf Fr Orig=PodWWIP DC.CVAUp DMB.ISH Rfe PodRR DSB.ISH IC.IVAUn DSB.ISH ISB FrePI { 0:X0=instr:"NOP"; 0:X1=P1:Lself00; 0:X3=x; 1:X0=x; 1:X2=P1:Lself00; } P0 | P1 ; STR W0,[X1] | LDR W1,[X0] ; DC CVAU,X1 | DSB ISH ; DMB ISH | IC IVAU,X2 ; MOV W2,#1 | DSB ISH ; STR W2,[X3] | ISB ; | Lself00: ; | B .+12 ; | MOV W3,herd#2 ; | B .+8 ; | MOV W3,#1 ; exists (1:X1=1 /\ 1:X3=1) $> diyone7 -moreedges true -arch AArch64 -variant self "DMB.ISHdWW Rfe DpCtrlIsbdR DC.CVAUn DSB.ISH IC.IVAUn DSB.ISH ISB FrePI" AArch64 A "DMB.ISHdWWIP Rfe DpCtrlIsbdR DC.CVAUn DSB.ISH IC.IVAUn DSB.ISH ISB FrePI" Generator=diyone7 (version 7.56+03) Com=Rf Fr Orig=DMB.ISHdWWIP Rfe DpCtrlIsbdR DC.CVAUn DSB.ISH IC.IVAUn DSB.ISH ISB FrePI { 0:X0=instr:"NOP"; 0:X1=P1:Lself00; 0:X3=x; 1:X0=x; 1:X2=P1:Lself00; } P0 | P1 ; STR W0,[X1] | LDR W1,[X0] ; DMB ISH | CBNZ W1,LC00 ; MOV W2,#1 | LC00: ; STR W2,[X3] | ISB ; | DC CVAU,X2 ; | DSB ISH ; | IC IVAU,X2 ; | DSB ISH ; | ISB ; | Lself00: ; | B .+12 ; | MOV W3,herd#2 ; | B .+8 ; | MOV W3,#1 ; exists (1:X1=1 /\ 1:X3=1)
maranget
added a commit
that referenced
this issue
Mar 27, 2024
[litmus] Permit initializing shared-memory locations with labels This PR aims to let `litmus7` execute tests that initialise memory locations with labels. Firstly, this following is now possible to run in the standard mode. ``` AArch64 test1 { int64_t 1:X2; 0:X1=x; 1:X1=x; 0:X2=P1:l1; 0:X3=z; 1:X3=z; 1:X9=P1:l0; z=P1:l0; } P0 |P1 ; MOV W0,#1 | LDR X2,[X3] ; STR W0,[X1] | BR X2 ; DMB ISH |l1: ; STR X2,[X3] | ISB ; | LDR W5,[X1] ; |l0: ; | CMP X2,X9 ; | CSET W4,NE ; locations [1:X2;1:X4;1:X5] exists 1:X4=1 /\ 1:X5=0 ``` ``` litmus7 -mode std test.litmus -o <output_dir> ``` Secondly, this following is now possible to run in the standard mode with `-variant self`. To this end, the implementation computes labels from the execution context each time upon the use. (Note that the first commit in this PR used to save them into arrays like global variables, but that does not seem necessary.) To enable this straightforward functionality, some definitions are moved out of the litmus test template dumping routine. ``` AArch64 test2 { 0:X0=instr:"NOP"; 0:X1=P1:m0; 0:X3=z; 1:X5=P1:m0; 1:X3=z; z=P1:l0; int64_t 1:X2; } P0 | P1 ; STR W0,[X1] | LDR X1,[X3] ; DMB ST | DC CVAU,X1 ; STR X1,[X3] | DSB ISH ; | IC IVAU,X1 ; | DSB ISH ; | ISB ; | m0: ; | B l0 ; | MOV W9,#1 ; | l0: ; | CMP X1,X5 ; | CSET X2,EQ ; exists 1:X2=1 /\ 1:X9=0 ``` ``` litmus7 -mode std test2.litmus -o <output_dir> -variant self ```
murzinv
pushed a commit
to murzinv/herdtools7
that referenced
this issue
May 17, 2024
Commits 93ee43a ("[all] Update the LDRS[BH] instructions.") and b4aefa6 ("[all,aarch64] Exhaustive implementation of LD<OP>") added special case for `MachSize.Quad` in `uxt_op` and `sxt_op`. Unfortunately, that breaks some operations for vector instructions. Consider the test ``` AArch64 T { uint8_t x[16]; 0 : X0 = x; } P0; MOVI V0.16B, herd#1 ; (* V0 = {1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1} *) MOVI V1.16B, herd#2 ; (* V1 = {2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2} *) EOR V2.8B, V1.8B, V0.8B; (* V2 = {3,3,3,3,3,3,3,3,0,0,0,0,0,0,0,0} *) ST1{V2.16B}, [X0]; locations[x;] ``` When run on hardware it produces ``` 10000 :>x={3,3,3,3,3,3,3,3,0,0,0,0,0,0,0,0}; ``` howevere when tun with `herd` it prooduces ``` States 1 x={3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3}; ``` The reason is that `EOR` instruction is modeled by `simd_op` as ``` ... | AArch64.EOR -> fun (v1,v2) -> M.op Op.Xor v1 v2 ... end >>= fun v -> write_reg_neon_sz sz r1 v ii ``` In other words it exclusive-or 128 bit input values and offload destination how much should be written to the output register to `write_reg_neon_sz`, which in turn would perform zero extension as ``` uxt_op sz v >>= fun v -> ``` The problem is that `8B` register size is `MachSize.Quad` which means that it falls under special case which would not apply mask so full 128-bit value gets written to the output register. Let's restore intended behavior by removing special case on `Machsize.Quad` Suggested-by: Nikos Nikoleris <nikos.nikoleris@arm.com> Signed-off-by: Vladimir Murzin <vladimir.murzin@arm.com>
murzinv
pushed a commit
to murzinv/herdtools7
that referenced
this issue
May 29, 2024
Commits 93ee43a ("[all] Update the LDRS[BH] instructions.") and b4aefa6 ("[all,aarch64] Exhaustive implementation of LD<OP>") added special case for `MachSize.Quad` in `uxt_op` and `sxt_op`. Unfortunately, that breaks some operations for vector instructions. Consider the test ``` AArch64 T { uint8_t x[16]; 0 : X0 = x; } P0; MOVI V0.16B, herd#1 ; (* V0 = {1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1} *) MOVI V1.16B, herd#2 ; (* V1 = {2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2} *) EOR V2.8B, V1.8B, V0.8B; (* V2 = {3,3,3,3,3,3,3,3,0,0,0,0,0,0,0,0} *) ST1{V2.16B}, [X0]; locations[x;] ``` When run on hardware it produces ``` 10000 :>x={3,3,3,3,3,3,3,3,0,0,0,0,0,0,0,0}; ``` howevere when tun with `herd` it prooduces ``` States 1 x={3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3}; ``` The reason is that `EOR` instruction is modeled by `simd_op` as ``` ... | AArch64.EOR -> fun (v1,v2) -> M.op Op.Xor v1 v2 ... end >>= fun v -> write_reg_neon_sz sz r1 v ii ``` In other words it exclusive-or 128 bit input values and offload destination how much should be written to the output register to `write_reg_neon_sz`, which in turn would perform zero extension as ``` uxt_op sz v >>= fun v -> ``` The problem is that `8B` register size is `MachSize.Quad` which means that it falls under special case which would not apply mask so full 128-bit value gets written to the output register. Let's restore intended behavior by removing special case on `Machsize.Quad` Suggested-by: Nikos Nikoleris <nikos.nikoleris@arm.com> Signed-off-by: Vladimir Murzin <vladimir.murzin@arm.com>
murzinv
pushed a commit
to murzinv/herdtools7
that referenced
this issue
Jun 3, 2024
Commits 93ee43a ("[all] Update the LDRS[BH] instructions.") and b4aefa6 ("[all,aarch64] Exhaustive implementation of LD<OP>") added special case for `MachSize.Quad` in `uxt_op` and `sxt_op`. Unfortunately, that breaks some operations for vector instructions. Consider the test ``` AArch64 T { uint8_t x[16]; 0 : X0 = x; } P0; MOVI V0.16B, herd#1 ; (* V0 = {1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1} *) MOVI V1.16B, herd#2 ; (* V1 = {2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2} *) EOR V2.8B, V1.8B, V0.8B; (* V2 = {3,3,3,3,3,3,3,3,0,0,0,0,0,0,0,0} *) ST1{V2.16B}, [X0]; locations[x;] ``` When run on hardware it produces ``` 10000 :>x={3,3,3,3,3,3,3,3,0,0,0,0,0,0,0,0}; ``` howevere when tun with `herd` it prooduces ``` States 1 x={3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3}; ``` The reason is that `EOR` instruction is modeled by `simd_op` as ``` ... | AArch64.EOR -> fun (v1,v2) -> M.op Op.Xor v1 v2 ... end >>= fun v -> write_reg_neon_sz sz r1 v ii ``` In other words it exclusive-or 128 bit input values and offload destination how much should be written to the output register to `write_reg_neon_sz`, which in turn would perform zero extension as ``` uxt_op sz v >>= fun v -> ``` The problem is that `8B` register size is `MachSize.Quad` which means that it falls under special case which would not apply mask so full 128-bit value gets written to the output register. Let's restore intended behavior by removing special case on `Machsize.Quad` Suggested-by: Nikos Nikoleris <nikos.nikoleris@arm.com> Signed-off-by: Vladimir Murzin <vladimir.murzin@arm.com>
murzinv
pushed a commit
to murzinv/herdtools7
that referenced
this issue
Jun 3, 2024
Commits 93ee43a ("[all] Update the LDRS[BH] instructions.") and b4aefa6 ("[all,aarch64] Exhaustive implementation of LD<OP>") added special case for `MachSize.Quad` in `uxt_op` and `sxt_op`. Unfortunately, that breaks some operations for vector instructions. Consider the test ``` AArch64 T { uint8_t x[16]; 0 : X0 = x; } P0; MOVI V0.16B, herd#1 ; (* V0 = {1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1} *) MOVI V1.16B, herd#2 ; (* V1 = {2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2} *) EOR V2.8B, V1.8B, V0.8B; (* V2 = {3,3,3,3,3,3,3,3,0,0,0,0,0,0,0,0} *) ST1{V2.16B}, [X0]; locations[x;] ``` When run on hardware it produces ``` 10000 :>x={3,3,3,3,3,3,3,3,0,0,0,0,0,0,0,0}; ``` howevere when tun with `herd` it prooduces ``` States 1 x={3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3}; ``` The reason is that `EOR` instruction is modeled by `simd_op` as ``` ... | AArch64.EOR -> fun (v1,v2) -> M.op Op.Xor v1 v2 ... end >>= fun v -> write_reg_neon_sz sz r1 v ii ``` In other words it exclusive-or 128 bit input values and offload destination how much should be written to the output register to `write_reg_neon_sz`, which in turn would perform zero extension as ``` uxt_op sz v >>= fun v -> ``` The problem is that `8B` register size is `MachSize.Quad` which means that it falls under special case which would not apply mask so full 128-bit value gets written to the output register. Let's restore intended behavior by removing special case on `Machsize.Quad` Suggested-by: Nikos Nikoleris <nikos.nikoleris@arm.com> Signed-off-by: Vladimir Murzin <vladimir.murzin@arm.com>
maranget
added a commit
that referenced
this issue
Aug 3, 2024
[litmus] Fix wrong register class for initialised stable registers When given as output parameters to gcc inline assembly templates, _initialised_ stable registers should have constraint "+" (and not "=") for their initialisation not to be overlooked by gcc. Here is an example of the difference between initialised and non-initialised stable registers: ``` AArch64 L Stable=X2 {} P0 | P1 ; MOV W0,W2 | MOV W2,#1 ; | MOV W0,W2 ; forall 0:X0=0 /\ 1:X0=1 ``` The `X2` register of `P0` is implicitly initialised to zero, while the content of `P1:X2` is overwritten by `P1` first instruction. + For `P0` the output constraint `=` is not adequate. Such a constraint would allow **gcc** _not_ to emit initialisation code. And **gcc** does behave that way. + For `P1` he output constraint `=` is adequate, as the register `P1:X2` is overwritten before being read.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Hi,
I was using herd7 (7.43, installed via OPAM) to run against the C11 litmus from https://github.com/nchong/c11popl15/blob/master/tests/a2.litmus:
But herd7 complains:
It seems "atomic_load_explicit" (and also "atomic_store_explicit" according to other tests) are recognized, but not "atomic_compare_exchange_strong_explicit".
I tried with latest version in master branch up to now, it complains the message as well. I also tried with old version herd in https://github.com/herd/herdtools, and it can successfully finish the run there.
So is herd7 indeed missing the part for "atomic_compare_exchange_strong_explicit" at the moment?
Thanks.
The text was updated successfully, but these errors were encountered: