-
Notifications
You must be signed in to change notification settings - Fork 14.8k
Description
When compiling the following C source using MVE vector intrinsics, the ARMLowOverheadLoops pass changes the semantics of the code by performing tail-predication.
// clang --target=arm-none-eabi -mcpu=cortex-m52 -mfloat-abi=hard -fno-inline-functions -O1 -S -o - 162644.c
#include <arm_mve.h>
float32x4_t inactive = {0.0, 0.0, 0.0, 0.0};
float32x4_t test_func(float32_t *array, int32_t len) {
float32x4_t acc = vdupq_n_f32(0.1f);
do {
mve_pred16_t tailpred = vctp32q(len);
float32x4_t vecSrc = vldrwq_z_f32(array, tailpred);
acc = vaddq_m_f32(inactive, acc, vecSrc, tailpred);
array += 4;
len -= 4;
} while (len > 0);
return acc;
}
The source code loads four floats at a time from the input array, and adds them elementwise to the vector acc
. In case len
is not a multiple of 4, an explicit MVE predicate is constructed using vctp32q
so that the final loop iteration will load fewer than 4 floats.
Usually in this kind of code the vaddq_m_32
instruction would pass acc
as its first operand as well as its second, so that any vector lanes disabled by the predicate would be left unchanged from their value in the previous iteration. However, in this code, the vaddq_m_32
takes its inactive lanes from the constant all-zero vector inactive
.
So the semantics of this code as written is that any vector lane not used by the last loop iteration will be zero in the returned vector, rather than containing the sum of array elements from previous iterations.
Compiling this code with the extra option -mllvm -arm-loloops-disable-tailpred
, the ARMLowOverheadLoops pass generates a low-overhead loop using dls
and le
, but leaves the tail-predication alone. The vmov q0,q1
in the middle of the loop is unpredicated, and copies all of the inactive
vector into q0, including the lanes disabled by the current loop iteration's predicate. Then the predicated vaddt
after that overwrites only the active lanes with the sum of the previous acc
with the loaded values, just as the source code says.
dls lr, r2
.LBB0_1:
vctp.32 r1
vmov q2, q0
vpst
vldrwt.u32 q3, [r0], #16 // tail-predicated: load from input array
vmov q0, q1 // unpredicated: copy 'inactive' into q0
vpst
vaddt.f32 q0, q2, q3 // tail-pred: overwrite some of q0 with q2+q3
subs r1, #4
le lr, .LBB0_1
But removing -mllvm -arm-loloops-disable-tailpred
causes ARMLowOverheadLoops to perform a transformation that changes the semantics (as of commit b256d0a):
dlstp.32 lr, r1
.LBB0_1:
vmov q2, q0
vldrw.u32 q3, [r0], #16 // tail-predication now done by LTPSIZE
vmov q0, q1 // ALSO TAIL-PREDICATED but shouldn't be
vadd.f32 q0, q2, q3 // tail-predicated as before
letp lr, .LBB0_1
Now the tail-predication in the last loop iteration is done by the dlstp
and letp
instructions setting the LTPSIZE field in FPSCR, instead of by constructing a predicate in VPR. This means that all the instructions in the loop are affected by the tail-predication. In particular, the vmov q0,q1
is now copying only the active lanes into q0. So the inactive lanes in the final iteration will not be zeroed: they will take whatever value was left in q0 after the previous iteration.
In this situation, ARMLowOverheadLoops should recognize that tail-predicating the loop via LTPSIZE is an invalid transformation: the inactive lanes written by that vmov
are needed, so the write to them cannot be discarded.