Skip to content

Invalid MVE tail-predication in ARMLowOverheadLoops #162644

@statham-arm

Description

@statham-arm

When compiling the following C source using MVE vector intrinsics, the ARMLowOverheadLoops pass changes the semantics of the code by performing tail-predication.

// clang --target=arm-none-eabi -mcpu=cortex-m52 -mfloat-abi=hard -fno-inline-functions -O1 -S -o - 162644.c

#include <arm_mve.h>

float32x4_t inactive = {0.0, 0.0, 0.0, 0.0};

float32x4_t test_func(float32_t *array, int32_t len) {
    float32x4_t acc = vdupq_n_f32(0.1f);

    do {
        mve_pred16_t tailpred = vctp32q(len);
        float32x4_t vecSrc = vldrwq_z_f32(array, tailpred);
        acc = vaddq_m_f32(inactive, acc, vecSrc, tailpred);
        array += 4;
        len -= 4;
    } while (len > 0);

    return acc;
}

The source code loads four floats at a time from the input array, and adds them elementwise to the vector acc. In case len is not a multiple of 4, an explicit MVE predicate is constructed using vctp32q so that the final loop iteration will load fewer than 4 floats.

Usually in this kind of code the vaddq_m_32 instruction would pass acc as its first operand as well as its second, so that any vector lanes disabled by the predicate would be left unchanged from their value in the previous iteration. However, in this code, the vaddq_m_32 takes its inactive lanes from the constant all-zero vector inactive.

So the semantics of this code as written is that any vector lane not used by the last loop iteration will be zero in the returned vector, rather than containing the sum of array elements from previous iterations.

Compiling this code with the extra option -mllvm -arm-loloops-disable-tailpred, the ARMLowOverheadLoops pass generates a low-overhead loop using dls and le, but leaves the tail-predication alone. The vmov q0,q1 in the middle of the loop is unpredicated, and copies all of the inactive vector into q0, including the lanes disabled by the current loop iteration's predicate. Then the predicated vaddt after that overwrites only the active lanes with the sum of the previous acc with the loaded values, just as the source code says.

        dls         lr, r2
.LBB0_1:
        vctp.32     r1
        vmov        q2, q0
        vpst
        vldrwt.u32  q3, [r0], #16 // tail-predicated: load from input array
        vmov        q0, q1        // unpredicated: copy 'inactive' into q0
        vpst
        vaddt.f32   q0, q2, q3    // tail-pred: overwrite some of q0 with q2+q3
        subs        r1, #4
        le          lr, .LBB0_1

But removing -mllvm -arm-loloops-disable-tailpred causes ARMLowOverheadLoops to perform a transformation that changes the semantics (as of commit b256d0a):

        dlstp.32    lr, r1
.LBB0_1:
        vmov        q2, q0
        vldrw.u32   q3, [r0], #16 // tail-predication now done by LTPSIZE
        vmov        q0, q1        // ALSO TAIL-PREDICATED but shouldn't be
        vadd.f32    q0, q2, q3    // tail-predicated as before
        letp        lr, .LBB0_1

Now the tail-predication in the last loop iteration is done by the dlstp and letp instructions setting the LTPSIZE field in FPSCR, instead of by constructing a predicate in VPR. This means that all the instructions in the loop are affected by the tail-predication. In particular, the vmov q0,q1 is now copying only the active lanes into q0. So the inactive lanes in the final iteration will not be zeroed: they will take whatever value was left in q0 after the previous iteration.

In this situation, ARMLowOverheadLoops should recognize that tail-predicating the loop via LTPSIZE is an invalid transformation: the inactive lanes written by that vmov are needed, so the write to them cannot be discarded.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions