-
Notifications
You must be signed in to change notification settings - Fork 15.3k
Description
Since LLVM 21, using ACLE intrinsics for SVE non-temporal loads/stores with an all-true predicate fails to generate the expected non-temporal instructions.
Code to reproduce:
#include <arm_sve.h>
void f(double* a) {
svbool_t allone = svptrue_b64();
svstnt1(allone, a + 1,
svldnt1(allone, a));
}https://godbolt.org/z/za6Mb1Een
In LLVM 21, an all-true predicate is now represented as a constant splat_vector in the SelectionDAG. This enables an optimization in DAGCombiner that converts a masked_load/masked_store node into a regular load/store node. However, the instruction selection patterns for SVE non-temporal instructions are only defined for masked ones.
LLVM 20:
Initial selection DAG: %bb.0 'f:entry'
SelectionDAG has 13 nodes:
t0: ch,glue = EntryToken
t2: i64,ch = CopyFromReg t0, Register:i64 %0
*** t5: nxv2i1 = llvm.aarch64.sve.ptrue TargetConstant:i64<1553>, TargetConstant:i32<31>
t9: nxv2f64,ch = llvm.aarch64.sve.ldnt1<(non-temporal load (<vscale x 1 x s128>) from %ir.a, align 8, !tbaa !6)> t0, TargetConstant:i64<1481>, t5, t2
t7: i64 = add nuw t2, Constant:i64<8>
t11: ch = llvm.aarch64.sve.stnt1<(non-temporal store (<vscale x 1 x s128>) into %ir.add.ptr, align 8, !tbaa !6)> t9:1, TargetConstant:i64<1792>, t9, t5, t7
t12: ch = AArch64ISD::RET_GLUE t11
LLVM 21:
Initial selection DAG: %bb.0 'f:entry'
SelectionDAG has 15 nodes:
t0: ch,glue = EntryToken
t2: i64,ch = CopyFromReg t0, Register:i64 %0
t9: nxv2i1 = insert_vector_elt poison:nxv2i1, Constant:i1<-1>, Constant:i64<0>
***t10: nxv2i1 = splat_vector Constant:i1<-1>***
t11: nxv2f64,ch = llvm.aarch64.sve.ldnt1<(non-temporal load (<vscale x 1 x s128>) from %ir.a, align 8, !tbaa !10)> t0, TargetConstant:i64<1598>, t10, t2
t4: i64 = add nuw t2, Constant:i64<8>
t13: ch = llvm.aarch64.sve.stnt1<(non-temporal store (<vscale x 1 x s128>) into %ir.add.ptr, align 8, !tbaa !10)> t11:1, TargetConstant:i64<1909>, t11, t10, t4
t14: ch = AArch64ISD::RET_GLUE t13
Combining: t13: ch = llvm.aarch64.sve.stnt1<(non-temporal store (<vscale x 1 x s128>) into %ir.add.ptr, align 8, !tbaa !10)> t11:1, TargetConstant:i64<1909>, t11, t10, t4
... into: t17: ch = masked_store<(non-temporal store (<vscale x 1 x s128>) into %ir.add.ptr, align 8, !tbaa !10)> t11:1, t15, t4, undef:i64, t10
Combining: t17: ch = masked_store<(non-temporal store (<vscale x 1 x s128>) into %ir.add.ptr, align 8, !tbaa !10)> t11:1, t15, t4, undef:i64, t10
... into: t18: ch = store<(non-temporal store (<vscale x 1 x s128>) into %ir.add.ptr, align 8, !tbaa !10)> t11:1, t15, t4, undef:i64
The transformation from masked_load/store to load/store is performed here:
https://github.com/llvm/llvm-project/blob/622f72f4bef8b177e1e4f318465260fbdb7711ef/llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp#L12782
The existing patterns are defined here:
| defm : pred_load<nxv16i8, nxv16i1, non_temporal_load, LDNT1B_ZRR, LDNT1B_ZRI, am_sve_regreg_lsl0>; |
A similar issue has existed with __builtin_nontemporal_load/store. These builtins also fail to generate non-temporal instructions. This appears to be the same root cause.
https://godbolt.org/z/rhzYaxjj5
To resolve both of these issues, should we add isel patterns for non-temporal instructions that match regular load and store nodes?