Skip to content

[AArch64][SVE] Non-temporal load/store instructions fail to be generated from intrinsics and builtins #169034

@ytmukai

Description

@ytmukai

Since LLVM 21, using ACLE intrinsics for SVE non-temporal loads/stores with an all-true predicate fails to generate the expected non-temporal instructions.

Code to reproduce:

#include <arm_sve.h>

void f(double* a) {
    svbool_t allone = svptrue_b64();
    svstnt1(allone, a + 1,
            svldnt1(allone, a));
}

https://godbolt.org/z/za6Mb1Een

In LLVM 21, an all-true predicate is now represented as a constant splat_vector in the SelectionDAG. This enables an optimization in DAGCombiner that converts a masked_load/masked_store node into a regular load/store node. However, the instruction selection patterns for SVE non-temporal instructions are only defined for masked ones.

LLVM 20:

Initial selection DAG: %bb.0 'f:entry'
SelectionDAG has 13 nodes:
  t0: ch,glue = EntryToken
  t2: i64,ch = CopyFromReg t0, Register:i64 %0
  *** t5: nxv2i1 = llvm.aarch64.sve.ptrue TargetConstant:i64<1553>, TargetConstant:i32<31>
  t9: nxv2f64,ch = llvm.aarch64.sve.ldnt1<(non-temporal load (<vscale x 1 x s128>) from %ir.a, align 8, !tbaa !6)> t0, TargetConstant:i64<1481>, t5, t2
      t7: i64 = add nuw t2, Constant:i64<8>
    t11: ch = llvm.aarch64.sve.stnt1<(non-temporal store (<vscale x 1 x s128>) into %ir.add.ptr, align 8, !tbaa !6)> t9:1, TargetConstant:i64<1792>, t9, t5, t7
  t12: ch = AArch64ISD::RET_GLUE t11

LLVM 21:

Initial selection DAG: %bb.0 'f:entry'
SelectionDAG has 15 nodes:
  t0: ch,glue = EntryToken
  t2: i64,ch = CopyFromReg t0, Register:i64 %0
  t9: nxv2i1 = insert_vector_elt poison:nxv2i1, Constant:i1<-1>, Constant:i64<0>
  ***t10: nxv2i1 = splat_vector Constant:i1<-1>***
  t11: nxv2f64,ch = llvm.aarch64.sve.ldnt1<(non-temporal load (<vscale x 1 x s128>) from %ir.a, align 8, !tbaa !10)> t0, TargetConstant:i64<1598>, t10, t2
      t4: i64 = add nuw t2, Constant:i64<8>
    t13: ch = llvm.aarch64.sve.stnt1<(non-temporal store (<vscale x 1 x s128>) into %ir.add.ptr, align 8, !tbaa !10)> t11:1, TargetConstant:i64<1909>, t11, t10, t4
  t14: ch = AArch64ISD::RET_GLUE t13

Combining: t13: ch = llvm.aarch64.sve.stnt1<(non-temporal store (<vscale x 1 x s128>) into %ir.add.ptr, align 8, !tbaa !10)> t11:1, TargetConstant:i64<1909>, t11, t10, t4
 ... into: t17: ch = masked_store<(non-temporal store (<vscale x 1 x s128>) into %ir.add.ptr, align 8, !tbaa !10)> t11:1, t15, t4, undef:i64, t10

Combining: t17: ch = masked_store<(non-temporal store (<vscale x 1 x s128>) into %ir.add.ptr, align 8, !tbaa !10)> t11:1, t15, t4, undef:i64, t10
 ... into: t18: ch = store<(non-temporal store (<vscale x 1 x s128>) into %ir.add.ptr, align 8, !tbaa !10)> t11:1, t15, t4, undef:i64

The transformation from masked_load/store to load/store is performed here:
https://github.com/llvm/llvm-project/blob/622f72f4bef8b177e1e4f318465260fbdb7711ef/llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp#L12782

The existing patterns are defined here:

defm : pred_load<nxv16i8, nxv16i1, non_temporal_load, LDNT1B_ZRR, LDNT1B_ZRI, am_sve_regreg_lsl0>;

A similar issue has existed with __builtin_nontemporal_load/store. These builtins also fail to generate non-temporal instructions. This appears to be the same root cause.

https://godbolt.org/z/rhzYaxjj5

To resolve both of these issues, should we add isel patterns for non-temporal instructions that match regular load and store nodes?

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions