[AArch64][SVE] Handle some cases of uzp1/reinterpret from svbool in isZeroingInactiveLanes #78623

UsmanNadeem · 2024-01-18T21:02:32Z

Currently we generate unnecessary and instructions to zero lanes when dealing with patterns of reinterpret and uzp1 to combine multiple vector predicates.

This patch handles those cases in isZeroingInactiveLanes.

llvmbot · 2024-01-18T21:03:05Z

@llvm/pr-subscribers-backend-aarch64

Author: Usman Nadeem (UsmanNadeem)

Changes

Precommit tests
[AArch64][SVE] Handle some cases of uzp1/reinterpret from svbool in isZeroingInactiveLanes

Full diff: https://github.com/llvm/llvm-project/pull/78623.diff

2 Files Affected:

(modified) llvm/lib/Target/AArch64/AArch64ISelLowering.cpp (+18)
(modified) llvm/test/CodeGen/AArch64/sve-intrinsics-reinterpret.ll (+103)

diff --git a/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp b/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
index 8a6f1dc7487bae8..11498b6622486fd 100644
--- a/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
+++ b/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
@@ -283,6 +283,24 @@ static bool isZeroingInactiveLanes(SDValue Op) {
     switch (Op.getConstantOperandVal(0)) {
     default:
       return false;
+
+    case Intrinsic::aarch64_sve_uzp1:
+      return isZeroingInactiveLanes(Op.getOperand(1)) &&
+             isZeroingInactiveLanes(Op.getOperand(2));
+    case Intrinsic::aarch64_sve_convert_from_svbool: {
+      // Inactive lanes are zero if the svbool was constructed in a way that the
+      // initial type had same or fewer lanes than the current op and the
+      // inactive lanes were cleared on construction.
+      SDValue Op1 = Op.getOperand(1);
+      if (Op1.getOpcode() == ISD::INTRINSIC_WO_CHAIN &&
+          Op1.getConstantOperandVal(0) ==
+              Intrinsic::aarch64_sve_convert_to_svbool) {
+        SDValue ConstrOp = Op1.getOperand(1);
+        return ConstrOp.getValueType().bitsLE(Op.getValueType()) &&
+               isZeroingInactiveLanes(ConstrOp);
+      }
+      return false;
+    }
     case Intrinsic::aarch64_sve_ptrue:
     case Intrinsic::aarch64_sve_pnext:
     case Intrinsic::aarch64_sve_cmpeq:
diff --git a/llvm/test/CodeGen/AArch64/sve-intrinsics-reinterpret.ll b/llvm/test/CodeGen/AArch64/sve-intrinsics-reinterpret.ll
index 82bf756f8228984..f390ad381705f92 100644
--- a/llvm/test/CodeGen/AArch64/sve-intrinsics-reinterpret.ll
+++ b/llvm/test/CodeGen/AArch64/sve-intrinsics-reinterpret.ll
@@ -150,10 +150,113 @@ define <vscale x 16 x i1> @chained_reinterpret() {
   ret <vscale x 16 x i1> %out
 }
 
+; Convert two nxv4i1 to nxv8i1 and combine them using uzp1 and then cast to svbool.
+; Icmp zeros the lanes so the uzp'd result also has zeroed lanes. There should
+; be no need to manually zero the lanes.
+define <vscale x 16 x i1> @uzp1_to_svbool(<vscale x 4 x i32> %v0, <vscale x 4 x i32> %v1, <vscale x 4 x i32> %x) {
+; CHECK-LABEL: uzp1_to_svbool:
+; CHECK:       // %bb.0:
+; CHECK-NEXT:    ptrue p0.s
+; CHECK-NEXT:    cmphi p1.s, p0/z, z2.s, z0.s
+; CHECK-NEXT:    cmphi p0.s, p0/z, z2.s, z1.s
+; CHECK-NEXT:    uzp1 p0.h, p1.h, p0.h
+; CHECK-NEXT:    ret
+  %cmp0 = icmp ult <vscale x 4 x i32> %v0, %x
+  %cmp1 = icmp ult <vscale x 4 x i32> %v1, %x
+  %1 = tail call <vscale x 16 x i1> @llvm.aarch64.sve.convert.to.svbool.nxv4i1(<vscale x 4 x i1> %cmp0)
+  %2 = tail call <vscale x 8 x i1> @llvm.aarch64.sve.convert.from.svbool.nxv8i1(<vscale x 16 x i1> %1)
+  %3 = tail call <vscale x 16 x i1> @llvm.aarch64.sve.convert.to.svbool.nxv4i1(<vscale x 4 x i1> %cmp1)
+  %4 = tail call <vscale x 8 x i1> @llvm.aarch64.sve.convert.from.svbool.nxv8i1(<vscale x 16 x i1> %3)
+  %uz1 = tail call <vscale x 8 x i1> @llvm.aarch64.sve.uzp1.nxv8i1(<vscale x 8 x i1> %2, <vscale x 8 x i1> %4)
+  %uz1_svbool = tail call <vscale x 16 x i1> @llvm.aarch64.sve.convert.to.svbool.nxv8i1(<vscale x 8 x i1> %uz1)
+  ret <vscale x 16 x i1> %uz1_svbool
+}
+
+; Negative test. cmp0 has 8 lanes so we need to zero.
+define <vscale x 16 x i1> @uzp1_to_svbool_neg(<vscale x 8 x i16> %v0, <vscale x 4 x i32> %v1, <vscale x 8 x i16> %x, <vscale x 4 x i32> %y) {
+; CHECK-LABEL: uzp1_to_svbool_neg:
+; CHECK:       // %bb.0:
+; CHECK-NEXT:    ptrue p0.h
+; CHECK-NEXT:    ptrue p1.s
+; CHECK-NEXT:    cmphi p0.h, p0/z, z2.h, z0.h
+; CHECK-NEXT:    cmphi p2.s, p1/z, z3.s, z1.s
+; CHECK-NEXT:    uzp1 p0.s, p0.s, p2.s
+; CHECK-NEXT:    and p0.b, p0/z, p0.b, p1.b
+; CHECK-NEXT:    ret
+  %cmp0 = icmp ult <vscale x 8 x i16> %v0, %x
+  %cmp1 = icmp ult <vscale x 4 x i32> %v1, %y
+  %1 = tail call <vscale x 16 x i1> @llvm.aarch64.sve.convert.to.svbool.nxv8i1(<vscale x 8 x i1> %cmp0)
+  %2 = tail call <vscale x 4 x i1> @llvm.aarch64.sve.convert.from.svbool.nxv4i1(<vscale x 16 x i1> %1)
+  %uz1 = tail call <vscale x 4 x i1> @llvm.aarch64.sve.uzp1.nxv4i1(<vscale x 4 x i1> %2, <vscale x 4 x i1> %cmp1)
+  %uz1_svbool = tail call <vscale x 16 x i1> @llvm.aarch64.sve.convert.to.svbool.nxv4i1(<vscale x 4 x i1> %uz1)
+  ret <vscale x 16 x i1> %uz1_svbool
+}
+
+; Test with chained reinterprets.
+define <vscale x 16 x i1> @chainedReinterpted_uzp1_to_svbool(<vscale x 2 x i64> %v0, <vscale x 2 x i64> %v1, <vscale x 2 x i64> %x) {
+; CHECK-LABEL: chainedReinterpted_uzp1_to_svbool:
+; CHECK:       // %bb.0:
+; CHECK-NEXT:    ptrue p0.d
+; CHECK-NEXT:    cmphi p1.d, p0/z, z2.d, z0.d
+; CHECK-NEXT:    cmphi p0.d, p0/z, z2.d, z1.d
+; CHECK-NEXT:    uzp1 p0.h, p1.h, p0.h
+; CHECK-NEXT:    ret
+  %cmp0 = icmp ult <vscale x 2 x i64> %v0, %x
+  %cmp1 = icmp ult <vscale x 2 x i64> %v1, %x
+  %1 = tail call <vscale x 16 x i1> @llvm.aarch64.sve.convert.to.svbool.nxv2i1(<vscale x 2 x i1> %cmp0)
+  %2 = tail call <vscale x 4 x i1> @llvm.aarch64.sve.convert.from.svbool.nxv4i1(<vscale x 16 x i1> %1)
+  %3 = tail call <vscale x 16 x i1> @llvm.aarch64.sve.convert.to.svbool.nxv2i1(<vscale x 2 x i1> %cmp1)
+  %4 = tail call <vscale x 4 x i1> @llvm.aarch64.sve.convert.from.svbool.nxv4i1(<vscale x 16 x i1> %3)
+  %5 = tail call <vscale x 16 x i1> @llvm.aarch64.sve.convert.to.svbool.nxv4i1(<vscale x 4 x i1> %2)
+  %6 = tail call <vscale x 8 x i1> @llvm.aarch64.sve.convert.from.svbool.nxv8i1(<vscale x 16 x i1> %5)
+  %7 = tail call <vscale x 16 x i1> @llvm.aarch64.sve.convert.to.svbool.nxv4i1(<vscale x 4 x i1> %4)
+  %8 = tail call <vscale x 8 x i1> @llvm.aarch64.sve.convert.from.svbool.nxv8i1(<vscale x 16 x i1> %7)
+  %uz1 = tail call <vscale x 8 x i1> @llvm.aarch64.sve.uzp1.nxv8i1(<vscale x 8 x i1> %6, <vscale x 8 x i1> %8)
+  %uz1_svbool = tail call <vscale x 16 x i1> @llvm.aarch64.sve.convert.to.svbool.nxv8i1(<vscale x 8 x i1> %uz1)
+  ret <vscale x 16 x i1> %uz1_svbool
+}
+
+; Test with chained uzp1s.
+define <vscale x 16 x i1> @chainedUzp1_to_svbool(<vscale x 4 x i32> %v0, <vscale x 4 x i32> %v1, <vscale x 4 x i32> %v2, <vscale x 4 x i32> %v3, <vscale x 4 x i32> %x) {
+; CHECK-LABEL: chainedUzp1_to_svbool:
+; CHECK:       // %bb.0:
+; CHECK-NEXT:    ptrue p0.s
+; CHECK-NEXT:    cmphi p1.s, p0/z, z4.s, z0.s
+; CHECK-NEXT:    cmphi p2.s, p0/z, z4.s, z1.s
+; CHECK-NEXT:    cmphi p3.s, p0/z, z4.s, z2.s
+; CHECK-NEXT:    cmphi p0.s, p0/z, z4.s, z3.s
+; CHECK-NEXT:    uzp1 p1.h, p1.h, p2.h
+; CHECK-NEXT:    uzp1 p0.h, p3.h, p0.h
+; CHECK-NEXT:    uzp1 p0.b, p1.b, p0.b
+; CHECK-NEXT:    ret
+  %cmp0 = icmp ult <vscale x 4 x i32> %v0, %x
+  %cmp1 = icmp ult <vscale x 4 x i32> %v1, %x
+  %cmp2 = icmp ult <vscale x 4 x i32> %v2, %x
+  %cmp3 = icmp ult <vscale x 4 x i32> %v3, %x
+  %1 = tail call <vscale x 16 x i1> @llvm.aarch64.sve.convert.to.svbool.nxv4i1(<vscale x 4 x i1> %cmp0)
+  %2 = tail call <vscale x 8 x i1> @llvm.aarch64.sve.convert.from.svbool.nxv8i1(<vscale x 16 x i1> %1)
+  %3 = tail call <vscale x 16 x i1> @llvm.aarch64.sve.convert.to.svbool.nxv4i1(<vscale x 4 x i1> %cmp1)
+  %4 = tail call <vscale x 8 x i1> @llvm.aarch64.sve.convert.from.svbool.nxv8i1(<vscale x 16 x i1> %3)
+  %uz1_1 = tail call <vscale x 8 x i1> @llvm.aarch64.sve.uzp1.nxv8i1(<vscale x 8 x i1> %2, <vscale x 8 x i1> %4)
+  %5 = tail call <vscale x 16 x i1> @llvm.aarch64.sve.convert.to.svbool.nxv4i1(<vscale x 4 x i1> %cmp2)
+  %6 = tail call <vscale x 8 x i1> @llvm.aarch64.sve.convert.from.svbool.nxv8i1(<vscale x 16 x i1> %5)
+  %7 = tail call <vscale x 16 x i1> @llvm.aarch64.sve.convert.to.svbool.nxv4i1(<vscale x 4 x i1> %cmp3)
+  %8 = tail call <vscale x 8 x i1> @llvm.aarch64.sve.convert.from.svbool.nxv8i1(<vscale x 16 x i1> %7)
+  %uz1_2 = tail call <vscale x 8 x i1> @llvm.aarch64.sve.uzp1.nxv8i1(<vscale x 8 x i1> %6, <vscale x 8 x i1> %8)
+  %9 = tail call <vscale x 16 x i1> @llvm.aarch64.sve.convert.to.svbool.nxv8i1(<vscale x 8 x i1> %uz1_1)
+  %10 = tail call <vscale x 16 x i1> @llvm.aarch64.sve.convert.to.svbool.nxv8i1(<vscale x 8 x i1> %uz1_2)
+  %uz3 = tail call <vscale x 16 x i1> @llvm.aarch64.sve.uzp1.nxv16i1(<vscale x 16 x i1> %9, <vscale x 16 x i1> %10)
+  ret <vscale x 16 x i1> %uz3
+}
+
 declare <vscale x 8 x i1> @llvm.aarch64.sve.ptrue.nxv8i1(i32 immarg)
 declare <vscale x 16 x i1> @llvm.aarch64.sve.ptrue.nxv16i1(i32 immarg)
 declare <vscale x 8 x i1> @llvm.aarch64.sve.cmpgt.nxv8i16(<vscale x 8 x i1>, <vscale x 8 x i16>, <vscale x 8 x i16>)
 
+declare <vscale x 4 x i1> @llvm.aarch64.sve.uzp1.nxv4i1(<vscale x 4 x i1>, <vscale x 4 x i1>)
+declare <vscale x 8 x i1> @llvm.aarch64.sve.uzp1.nxv8i1(<vscale x 8 x i1>, <vscale x 8 x i1>)
+declare <vscale x 16 x i1> @llvm.aarch64.sve.uzp1.nxv16i1(<vscale x 16 x i1>, <vscale x 16 x i1>)
+
 declare <vscale x 16 x i1> @llvm.aarch64.sve.convert.to.svbool.nxv16i1(<vscale x 16 x i1>)
 declare <vscale x 16 x i1> @llvm.aarch64.sve.convert.to.svbool.nxv8i1(<vscale x 8 x i1>)
 declare <vscale x 16 x i1> @llvm.aarch64.sve.convert.to.svbool.nxv4i1(<vscale x 4 x i1>)

Change-Id: Ieb9146ed979c4c04d876106786628a54971cfb15

…isZeroingInactiveLanes` See the precommitted test diff for the codegen changes. Change-Id: If82e00e5fdfcb268c7768d9b51e74315d8c193c4

UsmanNadeem · 2024-02-01T01:20:43Z

Ping

efriedma-quic · 2024-02-05T08:43:00Z

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp

+
+    case Intrinsic::aarch64_sve_uzp1:
+      return isZeroingInactiveLanes(Op.getOperand(1)) &&
+             isZeroingInactiveLanes(Op.getOperand(2));


Looking at the reference manual for uzp1, it looks like the instruction inherently zeros the unused lanes: the iteration changes depending on the specified type. So unless I'm missing somtehing, looking at the operands doesn't seem useful.

The way this check works seems inherently a fragile to me: we're assuming the SVE intrinsics in question are getting lowered to precisely the corresponding instructions, and we don't do any optimizations that might affect the bits that aren't part of the type. But that's not new with this patch.

https://reviews.llvm.org/D141469#4057420 has some comments by @paulwalker-arm about how the actual "block" of type d/s/h/b is copied unmodified and there may be some bits in there that don't belong to the predicate type.

I'll look into the manual as well.

Oh, I see, yes, you're right.

I'm not so sure this is the correct place for such code because the function related to zeroing through construction rather than propagation. It is a bit of a kludge but I find comfort in knowing it only handles a specific set of intrinsics that we know have no special handling. Once you start recursing down the DAG I start to feel less comfortable.

That said, aarch64_sve_uzp1 is also a weird intrinsic because for predicates only the nxv16i1 case is tied by design to a specific instruction (i.e. it's not one of the intrinsics where, at least as of today, we've attached any meaning to the invisible lanes). I only maintained the support for other predicate types when creating https://reviews.llvm.org/D142065 because it seemed useful to maintain it as a way to logically interleave two vectors. But we have dedicated target neutral intrinsics for that now.

So I guess my question is in what circumstance do you need to support to_svbool(sve.uzp1(val))? I wasn't expecting to see this from ACLE code for example. I'm not against the patch but I would like to understand the rational more.

We have some code similar to minimal C example below, implemented using intrinsics. The IR example would be chainedUzp1_to_svbool in the tests. The result of icmp is reinterpreted and combined using uzp1 and to/from svbool until we get vscale x 16 x i1.

If this function is not an appropriate place for this code then would it make sense to do it in instcombine by simplifying this kind of pattern to a concat using the llvm.vector.insert intrinsic?

void test(uint8_t *x, uint8_t *y, uint32_t v, int n) { for (int i = 0; i < n; i++) { x[i] = y[i]==v; } }

Posted an instcombine version: #81069

I see, thanks for the info. I certainly prefer the idea of having this as an instcombine so I'll take a look at the other PR. Looking at the IR I do wonder though if you're just emitting the wrong intrinsics and rather than using sve.uzp1 for everything you could use the larger element variants (e.g. sve.uzp1.b16) that operate on vscale x 16 x i1 and thus you remove the need for sve.convert.from.svbool intrinsics.

Concatenating two predictes using uzp1 after converting to double length using sve.convert.to/from.svbool is optimized poorly in the backend, resulting in additional `and` instructions to zero the lanes. See llvm#78623 Combine this pattern to use `llvm.vector.insert` to concatenate and get rid of convert to/from svbools. Change-Id: Ieb38055f1f2304ead1ab2bead03ebd046b3bd36f

…rt (#81069) Concatenating two predictes using uzp1 after converting to double length using sve.convert.to/from.svbool is optimized poorly in the backend, resulting in additional `and` instructions to zero the lanes. See #78623 Combine this pattern to use `llvm.vector.insert` to concatenate and get rid of convert to/from svbools.

UsmanNadeem · 2024-02-15T18:41:40Z

Abandoning this PR in favor of 81069.

llvmbot added the backend:AArch64 label Jan 18, 2024

UsmanNadeem changed the title ~~svepred~~ [AArch64][SVE] Handle some cases of uzp1/reinterpret from svbool in isZeroingInactiveLanes Jan 18, 2024

UsmanNadeem requested review from peterwaller-arm, paulwalker-arm, sdesmalen-arm and efriedma-quic January 18, 2024 21:03

davemgreen requested a review from david-arm January 20, 2024 17:22

UsmanNadeem added 2 commits January 24, 2024 13:25

Precommit tests

6c81c48

Change-Id: Ieb9146ed979c4c04d876106786628a54971cfb15

[AArch64][SVE] Handle some cases of uzp1/reinterpret from svbool in `…

4241466

…isZeroingInactiveLanes` See the precommitted test diff for the codegen changes. Change-Id: If82e00e5fdfcb268c7768d9b51e74315d8c193c4

UsmanNadeem force-pushed the svepred branch from 4a8e57d to 4241466 Compare January 24, 2024 21:25

efriedma-quic reviewed Feb 5, 2024

View reviewed changes

UsmanNadeem mentioned this pull request Feb 8, 2024

[AArch64][SVE] Instcombine uzp1/reinterpret svbool to use vector.insert #81069

Merged

UsmanNadeem closed this Feb 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[AArch64][SVE] Handle some cases of uzp1/reinterpret from svbool in isZeroingInactiveLanes #78623

[AArch64][SVE] Handle some cases of uzp1/reinterpret from svbool in isZeroingInactiveLanes #78623

UsmanNadeem commented Jan 18, 2024 •

edited

Loading

llvmbot commented Jan 18, 2024

UsmanNadeem commented Feb 1, 2024

efriedma-quic Feb 5, 2024

UsmanNadeem Feb 5, 2024 •

edited

Loading

efriedma-quic Feb 5, 2024

paulwalker-arm Feb 7, 2024

UsmanNadeem Feb 7, 2024

UsmanNadeem Feb 8, 2024

paulwalker-arm Feb 8, 2024

UsmanNadeem commented Feb 15, 2024

[AArch64][SVE] Handle some cases of uzp1/reinterpret from svbool in isZeroingInactiveLanes #78623

[AArch64][SVE] Handle some cases of uzp1/reinterpret from svbool in isZeroingInactiveLanes #78623

Conversation

UsmanNadeem commented Jan 18, 2024 • edited Loading

llvmbot commented Jan 18, 2024

UsmanNadeem commented Feb 1, 2024

efriedma-quic Feb 5, 2024

Choose a reason for hiding this comment

UsmanNadeem Feb 5, 2024 • edited Loading

Choose a reason for hiding this comment

efriedma-quic Feb 5, 2024

Choose a reason for hiding this comment

paulwalker-arm Feb 7, 2024

Choose a reason for hiding this comment

UsmanNadeem Feb 7, 2024

Choose a reason for hiding this comment

UsmanNadeem Feb 8, 2024

Choose a reason for hiding this comment

paulwalker-arm Feb 8, 2024

Choose a reason for hiding this comment

UsmanNadeem commented Feb 15, 2024

UsmanNadeem commented Jan 18, 2024 •

edited

Loading

UsmanNadeem Feb 5, 2024 •

edited

Loading