Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[AArch64][SVE] Handle some cases of uzp1/reinterpret from svbool in isZeroingInactiveLanes #78623

Closed
wants to merge 2 commits into from

Conversation

UsmanNadeem
Copy link
Contributor

@UsmanNadeem UsmanNadeem commented Jan 18, 2024

Currently we generate unnecessary and instructions to zero lanes when dealing with patterns of reinterpret and uzp1 to combine multiple vector predicates.

This patch handles those cases in isZeroingInactiveLanes.

@UsmanNadeem UsmanNadeem changed the title svepred [AArch64][SVE] Handle some cases of uzp1/reinterpret from svbool in isZeroingInactiveLanes Jan 18, 2024
@llvmbot
Copy link
Collaborator

llvmbot commented Jan 18, 2024

@llvm/pr-subscribers-backend-aarch64

Author: Usman Nadeem (UsmanNadeem)

Changes
  • Precommit tests
  • [AArch64][SVE] Handle some cases of uzp1/reinterpret from svbool in isZeroingInactiveLanes

Full diff: https://github.com/llvm/llvm-project/pull/78623.diff

2 Files Affected:

  • (modified) llvm/lib/Target/AArch64/AArch64ISelLowering.cpp (+18)
  • (modified) llvm/test/CodeGen/AArch64/sve-intrinsics-reinterpret.ll (+103)
diff --git a/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp b/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
index 8a6f1dc7487bae8..11498b6622486fd 100644
--- a/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
+++ b/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
@@ -283,6 +283,24 @@ static bool isZeroingInactiveLanes(SDValue Op) {
     switch (Op.getConstantOperandVal(0)) {
     default:
       return false;
+
+    case Intrinsic::aarch64_sve_uzp1:
+      return isZeroingInactiveLanes(Op.getOperand(1)) &&
+             isZeroingInactiveLanes(Op.getOperand(2));
+    case Intrinsic::aarch64_sve_convert_from_svbool: {
+      // Inactive lanes are zero if the svbool was constructed in a way that the
+      // initial type had same or fewer lanes than the current op and the
+      // inactive lanes were cleared on construction.
+      SDValue Op1 = Op.getOperand(1);
+      if (Op1.getOpcode() == ISD::INTRINSIC_WO_CHAIN &&
+          Op1.getConstantOperandVal(0) ==
+              Intrinsic::aarch64_sve_convert_to_svbool) {
+        SDValue ConstrOp = Op1.getOperand(1);
+        return ConstrOp.getValueType().bitsLE(Op.getValueType()) &&
+               isZeroingInactiveLanes(ConstrOp);
+      }
+      return false;
+    }
     case Intrinsic::aarch64_sve_ptrue:
     case Intrinsic::aarch64_sve_pnext:
     case Intrinsic::aarch64_sve_cmpeq:
diff --git a/llvm/test/CodeGen/AArch64/sve-intrinsics-reinterpret.ll b/llvm/test/CodeGen/AArch64/sve-intrinsics-reinterpret.ll
index 82bf756f8228984..f390ad381705f92 100644
--- a/llvm/test/CodeGen/AArch64/sve-intrinsics-reinterpret.ll
+++ b/llvm/test/CodeGen/AArch64/sve-intrinsics-reinterpret.ll
@@ -150,10 +150,113 @@ define <vscale x 16 x i1> @chained_reinterpret() {
   ret <vscale x 16 x i1> %out
 }
 
+; Convert two nxv4i1 to nxv8i1 and combine them using uzp1 and then cast to svbool.
+; Icmp zeros the lanes so the uzp'd result also has zeroed lanes. There should
+; be no need to manually zero the lanes.
+define <vscale x 16 x i1> @uzp1_to_svbool(<vscale x 4 x i32> %v0, <vscale x 4 x i32> %v1, <vscale x 4 x i32> %x) {
+; CHECK-LABEL: uzp1_to_svbool:
+; CHECK:       // %bb.0:
+; CHECK-NEXT:    ptrue p0.s
+; CHECK-NEXT:    cmphi p1.s, p0/z, z2.s, z0.s
+; CHECK-NEXT:    cmphi p0.s, p0/z, z2.s, z1.s
+; CHECK-NEXT:    uzp1 p0.h, p1.h, p0.h
+; CHECK-NEXT:    ret
+  %cmp0 = icmp ult <vscale x 4 x i32> %v0, %x
+  %cmp1 = icmp ult <vscale x 4 x i32> %v1, %x
+  %1 = tail call <vscale x 16 x i1> @llvm.aarch64.sve.convert.to.svbool.nxv4i1(<vscale x 4 x i1> %cmp0)
+  %2 = tail call <vscale x 8 x i1> @llvm.aarch64.sve.convert.from.svbool.nxv8i1(<vscale x 16 x i1> %1)
+  %3 = tail call <vscale x 16 x i1> @llvm.aarch64.sve.convert.to.svbool.nxv4i1(<vscale x 4 x i1> %cmp1)
+  %4 = tail call <vscale x 8 x i1> @llvm.aarch64.sve.convert.from.svbool.nxv8i1(<vscale x 16 x i1> %3)
+  %uz1 = tail call <vscale x 8 x i1> @llvm.aarch64.sve.uzp1.nxv8i1(<vscale x 8 x i1> %2, <vscale x 8 x i1> %4)
+  %uz1_svbool = tail call <vscale x 16 x i1> @llvm.aarch64.sve.convert.to.svbool.nxv8i1(<vscale x 8 x i1> %uz1)
+  ret <vscale x 16 x i1> %uz1_svbool
+}
+
+; Negative test. cmp0 has 8 lanes so we need to zero.
+define <vscale x 16 x i1> @uzp1_to_svbool_neg(<vscale x 8 x i16> %v0, <vscale x 4 x i32> %v1, <vscale x 8 x i16> %x, <vscale x 4 x i32> %y) {
+; CHECK-LABEL: uzp1_to_svbool_neg:
+; CHECK:       // %bb.0:
+; CHECK-NEXT:    ptrue p0.h
+; CHECK-NEXT:    ptrue p1.s
+; CHECK-NEXT:    cmphi p0.h, p0/z, z2.h, z0.h
+; CHECK-NEXT:    cmphi p2.s, p1/z, z3.s, z1.s
+; CHECK-NEXT:    uzp1 p0.s, p0.s, p2.s
+; CHECK-NEXT:    and p0.b, p0/z, p0.b, p1.b
+; CHECK-NEXT:    ret
+  %cmp0 = icmp ult <vscale x 8 x i16> %v0, %x
+  %cmp1 = icmp ult <vscale x 4 x i32> %v1, %y
+  %1 = tail call <vscale x 16 x i1> @llvm.aarch64.sve.convert.to.svbool.nxv8i1(<vscale x 8 x i1> %cmp0)
+  %2 = tail call <vscale x 4 x i1> @llvm.aarch64.sve.convert.from.svbool.nxv4i1(<vscale x 16 x i1> %1)
+  %uz1 = tail call <vscale x 4 x i1> @llvm.aarch64.sve.uzp1.nxv4i1(<vscale x 4 x i1> %2, <vscale x 4 x i1> %cmp1)
+  %uz1_svbool = tail call <vscale x 16 x i1> @llvm.aarch64.sve.convert.to.svbool.nxv4i1(<vscale x 4 x i1> %uz1)
+  ret <vscale x 16 x i1> %uz1_svbool
+}
+
+; Test with chained reinterprets.
+define <vscale x 16 x i1> @chainedReinterpted_uzp1_to_svbool(<vscale x 2 x i64> %v0, <vscale x 2 x i64> %v1, <vscale x 2 x i64> %x) {
+; CHECK-LABEL: chainedReinterpted_uzp1_to_svbool:
+; CHECK:       // %bb.0:
+; CHECK-NEXT:    ptrue p0.d
+; CHECK-NEXT:    cmphi p1.d, p0/z, z2.d, z0.d
+; CHECK-NEXT:    cmphi p0.d, p0/z, z2.d, z1.d
+; CHECK-NEXT:    uzp1 p0.h, p1.h, p0.h
+; CHECK-NEXT:    ret
+  %cmp0 = icmp ult <vscale x 2 x i64> %v0, %x
+  %cmp1 = icmp ult <vscale x 2 x i64> %v1, %x
+  %1 = tail call <vscale x 16 x i1> @llvm.aarch64.sve.convert.to.svbool.nxv2i1(<vscale x 2 x i1> %cmp0)
+  %2 = tail call <vscale x 4 x i1> @llvm.aarch64.sve.convert.from.svbool.nxv4i1(<vscale x 16 x i1> %1)
+  %3 = tail call <vscale x 16 x i1> @llvm.aarch64.sve.convert.to.svbool.nxv2i1(<vscale x 2 x i1> %cmp1)
+  %4 = tail call <vscale x 4 x i1> @llvm.aarch64.sve.convert.from.svbool.nxv4i1(<vscale x 16 x i1> %3)
+  %5 = tail call <vscale x 16 x i1> @llvm.aarch64.sve.convert.to.svbool.nxv4i1(<vscale x 4 x i1> %2)
+  %6 = tail call <vscale x 8 x i1> @llvm.aarch64.sve.convert.from.svbool.nxv8i1(<vscale x 16 x i1> %5)
+  %7 = tail call <vscale x 16 x i1> @llvm.aarch64.sve.convert.to.svbool.nxv4i1(<vscale x 4 x i1> %4)
+  %8 = tail call <vscale x 8 x i1> @llvm.aarch64.sve.convert.from.svbool.nxv8i1(<vscale x 16 x i1> %7)
+  %uz1 = tail call <vscale x 8 x i1> @llvm.aarch64.sve.uzp1.nxv8i1(<vscale x 8 x i1> %6, <vscale x 8 x i1> %8)
+  %uz1_svbool = tail call <vscale x 16 x i1> @llvm.aarch64.sve.convert.to.svbool.nxv8i1(<vscale x 8 x i1> %uz1)
+  ret <vscale x 16 x i1> %uz1_svbool
+}
+
+; Test with chained uzp1s.
+define <vscale x 16 x i1> @chainedUzp1_to_svbool(<vscale x 4 x i32> %v0, <vscale x 4 x i32> %v1, <vscale x 4 x i32> %v2, <vscale x 4 x i32> %v3, <vscale x 4 x i32> %x) {
+; CHECK-LABEL: chainedUzp1_to_svbool:
+; CHECK:       // %bb.0:
+; CHECK-NEXT:    ptrue p0.s
+; CHECK-NEXT:    cmphi p1.s, p0/z, z4.s, z0.s
+; CHECK-NEXT:    cmphi p2.s, p0/z, z4.s, z1.s
+; CHECK-NEXT:    cmphi p3.s, p0/z, z4.s, z2.s
+; CHECK-NEXT:    cmphi p0.s, p0/z, z4.s, z3.s
+; CHECK-NEXT:    uzp1 p1.h, p1.h, p2.h
+; CHECK-NEXT:    uzp1 p0.h, p3.h, p0.h
+; CHECK-NEXT:    uzp1 p0.b, p1.b, p0.b
+; CHECK-NEXT:    ret
+  %cmp0 = icmp ult <vscale x 4 x i32> %v0, %x
+  %cmp1 = icmp ult <vscale x 4 x i32> %v1, %x
+  %cmp2 = icmp ult <vscale x 4 x i32> %v2, %x
+  %cmp3 = icmp ult <vscale x 4 x i32> %v3, %x
+  %1 = tail call <vscale x 16 x i1> @llvm.aarch64.sve.convert.to.svbool.nxv4i1(<vscale x 4 x i1> %cmp0)
+  %2 = tail call <vscale x 8 x i1> @llvm.aarch64.sve.convert.from.svbool.nxv8i1(<vscale x 16 x i1> %1)
+  %3 = tail call <vscale x 16 x i1> @llvm.aarch64.sve.convert.to.svbool.nxv4i1(<vscale x 4 x i1> %cmp1)
+  %4 = tail call <vscale x 8 x i1> @llvm.aarch64.sve.convert.from.svbool.nxv8i1(<vscale x 16 x i1> %3)
+  %uz1_1 = tail call <vscale x 8 x i1> @llvm.aarch64.sve.uzp1.nxv8i1(<vscale x 8 x i1> %2, <vscale x 8 x i1> %4)
+  %5 = tail call <vscale x 16 x i1> @llvm.aarch64.sve.convert.to.svbool.nxv4i1(<vscale x 4 x i1> %cmp2)
+  %6 = tail call <vscale x 8 x i1> @llvm.aarch64.sve.convert.from.svbool.nxv8i1(<vscale x 16 x i1> %5)
+  %7 = tail call <vscale x 16 x i1> @llvm.aarch64.sve.convert.to.svbool.nxv4i1(<vscale x 4 x i1> %cmp3)
+  %8 = tail call <vscale x 8 x i1> @llvm.aarch64.sve.convert.from.svbool.nxv8i1(<vscale x 16 x i1> %7)
+  %uz1_2 = tail call <vscale x 8 x i1> @llvm.aarch64.sve.uzp1.nxv8i1(<vscale x 8 x i1> %6, <vscale x 8 x i1> %8)
+  %9 = tail call <vscale x 16 x i1> @llvm.aarch64.sve.convert.to.svbool.nxv8i1(<vscale x 8 x i1> %uz1_1)
+  %10 = tail call <vscale x 16 x i1> @llvm.aarch64.sve.convert.to.svbool.nxv8i1(<vscale x 8 x i1> %uz1_2)
+  %uz3 = tail call <vscale x 16 x i1> @llvm.aarch64.sve.uzp1.nxv16i1(<vscale x 16 x i1> %9, <vscale x 16 x i1> %10)
+  ret <vscale x 16 x i1> %uz3
+}
+
 declare <vscale x 8 x i1> @llvm.aarch64.sve.ptrue.nxv8i1(i32 immarg)
 declare <vscale x 16 x i1> @llvm.aarch64.sve.ptrue.nxv16i1(i32 immarg)
 declare <vscale x 8 x i1> @llvm.aarch64.sve.cmpgt.nxv8i16(<vscale x 8 x i1>, <vscale x 8 x i16>, <vscale x 8 x i16>)
 
+declare <vscale x 4 x i1> @llvm.aarch64.sve.uzp1.nxv4i1(<vscale x 4 x i1>, <vscale x 4 x i1>)
+declare <vscale x 8 x i1> @llvm.aarch64.sve.uzp1.nxv8i1(<vscale x 8 x i1>, <vscale x 8 x i1>)
+declare <vscale x 16 x i1> @llvm.aarch64.sve.uzp1.nxv16i1(<vscale x 16 x i1>, <vscale x 16 x i1>)
+
 declare <vscale x 16 x i1> @llvm.aarch64.sve.convert.to.svbool.nxv16i1(<vscale x 16 x i1>)
 declare <vscale x 16 x i1> @llvm.aarch64.sve.convert.to.svbool.nxv8i1(<vscale x 8 x i1>)
 declare <vscale x 16 x i1> @llvm.aarch64.sve.convert.to.svbool.nxv4i1(<vscale x 4 x i1>)

Change-Id: Ieb9146ed979c4c04d876106786628a54971cfb15
…isZeroingInactiveLanes`

See the precommitted test diff for the codegen changes.

Change-Id: If82e00e5fdfcb268c7768d9b51e74315d8c193c4
@UsmanNadeem
Copy link
Contributor Author

Ping


case Intrinsic::aarch64_sve_uzp1:
return isZeroingInactiveLanes(Op.getOperand(1)) &&
isZeroingInactiveLanes(Op.getOperand(2));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at the reference manual for uzp1, it looks like the instruction inherently zeros the unused lanes: the iteration changes depending on the specified type. So unless I'm missing somtehing, looking at the operands doesn't seem useful.

The way this check works seems inherently a fragile to me: we're assuming the SVE intrinsics in question are getting lowered to precisely the corresponding instructions, and we don't do any optimizations that might affect the bits that aren't part of the type. But that's not new with this patch.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://reviews.llvm.org/D141469#4057420 has some comments by @paulwalker-arm about how the actual "block" of type d/s/h/b is copied unmodified and there may be some bits in there that don't belong to the predicate type.

I'll look into the manual as well.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I see, yes, you're right.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not so sure this is the correct place for such code because the function related to zeroing through construction rather than propagation. It is a bit of a kludge but I find comfort in knowing it only handles a specific set of intrinsics that we know have no special handling. Once you start recursing down the DAG I start to feel less comfortable.

That said, aarch64_sve_uzp1 is also a weird intrinsic because for predicates only the nxv16i1 case is tied by design to a specific instruction (i.e. it's not one of the intrinsics where, at least as of today, we've attached any meaning to the invisible lanes). I only maintained the support for other predicate types when creating https://reviews.llvm.org/D142065 because it seemed useful to maintain it as a way to logically interleave two vectors. But we have dedicated target neutral intrinsics for that now.

So I guess my question is in what circumstance do you need to support to_svbool(sve.uzp1(val))? I wasn't expecting to see this from ACLE code for example. I'm not against the patch but I would like to understand the rational more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have some code similar to minimal C example below, implemented using intrinsics. The IR example would be chainedUzp1_to_svbool in the tests. The result of icmp is reinterpreted and combined using uzp1 and to/from svbool until we get vscale x 16 x i1.

If this function is not an appropriate place for this code then would it make sense to do it in instcombine by simplifying this kind of pattern to a concat using the llvm.vector.insert intrinsic?

void test(uint8_t *x, uint8_t *y, uint32_t v, int n) {
  for (int i = 0; i < n; i++) {
    x[i] = y[i]==v;
  }
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Posted an instcombine version: #81069

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, thanks for the info. I certainly prefer the idea of having this as an instcombine so I'll take a look at the other PR. Looking at the IR I do wonder though if you're just emitting the wrong intrinsics and rather than using sve.uzp1 for everything you could use the larger element variants (e.g. sve.uzp1.b16) that operate on vscale x 16 x i1 and thus you remove the need for sve.convert.from.svbool intrinsics.

UsmanNadeem added a commit to UsmanNadeem/llvm-project that referenced this pull request Feb 8, 2024
Concatenating two predictes using uzp1 after converting to double length
using sve.convert.to/from.svbool is optimized poorly in the backend,
resulting in additional `and` instructions to zero the lanes. See
llvm#78623

Combine this pattern to use `llvm.vector.insert` to concatenate and get
rid of convert to/from svbools.

Change-Id: Ieb38055f1f2304ead1ab2bead03ebd046b3bd36f
UsmanNadeem added a commit that referenced this pull request Feb 15, 2024
…rt (#81069)

Concatenating two predictes using uzp1 after converting to double length
using sve.convert.to/from.svbool is optimized poorly in the backend,
resulting in additional `and` instructions to zero the lanes. See
#78623

Combine this pattern to use `llvm.vector.insert` to concatenate and get
rid of convert to/from svbools.
@UsmanNadeem
Copy link
Contributor Author

Abandoning this PR in favor of 81069.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants