Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ValueTracking][X86] Compute KnownBits for phadd/phsub #92429

Merged
merged 2 commits into from
Jul 16, 2024

Conversation

mskamp
Copy link
Contributor

@mskamp mskamp commented May 16, 2024

Add KnownBits computations to ValueTracking and X86 DAG lowering.

These instructions add/subtract adjacent vector elements in their
operands. Example: phadd [X1, X2] [Y1, Y2] = [X1 + X2, Y1 + Y2].
This means that, in this example, we can compute the KnownBits of the
operation by computing the KnownBits of [X1, X2] + [X1, X2] and
[Y1, Y2] + [Y1, Y2] and intersecting the results. This approach
also generalizes to all x86 vector types.

There are also the operations phadd.sw and phsub.sw, which perform
saturating addition/subtraction. Use sadd_sat and ssub_sat to compute
the KnownBits of these operations.

Also adjust the existing test case pr53247.ll because it can be
transformed to a constant using the new KnownBits computation.

Fixes #82516.

@mskamp mskamp requested a review from nikic as a code owner May 16, 2024 16:53
Copy link

Thank you for submitting a Pull Request (PR) to the LLVM Project!

This PR will be automatically labeled and the relevant teams will be
notified.

If you wish to, you can add reviewers by using the "Reviewers" section on this page.

If this is not working for you, it is probably because you do not have write
permissions for the repository. In which case you can instead tag reviewers by
name in a comment by using @ followed by their GitHub username.

If you have received no comments on your PR for a week, you can request a review
by "ping"ing the PR by adding a comment “Ping”. The common courtesy "ping" rate
is once a week. Please remember that you are asking for valuable time from other developers.

If you have further questions, they may be answered by the LLVM GitHub User Guide.

You can also ask questions in a comment on this PR, on the LLVM Discord or on the forums.

@llvmbot
Copy link
Collaborator

llvmbot commented May 16, 2024

@llvm/pr-subscribers-backend-x86

@llvm/pr-subscribers-llvm-analysis

Author: None (mskamp)

Changes

Add KnownBits computations to ValueTracking and X86 DAG lowering.

These instructions add/subtract adjacent vector elements in their
operands. Example: phadd [X1, X2] [Y1, Y2] = [X1 + X2, Y1 + Y2].
This means that, in this example, we can compute the KnownBits of the
operation by computing the KnownBits of [X1, X2] + [X1, X2] and
[Y1, Y2] + [Y1, Y2] and intersecting the results. This approach
also generalizes to all x86 vector types.

There are also the operations phadd.sw and phsub.sw, which perform
saturating addition/subtraction. Use sadd_sat and ssub_sat to compute
the KnownBits of these operations.

Fixes #82516.


Patch is 26.99 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/92429.diff

4 Files Affected:

  • (modified) llvm/lib/Analysis/ValueTracking.cpp (+48)
  • (modified) llvm/lib/Target/X86/X86ISelLowering.cpp (+73)
  • (added) llvm/test/Analysis/ValueTracking/knownbits-hadd-hsub.ll (+192)
  • (added) llvm/test/CodeGen/X86/knownbits-hadd-hsub.ll (+201)
diff --git a/llvm/lib/Analysis/ValueTracking.cpp b/llvm/lib/Analysis/ValueTracking.cpp
index 2fdbb6e3ef840..e33cbef61e8f7 100644
--- a/llvm/lib/Analysis/ValueTracking.cpp
+++ b/llvm/lib/Analysis/ValueTracking.cpp
@@ -1725,6 +1725,54 @@ static void computeKnownBitsFromOperator(const Operator *I,
       case Intrinsic::x86_sse42_crc32_64_64:
         Known.Zero.setBitsFrom(32);
         break;
+      case Intrinsic::x86_ssse3_phadd_d:
+      case Intrinsic::x86_ssse3_phadd_w:
+      case Intrinsic::x86_ssse3_phadd_d_128:
+      case Intrinsic::x86_ssse3_phadd_w_128:
+      case Intrinsic::x86_avx2_phadd_d:
+      case Intrinsic::x86_avx2_phadd_w: {
+        computeKnownBits(I->getOperand(0), DemandedElts, Known, Depth + 1, Q);
+        computeKnownBits(I->getOperand(1), DemandedElts, Known2, Depth + 1, Q);
+
+        Known = KnownBits::computeForAddSub(true, false, false, Known, Known)
+                    .intersectWith(KnownBits::computeForAddSub(
+                        true, false, false, Known2, Known2));
+        break;
+      }
+      case Intrinsic::x86_ssse3_phadd_sw:
+      case Intrinsic::x86_ssse3_phadd_sw_128:
+      case Intrinsic::x86_avx2_phadd_sw: {
+        computeKnownBits(I->getOperand(0), DemandedElts, Known, Depth + 1, Q);
+        computeKnownBits(I->getOperand(1), DemandedElts, Known2, Depth + 1, Q);
+
+        Known = KnownBits::sadd_sat(Known, Known)
+                    .intersectWith(KnownBits::sadd_sat(Known2, Known2));
+        break;
+      }
+      case Intrinsic::x86_ssse3_phsub_d:
+      case Intrinsic::x86_ssse3_phsub_w:
+      case Intrinsic::x86_ssse3_phsub_d_128:
+      case Intrinsic::x86_ssse3_phsub_w_128:
+      case Intrinsic::x86_avx2_phsub_d:
+      case Intrinsic::x86_avx2_phsub_w: {
+        computeKnownBits(I->getOperand(0), DemandedElts, Known, Depth + 1, Q);
+        computeKnownBits(I->getOperand(1), DemandedElts, Known2, Depth + 1, Q);
+
+        Known = KnownBits::computeForAddSub(false, false, false, Known, Known)
+                    .intersectWith(KnownBits::computeForAddSub(
+                        false, false, false, Known2, Known2));
+        break;
+      }
+      case Intrinsic::x86_ssse3_phsub_sw:
+      case Intrinsic::x86_ssse3_phsub_sw_128:
+      case Intrinsic::x86_avx2_phsub_sw: {
+        computeKnownBits(I->getOperand(0), DemandedElts, Known, Depth + 1, Q);
+        computeKnownBits(I->getOperand(1), DemandedElts, Known2, Depth + 1, Q);
+
+        Known = KnownBits::ssub_sat(Known, Known)
+                    .intersectWith(KnownBits::ssub_sat(Known2, Known2));
+        break;
+      }
       case Intrinsic::riscv_vsetvli:
       case Intrinsic::riscv_vsetvlimax: {
         bool HasAVL = II->getIntrinsicID() == Intrinsic::riscv_vsetvli;
diff --git a/llvm/lib/Target/X86/X86ISelLowering.cpp b/llvm/lib/Target/X86/X86ISelLowering.cpp
index ecc5b3b3bf840..c23df2c91f385 100644
--- a/llvm/lib/Target/X86/X86ISelLowering.cpp
+++ b/llvm/lib/Target/X86/X86ISelLowering.cpp
@@ -37262,6 +37262,27 @@ void X86TargetLowering::computeKnownBitsForTargetNode(const SDValue Op,
     }
     break;
   }
+  case X86ISD::HADD: {
+    Known = DAG.computeKnownBits(Op.getOperand(0), DemandedElts, Depth + 1);
+    KnownBits Known2 =
+        DAG.computeKnownBits(Op.getOperand(1), DemandedElts, Depth + 1);
+
+    Known = KnownBits::computeForAddSub(true, false, false, Known, Known)
+                .intersectWith(KnownBits::computeForAddSub(true, false, false,
+                                                           Known2, Known2));
+    break;
+  }
+  case X86ISD::HSUB: {
+    Known =
+        DAG.computeKnownBits(Op.getOperand(0), DemandedElts, Depth + 1);
+    KnownBits Known2 =
+        DAG.computeKnownBits(Op.getOperand(1), DemandedElts, Depth + 1);
+
+    Known = KnownBits::computeForAddSub(false, false, false, Known, Known)
+                .intersectWith(KnownBits::computeForAddSub(false, false, false,
+                                                           Known2, Known2));
+    break;
+  }
   case ISD::INTRINSIC_WO_CHAIN: {
     switch (Op->getConstantOperandVal(0)) {
     case Intrinsic::x86_sse2_psad_bw:
@@ -37276,6 +37297,58 @@ void X86TargetLowering::computeKnownBitsForTargetNode(const SDValue Op,
       computeKnownBitsForPSADBW(LHS, RHS, Known, DemandedElts, DAG, Depth);
       break;
     }
+    case Intrinsic::x86_ssse3_phadd_d:
+    case Intrinsic::x86_ssse3_phadd_w:
+    case Intrinsic::x86_ssse3_phadd_d_128:
+    case Intrinsic::x86_ssse3_phadd_w_128:
+    case Intrinsic::x86_avx2_phadd_d:
+    case Intrinsic::x86_avx2_phadd_w: {
+      Known = DAG.computeKnownBits(Op.getOperand(1), DemandedElts, Depth + 1);
+      KnownBits Known2 =
+          DAG.computeKnownBits(Op.getOperand(2), DemandedElts, Depth + 1);
+
+      Known = KnownBits::computeForAddSub(true, false, false, Known, Known)
+                  .intersectWith(KnownBits::computeForAddSub(true, false, false,
+                                                             Known2, Known2));
+      break;
+    }
+    case Intrinsic::x86_ssse3_phadd_sw:
+    case Intrinsic::x86_ssse3_phadd_sw_128:
+    case Intrinsic::x86_avx2_phadd_sw: {
+      Known = DAG.computeKnownBits(Op.getOperand(1), DemandedElts, Depth + 1);
+      KnownBits Known2 =
+          DAG.computeKnownBits(Op.getOperand(2), DemandedElts, Depth + 1);
+
+      Known = KnownBits::sadd_sat(Known, Known)
+                  .intersectWith(KnownBits::sadd_sat(Known2, Known2));
+      break;
+    }
+    case Intrinsic::x86_ssse3_phsub_d:
+    case Intrinsic::x86_ssse3_phsub_w:
+    case Intrinsic::x86_ssse3_phsub_d_128:
+    case Intrinsic::x86_ssse3_phsub_w_128:
+    case Intrinsic::x86_avx2_phsub_d:
+    case Intrinsic::x86_avx2_phsub_w: {
+      Known = DAG.computeKnownBits(Op.getOperand(1), DemandedElts, Depth + 1);
+      KnownBits Known2 =
+          DAG.computeKnownBits(Op.getOperand(2), DemandedElts, Depth + 1);
+
+      Known = KnownBits::computeForAddSub(false, false, false, Known, Known)
+                  .intersectWith(KnownBits::computeForAddSub(
+                      false, false, false, Known2, Known2));
+      break;
+    }
+    case Intrinsic::x86_ssse3_phsub_sw:
+    case Intrinsic::x86_ssse3_phsub_sw_128:
+    case Intrinsic::x86_avx2_phsub_sw: {
+      Known = DAG.computeKnownBits(Op.getOperand(1), DemandedElts, Depth + 1);
+      KnownBits Known2 =
+          DAG.computeKnownBits(Op.getOperand(2), DemandedElts, Depth + 1);
+
+      Known = KnownBits::ssub_sat(Known, Known)
+                  .intersectWith(KnownBits::ssub_sat(Known2, Known2));
+      break;
+    }
     }
     break;
   }
diff --git a/llvm/test/Analysis/ValueTracking/knownbits-hadd-hsub.ll b/llvm/test/Analysis/ValueTracking/knownbits-hadd-hsub.ll
new file mode 100644
index 0000000000000..443ab72ee54cb
--- /dev/null
+++ b/llvm/test/Analysis/ValueTracking/knownbits-hadd-hsub.ll
@@ -0,0 +1,192 @@
+; NOTE: Assertions have been autogenerated by utils/update_test_checks.py UTC_ARGS: --version 4
+; RUN: opt -S -passes=instcombine < %s | FileCheck %s
+
+define <4 x i1> @hadd_and_eq_v4i32(<4 x i32> %x, <4 x i32> %y) {
+; CHECK-LABEL: define <4 x i1> @hadd_and_eq_v4i32(
+; CHECK-SAME: <4 x i32> [[X:%.*]], <4 x i32> [[Y:%.*]]) {
+; CHECK-NEXT:  entry:
+; CHECK-NEXT:    ret <4 x i1> zeroinitializer
+;
+entry:
+  %0 = and <4 x i32> %x, <i32 3, i32 3, i32 3, i32 3>
+  %1 = and <4 x i32> %y, <i32 3, i32 3, i32 3, i32 3>
+  %2 = tail call <4 x i32> @llvm.x86.ssse3.phadd.d.128(<4 x i32> %0, <4 x i32> %1)
+  %3 = and <4 x i32> %2, <i32 -8, i32 -8, i32 -8, i32 -8>
+  %ret = icmp eq <4 x i32> %3, <i32 3, i32 4, i32 5, i32 6>
+  ret <4 x i1> %ret
+}
+
+define <8 x i1> @hadd_and_eq_v8i16(<8 x i16> %x, <8 x i16> %y) {
+; CHECK-LABEL: define <8 x i1> @hadd_and_eq_v8i16(
+; CHECK-SAME: <8 x i16> [[X:%.*]], <8 x i16> [[Y:%.*]]) {
+; CHECK-NEXT:  entry:
+; CHECK-NEXT:    ret <8 x i1> <i1 true, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 true>
+;
+entry:
+  %0 = and <8 x i16> %x, <i16 3, i16 3, i16 3, i16 3, i16 3, i16 3, i16 3, i16 3>
+  %1 = and <8 x i16> %y, <i16 3, i16 3, i16 3, i16 3, i16 3, i16 3, i16 3, i16 3>
+  %2 = tail call <8 x i16> @llvm.x86.ssse3.phadd.w.128(<8 x i16> %0, <8 x i16> %1)
+  %3 = and <8 x i16> %2, <i16 -8, i16 -8, i16 -8, i16 -8, i16 -8, i16 -8, i16 -8, i16 -8>
+  %ret = icmp eq <8 x i16> %3, <i16 0, i16 1, i16 2, i16 3, i16 4, i16 5, i16 6, i16 0>
+  ret <8 x i1> %ret
+}
+
+define <8 x i1> @hadd_and_eq_v8i16_sat(<8 x i16> %x, <8 x i16> %y) {
+; CHECK-LABEL: define <8 x i1> @hadd_and_eq_v8i16_sat(
+; CHECK-SAME: <8 x i16> [[X:%.*]], <8 x i16> [[Y:%.*]]) {
+; CHECK-NEXT:  entry:
+; CHECK-NEXT:    ret <8 x i1> <i1 true, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 true>
+;
+entry:
+  %0 = and <8 x i16> %x, <i16 3, i16 3, i16 3, i16 3, i16 3, i16 3, i16 3, i16 3>
+  %1 = and <8 x i16> %y, <i16 3, i16 3, i16 3, i16 3, i16 3, i16 3, i16 3, i16 3>
+  %2 = tail call <8 x i16> @llvm.x86.ssse3.phadd.sw.128(<8 x i16> %0, <8 x i16> %1)
+  %3 = and <8 x i16> %2, <i16 -8, i16 -8, i16 -8, i16 -8, i16 -8, i16 -8, i16 -8, i16 -8>
+  %ret = icmp eq <8 x i16> %3, <i16 0, i16 1, i16 2, i16 3, i16 4, i16 5, i16 6, i16 0>
+  ret <8 x i1> %ret
+}
+
+define <8 x i1> @hadd_and_eq_v8i32(<8 x i32> %x, <8 x i32> %y) {
+; CHECK-LABEL: define <8 x i1> @hadd_and_eq_v8i32(
+; CHECK-SAME: <8 x i32> [[X:%.*]], <8 x i32> [[Y:%.*]]) {
+; CHECK-NEXT:  entry:
+; CHECK-NEXT:    ret <8 x i1> zeroinitializer
+;
+entry:
+  %0 = and <8 x i32> %x, <i32 3, i32 3, i32 3, i32 3, i32 3, i32 3, i32 3, i32 3>
+  %1 = and <8 x i32> %y, <i32 3, i32 3, i32 3, i32 3, i32 3, i32 3, i32 3, i32 3>
+  %2 = tail call <8 x i32> @llvm.x86.avx2.phadd.d(<8 x i32> %0, <8 x i32> %1)
+  %3 = and <8 x i32> %2, <i32 -8, i32 -8, i32 -8, i32 -8, i32 -8, i32 -8, i32 -8, i32 -8>
+  %ret = icmp eq <8 x i32> %3, <i32 3, i32 4, i32 5, i32 6, i32 3, i32 4, i32 5, i32 6>
+  ret <8 x i1> %ret
+}
+
+define <16 x i1> @hadd_and_eq_v16i16(<16 x i16> %x, <16 x i16> %y) {
+; CHECK-LABEL: define <16 x i1> @hadd_and_eq_v16i16(
+; CHECK-SAME: <16 x i16> [[X:%.*]], <16 x i16> [[Y:%.*]]) {
+; CHECK-NEXT:  entry:
+; CHECK-NEXT:    ret <16 x i1> <i1 true, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 true, i1 true, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 true>
+;
+entry:
+  %0 = and <16 x i16> %x, <i16 3, i16 3, i16 3, i16 3, i16 3, i16 3, i16 3, i16 3, i16 3, i16 3, i16 3, i16 3, i16 3, i16 3, i16 3, i16 3>
+  %1 = and <16 x i16> %y, <i16 3, i16 3, i16 3, i16 3, i16 3, i16 3, i16 3, i16 3, i16 3, i16 3, i16 3, i16 3, i16 3, i16 3, i16 3, i16 3>
+  %2 = tail call <16 x i16> @llvm.x86.avx2.phadd.w(<16 x i16> %0, <16 x i16> %1)
+  %3 = and <16 x i16> %2, <i16 -8, i16 -8, i16 -8, i16 -8, i16 -8, i16 -8, i16 -8, i16 -8, i16 -8, i16 -8, i16 -8, i16 -8, i16 -8, i16 -8, i16 -8, i16 -8>
+  %ret = icmp eq <16 x i16> %3, <i16 0, i16 1, i16 2, i16 3, i16 4, i16 5, i16 6, i16 0, i16 0, i16 1, i16 2, i16 3, i16 4, i16 5, i16 6, i16 0>
+  ret <16 x i1> %ret
+}
+
+define <16 x i1> @hadd_and_eq_v16i16_sat(<16 x i16> %x, <16 x i16> %y) {
+; CHECK-LABEL: define <16 x i1> @hadd_and_eq_v16i16_sat(
+; CHECK-SAME: <16 x i16> [[X:%.*]], <16 x i16> [[Y:%.*]]) {
+; CHECK-NEXT:  entry:
+; CHECK-NEXT:    ret <16 x i1> <i1 true, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 true, i1 true, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 true>
+;
+entry:
+  %0 = and <16 x i16> %x, <i16 3, i16 3, i16 3, i16 3, i16 3, i16 3, i16 3, i16 3, i16 3, i16 3, i16 3, i16 3, i16 3, i16 3, i16 3, i16 3>
+  %1 = and <16 x i16> %y, <i16 3, i16 3, i16 3, i16 3, i16 3, i16 3, i16 3, i16 3, i16 3, i16 3, i16 3, i16 3, i16 3, i16 3, i16 3, i16 3>
+  %2 = tail call <16 x i16> @llvm.x86.avx2.phadd.sw(<16 x i16> %0, <16 x i16> %1)
+  %3 = and <16 x i16> %2, <i16 -8, i16 -8, i16 -8, i16 -8, i16 -8, i16 -8, i16 -8, i16 -8, i16 -8, i16 -8, i16 -8, i16 -8, i16 -8, i16 -8, i16 -8, i16 -8>
+  %ret = icmp eq <16 x i16> %3, <i16 0, i16 1, i16 2, i16 3, i16 4, i16 5, i16 6, i16 0, i16 0, i16 1, i16 2, i16 3, i16 4, i16 5, i16 6, i16 0>
+  ret <16 x i1> %ret
+}
+
+define <4 x i1> @hsub_trunc_eq_v4i32(<4 x i32> %x, <4 x i32> %y) {
+; CHECK-LABEL: define <4 x i1> @hsub_trunc_eq_v4i32(
+; CHECK-SAME: <4 x i32> [[X:%.*]], <4 x i32> [[Y:%.*]]) {
+; CHECK-NEXT:  entry:
+; CHECK-NEXT:    ret <4 x i1> zeroinitializer
+;
+entry:
+  %0 = or <4 x i32> %x, <i32 65535, i32 65535, i32 65535, i32 65535>
+  %1 = or <4 x i32> %y, <i32 65535, i32 65535, i32 65535, i32 65535>
+  %2 = tail call <4 x i32> @llvm.x86.ssse3.phsub.d.128(<4 x i32> %0, <4 x i32> %1)
+  %conv = trunc <4 x i32> %2 to <4 x i16>
+  %ret = icmp eq <4 x i16> %conv, <i16 3, i16 4, i16 5, i16 6>
+  ret <4 x i1> %ret
+}
+
+define <8 x i1> @hsub_trunc_eq_v8i16(<8 x i16> %x, <8 x i16> %y) {
+; CHECK-LABEL: define <8 x i1> @hsub_trunc_eq_v8i16(
+; CHECK-SAME: <8 x i16> [[X:%.*]], <8 x i16> [[Y:%.*]]) {
+; CHECK-NEXT:  entry:
+; CHECK-NEXT:    ret <8 x i1> <i1 true, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 true>
+;
+entry:
+  %0 = or <8 x i16> %x, <i16 255, i16 255, i16 255, i16 255, i16 255, i16 255, i16 255, i16 255>
+  %1 = or <8 x i16> %y, <i16 255, i16 255, i16 255, i16 255, i16 255, i16 255, i16 255, i16 255>
+  %2 = tail call <8 x i16> @llvm.x86.ssse3.phsub.w.128(<8 x i16> %0, <8 x i16> %1)
+  %conv = trunc <8 x i16> %2 to <8 x i8>
+  %ret = icmp eq <8 x i8> %conv, <i8 0, i8 1, i8 2, i8 3, i8 4, i8 5, i8 6, i8 0>
+  ret <8 x i1> %ret
+}
+
+define <8 x i1> @hsub_and_eq_v8i16_sat(<8 x i16> %x, <8 x i16> %y) {
+; CHECK-LABEL: define <8 x i1> @hsub_and_eq_v8i16_sat(
+; CHECK-SAME: <8 x i16> [[X:%.*]], <8 x i16> [[Y:%.*]]) {
+; CHECK-NEXT:  entry:
+; CHECK-NEXT:    [[TMP0:%.*]] = or <8 x i16> [[X]], <i16 7, i16 7, i16 7, i16 7, i16 7, i16 7, i16 7, i16 7>
+; CHECK-NEXT:    [[TMP1:%.*]] = or <8 x i16> [[Y]], <i16 7, i16 7, i16 7, i16 7, i16 7, i16 7, i16 7, i16 7>
+; CHECK-NEXT:    [[TMP2:%.*]] = tail call <8 x i16> @llvm.x86.ssse3.phsub.sw.128(<8 x i16> [[TMP0]], <8 x i16> [[TMP1]])
+; CHECK-NEXT:    [[TMP3:%.*]] = and <8 x i16> [[TMP2]], <i16 7, i16 7, i16 7, i16 7, i16 7, i16 7, i16 7, i16 7>
+; CHECK-NEXT:    [[TMP4:%.*]] = icmp eq <8 x i16> [[TMP3]], zeroinitializer
+; CHECK-NEXT:    ret <8 x i1> [[TMP4]]
+;
+entry:
+  %0 = or <8 x i16> %x, <i16 7, i16 7, i16 7, i16 7, i16 7, i16 7, i16 7, i16 7>
+  %1 = or <8 x i16> %y, <i16 7, i16 7, i16 7, i16 7, i16 7, i16 7, i16 7, i16 7>
+  %2 = tail call <8 x i16> @llvm.x86.ssse3.phsub.sw.128(<8 x i16> %0, <8 x i16> %1)
+  %3 = and <8 x i16> %2, <i16 7, i16 7, i16 7, i16 7, i16 7, i16 7, i16 7, i16 7>
+  %4 = icmp eq <8 x i16> %3, zeroinitializer
+  ret <8 x i1> %4
+}
+
+define <8 x i1> @hsub_trunc_eq_v8i32(<8 x i32> %x, <8 x i32> %y) {
+; CHECK-LABEL: define <8 x i1> @hsub_trunc_eq_v8i32(
+; CHECK-SAME: <8 x i32> [[X:%.*]], <8 x i32> [[Y:%.*]]) {
+; CHECK-NEXT:  entry:
+; CHECK-NEXT:    ret <8 x i1> zeroinitializer
+;
+entry:
+  %0 = or <8 x i32> %x, <i32 65535, i32 65535, i32 65535, i32 65535, i32 65535, i32 65535, i32 65535, i32 65535>
+  %1 = or <8 x i32> %y, <i32 65535, i32 65535, i32 65535, i32 65535, i32 65535, i32 65535, i32 65535, i32 65535>
+  %2 = tail call <8 x i32> @llvm.x86.avx2.phsub.d(<8 x i32> %0, <8 x i32> %1)
+  %conv = trunc <8 x i32> %2 to <8 x i16>
+  %ret = icmp eq <8 x i16> %conv, <i16 3, i16 4, i16 5, i16 6, i16 3, i16 4, i16 5, i16 6>
+  ret <8 x i1> %ret
+}
+
+define <16 x i1> @hsub_trunc_eq_v16i16(<16 x i16> %x, <16 x i16> %y) {
+; CHECK-LABEL: define <16 x i1> @hsub_trunc_eq_v16i16(
+; CHECK-SAME: <16 x i16> [[X:%.*]], <16 x i16> [[Y:%.*]]) {
+; CHECK-NEXT:  entry:
+; CHECK-NEXT:    ret <16 x i1> <i1 true, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 true, i1 true, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 true>
+;
+entry:
+  %0 = or <16 x i16> %x, <i16 255, i16 255, i16 255, i16 255, i16 255, i16 255, i16 255, i16 255, i16 255, i16 255, i16 255, i16 255, i16 255, i16 255, i16 255, i16 255>
+  %1 = or <16 x i16> %y, <i16 255, i16 255, i16 255, i16 255, i16 255, i16 255, i16 255, i16 255, i16 255, i16 255, i16 255, i16 255, i16 255, i16 255, i16 255, i16 255>
+  %2 = tail call <16 x i16> @llvm.x86.avx2.phsub.w(<16 x i16> %0, <16 x i16> %1)
+  %conv = trunc <16 x i16> %2 to <16 x i8>
+  %ret = icmp eq <16 x i8> %conv, <i8 0, i8 1, i8 2, i8 3, i8 4, i8 5, i8 6, i8 0, i8 0, i8 1, i8 2, i8 3, i8 4, i8 5, i8 6, i8 0>
+  ret <16 x i1> %ret
+}
+
+define <16 x i1> @hsub_and_eq_v16i16_sat(<16 x i16> %x, <16 x i16> %y) {
+; CHECK-LABEL: define <16 x i1> @hsub_and_eq_v16i16_sat(
+; CHECK-SAME: <16 x i16> [[X:%.*]], <16 x i16> [[Y:%.*]]) {
+; CHECK-NEXT:  entry:
+; CHECK-NEXT:    [[TMP0:%.*]] = or <16 x i16> [[X]], <i16 7, i16 7, i16 7, i16 7, i16 7, i16 7, i16 7, i16 7, i16 7, i16 7, i16 7, i16 7, i16 7, i16 7, i16 7, i16 7>
+; CHECK-NEXT:    [[TMP1:%.*]] = or <16 x i16> [[Y]], <i16 7, i16 7, i16 7, i16 7, i16 7, i16 7, i16 7, i16 7, i16 7, i16 7, i16 7, i16 7, i16 7, i16 7, i16 7, i16 7>
+; CHECK-NEXT:    [[TMP2:%.*]] = tail call <16 x i16> @llvm.x86.avx2.phsub.sw(<16 x i16> [[TMP0]], <16 x i16> [[TMP1]])
+; CHECK-NEXT:    [[TMP3:%.*]] = and <16 x i16> [[TMP2]], <i16 7, i16 7, i16 7, i16 7, i16 7, i16 7, i16 7, i16 7, i16 7, i16 7, i16 7, i16 7, i16 7, i16 7, i16 7, i16 7>
+; CHECK-NEXT:    [[TMP4:%.*]] = icmp eq <16 x i16> [[TMP3]], zeroinitializer
+; CHECK-NEXT:    ret <16 x i1> [[TMP4]]
+;
+entry:
+  %0 = or <16 x i16> %x, <i16 7, i16 7, i16 7, i16 7, i16 7, i16 7, i16 7, i16 7, i16 7, i16 7, i16 7, i16 7, i16 7, i16 7, i16 7, i16 7>
+  %1 = or <16 x i16> %y, <i16 7, i16 7, i16 7, i16 7, i16 7, i16 7, i16 7, i16 7, i16 7, i16 7, i16 7, i16 7, i16 7, i16 7, i16 7, i16 7>
+  %2 = tail call <16 x i16> @llvm.x86.avx2.phsub.sw(<16 x i16> %0, <16 x i16> %1)
+  %3 = and <16 x i16> %2, <i16 7, i16 7, i16 7, i16 7, i16 7, i16 7, i16 7, i16 7, i16 7, i16 7, i16 7, i16 7, i16 7, i16 7, i16 7, i16 7>
+  %4 = icmp eq <16 x i16> %3, zeroinitializer
+  ret <16 x i1> %4
+}
diff --git a/llvm/test/CodeGen/X86/knownbits-hadd-hsub.ll b/llvm/test/CodeGen/X86/knownbits-hadd-hsub.ll
new file mode 100644
index 0000000000000..eba7b9843d991
--- /dev/null
+++ b/llvm/test/CodeGen/X86/knownbits-hadd-hsub.ll
@@ -0,0 +1,201 @@
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 4
+; RUN: llc < %s -mtriple=x86_64-unknown -mattr=+avx2 | FileCheck %s
+
+define <4 x i16> @hadd_trunc_v4i32(<4 x i32> %x, <4 x i32> %y) {
+; CHECK-LABEL: hadd_trunc_v4i32:
+; CHECK:       # %bb.0: # %entry
+; CHECK-NEXT:    vpbroadcastd {{.*#+}} xmm2 = [3,3,3,3]
+; CHECK-NEXT:    vpand %xmm2, %xmm0, %xmm0
+; CHECK-NEXT:    vpand %xmm2, %xmm1, %xmm1
+; CHECK-NEXT:    vphaddd %xmm1, %xmm0, %xmm0
+; CHECK-NEXT:    vpackusdw %xmm0, %xmm0, %xmm0
+; CHECK-NEXT:    retq
+entry:
+  %0 = and <4 x i32> %x, <i32 3, i32 3, i32 3, i32 3>
+  %1 = and <4 x i32> %y, <i32 3, i32 3, i32 3, i32 3>
+  %2 = tail call <4 x i32> @llvm.x86.ssse3.phadd.d.128(<4 x i32> %0, <4 x i32> %1)
+  %conv = trunc <4 x i32> %2 to <4 x i16>
+  ret <4 x i16> %conv
+}
+
+define <8 x i8> @hadd_trunc_v8i16(<8 x i16> %x, <8 x i16> %y) {
+; CHECK-LABEL: hadd_trunc_v8i16:
+; CHECK:       # %bb.0: # %entry
+; CHECK-NEXT:    vpbroadcastw {{.*#+}} xmm2 = [3,3,3,3,3,3,3,3]
+; CHECK-NEXT:    vpand %xmm2, %xmm0, %xmm0
+; CHECK-NEXT:    vpand %xmm2, %xmm1, %xmm1
+; CHECK-NEXT:    vphaddw %xmm1, %xmm0, %xmm0
+; CHECK-NEXT:    vpackuswb %xmm0, %xmm0, %xmm0
+; CHECK-NEXT:    retq
+entry:
+  %0 = and <8 x i16> %x, <i16 3, i16 3, i16 3, i16 3, i16 3, i16 3, i16 3, i16 3>
+  %1 = and <8 x i16> %y, <i16 3, i16 3, i16 3, i16 3, i16 3, i16 3, i16 3, i16 3>
+  %2 = tail call <8 x i16> @llvm.x86.ssse3.phadd.w.128(<8 x i16> %0, <8 x i16> %1)
+  %conv = trunc <8 x i16> %2 to <8 x i8>
+  ret <8 x i8> %conv
+}
+
+define <8 x i8> @hadd_trunc_v8i16_sat(<8 x i16> %x, <8 x i16> %y) {
+; CHECK-LABEL: hadd_trunc_v8i16_sat:
+; CHECK:       # %bb.0: # %entry
+; CHECK-NEXT:    vpbroadcastw {{.*#+}} xmm2 = [3,3,3,3,3,3,3,3]
+; CHECK-NEXT:    vpand %xmm2, %xmm0, %xmm0
+; CHECK-NEXT:    vpand %xmm2, %xmm1, %xmm1
+; CHECK-NEXT:    vphaddsw %xmm1, %xmm0, %xmm0
+; CHECK-NEXT:    vpackuswb %xmm0, %xmm0, %xmm0
+; CHECK-NEXT:    retq
+entry:
+  %0 = and <8 x i16> %x, <i16 3, i16 3, i16 3, i16 3, i16 3, i16 3, i16 3, i16 3>
+  %1 = and <8 x i16> %y, <i16 3, i16 3, i16 3, i16 3, i16 3, i16 3, i16 3, i16 3>
+  %2 = tail call <8 x i16> @llvm.x86.ssse3.phadd.sw.128(<8 x i16> %0, <8 x i...
[truncated]

case Intrinsic::x86_avx2_phadd_d:
case Intrinsic::x86_avx2_phadd_w: {
computeKnownBits(I->getOperand(0), DemandedElts, Known, Depth + 1, Q);
computeKnownBits(I->getOperand(1), DemandedElts, Known2, Depth + 1, Q);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure demandedelts is right to propagate here. I.e in your example of:
[X1, X2], [Y1, Y2] -> [X1 + X2, Y1 + Y2]. If you only demand the first element, you still need both X1 and X2 from operand 0, you just don't need either Y1 or Y2.

This also appears to need to be changed for the rest of the impls.

Can you add some tests where the result is then shuffled/extracted from to test w/ DemandedElts as not all ones?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apparently, I didn't quite understand the purpose of DemandedElts. I've adjusted the code and added some tests.

Copy link
Collaborator

@RKSimon RKSimon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be better to focus on the SelectionDAG variant first to get this right, but its up to you.

@@ -37262,6 +37262,27 @@ void X86TargetLowering::computeKnownBitsForTargetNode(const SDValue Op,
}
break;
}
case X86ISD::HADD: {
Known = DAG.computeKnownBits(Op.getOperand(0), DemandedElts, Depth + 1);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can't use the result DemandedElts for the source operands you'll need something like:

    APInt DemandedLHS, DemandedRHS;
    getHorizDemandedElts(VT, DemandedElts, DemandedLHS, DemandedRHS);

You can avoid computeKnownBits calls for cases where either DemandedLHS/RHS are zero.
Plus you can probably get more refined knownbits by getting the known bits of the odd/even elements separately, although that could mean 4 computeKnownBits calls.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added the implementation with 4 computeKnownBits calls in the hope that it does not slow down the analysis too much.

llvm/lib/Target/X86/X86ISelLowering.cpp Outdated Show resolved Hide resolved
case Intrinsic::x86_ssse3_phadd_sw_128:
case Intrinsic::x86_avx2_phadd_sw: {
computeKnownBits(I->getOperand(0), DemandedElts, Known, Depth + 1, Q);
computeKnownBits(I->getOperand(1), DemandedElts, Known2, Depth + 1, Q);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wow, saturated hadd, haven't seen that for a while :)

@mskamp mskamp force-pushed the feature_hadd_hsub_knownbits branch from 8aa2cec to 58d73b1 Compare May 19, 2024 06:53
/// \param DemandedEltsOp the demanded elements mask for the operation
/// \param DemandedEltsLHS the demanded elements mask for the left operand
/// \param DemandedEltsRHS the demanded elements mask for the right operand
void getHorizontalDemandedElts(const APInt &DemandedEltsOp,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this function name should make it clear you are only getting half the elements.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. Done. In the meantime, I've noticed that there is already another function that does almost the same. Therefore, I've consolidated them.


std::array<KnownBits, 2> KnownLHS;
for (unsigned Index = 0; Index < KnownLHS.size(); ++Index) {
if (!DemandedEltsLHS.isZero()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you must never be hitting this case otherwise think we will run into issues with uninitialized used of KnownLHS. Likewise below for RHS.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understand. In both cases, KnownLHS[Index] is initialized. Shouldn't both cases occur in the hadd_extract test cases? Shouldn't they fail if anything were uninitialized? At least, valgrind --tool=memcheck doesn't report any error when executing the tests.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah you're right, I misread code.

@mskamp mskamp force-pushed the feature_hadd_hsub_knownbits branch from 58d73b1 to c29661e Compare May 20, 2024 05:04
@mskamp mskamp force-pushed the feature_hadd_hsub_knownbits branch from c29661e to ff64740 Compare June 2, 2024 15:02
@mskamp mskamp force-pushed the feature_hadd_hsub_knownbits branch from ff64740 to 072a9c0 Compare June 4, 2024 17:08
@mskamp mskamp force-pushed the feature_hadd_hsub_knownbits branch from 072a9c0 to d680eab Compare June 7, 2024 14:56
@mskamp mskamp force-pushed the feature_hadd_hsub_knownbits branch from d680eab to 8c7dc00 Compare June 8, 2024 05:13
@goldsteinn
Copy link
Contributor

This basically LGTM. Please wait for some additional approvals to push.

llvm/lib/Analysis/ValueTracking.cpp Outdated Show resolved Hide resolved
llvm/lib/Analysis/ValueTracking.cpp Outdated Show resolved Hide resolved
llvm/lib/Analysis/VectorUtils.cpp Outdated Show resolved Hide resolved
[](const KnownBits &KnownLHS, const KnownBits &KnownRHS) {
return KnownBits::ssub_sat(KnownLHS, KnownRHS);
});
break;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer it in the DAG if we just handled the X86ISD::HADD/SUB nodes and not the intrinsics - we try to do as little as possible with MMX types in DAG, and the saturation instructions are very rare - its much more likely that we just need to determine knownbits for X86ISD::HADD/SUB nodes we've created in the DAG.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The implementation handled the intrinsics because otherwise some test cases would not fold. For example, the test case that truncates <4 x i32> to <4 x i16> does not fold when handling only the X86ISD::HADD/HSUB. In contrast, tests that truncate <8 x i32> to <8 x i16> work fine this way.

After looking at this problem again, I believe that the code that replaces the shuffle with a pack instruction might be too strict. This is probably also the case in this example: https://godbolt.org/z/KW5b6r7xW

Anyway, I've removed the handling of the intrinsics and adapted the test cases such that they still fold with only the X86ISD::HADD/HSUB nodes.

llvm/test/Analysis/ValueTracking/knownbits-hadd-hsub.ll Outdated Show resolved Hide resolved
@mskamp mskamp force-pushed the feature_hadd_hsub_knownbits branch from 8c7dc00 to e8ce125 Compare June 15, 2024 13:19
Copy link
Collaborator

@RKSimon RKSimon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for this falling off my radar - a few minors but otherwise almost ready to go

llvm/lib/Target/X86/X86ISelLowering.cpp Outdated Show resolved Hide resolved
llvm/lib/Target/X86/X86ISelLowering.cpp Outdated Show resolved Hide resolved
llvm/lib/Target/X86/X86ISelLowering.cpp Outdated Show resolved Hide resolved
@mskamp mskamp force-pushed the feature_hadd_hsub_knownbits branch from e8ce125 to b64f2ea Compare July 12, 2024 10:40
@RKSimon
Copy link
Collaborator

RKSimon commented Jul 12, 2024

@mskamp Please can you rebase to fix the merge conflicts?

Copy link
Collaborator

@RKSimon RKSimon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Add KnownBits computations to ValueTracking and X86 DAG lowering.

These instructions add/subtract adjacent vector elements in their
operands. Example: phadd [X1, X2] [Y1, Y2] = [X1 + X2, Y1 + Y2].
This means that, in this example, we can compute the KnownBits of the
operation by computing the KnownBits of [X1, X2] + [X1, X2] and
[Y1, Y2] + [Y1, Y2] and intersecting the results. This approach
also generalizes to all x86 vector types.

There are also the operations phadd.sw and phsub.sw, which perform
saturating addition/subtraction. Use sadd_sat and ssub_sat to compute
the KnownBits of these operations in ValueTracking.

Also adjust the existing test case pr53247.ll because it can be
transformed to a constant using the new KnownBits computation.

Fixes llvm#82516.
@mskamp mskamp force-pushed the feature_hadd_hsub_knownbits branch from b64f2ea to 35f153c Compare July 15, 2024 19:09
@RKSimon RKSimon merged commit b22fa90 into llvm:main Jul 16, 2024
7 checks passed
Copy link

@mskamp Congratulations on having your first Pull Request (PR) merged into the LLVM Project!

Your changes will be combined with recent changes from other authors, then tested
by our build bots. If there is a problem with a build, you may receive a report in an email or a comment on this PR.

Please check whether problems have been caused by your change specifically, as
the builds can include changes from many authors. It is not uncommon for your
change to be included in a build that fails due to someone else's changes, or
infrastructure issues.

How to do this, and the rest of the post-merge process, is covered in detail here.

If your change does cause a problem, it may be reverted, or you can revert it yourself.
This is a normal part of LLVM development. You can fix your changes and open a new PR to merge them again.

If you don't get any reports, no action is required from you. Your changes are working as expected, well done!

sayhaan pushed a commit to sayhaan/llvm-project that referenced this pull request Jul 16, 2024
Summary:
Add KnownBits computations to ValueTracking and X86 DAG lowering.
    
These instructions add/subtract adjacent vector elements in their operands. Example: phadd [X1, X2] [Y1, Y2] = [X1 + X2, Y1 + Y2]. This means that, in this example, we can compute the KnownBits of the operation by computing the KnownBits of [X1, X2] + [X1, X2] and [Y1, Y2] + [Y1, Y2] and intersecting the results. This approach also generalizes to all x86 vector types.
    
There are also the operations phadd.sw and phsub.sw, which perform saturating addition/subtraction. Use sadd_sat and ssub_sat to compute the KnownBits of these operations.
    
Also adjust the existing test case pr53247.ll because it can be transformed to a constant using the new KnownBits computation.
    
Fixes llvm#82516.

Test Plan: 

Reviewers: 

Subscribers: 

Tasks: 

Tags: 


Differential Revision: https://phabricator.intern.facebook.com/D59822457
yuxuanchen1997 pushed a commit that referenced this pull request Jul 25, 2024
Summary:
Add KnownBits computations to ValueTracking and X86 DAG lowering.
    
These instructions add/subtract adjacent vector elements in their operands. Example: phadd [X1, X2] [Y1, Y2] = [X1 + X2, Y1 + Y2]. This means that, in this example, we can compute the KnownBits of the operation by computing the KnownBits of [X1, X2] + [X1, X2] and [Y1, Y2] + [Y1, Y2] and intersecting the results. This approach also generalizes to all x86 vector types.
    
There are also the operations phadd.sw and phsub.sw, which perform saturating addition/subtraction. Use sadd_sat and ssub_sat to compute the KnownBits of these operations.
    
Also adjust the existing test case pr53247.ll because it can be transformed to a constant using the new KnownBits computation.
    
Fixes #82516.

Test Plan: 

Reviewers: 

Subscribers: 

Tasks: 

Tags: 


Differential Revision: https://phabricator.intern.facebook.com/D60251514
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[X86] Improve KnownBits for X86ISD::HADD/HSUB nodes
4 participants