-
Notifications
You must be signed in to change notification settings - Fork 11.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ValueTracking][X86] Compute KnownBits for phadd/phsub #92429
Conversation
Thank you for submitting a Pull Request (PR) to the LLVM Project! This PR will be automatically labeled and the relevant teams will be If you wish to, you can add reviewers by using the "Reviewers" section on this page. If this is not working for you, it is probably because you do not have write If you have received no comments on your PR for a week, you can request a review If you have further questions, they may be answered by the LLVM GitHub User Guide. You can also ask questions in a comment on this PR, on the LLVM Discord or on the forums. |
@llvm/pr-subscribers-backend-x86 @llvm/pr-subscribers-llvm-analysis Author: None (mskamp) ChangesAdd KnownBits computations to ValueTracking and X86 DAG lowering. These instructions add/subtract adjacent vector elements in their There are also the operations phadd.sw and phsub.sw, which perform Fixes #82516. Patch is 26.99 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/92429.diff 4 Files Affected:
diff --git a/llvm/lib/Analysis/ValueTracking.cpp b/llvm/lib/Analysis/ValueTracking.cpp
index 2fdbb6e3ef840..e33cbef61e8f7 100644
--- a/llvm/lib/Analysis/ValueTracking.cpp
+++ b/llvm/lib/Analysis/ValueTracking.cpp
@@ -1725,6 +1725,54 @@ static void computeKnownBitsFromOperator(const Operator *I,
case Intrinsic::x86_sse42_crc32_64_64:
Known.Zero.setBitsFrom(32);
break;
+ case Intrinsic::x86_ssse3_phadd_d:
+ case Intrinsic::x86_ssse3_phadd_w:
+ case Intrinsic::x86_ssse3_phadd_d_128:
+ case Intrinsic::x86_ssse3_phadd_w_128:
+ case Intrinsic::x86_avx2_phadd_d:
+ case Intrinsic::x86_avx2_phadd_w: {
+ computeKnownBits(I->getOperand(0), DemandedElts, Known, Depth + 1, Q);
+ computeKnownBits(I->getOperand(1), DemandedElts, Known2, Depth + 1, Q);
+
+ Known = KnownBits::computeForAddSub(true, false, false, Known, Known)
+ .intersectWith(KnownBits::computeForAddSub(
+ true, false, false, Known2, Known2));
+ break;
+ }
+ case Intrinsic::x86_ssse3_phadd_sw:
+ case Intrinsic::x86_ssse3_phadd_sw_128:
+ case Intrinsic::x86_avx2_phadd_sw: {
+ computeKnownBits(I->getOperand(0), DemandedElts, Known, Depth + 1, Q);
+ computeKnownBits(I->getOperand(1), DemandedElts, Known2, Depth + 1, Q);
+
+ Known = KnownBits::sadd_sat(Known, Known)
+ .intersectWith(KnownBits::sadd_sat(Known2, Known2));
+ break;
+ }
+ case Intrinsic::x86_ssse3_phsub_d:
+ case Intrinsic::x86_ssse3_phsub_w:
+ case Intrinsic::x86_ssse3_phsub_d_128:
+ case Intrinsic::x86_ssse3_phsub_w_128:
+ case Intrinsic::x86_avx2_phsub_d:
+ case Intrinsic::x86_avx2_phsub_w: {
+ computeKnownBits(I->getOperand(0), DemandedElts, Known, Depth + 1, Q);
+ computeKnownBits(I->getOperand(1), DemandedElts, Known2, Depth + 1, Q);
+
+ Known = KnownBits::computeForAddSub(false, false, false, Known, Known)
+ .intersectWith(KnownBits::computeForAddSub(
+ false, false, false, Known2, Known2));
+ break;
+ }
+ case Intrinsic::x86_ssse3_phsub_sw:
+ case Intrinsic::x86_ssse3_phsub_sw_128:
+ case Intrinsic::x86_avx2_phsub_sw: {
+ computeKnownBits(I->getOperand(0), DemandedElts, Known, Depth + 1, Q);
+ computeKnownBits(I->getOperand(1), DemandedElts, Known2, Depth + 1, Q);
+
+ Known = KnownBits::ssub_sat(Known, Known)
+ .intersectWith(KnownBits::ssub_sat(Known2, Known2));
+ break;
+ }
case Intrinsic::riscv_vsetvli:
case Intrinsic::riscv_vsetvlimax: {
bool HasAVL = II->getIntrinsicID() == Intrinsic::riscv_vsetvli;
diff --git a/llvm/lib/Target/X86/X86ISelLowering.cpp b/llvm/lib/Target/X86/X86ISelLowering.cpp
index ecc5b3b3bf840..c23df2c91f385 100644
--- a/llvm/lib/Target/X86/X86ISelLowering.cpp
+++ b/llvm/lib/Target/X86/X86ISelLowering.cpp
@@ -37262,6 +37262,27 @@ void X86TargetLowering::computeKnownBitsForTargetNode(const SDValue Op,
}
break;
}
+ case X86ISD::HADD: {
+ Known = DAG.computeKnownBits(Op.getOperand(0), DemandedElts, Depth + 1);
+ KnownBits Known2 =
+ DAG.computeKnownBits(Op.getOperand(1), DemandedElts, Depth + 1);
+
+ Known = KnownBits::computeForAddSub(true, false, false, Known, Known)
+ .intersectWith(KnownBits::computeForAddSub(true, false, false,
+ Known2, Known2));
+ break;
+ }
+ case X86ISD::HSUB: {
+ Known =
+ DAG.computeKnownBits(Op.getOperand(0), DemandedElts, Depth + 1);
+ KnownBits Known2 =
+ DAG.computeKnownBits(Op.getOperand(1), DemandedElts, Depth + 1);
+
+ Known = KnownBits::computeForAddSub(false, false, false, Known, Known)
+ .intersectWith(KnownBits::computeForAddSub(false, false, false,
+ Known2, Known2));
+ break;
+ }
case ISD::INTRINSIC_WO_CHAIN: {
switch (Op->getConstantOperandVal(0)) {
case Intrinsic::x86_sse2_psad_bw:
@@ -37276,6 +37297,58 @@ void X86TargetLowering::computeKnownBitsForTargetNode(const SDValue Op,
computeKnownBitsForPSADBW(LHS, RHS, Known, DemandedElts, DAG, Depth);
break;
}
+ case Intrinsic::x86_ssse3_phadd_d:
+ case Intrinsic::x86_ssse3_phadd_w:
+ case Intrinsic::x86_ssse3_phadd_d_128:
+ case Intrinsic::x86_ssse3_phadd_w_128:
+ case Intrinsic::x86_avx2_phadd_d:
+ case Intrinsic::x86_avx2_phadd_w: {
+ Known = DAG.computeKnownBits(Op.getOperand(1), DemandedElts, Depth + 1);
+ KnownBits Known2 =
+ DAG.computeKnownBits(Op.getOperand(2), DemandedElts, Depth + 1);
+
+ Known = KnownBits::computeForAddSub(true, false, false, Known, Known)
+ .intersectWith(KnownBits::computeForAddSub(true, false, false,
+ Known2, Known2));
+ break;
+ }
+ case Intrinsic::x86_ssse3_phadd_sw:
+ case Intrinsic::x86_ssse3_phadd_sw_128:
+ case Intrinsic::x86_avx2_phadd_sw: {
+ Known = DAG.computeKnownBits(Op.getOperand(1), DemandedElts, Depth + 1);
+ KnownBits Known2 =
+ DAG.computeKnownBits(Op.getOperand(2), DemandedElts, Depth + 1);
+
+ Known = KnownBits::sadd_sat(Known, Known)
+ .intersectWith(KnownBits::sadd_sat(Known2, Known2));
+ break;
+ }
+ case Intrinsic::x86_ssse3_phsub_d:
+ case Intrinsic::x86_ssse3_phsub_w:
+ case Intrinsic::x86_ssse3_phsub_d_128:
+ case Intrinsic::x86_ssse3_phsub_w_128:
+ case Intrinsic::x86_avx2_phsub_d:
+ case Intrinsic::x86_avx2_phsub_w: {
+ Known = DAG.computeKnownBits(Op.getOperand(1), DemandedElts, Depth + 1);
+ KnownBits Known2 =
+ DAG.computeKnownBits(Op.getOperand(2), DemandedElts, Depth + 1);
+
+ Known = KnownBits::computeForAddSub(false, false, false, Known, Known)
+ .intersectWith(KnownBits::computeForAddSub(
+ false, false, false, Known2, Known2));
+ break;
+ }
+ case Intrinsic::x86_ssse3_phsub_sw:
+ case Intrinsic::x86_ssse3_phsub_sw_128:
+ case Intrinsic::x86_avx2_phsub_sw: {
+ Known = DAG.computeKnownBits(Op.getOperand(1), DemandedElts, Depth + 1);
+ KnownBits Known2 =
+ DAG.computeKnownBits(Op.getOperand(2), DemandedElts, Depth + 1);
+
+ Known = KnownBits::ssub_sat(Known, Known)
+ .intersectWith(KnownBits::ssub_sat(Known2, Known2));
+ break;
+ }
}
break;
}
diff --git a/llvm/test/Analysis/ValueTracking/knownbits-hadd-hsub.ll b/llvm/test/Analysis/ValueTracking/knownbits-hadd-hsub.ll
new file mode 100644
index 0000000000000..443ab72ee54cb
--- /dev/null
+++ b/llvm/test/Analysis/ValueTracking/knownbits-hadd-hsub.ll
@@ -0,0 +1,192 @@
+; NOTE: Assertions have been autogenerated by utils/update_test_checks.py UTC_ARGS: --version 4
+; RUN: opt -S -passes=instcombine < %s | FileCheck %s
+
+define <4 x i1> @hadd_and_eq_v4i32(<4 x i32> %x, <4 x i32> %y) {
+; CHECK-LABEL: define <4 x i1> @hadd_and_eq_v4i32(
+; CHECK-SAME: <4 x i32> [[X:%.*]], <4 x i32> [[Y:%.*]]) {
+; CHECK-NEXT: entry:
+; CHECK-NEXT: ret <4 x i1> zeroinitializer
+;
+entry:
+ %0 = and <4 x i32> %x, <i32 3, i32 3, i32 3, i32 3>
+ %1 = and <4 x i32> %y, <i32 3, i32 3, i32 3, i32 3>
+ %2 = tail call <4 x i32> @llvm.x86.ssse3.phadd.d.128(<4 x i32> %0, <4 x i32> %1)
+ %3 = and <4 x i32> %2, <i32 -8, i32 -8, i32 -8, i32 -8>
+ %ret = icmp eq <4 x i32> %3, <i32 3, i32 4, i32 5, i32 6>
+ ret <4 x i1> %ret
+}
+
+define <8 x i1> @hadd_and_eq_v8i16(<8 x i16> %x, <8 x i16> %y) {
+; CHECK-LABEL: define <8 x i1> @hadd_and_eq_v8i16(
+; CHECK-SAME: <8 x i16> [[X:%.*]], <8 x i16> [[Y:%.*]]) {
+; CHECK-NEXT: entry:
+; CHECK-NEXT: ret <8 x i1> <i1 true, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 true>
+;
+entry:
+ %0 = and <8 x i16> %x, <i16 3, i16 3, i16 3, i16 3, i16 3, i16 3, i16 3, i16 3>
+ %1 = and <8 x i16> %y, <i16 3, i16 3, i16 3, i16 3, i16 3, i16 3, i16 3, i16 3>
+ %2 = tail call <8 x i16> @llvm.x86.ssse3.phadd.w.128(<8 x i16> %0, <8 x i16> %1)
+ %3 = and <8 x i16> %2, <i16 -8, i16 -8, i16 -8, i16 -8, i16 -8, i16 -8, i16 -8, i16 -8>
+ %ret = icmp eq <8 x i16> %3, <i16 0, i16 1, i16 2, i16 3, i16 4, i16 5, i16 6, i16 0>
+ ret <8 x i1> %ret
+}
+
+define <8 x i1> @hadd_and_eq_v8i16_sat(<8 x i16> %x, <8 x i16> %y) {
+; CHECK-LABEL: define <8 x i1> @hadd_and_eq_v8i16_sat(
+; CHECK-SAME: <8 x i16> [[X:%.*]], <8 x i16> [[Y:%.*]]) {
+; CHECK-NEXT: entry:
+; CHECK-NEXT: ret <8 x i1> <i1 true, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 true>
+;
+entry:
+ %0 = and <8 x i16> %x, <i16 3, i16 3, i16 3, i16 3, i16 3, i16 3, i16 3, i16 3>
+ %1 = and <8 x i16> %y, <i16 3, i16 3, i16 3, i16 3, i16 3, i16 3, i16 3, i16 3>
+ %2 = tail call <8 x i16> @llvm.x86.ssse3.phadd.sw.128(<8 x i16> %0, <8 x i16> %1)
+ %3 = and <8 x i16> %2, <i16 -8, i16 -8, i16 -8, i16 -8, i16 -8, i16 -8, i16 -8, i16 -8>
+ %ret = icmp eq <8 x i16> %3, <i16 0, i16 1, i16 2, i16 3, i16 4, i16 5, i16 6, i16 0>
+ ret <8 x i1> %ret
+}
+
+define <8 x i1> @hadd_and_eq_v8i32(<8 x i32> %x, <8 x i32> %y) {
+; CHECK-LABEL: define <8 x i1> @hadd_and_eq_v8i32(
+; CHECK-SAME: <8 x i32> [[X:%.*]], <8 x i32> [[Y:%.*]]) {
+; CHECK-NEXT: entry:
+; CHECK-NEXT: ret <8 x i1> zeroinitializer
+;
+entry:
+ %0 = and <8 x i32> %x, <i32 3, i32 3, i32 3, i32 3, i32 3, i32 3, i32 3, i32 3>
+ %1 = and <8 x i32> %y, <i32 3, i32 3, i32 3, i32 3, i32 3, i32 3, i32 3, i32 3>
+ %2 = tail call <8 x i32> @llvm.x86.avx2.phadd.d(<8 x i32> %0, <8 x i32> %1)
+ %3 = and <8 x i32> %2, <i32 -8, i32 -8, i32 -8, i32 -8, i32 -8, i32 -8, i32 -8, i32 -8>
+ %ret = icmp eq <8 x i32> %3, <i32 3, i32 4, i32 5, i32 6, i32 3, i32 4, i32 5, i32 6>
+ ret <8 x i1> %ret
+}
+
+define <16 x i1> @hadd_and_eq_v16i16(<16 x i16> %x, <16 x i16> %y) {
+; CHECK-LABEL: define <16 x i1> @hadd_and_eq_v16i16(
+; CHECK-SAME: <16 x i16> [[X:%.*]], <16 x i16> [[Y:%.*]]) {
+; CHECK-NEXT: entry:
+; CHECK-NEXT: ret <16 x i1> <i1 true, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 true, i1 true, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 true>
+;
+entry:
+ %0 = and <16 x i16> %x, <i16 3, i16 3, i16 3, i16 3, i16 3, i16 3, i16 3, i16 3, i16 3, i16 3, i16 3, i16 3, i16 3, i16 3, i16 3, i16 3>
+ %1 = and <16 x i16> %y, <i16 3, i16 3, i16 3, i16 3, i16 3, i16 3, i16 3, i16 3, i16 3, i16 3, i16 3, i16 3, i16 3, i16 3, i16 3, i16 3>
+ %2 = tail call <16 x i16> @llvm.x86.avx2.phadd.w(<16 x i16> %0, <16 x i16> %1)
+ %3 = and <16 x i16> %2, <i16 -8, i16 -8, i16 -8, i16 -8, i16 -8, i16 -8, i16 -8, i16 -8, i16 -8, i16 -8, i16 -8, i16 -8, i16 -8, i16 -8, i16 -8, i16 -8>
+ %ret = icmp eq <16 x i16> %3, <i16 0, i16 1, i16 2, i16 3, i16 4, i16 5, i16 6, i16 0, i16 0, i16 1, i16 2, i16 3, i16 4, i16 5, i16 6, i16 0>
+ ret <16 x i1> %ret
+}
+
+define <16 x i1> @hadd_and_eq_v16i16_sat(<16 x i16> %x, <16 x i16> %y) {
+; CHECK-LABEL: define <16 x i1> @hadd_and_eq_v16i16_sat(
+; CHECK-SAME: <16 x i16> [[X:%.*]], <16 x i16> [[Y:%.*]]) {
+; CHECK-NEXT: entry:
+; CHECK-NEXT: ret <16 x i1> <i1 true, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 true, i1 true, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 true>
+;
+entry:
+ %0 = and <16 x i16> %x, <i16 3, i16 3, i16 3, i16 3, i16 3, i16 3, i16 3, i16 3, i16 3, i16 3, i16 3, i16 3, i16 3, i16 3, i16 3, i16 3>
+ %1 = and <16 x i16> %y, <i16 3, i16 3, i16 3, i16 3, i16 3, i16 3, i16 3, i16 3, i16 3, i16 3, i16 3, i16 3, i16 3, i16 3, i16 3, i16 3>
+ %2 = tail call <16 x i16> @llvm.x86.avx2.phadd.sw(<16 x i16> %0, <16 x i16> %1)
+ %3 = and <16 x i16> %2, <i16 -8, i16 -8, i16 -8, i16 -8, i16 -8, i16 -8, i16 -8, i16 -8, i16 -8, i16 -8, i16 -8, i16 -8, i16 -8, i16 -8, i16 -8, i16 -8>
+ %ret = icmp eq <16 x i16> %3, <i16 0, i16 1, i16 2, i16 3, i16 4, i16 5, i16 6, i16 0, i16 0, i16 1, i16 2, i16 3, i16 4, i16 5, i16 6, i16 0>
+ ret <16 x i1> %ret
+}
+
+define <4 x i1> @hsub_trunc_eq_v4i32(<4 x i32> %x, <4 x i32> %y) {
+; CHECK-LABEL: define <4 x i1> @hsub_trunc_eq_v4i32(
+; CHECK-SAME: <4 x i32> [[X:%.*]], <4 x i32> [[Y:%.*]]) {
+; CHECK-NEXT: entry:
+; CHECK-NEXT: ret <4 x i1> zeroinitializer
+;
+entry:
+ %0 = or <4 x i32> %x, <i32 65535, i32 65535, i32 65535, i32 65535>
+ %1 = or <4 x i32> %y, <i32 65535, i32 65535, i32 65535, i32 65535>
+ %2 = tail call <4 x i32> @llvm.x86.ssse3.phsub.d.128(<4 x i32> %0, <4 x i32> %1)
+ %conv = trunc <4 x i32> %2 to <4 x i16>
+ %ret = icmp eq <4 x i16> %conv, <i16 3, i16 4, i16 5, i16 6>
+ ret <4 x i1> %ret
+}
+
+define <8 x i1> @hsub_trunc_eq_v8i16(<8 x i16> %x, <8 x i16> %y) {
+; CHECK-LABEL: define <8 x i1> @hsub_trunc_eq_v8i16(
+; CHECK-SAME: <8 x i16> [[X:%.*]], <8 x i16> [[Y:%.*]]) {
+; CHECK-NEXT: entry:
+; CHECK-NEXT: ret <8 x i1> <i1 true, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 true>
+;
+entry:
+ %0 = or <8 x i16> %x, <i16 255, i16 255, i16 255, i16 255, i16 255, i16 255, i16 255, i16 255>
+ %1 = or <8 x i16> %y, <i16 255, i16 255, i16 255, i16 255, i16 255, i16 255, i16 255, i16 255>
+ %2 = tail call <8 x i16> @llvm.x86.ssse3.phsub.w.128(<8 x i16> %0, <8 x i16> %1)
+ %conv = trunc <8 x i16> %2 to <8 x i8>
+ %ret = icmp eq <8 x i8> %conv, <i8 0, i8 1, i8 2, i8 3, i8 4, i8 5, i8 6, i8 0>
+ ret <8 x i1> %ret
+}
+
+define <8 x i1> @hsub_and_eq_v8i16_sat(<8 x i16> %x, <8 x i16> %y) {
+; CHECK-LABEL: define <8 x i1> @hsub_and_eq_v8i16_sat(
+; CHECK-SAME: <8 x i16> [[X:%.*]], <8 x i16> [[Y:%.*]]) {
+; CHECK-NEXT: entry:
+; CHECK-NEXT: [[TMP0:%.*]] = or <8 x i16> [[X]], <i16 7, i16 7, i16 7, i16 7, i16 7, i16 7, i16 7, i16 7>
+; CHECK-NEXT: [[TMP1:%.*]] = or <8 x i16> [[Y]], <i16 7, i16 7, i16 7, i16 7, i16 7, i16 7, i16 7, i16 7>
+; CHECK-NEXT: [[TMP2:%.*]] = tail call <8 x i16> @llvm.x86.ssse3.phsub.sw.128(<8 x i16> [[TMP0]], <8 x i16> [[TMP1]])
+; CHECK-NEXT: [[TMP3:%.*]] = and <8 x i16> [[TMP2]], <i16 7, i16 7, i16 7, i16 7, i16 7, i16 7, i16 7, i16 7>
+; CHECK-NEXT: [[TMP4:%.*]] = icmp eq <8 x i16> [[TMP3]], zeroinitializer
+; CHECK-NEXT: ret <8 x i1> [[TMP4]]
+;
+entry:
+ %0 = or <8 x i16> %x, <i16 7, i16 7, i16 7, i16 7, i16 7, i16 7, i16 7, i16 7>
+ %1 = or <8 x i16> %y, <i16 7, i16 7, i16 7, i16 7, i16 7, i16 7, i16 7, i16 7>
+ %2 = tail call <8 x i16> @llvm.x86.ssse3.phsub.sw.128(<8 x i16> %0, <8 x i16> %1)
+ %3 = and <8 x i16> %2, <i16 7, i16 7, i16 7, i16 7, i16 7, i16 7, i16 7, i16 7>
+ %4 = icmp eq <8 x i16> %3, zeroinitializer
+ ret <8 x i1> %4
+}
+
+define <8 x i1> @hsub_trunc_eq_v8i32(<8 x i32> %x, <8 x i32> %y) {
+; CHECK-LABEL: define <8 x i1> @hsub_trunc_eq_v8i32(
+; CHECK-SAME: <8 x i32> [[X:%.*]], <8 x i32> [[Y:%.*]]) {
+; CHECK-NEXT: entry:
+; CHECK-NEXT: ret <8 x i1> zeroinitializer
+;
+entry:
+ %0 = or <8 x i32> %x, <i32 65535, i32 65535, i32 65535, i32 65535, i32 65535, i32 65535, i32 65535, i32 65535>
+ %1 = or <8 x i32> %y, <i32 65535, i32 65535, i32 65535, i32 65535, i32 65535, i32 65535, i32 65535, i32 65535>
+ %2 = tail call <8 x i32> @llvm.x86.avx2.phsub.d(<8 x i32> %0, <8 x i32> %1)
+ %conv = trunc <8 x i32> %2 to <8 x i16>
+ %ret = icmp eq <8 x i16> %conv, <i16 3, i16 4, i16 5, i16 6, i16 3, i16 4, i16 5, i16 6>
+ ret <8 x i1> %ret
+}
+
+define <16 x i1> @hsub_trunc_eq_v16i16(<16 x i16> %x, <16 x i16> %y) {
+; CHECK-LABEL: define <16 x i1> @hsub_trunc_eq_v16i16(
+; CHECK-SAME: <16 x i16> [[X:%.*]], <16 x i16> [[Y:%.*]]) {
+; CHECK-NEXT: entry:
+; CHECK-NEXT: ret <16 x i1> <i1 true, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 true, i1 true, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 true>
+;
+entry:
+ %0 = or <16 x i16> %x, <i16 255, i16 255, i16 255, i16 255, i16 255, i16 255, i16 255, i16 255, i16 255, i16 255, i16 255, i16 255, i16 255, i16 255, i16 255, i16 255>
+ %1 = or <16 x i16> %y, <i16 255, i16 255, i16 255, i16 255, i16 255, i16 255, i16 255, i16 255, i16 255, i16 255, i16 255, i16 255, i16 255, i16 255, i16 255, i16 255>
+ %2 = tail call <16 x i16> @llvm.x86.avx2.phsub.w(<16 x i16> %0, <16 x i16> %1)
+ %conv = trunc <16 x i16> %2 to <16 x i8>
+ %ret = icmp eq <16 x i8> %conv, <i8 0, i8 1, i8 2, i8 3, i8 4, i8 5, i8 6, i8 0, i8 0, i8 1, i8 2, i8 3, i8 4, i8 5, i8 6, i8 0>
+ ret <16 x i1> %ret
+}
+
+define <16 x i1> @hsub_and_eq_v16i16_sat(<16 x i16> %x, <16 x i16> %y) {
+; CHECK-LABEL: define <16 x i1> @hsub_and_eq_v16i16_sat(
+; CHECK-SAME: <16 x i16> [[X:%.*]], <16 x i16> [[Y:%.*]]) {
+; CHECK-NEXT: entry:
+; CHECK-NEXT: [[TMP0:%.*]] = or <16 x i16> [[X]], <i16 7, i16 7, i16 7, i16 7, i16 7, i16 7, i16 7, i16 7, i16 7, i16 7, i16 7, i16 7, i16 7, i16 7, i16 7, i16 7>
+; CHECK-NEXT: [[TMP1:%.*]] = or <16 x i16> [[Y]], <i16 7, i16 7, i16 7, i16 7, i16 7, i16 7, i16 7, i16 7, i16 7, i16 7, i16 7, i16 7, i16 7, i16 7, i16 7, i16 7>
+; CHECK-NEXT: [[TMP2:%.*]] = tail call <16 x i16> @llvm.x86.avx2.phsub.sw(<16 x i16> [[TMP0]], <16 x i16> [[TMP1]])
+; CHECK-NEXT: [[TMP3:%.*]] = and <16 x i16> [[TMP2]], <i16 7, i16 7, i16 7, i16 7, i16 7, i16 7, i16 7, i16 7, i16 7, i16 7, i16 7, i16 7, i16 7, i16 7, i16 7, i16 7>
+; CHECK-NEXT: [[TMP4:%.*]] = icmp eq <16 x i16> [[TMP3]], zeroinitializer
+; CHECK-NEXT: ret <16 x i1> [[TMP4]]
+;
+entry:
+ %0 = or <16 x i16> %x, <i16 7, i16 7, i16 7, i16 7, i16 7, i16 7, i16 7, i16 7, i16 7, i16 7, i16 7, i16 7, i16 7, i16 7, i16 7, i16 7>
+ %1 = or <16 x i16> %y, <i16 7, i16 7, i16 7, i16 7, i16 7, i16 7, i16 7, i16 7, i16 7, i16 7, i16 7, i16 7, i16 7, i16 7, i16 7, i16 7>
+ %2 = tail call <16 x i16> @llvm.x86.avx2.phsub.sw(<16 x i16> %0, <16 x i16> %1)
+ %3 = and <16 x i16> %2, <i16 7, i16 7, i16 7, i16 7, i16 7, i16 7, i16 7, i16 7, i16 7, i16 7, i16 7, i16 7, i16 7, i16 7, i16 7, i16 7>
+ %4 = icmp eq <16 x i16> %3, zeroinitializer
+ ret <16 x i1> %4
+}
diff --git a/llvm/test/CodeGen/X86/knownbits-hadd-hsub.ll b/llvm/test/CodeGen/X86/knownbits-hadd-hsub.ll
new file mode 100644
index 0000000000000..eba7b9843d991
--- /dev/null
+++ b/llvm/test/CodeGen/X86/knownbits-hadd-hsub.ll
@@ -0,0 +1,201 @@
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 4
+; RUN: llc < %s -mtriple=x86_64-unknown -mattr=+avx2 | FileCheck %s
+
+define <4 x i16> @hadd_trunc_v4i32(<4 x i32> %x, <4 x i32> %y) {
+; CHECK-LABEL: hadd_trunc_v4i32:
+; CHECK: # %bb.0: # %entry
+; CHECK-NEXT: vpbroadcastd {{.*#+}} xmm2 = [3,3,3,3]
+; CHECK-NEXT: vpand %xmm2, %xmm0, %xmm0
+; CHECK-NEXT: vpand %xmm2, %xmm1, %xmm1
+; CHECK-NEXT: vphaddd %xmm1, %xmm0, %xmm0
+; CHECK-NEXT: vpackusdw %xmm0, %xmm0, %xmm0
+; CHECK-NEXT: retq
+entry:
+ %0 = and <4 x i32> %x, <i32 3, i32 3, i32 3, i32 3>
+ %1 = and <4 x i32> %y, <i32 3, i32 3, i32 3, i32 3>
+ %2 = tail call <4 x i32> @llvm.x86.ssse3.phadd.d.128(<4 x i32> %0, <4 x i32> %1)
+ %conv = trunc <4 x i32> %2 to <4 x i16>
+ ret <4 x i16> %conv
+}
+
+define <8 x i8> @hadd_trunc_v8i16(<8 x i16> %x, <8 x i16> %y) {
+; CHECK-LABEL: hadd_trunc_v8i16:
+; CHECK: # %bb.0: # %entry
+; CHECK-NEXT: vpbroadcastw {{.*#+}} xmm2 = [3,3,3,3,3,3,3,3]
+; CHECK-NEXT: vpand %xmm2, %xmm0, %xmm0
+; CHECK-NEXT: vpand %xmm2, %xmm1, %xmm1
+; CHECK-NEXT: vphaddw %xmm1, %xmm0, %xmm0
+; CHECK-NEXT: vpackuswb %xmm0, %xmm0, %xmm0
+; CHECK-NEXT: retq
+entry:
+ %0 = and <8 x i16> %x, <i16 3, i16 3, i16 3, i16 3, i16 3, i16 3, i16 3, i16 3>
+ %1 = and <8 x i16> %y, <i16 3, i16 3, i16 3, i16 3, i16 3, i16 3, i16 3, i16 3>
+ %2 = tail call <8 x i16> @llvm.x86.ssse3.phadd.w.128(<8 x i16> %0, <8 x i16> %1)
+ %conv = trunc <8 x i16> %2 to <8 x i8>
+ ret <8 x i8> %conv
+}
+
+define <8 x i8> @hadd_trunc_v8i16_sat(<8 x i16> %x, <8 x i16> %y) {
+; CHECK-LABEL: hadd_trunc_v8i16_sat:
+; CHECK: # %bb.0: # %entry
+; CHECK-NEXT: vpbroadcastw {{.*#+}} xmm2 = [3,3,3,3,3,3,3,3]
+; CHECK-NEXT: vpand %xmm2, %xmm0, %xmm0
+; CHECK-NEXT: vpand %xmm2, %xmm1, %xmm1
+; CHECK-NEXT: vphaddsw %xmm1, %xmm0, %xmm0
+; CHECK-NEXT: vpackuswb %xmm0, %xmm0, %xmm0
+; CHECK-NEXT: retq
+entry:
+ %0 = and <8 x i16> %x, <i16 3, i16 3, i16 3, i16 3, i16 3, i16 3, i16 3, i16 3>
+ %1 = and <8 x i16> %y, <i16 3, i16 3, i16 3, i16 3, i16 3, i16 3, i16 3, i16 3>
+ %2 = tail call <8 x i16> @llvm.x86.ssse3.phadd.sw.128(<8 x i16> %0, <8 x i...
[truncated]
|
llvm/lib/Analysis/ValueTracking.cpp
Outdated
case Intrinsic::x86_avx2_phadd_d: | ||
case Intrinsic::x86_avx2_phadd_w: { | ||
computeKnownBits(I->getOperand(0), DemandedElts, Known, Depth + 1, Q); | ||
computeKnownBits(I->getOperand(1), DemandedElts, Known2, Depth + 1, Q); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure demandedelts is right to propagate here. I.e in your example of:
[X1, X2], [Y1, Y2]
-> [X1 + X2, Y1 + Y2]
. If you only demand the first element, you still need both X1
and X2
from operand 0, you just don't need either Y1
or Y2
.
This also appears to need to be changed for the rest of the impls.
Can you add some tests where the result is then shuffled/extracted from to test w/ DemandedElts as not all ones?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apparently, I didn't quite understand the purpose of DemandedElts
. I've adjusted the code and added some tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be better to focus on the SelectionDAG variant first to get this right, but its up to you.
@@ -37262,6 +37262,27 @@ void X86TargetLowering::computeKnownBitsForTargetNode(const SDValue Op, | |||
} | |||
break; | |||
} | |||
case X86ISD::HADD: { | |||
Known = DAG.computeKnownBits(Op.getOperand(0), DemandedElts, Depth + 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can't use the result DemandedElts for the source operands you'll need something like:
APInt DemandedLHS, DemandedRHS;
getHorizDemandedElts(VT, DemandedElts, DemandedLHS, DemandedRHS);
You can avoid computeKnownBits calls for cases where either DemandedLHS/RHS are zero.
Plus you can probably get more refined knownbits by getting the known bits of the odd/even elements separately, although that could mean 4 computeKnownBits calls.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added the implementation with 4 computeKnownBits
calls in the hope that it does not slow down the analysis too much.
llvm/lib/Analysis/ValueTracking.cpp
Outdated
case Intrinsic::x86_ssse3_phadd_sw_128: | ||
case Intrinsic::x86_avx2_phadd_sw: { | ||
computeKnownBits(I->getOperand(0), DemandedElts, Known, Depth + 1, Q); | ||
computeKnownBits(I->getOperand(1), DemandedElts, Known2, Depth + 1, Q); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wow, saturated hadd, haven't seen that for a while :)
8aa2cec
to
58d73b1
Compare
/// \param DemandedEltsOp the demanded elements mask for the operation | ||
/// \param DemandedEltsLHS the demanded elements mask for the left operand | ||
/// \param DemandedEltsRHS the demanded elements mask for the right operand | ||
void getHorizontalDemandedElts(const APInt &DemandedEltsOp, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this function name should make it clear you are only getting half the elements.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. Done. In the meantime, I've noticed that there is already another function that does almost the same. Therefore, I've consolidated them.
llvm/lib/Analysis/ValueTracking.cpp
Outdated
|
||
std::array<KnownBits, 2> KnownLHS; | ||
for (unsigned Index = 0; Index < KnownLHS.size(); ++Index) { | ||
if (!DemandedEltsLHS.isZero()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you must never be hitting this case otherwise think we will run into issues with uninitialized used of KnownLHS
. Likewise below for RHS.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I understand. In both cases, KnownLHS[Index]
is initialized. Shouldn't both cases occur in the hadd_extract
test cases? Shouldn't they fail if anything were uninitialized? At least, valgrind --tool=memcheck
doesn't report any error when executing the tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah you're right, I misread code.
58d73b1
to
c29661e
Compare
c29661e
to
ff64740
Compare
ff64740
to
072a9c0
Compare
072a9c0
to
d680eab
Compare
d680eab
to
8c7dc00
Compare
This basically LGTM. Please wait for some additional approvals to push. |
[](const KnownBits &KnownLHS, const KnownBits &KnownRHS) { | ||
return KnownBits::ssub_sat(KnownLHS, KnownRHS); | ||
}); | ||
break; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd prefer it in the DAG if we just handled the X86ISD::HADD/SUB nodes and not the intrinsics - we try to do as little as possible with MMX types in DAG, and the saturation instructions are very rare - its much more likely that we just need to determine knownbits for X86ISD::HADD/SUB nodes we've created in the DAG.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The implementation handled the intrinsics because otherwise some test cases would not fold. For example, the test case that truncates <4 x i32>
to <4 x i16>
does not fold when handling only the X86ISD::HADD
/HSUB
. In contrast, tests that truncate <8 x i32>
to <8 x i16>
work fine this way.
After looking at this problem again, I believe that the code that replaces the shuffle with a pack instruction might be too strict. This is probably also the case in this example: https://godbolt.org/z/KW5b6r7xW
Anyway, I've removed the handling of the intrinsics and adapted the test cases such that they still fold with only the X86ISD::HADD
/HSUB
nodes.
8c7dc00
to
e8ce125
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for this falling off my radar - a few minors but otherwise almost ready to go
e8ce125
to
b64f2ea
Compare
@mskamp Please can you rebase to fix the merge conflicts? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Add KnownBits computations to ValueTracking and X86 DAG lowering. These instructions add/subtract adjacent vector elements in their operands. Example: phadd [X1, X2] [Y1, Y2] = [X1 + X2, Y1 + Y2]. This means that, in this example, we can compute the KnownBits of the operation by computing the KnownBits of [X1, X2] + [X1, X2] and [Y1, Y2] + [Y1, Y2] and intersecting the results. This approach also generalizes to all x86 vector types. There are also the operations phadd.sw and phsub.sw, which perform saturating addition/subtraction. Use sadd_sat and ssub_sat to compute the KnownBits of these operations in ValueTracking. Also adjust the existing test case pr53247.ll because it can be transformed to a constant using the new KnownBits computation. Fixes llvm#82516.
b64f2ea
to
35f153c
Compare
@mskamp Congratulations on having your first Pull Request (PR) merged into the LLVM Project! Your changes will be combined with recent changes from other authors, then tested Please check whether problems have been caused by your change specifically, as How to do this, and the rest of the post-merge process, is covered in detail here. If your change does cause a problem, it may be reverted, or you can revert it yourself. If you don't get any reports, no action is required from you. Your changes are working as expected, well done! |
Summary: Add KnownBits computations to ValueTracking and X86 DAG lowering. These instructions add/subtract adjacent vector elements in their operands. Example: phadd [X1, X2] [Y1, Y2] = [X1 + X2, Y1 + Y2]. This means that, in this example, we can compute the KnownBits of the operation by computing the KnownBits of [X1, X2] + [X1, X2] and [Y1, Y2] + [Y1, Y2] and intersecting the results. This approach also generalizes to all x86 vector types. There are also the operations phadd.sw and phsub.sw, which perform saturating addition/subtraction. Use sadd_sat and ssub_sat to compute the KnownBits of these operations. Also adjust the existing test case pr53247.ll because it can be transformed to a constant using the new KnownBits computation. Fixes llvm#82516. Test Plan: Reviewers: Subscribers: Tasks: Tags: Differential Revision: https://phabricator.intern.facebook.com/D59822457
Summary: Add KnownBits computations to ValueTracking and X86 DAG lowering. These instructions add/subtract adjacent vector elements in their operands. Example: phadd [X1, X2] [Y1, Y2] = [X1 + X2, Y1 + Y2]. This means that, in this example, we can compute the KnownBits of the operation by computing the KnownBits of [X1, X2] + [X1, X2] and [Y1, Y2] + [Y1, Y2] and intersecting the results. This approach also generalizes to all x86 vector types. There are also the operations phadd.sw and phsub.sw, which perform saturating addition/subtraction. Use sadd_sat and ssub_sat to compute the KnownBits of these operations. Also adjust the existing test case pr53247.ll because it can be transformed to a constant using the new KnownBits computation. Fixes #82516. Test Plan: Reviewers: Subscribers: Tasks: Tags: Differential Revision: https://phabricator.intern.facebook.com/D60251514
Add KnownBits computations to ValueTracking and X86 DAG lowering.
These instructions add/subtract adjacent vector elements in their
operands. Example: phadd [X1, X2] [Y1, Y2] = [X1 + X2, Y1 + Y2].
This means that, in this example, we can compute the KnownBits of the
operation by computing the KnownBits of [X1, X2] + [X1, X2] and
[Y1, Y2] + [Y1, Y2] and intersecting the results. This approach
also generalizes to all x86 vector types.
There are also the operations phadd.sw and phsub.sw, which perform
saturating addition/subtraction. Use sadd_sat and ssub_sat to compute
the KnownBits of these operations.
Also adjust the existing test case pr53247.ll because it can be
transformed to a constant using the new KnownBits computation.
Fixes #82516.