[X86] Try Folding icmp of v8i32 -> fcmp of v8f32 on AVX #82290

goldsteinn · 2024-02-19T23:58:47Z

The idea is that AVX doesn't support comparisons for v8i32 so it
splits the comparison into 2x v4i32 comparisons + reconstruction of
the v8i32.

By converting to a float, we can handle the comparison with 1/2
instructions (1 if we can bitcast, 2 if we need to cast with
sitofp).

The Proofs: https://alive2.llvm.org/ce/z/AJDdQ8
Timeout, but they can be reproduced locally.

llvmbot · 2024-02-19T23:59:19Z

@llvm/pr-subscribers-llvm-selectiondag

@llvm/pr-subscribers-backend-x86

Author: None (goldsteinn)

Changes

[X86] Add tests for folding icmp of v8i32 -> fcmp of v8f32 on AVX; NFC
[X86] Try Folding icmp of v8i32 -> fcmp of v8f32 on AVX

Patch is 282.65 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/82290.diff

30 Files Affected:

(modified) llvm/lib/Target/X86/X86ISelLowering.cpp (+104)
(modified) llvm/test/CodeGen/X86/bitcast-int-to-vector-bool-sext.ll (+3-12)
(modified) llvm/test/CodeGen/X86/bitcast-int-to-vector-bool-zext.ll (+8-11)
(added) llvm/test/CodeGen/X86/cmpf-avx.ll (+242)
(modified) llvm/test/CodeGen/X86/combine-testps.ll (+18-7)
(modified) llvm/test/CodeGen/X86/masked_compressstore.ll (+17-17)
(modified) llvm/test/CodeGen/X86/masked_expandload.ll (+24-24)
(modified) llvm/test/CodeGen/X86/masked_gather.ll (+64-70)
(modified) llvm/test/CodeGen/X86/masked_load.ll (+3-5)
(modified) llvm/test/CodeGen/X86/masked_store.ll (+36-33)
(modified) llvm/test/CodeGen/X86/masked_store_trunc.ll (+29-48)
(modified) llvm/test/CodeGen/X86/masked_store_trunc_ssat.ll (+29-48)
(modified) llvm/test/CodeGen/X86/masked_store_trunc_usat.ll (+29-48)
(modified) llvm/test/CodeGen/X86/nontemporal-loads.ll (+7-11)
(modified) llvm/test/CodeGen/X86/pr48215.ll (+9-10)
(modified) llvm/test/CodeGen/X86/sadd_sat_vec.ll (+18-10)
(modified) llvm/test/CodeGen/X86/setcc-lowering.ll (+3-3)
(modified) llvm/test/CodeGen/X86/ssub_sat_vec.ll (+33-36)
(modified) llvm/test/CodeGen/X86/v8i1-masks.ll (+6-10)
(modified) llvm/test/CodeGen/X86/vec_saddo.ll (+45-45)
(modified) llvm/test/CodeGen/X86/vec_ssubo.ll (+45-45)
(modified) llvm/test/CodeGen/X86/vec_umulo.ll (+33-38)
(modified) llvm/test/CodeGen/X86/vector-constrained-fp-intrinsics.ll (+7-11)
(modified) llvm/test/CodeGen/X86/vector-pcmp.ll (+9-13)
(modified) llvm/test/CodeGen/X86/vector-popcnt-256-ult-ugt.ll (+1493-1584)
(modified) llvm/test/CodeGen/X86/vector-reduce-fmaximum.ll (-58)
(modified) llvm/test/CodeGen/X86/vector-reduce-or-bool.ll (+11-13)
(modified) llvm/test/CodeGen/X86/vector-reduce-xor-bool.ll (+10-12)
(modified) llvm/test/CodeGen/X86/vector-sext.ll (+1-4)
(modified) llvm/test/CodeGen/X86/vector-unsigned-cmp.ll (+8-8)

diff --git a/llvm/lib/Target/X86/X86ISelLowering.cpp b/llvm/lib/Target/X86/X86ISelLowering.cpp
index d9657e1aeb8026..b438700e69d8ca 100644
--- a/llvm/lib/Target/X86/X86ISelLowering.cpp
+++ b/llvm/lib/Target/X86/X86ISelLowering.cpp
@@ -23299,6 +23299,110 @@ static SDValue LowerVSETCC(SDValue Op, const X86Subtarget &Subtarget,
     }
   }
 
+  // We get bad codegen for v8i32 compares on avx targets (without avx2) so if
+  // possible convert to a v8f32 compare.
+  if (VTOp0.getVectorElementType() == MVT::i32 && VTOp0 == MVT::v8i32 &&
+      Subtarget.hasAVX() && !Subtarget.hasAVX2()) {
+    std::optional<KnownBits> KnownOps[2];
+    // Check if an op is known to be in a certain range.
+    auto OpInRange = [&DAG, Op, &KnownOps](unsigned OpNo, bool CmpLT,
+                                           const APInt Bound) {
+      if (!KnownOps[OpNo].has_value())
+        KnownOps[OpNo] = DAG.computeKnownBits(Op.getOperand(OpNo));
+
+      if (KnownOps[OpNo]->isUnknown())
+        return false;
+
+      std::optional<bool> Res;
+      if (CmpLT)
+        Res = KnownBits::ult(*KnownOps[OpNo], KnownBits::makeConstant(Bound));
+      else
+        Res = KnownBits::ugt(*KnownOps[OpNo], KnownBits::makeConstant(Bound));
+      return Res.has_value() && *Res;
+    };
+
+    bool OkayCvt = false;
+    bool OkayBitcast = false;
+
+    // For cvt up to 1 << (Significand Precision)
+    const APInt MaxConvertableCvt = APInt(32, (1U << 24));
+    // For bitcast up to (and including) first inf representation (0x7f800000)
+    const APInt MaxConvertableBitcast = APInt(32, 0x7f800001);
+
+    // For bitcast we need both lhs/op1 u< MaxConvertableBitcast
+    // NB: It might be worth it to enable to bitcast version for unsigned avx2
+    // comparisons as they typically require multiple instructions to lower
+    // (they don't fit `vpcmpeq`/`vpcmpgt` well).
+    if (OpInRange(1, /*CmpLT*/ true, MaxConvertableBitcast) &&
+        OpInRange(0, /*CmpLT*/ true, MaxConvertableBitcast)) {
+      OkayBitcast = true;
+    }
+    // We want to convert icmp -> fcmp using `sitofp` iff one of the converts
+    // will be constant folded.
+    else if ((DAG.isConstantValueOfAnyType(peekThroughBitcasts(Op1)) ||
+              DAG.isConstantValueOfAnyType(peekThroughBitcasts(Op0)))) {
+      if (isUnsignedIntSetCC(Cond)) {
+        // For cvt + unsigned compare we need both lhs/rhs >= 0 and either lhs
+        // or rhs < MaxConvertableCvt
+
+        if (OpInRange(1, /*CmpLT*/ true, APInt::getSignedMinValue(32)) &&
+            OpInRange(0, /*CmpLT*/ true, APInt::getSignedMinValue(32)) &&
+            (OpInRange(1, /*CmpLT*/ true, MaxConvertableCvt) ||
+             OpInRange(0, /*CmpLT*/ true, MaxConvertableCvt)))
+          OkayCvt = true;
+      } else {
+        // For cvt + signed compare we need  abs(lhs) or abs(rhs) <
+        // MaxConvertableCvt
+        if (OpInRange(1, /*CmpLT*/ true, MaxConvertableCvt) ||
+            OpInRange(1, /*CmpLT*/ false, -MaxConvertableCvt) ||
+            OpInRange(0, /*CmpLT*/ true, MaxConvertableCvt) ||
+            OpInRange(0, /*CmpLT*/ false, -MaxConvertableCvt))
+          OkayCvt = true;
+      }
+    }
+
+    if (OkayBitcast || OkayCvt) {
+      switch (Cond) {
+      default:
+        llvm_unreachable("Unexpected SETCC condition");
+        // Get the new FP condition. Note for the unsigned conditions we have
+        // verified its okay to convert to the signed version.
+      case ISD::SETULT:
+      case ISD::SETLT:
+        Cond = ISD::SETOLT;
+        break;
+      case ISD::SETUGT:
+      case ISD::SETGT:
+        Cond = ISD::SETOGT;
+        break;
+      case ISD::SETULE:
+      case ISD::SETLE:
+        Cond = ISD::SETOLE;
+        break;
+      case ISD::SETUGE:
+      case ISD::SETGE:
+        Cond = ISD::SETOGE;
+        break;
+      case ISD::SETEQ:
+        Cond = ISD::SETOEQ;
+        break;
+      case ISD::SETNE:
+        Cond = ISD::SETONE;
+        break;
+      }
+
+      MVT FpVT = MVT::getVectorVT(MVT::f32, VT.getVectorElementCount());
+      if (OkayBitcast) {
+        Op0 = DAG.getBitcast(FpVT, Op0);
+        Op1 = DAG.getBitcast(FpVT, Op1);
+      } else {
+        Op0 = DAG.getNode(ISD::SINT_TO_FP, dl, FpVT, Op0);
+        Op1 = DAG.getNode(ISD::SINT_TO_FP, dl, FpVT, Op1);
+      }
+      return DAG.getSetCC(dl, VT, Op0, Op1, Cond);
+    }
+  }
+
   // Break 256-bit integer vector compare into smaller ones.
   if (VT.is256BitVector() && !Subtarget.hasInt256())
     return splitIntVSETCC(VT, Op0, Op1, Cond, DAG, dl);
diff --git a/llvm/test/CodeGen/X86/bitcast-int-to-vector-bool-sext.ll b/llvm/test/CodeGen/X86/bitcast-int-to-vector-bool-sext.ll
index 6255621d870e12..d680135be56418 100644
--- a/llvm/test/CodeGen/X86/bitcast-int-to-vector-bool-sext.ll
+++ b/llvm/test/CodeGen/X86/bitcast-int-to-vector-bool-sext.ll
@@ -258,10 +258,7 @@ define <8 x i32> @ext_i8_8i32(i8 %a0) {
 ; AVX1-NEXT:    vinsertf128 $1, %xmm0, %ymm0, %ymm0
 ; AVX1-NEXT:    vmovaps {{.*#+}} ymm1 = [1,2,4,8,16,32,64,128]
 ; AVX1-NEXT:    vandps %ymm1, %ymm0, %ymm0
-; AVX1-NEXT:    vpcmpeqd %xmm1, %xmm0, %xmm1
-; AVX1-NEXT:    vextractf128 $1, %ymm0, %xmm0
-; AVX1-NEXT:    vpcmpeqd {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm0, %xmm0
-; AVX1-NEXT:    vinsertf128 $1, %xmm0, %ymm1, %ymm0
+; AVX1-NEXT:    vcmpeqps %ymm1, %ymm0, %ymm0
 ; AVX1-NEXT:    retq
 ;
 ; AVX2-LABEL: ext_i8_8i32:
@@ -489,16 +486,10 @@ define <16 x i32> @ext_i16_16i32(i16 %a0) {
 ; AVX1-NEXT:    vinsertf128 $1, %xmm0, %ymm0, %ymm1
 ; AVX1-NEXT:    vmovaps {{.*#+}} ymm0 = [1,2,4,8,16,32,64,128]
 ; AVX1-NEXT:    vandps %ymm0, %ymm1, %ymm2
-; AVX1-NEXT:    vpcmpeqd %xmm0, %xmm2, %xmm0
-; AVX1-NEXT:    vextractf128 $1, %ymm2, %xmm2
-; AVX1-NEXT:    vpcmpeqd {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm2, %xmm2
-; AVX1-NEXT:    vinsertf128 $1, %xmm2, %ymm0, %ymm0
+; AVX1-NEXT:    vcmpeqps %ymm0, %ymm2, %ymm0
 ; AVX1-NEXT:    vmovaps {{.*#+}} ymm2 = [256,512,1024,2048,4096,8192,16384,32768]
 ; AVX1-NEXT:    vandps %ymm2, %ymm1, %ymm1
-; AVX1-NEXT:    vpcmpeqd %xmm2, %xmm1, %xmm2
-; AVX1-NEXT:    vextractf128 $1, %ymm1, %xmm1
-; AVX1-NEXT:    vpcmpeqd {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm1, %xmm1
-; AVX1-NEXT:    vinsertf128 $1, %xmm1, %ymm2, %ymm1
+; AVX1-NEXT:    vcmpeqps %ymm2, %ymm1, %ymm1
 ; AVX1-NEXT:    retq
 ;
 ; AVX2-LABEL: ext_i16_16i32:
diff --git a/llvm/test/CodeGen/X86/bitcast-int-to-vector-bool-zext.ll b/llvm/test/CodeGen/X86/bitcast-int-to-vector-bool-zext.ll
index ee39b1333fff31..c7785c56972d8c 100644
--- a/llvm/test/CodeGen/X86/bitcast-int-to-vector-bool-zext.ll
+++ b/llvm/test/CodeGen/X86/bitcast-int-to-vector-bool-zext.ll
@@ -327,10 +327,9 @@ define <8 x i32> @ext_i8_8i32(i8 %a0) {
 ; AVX1-NEXT:    vinsertf128 $1, %xmm0, %ymm0, %ymm0
 ; AVX1-NEXT:    vmovaps {{.*#+}} ymm1 = [1,2,4,8,16,32,64,128]
 ; AVX1-NEXT:    vandps %ymm1, %ymm0, %ymm0
-; AVX1-NEXT:    vpcmpeqd %xmm1, %xmm0, %xmm1
-; AVX1-NEXT:    vpsrld $31, %xmm1, %xmm1
+; AVX1-NEXT:    vcmpeqps %ymm1, %ymm0, %ymm0
+; AVX1-NEXT:    vpsrld $31, %xmm0, %xmm1
 ; AVX1-NEXT:    vextractf128 $1, %ymm0, %xmm0
-; AVX1-NEXT:    vpcmpeqd {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm0, %xmm0
 ; AVX1-NEXT:    vpsrld $31, %xmm0, %xmm0
 ; AVX1-NEXT:    vinsertf128 $1, %xmm0, %ymm1, %ymm0
 ; AVX1-NEXT:    retq
@@ -630,18 +629,16 @@ define <16 x i32> @ext_i16_16i32(i16 %a0) {
 ; AVX1-NEXT:    vinsertf128 $1, %xmm0, %ymm0, %ymm1
 ; AVX1-NEXT:    vmovaps {{.*#+}} ymm0 = [1,2,4,8,16,32,64,128]
 ; AVX1-NEXT:    vandps %ymm0, %ymm1, %ymm2
-; AVX1-NEXT:    vpcmpeqd %xmm0, %xmm2, %xmm0
+; AVX1-NEXT:    vcmpeqps %ymm0, %ymm2, %ymm0
+; AVX1-NEXT:    vpsrld $31, %xmm0, %xmm2
+; AVX1-NEXT:    vextractf128 $1, %ymm0, %xmm0
 ; AVX1-NEXT:    vpsrld $31, %xmm0, %xmm0
-; AVX1-NEXT:    vextractf128 $1, %ymm2, %xmm2
-; AVX1-NEXT:    vpcmpeqd {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm2, %xmm2
-; AVX1-NEXT:    vpsrld $31, %xmm2, %xmm2
-; AVX1-NEXT:    vinsertf128 $1, %xmm2, %ymm0, %ymm0
+; AVX1-NEXT:    vinsertf128 $1, %xmm0, %ymm2, %ymm0
 ; AVX1-NEXT:    vmovaps {{.*#+}} ymm2 = [256,512,1024,2048,4096,8192,16384,32768]
 ; AVX1-NEXT:    vandps %ymm2, %ymm1, %ymm1
-; AVX1-NEXT:    vpcmpeqd %xmm2, %xmm1, %xmm2
-; AVX1-NEXT:    vpsrld $31, %xmm2, %xmm2
+; AVX1-NEXT:    vcmpeqps %ymm2, %ymm1, %ymm1
+; AVX1-NEXT:    vpsrld $31, %xmm1, %xmm2
 ; AVX1-NEXT:    vextractf128 $1, %ymm1, %xmm1
-; AVX1-NEXT:    vpcmpeqd {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm1, %xmm1
 ; AVX1-NEXT:    vpsrld $31, %xmm1, %xmm1
 ; AVX1-NEXT:    vinsertf128 $1, %xmm1, %ymm2, %ymm1
 ; AVX1-NEXT:    retq
diff --git a/llvm/test/CodeGen/X86/cmpf-avx.ll b/llvm/test/CodeGen/X86/cmpf-avx.ll
new file mode 100644
index 00000000000000..f343e53b7fab83
--- /dev/null
+++ b/llvm/test/CodeGen/X86/cmpf-avx.ll
@@ -0,0 +1,242 @@
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
+; RUN: llc < %s -mtriple=i686-unknown-unknown -mattr=+avx | FileCheck %s --check-prefixes=CHECK,X86
+; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mattr=+avx | FileCheck %s --check-prefixes=CHECK,X64
+
+define <8 x i32> @cmp_eq_bitcast(<8 x i32> %x) {
+; X86-LABEL: cmp_eq_bitcast:
+; X86:       # %bb.0:
+; X86-NEXT:    vandps {{\.?LCPI[0-9]+_[0-9]+}}, %ymm0, %ymm0
+; X86-NEXT:    vcmpeqps {{\.?LCPI[0-9]+_[0-9]+}}, %ymm0, %ymm0
+; X86-NEXT:    retl
+;
+; X64-LABEL: cmp_eq_bitcast:
+; X64:       # %bb.0:
+; X64-NEXT:    vandps {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %ymm0, %ymm0
+; X64-NEXT:    vcmpeqps {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %ymm0, %ymm0
+; X64-NEXT:    retq
+  %and = and <8 x i32> %x, <i32 7, i32 3, i32 3, i32 3, i32 3, i32 3, i32 3, i32 3>
+  %cmp = icmp eq <8 x i32> %and, <i32 3, i32 3, i32 3, i32 3, i32 3, i32 3, i32 3, i32 3>
+  %sext = sext <8 x i1> %cmp to <8 x i32>
+  ret <8 x i32> %sext
+}
+
+define <8 x i32> @cmp_ne_sitofp(<8 x i32> %x) {
+; X86-LABEL: cmp_ne_sitofp:
+; X86:       # %bb.0:
+; X86-NEXT:    vcvtdq2ps %ymm0, %ymm0
+; X86-NEXT:    vcmpneq_oqps {{\.?LCPI[0-9]+_[0-9]+}}, %ymm0, %ymm0
+; X86-NEXT:    retl
+;
+; X64-LABEL: cmp_ne_sitofp:
+; X64:       # %bb.0:
+; X64-NEXT:    vcvtdq2ps %ymm0, %ymm0
+; X64-NEXT:    vcmpneq_oqps {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %ymm0, %ymm0
+; X64-NEXT:    retq
+  %cmp = icmp ne <8 x i32> %x, <i32 3, i32 3, i32 3, i32 3, i32 3, i32 3, i32 3, i32 3>
+  %sext = sext <8 x i1> %cmp to <8 x i32>
+  ret <8 x i32> %sext
+}
+
+define <8 x i32> @cmp_slt_fail_no_const(<8 x i32> %x, <8 x i32> %y) {
+; X86-LABEL: cmp_slt_fail_no_const:
+; X86:       # %bb.0:
+; X86-NEXT:    vandps {{\.?LCPI[0-9]+_[0-9]+}}, %ymm0, %ymm0
+; X86-NEXT:    vextractf128 $1, %ymm1, %xmm2
+; X86-NEXT:    vextractf128 $1, %ymm0, %xmm3
+; X86-NEXT:    vpcmpgtd %xmm3, %xmm2, %xmm2
+; X86-NEXT:    vpcmpgtd %xmm0, %xmm1, %xmm0
+; X86-NEXT:    vinsertf128 $1, %xmm2, %ymm0, %ymm0
+; X86-NEXT:    retl
+;
+; X64-LABEL: cmp_slt_fail_no_const:
+; X64:       # %bb.0:
+; X64-NEXT:    vandps {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %ymm0, %ymm0
+; X64-NEXT:    vextractf128 $1, %ymm1, %xmm2
+; X64-NEXT:    vextractf128 $1, %ymm0, %xmm3
+; X64-NEXT:    vpcmpgtd %xmm3, %xmm2, %xmm2
+; X64-NEXT:    vpcmpgtd %xmm0, %xmm1, %xmm0
+; X64-NEXT:    vinsertf128 $1, %xmm2, %ymm0, %ymm0
+; X64-NEXT:    retq
+  %and = and <8 x i32> %x, <i32 3, i32 3, i32 3, i32 3, i32 3, i32 3, i32 3, i32 3>
+  %cmp = icmp slt <8 x i32> %and, %y
+  %sext = sext <8 x i1> %cmp to <8 x i32>
+  ret <8 x i32> %sext
+}
+
+define <8 x i32> @cmp_eq_sitofp(<8 x i32> %x) {
+; X86-LABEL: cmp_eq_sitofp:
+; X86:       # %bb.0:
+; X86-NEXT:    vcvtdq2ps %ymm0, %ymm0
+; X86-NEXT:    vcmpeqps {{\.?LCPI[0-9]+_[0-9]+}}, %ymm0, %ymm0
+; X86-NEXT:    retl
+;
+; X64-LABEL: cmp_eq_sitofp:
+; X64:       # %bb.0:
+; X64-NEXT:    vcvtdq2ps %ymm0, %ymm0
+; X64-NEXT:    vcmpeqps {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %ymm0, %ymm0
+; X64-NEXT:    retq
+  %cmp = icmp eq <8 x i32> %x, <i32 -3, i32 -3, i32 -3, i32 -3, i32 -3, i32 -3, i32 -3, i32 -3>
+  %sext = sext <8 x i1> %cmp to <8 x i32>
+  ret <8 x i32> %sext
+}
+
+define <8 x i32> @cmp_sgt_fail_no_bounds(<8 x i32> %x, <8 x i32> %y) {
+; CHECK-LABEL: cmp_sgt_fail_no_bounds:
+; CHECK:       # %bb.0:
+; CHECK-NEXT:    vextractf128 $1, %ymm0, %xmm2
+; CHECK-NEXT:    vextractf128 $1, %ymm1, %xmm3
+; CHECK-NEXT:    vpcmpgtd %xmm2, %xmm3, %xmm2
+; CHECK-NEXT:    vpcmpgtd %xmm0, %xmm1, %xmm0
+; CHECK-NEXT:    vinsertf128 $1, %xmm2, %ymm0, %ymm0
+; CHECK-NEXT:    ret{{[l|q]}}
+  %cmp = icmp slt <8 x i32> %x, %y
+  %sext = sext <8 x i1> %cmp to <8 x i32>
+  ret <8 x i32> %sext
+}
+
+define <8 x i32> @cmp_sgt_bitcast(<8 x i32> %xx, <8 x i32> %yy) {
+; CHECK-LABEL: cmp_sgt_bitcast:
+; CHECK:       # %bb.0:
+; CHECK-NEXT:    vbroadcastss {{.*#+}} ymm2 = [2139095040,2139095040,2139095040,2139095040,2139095040,2139095040,2139095040,2139095040]
+; CHECK-NEXT:    vandps %ymm2, %ymm0, %ymm0
+; CHECK-NEXT:    vandps %ymm2, %ymm1, %ymm1
+; CHECK-NEXT:    vcmpltps %ymm0, %ymm1, %ymm0
+; CHECK-NEXT:    ret{{[l|q]}}
+  %x = and <8 x i32> %xx, <i32 2139095040, i32 2139095040, i32 2139095040, i32 2139095040, i32 2139095040, i32 2139095040, i32 2139095040, i32 2139095040>
+  %y = and <8 x i32> %yy, <i32 2139095040, i32 2139095040, i32 2139095040, i32 2139095040, i32 2139095040, i32 2139095040, i32 2139095040, i32 2139095040>
+
+  %cmp = icmp sgt <8 x i32> %x, %y
+  %sext = sext <8 x i1> %cmp to <8 x i32>
+  ret <8 x i32> %sext
+}
+
+define <8 x i32> @cmp_sle_fail_out_of_bounds(<8 x i32> %xx) {
+; X86-LABEL: cmp_sle_fail_out_of_bounds:
+; X86:       # %bb.0:
+; X86-NEXT:    vandps {{\.?LCPI[0-9]+_[0-9]+}}, %ymm0, %ymm0
+; X86-NEXT:    vextractf128 $1, %ymm0, %xmm1
+; X86-NEXT:    vbroadcastss {{.*#+}} xmm2 = [2139095041,2139095041,2139095041,2139095041]
+; X86-NEXT:    vpcmpgtd %xmm1, %xmm2, %xmm1
+; X86-NEXT:    vpcmpgtd %xmm0, %xmm2, %xmm0
+; X86-NEXT:    vinsertf128 $1, %xmm1, %ymm0, %ymm0
+; X86-NEXT:    retl
+;
+; X64-LABEL: cmp_sle_fail_out_of_bounds:
+; X64:       # %bb.0:
+; X64-NEXT:    vandps {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %ymm0, %ymm0
+; X64-NEXT:    vextractf128 $1, %ymm0, %xmm1
+; X64-NEXT:    vbroadcastss {{.*#+}} xmm2 = [2139095041,2139095041,2139095041,2139095041]
+; X64-NEXT:    vpcmpgtd %xmm1, %xmm2, %xmm1
+; X64-NEXT:    vpcmpgtd %xmm0, %xmm2, %xmm0
+; X64-NEXT:    vinsertf128 $1, %xmm1, %ymm0, %ymm0
+; X64-NEXT:    retq
+  %x = and <8 x i32> %xx, <i32 2139095041, i32 2139095040, i32 2139095040, i32 2139095040, i32 2139095040, i32 2139095040, i32 2139095040, i32 2139095040>
+  %cmp = icmp sle <8 x i32> %x, <i32 2139095040, i32 2139095040, i32 2139095040, i32 2139095040, i32 2139095040, i32 2139095040, i32 2139095040, i32 2139095040>
+  %sext = sext <8 x i1> %cmp to <8 x i32>
+  ret <8 x i32> %sext
+}
+
+define <8 x i32> @cmp_eq_fail_out_of_bounds(<8 x i32> %x) {
+; CHECK-LABEL: cmp_eq_fail_out_of_bounds:
+; CHECK:       # %bb.0:
+; CHECK-NEXT:    vextractf128 $1, %ymm0, %xmm1
+; CHECK-NEXT:    vbroadcastss {{.*#+}} xmm2 = [16777216,16777216,16777216,16777216]
+; CHECK-NEXT:    vpcmpeqd %xmm2, %xmm1, %xmm1
+; CHECK-NEXT:    vpcmpeqd %xmm2, %xmm0, %xmm0
+; CHECK-NEXT:    vinsertf128 $1, %xmm1, %ymm0, %ymm0
+; CHECK-NEXT:    ret{{[l|q]}}
+  %cmp = icmp eq <8 x i32> %x, <i32 16777216, i32 16777216, i32 16777216, i32 16777216, i32 16777216, i32 16777216, i32 16777216, i32 16777216>
+  %sext = sext <8 x i1> %cmp to <8 x i32>
+  ret <8 x i32> %sext
+}
+
+define <8 x i32> @cmp_eq_fail_out_of_bounds2(<8 x i32> %x) {
+; CHECK-LABEL: cmp_eq_fail_out_of_bounds2:
+; CHECK:       # %bb.0:
+; CHECK-NEXT:    vextractf128 $1, %ymm0, %xmm1
+; CHECK-NEXT:    vbroadcastss {{.*#+}} xmm2 = [4278190080,4278190080,4278190080,4278190080]
+; CHECK-NEXT:    vpcmpeqd %xmm2, %xmm1, %xmm1
+; CHECK-NEXT:    vpcmpeqd %xmm2, %xmm0, %xmm0
+; CHECK-NEXT:    vinsertf128 $1, %xmm1, %ymm0, %ymm0
+; CHECK-NEXT:    ret{{[l|q]}}
+  %cmp = icmp eq <8 x i32> %x, <i32 -16777216, i32 -16777216, i32 -16777216, i32 -16777216, i32 -16777216, i32 -16777216, i32 -16777216, i32 -16777216>
+  %sext = sext <8 x i1> %cmp to <8 x i32>
+  ret <8 x i32> %sext
+}
+
+define <8 x i32> @cmp_eq_todo(<8 x i32> %x) {
+; X86-LABEL: cmp_eq_todo:
+; X86:       # %bb.0:
+; X86-NEXT:    vpcmpeqd {{\.?LCPI[0-9]+_[0-9]+}}, %xmm0, %xmm1
+; X86-NEXT:    vextractf128 $1, %ymm0, %xmm0
+; X86-NEXT:    vpcmpeqd {{\.?LCPI[0-9]+_[0-9]+}}, %xmm0, %xmm0
+; X86-NEXT:    vinsertf128 $1, %xmm0, %ymm1, %ymm0
+; X86-NEXT:    retl
+;
+; X64-LABEL: cmp_eq_todo:
+; X64:       # %bb.0:
+; X64-NEXT:    vpcmpeqd {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm0, %xmm1
+; X64-NEXT:    vextractf128 $1, %ymm0, %xmm0
+; X64-NEXT:    vpcmpeqd {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm0, %xmm0
+; X64-NEXT:    vinsertf128 $1, %xmm0, %ymm1, %ymm0
+; X64-NEXT:    retq
+  %cmp = icmp eq <8 x i32> %x, <i32 -16777215, i32 16777215, i32 16777215, i32 -16777215, i32 16777215, i32 -16777215, i32 16777215, i32 -16777215>
+  %sext = sext <8 x i1> %cmp to <8 x i32>
+  ret <8 x i32> %sext
+}
+
+define <8 x i32> @cmp_ult_fail_maybe_negative(<8 x i32> %x) {
+; CHECK-LABEL: cmp_ult_fail_maybe_negative:
+; CHECK:       # %bb.0:
+; CHECK-NEXT:    vextractf128 $1, %ymm0, %xmm1
+; CHECK-NEXT:    vbroadcastss {{.*#+}} xmm2 = [2,2,2,2]
+; CHECK-NEXT:    vpminud %xmm2, %xmm1, %xmm3
+; CHECK-NEXT:    vpcmpeqd %xmm3, %xmm1, %xmm1
+; CHECK-NEXT:    vpminud %xmm2, %xmm0, %xmm2
+; CHECK-NEXT:    vpcmpeqd %xmm2, %xmm0, %xmm0
+; CHECK-NEXT:    vinsertf128 $1, %xmm1, %ymm0, %ymm0
+; CHECK-NEXT:    ret{{[l|q]}}
+  %cmp = icmp ult <8 x i32> %x, <i32 3, i32 3, i32 3, i32 3, i32 3, i32 3, i32 3, i32 3>
+  %sext = sext <8 x i1> %cmp to <8 x i32>
+  ret <8 x i32> %sext
+}
+
+define <8 x i32> @cmp_ule_bitcast(<8 x i32> %xx) {
+; X86-LABEL: cmp_ule_bitcast:
+; X86:       # %bb.0:
+; X86-NEXT:    vandps {{\.?LCPI[0-9]+_[0-9]+}}, %ymm0, %ymm0
+; X86-NEXT:    vcmpltps {{\.?LCPI[0-9]+_[0-9]+}}, %ymm0, %ymm0
+; X86-NEXT:    retl
+;
+; X64-LABEL: cmp_ule_bitcast:
+; X64:       # %bb.0:
+; X64-NEXT:    vandps {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %ymm0, %ymm0
+; X64-NEXT:    vcmpltps {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %ymm0, %ymm0
+; X64-NEXT:    retq
+  %x = and <8 x i32> %xx, <i32 2139095040, i32 2139095040, i32 2139095040, i32 2139095040, i32 2139095040, i32 2139095040, i32 2139095040, i32 2139095040>
+  %cmp = icmp ule <8 x i32> %x, <i32 3, i32 3, i32 3, i32 3, i32 3, i32 3, i32 3, i32 3>
+  %sext = sext <8 x i1> %cmp to <8 x i32>
+  ret <8 x i32> %sext
+}
+
+define <8 x i32> @cmp_ugt_sitofp(<8 x i32> %xx) {
+; X86-LABEL: cmp_ugt_sitofp:
+; X86:       # %bb.0:
+; X86-NEXT:    vandps {{\.?LCPI[0-9]+_[0-9]+}}, %ymm0, %ymm0
+; X86-NEXT:    vcvtdq2ps %ymm0, %ymm0
+; X86-NEXT:    vbroadcastss {{.*#+}} ymm1 = [3.0E+0,3.0E+0,3.0E+0,3.0E+0,3.0E+0,3.0E+0,3.0E+0,3.0E+0]
+; X86-NEXT:    vcmpltps %ymm0, %ymm1, %ymm0
+; X86-NEXT:    retl
+;
+; X64-LABEL: cmp_ugt_sitofp:
+; X64:       # %bb.0:
+; X64-NEXT:    vandps {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %ymm0, %ymm0
+; X64-NEXT:    vcvtdq2ps %ymm0, %ymm0
+; X64-NEXT:    vbroadcastss {{.*#+}} ymm1 = [3.0E+0,3.0E+0,3.0E+0,3.0E+0,3.0E+0,3.0E+0,3.0E+0,3.0E+0]
+; X64-NEXT:    vcmpltps %ymm0, %ymm1, %ymm0
+; X64-NEXT:    retq
+  %x = and <8 x i32> %xx, <i32 2147483647, i32 2147483647, i32 2147483647, i32 2147483647, i32 2147483647, i32 2147483647, i32 2147483647, i32 2147483647>
+  %cmp = icmp ugt <8 x i32> %x, <i32 3, i32 3, i32 3, i32 3, i32 3, i32 3, i32 3, i32 3>
+  %sext = sext <8 x i1> %cmp to <8 x i32>
+  ret <8 x i32> %sext
+}
diff --git a/llvm/test/CodeGen/X86/combine-testps.ll b/llvm/test/CodeGen/X86/combine-testps.ll
index 43dddbecf51a7d..66165ce2aa53a5 100644
--- a/llvm/test/CodeGen/X86/combine-testps.ll
+++ b/llvm/test/CodeGen/X86/combine-testps.ll
@@ -171,13 +171,24 @@ define i32 @testpsz_128_signbit(<4 x float> %c, <4 x float> %d, i32 %a, i32 %b)
 }
 
 define i32 @testpsnzc_256_signbit(<8 x float> %c, <8 x float> %d, i32 %a, i32 %b) {
-; CHECK-LABEL: testpsnzc_256_signbit:
-; CHECK:       # %bb.0:
-; CHECK-NEXT:    movl %edi, %eax
-; CHECK-NEXT:    vtestps %ymm1, %ymm0
-; CHECK-NEXT:    cmovnel %esi, %eax
-; CHECK-NEXT:    vzeroupper
-; CHECK-NEXT:    retq
+; AVX-LABEL: testpsnzc_256_signbit:
+; AVX:       # %bb.0:
+; AVX-NEXT:    movl %edi, %eax
+; AVX-NEXT:    vcvtdq2ps %ymm0, %ymm0
+; AVX-NEXT:    vxorps %xmm2, %xmm2, %xmm2
+; AVX-NEXT:    vcmpltps %ymm2, %ymm0, %ymm0
+; AVX-NEXT:    vtestps %ymm1, %ymm0
+; AVX-NEXT:    cmovnel %esi, %eax
+; AVX-NEXT:    vzeroupper
+; AVX-NEXT:    retq
+;
+; AVX2-LABEL: testpsnzc_256_signbit:
+; AVX2:       # %bb.0:
+; AVX2-NEXT:    movl %edi, %eax
+; AVX2-NEXT:    vtestps %ymm1, %ymm0
+; AVX2-NEXT:    cmovnel %esi, %eax
+; AVX2-NEXT:    vzeroupper
+; AVX2-NEXT:    retq
   %t0 = b...
[truncated]

goldsteinn · 2024-02-20T00:00:42Z

NB: Proofs of the transform: https://alive2.llvm.org/ce/z/AJDdQ8

They timeout online, so need to be run locally. here is the output:

; $> /home/noah/programs/opensource/llvm-dev/src/alive2/build/alive-tv (-smt-to=200000000)

----------------------------------------
define i1 @src_eq(i32 %x, i32 %C) {
#0:
  %C_abs = abs i32 %C, 0
  %X_abs = abs i32 %x, 0
  %C_lemma = icmp ult i32 %C_abs, 16777216
  %X_lemma = icmp ult i32 %X_abs, 16777216
  %lemma = or i1 %C_lemma, %X_lemma
  assume i1 %lemma
  %cmp = icmp eq i32 %x, %C
  ret i1 %cmp
}
=>
define i1 @tgt_eq(i32 %x, i32 %C) {
#0:
  %C_abs = abs i32 %C, 0
  %X_abs = abs i32 %x, 0
  %C_lemma = icmp ult i32 %C_abs, 16777216
  %X_lemma = icmp ult i32 %X_abs, 16777216
  %lemma = or i1 %C_lemma, %X_lemma
  assume i1 %lemma
  %CFp = sitofp i32 %C to float
  %conv = sitofp i32 %x to float
  %cmp = fcmp oeq float %conv, %CFp
  ret i1 %cmp
}
Transformation seems to be correct!


----------------------------------------
define i1 @src_ne(i32 %x, i32 %C) {
#0:
  %C_abs = abs i32 %C, 0
  %X_abs = abs i32 %x, 0
  %C_lemma = icmp ult i32 %C_abs, 16777216
  %X_lemma = icmp ult i32 %X_abs, 16777216
  %lemma = or i1 %C_lemma, %X_lemma
  assume i1 %lemma
  %cmp = icmp ne i32 %x, %C
  ret i1 %cmp
}
=>
define i1 @tgt_ne(i32 %x, i32 %C) {
#0:
  %C_abs = abs i32 %C, 0
  %X_abs = abs i32 %x, 0
  %C_lemma = icmp ult i32 %C_abs, 16777216
  %X_lemma = icmp ult i32 %X_abs, 16777216
  %lemma = or i1 %C_lemma, %X_lemma
  assume i1 %lemma
  %CFp = sitofp i32 %C to float
  %conv = sitofp i32 %x to float
  %cmp = fcmp one float %conv, %CFp
  ret i1 %cmp
}
Transformation seems to be correct!


----------------------------------------
define i1 @src_slt(i32 %x, i32 %C) {
#0:
  %C_abs = abs i32 %C, 0
  %X_abs = abs i32 %x, 0
  %C_lemma = icmp ult i32 %C_abs, 16777216
  %X_lemma = icmp ult i32 %X_abs, 16777216
  %lemma = or i1 %C_lemma, %X_lemma
  assume i1 %lemma
  %cmp = icmp slt i32 %x, %C
  ret i1 %cmp
}
=>
define i1 @tgt_slt(i32 %x, i32 %C) {
#0:
  %C_abs = abs i32 %C, 0
  %X_abs = abs i32 %x, 0
  %C_lemma = icmp ult i32 %C_abs, 16777216
  %X_lemma = icmp ult i32 %X_abs, 16777216
  %lemma = or i1 %C_lemma, %X_lemma
  assume i1 %lemma
  %CFp = sitofp i32 %C to float
  %conv = sitofp i32 %x to float
  %cmp = fcmp olt float %conv, %CFp
  ret i1 %cmp
}
Transformation seems to be correct!


----------------------------------------
define i1 @src_sgt(i32 %x, i32 %C) {
#0:
  %C_abs = abs i32 %C, 0
  %X_abs = abs i32 %x, 0
  %C_lemma = icmp ult i32 %C_abs, 16777216
  %X_lemma = icmp ult i32 %X_abs, 16777216
  %lemma = or i1 %C_lemma, %X_lemma
  assume i1 %lemma
  %cmp = icmp sgt i32 %x, %C
  ret i1 %cmp
}
=>
define i1 @tgt_sgt(i32 %x, i32 %C) {
#0:
  %C_abs = abs i32 %C, 0
  %X_abs = abs i32 %x, 0
  %C_lemma = icmp ult i32 %C_abs, 16777216
  %X_lemma = icmp ult i32 %X_abs, 16777216
  %lemma = or i1 %C_lemma, %X_lemma
  assume i1 %lemma
  %CFp = sitofp i32 %C to float
  %conv = sitofp i32 %x to float
  %cmp = fcmp ogt float %conv, %CFp
  ret i1 %cmp
}
Transformation seems to be correct!


----------------------------------------
define i1 @src_sle(i32 %x, i32 %C) {
#0:
  %C_abs = abs i32 %C, 0
  %X_abs = abs i32 %x, 0
  %C_lemma = icmp ult i32 %C_abs, 16777216
  %X_lemma = icmp ult i32 %X_abs, 16777216
  %lemma = or i1 %C_lemma, %X_lemma
  assume i1 %lemma
  %cmp = icmp sle i32 %x, %C
  ret i1 %cmp
}
=>
define i1 @tgt_sle(i32 %x, i32 %C) {
#0:
  %C_abs = abs i32 %C, 0
  %X_abs = abs i32 %x, 0
  %C_lemma = icmp ult i32 %C_abs, 16777216
  %X_lemma = icmp ult i32 %X_abs, 16777216
  %lemma = or i1 %C_lemma, %X_lemma
  assume i1 %lemma
  %CFp = sitofp i32 %C to float
  %conv = sitofp i32 %x to float
  %cmp = fcmp ole float %conv, %CFp
  ret i1 %cmp
}
Transformation seems to be correct!


----------------------------------------
define i1 @src_sge(i32 %x, i32 %C) {
#0:
  %C_abs = abs i32 %C, 0
  %X_abs = abs i32 %x, 0
  %C_lemma = icmp ult i32 %C_abs, 16777216
  %X_lemma = icmp ult i32 %X_abs, 16777216
  %lemma = or i1 %C_lemma, %X_lemma
  assume i1 %lemma
  %cmp = icmp sge i32 %x, %C
  ret i1 %cmp
}
=>
define i1 @tgt_sge(i32 %x, i32 %C) {
#0:
  %C_abs = abs i32 %C, 0
  %X_abs = abs i32 %x, 0
  %C_lemma = icmp ult i32 %C_abs, 16777216
  %X_lemma = icmp ult i32 %X_abs, 16777216
  %lemma = or i1 %C_lemma, %X_lemma
  assume i1 %lemma
  %CFp = sitofp i32 %C to float
  %conv = sitofp i32 %x to float
  %cmp = fcmp oge float %conv, %CFp
  ret i1 %cmp
}
Transformation seems to be correct!


----------------------------------------
define i1 @src_ult(i32 %x, i32 %C) {
#0:
  %C_lemma0 = icmp ult i32 %C, 16777216
  %X_lemma0 = icmp ult i32 %x, 16777216
  %lemma0 = or i1 %C_lemma0, %X_lemma0
  assume i1 %lemma0
  %C_lemma1 = icmp sge i32 %C, 0
  %X_lemma1 = icmp sge i32 %x, 0
  %lemma1 = and i1 %C_lemma1, %X_lemma1
  assume i1 %lemma1
  %cmp = icmp ult i32 %x, %C
  ret i1 %cmp
}
=>
define i1 @tgt_ult(i32 %x, i32 %C) {
#0:
  %C_lemma0 = icmp ult i32 %C, 16777216
  %X_lemma0 = icmp ult i32 %x, 16777216
  %lemma0 = or i1 %C_lemma0, %X_lemma0
  assume i1 %lemma0
  %C_lemma1 = icmp sge i32 %C, 0
  %X_lemma1 = icmp sge i32 %x, 0
  %lemma1 = and i1 %C_lemma1, %X_lemma1
  assume i1 %lemma1
  %CFp = sitofp i32 %C to float
  %conv = sitofp i32 %x to float
  %cmp = fcmp olt float %conv, %CFp
  ret i1 %cmp
}
Transformation seems to be correct!


----------------------------------------
define i1 @src_ule(i32 %x, i32 %C) {
#0:
  %C_lemma0 = icmp ult i32 %C, 16777216
  %X_lemma0 = icmp ult i32 %x, 16777216
  %lemma0 = or i1 %C_lemma0, %X_lemma0
  assume i1 %lemma0
  %C_lemma1 = icmp sge i32 %C, 0
  %X_lemma1 = icmp sge i32 %x, 0
  %lemma1 = and i1 %C_lemma1, %X_lemma1
  assume i1 %lemma1
  %cmp = icmp ule i32 %x, %C
  ret i1 %cmp
}
=>
define i1 @tgt_ule(i32 %x, i32 %C) {
#0:
  %C_lemma0 = icmp ult i32 %C, 16777216
  %X_lemma0 = icmp ult i32 %x, 16777216
  %lemma0 = or i1 %C_lemma0, %X_lemma0
  assume i1 %lemma0
  %C_lemma1 = icmp sge i32 %C, 0
  %X_lemma1 = icmp sge i32 %x, 0
  %lemma1 = and i1 %C_lemma1, %X_lemma1
  assume i1 %lemma1
  %CFp = sitofp i32 %C to float
  %conv = sitofp i32 %x to float
  %cmp = fcmp ole float %conv, %CFp
  ret i1 %cmp
}
Transformation seems to be correct!


----------------------------------------
define i1 @src_ugt(i32 %x, i32 %C) {
#0:
  %C_lemma0 = icmp ult i32 %C, 16777216
  %X_lemma0 = icmp ult i32 %x, 16777216
  %lemma0 = or i1 %C_lemma0, %X_lemma0
  assume i1 %lemma0
  %C_lemma1 = icmp sge i32 %C, 0
  %X_lemma1 = icmp sge i32 %x, 0
  %lemma1 = and i1 %C_lemma1, %X_lemma1
  assume i1 %lemma1
  %cmp = icmp ugt i32 %x, %C
  ret i1 %cmp
}
=>
define i1 @tgt_ugt(i32 %x, i32 %C) {
#0:
  %C_lemma0 = icmp ult i32 %C, 16777216
  %X_lemma0 = icmp ult i32 %x, 16777216
  %lemma0 = or i1 %C_lemma0, %X_lemma0
  assume i1 %lemma0
  %C_lemma1 = icmp sge i32 %C, 0
  %X_lemma1 = icmp sge i32 %x, 0
  %lemma1 = and i1 %C_lemma1, %X_lemma1
  assume i1 %lemma1
  %CFp = sitofp i32 %C to float
  %conv = sitofp i32 %x to float
  %cmp = fcmp ogt float %conv, %CFp
  ret i1 %cmp
}
Transformation seems to be correct!


----------------------------------------
define i1 @src_uge(i32 %x, i32 %C) {
#0:
  %C_lemma0 = icmp ult i32 %C, 16777216
  %X_lemma0 = icmp ult i32 %x, 16777216
  %lemma0 = or i1 %C_lemma0, %X_lemma0
  assume i1 %lemma0
  %C_lemma1 = icmp sge i32 %C, 0
  %X_lemma1 = icmp sge i32 %x, 0
  %lemma1 = and i1 %C_lemma1, %X_lemma1
  assume i1 %lemma1
  %cmp = icmp uge i32 %x, %C
  ret i1 %cmp
}
=>
define i1 @tgt_uge(i32 %x, i32 %C) {
#0:
  %C_lemma0 = icmp ult i32 %C, 16777216
  %X_lemma0 = icmp ult i32 %x, 16777216
  %lemma0 = or i1 %C_lemma0, %X_lemma0
  assume i1 %lemma0
  %C_lemma1 = icmp sge i32 %C, 0
  %X_lemma1 = icmp sge i32 %x, 0
  %lemma1 = and i1 %C_lemma1, %X_lemma1
  assume i1 %lemma1
  %CFp = sitofp i32 %C to float
  %conv = sitofp i32 %x to float
  %cmp = fcmp oge float %conv, %CFp
  ret i1 %cmp
}
Transformation seems to be correct!


----------------------------------------
define i1 @src_eq_bitcast(i32 %x, i32 %C) {
#0:
  %C_lemma = icmp ult i32 %C, 2139095041
  %X_lemma = icmp ult i32 %x, 2139095041
  %lemma = and i1 %C_lemma, %X_lemma
  assume i1 %lemma
  %cmp = icmp eq i32 %x, %C
  ret i1 %cmp
}
=>
define i1 @tgt_eq_bitcast(i32 %x, i32 %C) {
#0:
  %C_lemma = icmp ult i32 %C, 2139095041
  %X_lemma = icmp ult i32 %x, 2139095041
  %lemma = and i1 %C_lemma, %X_lemma
  assume i1 %lemma
  %CFp = bitcast i32 %C to float
  %conv = bitcast i32 %x to float
  %cmp = fcmp oeq float %conv, %CFp
  ret i1 %cmp
}
Transformation seems to be correct!


----------------------------------------
define i1 @src_ne_bitcast(i32 %x, i32 %C) {
#0:
  %C_lemma = icmp ult i32 %C, 2139095041
  %X_lemma = icmp ult i32 %x, 2139095041
  %lemma = and i1 %C_lemma, %X_lemma
  assume i1 %lemma
  %cmp = icmp ne i32 %x, %C
  ret i1 %cmp
}
=>
define i1 @tgt_ne_bitcast(i32 %x, i32 %C) {
#0:
  %C_lemma = icmp ult i32 %C, 2139095041
  %X_lemma = icmp ult i32 %x, 2139095041
  %lemma = and i1 %C_lemma, %X_lemma
  assume i1 %lemma
  %CFp = bitcast i32 %C to float
  %conv = bitcast i32 %x to float
  %cmp = fcmp one float %conv, %CFp
  ret i1 %cmp
}
Transformation seems to be correct!


----------------------------------------
define i1 @src_slt_bitcast(i32 %x, i32 %C) {
#0:
  %C_lemma = icmp ult i32 %C, 2139095041
  %X_lemma = icmp ult i32 %x, 2139095041
  %lemma = and i1 %C_lemma, %X_lemma
  assume i1 %lemma
  %cmp = icmp slt i32 %x, %C
  ret i1 %cmp
}
=>
define i1 @tgt_slt_bitcast(i32 %x, i32 %C) {
#0:
  %C_lemma = icmp ult i32 %C, 2139095041
  %X_lemma = icmp ult i32 %x, 2139095041
  %lemma = and i1 %C_lemma, %X_lemma
  assume i1 %lemma
  %CFp = bitcast i32 %C to float
  %conv = bitcast i32 %x to float
  %cmp = fcmp olt float %conv, %CFp
  ret i1 %cmp
}
Transformation seems to be correct!


----------------------------------------
define i1 @src_sgt_bitcast(i32 %x, i32 %C) {
#0:
  %C_lemma = icmp ult i32 %C, 2139095041
  %X_lemma = icmp ult i32 %x, 2139095041
  %lemma = and i1 %C_lemma, %X_lemma
  assume i1 %lemma
  %cmp = icmp sgt i32 %x, %C
  ret i1 %cmp
}
=>
define i1 @tgt_sgt_bitcast(i32 %x, i32 %C) {
#0:
  %C_lemma = icmp ult i32 %C, 2139095041
  %X_lemma = icmp ult i32 %x, 2139095041
  %lemma = and i1 %C_lemma, %X_lemma
  assume i1 %lemma
  %CFp = bitcast i32 %C to float
  %conv = bitcast i32 %x to float
  %cmp = fcmp ogt float %conv, %CFp
  ret i1 %cmp
}
Transformation seems to be correct!


----------------------------------------
define i1 @src_sle_bitcast(i32 %x, i32 %C) {
#0:
  %C_lemma = icmp ult i32 %C, 2139095041
  %X_lemma = icmp ult i32 %x, 2139095041
  %lemma = and i1 %C_lemma, %X_lemma
  assume i1 %lemma
  %cmp = icmp sle i32 %x, %C
  ret i1 %cmp
}
=>
define i1 @tgt_sle_bitcast(i32 %x, i32 %C) {
#0:
  %C_lemma = icmp ult i32 %C, 2139095041
  %X_lemma = icmp ult i32 %x, 2139095041
  %lemma = and i1 %C_lemma, %X_lemma
  assume i1 %lemma
  %CFp = bitcast i32 %C to float
  %conv = bitcast i32 %x to float
  %cmp = fcmp ole float %conv, %CFp
  ret i1 %cmp
}
Transformation seems to be correct!


----------------------------------------
define i1 @src_sge_bitcast(i32 %x, i32 %C) {
#0:
  %C_lemma = icmp ult i32 %C, 2139095041
  %X_lemma = icmp ult i32 %x, 2139095041
  %lemma = and i1 %C_lemma, %X_lemma
  assume i1 %lemma
  %cmp = icmp sge i32 %x, %C
  ret i1 %cmp
}
=>
define i1 @tgt_sge_bitcast(i32 %x, i32 %C) {
#0:
  %C_lemma = icmp ult i32 %C, 2139095041
  %X_lemma = icmp ult i32 %x, 2139095041
  %lemma = and i1 %C_lemma, %X_lemma
  assume i1 %lemma
  %CFp = bitcast i32 %C to float
  %conv = bitcast i32 %x to float
  %cmp = fcmp oge float %conv, %CFp
  ret i1 %cmp
}
Transformation seems to be correct!


----------------------------------------
define i1 @src_ult_bitcast(i32 %x, i32 %C) {
#0:
  %C_lemma = icmp ult i32 %C, 2139095041
  %X_lemma = icmp ult i32 %x, 2139095041
  %lemma = and i1 %C_lemma, %X_lemma
  assume i1 %lemma
  %cmp = icmp ult i32 %x, %C
  ret i1 %cmp
}
=>
define i1 @tgt_ult_bitcast(i32 %x, i32 %C) {
#0:
  %C_lemma = icmp ult i32 %C, 2139095041
  %X_lemma = icmp ult i32 %x, 2139095041
  %lemma = and i1 %C_lemma, %X_lemma
  assume i1 %lemma
  %CFp = bitcast i32 %C to float
  %conv = bitcast i32 %x to float
  %cmp = fcmp olt float %conv, %CFp
  ret i1 %cmp
}
Transformation seems to be correct!


----------------------------------------
define i1 @src_ule_bitcast(i32 %x, i32 %C) {
#0:
  %C_lemma = icmp ult i32 %C, 2139095041
  %X_lemma = icmp ult i32 %x, 2139095041
  %lemma = and i1 %C_lemma, %X_lemma
  assume i1 %lemma
  %cmp = icmp ule i32 %x, %C
  ret i1 %cmp
}
=>
define i1 @tgt_ule_bitcast(i32 %x, i32 %C) {
#0:
  %C_lemma = icmp ult i32 %C, 2139095041
  %X_lemma = icmp ult i32 %x, 2139095041
  %lemma = and i1 %C_lemma, %X_lemma
  assume i1 %lemma
  %CFp = bitcast i32 %C to float
  %conv = bitcast i32 %x to float
  %cmp = fcmp ole float %conv, %CFp
  ret i1 %cmp
}
Transformation seems to be correct!


----------------------------------------
define i1 @src_ugt_bitcast(i32 %x, i32 %C) {
#0:
  %C_lemma = icmp ult i32 %C, 2139095041
  %X_lemma = icmp ult i32 %x, 2139095041
  %lemma = and i1 %C_lemma, %X_lemma
  assume i1 %lemma
  %cmp = icmp ugt i32 %x, %C
  ret i1 %cmp
}
=>
define i1 @tgt_ugt_bitcast(i32 %x, i32 %C) {
#0:
  %C_lemma = icmp ult i32 %C, 2139095041
  %X_lemma = icmp ult i32 %x, 2139095041
  %lemma = and i1 %C_lemma, %X_lemma
  assume i1 %lemma
  %CFp = bitcast i32 %C to float
  %conv = bitcast i32 %x to float
  %cmp = fcmp ogt float %conv, %CFp
  ret i1 %cmp
}
Transformation seems to be correct!


----------------------------------------
define i1 @src_uge_bitcast(i32 %x, i32 %C) {
#0:
  %C_lemma = icmp ult i32 %C, 2139095041
  %X_lemma = icmp ult i32 %x, 2139095041
  %lemma = and i1 %C_lemma, %X_lemma
  assume i1 %lemma
  %cmp = icmp uge i32 %x, %C
  ret i1 %cmp
}
=>
define i1 @tgt_uge_bitcast(i32 %x, i32 %C) {
#0:
  %C_lemma = icmp ult i32 %C, 2139095041
  %X_lemma = icmp ult i32 %x, 2139095041
  %lemma = and i1 %C_lemma, %X_lemma
  assume i1 %lemma
  %CFp = bitcast i32 %C to float
  %conv = bitcast i32 %x to float
  %cmp = fcmp oge float %conv, %CFp
  ret i1 %cmp
}
Transformation seems to be correct!

Summary:
  20 correct transformations
  0 incorrect transformations
  0 failed-to-prove transformations
  0 Alive2 errors

goldsteinn · 2024-02-20T00:03:07Z

llvm/test/CodeGen/X86/sadd_sat_vec.ll

@@ -861,6 +861,9 @@ define <8 x i32> @v8i32(<8 x i32> %x, <8 x i32> %y) nounwind {
 ; AVX1-NEXT:    vpcmpgtd %xmm2, %xmm3, %xmm3
 ; AVX1-NEXT:    vpcmpgtd %xmm4, %xmm0, %xmm0
 ; AVX1-NEXT:    vinsertf128 $1, %xmm3, %ymm0, %ymm0
+; AVX1-NEXT:    vcvtdq2ps %ymm1, %ymm1
+; AVX1-NEXT:    vpxor %xmm3, %xmm3, %xmm3
+; AVX1-NEXT:    vcmpltps %ymm3, %ymm1, %ymm1


Not sure why this is happening. It seems we are determining the compare is dead before but not after.
Not sure what the best way to fix this is.

My guess is the PCMPGT handling in computeKnownBitsForTargetNode was managing to constant fold - but for some reason it never succeeded earlier?

Found it, its actually this fold in SimplifyDemandedBitsForTargetNode:

case X86ISD::PCMPGT: // icmp sgt(0, R) == ashr(R, BitWidth-1). // iff we only need the sign bit then we can use R directly. if (OriginalDemandedBits.isSignMask() && ISD::isBuildVectorAllZeros(Op.getOperand(0).getNode())) return TLO.CombineTo(Op, Op.getOperand(1)); break;

I would think it would be folded earlier? Is there a place I could move this change too s.t it would be after all simplifications but still have reasonable access to KnownBits?

Think fixed. Added support for cmpp in simplifydemandedbits.

goldsteinn · 2024-02-20T00:03:12Z

llvm/test/CodeGen/X86/sadd_sat_vec.ll

@@ -1059,6 +1062,9 @@ define <16 x i32> @v16i32(<16 x i32> %x, <16 x i32> %y) nounwind {
 ; AVX1-NEXT:    vpcmpgtd %xmm4, %xmm5, %xmm5
 ; AVX1-NEXT:    vpcmpgtd %xmm6, %xmm0, %xmm0
 ; AVX1-NEXT:    vinsertf128 $1, %xmm5, %ymm0, %ymm0
+; AVX1-NEXT:    vcvtdq2ps %ymm2, %ymm2
+; AVX1-NEXT:    vpxor %xmm5, %xmm5, %xmm5
+; AVX1-NEXT:    vcmpltps %ymm5, %ymm2, %ymm2


Same as above

goldsteinn · 2024-02-20T00:03:21Z

llvm/test/CodeGen/X86/sadd_sat_vec.ll

+; AVX1-NEXT:    vpcmpgtd %xmm7, %xmm1, %xmm1
+; AVX1-NEXT:    vinsertf128 $1, %xmm6, %ymm1, %ymm1
+; AVX1-NEXT:    vcvtdq2ps %ymm3, %ymm3
+; AVX1-NEXT:    vcmpltps %ymm5, %ymm3, %ymm3


Same as above.

phoebewang · 2024-02-20T02:14:54Z

llvm/lib/Target/X86/X86ISelLowering.cpp

+      } else {
+        // For cvt + signed compare we need  abs(lhs) or abs(rhs) <
+        // MaxConvertableCvt
+        if (OpInRange(1, /*CmpLT*/ true, MaxConvertableCvt) ||


its ||.
Proof is something like:

define i1 @src_eq(i32 %x, i32 %C) { %C_abs = call i32 @llvm.abs.i32(i32 %C, i1 false) %X_abs = call i32 @llvm.abs.i32(i32 %x, i1 false) %C_lemma = icmp ult i32 %C_abs, 16777216 %X_lemma = icmp ult i32 %X_abs, 16777216 %lemma = or i1 %C_lemma, %X_lemma call void @llvm.assume(i1 %lemma) %cmp = icmp eq i32 %x, %C ret i1 %cmp } define i1 @tgt_eq(i32 %x, i32 %C) { %C_abs = call i32 @llvm.abs.i32(i32 %C, i1 false) %X_abs = call i32 @llvm.abs.i32(i32 %x, i1 false) %C_lemma = icmp ult i32 %C_abs, 16777216 %X_lemma = icmp ult i32 %X_abs, 16777216 %lemma = or i1 %C_lemma, %X_lemma call void @llvm.assume(i1 %lemma) %CFp = sitofp i32 %C to float %conv = sitofp i32 %x to float %cmp = fcmp oeq float %conv, %CFp ret i1 %cmp }

The condition:

%C_abs = call i32 @llvm.abs.i32(i32 %C, i1 false) %X_abs = call i32 @llvm.abs.i32(i32 %x, i1 false) %C_lemma = icmp ult i32 %C_abs, 16777216 %X_lemma = icmp ult i32 %X_abs, 16777216 %lemma = or i1 %C_lemma, %X_lemma call void @llvm.assume(i1 %lemma)

is all ||.

Isn't a abs(x) < c is c > x > -c? Otherwise x can be any possible value for a x < c || x > -c.

note that the x > -c is x u> -c which is a pretty tight bound.

Oh, I see. Comparing signed value with unsigned operator. It's a bit difficult for me to notice that :)

llvm/lib/Target/X86/X86ISelLowering.cpp

RKSimon

I'm not sure about the OkayBitcast codepath - I was just expecting some simple knownbits min/max value tests and SINT_TO_FP calls that would also handle constant folding.

RKSimon · 2024-02-20T11:33:41Z

llvm/lib/Target/X86/X86ISelLowering.cpp

@@ -23299,6 +23299,110 @@ static SDValue LowerVSETCC(SDValue Op, const X86Subtarget &Subtarget,
    }
  }

+  // We get bad codegen for v8i32 compares on avx targets (without avx2) so if
+  // possible convert to a v8f32 compare.
+  if (VTOp0.getVectorElementType() == MVT::i32 && VTOp0 == MVT::v8i32 &&


Just test VTOp0 == MVT::v8i32 for ? VTOp0.getVectorElementType() == MVT::i32 looks unnecessary.

RKSimon · 2024-02-20T11:36:05Z

llvm/lib/Target/X86/X86ISelLowering.cpp

+    bool OkayCvt = false;
+    bool OkayBitcast = false;
+
+    // For cvt up to 1 << (Significand Precision)


Would it make sense to use the APFloat semantics instead of hard coded numbers here?

Not 100% sure, I reached the constants in the proofs via trial and error then realized these numbers are the most logical bound. Not certain though that they are indeed generalizable.

RKSimon · 2024-02-20T11:38:29Z

llvm/lib/Target/X86/X86ISelLowering.cpp

+    auto OpInRange = [&DAG, Op, &KnownOps](unsigned OpNo, bool CmpLT,
+                                           const APInt Bound) {
+      if (!KnownOps[OpNo].has_value())
+        KnownOps[OpNo] = DAG.computeKnownBits(Op.getOperand(OpNo));


Why do you want to cache KnownOps[] - shouldn't we just call both at the start and be done with it?

We can get away with only doing Op1 if its a small constant + no unsigned cond.

goldsteinn · 2024-02-20T14:40:51Z

I'm not sure about the OkayBitcast codepath - I was just expecting some simple knownbits min/max value tests and SINT_TO_FP calls that would also handle constant folding.

Whats wrong with the bitcast codepath? Its just the same idea as the SINT_TO_FP one but saves an instruction.

RKSimon · 2024-02-20T14:48:02Z

Sorry I meant I wasn't expecting it - and haven't gone through it properly yet.

RKSimon · 2024-02-20T15:45:16Z

llvm/test/CodeGen/X86/masked_store.ll

+; AVX1-NEXT:    vpunpckhwd {{.*#+}} xmm1 = xmm1[4,4,5,5,6,6,7,7]
+; AVX1-NEXT:    vpslld $31, %xmm1, %xmm1
+; AVX1-NEXT:    vinsertf128 $1, %xmm1, %ymm2, %ymm1
+; AVX1-NEXT:    vmaskmovps %ymm0, %ymm1, 32(%rdx)


Anything we can do to prevent this regression?

RKSimon · 2024-02-20T15:53:19Z

llvm/test/CodeGen/X86/sadd_sat_vec.ll

@@ -861,6 +861,9 @@ define <8 x i32> @v8i32(<8 x i32> %x, <8 x i32> %y) nounwind {
 ; AVX1-NEXT:    vpcmpgtd %xmm2, %xmm3, %xmm3
 ; AVX1-NEXT:    vpcmpgtd %xmm4, %xmm0, %xmm0
 ; AVX1-NEXT:    vinsertf128 $1, %xmm3, %ymm0, %ymm0
+; AVX1-NEXT:    vcvtdq2ps %ymm1, %ymm1
+; AVX1-NEXT:    vpxor %xmm3, %xmm3, %xmm3
+; AVX1-NEXT:    vcmpltps %ymm3, %ymm1, %ymm1


My guess is the PCMPGT handling in computeKnownBitsForTargetNode was managing to constant fold - but for some reason it never succeeded earlier?

github-actions · 2024-02-25T17:57:13Z

✅ With the latest revision this PR passed the C/C++ code formatter.

phoebewang · 2024-02-26T00:11:17Z

llvm/lib/CodeGen/SelectionDAG/TargetLowering.cpp

@@ -2471,6 +2474,8 @@ bool TargetLowering::SimplifyDemandedBits(
    if (SimplifyDemandedBits(Src, InDemandedBits, InDemandedElts, Known, TLO,
                             Depth + 1))
      return true;
+


Looks nothing changed in this file, should revert the whole file?

Yeah, thank you!

phoebewang · 2024-02-26T00:18:59Z

I recall I read someone commented with concern when DAZ is set, fcmp will always return equal for denormal values, but I cannot find the comment now. Is this a real concern? Did we fix it?

goldsteinn · 2024-02-26T00:51:33Z

I recall I read someone commented with concern when DAZ is set, fcmp will always return equal for denormal values, but I cannot find the comment now. Is this a real concern? Did we fix it?

Looks like that would be the result of DAZ, but didn't test:
https://stackoverflow.com/questions/54047415/do-denormal-flags-like-denormals-are-zero-daz-affect-comparisons-for-equality

Ill update with sub flag.

goldsteinn · 2024-02-26T01:03:45Z

I recall I read someone commented with concern when DAZ is set, fcmp will always return equal for denormal values, but I cannot find the comment now. Is this a real concern? Did we fix it?

Looks like that would be the result of DAZ, but didn't test: https://stackoverflow.com/questions/54047415/do-denormal-flags-like-denormals-are-zero-daz-affect-comparisons-for-equality

Ill update with sub flag.

Actually, I don't think this is an issue since we don't handle eq/ne

phoebewang · 2024-02-26T01:44:51Z

I recall I read someone commented with concern when DAZ is set, fcmp will always return equal for denormal values, but I cannot find the comment now. Is this a real concern? Did we fix it?

Looks like that would be the result of DAZ, but didn't test: https://stackoverflow.com/questions/54047415/do-denormal-flags-like-denormals-are-zero-daz-affect-comparisons-for-equality
Ill update with sub flag.

Actually, I don't think this is an issue since we don't handle eq/ne

It affects all predicates, not only eq/ne, see https://godbolt.org/z/jxs5s5qde

goldsteinn · 2024-02-26T05:24:15Z

I recall I read someone commented with concern when DAZ is set, fcmp will always return equal for denormal values, but I cannot find the comment now. Is this a real concern? Did we fix it?

Looks like that would be the result of DAZ, but didn't test: https://stackoverflow.com/questions/54047415/do-denormal-flags-like-denormals-are-zero-daz-affect-comparisons-for-equality
Ill update with sub flag.

Actually, I don't think this is an issue since we don't handle eq/ne

It affects all predicates, not only eq/ne, see https://godbolt.org/z/jxs5s5qde

ah.

We don't naturally have subnormal fpclass in DAG, so think we just need to disable the simplifications involving Op1C == 0 if -mdaz-ftz? Or would we need to disable it regardless unless we can prove Op0 is normal?

anematode · 2024-02-26T05:59:14Z

We should probably disable it regardless of the compiler settings because it'd be weird for FP flags to invalidate what looks like integer comparisons.... In my comment where I brought this up I posted a way to compare i32 for equality/inequality regardless of range, but it fails to generalize. I think your converting-to-float method is totally fine though.

goldsteinn · 2024-02-26T06:20:33Z

We should probably disable it regardless of the compiler settings because it'd be weird for FP flags to invalidate what looks like integer comparisons.... In my comment where I brought this up I posted a way to compare i32 for equality/inequality regardless of range, but it fails to generalize. I think your converting-to-float method is totally fine though.

What do you mean by invalidate? The issue is simplifying a float signbit comparison to just return the float. DAZ can make the simplification incorrect.

anematode · 2024-02-26T06:37:16Z

Maybe we're talking about different things then, sorry. I thought this was in relation to when we can or cannot bitcast integers to floats and do comparison as floats. For example, I think the restriction on C in src_ule_bitcast -> tgt_ule_bitcast is that it must be between 1 << 23 and the existing bound, not just 0 and the bound.

goldsteinn · 2024-02-26T07:43:30Z

Maybe we're talking about different things then, sorry. I thought this was in relation to when we can or cannot bitcast integers to floats and do comparison as floats. For example, I think the restriction on C in src_ule_bitcast -> tgt_ule_bitcast is that it must be between 1 << 23 and the existing bound, not just 0 and the bound.

Ah that is another issue...

So there are two issues with DAZ:

bitcast only check ops < inf_bits
simplifySExtOfDecomposedSetCC will simplify op lt/gt 0.0 (what I thought you where referencing).

Think both probably need to be fixed.

Still re:

We should probably disable it regardless of the compiler settings because it'd be weird for FP flags to invalidate what looks like integer comparisons....

Do you mean drop the optimizations (if we can't prove normal)? Don't really see whats the issue with seeing if we are compiling with DAZ.

…83596) If we're logical shifting an all-signbits value, then we can just mask out the shifted bits. This helps removes some unnecessary bitcasted vXi16 shifts used for vXi8 shifts (which SimplifyDemandedBits will struggle to remove through the bitcast), and allows some AVX1 shifts of 256-bit values to stay as a YMM instruction. Noticed in codegen from #82290

RKSimon · 2024-03-04T13:24:50Z

rebase?

goldsteinn · 2024-03-04T17:52:35Z

rebase?

Done.

RKSimon · 2024-03-04T22:14:46Z

llvm/lib/Target/X86/X86ISelLowering.cpp

@@ -30,6 +30,7 @@
 #include "llvm/Analysis/ObjCARCUtil.h"
 #include "llvm/Analysis/ProfileSummaryInfo.h"
 #include "llvm/Analysis/VectorUtils.h"
+#include "llvm/CodeGen/ISDOpcodes.h"


This shouldn't be necessary

RKSimon · 2024-03-04T22:22:17Z

llvm/lib/Target/X86/X86ISelLowering.cpp

+  } else if (auto *C = dyn_cast<ConstantSDNode>(Op1)) {
+    Op1C = C->getAPIntValue();
+  } else if (ISD::isConstantSplatVector(Op1.getNode(), Op1C)) {
+    // Pass


Pass? Make it clear that isConstantSplatVector sets Op1C

RKSimon · 2024-03-04T22:23:32Z

llvm/lib/Target/X86/X86ISelLowering.cpp

+            DAG, DL, ISD::SETGT, Op.getOperand(0), Op.getOperand(1),
+            DemandedBits, /*AllowNOT*/ false, Depth))
+      return R;
+    break;


Why is this case necessary? I thought it was just to convert a XMM X86ISD::CMPP back to an integer?

Not, its a common helper to do the simplifications for float/int.

I could add float->int, however.

RKSimon · 2024-03-07T14:51:15Z

llvm/lib/Target/X86/X86ISelLowering.cpp

+        Res = KnownBits::ult(*KnownOps[OpNo], KnownBits::makeConstant(Bound));
+      else
+        Res = KnownBits::ugt(*KnownOps[OpNo], KnownBits::makeConstant(Bound));
+      return Res.has_value() && *Res;


Maybe return Res.value_or(false)?

RKSimon · 2024-03-07T15:00:03Z

llvm/lib/Target/X86/X86ISelLowering.cpp

@@ -41125,6 +41255,154 @@ static SDValue combineShuffle(SDNode *N, SelectionDAG &DAG,
  return SDValue();
 }

+// Simplify a decomposed (sext (setcc)). Assumes prior check that
+// bitwidth(sext)==bitwidth(setcc operands).
+static SDValue simplifySExtOfDecomposedSetCCImpl(


Any hope of pulling this out into its own patch?

@RKSimon seperate commit in this PR, or new PR as a whole?

new PR if possible - patches that do too much at once make me nervous :)

Can do, the only issue is that basically its functionality is dependent on this pr. Otherwise we never really get the necessary floating point flags to assist with the simplifcation.

So this PR will have some regressions in it, ill post the other PR as rebased ontop of this one.

Okay, dropped the simplifcations from this patch. See: #84360 (based on this one).

Note, there are some regressions in this patch now but they should be cleared up in #84360.

Also note, the reason #84360 is second is so that we can properly test the cmpp code. Without this patch we never get the necessary FMF flags to actually do the cmpp simplifications. If you think that is unnecessary I can swap the order (#84360 does stand on its own mostly for is simplifications on pcmpgt).

RKSimon

Any luck with the remaining regressions?

RKSimon · 2024-03-20T10:27:26Z

llvm/test/CodeGen/X86/vector-reduce-fmaximum.ll

@@ -816,64 +816,6 @@ define float @test_v16f32(<16 x float> %a0) {
 ; SSE41-NEXT:    orps %xmm4, %xmm0
 ; SSE41-NEXT:    retq
 ;
-; AVX-LABEL: test_v16f32:


still not fixed

goldsteinn · 2024-03-20T19:13:48Z

Any luck with the remaining regressions?

Most of the regressions are fixed with #84360, the only outstanding one
is when we later simplify to xmm w/ demandedelts in which case we
have converted from int -> float without any benefit. For example:
https://github.com/llvm/llvm-project/pull/84360/files#diff-ec93308ebc948e632bea97e65a5aeedc4342aab5fa49d2c4a43be983131fba47R17

Thats tough to get around, as float -> int is much more difficult to
pull off.

… AVX; NFC

goldsteinn · 2024-03-20T19:39:56Z

Any luck with the remaining regressions?

Most of the regressions are fixed with #84360, the only outstanding one is when we later simplify to xmm w/ demandedelts in which case we have converted from int -> float without any benefit. For example: https://github.com/llvm/llvm-project/pull/84360/files#diff-ec93308ebc948e632bea97e65a5aeedc4342aab5fa49d2c4a43be983131fba47R17

Thats tough to get around, as float -> int is much more difficult to pull off.

Im trying to refactor so the code arbitrarily works to cvt int <-> fp compares. Then we can undo as needed.

goldsteinn · 2024-03-20T20:49:45Z

Any luck with the remaining regressions?

Most of the regressions are fixed with #84360, the only outstanding one is when we later simplify to xmm w/ demandedelts in which case we have converted from int -> float without any benefit. For example: https://github.com/llvm/llvm-project/pull/84360/files#diff-ec93308ebc948e632bea97e65a5aeedc4342aab5fa49d2c4a43be983131fba47R17
Thats tough to get around, as float -> int is much more difficult to pull off.

Im trying to refactor so the code arbitrarily works to cvt int <-> fp compares. Then we can undo as needed.

That doesn't really help. Maybe this isn't worth doing after all?

Fixes: llvm#82242 The idea is that AVX doesn't support comparisons for `v8i32` so it splits the comparison into 2x `v4i32` comparisons + reconstruction of the `v8i32`. By converting to a float, we can handle the comparison with 1/2 instructions (1 if we can `bitcast`, 2 if we need to cast with `sitofp`). The Proofs: https://alive2.llvm.org/ce/z/AJDdQ8 Timeout, but they can be reproduced locally.

We currently only handle a single case for `pcmpgt`. This patch extends that to work for `cmpp` and handles comparitors more generically.

goldsteinn · 2024-03-26T16:22:01Z

@RKSimon, thoughts on dropping this? Do you think its worth the complexity?

RKSimon · 2024-03-26T17:48:23Z

@RKSimon, thoughts on dropping this? Do you think its worth the complexity?

Do you know which part of the fold is causing the most regressions? #82242 was for the limited case of when we've masked the lower bits for comparison - if we just handled that, what does it look like then?

goldsteinn · 2024-03-26T18:50:08Z

@RKSimon, thoughts on dropping this? Do you think its worth the complexity?

Do you know which part of the fold is causing the most regressions? #82242 was for the limited case of when we've masked the lower bits for comparison - if we just handled that, what does it look like then?

Most of the regressions are caused by not simplifying CMPP which #84360 cleans up.
The remainder of when we simplify ymm -> xmm after the transform which is harder
to undo.

RKSimon · 2024-03-26T23:00:50Z

llvm/lib/Target/X86/X86ISelLowering.cpp

+        // or rhs < MaxConvertableCvt
+
+        if (OpInRange(1, /*CmpLT*/ true, APInt::getSignedMinValue(32)) &&
+            OpInRange(0, /*CmpLT*/ true, APInt::getSignedMinValue(32)) &&


@goldsteinn The comment says both lhs/rhs >= 0 but the comparisons are with MIN_INT ?

its unsigned compare. So foo u< MIN_INT -> foo s>= 0.

RKSimon · 2024-03-26T23:20:58Z

llvm/lib/Target/X86/X86ISelLowering.cpp

+            OpInRange(1, /*CmpLT*/ false, -MaxConvertableCvt) ||
+            OpInRange(0, /*CmpLT*/ true, MaxConvertableCvt) ||
+            OpInRange(0, /*CmpLT*/ false, -MaxConvertableCvt))
+          OkayCvt = true;


Don't we need signed comparisons here - OpInRange just does unsigned comparisons? Ideally we'd use NumSignBits checks as an alternative to also check for being in signed bounds.

llvmbot added the backend:X86 label Feb 19, 2024

goldsteinn requested review from RKSimon and phoebewang February 19, 2024 23:59

goldsteinn commented Feb 20, 2024

View reviewed changes

goldsteinn changed the title ~~goldsteinn/icmp to fcmp~~ [X86] Try Folding icmp of v8i32 -> fcmp of v8f32 on AVX Feb 20, 2024

phoebewang reviewed Feb 20, 2024

View reviewed changes

llvm/lib/Target/X86/X86ISelLowering.cpp Outdated Show resolved Hide resolved

RKSimon reviewed Feb 20, 2024

View reviewed changes

goldsteinn force-pushed the goldsteinn/icmp-to-fcmp branch from a508e9e to 0194e79 Compare February 20, 2024 14:58

RKSimon reviewed Feb 20, 2024

View reviewed changes

goldsteinn force-pushed the goldsteinn/icmp-to-fcmp branch from 0194e79 to 73b8c4a Compare February 25, 2024 17:54

llvmbot added the llvm:SelectionDAG SelectionDAGISel as well label Feb 25, 2024

goldsteinn force-pushed the goldsteinn/icmp-to-fcmp branch from 73b8c4a to bf63272 Compare February 25, 2024 19:08

phoebewang reviewed Feb 26, 2024

View reviewed changes

goldsteinn force-pushed the goldsteinn/icmp-to-fcmp branch from fed3d38 to 351cefa Compare March 1, 2024 19:51

goldsteinn force-pushed the goldsteinn/icmp-to-fcmp branch from 351cefa to 0da0ea1 Compare March 4, 2024 17:52

RKSimon reviewed Mar 4, 2024

View reviewed changes

goldsteinn force-pushed the goldsteinn/icmp-to-fcmp branch 2 times, most recently from be07876 to f8b168b Compare March 5, 2024 20:09

RKSimon reviewed Mar 7, 2024

View reviewed changes

goldsteinn force-pushed the goldsteinn/icmp-to-fcmp branch from f8b168b to adbc719 Compare March 7, 2024 19:00

goldsteinn mentioned this pull request Mar 7, 2024

[X86] Improve helper for simplifying demanded bits of compares #84360

Open

goldsteinn force-pushed the goldsteinn/icmp-to-fcmp branch 3 times, most recently from 1094056 to f6b0d17 Compare March 15, 2024 18:43

goldsteinn force-pushed the goldsteinn/icmp-to-fcmp branch from f6b0d17 to ba7d9d1 Compare March 20, 2024 04:26

RKSimon reviewed Mar 20, 2024

View reviewed changes

[X86] Add tests for folding icmp of v8i32 -> fcmp of v8f32 on…

c4f1231

… AVX; NFC

goldsteinn added 2 commits March 20, 2024 16:15

[X86] Improve helper for simplifying demanded bits of compares

c67e83a

We currently only handle a single case for `pcmpgt`. This patch extends that to work for `cmpp` and handles comparitors more generically.

goldsteinn force-pushed the goldsteinn/icmp-to-fcmp branch from ba7d9d1 to c67e83a Compare March 20, 2024 21:16

RKSimon reviewed Mar 26, 2024

View reviewed changes

[X86] Try Folding icmp of v8i32 -> fcmp of v8f32 on AVX #82290

Are you sure you want to change the base?

[X86] Try Folding icmp of v8i32 -> fcmp of v8f32 on AVX #82290

Conversation

goldsteinn commented Feb 19, 2024 • edited

llvmbot commented Feb 19, 2024 • edited

goldsteinn commented Feb 20, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

phoebewang Feb 20, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

RKSimon left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

goldsteinn Feb 20, 2024 • edited

Choose a reason for hiding this comment

goldsteinn commented Feb 20, 2024

RKSimon commented Feb 20, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

github-actions bot commented Feb 25, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

phoebewang commented Feb 26, 2024

goldsteinn commented Feb 26, 2024

goldsteinn commented Feb 26, 2024

phoebewang commented Feb 26, 2024

goldsteinn commented Feb 26, 2024

anematode commented Feb 26, 2024 • edited

goldsteinn commented Feb 26, 2024

anematode commented Feb 26, 2024

goldsteinn commented Feb 26, 2024 • edited

RKSimon commented Mar 4, 2024

goldsteinn commented Mar 4, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

RKSimon left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

goldsteinn commented Mar 20, 2024

goldsteinn commented Mar 20, 2024

goldsteinn commented Mar 20, 2024

goldsteinn commented Mar 26, 2024

RKSimon commented Mar 26, 2024

goldsteinn commented Mar 26, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

goldsteinn commented Feb 19, 2024 •

edited

llvmbot commented Feb 19, 2024 •

edited

phoebewang Feb 20, 2024 •

edited

goldsteinn Feb 20, 2024 •

edited

github-actions bot commented Feb 25, 2024 •

edited

anematode commented Feb 26, 2024 •

edited

goldsteinn commented Feb 26, 2024 •

edited