[X86] Fold concat(PCMP(),PCMP()) -> CMPPS(concat,concat) on AVX1 targets #95915

RKSimon · 2024-06-18T12:54:47Z

This is a more restricted solution to #82242 (vs the more general #82290 + #84360) whereby if we're concat'ing PCMPEQ/GT nodes to 256-bits on a AVX1 target then determine if the integer values are in bounds to allow them to be converted to FP for a (legal) float comparison.

By performing this inside combineConcatVectorOps and working on PCMPEQ/GT nodes and not ICMP, we delay the fold until after more lowering has occurred, which avoids many of the issues where we were getting 'stuck' with CMPPS or unnecessary 256-bit nodes, and can more easily determine if either of the new concats() will be free.

Additionally this patch requires BOTH comparison operands to be in range, while technically not required this does help avoid the remaining regressions. It doesn't require that one of the operands is constant, it didn't seem necessary to include that constraint.

I've reused some of the code from #82290, and we may be able to add additional functionality (more CondCode patterns, v4i64/v4f64 handling, 'bitcastable' integers etc.) in future patches.

Fixes #82242

llvmbot · 2024-06-18T12:55:09Z

@llvm/pr-subscribers-backend-x86

Author: Simon Pilgrim (RKSimon)

Changes

This is a more restricted solution to #82242 (vs the more general #82290 + #84360) whereby if we're concat'ing PCMPEQ/GT nodes to 256-bits on a AVX1 target then determine if the integer values are in bounds to allow them to be converted to FP for a (legal) float comparison.

By performing this inside combineConcatVectorOps and working on PCMPEQ/GT nodes and not ICMP, we delay the fold until after more lowering has occurred, which avoids many of the issues where we were getting 'stuck' with CMPPS or unnecessary 256-bit nodes, and can more easily determine if either of the new concats() will be free.

Additionally this patch requires BOTH comparison operands to be in range, while technically not required this does help avoid the remaining regressions. It doesn't require that one of the operands is constant, it didn't seem necessary to include that constraint.

I've reused some of the code from #82290, and we may be able to add additional functionality (more CondCode patterns, v4i64/v4f64 handling, 'bitcastable' integers etc.) in future patches.

Fixes #82242

Patch is 193.21 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/95915.diff

7 Files Affected:

(modified) llvm/lib/Target/X86/X86ISelLowering.cpp (+77-4)
(modified) llvm/test/CodeGen/X86/bitcast-int-to-vector-bool-sext.ll (+9-18)
(modified) llvm/test/CodeGen/X86/bitcast-int-to-vector-bool-zext.ll (+9-18)
(modified) llvm/test/CodeGen/X86/cmpf-avx.ll (+5-10)
(modified) llvm/test/CodeGen/X86/vector-popcnt-256-ult-ugt.ll (+1541-1570)
(modified) llvm/test/CodeGen/X86/vector-sext.ll (+3-6)
(modified) llvm/test/CodeGen/X86/vector-unsigned-cmp.ll (+9-8)

diff --git a/llvm/lib/Target/X86/X86ISelLowering.cpp b/llvm/lib/Target/X86/X86ISelLowering.cpp
index f27c935812f51..0ad30490772c8 100644
--- a/llvm/lib/Target/X86/X86ISelLowering.cpp
+++ b/llvm/lib/Target/X86/X86ISelLowering.cpp
@@ -55716,6 +55716,38 @@ static SDValue combineVectorCompare(SDNode *N, SelectionDAG &DAG,
   return SDValue();
 }
 
+// Helper to determine if we can convert an integer comparison to a float
+// comparison byt casting the operands.
+static std::optional<unsigned> CastIntSETCCtoFP(MVT VT, ISD::CondCode CC,
+                                                const KnownBits &LHS,
+                                                const KnownBits &RHS) {
+  MVT SVT = VT.getScalarType();
+  assert(SVT == MVT::f32 && "Only tested for float so far");
+  const fltSemantics &Sem = SelectionDAG::EVTToAPFloatSemantics(SVT);
+  assert((CC == ISD::SETEQ || CC == ISD::SETGT) &&
+         "Only PCMPEQ/PCMPGT currently supported");
+
+  // TODO: Handle bitcastable integers.
+
+  // For cvt + signed compare we need:
+  // abs(lhs) < MaxConvertableCvt and abs(rhs) < MaxConvertableCvt
+  auto OpInAbsRange = [](const KnownBits &Known, const APInt &Bound) {
+    if (Known.isUnknown() ||
+        !KnownBits::slt(Known, KnownBits::makeConstant(Bound)) ||
+        !KnownBits::sgt(Known, KnownBits::makeConstant(-Bound)))
+      return false;
+    return true;
+  };
+  APInt MaxConvertableCvt = APInt::getOneBitSet(
+      SVT.getScalarSizeInBits(), APFloat::semanticsPrecision(Sem));
+
+  if (OpInAbsRange(RHS, MaxConvertableCvt) &&
+      OpInAbsRange(LHS, MaxConvertableCvt))
+    return ISD::SINT_TO_FP;
+
+  return std::nullopt;
+}
+
 /// Helper that combines an array of subvector ops as if they were the operands
 /// of a ISD::CONCAT_VECTORS node, but may have come from another source (e.g.
 /// ISD::INSERT_SUBVECTOR). The ops are assumed to be of the same type.
@@ -56106,11 +56138,52 @@ static SDValue combineConcatVectorOps(const SDLoc &DL, MVT VT,
       break;
     case X86ISD::PCMPEQ:
     case X86ISD::PCMPGT:
-      if (!IsSplat && VT.is256BitVector() && Subtarget.hasInt256() &&
+      if (!IsSplat && VT.is256BitVector() &&
+          (Subtarget.hasInt256() || VT == MVT::v8i32) &&
           (IsConcatFree(VT, Ops, 0) || IsConcatFree(VT, Ops, 1))) {
-        return DAG.getNode(Op0.getOpcode(), DL, VT,
-                           ConcatSubOperand(VT, Ops, 0),
-                           ConcatSubOperand(VT, Ops, 1));
+        if (Subtarget.hasInt256())
+          return DAG.getNode(Op0.getOpcode(), DL, VT,
+                             ConcatSubOperand(VT, Ops, 0),
+                             ConcatSubOperand(VT, Ops, 1));
+
+        // Without AVX2, see if we can cast the values to v8f32 and use fcmp.
+        // TODO: Handle v4f64 as well?
+        KnownBits KnownLHS(EltSizeInBits), KnownRHS(EltSizeInBits);
+        KnownLHS.One.setAllBits();
+        KnownRHS.One.setAllBits();
+        KnownLHS.Zero.setAllBits();
+        KnownRHS.Zero.setAllBits();
+        for (unsigned I = 0; I != NumOps; ++I) {
+          KnownBits LHS = DAG.computeKnownBits(Ops[I].getOperand(0));
+          KnownBits RHS = DAG.computeKnownBits(Ops[I].getOperand(1));
+          KnownLHS = KnownLHS.intersectWith(LHS);
+          KnownRHS = KnownRHS.intersectWith(RHS);
+          if (KnownLHS.isUnknown() && KnownRHS.isUnknown())
+            break;
+        }
+
+        ISD::CondCode ICC =
+            Op0.getOpcode() == X86ISD::PCMPEQ ? ISD::SETEQ : ISD::SETGT;
+        ISD::CondCode FCC =
+            Op0.getOpcode() == X86ISD::PCMPEQ ? ISD::SETOEQ : ISD::SETOGT;
+
+        MVT FpSVT = MVT::getFloatingPointVT(EltSizeInBits);
+        MVT FpVT = VT.changeVectorElementType(FpSVT);
+
+        if (std::optional<unsigned> CastOpc =
+                CastIntSETCCtoFP(FpVT, ICC, KnownLHS, KnownRHS)) {
+          SDValue LHS = ConcatSubOperand(VT, Ops, 0);
+          SDValue RHS = ConcatSubOperand(VT, Ops, 1);
+          LHS = DAG.getNode(*CastOpc, DL, FpVT, LHS);
+          RHS = DAG.getNode(*CastOpc, DL, FpVT, RHS);
+
+          bool IsAlwaysSignaling;
+          unsigned FSETCC =
+              translateX86FSETCC(FCC, LHS, RHS, IsAlwaysSignaling);
+          return DAG.getBitcast(
+              VT, DAG.getNode(X86ISD::CMPP, DL, FpVT, LHS, RHS,
+                              DAG.getTargetConstant(FSETCC, DL, MVT::i8)));
+        }
       }
       break;
     case ISD::CTPOP:
diff --git a/llvm/test/CodeGen/X86/bitcast-int-to-vector-bool-sext.ll b/llvm/test/CodeGen/X86/bitcast-int-to-vector-bool-sext.ll
index 6255621d870e1..eef2b3db5d694 100644
--- a/llvm/test/CodeGen/X86/bitcast-int-to-vector-bool-sext.ll
+++ b/llvm/test/CodeGen/X86/bitcast-int-to-vector-bool-sext.ll
@@ -256,12 +256,9 @@ define <8 x i32> @ext_i8_8i32(i8 %a0) {
 ; AVX1-NEXT:    vmovd %edi, %xmm0
 ; AVX1-NEXT:    vpshufd {{.*#+}} xmm0 = xmm0[0,0,0,0]
 ; AVX1-NEXT:    vinsertf128 $1, %xmm0, %ymm0, %ymm0
-; AVX1-NEXT:    vmovaps {{.*#+}} ymm1 = [1,2,4,8,16,32,64,128]
-; AVX1-NEXT:    vandps %ymm1, %ymm0, %ymm0
-; AVX1-NEXT:    vpcmpeqd %xmm1, %xmm0, %xmm1
-; AVX1-NEXT:    vextractf128 $1, %ymm0, %xmm0
-; AVX1-NEXT:    vpcmpeqd {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm0, %xmm0
-; AVX1-NEXT:    vinsertf128 $1, %xmm0, %ymm1, %ymm0
+; AVX1-NEXT:    vandps {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %ymm0, %ymm0
+; AVX1-NEXT:    vcvtdq2ps %ymm0, %ymm0
+; AVX1-NEXT:    vcmpeqps {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %ymm0, %ymm0
 ; AVX1-NEXT:    retq
 ;
 ; AVX2-LABEL: ext_i8_8i32:
@@ -487,18 +484,12 @@ define <16 x i32> @ext_i16_16i32(i16 %a0) {
 ; AVX1-NEXT:    vmovd %edi, %xmm0
 ; AVX1-NEXT:    vpshufd {{.*#+}} xmm0 = xmm0[0,0,0,0]
 ; AVX1-NEXT:    vinsertf128 $1, %xmm0, %ymm0, %ymm1
-; AVX1-NEXT:    vmovaps {{.*#+}} ymm0 = [1,2,4,8,16,32,64,128]
-; AVX1-NEXT:    vandps %ymm0, %ymm1, %ymm2
-; AVX1-NEXT:    vpcmpeqd %xmm0, %xmm2, %xmm0
-; AVX1-NEXT:    vextractf128 $1, %ymm2, %xmm2
-; AVX1-NEXT:    vpcmpeqd {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm2, %xmm2
-; AVX1-NEXT:    vinsertf128 $1, %xmm2, %ymm0, %ymm0
-; AVX1-NEXT:    vmovaps {{.*#+}} ymm2 = [256,512,1024,2048,4096,8192,16384,32768]
-; AVX1-NEXT:    vandps %ymm2, %ymm1, %ymm1
-; AVX1-NEXT:    vpcmpeqd %xmm2, %xmm1, %xmm2
-; AVX1-NEXT:    vextractf128 $1, %ymm1, %xmm1
-; AVX1-NEXT:    vpcmpeqd {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm1, %xmm1
-; AVX1-NEXT:    vinsertf128 $1, %xmm1, %ymm2, %ymm1
+; AVX1-NEXT:    vandps {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %ymm1, %ymm0
+; AVX1-NEXT:    vcvtdq2ps %ymm0, %ymm0
+; AVX1-NEXT:    vcmpeqps {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %ymm0, %ymm0
+; AVX1-NEXT:    vandps {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %ymm1, %ymm1
+; AVX1-NEXT:    vcvtdq2ps %ymm1, %ymm1
+; AVX1-NEXT:    vcmpeqps {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %ymm1, %ymm1
 ; AVX1-NEXT:    retq
 ;
 ; AVX2-LABEL: ext_i16_16i32:
diff --git a/llvm/test/CodeGen/X86/bitcast-int-to-vector-bool-zext.ll b/llvm/test/CodeGen/X86/bitcast-int-to-vector-bool-zext.ll
index d2794df731b65..5c810797bd2b7 100644
--- a/llvm/test/CodeGen/X86/bitcast-int-to-vector-bool-zext.ll
+++ b/llvm/test/CodeGen/X86/bitcast-int-to-vector-bool-zext.ll
@@ -320,12 +320,9 @@ define <8 x i32> @ext_i8_8i32(i8 %a0) {
 ; AVX1-NEXT:    vmovd %edi, %xmm0
 ; AVX1-NEXT:    vpshufd {{.*#+}} xmm0 = xmm0[0,0,0,0]
 ; AVX1-NEXT:    vinsertf128 $1, %xmm0, %ymm0, %ymm0
-; AVX1-NEXT:    vmovaps {{.*#+}} ymm1 = [1,2,4,8,16,32,64,128]
-; AVX1-NEXT:    vandps %ymm1, %ymm0, %ymm0
-; AVX1-NEXT:    vpcmpeqd %xmm1, %xmm0, %xmm1
-; AVX1-NEXT:    vextractf128 $1, %ymm0, %xmm0
-; AVX1-NEXT:    vpcmpeqd {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm0, %xmm0
-; AVX1-NEXT:    vinsertf128 $1, %xmm0, %ymm1, %ymm0
+; AVX1-NEXT:    vandps {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %ymm0, %ymm0
+; AVX1-NEXT:    vcvtdq2ps %ymm0, %ymm0
+; AVX1-NEXT:    vcmpeqps {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %ymm0, %ymm0
 ; AVX1-NEXT:    vandps {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %ymm0, %ymm0
 ; AVX1-NEXT:    retq
 ;
@@ -613,20 +610,14 @@ define <16 x i32> @ext_i16_16i32(i16 %a0) {
 ; AVX1-NEXT:    vmovd %edi, %xmm0
 ; AVX1-NEXT:    vpshufd {{.*#+}} xmm0 = xmm0[0,0,0,0]
 ; AVX1-NEXT:    vinsertf128 $1, %xmm0, %ymm0, %ymm1
-; AVX1-NEXT:    vmovaps {{.*#+}} ymm0 = [1,2,4,8,16,32,64,128]
-; AVX1-NEXT:    vandps %ymm0, %ymm1, %ymm2
-; AVX1-NEXT:    vpcmpeqd %xmm0, %xmm2, %xmm0
-; AVX1-NEXT:    vextractf128 $1, %ymm2, %xmm2
-; AVX1-NEXT:    vpcmpeqd {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm2, %xmm2
-; AVX1-NEXT:    vinsertf128 $1, %xmm2, %ymm0, %ymm0
+; AVX1-NEXT:    vandps {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %ymm1, %ymm0
+; AVX1-NEXT:    vcvtdq2ps %ymm0, %ymm0
+; AVX1-NEXT:    vcmpeqps {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %ymm0, %ymm0
 ; AVX1-NEXT:    vbroadcastss {{.*#+}} ymm2 = [1,1,1,1,1,1,1,1]
 ; AVX1-NEXT:    vandps %ymm2, %ymm0, %ymm0
-; AVX1-NEXT:    vmovaps {{.*#+}} ymm3 = [256,512,1024,2048,4096,8192,16384,32768]
-; AVX1-NEXT:    vandps %ymm3, %ymm1, %ymm1
-; AVX1-NEXT:    vpcmpeqd %xmm3, %xmm1, %xmm3
-; AVX1-NEXT:    vextractf128 $1, %ymm1, %xmm1
-; AVX1-NEXT:    vpcmpeqd {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm1, %xmm1
-; AVX1-NEXT:    vinsertf128 $1, %xmm1, %ymm3, %ymm1
+; AVX1-NEXT:    vandps {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %ymm1, %ymm1
+; AVX1-NEXT:    vcvtdq2ps %ymm1, %ymm1
+; AVX1-NEXT:    vcmpeqps {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %ymm1, %ymm1
 ; AVX1-NEXT:    vandps %ymm2, %ymm1, %ymm1
 ; AVX1-NEXT:    retq
 ;
diff --git a/llvm/test/CodeGen/X86/cmpf-avx.ll b/llvm/test/CodeGen/X86/cmpf-avx.ll
index 39dce7b989509..e58295fff9855 100644
--- a/llvm/test/CodeGen/X86/cmpf-avx.ll
+++ b/llvm/test/CodeGen/X86/cmpf-avx.ll
@@ -2,25 +2,20 @@
 ; RUN: llc < %s -mtriple=i686-unknown-unknown -mattr=+avx | FileCheck %s --check-prefixes=CHECK,X86
 ; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mattr=+avx | FileCheck %s --check-prefixes=CHECK,X64
 
+; PR82242
 define <8 x i32> @cmp_eq_bitcast(<8 x i32> %x) {
 ; X86-LABEL: cmp_eq_bitcast:
 ; X86:       # %bb.0:
 ; X86-NEXT:    vandps {{\.?LCPI[0-9]+_[0-9]+}}, %ymm0, %ymm0
-; X86-NEXT:    vextractf128 $1, %ymm0, %xmm1
-; X86-NEXT:    vbroadcastss {{.*#+}} xmm2 = [3,3,3,3]
-; X86-NEXT:    vpcmpeqd %xmm2, %xmm1, %xmm1
-; X86-NEXT:    vpcmpeqd %xmm2, %xmm0, %xmm0
-; X86-NEXT:    vinsertf128 $1, %xmm1, %ymm0, %ymm0
+; X86-NEXT:    vcvtdq2ps %ymm0, %ymm0
+; X86-NEXT:    vcmpeqps {{\.?LCPI[0-9]+_[0-9]+}}, %ymm0, %ymm0
 ; X86-NEXT:    retl
 ;
 ; X64-LABEL: cmp_eq_bitcast:
 ; X64:       # %bb.0:
 ; X64-NEXT:    vandps {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %ymm0, %ymm0
-; X64-NEXT:    vextractf128 $1, %ymm0, %xmm1
-; X64-NEXT:    vbroadcastss {{.*#+}} xmm2 = [3,3,3,3]
-; X64-NEXT:    vpcmpeqd %xmm2, %xmm1, %xmm1
-; X64-NEXT:    vpcmpeqd %xmm2, %xmm0, %xmm0
-; X64-NEXT:    vinsertf128 $1, %xmm1, %ymm0, %ymm0
+; X64-NEXT:    vcvtdq2ps %ymm0, %ymm0
+; X64-NEXT:    vcmpeqps {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %ymm0, %ymm0
 ; X64-NEXT:    retq
   %and = and <8 x i32> %x, <i32 7, i32 3, i32 3, i32 3, i32 3, i32 3, i32 3, i32 3>
   %cmp = icmp eq <8 x i32> %and, <i32 3, i32 3, i32 3, i32 3, i32 3, i32 3, i32 3, i32 3>
diff --git a/llvm/test/CodeGen/X86/vector-popcnt-256-ult-ugt.ll b/llvm/test/CodeGen/X86/vector-popcnt-256-ult-ugt.ll
index 05854ff728a07..ab1922e3ad9a2 100644
--- a/llvm/test/CodeGen/X86/vector-popcnt-256-ult-ugt.ll
+++ b/llvm/test/CodeGen/X86/vector-popcnt-256-ult-ugt.ll
@@ -3432,36 +3432,36 @@ define <8 x i32> @ult_2_v8i32(<8 x i32> %0) {
 define <8 x i32> @ugt_2_v8i32(<8 x i32> %0) {
 ; AVX1-LABEL: ugt_2_v8i32:
 ; AVX1:       # %bb.0:
-; AVX1-NEXT:    vbroadcastss {{.*#+}} xmm1 = [15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15]
-; AVX1-NEXT:    vpand %xmm1, %xmm0, %xmm2
-; AVX1-NEXT:    vmovdqa {{.*#+}} xmm3 = [0,1,1,2,1,2,2,3,1,2,2,3,2,3,3,4]
-; AVX1-NEXT:    vpshufb %xmm2, %xmm3, %xmm2
-; AVX1-NEXT:    vpsrlw $4, %xmm0, %xmm4
-; AVX1-NEXT:    vpand %xmm1, %xmm4, %xmm4
-; AVX1-NEXT:    vpshufb %xmm4, %xmm3, %xmm4
-; AVX1-NEXT:    vpaddb %xmm2, %xmm4, %xmm2
-; AVX1-NEXT:    vpxor %xmm4, %xmm4, %xmm4
-; AVX1-NEXT:    vpunpckhdq {{.*#+}} xmm5 = xmm2[2],xmm4[2],xmm2[3],xmm4[3]
-; AVX1-NEXT:    vpsadbw %xmm4, %xmm5, %xmm5
-; AVX1-NEXT:    vpmovzxdq {{.*#+}} xmm2 = xmm2[0],zero,xmm2[1],zero
-; AVX1-NEXT:    vpsadbw %xmm4, %xmm2, %xmm2
-; AVX1-NEXT:    vpackuswb %xmm5, %xmm2, %xmm2
-; AVX1-NEXT:    vextractf128 $1, %ymm0, %xmm0
-; AVX1-NEXT:    vpand %xmm1, %xmm0, %xmm5
-; AVX1-NEXT:    vpshufb %xmm5, %xmm3, %xmm5
+; AVX1-NEXT:    vextractf128 $1, %ymm0, %xmm1
+; AVX1-NEXT:    vbroadcastss {{.*#+}} xmm2 = [15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15]
+; AVX1-NEXT:    vpand %xmm2, %xmm1, %xmm3
+; AVX1-NEXT:    vmovdqa {{.*#+}} xmm4 = [0,1,1,2,1,2,2,3,1,2,2,3,2,3,3,4]
+; AVX1-NEXT:    vpshufb %xmm3, %xmm4, %xmm3
+; AVX1-NEXT:    vpsrlw $4, %xmm1, %xmm1
+; AVX1-NEXT:    vpand %xmm2, %xmm1, %xmm1
+; AVX1-NEXT:    vpshufb %xmm1, %xmm4, %xmm1
+; AVX1-NEXT:    vpaddb %xmm3, %xmm1, %xmm1
+; AVX1-NEXT:    vpxor %xmm3, %xmm3, %xmm3
+; AVX1-NEXT:    vpunpckhdq {{.*#+}} xmm5 = xmm1[2],xmm3[2],xmm1[3],xmm3[3]
+; AVX1-NEXT:    vpsadbw %xmm3, %xmm5, %xmm5
+; AVX1-NEXT:    vpmovzxdq {{.*#+}} xmm1 = xmm1[0],zero,xmm1[1],zero
+; AVX1-NEXT:    vpsadbw %xmm3, %xmm1, %xmm1
+; AVX1-NEXT:    vpackuswb %xmm5, %xmm1, %xmm1
+; AVX1-NEXT:    vpand %xmm2, %xmm0, %xmm5
+; AVX1-NEXT:    vpshufb %xmm5, %xmm4, %xmm5
 ; AVX1-NEXT:    vpsrlw $4, %xmm0, %xmm0
-; AVX1-NEXT:    vpand %xmm1, %xmm0, %xmm0
-; AVX1-NEXT:    vpshufb %xmm0, %xmm3, %xmm0
+; AVX1-NEXT:    vpand %xmm2, %xmm0, %xmm0
+; AVX1-NEXT:    vpshufb %xmm0, %xmm4, %xmm0
 ; AVX1-NEXT:    vpaddb %xmm5, %xmm0, %xmm0
-; AVX1-NEXT:    vpunpckhdq {{.*#+}} xmm1 = xmm0[2],xmm4[2],xmm0[3],xmm4[3]
-; AVX1-NEXT:    vpsadbw %xmm4, %xmm1, %xmm1
+; AVX1-NEXT:    vpunpckhdq {{.*#+}} xmm2 = xmm0[2],xmm3[2],xmm0[3],xmm3[3]
+; AVX1-NEXT:    vpsadbw %xmm3, %xmm2, %xmm2
 ; AVX1-NEXT:    vpmovzxdq {{.*#+}} xmm0 = xmm0[0],zero,xmm0[1],zero
-; AVX1-NEXT:    vpsadbw %xmm4, %xmm0, %xmm0
-; AVX1-NEXT:    vpackuswb %xmm1, %xmm0, %xmm0
-; AVX1-NEXT:    vbroadcastss {{.*#+}} xmm1 = [2,2,2,2]
-; AVX1-NEXT:    vpcmpgtd %xmm1, %xmm0, %xmm0
-; AVX1-NEXT:    vpcmpgtd %xmm1, %xmm2, %xmm1
-; AVX1-NEXT:    vinsertf128 $1, %xmm0, %ymm1, %ymm0
+; AVX1-NEXT:    vpsadbw %xmm3, %xmm0, %xmm0
+; AVX1-NEXT:    vpackuswb %xmm2, %xmm0, %xmm0
+; AVX1-NEXT:    vinsertf128 $1, %xmm1, %ymm0, %ymm0
+; AVX1-NEXT:    vcvtdq2ps %ymm0, %ymm0
+; AVX1-NEXT:    vbroadcastss {{.*#+}} ymm1 = [2.0E+0,2.0E+0,2.0E+0,2.0E+0,2.0E+0,2.0E+0,2.0E+0,2.0E+0]
+; AVX1-NEXT:    vcmpltps %ymm0, %ymm1, %ymm0
 ; AVX1-NEXT:    retq
 ;
 ; AVX2-LABEL: ugt_2_v8i32:
@@ -3535,36 +3535,35 @@ define <8 x i32> @ugt_2_v8i32(<8 x i32> %0) {
 define <8 x i32> @ult_3_v8i32(<8 x i32> %0) {
 ; AVX1-LABEL: ult_3_v8i32:
 ; AVX1:       # %bb.0:
-; AVX1-NEXT:    vbroadcastss {{.*#+}} xmm1 = [15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15]
-; AVX1-NEXT:    vpand %xmm1, %xmm0, %xmm2
-; AVX1-NEXT:    vmovdqa {{.*#+}} xmm3 = [0,1,1,2,1,2,2,3,1,2,2,3,2,3,3,4]
-; AVX1-NEXT:    vpshufb %xmm2, %xmm3, %xmm2
-; AVX1-NEXT:    vpsrlw $4, %xmm0, %xmm4
-; AVX1-NEXT:    vpand %xmm1, %xmm4, %xmm4
-; AVX1-NEXT:    vpshufb %xmm4, %xmm3, %xmm4
-; AVX1-NEXT:    vpaddb %xmm2, %xmm4, %xmm2
-; AVX1-NEXT:    vpxor %xmm4, %xmm4, %xmm4
-; AVX1-NEXT:    vpunpckhdq {{.*#+}} xmm5 = xmm2[2],xmm4[2],xmm2[3],xmm4[3]
-; AVX1-NEXT:    vpsadbw %xmm4, %xmm5, %xmm5
-; AVX1-NEXT:    vpmovzxdq {{.*#+}} xmm2 = xmm2[0],zero,xmm2[1],zero
-; AVX1-NEXT:    vpsadbw %xmm4, %xmm2, %xmm2
-; AVX1-NEXT:    vpackuswb %xmm5, %xmm2, %xmm2
-; AVX1-NEXT:    vextractf128 $1, %ymm0, %xmm0
-; AVX1-NEXT:    vpand %xmm1, %xmm0, %xmm5
-; AVX1-NEXT:    vpshufb %xmm5, %xmm3, %xmm5
+; AVX1-NEXT:    vextractf128 $1, %ymm0, %xmm1
+; AVX1-NEXT:    vbroadcastss {{.*#+}} xmm2 = [15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15]
+; AVX1-NEXT:    vpand %xmm2, %xmm1, %xmm3
+; AVX1-NEXT:    vmovdqa {{.*#+}} xmm4 = [0,1,1,2,1,2,2,3,1,2,2,3,2,3,3,4]
+; AVX1-NEXT:    vpshufb %xmm3, %xmm4, %xmm3
+; AVX1-NEXT:    vpsrlw $4, %xmm1, %xmm1
+; AVX1-NEXT:    vpand %xmm2, %xmm1, %xmm1
+; AVX1-NEXT:    vpshufb %xmm1, %xmm4, %xmm1
+; AVX1-NEXT:    vpaddb %xmm3, %xmm1, %xmm1
+; AVX1-NEXT:    vpxor %xmm3, %xmm3, %xmm3
+; AVX1-NEXT:    vpunpckhdq {{.*#+}} xmm5 = xmm1[2],xmm3[2],xmm1[3],xmm3[3]
+; AVX1-NEXT:    vpsadbw %xmm3, %xmm5, %xmm5
+; AVX1-NEXT:    vpmovzxdq {{.*#+}} xmm1 = xmm1[0],zero,xmm1[1],zero
+; AVX1-NEXT:    vpsadbw %xmm3, %xmm1, %xmm1
+; AVX1-NEXT:    vpackuswb %xmm5, %xmm1, %xmm1
+; AVX1-NEXT:    vpand %xmm2, %xmm0, %xmm5
+; AVX1-NEXT:    vpshufb %xmm5, %xmm4, %xmm5
 ; AVX1-NEXT:    vpsrlw $4, %xmm0, %xmm0
-; AVX1-NEXT:    vpand %xmm1, %xmm0, %xmm0
-; AVX1-NEXT:    vpshufb %xmm0, %xmm3, %xmm0
+; AVX1-NEXT:    vpand %xmm2, %xmm0, %xmm0
+; AVX1-NEXT:    vpshufb %xmm0, %xmm4, %xmm0
 ; AVX1-NEXT:    vpaddb %xmm5, %xmm0, %xmm0
-; AVX1-NEXT:    vpunpckhdq {{.*#+}} xmm1 = xmm0[2],xmm4[2],xmm0[3],xmm4[3]
-; AVX1-NEXT:    vpsadbw %xmm4, %xmm1, %xmm1
+; AVX1-NEXT:    vpunpckhdq {{.*#+}} xmm2 = xmm0[2],xmm3[2],xmm0[3],xmm3[3]
+; AVX1-NEXT:    vpsadbw %xmm3, %xmm2, %xmm2
 ; AVX1-NEXT:    vpmovzxdq {{.*#+}} xmm0 = xmm0[0],zero,xmm0[1],zero
-; AVX1-NEXT:    vpsadbw %xmm4, %xmm0, %xmm0
-; AVX1-NEXT:    vpackuswb %xmm1, %xmm0, %xmm0
-; AVX1-NEXT:    vbroadcastss {{.*#+}} xmm1 = [3,3,3,3]
-; AVX1-NEXT:    vpcmpgtd %xmm0, %xmm1, %xmm0
-; AVX1-NEXT:    vpcmpgtd %xmm2, %xmm1, %xmm1
-; AVX1-NEXT:    vinsertf128 $1, %xmm0, %ymm1, %ymm0
+; AVX1-NEXT:    vpsadbw %xmm3, %xmm0, %xmm0
+; AVX1-NEXT:    vpackuswb %xmm2, %xmm0, %xmm0
+; AVX1-NEXT:    vinsertf128 $1, %xmm1, %ymm0, %ymm0
+; AVX1-NEXT:    vcvtdq2ps %ymm0, %ymm0
+; AVX1-NEXT:    vcmpltps {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %ymm0, %ymm0
 ; AVX1-NEXT:    retq
 ;
 ; AVX2-LABEL: ult_3_v8i32:
@@ -3638,36 +3637,36 @@ define <8 x i32> @ult_3_v8i32(<8 x i32> %0) {
 define <8 x i32> @ugt_3_v8i32(<8 x i32> %0) {
 ; AVX1-LABEL: ugt_3_v8i32:
 ; AVX1:       # %bb.0:
-; AVX1-NEXT:    vbroadcastss {{.*#+}} xmm1 = [15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15]
-; AVX1-NEXT:    vpand %xmm1, %xmm0, %xmm2
-; AVX1-NEXT:    vmovdqa {{.*#+}} xmm3 = [0,1,1,2,1,2,2,3,1,2,2,3,2,3,3,4]
-; AVX1-NEXT:    vpshufb %xmm2, %xmm3, %xmm2
-; AVX1-NEXT:    vpsrlw $4, %xmm0, %xmm4
-; AVX1-NEXT:    vpand %xmm1, %xmm4, %xmm4
-; AVX1-NEXT:    vpshufb %xmm4, %xmm3, %xmm4
-; AVX1-NEXT:    vpaddb %xmm2, %xmm4, %xmm2
-; AVX1-NEXT:    vpxor %xmm4, %xmm4, %xmm4
-; AVX1-NEXT:    vpunpckhdq {{.*#+}} xmm5 = xmm2[2],xmm4[2],xmm2[3],xmm4[3]
-; AVX1-NEXT:    vpsadbw %xmm4, %xmm5, %xmm5
-; AVX1-NEXT:    vpmovzxdq {{.*#+}} xmm2 = xmm2[0],zero,xmm2[1],zero
-; AVX1-NEXT:    vpsadbw %xmm4, %xmm2, %xmm2
-; AVX1-NEXT:    vpackuswb %xmm5, %xmm2, %xmm2
-; AVX1-NEXT:    vextractf128 $1, %ymm0, %xmm0
-; AVX1-NEXT:    vpand %xmm1, %xmm0, %xmm5
-; AVX1-NEXT:    vpshufb %xmm5, %xmm3, %xmm5
+; AVX1-NEXT:    vextractf128 $1, %ymm0, %xmm1
+; AVX1-NEXT:    vbroadcastss {{.*#+}} xmm2 = [15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15]
+; AVX1-NEXT:    vpand %xmm2, %xmm1, %xmm3
+; AVX1-NEXT:    vmovdqa {{.*#+}} xmm4 = [0,1,1,2,1,2,2,3,1,2,2,3,2,3,3,4]
+; AVX1-NEXT:    vpshufb %xmm3, %xmm4, %xmm3
+; AVX1-NEXT:    vpsrlw $4, %xmm1, %xmm1
+; AVX1-NEXT:    vpand %xmm2, %xmm1, %xmm1
+; AVX1-NEXT:    vpshufb %xmm1, %xmm4, %xmm1
+; AVX1-NEXT:    vpaddb %xmm3, %xmm1, %xmm1
+; AVX1-NEXT:    vpxor %xmm3, %xmm3, %xmm3
+; AVX1-NEXT:    vpunpckhdq {{.*#+}} xmm5 = xmm1[2],xmm3[2],xmm1[3],xmm3[3]
+; AVX1-NEXT:    vpsadbw %xmm3, %xmm5, %xmm5
+; AVX1-NEXT:    vpmovzxdq {{.*#+}} xmm1 = xmm1[0],zero,xmm1[1],zero
+; AVX1-NEXT:    vpsadbw %xmm3, %xmm1, %xmm1
+; AVX1-NEXT:    vpackuswb %xmm5, %xmm1, %xmm1
+; AVX1-NEXT:    vpand %xmm2, %xmm0, %xmm5
+; AVX1-NEXT:    vpshufb %xmm5, %xmm4, %xmm5
 ; AVX1-NEXT:    vpsrlw $4, %xmm0, %xmm0
-; AVX1-NEXT:    vpand %xmm1, %xmm0, %xmm0
-; AVX1-NEXT:    vpshufb %xmm0, %xmm3, %xmm0
+; AVX1-NEXT:    vpand %xmm2, %xmm0, %xmm0
+; AVX1-NEXT:    vpshufb %xmm0, %xmm4, %xmm0
 ; AVX1-NEXT:    vpaddb %xmm5, %xmm0, %xmm0
-; AVX1-NEXT:    vpunpckhdq {{.*#+}} xmm1 = xmm0[2],xmm4[2],xmm0[3],xmm4[3]
-; AVX1-NEXT:    vpsadbw %xmm4, %xmm1, %xmm1
+; AVX1-NEXT:    vpunpckhdq {{.*#+}} xmm2 = xmm0[2],xmm3[2],xmm0[3],xmm3[3]
+; AVX1-NEXT:    vpsadbw %xmm3, %xmm2, %xmm2
 ; AVX1-NEXT:    vpmovzxdq {{.*#+}} xmm0 = xmm0[0],zero,xmm0[1],zero
-; AVX1-NEXT:    vpsadbw %xmm4, %xmm0, %xmm0
-; AVX1-NEXT:    vpackuswb %xmm1, %xmm0, %xmm0
-; AVX1-NEXT:    vbroadcastss {{.*#+}} xmm1 = [3,3,3,3]
-; AVX1-NEXT:    vpcmpgtd %xmm1, %xmm0, %xmm0
-; AVX1-NEXT:    vpcmpgtd %xmm1, %xmm2, %xmm1
-; AVX1-NEXT:    vinsertf128 $1, %xmm0, %ymm1, %ymm0
+; AVX1-NEXT:    vpsadbw %xmm3, %xmm0, %xmm0
+; AVX1-NEXT:    vpackuswb %xmm2, %xmm0, %xmm0
+; AVX1-NEXT:    vinsertf...
[truncated]

goldsteinn · 2024-06-18T17:29:25Z

llvm/lib/Target/X86/X86ISelLowering.cpp

+  auto OpInAbsRange = [](const KnownBits &Known, const APInt &Bound) {
+    if (Known.isUnknown() ||
+        !KnownBits::slt(Known, KnownBits::makeConstant(Bound)) ||
+        !KnownBits::sgt(Known, KnownBits::makeConstant(-Bound)))


Think the unsigned version is clearer

Actually, your comment: https://github.com/llvm/llvm-project/pull/82290/files#r1540227868

Done - I went for using SignificantBits instead as the comparison simplifies even further

goldsteinn · 2024-06-18T17:33:45Z

Wonder if this code would fit in simplifySExtOfDecomposedSetCCImpl of #82290.

I'm about to board a flight so don't have time to check now, but will likely tomorrow.

RKSimon · 2024-06-25T08:08:07Z

ping?

goldsteinn · 2024-06-26T08:27:04Z

ping?

Bah sorry.

It does fit, but it doesn't really change anything of note.

The only change i would like to try to get into this is the equiv bitcast logic from #82290

Edit: Okay I tested it out, bitcast case doesn't hit in any of the testsuite so think its fine to leave as todo for now.
If you change KnownBits -> ComputeNumSignbits this look good.

…rgets This is a more restricted solution to llvm#82242 (vs the more general llvm#82290 + llvm#84360) whereby if we're concat'ing PCMPEQ/GT nodes to 256-bits on a AVX1 target then determine if the integer values are in bounds to allow them to be converted to FP for a (legal) float comparison. By performing this inside combineConcatVectorOps and working on PCMPEQ/GT nodes and not ICMP, we delay the fold until after more lowering has occurred, which avoids many of the issues where we were getting 'stuck' with CMPPS or unnecessary 256-bit nodes, and can more easily determine if either of the new concats() will be free. Additionally this patch requires both comparison operands to be in range, while technically not required this does help avoid the remaining regressions. It doesn't require that one of the operands is constant, it didn't seem necessary to include that constraint. I've reused some of the code from llvm#82290, and we may be able to add additional functionality (more CondCode patterns, v4i64/v4f64 handling, 'bitcastable' integers etc.) in future patches. Fixes llvm#82242

…own bits

llvm/lib/Target/X86/X86ISelLowering.cpp

goldsteinn · 2024-06-27T04:03:40Z

llvm/test/CodeGen/X86/vector-unsigned-cmp.ll

 ; AVX1-NEXT:  .LBB18_2: # %if.end
+; AVX1-NEXT:    vzeroupper


Probably a regression. Not sure if we can do anything about it though.

agreed - if the store had been split first then we wouldn't have folded to 256-bit fcmp at all - yet another case of poor topological sorting in DAG combining :(

llvm/lib/Target/X86/X86ISelLowering.cpp

goldsteinn · 2024-06-27T10:36:20Z

LGTM. wait 24 hours to push please.

…rgets (llvm#95915) This is a more restricted solution to llvm#82242 (vs the more general llvm#82290 + llvm#84360) whereby if we're concat'ing PCMPEQ/GT nodes to 256-bits on a AVX1 target then determine if the integer values are in bounds to allow them to be converted to FP for a (legal) float comparison. By performing this inside combineConcatVectorOps and working on PCMPEQ/GT nodes and not ICMP, we delay the fold until after more lowering has occurred, which avoids many of the issues where we were getting 'stuck' with CMPPS or unnecessary 256-bit nodes, and can more easily determine if either of the new concats() will be free. Additionally this patch requires BOTH comparison operands to be in range, while technically not required this does help avoid the remaining regressions. It doesn't require that one of the operands is constant as it didn't seem necessary to include that constraint. I've reused some of the code from llvm#82290, and we may be able to add additional functionality (more CondCode patterns, v4i64/v4f64 handling, 'bitcastable' integers etc.) in future patches. Fixes llvm#82242

RKSimon added the backend:X86 label Jun 18, 2024

RKSimon requested review from phoebewang, topperc and goldsteinn June 18, 2024 12:54

goldsteinn reviewed Jun 18, 2024

View reviewed changes

RKSimon added 2 commits June 26, 2024 14:01

Convert icmp->fcmp bounds tests to use significant bits instead of kn…

740ee52

…own bits

RKSimon force-pushed the avx1_icmp256 branch from 411feda to 740ee52 Compare June 26, 2024 13:49

goldsteinn reviewed Jun 27, 2024

View reviewed changes

llvm/lib/Target/X86/X86ISelLowering.cpp Show resolved Hide resolved

goldsteinn reviewed Jun 27, 2024

View reviewed changes

llvm/lib/Target/X86/X86ISelLowering.cpp Show resolved Hide resolved

RKSimon merged commit 5b89aaa into llvm:main Jun 28, 2024
7 checks passed

RKSimon deleted the avx1_icmp256 branch June 28, 2024 14:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[X86] Fold concat(PCMP(),PCMP()) -> CMPPS(concat,concat) on AVX1 targets #95915

[X86] Fold concat(PCMP(),PCMP()) -> CMPPS(concat,concat) on AVX1 targets #95915

RKSimon commented Jun 18, 2024

llvmbot commented Jun 18, 2024

goldsteinn Jun 18, 2024

goldsteinn Jun 26, 2024

RKSimon Jun 26, 2024

goldsteinn commented Jun 18, 2024

RKSimon commented Jun 25, 2024

goldsteinn commented Jun 26, 2024 •

edited

Loading

goldsteinn Jun 27, 2024

RKSimon Jun 27, 2024

goldsteinn commented Jun 27, 2024

[X86] Fold concat(PCMP*(),PCMP*()) -> CMPPS(concat,concat) on AVX1 targets #95915

[X86] Fold concat(PCMP*(),PCMP*()) -> CMPPS(concat,concat) on AVX1 targets #95915

Conversation

RKSimon commented Jun 18, 2024

llvmbot commented Jun 18, 2024

goldsteinn Jun 18, 2024

Choose a reason for hiding this comment

goldsteinn Jun 26, 2024

Choose a reason for hiding this comment

RKSimon Jun 26, 2024

Choose a reason for hiding this comment

goldsteinn commented Jun 18, 2024

RKSimon commented Jun 25, 2024

goldsteinn commented Jun 26, 2024 • edited Loading

goldsteinn Jun 27, 2024

Choose a reason for hiding this comment

RKSimon Jun 27, 2024

Choose a reason for hiding this comment

goldsteinn commented Jun 27, 2024

[X86] Fold concat(PCMP(),PCMP()) -> CMPPS(concat,concat) on AVX1 targets #95915

[X86] Fold concat(PCMP(),PCMP()) -> CMPPS(concat,concat) on AVX1 targets #95915

goldsteinn commented Jun 26, 2024 •

edited

Loading