Improve selection of conditional branch on amdgcn.ballot!=0 condition in SelectionDAG. #68714

vpykhtin · 2023-10-10T14:56:14Z

I'm going to introduce changes in address sanitizer instrumentation checks for AMDGPU target and it will make use of the following pattern:

bool cnd = ...
if (amdgcn.ballot(cnd) != 0) {
  ...
}

which means "execute then if any lane has satisfied the cnd condition". Note that this is uniform condition since amdgcn.ballot is uniform intrinsic.

This patch improves code generation for this pattern in SelectionDAG. GlobalISel change should be added later. It introduces special AMDGPUISD::BRCONDZ node that is created in the DAG combiner and maps directly either to S_CBRANCH_VCC(Z|NZ) or S_CBRANCH_SCC(0|1) if cnd is itself uniform.

P.S. This pull request contains two commits, first one just adds tests so I recommend to take a look at the second commit 13ebb87 to see the code difference it makes in tests.

…in SelectionDAG.

llvmbot · 2023-10-10T14:57:21Z

@llvm/pr-subscribers-llvm-globalisel

@llvm/pr-subscribers-backend-amdgpu

Author: Valery (vpykhtin)

Changes

I'm going to introduce changes in address sanitizer instrumentation checks for AMDGPU target and it will make use of the following pattern:

bool cnd = ...
if (amdgcn.ballot(cnd) != 0) {
  ...
}

which means "execute then if any lane has satisfied the cnd condition". Note that this is uniform condition since amdgcn.ballot is uniform intrinsic.

This patch improves code generation for this pattern in SelectionDAG. GlobalISel change should be added later. It introduces special AMDGPUISD::BRCONDZ node that is created in the DAG combiner and maps directly either to S_CBRANCH_VCC(Z|NZ) or S_CBRANCH_SCC(0|1) if cnd is itself uniform.

P.S. This pull request contains two commits, first one just adds tests so I recommend to take a look at the second commit 13ebb87 to see the code difference it makes in tests.

Patch is 49.48 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/68714.diff

11 Files Affected:

(modified) llvm/lib/Target/AMDGPU/AMDGPUISelDAGToDAG.cpp (+31)
(modified) llvm/lib/Target/AMDGPU/AMDGPUISelDAGToDAG.h (+1)
(modified) llvm/lib/Target/AMDGPU/AMDGPUISelLowering.cpp (+1)
(modified) llvm/lib/Target/AMDGPU/AMDGPUISelLowering.h (+14)
(modified) llvm/lib/Target/AMDGPU/AMDGPUInstrInfo.td (+6)
(modified) llvm/lib/Target/AMDGPU/SIISelLowering.cpp (+54-1)
(modified) llvm/lib/Target/AMDGPU/SIISelLowering.h (+1)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.ballot.i32.ll (+310)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.ballot.i64.ll (+310)
(modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.ballot.i32.ll (+290)
(modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.ballot.i64.ll (+290)

diff --git a/llvm/lib/Target/AMDGPU/AMDGPUISelDAGToDAG.cpp b/llvm/lib/Target/AMDGPU/AMDGPUISelDAGToDAG.cpp
index b5ceaaa14b4fd5e..90addb12a81abcf 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUISelDAGToDAG.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPUISelDAGToDAG.cpp
@@ -666,6 +666,9 @@ void AMDGPUDAGToDAGISel::Select(SDNode *N) {
   case ISD::FP_EXTEND:
     SelectFP_EXTEND(N);
     return;
+  case AMDGPUISD::BRCONDZ:
+    SelectBRCONDZ(N);
+    return;
   case AMDGPUISD::CVT_PKRTZ_F16_F32:
   case AMDGPUISD::CVT_PKNORM_I16_F32:
   case AMDGPUISD::CVT_PKNORM_U16_F32:
@@ -2306,6 +2309,34 @@ void AMDGPUDAGToDAGISel::SelectBRCOND(SDNode *N) {
                        VCC.getValue(0));
 }
 
+void AMDGPUDAGToDAGISel::SelectBRCONDZ(SDNode *N) {
+  const SIRegisterInfo *TRI =
+      static_cast<const SIRegisterInfo *>(Subtarget->getRegisterInfo());
+
+  SDValue Cond = N->getOperand(1);
+
+  // BRCONDZ condition is either AMDGPUISD::SETCC or i1 value that comes from
+  // ISD::SETCC node or logical combination of ISD::SETCCs therefore we don't
+  // need to AND the condition with execmask.
+
+  // TODO: AMDGPUISD::SETCC is always selected as V_CMP so use VCC condition.
+  // This might change later.
+  bool UseSCCBr = Cond->getOpcode() != AMDGPUISD::SETCC && !Cond->isDivergent();
+
+  auto CondCode = cast<CondCodeSDNode>(N->getOperand(3))->get();
+  assert(CondCode == ISD::SETEQ || CondCode == ISD::SETNE);
+
+  bool EqZero = CondCode == ISD::SETEQ;
+  unsigned BrOp =
+      UseSCCBr ? (EqZero ? AMDGPU::S_CBRANCH_SCC0 : AMDGPU::S_CBRANCH_SCC1)
+               : (EqZero ? AMDGPU::S_CBRANCH_VCCZ : AMDGPU::S_CBRANCH_VCCNZ);
+
+  SDValue CondCopy = CurDAG->getCopyToReg(
+      N->getOperand(0), SDLoc(N), UseSCCBr ? AMDGPU::SCC : TRI->getVCC(),
+      N->getOperand(1));
+  CurDAG->SelectNodeTo(N, BrOp, MVT::Other, N->getOperand(2), CondCopy);
+}
+
 void AMDGPUDAGToDAGISel::SelectFP_EXTEND(SDNode *N) {
   if (Subtarget->hasSALUFloatInsts() && N->getValueType(0) == MVT::f32 &&
       !N->isDivergent()) {
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUISelDAGToDAG.h b/llvm/lib/Target/AMDGPU/AMDGPUISelDAGToDAG.h
index a8a606f60a3faee..255ca62cb7a9100 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUISelDAGToDAG.h
+++ b/llvm/lib/Target/AMDGPU/AMDGPUISelDAGToDAG.h
@@ -272,6 +272,7 @@ class AMDGPUDAGToDAGISel : public SelectionDAGISel {
   void SelectS_BFE(SDNode *N);
   bool isCBranchSCC(const SDNode *N) const;
   void SelectBRCOND(SDNode *N);
+  void SelectBRCONDZ(SDNode *N);
   void SelectFMAD_FMA(SDNode *N);
   void SelectFP_EXTEND(SDNode *N);
   void SelectDSAppendConsume(SDNode *N, unsigned IntrID);
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUISelLowering.cpp b/llvm/lib/Target/AMDGPU/AMDGPUISelLowering.cpp
index 607d59db7bcf709..a268a807679f473 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUISelLowering.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPUISelLowering.cpp
@@ -5317,6 +5317,7 @@ const char* AMDGPUTargetLowering::getTargetNodeName(unsigned Opcode) const {
   NODE_NAME_CASE(BUFFER_ATOMIC_FADD)
   NODE_NAME_CASE(BUFFER_ATOMIC_FMIN)
   NODE_NAME_CASE(BUFFER_ATOMIC_FMAX)
+  NODE_NAME_CASE(BRCONDZ)
 
   case AMDGPUISD::LAST_AMDGPU_ISD_NUMBER: break;
   }
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUISelLowering.h b/llvm/lib/Target/AMDGPU/AMDGPUISelLowering.h
index e971c85ee3f6e39..43e572dcc56423d 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUISelLowering.h
+++ b/llvm/lib/Target/AMDGPU/AMDGPUISelLowering.h
@@ -422,6 +422,20 @@ enum NodeType : unsigned {
   // This is SETCC with the full mask result which is used for a compare with a
   // result bit per item in the wavefront.
   SETCC,
+
+  // Conditional branch on comparison of CondWaveMask operand to zero.
+  //   BRCONDZ CondWaveMask, BB, CondCode
+  // where:
+  //   - CondWaveMask - is either:
+  //     * the i32/i64 result of AMDGPUISD::SETCC node,
+  //     * i1 value that comes from ISD::SETCC node or logical combination of
+  //       ISD::SETCCs. For a divergent node this becomes a i32/i64 value after
+  //       selection.
+  //   - BB is the target basic block,
+  //   - CondCode is either SETEQ or SETNE meaning that the branch should happen
+  //     if the CondWaveMask is either equal to zero or not.
+  BRCONDZ,
+
   SETREG,
 
   DENORM_MODE,
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUInstrInfo.td b/llvm/lib/Target/AMDGPU/AMDGPUInstrInfo.td
index 2492a7be651f6d6..a6b4fa4937dff30 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUInstrInfo.td
+++ b/llvm/lib/Target/AMDGPU/AMDGPUInstrInfo.td
@@ -58,6 +58,11 @@ def AMDGPUIfBreakOp : SDTypeProfile<1, 2,
   [SDTCisVT<0, i1>, SDTCisVT<1, i1>, SDTCisVT<2, i1>]
 >;
 
+def AMDGPUBrcondzOp : SDTypeProfile<0, 3, [
+  // cond, bb, cc
+  SDTCisInt<0>, SDTCisVT<1, OtherVT>, SDTCisVT<2, OtherVT>
+]>;
+
 //===----------------------------------------------------------------------===//
 // AMDGPU DAG Nodes
 //
@@ -65,6 +70,7 @@ def AMDGPUIfBreakOp : SDTypeProfile<1, 2,
 def AMDGPUif : SDNode<"AMDGPUISD::IF", AMDGPUIfOp, [SDNPHasChain]>;
 def AMDGPUelse : SDNode<"AMDGPUISD::ELSE", AMDGPUElseOp, [SDNPHasChain]>;
 def AMDGPUloop : SDNode<"AMDGPUISD::LOOP", AMDGPULoopOp, [SDNPHasChain]>;
+def AMDGPUbrcondz: SDNode<"AMDGPUISD::BRCONDZ", AMDGPUBrcondzOp, [SDNPHasChain]>;
 
 def callseq_start : SDNode<"ISD::CALLSEQ_START",
   SDCallSeqStart<[ SDTCisVT<0, i32>, SDTCisVT<1, i32> ]>,
diff --git a/llvm/lib/Target/AMDGPU/SIISelLowering.cpp b/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
index 0a32844bdb01a09..5edf1446ca34dac 100644
--- a/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
+++ b/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
@@ -798,7 +798,8 @@ SITargetLowering::SITargetLowering(const TargetMachine &TM,
                        ISD::SIGN_EXTEND_INREG,
                        ISD::EXTRACT_VECTOR_ELT,
                        ISD::INSERT_VECTOR_ELT,
-                       ISD::FCOPYSIGN});
+                       ISD::FCOPYSIGN,
+                       ISD::BRCOND});
 
   if (Subtarget->has16BitInsts() && !Subtarget->hasMed3_16())
     setTargetDAGCombine(ISD::FP_ROUND);
@@ -13584,6 +13585,56 @@ SDValue SITargetLowering::performClampCombine(SDNode *N,
   return SDValue(CSrc, 0);
 }
 
+SDValue SITargetLowering::performBRCondCombine(SDNode *N,
+                                               DAGCombinerInfo &DCI) const {
+  if (!DCI.isAfterLegalizeDAG())
+    return SDValue(N, 0);
+
+  SDValue Cond = N->getOperand(1);
+  if (Cond.getOpcode() == ISD::SETCC &&
+      Cond->getOperand(0)->getOpcode() == AMDGPUISD::SETCC) {
+
+    // %VCMP = i32/i64 AMDGPUISD::SETCC ...
+    // %C = ISD::SETCC %VCMP, 0, setne/seteq
+    // BRCOND %BB, %C
+    // =>
+    // %VCMP = i32/i64 AMDGPUISD::SETCC ...
+    // BRCONDZ %BB, %VCMP, setne/seteq
+
+    auto CC = cast<CondCodeSDNode>(Cond->getOperand(2))->get();
+    auto *CRHS = dyn_cast<ConstantSDNode>(Cond->getOperand(1));
+    if ((CC == ISD::SETEQ || CC == ISD::SETNE) && CRHS && CRHS->isZero()) {
+
+      auto VCMP = Cond->getOperand(0);
+      auto VCMP_CC = cast<CondCodeSDNode>(VCMP.getOperand(2))->get();
+      auto *VCMP_CRHS = dyn_cast<ConstantSDNode>(VCMP.getOperand(1));
+      auto Src = VCMP;
+      if (VCMP_CC == ISD::SETNE && VCMP_CRHS && VCMP_CRHS->isZero()) {
+
+        // Special case for amdgcn.ballot:
+        // %VCMPSrc = ISD::SETCC or a logical combination of ISD::SETCCs
+        // %VCMP = i32/i64 AMDGPUISD::SETCC (ext %VCMPSrc), 0, setne
+        // %C = ISD::SETCC %VCMP, 0, setne/seteq
+        // BRCOND %BB, %C
+        // =>
+        // BRCONDZ %BB, %VCMPSrc, setne/seteq
+
+        auto VCMPSrc = VCMP.getOperand(0);
+        if (ISD::isExtOpcode(VCMPSrc->getOpcode())) // Skip extension.
+          VCMPSrc = VCMPSrc.getOperand(0);
+
+        if (isBoolSGPR(VCMPSrc))
+          Src = VCMPSrc;
+      }
+      return DCI.DAG.getNode(AMDGPUISD::BRCONDZ, SDLoc(N), N->getVTList(),
+                             N->getOperand(0), // Chain
+                             Src,
+                             N->getOperand(2),         // BB
+                             DCI.DAG.getCondCode(CC)); // SETEQ|SETNE
+    }
+  }
+  return SDValue(N, 0);
+}
 
 SDValue SITargetLowering::PerformDAGCombine(SDNode *N,
                                             DAGCombinerInfo &DCI) const {
@@ -13694,6 +13745,8 @@ SDValue SITargetLowering::PerformDAGCombine(SDNode *N,
     return performInsertVectorEltCombine(N, DCI);
   case ISD::FP_ROUND:
     return performFPRoundCombine(N, DCI);
+  case ISD::BRCOND:
+    return performBRCondCombine(N, DCI);
   case ISD::LOAD: {
     if (SDValue Widended = widenLoad(cast<LoadSDNode>(N), DCI))
       return Widended;
diff --git a/llvm/lib/Target/AMDGPU/SIISelLowering.h b/llvm/lib/Target/AMDGPU/SIISelLowering.h
index d717e12d29a514a..f03b83705d14083 100644
--- a/llvm/lib/Target/AMDGPU/SIISelLowering.h
+++ b/llvm/lib/Target/AMDGPU/SIISelLowering.h
@@ -220,6 +220,7 @@ class SITargetLowering final : public AMDGPUTargetLowering {
   SDValue performCvtF32UByteNCombine(SDNode *N, DAGCombinerInfo &DCI) const;
   SDValue performClampCombine(SDNode *N, DAGCombinerInfo &DCI) const;
   SDValue performRcpCombine(SDNode *N, DAGCombinerInfo &DCI) const;
+  SDValue performBRCondCombine(SDNode *N, DAGCombinerInfo &DCI) const;
 
   bool isLegalFlatAddressingMode(const AddrMode &AM) const;
   bool isLegalMUBUFAddressingMode(const AddrMode &AM) const;
diff --git a/llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.ballot.i32.ll b/llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.ballot.i32.ll
index 8bd1be04650e005..6c12329930b8a2c 100644
--- a/llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.ballot.i32.ll
+++ b/llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.ballot.i32.ll
@@ -83,3 +83,313 @@ define amdgpu_cs i32 @ctpop_of_ballot(float %x, float %y) {
   %bcnt = call i32 @llvm.ctpop.i32(i32 %ballot)
   ret i32 %bcnt
 }
+
+define amdgpu_cs i32 @branch_divergent_ballot_ne_zero_non_compare(i32 %v) {
+; CHECK-LABEL: branch_divergent_ballot_ne_zero_non_compare:
+; CHECK:       ; %bb.0:
+; CHECK-NEXT:    v_and_b32_e32 v0, 1, v0
+; CHECK-NEXT:    v_cmp_ne_u32_e32 vcc_lo, 0, v0
+; CHECK-NEXT:    s_cmp_eq_u32 vcc_lo, 0
+; CHECK-NEXT:    s_cbranch_scc1 .LBB7_2
+; CHECK-NEXT:  ; %bb.1: ; %true
+; CHECK-NEXT:    s_mov_b32 s0, 42
+; CHECK-NEXT:    s_branch .LBB7_3
+; CHECK-NEXT:  .LBB7_2: ; %false
+; CHECK-NEXT:    s_mov_b32 s0, 33
+; CHECK-NEXT:    s_branch .LBB7_3
+; CHECK-NEXT:  .LBB7_3:
+  %c = trunc i32 %v to i1
+  %ballot = call i32 @llvm.amdgcn.ballot.i32(i1 %c)
+  %ballot_ne_zero = icmp ne i32 %ballot, 0
+  br i1 %ballot_ne_zero, label %true, label %false
+true:
+  ret i32 42
+false:
+  ret i32 33
+}
+
+define amdgpu_cs i32 @branch_uniform_ballot_ne_zero_non_compare(i32 inreg %v) {
+; CHECK-LABEL: branch_uniform_ballot_ne_zero_non_compare:
+; CHECK:       ; %bb.0:
+; CHECK-NEXT:    s_and_b32 s0, 1, s0
+; CHECK-NEXT:    v_cmp_ne_u32_e64 s0, 0, s0
+; CHECK-NEXT:    s_cmp_eq_u32 s0, 0
+; CHECK-NEXT:    s_cbranch_scc1 .LBB8_2
+; CHECK-NEXT:  ; %bb.1: ; %true
+; CHECK-NEXT:    s_mov_b32 s0, 42
+; CHECK-NEXT:    s_branch .LBB8_3
+; CHECK-NEXT:  .LBB8_2: ; %false
+; CHECK-NEXT:    s_mov_b32 s0, 33
+; CHECK-NEXT:    s_branch .LBB8_3
+; CHECK-NEXT:  .LBB8_3:
+  %c = trunc i32 %v to i1
+  %ballot = call i32 @llvm.amdgcn.ballot.i32(i1 %c)
+  %ballot_ne_zero = icmp ne i32 %ballot, 0
+  br i1 %ballot_ne_zero, label %true, label %false
+true:
+  ret i32 42
+false:
+  ret i32 33
+}
+
+define amdgpu_cs i32 @branch_divergent_ballot_eq_zero_non_compare(i32 %v) {
+; CHECK-LABEL: branch_divergent_ballot_eq_zero_non_compare:
+; CHECK:       ; %bb.0:
+; CHECK-NEXT:    v_and_b32_e32 v0, 1, v0
+; CHECK-NEXT:    v_cmp_ne_u32_e32 vcc_lo, 0, v0
+; CHECK-NEXT:    s_cmp_lg_u32 vcc_lo, 0
+; CHECK-NEXT:    s_cbranch_scc0 .LBB9_2
+; CHECK-NEXT:  ; %bb.1: ; %false
+; CHECK-NEXT:    s_mov_b32 s0, 33
+; CHECK-NEXT:    s_branch .LBB9_3
+; CHECK-NEXT:  .LBB9_2: ; %true
+; CHECK-NEXT:    s_mov_b32 s0, 42
+; CHECK-NEXT:    s_branch .LBB9_3
+; CHECK-NEXT:  .LBB9_3:
+  %c = trunc i32 %v to i1
+  %ballot = call i32 @llvm.amdgcn.ballot.i32(i1 %c)
+  %ballot_eq_zero = icmp eq i32 %ballot, 0
+  br i1 %ballot_eq_zero, label %true, label %false
+true:
+  ret i32 42
+false:
+  ret i32 33
+}
+
+define amdgpu_cs i32 @branch_uniform_ballot_eq_zero_non_compare(i32 inreg %v) {
+; CHECK-LABEL: branch_uniform_ballot_eq_zero_non_compare:
+; CHECK:       ; %bb.0:
+; CHECK-NEXT:    s_and_b32 s0, 1, s0
+; CHECK-NEXT:    v_cmp_ne_u32_e64 s0, 0, s0
+; CHECK-NEXT:    s_cmp_lg_u32 s0, 0
+; CHECK-NEXT:    s_cbranch_scc0 .LBB10_2
+; CHECK-NEXT:  ; %bb.1: ; %false
+; CHECK-NEXT:    s_mov_b32 s0, 33
+; CHECK-NEXT:    s_branch .LBB10_3
+; CHECK-NEXT:  .LBB10_2: ; %true
+; CHECK-NEXT:    s_mov_b32 s0, 42
+; CHECK-NEXT:    s_branch .LBB10_3
+; CHECK-NEXT:  .LBB10_3:
+  %c = trunc i32 %v to i1
+  %ballot = call i32 @llvm.amdgcn.ballot.i32(i1 %c)
+  %ballot_eq_zero = icmp eq i32 %ballot, 0
+  br i1 %ballot_eq_zero, label %true, label %false
+true:
+  ret i32 42
+false:
+  ret i32 33
+}
+
+define amdgpu_cs i32 @branch_divergent_ballot_ne_zero_compare(i32 %v) {
+; CHECK-LABEL: branch_divergent_ballot_ne_zero_compare:
+; CHECK:       ; %bb.0:
+; CHECK-NEXT:    v_cmp_gt_u32_e32 vcc_lo, 12, v0
+; CHECK-NEXT:    s_cmp_eq_u32 vcc_lo, 0
+; CHECK-NEXT:    s_cbranch_scc1 .LBB11_2
+; CHECK-NEXT:  ; %bb.1: ; %true
+; CHECK-NEXT:    s_mov_b32 s0, 42
+; CHECK-NEXT:    s_branch .LBB11_3
+; CHECK-NEXT:  .LBB11_2: ; %false
+; CHECK-NEXT:    s_mov_b32 s0, 33
+; CHECK-NEXT:    s_branch .LBB11_3
+; CHECK-NEXT:  .LBB11_3:
+  %c = icmp ult i32 %v, 12
+  %ballot = call i32 @llvm.amdgcn.ballot.i32(i1 %c)
+  %ballot_ne_zero = icmp ne i32 %ballot, 0
+  br i1 %ballot_ne_zero, label %true, label %false
+true:
+  ret i32 42
+false:
+  ret i32 33
+}
+
+define amdgpu_cs i32 @branch_uniform_ballot_ne_zero_compare(i32 inreg %v) {
+; CHECK-LABEL: branch_uniform_ballot_ne_zero_compare:
+; CHECK:       ; %bb.0:
+; CHECK-NEXT:    s_cmp_lt_u32 s0, 12
+; CHECK-NEXT:    s_cselect_b32 s0, 1, 0
+; CHECK-NEXT:    s_and_b32 s0, 1, s0
+; CHECK-NEXT:    v_cmp_ne_u32_e64 s0, 0, s0
+; CHECK-NEXT:    s_cmp_eq_u32 s0, 0
+; CHECK-NEXT:    s_cbranch_scc1 .LBB12_2
+; CHECK-NEXT:  ; %bb.1: ; %true
+; CHECK-NEXT:    s_mov_b32 s0, 42
+; CHECK-NEXT:    s_branch .LBB12_3
+; CHECK-NEXT:  .LBB12_2: ; %false
+; CHECK-NEXT:    s_mov_b32 s0, 33
+; CHECK-NEXT:    s_branch .LBB12_3
+; CHECK-NEXT:  .LBB12_3:
+  %c = icmp ult i32 %v, 12
+  %ballot = call i32 @llvm.amdgcn.ballot.i32(i1 %c)
+  %ballot_ne_zero = icmp ne i32 %ballot, 0
+  br i1 %ballot_ne_zero, label %true, label %false
+true:
+  ret i32 42
+false:
+  ret i32 33
+}
+
+define amdgpu_cs i32 @branch_divergent_ballot_eq_zero_compare(i32 %v) {
+; CHECK-LABEL: branch_divergent_ballot_eq_zero_compare:
+; CHECK:       ; %bb.0:
+; CHECK-NEXT:    v_cmp_gt_u32_e32 vcc_lo, 12, v0
+; CHECK-NEXT:    s_cmp_lg_u32 vcc_lo, 0
+; CHECK-NEXT:    s_cbranch_scc0 .LBB13_2
+; CHECK-NEXT:  ; %bb.1: ; %false
+; CHECK-NEXT:    s_mov_b32 s0, 33
+; CHECK-NEXT:    s_branch .LBB13_3
+; CHECK-NEXT:  .LBB13_2: ; %true
+; CHECK-NEXT:    s_mov_b32 s0, 42
+; CHECK-NEXT:    s_branch .LBB13_3
+; CHECK-NEXT:  .LBB13_3:
+  %c = icmp ult i32 %v, 12
+  %ballot = call i32 @llvm.amdgcn.ballot.i32(i1 %c)
+  %ballot_eq_zero = icmp eq i32 %ballot, 0
+  br i1 %ballot_eq_zero, label %true, label %false
+true:
+  ret i32 42
+false:
+  ret i32 33
+}
+
+define amdgpu_cs i32 @branch_uniform_ballot_eq_zero_compare(i32 inreg %v) {
+; CHECK-LABEL: branch_uniform_ballot_eq_zero_compare:
+; CHECK:       ; %bb.0:
+; CHECK-NEXT:    s_cmp_lt_u32 s0, 12
+; CHECK-NEXT:    s_cselect_b32 s0, 1, 0
+; CHECK-NEXT:    s_and_b32 s0, 1, s0
+; CHECK-NEXT:    v_cmp_ne_u32_e64 s0, 0, s0
+; CHECK-NEXT:    s_cmp_lg_u32 s0, 0
+; CHECK-NEXT:    s_cbranch_scc0 .LBB14_2
+; CHECK-NEXT:  ; %bb.1: ; %false
+; CHECK-NEXT:    s_mov_b32 s0, 33
+; CHECK-NEXT:    s_branch .LBB14_3
+; CHECK-NEXT:  .LBB14_2: ; %true
+; CHECK-NEXT:    s_mov_b32 s0, 42
+; CHECK-NEXT:    s_branch .LBB14_3
+; CHECK-NEXT:  .LBB14_3:
+  %c = icmp ult i32 %v, 12
+  %ballot = call i32 @llvm.amdgcn.ballot.i32(i1 %c)
+  %ballot_eq_zero = icmp eq i32 %ballot, 0
+  br i1 %ballot_eq_zero, label %true, label %false
+true:
+  ret i32 42
+false:
+  ret i32 33
+}
+
+define amdgpu_cs i32 @branch_divergent_ballot_ne_zero_and(i32 %v1, i32 %v2) {
+; CHECK-LABEL: branch_divergent_ballot_ne_zero_and:
+; CHECK:       ; %bb.0:
+; CHECK-NEXT:    v_cmp_gt_u32_e32 vcc_lo, 12, v0
+; CHECK-NEXT:    v_cmp_lt_u32_e64 s0, 34, v1
+; CHECK-NEXT:    s_and_b32 s0, vcc_lo, s0
+; CHECK-NEXT:    s_cmp_eq_u32 s0, 0
+; CHECK-NEXT:    s_cbranch_scc1 .LBB15_2
+; CHECK-NEXT:  ; %bb.1: ; %true
+; CHECK-NEXT:    s_mov_b32 s0, 42
+; CHECK-NEXT:    s_branch .LBB15_3
+; CHECK-NEXT:  .LBB15_2: ; %false
+; CHECK-NEXT:    s_mov_b32 s0, 33
+; CHECK-NEXT:    s_branch .LBB15_3
+; CHECK-NEXT:  .LBB15_3:
+  %v1c = icmp ult i32 %v1, 12
+  %v2c = icmp ugt i32 %v2, 34
+  %c = and i1 %v1c, %v2c
+  %ballot = call i32 @llvm.amdgcn.ballot.i32(i1 %c)
+  %ballot_ne_zero = icmp ne i32 %ballot, 0
+  br i1 %ballot_ne_zero, label %true, label %false
+true:
+  ret i32 42
+false:
+  ret i32 33
+}
+
+define amdgpu_cs i32 @branch_uniform_ballot_ne_zero_and(i32 inreg %v1, i32 inreg %v2) {
+; CHECK-LABEL: branch_uniform_ballot_ne_zero_and:
+; CHECK:       ; %bb.0:
+; CHECK-NEXT:    s_cmp_lt_u32 s0, 12
+; CHECK-NEXT:    s_cselect_b32 s0, 1, 0
+; CHECK-NEXT:    s_cmp_gt_u32 s1, 34
+; CHECK-NEXT:    s_cselect_b32 s1, 1, 0
+; CHECK-NEXT:    s_and_b32 s0, s0, s1
+; CHECK-NEXT:    s_and_b32 s0, 1, s0
+; CHECK-NEXT:    v_cmp_ne_u32_e64 s0, 0, s0
+; CHECK-NEXT:    s_cmp_eq_u32 s0, 0
+; CHECK-NEXT:    s_cbranch_scc1 .LBB16_2
+; CHECK-NEXT:  ; %bb.1: ; %true
+; CHECK-NEXT:    s_mov_b32 s0, 42
+; CHECK-NEXT:    s_branch .LBB16_3
+; CHECK-NEXT:  .LBB16_2: ; %false
+; CHECK-NEXT:    s_mov_b32 s0, 33
+; CHECK-NEXT:    s_branch .LBB16_3
+; CHECK-NEXT:  .LBB16_3:
+  %v1c = icmp ult i32 %v1, 12
+  %v2c = icmp ugt i32 %v2, 34
+  %c = and i1 %v1c, %v2c
+  %ballot = call i32 @llvm.amdgcn.ballot.i32(i1 %c)
+  %ballot_ne_zero = icmp ne i32 %ballot, 0
+  br i1 %ballot_ne_zero, label %true, label %false
+true:
+  ret i32 42
+false:
+  ret i32 33
+}
+
+define amdgpu_cs i32 @branch_divergent_ballot_eq_zero_and(i32 %v1, i32 %v2) {
+; CHECK-LABEL: branch_divergent_ballot_eq_zero_and:
+; CHECK:       ; %bb.0:
+; CHECK-NEXT:    v_cmp_gt_u32_e32 vcc_lo, 12, v0
+; CHECK-NEXT:    v_cmp_lt_u32_e64 s0, 34, v1
+; CHECK-NEXT:    s_and_b32 s0, vcc_lo, s0
+; CHECK-NEXT:    s_cmp_lg_u32 s0, 0
+; CHECK-NEXT:    s_cbranch_scc0 .LBB17_2
+; CHECK-NEXT:  ; %bb.1: ; %false
+; CHECK-NEXT:    s_mov_b32 s0, 33
+; CHECK-NEXT:    s_branch .LBB17_3
+; CHECK-NEXT:  .LBB17_2: ; %true
+; CHECK-NEXT:    s_mov_b32 s0, 42
+; CHECK-NEXT:    s_branch .LBB17_3
+; CHECK-NEXT:  .LBB17_3:
+  %v1c = icmp ult i32 %v1, 12
+  %v2c = icmp ugt i32 %v2, 34
+  %c = and i1 %v1c, %v2c
+  %ballot = call i32 @llvm.amdgcn.ballot.i32(i1 %c)
+  %ballot_eq_zero = icmp eq i32 %ballot, 0
+  br i1 %ballot_eq_zero, label %true, label %false
+true:
+  ret i32 42
+false:
+  ret i32 33
+}
+
+define amdgpu_cs i32 @branch_uniform_ballot_eq_zero_and(i32 inreg %v1, i32 inreg %v2) {
+; CHECK-LABEL: branch_uniform_ballot_eq_zero_and:
+; CHECK:       ; %bb.0:
+; CHECK-NEXT:    s_cmp_lt_u32 s0, 12
+; CHECK-NEXT:    s_cselect_b32 s0, 1, 0
+; CHECK-NEXT:    s_cmp_gt_u32 s1, 34
+; CHECK-NEXT:    s_cselect_b32 s1, 1, 0
+; CHECK-NEXT:    s_and_b32 s0, s0, s1
+; CHECK-NEXT:    s_and_b32 s0, 1, s0
+; CHECK-NEXT:    v_cmp_ne_u32_e64 s0, 0, s0
+; CHECK-NEXT:    s_cmp_lg_u32 s0, 0
+; CHECK-NEXT:    s_cbranch_scc0 .LBB18_2
+; CHECK-NEXT:  ; %bb.1: ; %false
+; CHECK-NEXT:    s_mov_b32 s0, 33
+; CHECK-NEXT:    s_branch .LBB18_3
+; CHECK-NEXT:  .LBB18_2: ; %true
+; CHECK-NEXT:    s_mov_b32 s0, 42
+; CHECK-NEXT:    s_branch .LBB18_3
+; CHECK-NEXT:  .LBB18_3:
+  %v1c = icmp ult i32 %v1, 12
+  %v2c = icmp ugt i32 %v2, 34
+  %c = and i1 %v1c, %v2c
+  %ballot = call i32 @llvm.amdgcn.ballot.i32(i1 %c)
+  %ballot_eq_zero = icmp eq i32 %ballot, 0
+  br i1 %ballot_eq_zero, label %true, label %false
+true:
+  ret i32 42
+false:
+  ret i32 33
+}
diff --git a/llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.ballot.i64.ll b/llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.ballot.i64.ll
index 9f83012f5457509..ebb96ddc0603d68 100644
--- a/llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.ballot.i64.ll
+++ b/llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.ballot.i64.ll
@@ -86,3 +86,313 @@ define amdgpu_cs i64 @ctpop_of_ballot(float %x, float %y) {
   %bcnt = call i64 @llvm.ctpop.i64(i64 %ballot)
   ret i64 %bcnt
 }
+
+d...
[truncated]

jayfoad · 2023-10-11T10:07:10Z

llvm/lib/Target/AMDGPU/SIISelLowering.cpp

+    // %VCMP = i32/i64 AMDGPUISD::SETCC ...
+    // %C = ISD::SETCC %VCMP, 0, setne/seteq
+    // BRCOND %BB, %C
+    // =>
+    // %VCMP = i32/i64 AMDGPUISD::SETCC ...
+    // BRCONDZ %BB, %VCMP, setne/seteq


Why does VCMP have to be AMDGPUISD::SETCC here? Can you do a simpler combine instead:

// %C = ISD::SETCC %VCMP, 0, setne/seteq // BRCOND %BB, %C // => // BRCONDZ %BB, %VCMP, setne/seteq

jayfoad · 2023-10-11T10:08:23Z

llvm/lib/Target/AMDGPU/SIISelLowering.cpp

+        // %VCMPSrc = ISD::SETCC or a logical combination of ISD::SETCCs
+        // %VCMP = i32/i64 AMDGPUISD::SETCC (ext %VCMPSrc), 0, setne
+        // %C = ISD::SETCC %VCMP, 0, setne/seteq
+        // BRCOND %BB, %C
+        // =>
+        // BRCONDZ %BB, %VCMPSrc, setne/seteq


Can you split this out into a separate simpler combine:

// %VCMPSrc = ISD::SETCC or a logical combination of ISD::SETCCs // %VCMP = i32/i64 AMDGPUISD::SETCC (ext %VCMPSrc), 0, setne // %C = ISD::SETCC %VCMP, 0, setne/seteq // => // %C = ISD::SETCC or a logical combination of ISD::SETCCs

Thank you Jay, these're good suggestions and they showed me that BRCONDZ has vague semantics.

Here I use lowercase setcc to refer to ordinary i1 ISD::SETCC opcode and AMDGPU::SETCC for a special instruction returning boolean mask accross lanes.

The problem is that when we get rid of AMDGPU::SETCC (as in your second pattern) we're loosing the semantic of a boolean mask accross all lanes.

This confusion comes from the description of BRCONDZ node that can accept setcc value or logical combination of setccs. Instead it should be defined as BRCONDZ only accepts the result of AMDGPU::SETCC node which means that BRCONDZ compares uniform boolean mask resulted from the AMDGPU::SETCC operation to zero.

For example:

%Mask = i64 AMDGPU::SETCC i1 %c, 42, setle BRCONDZ i64 %Mask, BB2, setne => v_cmp_le vcc, v0, 42 c_cbranch_vccnz bb2

Now the confusing case with ballot:

%c = i1 setcc ... %Mask = i64 AMDGPU::SETCC i32 (zext i32 %c), 0, setne ; <- this is lowered ballot BRCONDZ i64 %Mask, BB2, setne => v_cmp_ vcc, ... ; <- setcc s_cmp_ne vcc, 0 ; <- ballot c_cbranch_scc1 bb2

Here we want to remove second comparison because vcc is a required boolean mask across all lanes resulted from the first v_cmp_ instruction and it can be compared to zero with the branch. In SelectionDAG we know that setcc for a divergent value will be selected as a v_cmp instruction and we can use its result directly and that is why setcc input to ballot is delivered to BRCONDZ. This is probably wrong.

I'm not really sure how to resolve this yet.

On the other thought these two patterns have a node in common:

%C = ISD::SETCC %VCMP, 0, setne/seteq
BRCOND %BB, %C
=>
BRCONDZ %BB, %VCMP, setne/seteq

%VCMPSrc = ISD::SETCC or a logical combination of ISD::SETCCs
%VCMP = i32/i64 AMDGPUISD::SETCC (ext %VCMPSrc), 0, setne
%C = ISD::SETCC %VCMP, 0, setne/seteq
=>
%C = ISD::SETCC or a logical combination of ISD::SETCCs

SelectionDAG starts with BRCOND of the first pattern and erases the ISD::SETCC which is the root for the second pattern. I tried to perform DAG combine on the %C value of the first rule so that the second rule could perform its transformation and return result to the first rule but this requires to create artificial result %C = ISD::SETCC %VCMPSrc, 0, setne so that the first rule pattern could match. It looks like this should work as a single pattern.

I've reverted previous implementation and submitted simplified version without using additional DAG node.

…llot!=0 in SelectionDAG." This reverts commit 13ebb87.

github-actions · 2023-10-15T09:00:21Z

✅ With the latest revision this PR passed the C/C++ code formatter.

vpykhtin · 2023-10-16T11:59:15Z

Fixed SCC not being used if
// %C = i1 ISD::SETCC %VCMP, 0, setne/seteq
haven't matched. Added tests.

arsenm · 2023-10-24T06:30:20Z

llvm/lib/Target/AMDGPU/AMDGPUISelDAGToDAG.cpp

@@ -2259,6 +2259,30 @@ bool AMDGPUDAGToDAGISel::isCBranchSCC(const SDNode *N) const {
  return false;
 }

+bool isBoolSGPR(SDValue V);


Can just move the body up here to avoid the forward declaration

The definition is in another source (SIISelLowering.cpp), I haven't found common header to place this declaration yet.

arsenm · 2023-10-24T06:44:40Z

llvm/lib/Target/AMDGPU/AMDGPUISelDAGToDAG.cpp

+  assert(VCMP->getOpcode() == AMDGPUISD::SETCC);
+  // Special case for amdgcn.ballot:
+  // %Cond = i1 (and/or combination of i1 ISD::SETCCs)
+  // %VCMP = i(WaveSize) AMDGPUISD::SETCC (ext %Cond), 0, setne ; lowered ballot


You might need to handle the negated form as well

added, but I had to simulate it using llvm.amdgcn.icmp(i1 %Cond, i1 0, ICMP_EQ) intrinsic.

llvm/lib/Target/AMDGPU/AMDGPUISelDAGToDAG.cpp

* supported negated ballot * improved comments

vpykhtin

I've added support for negated ballot but I had to simulate it using llvm.amdgcn.icmp intrinsic.

Tests for GlobalISel aren't added because they fail on "cannot select llvm.amdgcn.icmp(i1 ...)"

nhaehnle

Seems pretty reasonable to me

nhaehnle · 2023-10-27T13:02:37Z

llvm/test/CodeGen/AMDGPU/llvm.amdgcn.ballot.i32.ll

+; CHECK:       ; %bb.0:
+; CHECK-NEXT:    s_cmp_lt_u32 s0, 12
+; CHECK-NEXT:    s_cselect_b32 s0, -1, 0
+; CHECK-NEXT:    s_cmp_gt_u32 s1, 34
+; CHECK-NEXT:    s_cselect_b32 s1, -1, 0
+; CHECK-NEXT:    s_and_b32 s0, s0, s1


This sequence could be improved to:

s_cmp_lt_u32 s0, 12 s_cselect_b32 s0, -1, 0 s_cmp_gt_u32 s1, 34 s_cselect_b32 s0, s0, 0

By selecting into vcc(_lo) instead, we could even avoid the AND-with-exec that follows.

Not something for this patch, just an observation. Perhaps you could add a TODO.

(In general, LLVM knows to do boolean logic -> select folds. The fact that they don't fire here is probably a pass ordering issue.)

Perhaps you could add a TODO.

Thanks Nicolai, added the TODO.

* added the TODO by Nicolai's suggestion.

vpykhtin · 2023-10-31T15:49:00Z

With the latest commit I added a guard to skip ballot.i64 in wave32 mode because it's lowered with i64 AMDGPUISD::SETCC - this should be fixed with i32 result similar to ICMP/FCMP intrinsics. After a fix that follows this PR I'm going to turn the guard into assert - TODO is added for this.

nhaehnle · 2023-11-03T02:06:28Z

Though, on second thought, shouldn't there be some wave64 tests?

vpykhtin · 2023-11-03T12:38:24Z

Though, on second thought, shouldn't there be some wave64 tests?

I added a couple of ballot.i64 in wave32 mode tests to AMDGPU/GlobalISel/llvm.amdgcn.ballot.i32.ll and AMDGPU/llvm.amdgcn.ballot.i32.ll with the 6e865d1

or you mean ballot.i32 in wave64 mode?

nhaehnle · 2023-11-03T16:11:14Z

I meant ballot.i64 in wave64. For some reason, I didn't see the changes you made in those tests (perhaps I was looking at GitHub's display of your last changes to the PR, instead of the full PR.) But since you did make those changes, it's all good as far as I'm concerned.

vpykhtin · 2023-11-06T14:10:45Z

Thanks Nicolai!

vpykhtin added 2 commits October 10, 2023 14:13

[AMDGPU] Add tests for conditional branch on amdgcn.ballot eq/ne zero.

0b14138

[AMDGPU] Improve selection of conditional branch on amdgcn.ballot!=0 …

13ebb87

…in SelectionDAG.

vpykhtin requested review from arsenm, rampitec and mbrkusanin October 10, 2023 14:56

llvmbot added backend:AMDGPU llvm:globalisel labels Oct 10, 2023

vpykhtin removed the llvm:globalisel label Oct 10, 2023

vpykhtin changed the title ~~Improve selection of conditional branch on amdgcb.ballot!=0 condition in SelectionDAG.~~ Improve selection of conditional branch on amdgcn.ballot!=0 condition in SelectionDAG. Oct 10, 2023

jayfoad reviewed Oct 11, 2023

View reviewed changes

Revert "[AMDGPU] Improve selection of conditional branch on amdgcn.ba…

a17e48f

…llot!=0 in SelectionDAG." This reverts commit 13ebb87.

llvmbot added the llvm:globalisel label Oct 15, 2023

vpykhtin force-pushed the improve_ballot_ne_zero_selection branch from 4429535 to abb2ca2 Compare October 15, 2023 11:17

Simplified version without using DAG combiner.

8b836e2

vpykhtin force-pushed the improve_ballot_ne_zero_selection branch from abb2ca2 to 8b836e2 Compare October 16, 2023 11:52

vpykhtin requested review from nhaehnle and ruiling October 17, 2023 11:05

vpykhtin self-assigned this Oct 18, 2023

arsenm reviewed Oct 24, 2023

View reviewed changes

per comments:

50ab573

* supported negated ballot * improved comments

vpykhtin commented Oct 25, 2023

View reviewed changes

vpykhtin added 2 commits October 25, 2023 17:04

Merge branch 'main' into improve_ballot_ne_zero_selection

3d9a40e

update test after merge

73ba94d

nhaehnle reviewed Oct 27, 2023

View reviewed changes

* skip optimization when ballot bitwitdh doesn't match wavefront size.

6e865d1

* added the TODO by Nicolai's suggestion.

nhaehnle approved these changes Nov 3, 2023

View reviewed changes

Merge branch 'main' into improve_ballot_ne_zero_selection

0427488

vpykhtin merged commit fe6893b into llvm:main Nov 6, 2023
3 checks passed

vpykhtin deleted the improve_ballot_ne_zero_selection branch November 6, 2023 14:17

vpykhtin mentioned this pull request Nov 8, 2023

[AMDGPU] Add InstCombine rule for ballot.i64 intrinsic in wave32 mode. #71556

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve selection of conditional branch on amdgcn.ballot!=0 condition in SelectionDAG. #68714

Improve selection of conditional branch on amdgcn.ballot!=0 condition in SelectionDAG. #68714

vpykhtin commented Oct 10, 2023

llvmbot commented Oct 10, 2023 •

edited

Loading

jayfoad Oct 11, 2023

jayfoad Oct 11, 2023

vpykhtin Oct 11, 2023 •

edited

Loading

vpykhtin Oct 11, 2023

vpykhtin Oct 15, 2023

github-actions bot commented Oct 15, 2023 •

edited

Loading

vpykhtin commented Oct 16, 2023

arsenm Oct 24, 2023

vpykhtin Oct 24, 2023

arsenm Oct 24, 2023

vpykhtin Oct 25, 2023

vpykhtin left a comment

nhaehnle left a comment

nhaehnle Oct 27, 2023

vpykhtin Oct 31, 2023

vpykhtin commented Oct 31, 2023

nhaehnle commented Nov 3, 2023

vpykhtin commented Nov 3, 2023 •

edited

Loading

nhaehnle commented Nov 3, 2023

vpykhtin commented Nov 6, 2023

Improve selection of conditional branch on amdgcn.ballot!=0 condition in SelectionDAG. #68714

Improve selection of conditional branch on amdgcn.ballot!=0 condition in SelectionDAG. #68714

Conversation

vpykhtin commented Oct 10, 2023

llvmbot commented Oct 10, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vpykhtin Oct 11, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

github-actions bot commented Oct 15, 2023 • edited Loading

vpykhtin commented Oct 16, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vpykhtin left a comment

Choose a reason for hiding this comment

nhaehnle left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vpykhtin commented Oct 31, 2023

nhaehnle commented Nov 3, 2023

vpykhtin commented Nov 3, 2023 • edited Loading

nhaehnle commented Nov 3, 2023

vpykhtin commented Nov 6, 2023

llvmbot commented Oct 10, 2023 •

edited

Loading

vpykhtin Oct 11, 2023 •

edited

Loading

github-actions bot commented Oct 15, 2023 •

edited

Loading

vpykhtin commented Nov 3, 2023 •

edited

Loading