Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve selection of conditional branch on amdgcn.ballot!=0 condition in SelectionDAG. #68714

Merged
merged 9 commits into from
Nov 6, 2023

Conversation

vpykhtin
Copy link
Contributor

I'm going to introduce changes in address sanitizer instrumentation checks for AMDGPU target and it will make use of the following pattern:

bool cnd = ...
if (amdgcn.ballot(cnd) != 0) {
  ...
}

which means "execute then if any lane has satisfied the cnd condition". Note that this is uniform condition since amdgcn.ballot is uniform intrinsic.

This patch improves code generation for this pattern in SelectionDAG. GlobalISel change should be added later. It introduces special AMDGPUISD::BRCONDZ node that is created in the DAG combiner and maps directly either to S_CBRANCH_VCC(Z|NZ) or S_CBRANCH_SCC(0|1) if cnd is itself uniform.

P.S. This pull request contains two commits, first one just adds tests so I recommend to take a look at the second commit 13ebb87 to see the code difference it makes in tests.

@llvmbot
Copy link
Collaborator

llvmbot commented Oct 10, 2023

@llvm/pr-subscribers-llvm-globalisel

@llvm/pr-subscribers-backend-amdgpu

Author: Valery (vpykhtin)

Changes

I'm going to introduce changes in address sanitizer instrumentation checks for AMDGPU target and it will make use of the following pattern:

bool cnd = ...
if (amdgcn.ballot(cnd) != 0) {
  ...
}

which means "execute then if any lane has satisfied the cnd condition". Note that this is uniform condition since amdgcn.ballot is uniform intrinsic.

This patch improves code generation for this pattern in SelectionDAG. GlobalISel change should be added later. It introduces special AMDGPUISD::BRCONDZ node that is created in the DAG combiner and maps directly either to S_CBRANCH_VCC(Z|NZ) or S_CBRANCH_SCC(0|1) if cnd is itself uniform.

P.S. This pull request contains two commits, first one just adds tests so I recommend to take a look at the second commit 13ebb87 to see the code difference it makes in tests.


Patch is 49.48 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/68714.diff

11 Files Affected:

  • (modified) llvm/lib/Target/AMDGPU/AMDGPUISelDAGToDAG.cpp (+31)
  • (modified) llvm/lib/Target/AMDGPU/AMDGPUISelDAGToDAG.h (+1)
  • (modified) llvm/lib/Target/AMDGPU/AMDGPUISelLowering.cpp (+1)
  • (modified) llvm/lib/Target/AMDGPU/AMDGPUISelLowering.h (+14)
  • (modified) llvm/lib/Target/AMDGPU/AMDGPUInstrInfo.td (+6)
  • (modified) llvm/lib/Target/AMDGPU/SIISelLowering.cpp (+54-1)
  • (modified) llvm/lib/Target/AMDGPU/SIISelLowering.h (+1)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.ballot.i32.ll (+310)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.ballot.i64.ll (+310)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.ballot.i32.ll (+290)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.ballot.i64.ll (+290)
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUISelDAGToDAG.cpp b/llvm/lib/Target/AMDGPU/AMDGPUISelDAGToDAG.cpp
index b5ceaaa14b4fd5e..90addb12a81abcf 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUISelDAGToDAG.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPUISelDAGToDAG.cpp
@@ -666,6 +666,9 @@ void AMDGPUDAGToDAGISel::Select(SDNode *N) {
   case ISD::FP_EXTEND:
     SelectFP_EXTEND(N);
     return;
+  case AMDGPUISD::BRCONDZ:
+    SelectBRCONDZ(N);
+    return;
   case AMDGPUISD::CVT_PKRTZ_F16_F32:
   case AMDGPUISD::CVT_PKNORM_I16_F32:
   case AMDGPUISD::CVT_PKNORM_U16_F32:
@@ -2306,6 +2309,34 @@ void AMDGPUDAGToDAGISel::SelectBRCOND(SDNode *N) {
                        VCC.getValue(0));
 }
 
+void AMDGPUDAGToDAGISel::SelectBRCONDZ(SDNode *N) {
+  const SIRegisterInfo *TRI =
+      static_cast<const SIRegisterInfo *>(Subtarget->getRegisterInfo());
+
+  SDValue Cond = N->getOperand(1);
+
+  // BRCONDZ condition is either AMDGPUISD::SETCC or i1 value that comes from
+  // ISD::SETCC node or logical combination of ISD::SETCCs therefore we don't
+  // need to AND the condition with execmask.
+
+  // TODO: AMDGPUISD::SETCC is always selected as V_CMP so use VCC condition.
+  // This might change later.
+  bool UseSCCBr = Cond->getOpcode() != AMDGPUISD::SETCC && !Cond->isDivergent();
+
+  auto CondCode = cast<CondCodeSDNode>(N->getOperand(3))->get();
+  assert(CondCode == ISD::SETEQ || CondCode == ISD::SETNE);
+
+  bool EqZero = CondCode == ISD::SETEQ;
+  unsigned BrOp =
+      UseSCCBr ? (EqZero ? AMDGPU::S_CBRANCH_SCC0 : AMDGPU::S_CBRANCH_SCC1)
+               : (EqZero ? AMDGPU::S_CBRANCH_VCCZ : AMDGPU::S_CBRANCH_VCCNZ);
+
+  SDValue CondCopy = CurDAG->getCopyToReg(
+      N->getOperand(0), SDLoc(N), UseSCCBr ? AMDGPU::SCC : TRI->getVCC(),
+      N->getOperand(1));
+  CurDAG->SelectNodeTo(N, BrOp, MVT::Other, N->getOperand(2), CondCopy);
+}
+
 void AMDGPUDAGToDAGISel::SelectFP_EXTEND(SDNode *N) {
   if (Subtarget->hasSALUFloatInsts() && N->getValueType(0) == MVT::f32 &&
       !N->isDivergent()) {
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUISelDAGToDAG.h b/llvm/lib/Target/AMDGPU/AMDGPUISelDAGToDAG.h
index a8a606f60a3faee..255ca62cb7a9100 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUISelDAGToDAG.h
+++ b/llvm/lib/Target/AMDGPU/AMDGPUISelDAGToDAG.h
@@ -272,6 +272,7 @@ class AMDGPUDAGToDAGISel : public SelectionDAGISel {
   void SelectS_BFE(SDNode *N);
   bool isCBranchSCC(const SDNode *N) const;
   void SelectBRCOND(SDNode *N);
+  void SelectBRCONDZ(SDNode *N);
   void SelectFMAD_FMA(SDNode *N);
   void SelectFP_EXTEND(SDNode *N);
   void SelectDSAppendConsume(SDNode *N, unsigned IntrID);
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUISelLowering.cpp b/llvm/lib/Target/AMDGPU/AMDGPUISelLowering.cpp
index 607d59db7bcf709..a268a807679f473 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUISelLowering.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPUISelLowering.cpp
@@ -5317,6 +5317,7 @@ const char* AMDGPUTargetLowering::getTargetNodeName(unsigned Opcode) const {
   NODE_NAME_CASE(BUFFER_ATOMIC_FADD)
   NODE_NAME_CASE(BUFFER_ATOMIC_FMIN)
   NODE_NAME_CASE(BUFFER_ATOMIC_FMAX)
+  NODE_NAME_CASE(BRCONDZ)
 
   case AMDGPUISD::LAST_AMDGPU_ISD_NUMBER: break;
   }
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUISelLowering.h b/llvm/lib/Target/AMDGPU/AMDGPUISelLowering.h
index e971c85ee3f6e39..43e572dcc56423d 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUISelLowering.h
+++ b/llvm/lib/Target/AMDGPU/AMDGPUISelLowering.h
@@ -422,6 +422,20 @@ enum NodeType : unsigned {
   // This is SETCC with the full mask result which is used for a compare with a
   // result bit per item in the wavefront.
   SETCC,
+
+  // Conditional branch on comparison of CondWaveMask operand to zero.
+  //   BRCONDZ CondWaveMask, BB, CondCode
+  // where:
+  //   - CondWaveMask - is either:
+  //     * the i32/i64 result of AMDGPUISD::SETCC node,
+  //     * i1 value that comes from ISD::SETCC node or logical combination of
+  //       ISD::SETCCs. For a divergent node this becomes a i32/i64 value after
+  //       selection.
+  //   - BB is the target basic block,
+  //   - CondCode is either SETEQ or SETNE meaning that the branch should happen
+  //     if the CondWaveMask is either equal to zero or not.
+  BRCONDZ,
+
   SETREG,
 
   DENORM_MODE,
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUInstrInfo.td b/llvm/lib/Target/AMDGPU/AMDGPUInstrInfo.td
index 2492a7be651f6d6..a6b4fa4937dff30 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUInstrInfo.td
+++ b/llvm/lib/Target/AMDGPU/AMDGPUInstrInfo.td
@@ -58,6 +58,11 @@ def AMDGPUIfBreakOp : SDTypeProfile<1, 2,
   [SDTCisVT<0, i1>, SDTCisVT<1, i1>, SDTCisVT<2, i1>]
 >;
 
+def AMDGPUBrcondzOp : SDTypeProfile<0, 3, [
+  // cond, bb, cc
+  SDTCisInt<0>, SDTCisVT<1, OtherVT>, SDTCisVT<2, OtherVT>
+]>;
+
 //===----------------------------------------------------------------------===//
 // AMDGPU DAG Nodes
 //
@@ -65,6 +70,7 @@ def AMDGPUIfBreakOp : SDTypeProfile<1, 2,
 def AMDGPUif : SDNode<"AMDGPUISD::IF", AMDGPUIfOp, [SDNPHasChain]>;
 def AMDGPUelse : SDNode<"AMDGPUISD::ELSE", AMDGPUElseOp, [SDNPHasChain]>;
 def AMDGPUloop : SDNode<"AMDGPUISD::LOOP", AMDGPULoopOp, [SDNPHasChain]>;
+def AMDGPUbrcondz: SDNode<"AMDGPUISD::BRCONDZ", AMDGPUBrcondzOp, [SDNPHasChain]>;
 
 def callseq_start : SDNode<"ISD::CALLSEQ_START",
   SDCallSeqStart<[ SDTCisVT<0, i32>, SDTCisVT<1, i32> ]>,
diff --git a/llvm/lib/Target/AMDGPU/SIISelLowering.cpp b/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
index 0a32844bdb01a09..5edf1446ca34dac 100644
--- a/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
+++ b/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
@@ -798,7 +798,8 @@ SITargetLowering::SITargetLowering(const TargetMachine &TM,
                        ISD::SIGN_EXTEND_INREG,
                        ISD::EXTRACT_VECTOR_ELT,
                        ISD::INSERT_VECTOR_ELT,
-                       ISD::FCOPYSIGN});
+                       ISD::FCOPYSIGN,
+                       ISD::BRCOND});
 
   if (Subtarget->has16BitInsts() && !Subtarget->hasMed3_16())
     setTargetDAGCombine(ISD::FP_ROUND);
@@ -13584,6 +13585,56 @@ SDValue SITargetLowering::performClampCombine(SDNode *N,
   return SDValue(CSrc, 0);
 }
 
+SDValue SITargetLowering::performBRCondCombine(SDNode *N,
+                                               DAGCombinerInfo &DCI) const {
+  if (!DCI.isAfterLegalizeDAG())
+    return SDValue(N, 0);
+
+  SDValue Cond = N->getOperand(1);
+  if (Cond.getOpcode() == ISD::SETCC &&
+      Cond->getOperand(0)->getOpcode() == AMDGPUISD::SETCC) {
+
+    // %VCMP = i32/i64 AMDGPUISD::SETCC ...
+    // %C = ISD::SETCC %VCMP, 0, setne/seteq
+    // BRCOND %BB, %C
+    // =>
+    // %VCMP = i32/i64 AMDGPUISD::SETCC ...
+    // BRCONDZ %BB, %VCMP, setne/seteq
+
+    auto CC = cast<CondCodeSDNode>(Cond->getOperand(2))->get();
+    auto *CRHS = dyn_cast<ConstantSDNode>(Cond->getOperand(1));
+    if ((CC == ISD::SETEQ || CC == ISD::SETNE) && CRHS && CRHS->isZero()) {
+
+      auto VCMP = Cond->getOperand(0);
+      auto VCMP_CC = cast<CondCodeSDNode>(VCMP.getOperand(2))->get();
+      auto *VCMP_CRHS = dyn_cast<ConstantSDNode>(VCMP.getOperand(1));
+      auto Src = VCMP;
+      if (VCMP_CC == ISD::SETNE && VCMP_CRHS && VCMP_CRHS->isZero()) {
+
+        // Special case for amdgcn.ballot:
+        // %VCMPSrc = ISD::SETCC or a logical combination of ISD::SETCCs
+        // %VCMP = i32/i64 AMDGPUISD::SETCC (ext %VCMPSrc), 0, setne
+        // %C = ISD::SETCC %VCMP, 0, setne/seteq
+        // BRCOND %BB, %C
+        // =>
+        // BRCONDZ %BB, %VCMPSrc, setne/seteq
+
+        auto VCMPSrc = VCMP.getOperand(0);
+        if (ISD::isExtOpcode(VCMPSrc->getOpcode())) // Skip extension.
+          VCMPSrc = VCMPSrc.getOperand(0);
+
+        if (isBoolSGPR(VCMPSrc))
+          Src = VCMPSrc;
+      }
+      return DCI.DAG.getNode(AMDGPUISD::BRCONDZ, SDLoc(N), N->getVTList(),
+                             N->getOperand(0), // Chain
+                             Src,
+                             N->getOperand(2),         // BB
+                             DCI.DAG.getCondCode(CC)); // SETEQ|SETNE
+    }
+  }
+  return SDValue(N, 0);
+}
 
 SDValue SITargetLowering::PerformDAGCombine(SDNode *N,
                                             DAGCombinerInfo &DCI) const {
@@ -13694,6 +13745,8 @@ SDValue SITargetLowering::PerformDAGCombine(SDNode *N,
     return performInsertVectorEltCombine(N, DCI);
   case ISD::FP_ROUND:
     return performFPRoundCombine(N, DCI);
+  case ISD::BRCOND:
+    return performBRCondCombine(N, DCI);
   case ISD::LOAD: {
     if (SDValue Widended = widenLoad(cast<LoadSDNode>(N), DCI))
       return Widended;
diff --git a/llvm/lib/Target/AMDGPU/SIISelLowering.h b/llvm/lib/Target/AMDGPU/SIISelLowering.h
index d717e12d29a514a..f03b83705d14083 100644
--- a/llvm/lib/Target/AMDGPU/SIISelLowering.h
+++ b/llvm/lib/Target/AMDGPU/SIISelLowering.h
@@ -220,6 +220,7 @@ class SITargetLowering final : public AMDGPUTargetLowering {
   SDValue performCvtF32UByteNCombine(SDNode *N, DAGCombinerInfo &DCI) const;
   SDValue performClampCombine(SDNode *N, DAGCombinerInfo &DCI) const;
   SDValue performRcpCombine(SDNode *N, DAGCombinerInfo &DCI) const;
+  SDValue performBRCondCombine(SDNode *N, DAGCombinerInfo &DCI) const;
 
   bool isLegalFlatAddressingMode(const AddrMode &AM) const;
   bool isLegalMUBUFAddressingMode(const AddrMode &AM) const;
diff --git a/llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.ballot.i32.ll b/llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.ballot.i32.ll
index 8bd1be04650e005..6c12329930b8a2c 100644
--- a/llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.ballot.i32.ll
+++ b/llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.ballot.i32.ll
@@ -83,3 +83,313 @@ define amdgpu_cs i32 @ctpop_of_ballot(float %x, float %y) {
   %bcnt = call i32 @llvm.ctpop.i32(i32 %ballot)
   ret i32 %bcnt
 }
+
+define amdgpu_cs i32 @branch_divergent_ballot_ne_zero_non_compare(i32 %v) {
+; CHECK-LABEL: branch_divergent_ballot_ne_zero_non_compare:
+; CHECK:       ; %bb.0:
+; CHECK-NEXT:    v_and_b32_e32 v0, 1, v0
+; CHECK-NEXT:    v_cmp_ne_u32_e32 vcc_lo, 0, v0
+; CHECK-NEXT:    s_cmp_eq_u32 vcc_lo, 0
+; CHECK-NEXT:    s_cbranch_scc1 .LBB7_2
+; CHECK-NEXT:  ; %bb.1: ; %true
+; CHECK-NEXT:    s_mov_b32 s0, 42
+; CHECK-NEXT:    s_branch .LBB7_3
+; CHECK-NEXT:  .LBB7_2: ; %false
+; CHECK-NEXT:    s_mov_b32 s0, 33
+; CHECK-NEXT:    s_branch .LBB7_3
+; CHECK-NEXT:  .LBB7_3:
+  %c = trunc i32 %v to i1
+  %ballot = call i32 @llvm.amdgcn.ballot.i32(i1 %c)
+  %ballot_ne_zero = icmp ne i32 %ballot, 0
+  br i1 %ballot_ne_zero, label %true, label %false
+true:
+  ret i32 42
+false:
+  ret i32 33
+}
+
+define amdgpu_cs i32 @branch_uniform_ballot_ne_zero_non_compare(i32 inreg %v) {
+; CHECK-LABEL: branch_uniform_ballot_ne_zero_non_compare:
+; CHECK:       ; %bb.0:
+; CHECK-NEXT:    s_and_b32 s0, 1, s0
+; CHECK-NEXT:    v_cmp_ne_u32_e64 s0, 0, s0
+; CHECK-NEXT:    s_cmp_eq_u32 s0, 0
+; CHECK-NEXT:    s_cbranch_scc1 .LBB8_2
+; CHECK-NEXT:  ; %bb.1: ; %true
+; CHECK-NEXT:    s_mov_b32 s0, 42
+; CHECK-NEXT:    s_branch .LBB8_3
+; CHECK-NEXT:  .LBB8_2: ; %false
+; CHECK-NEXT:    s_mov_b32 s0, 33
+; CHECK-NEXT:    s_branch .LBB8_3
+; CHECK-NEXT:  .LBB8_3:
+  %c = trunc i32 %v to i1
+  %ballot = call i32 @llvm.amdgcn.ballot.i32(i1 %c)
+  %ballot_ne_zero = icmp ne i32 %ballot, 0
+  br i1 %ballot_ne_zero, label %true, label %false
+true:
+  ret i32 42
+false:
+  ret i32 33
+}
+
+define amdgpu_cs i32 @branch_divergent_ballot_eq_zero_non_compare(i32 %v) {
+; CHECK-LABEL: branch_divergent_ballot_eq_zero_non_compare:
+; CHECK:       ; %bb.0:
+; CHECK-NEXT:    v_and_b32_e32 v0, 1, v0
+; CHECK-NEXT:    v_cmp_ne_u32_e32 vcc_lo, 0, v0
+; CHECK-NEXT:    s_cmp_lg_u32 vcc_lo, 0
+; CHECK-NEXT:    s_cbranch_scc0 .LBB9_2
+; CHECK-NEXT:  ; %bb.1: ; %false
+; CHECK-NEXT:    s_mov_b32 s0, 33
+; CHECK-NEXT:    s_branch .LBB9_3
+; CHECK-NEXT:  .LBB9_2: ; %true
+; CHECK-NEXT:    s_mov_b32 s0, 42
+; CHECK-NEXT:    s_branch .LBB9_3
+; CHECK-NEXT:  .LBB9_3:
+  %c = trunc i32 %v to i1
+  %ballot = call i32 @llvm.amdgcn.ballot.i32(i1 %c)
+  %ballot_eq_zero = icmp eq i32 %ballot, 0
+  br i1 %ballot_eq_zero, label %true, label %false
+true:
+  ret i32 42
+false:
+  ret i32 33
+}
+
+define amdgpu_cs i32 @branch_uniform_ballot_eq_zero_non_compare(i32 inreg %v) {
+; CHECK-LABEL: branch_uniform_ballot_eq_zero_non_compare:
+; CHECK:       ; %bb.0:
+; CHECK-NEXT:    s_and_b32 s0, 1, s0
+; CHECK-NEXT:    v_cmp_ne_u32_e64 s0, 0, s0
+; CHECK-NEXT:    s_cmp_lg_u32 s0, 0
+; CHECK-NEXT:    s_cbranch_scc0 .LBB10_2
+; CHECK-NEXT:  ; %bb.1: ; %false
+; CHECK-NEXT:    s_mov_b32 s0, 33
+; CHECK-NEXT:    s_branch .LBB10_3
+; CHECK-NEXT:  .LBB10_2: ; %true
+; CHECK-NEXT:    s_mov_b32 s0, 42
+; CHECK-NEXT:    s_branch .LBB10_3
+; CHECK-NEXT:  .LBB10_3:
+  %c = trunc i32 %v to i1
+  %ballot = call i32 @llvm.amdgcn.ballot.i32(i1 %c)
+  %ballot_eq_zero = icmp eq i32 %ballot, 0
+  br i1 %ballot_eq_zero, label %true, label %false
+true:
+  ret i32 42
+false:
+  ret i32 33
+}
+
+define amdgpu_cs i32 @branch_divergent_ballot_ne_zero_compare(i32 %v) {
+; CHECK-LABEL: branch_divergent_ballot_ne_zero_compare:
+; CHECK:       ; %bb.0:
+; CHECK-NEXT:    v_cmp_gt_u32_e32 vcc_lo, 12, v0
+; CHECK-NEXT:    s_cmp_eq_u32 vcc_lo, 0
+; CHECK-NEXT:    s_cbranch_scc1 .LBB11_2
+; CHECK-NEXT:  ; %bb.1: ; %true
+; CHECK-NEXT:    s_mov_b32 s0, 42
+; CHECK-NEXT:    s_branch .LBB11_3
+; CHECK-NEXT:  .LBB11_2: ; %false
+; CHECK-NEXT:    s_mov_b32 s0, 33
+; CHECK-NEXT:    s_branch .LBB11_3
+; CHECK-NEXT:  .LBB11_3:
+  %c = icmp ult i32 %v, 12
+  %ballot = call i32 @llvm.amdgcn.ballot.i32(i1 %c)
+  %ballot_ne_zero = icmp ne i32 %ballot, 0
+  br i1 %ballot_ne_zero, label %true, label %false
+true:
+  ret i32 42
+false:
+  ret i32 33
+}
+
+define amdgpu_cs i32 @branch_uniform_ballot_ne_zero_compare(i32 inreg %v) {
+; CHECK-LABEL: branch_uniform_ballot_ne_zero_compare:
+; CHECK:       ; %bb.0:
+; CHECK-NEXT:    s_cmp_lt_u32 s0, 12
+; CHECK-NEXT:    s_cselect_b32 s0, 1, 0
+; CHECK-NEXT:    s_and_b32 s0, 1, s0
+; CHECK-NEXT:    v_cmp_ne_u32_e64 s0, 0, s0
+; CHECK-NEXT:    s_cmp_eq_u32 s0, 0
+; CHECK-NEXT:    s_cbranch_scc1 .LBB12_2
+; CHECK-NEXT:  ; %bb.1: ; %true
+; CHECK-NEXT:    s_mov_b32 s0, 42
+; CHECK-NEXT:    s_branch .LBB12_3
+; CHECK-NEXT:  .LBB12_2: ; %false
+; CHECK-NEXT:    s_mov_b32 s0, 33
+; CHECK-NEXT:    s_branch .LBB12_3
+; CHECK-NEXT:  .LBB12_3:
+  %c = icmp ult i32 %v, 12
+  %ballot = call i32 @llvm.amdgcn.ballot.i32(i1 %c)
+  %ballot_ne_zero = icmp ne i32 %ballot, 0
+  br i1 %ballot_ne_zero, label %true, label %false
+true:
+  ret i32 42
+false:
+  ret i32 33
+}
+
+define amdgpu_cs i32 @branch_divergent_ballot_eq_zero_compare(i32 %v) {
+; CHECK-LABEL: branch_divergent_ballot_eq_zero_compare:
+; CHECK:       ; %bb.0:
+; CHECK-NEXT:    v_cmp_gt_u32_e32 vcc_lo, 12, v0
+; CHECK-NEXT:    s_cmp_lg_u32 vcc_lo, 0
+; CHECK-NEXT:    s_cbranch_scc0 .LBB13_2
+; CHECK-NEXT:  ; %bb.1: ; %false
+; CHECK-NEXT:    s_mov_b32 s0, 33
+; CHECK-NEXT:    s_branch .LBB13_3
+; CHECK-NEXT:  .LBB13_2: ; %true
+; CHECK-NEXT:    s_mov_b32 s0, 42
+; CHECK-NEXT:    s_branch .LBB13_3
+; CHECK-NEXT:  .LBB13_3:
+  %c = icmp ult i32 %v, 12
+  %ballot = call i32 @llvm.amdgcn.ballot.i32(i1 %c)
+  %ballot_eq_zero = icmp eq i32 %ballot, 0
+  br i1 %ballot_eq_zero, label %true, label %false
+true:
+  ret i32 42
+false:
+  ret i32 33
+}
+
+define amdgpu_cs i32 @branch_uniform_ballot_eq_zero_compare(i32 inreg %v) {
+; CHECK-LABEL: branch_uniform_ballot_eq_zero_compare:
+; CHECK:       ; %bb.0:
+; CHECK-NEXT:    s_cmp_lt_u32 s0, 12
+; CHECK-NEXT:    s_cselect_b32 s0, 1, 0
+; CHECK-NEXT:    s_and_b32 s0, 1, s0
+; CHECK-NEXT:    v_cmp_ne_u32_e64 s0, 0, s0
+; CHECK-NEXT:    s_cmp_lg_u32 s0, 0
+; CHECK-NEXT:    s_cbranch_scc0 .LBB14_2
+; CHECK-NEXT:  ; %bb.1: ; %false
+; CHECK-NEXT:    s_mov_b32 s0, 33
+; CHECK-NEXT:    s_branch .LBB14_3
+; CHECK-NEXT:  .LBB14_2: ; %true
+; CHECK-NEXT:    s_mov_b32 s0, 42
+; CHECK-NEXT:    s_branch .LBB14_3
+; CHECK-NEXT:  .LBB14_3:
+  %c = icmp ult i32 %v, 12
+  %ballot = call i32 @llvm.amdgcn.ballot.i32(i1 %c)
+  %ballot_eq_zero = icmp eq i32 %ballot, 0
+  br i1 %ballot_eq_zero, label %true, label %false
+true:
+  ret i32 42
+false:
+  ret i32 33
+}
+
+define amdgpu_cs i32 @branch_divergent_ballot_ne_zero_and(i32 %v1, i32 %v2) {
+; CHECK-LABEL: branch_divergent_ballot_ne_zero_and:
+; CHECK:       ; %bb.0:
+; CHECK-NEXT:    v_cmp_gt_u32_e32 vcc_lo, 12, v0
+; CHECK-NEXT:    v_cmp_lt_u32_e64 s0, 34, v1
+; CHECK-NEXT:    s_and_b32 s0, vcc_lo, s0
+; CHECK-NEXT:    s_cmp_eq_u32 s0, 0
+; CHECK-NEXT:    s_cbranch_scc1 .LBB15_2
+; CHECK-NEXT:  ; %bb.1: ; %true
+; CHECK-NEXT:    s_mov_b32 s0, 42
+; CHECK-NEXT:    s_branch .LBB15_3
+; CHECK-NEXT:  .LBB15_2: ; %false
+; CHECK-NEXT:    s_mov_b32 s0, 33
+; CHECK-NEXT:    s_branch .LBB15_3
+; CHECK-NEXT:  .LBB15_3:
+  %v1c = icmp ult i32 %v1, 12
+  %v2c = icmp ugt i32 %v2, 34
+  %c = and i1 %v1c, %v2c
+  %ballot = call i32 @llvm.amdgcn.ballot.i32(i1 %c)
+  %ballot_ne_zero = icmp ne i32 %ballot, 0
+  br i1 %ballot_ne_zero, label %true, label %false
+true:
+  ret i32 42
+false:
+  ret i32 33
+}
+
+define amdgpu_cs i32 @branch_uniform_ballot_ne_zero_and(i32 inreg %v1, i32 inreg %v2) {
+; CHECK-LABEL: branch_uniform_ballot_ne_zero_and:
+; CHECK:       ; %bb.0:
+; CHECK-NEXT:    s_cmp_lt_u32 s0, 12
+; CHECK-NEXT:    s_cselect_b32 s0, 1, 0
+; CHECK-NEXT:    s_cmp_gt_u32 s1, 34
+; CHECK-NEXT:    s_cselect_b32 s1, 1, 0
+; CHECK-NEXT:    s_and_b32 s0, s0, s1
+; CHECK-NEXT:    s_and_b32 s0, 1, s0
+; CHECK-NEXT:    v_cmp_ne_u32_e64 s0, 0, s0
+; CHECK-NEXT:    s_cmp_eq_u32 s0, 0
+; CHECK-NEXT:    s_cbranch_scc1 .LBB16_2
+; CHECK-NEXT:  ; %bb.1: ; %true
+; CHECK-NEXT:    s_mov_b32 s0, 42
+; CHECK-NEXT:    s_branch .LBB16_3
+; CHECK-NEXT:  .LBB16_2: ; %false
+; CHECK-NEXT:    s_mov_b32 s0, 33
+; CHECK-NEXT:    s_branch .LBB16_3
+; CHECK-NEXT:  .LBB16_3:
+  %v1c = icmp ult i32 %v1, 12
+  %v2c = icmp ugt i32 %v2, 34
+  %c = and i1 %v1c, %v2c
+  %ballot = call i32 @llvm.amdgcn.ballot.i32(i1 %c)
+  %ballot_ne_zero = icmp ne i32 %ballot, 0
+  br i1 %ballot_ne_zero, label %true, label %false
+true:
+  ret i32 42
+false:
+  ret i32 33
+}
+
+define amdgpu_cs i32 @branch_divergent_ballot_eq_zero_and(i32 %v1, i32 %v2) {
+; CHECK-LABEL: branch_divergent_ballot_eq_zero_and:
+; CHECK:       ; %bb.0:
+; CHECK-NEXT:    v_cmp_gt_u32_e32 vcc_lo, 12, v0
+; CHECK-NEXT:    v_cmp_lt_u32_e64 s0, 34, v1
+; CHECK-NEXT:    s_and_b32 s0, vcc_lo, s0
+; CHECK-NEXT:    s_cmp_lg_u32 s0, 0
+; CHECK-NEXT:    s_cbranch_scc0 .LBB17_2
+; CHECK-NEXT:  ; %bb.1: ; %false
+; CHECK-NEXT:    s_mov_b32 s0, 33
+; CHECK-NEXT:    s_branch .LBB17_3
+; CHECK-NEXT:  .LBB17_2: ; %true
+; CHECK-NEXT:    s_mov_b32 s0, 42
+; CHECK-NEXT:    s_branch .LBB17_3
+; CHECK-NEXT:  .LBB17_3:
+  %v1c = icmp ult i32 %v1, 12
+  %v2c = icmp ugt i32 %v2, 34
+  %c = and i1 %v1c, %v2c
+  %ballot = call i32 @llvm.amdgcn.ballot.i32(i1 %c)
+  %ballot_eq_zero = icmp eq i32 %ballot, 0
+  br i1 %ballot_eq_zero, label %true, label %false
+true:
+  ret i32 42
+false:
+  ret i32 33
+}
+
+define amdgpu_cs i32 @branch_uniform_ballot_eq_zero_and(i32 inreg %v1, i32 inreg %v2) {
+; CHECK-LABEL: branch_uniform_ballot_eq_zero_and:
+; CHECK:       ; %bb.0:
+; CHECK-NEXT:    s_cmp_lt_u32 s0, 12
+; CHECK-NEXT:    s_cselect_b32 s0, 1, 0
+; CHECK-NEXT:    s_cmp_gt_u32 s1, 34
+; CHECK-NEXT:    s_cselect_b32 s1, 1, 0
+; CHECK-NEXT:    s_and_b32 s0, s0, s1
+; CHECK-NEXT:    s_and_b32 s0, 1, s0
+; CHECK-NEXT:    v_cmp_ne_u32_e64 s0, 0, s0
+; CHECK-NEXT:    s_cmp_lg_u32 s0, 0
+; CHECK-NEXT:    s_cbranch_scc0 .LBB18_2
+; CHECK-NEXT:  ; %bb.1: ; %false
+; CHECK-NEXT:    s_mov_b32 s0, 33
+; CHECK-NEXT:    s_branch .LBB18_3
+; CHECK-NEXT:  .LBB18_2: ; %true
+; CHECK-NEXT:    s_mov_b32 s0, 42
+; CHECK-NEXT:    s_branch .LBB18_3
+; CHECK-NEXT:  .LBB18_3:
+  %v1c = icmp ult i32 %v1, 12
+  %v2c = icmp ugt i32 %v2, 34
+  %c = and i1 %v1c, %v2c
+  %ballot = call i32 @llvm.amdgcn.ballot.i32(i1 %c)
+  %ballot_eq_zero = icmp eq i32 %ballot, 0
+  br i1 %ballot_eq_zero, label %true, label %false
+true:
+  ret i32 42
+false:
+  ret i32 33
+}
diff --git a/llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.ballot.i64.ll b/llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.ballot.i64.ll
index 9f83012f5457509..ebb96ddc0603d68 100644
--- a/llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.ballot.i64.ll
+++ b/llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.ballot.i64.ll
@@ -86,3 +86,313 @@ define amdgpu_cs i64 @ctpop_of_ballot(float %x, float %y) {
   %bcnt = call i64 @llvm.ctpop.i64(i64 %ballot)
   ret i64 %bcnt
 }
+
+d...
[truncated]

@vpykhtin vpykhtin changed the title Improve selection of conditional branch on amdgcb.ballot!=0 condition in SelectionDAG. Improve selection of conditional branch on amdgcn.ballot!=0 condition in SelectionDAG. Oct 10, 2023
Comment on lines 13597 to 13602
// %VCMP = i32/i64 AMDGPUISD::SETCC ...
// %C = ISD::SETCC %VCMP, 0, setne/seteq
// BRCOND %BB, %C
// =>
// %VCMP = i32/i64 AMDGPUISD::SETCC ...
// BRCONDZ %BB, %VCMP, setne/seteq
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does VCMP have to be AMDGPUISD::SETCC here? Can you do a simpler combine instead:

    // %C = ISD::SETCC %VCMP, 0, setne/seteq
    // BRCOND %BB, %C
    // =>
    // BRCONDZ %BB, %VCMP, setne/seteq

Comment on lines 13615 to 13620
// %VCMPSrc = ISD::SETCC or a logical combination of ISD::SETCCs
// %VCMP = i32/i64 AMDGPUISD::SETCC (ext %VCMPSrc), 0, setne
// %C = ISD::SETCC %VCMP, 0, setne/seteq
// BRCOND %BB, %C
// =>
// BRCONDZ %BB, %VCMPSrc, setne/seteq
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you split this out into a separate simpler combine:

        // %VCMPSrc = ISD::SETCC or a logical combination of ISD::SETCCs
        // %VCMP = i32/i64 AMDGPUISD::SETCC (ext %VCMPSrc), 0, setne
        // %C = ISD::SETCC %VCMP, 0, setne/seteq
        // =>
        // %C = ISD::SETCC or a logical combination of ISD::SETCCs

Copy link
Contributor Author

@vpykhtin vpykhtin Oct 11, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you Jay, these're good suggestions and they showed me that BRCONDZ has vague semantics.

Here I use lowercase setcc to refer to ordinary i1 ISD::SETCC opcode and AMDGPU::SETCC for a special instruction returning boolean mask accross lanes.

The problem is that when we get rid of AMDGPU::SETCC (as in your second pattern) we're loosing the semantic of a boolean mask accross all lanes.

This confusion comes from the description of BRCONDZ node that can accept setcc value or logical combination of setccs. Instead it should be defined as BRCONDZ only accepts the result of AMDGPU::SETCC node which means that BRCONDZ compares uniform boolean mask resulted from the AMDGPU::SETCC operation to zero.

For example:

  %Mask = i64 AMDGPU::SETCC i1 %c, 42, setle
  BRCONDZ i64 %Mask, BB2, setne
=>
  v_cmp_le vcc, v0, 42
  c_cbranch_vccnz bb2

Now the confusing case with ballot:

  %c = i1 setcc ...
  %Mask = i64 AMDGPU::SETCC i32 (zext i32 %c), 0, setne  ; <- this is lowered ballot
  BRCONDZ i64 %Mask, BB2, setne
=>
  v_cmp_ vcc, ... ; <- setcc
  s_cmp_ne vcc, 0 ; <- ballot
  c_cbranch_scc1 bb2

Here we want to remove second comparison because vcc is a required boolean mask across all lanes resulted from the first v_cmp_ instruction and it can be compared to zero with the branch. In SelectionDAG we know that setcc for a divergent value will be selected as a v_cmp instruction and we can use its result directly and that is why setcc input to ballot is delivered to BRCONDZ. This is probably wrong.

I'm not really sure how to resolve this yet.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the other thought these two patterns have a node in common:

%C = ISD::SETCC %VCMP, 0, setne/seteq
BRCOND %BB, %C
=>
BRCONDZ %BB, %VCMP, setne/seteq

%VCMPSrc = ISD::SETCC or a logical combination of ISD::SETCCs
%VCMP = i32/i64 AMDGPUISD::SETCC (ext %VCMPSrc), 0, setne
%C = ISD::SETCC %VCMP, 0, setne/seteq
=>
%C = ISD::SETCC or a logical combination of ISD::SETCCs

SelectionDAG starts with BRCOND of the first pattern and erases the ISD::SETCC which is the root for the second pattern. I tried to perform DAG combine on the %C value of the first rule so that the second rule could perform its transformation and return result to the first rule but this requires to create artificial result %C = ISD::SETCC %VCMPSrc, 0, setne so that the first rule pattern could match. It looks like this should work as a single pattern.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've reverted previous implementation and submitted simplified version without using additional DAG node.

@github-actions
Copy link

github-actions bot commented Oct 15, 2023

✅ With the latest revision this PR passed the C/C++ code formatter.

@vpykhtin vpykhtin force-pushed the improve_ballot_ne_zero_selection branch from 4429535 to abb2ca2 Compare October 15, 2023 11:17
@vpykhtin vpykhtin force-pushed the improve_ballot_ne_zero_selection branch from abb2ca2 to 8b836e2 Compare October 16, 2023 11:52
@vpykhtin
Copy link
Contributor Author

Fixed SCC not being used if
// %C = i1 ISD::SETCC %VCMP, 0, setne/seteq
haven't matched. Added tests.

@vpykhtin vpykhtin self-assigned this Oct 18, 2023
@@ -2259,6 +2259,30 @@ bool AMDGPUDAGToDAGISel::isCBranchSCC(const SDNode *N) const {
return false;
}

bool isBoolSGPR(SDValue V);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can just move the body up here to avoid the forward declaration

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The definition is in another source (SIISelLowering.cpp), I haven't found common header to place this declaration yet.

assert(VCMP->getOpcode() == AMDGPUISD::SETCC);
// Special case for amdgcn.ballot:
// %Cond = i1 (and/or combination of i1 ISD::SETCCs)
// %VCMP = i(WaveSize) AMDGPUISD::SETCC (ext %Cond), 0, setne ; lowered ballot
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You might need to handle the negated form as well

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added, but I had to simulate it using llvm.amdgcn.icmp(i1 %Cond, i1 0, ICMP_EQ) intrinsic.

llvm/lib/Target/AMDGPU/AMDGPUISelDAGToDAG.cpp Show resolved Hide resolved
 * supported negated ballot
 * improved comments
Copy link
Contributor Author

@vpykhtin vpykhtin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added support for negated ballot but I had to simulate it using llvm.amdgcn.icmp intrinsic.

Tests for GlobalISel aren't added because they fail on "cannot select llvm.amdgcn.icmp(i1 ...)"

Copy link
Collaborator

@nhaehnle nhaehnle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems pretty reasonable to me

Comment on lines +430 to +435
; CHECK: ; %bb.0:
; CHECK-NEXT: s_cmp_lt_u32 s0, 12
; CHECK-NEXT: s_cselect_b32 s0, -1, 0
; CHECK-NEXT: s_cmp_gt_u32 s1, 34
; CHECK-NEXT: s_cselect_b32 s1, -1, 0
; CHECK-NEXT: s_and_b32 s0, s0, s1
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sequence could be improved to:

  s_cmp_lt_u32 s0, 12
  s_cselect_b32 s0, -1, 0
  s_cmp_gt_u32 s1, 34
  s_cselect_b32 s0, s0, 0

By selecting into vcc(_lo) instead, we could even avoid the AND-with-exec that follows.

Not something for this patch, just an observation. Perhaps you could add a TODO.

(In general, LLVM knows to do boolean logic -> select folds. The fact that they don't fire here is probably a pass ordering issue.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps you could add a TODO.

Thanks Nicolai, added the TODO.

@vpykhtin
Copy link
Contributor Author

With the latest commit I added a guard to skip ballot.i64 in wave32 mode because it's lowered with i64 AMDGPUISD::SETCC - this should be fixed with i32 result similar to ICMP/FCMP intrinsics. After a fix that follows this PR I'm going to turn the guard into assert - TODO is added for this.

@nhaehnle
Copy link
Collaborator

nhaehnle commented Nov 3, 2023

Though, on second thought, shouldn't there be some wave64 tests?

@vpykhtin
Copy link
Contributor Author

vpykhtin commented Nov 3, 2023

Though, on second thought, shouldn't there be some wave64 tests?

I added a couple of ballot.i64 in wave32 mode tests to AMDGPU/GlobalISel/llvm.amdgcn.ballot.i32.ll and AMDGPU/llvm.amdgcn.ballot.i32.ll with the 6e865d1

or you mean ballot.i32 in wave64 mode?

@nhaehnle
Copy link
Collaborator

nhaehnle commented Nov 3, 2023

I meant ballot.i64 in wave64. For some reason, I didn't see the changes you made in those tests (perhaps I was looking at GitHub's display of your last changes to the PR, instead of the full PR.) But since you did make those changes, it's all good as far as I'm concerned.

@vpykhtin
Copy link
Contributor Author

vpykhtin commented Nov 6, 2023

Thanks Nicolai!

@vpykhtin vpykhtin merged commit fe6893b into llvm:main Nov 6, 2023
3 checks passed
@vpykhtin vpykhtin deleted the improve_ballot_ne_zero_selection branch November 6, 2023 14:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants