-
Notifications
You must be signed in to change notification settings - Fork 11.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve selection of conditional branch on amdgcn.ballot!=0 condition in SelectionDAG. #68714
Conversation
@llvm/pr-subscribers-llvm-globalisel @llvm/pr-subscribers-backend-amdgpu Author: Valery (vpykhtin) ChangesI'm going to introduce changes in address sanitizer instrumentation checks for AMDGPU target and it will make use of the following pattern:
which means "execute then if any lane has satisfied the cnd condition". Note that this is uniform condition since amdgcn.ballot is uniform intrinsic. This patch improves code generation for this pattern in SelectionDAG. GlobalISel change should be added later. It introduces special AMDGPUISD::BRCONDZ node that is created in the DAG combiner and maps directly either to S_CBRANCH_VCC(Z|NZ) or S_CBRANCH_SCC(0|1) if cnd is itself uniform. P.S. This pull request contains two commits, first one just adds tests so I recommend to take a look at the second commit 13ebb87 to see the code difference it makes in tests. Patch is 49.48 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/68714.diff 11 Files Affected:
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUISelDAGToDAG.cpp b/llvm/lib/Target/AMDGPU/AMDGPUISelDAGToDAG.cpp
index b5ceaaa14b4fd5e..90addb12a81abcf 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUISelDAGToDAG.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPUISelDAGToDAG.cpp
@@ -666,6 +666,9 @@ void AMDGPUDAGToDAGISel::Select(SDNode *N) {
case ISD::FP_EXTEND:
SelectFP_EXTEND(N);
return;
+ case AMDGPUISD::BRCONDZ:
+ SelectBRCONDZ(N);
+ return;
case AMDGPUISD::CVT_PKRTZ_F16_F32:
case AMDGPUISD::CVT_PKNORM_I16_F32:
case AMDGPUISD::CVT_PKNORM_U16_F32:
@@ -2306,6 +2309,34 @@ void AMDGPUDAGToDAGISel::SelectBRCOND(SDNode *N) {
VCC.getValue(0));
}
+void AMDGPUDAGToDAGISel::SelectBRCONDZ(SDNode *N) {
+ const SIRegisterInfo *TRI =
+ static_cast<const SIRegisterInfo *>(Subtarget->getRegisterInfo());
+
+ SDValue Cond = N->getOperand(1);
+
+ // BRCONDZ condition is either AMDGPUISD::SETCC or i1 value that comes from
+ // ISD::SETCC node or logical combination of ISD::SETCCs therefore we don't
+ // need to AND the condition with execmask.
+
+ // TODO: AMDGPUISD::SETCC is always selected as V_CMP so use VCC condition.
+ // This might change later.
+ bool UseSCCBr = Cond->getOpcode() != AMDGPUISD::SETCC && !Cond->isDivergent();
+
+ auto CondCode = cast<CondCodeSDNode>(N->getOperand(3))->get();
+ assert(CondCode == ISD::SETEQ || CondCode == ISD::SETNE);
+
+ bool EqZero = CondCode == ISD::SETEQ;
+ unsigned BrOp =
+ UseSCCBr ? (EqZero ? AMDGPU::S_CBRANCH_SCC0 : AMDGPU::S_CBRANCH_SCC1)
+ : (EqZero ? AMDGPU::S_CBRANCH_VCCZ : AMDGPU::S_CBRANCH_VCCNZ);
+
+ SDValue CondCopy = CurDAG->getCopyToReg(
+ N->getOperand(0), SDLoc(N), UseSCCBr ? AMDGPU::SCC : TRI->getVCC(),
+ N->getOperand(1));
+ CurDAG->SelectNodeTo(N, BrOp, MVT::Other, N->getOperand(2), CondCopy);
+}
+
void AMDGPUDAGToDAGISel::SelectFP_EXTEND(SDNode *N) {
if (Subtarget->hasSALUFloatInsts() && N->getValueType(0) == MVT::f32 &&
!N->isDivergent()) {
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUISelDAGToDAG.h b/llvm/lib/Target/AMDGPU/AMDGPUISelDAGToDAG.h
index a8a606f60a3faee..255ca62cb7a9100 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUISelDAGToDAG.h
+++ b/llvm/lib/Target/AMDGPU/AMDGPUISelDAGToDAG.h
@@ -272,6 +272,7 @@ class AMDGPUDAGToDAGISel : public SelectionDAGISel {
void SelectS_BFE(SDNode *N);
bool isCBranchSCC(const SDNode *N) const;
void SelectBRCOND(SDNode *N);
+ void SelectBRCONDZ(SDNode *N);
void SelectFMAD_FMA(SDNode *N);
void SelectFP_EXTEND(SDNode *N);
void SelectDSAppendConsume(SDNode *N, unsigned IntrID);
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUISelLowering.cpp b/llvm/lib/Target/AMDGPU/AMDGPUISelLowering.cpp
index 607d59db7bcf709..a268a807679f473 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUISelLowering.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPUISelLowering.cpp
@@ -5317,6 +5317,7 @@ const char* AMDGPUTargetLowering::getTargetNodeName(unsigned Opcode) const {
NODE_NAME_CASE(BUFFER_ATOMIC_FADD)
NODE_NAME_CASE(BUFFER_ATOMIC_FMIN)
NODE_NAME_CASE(BUFFER_ATOMIC_FMAX)
+ NODE_NAME_CASE(BRCONDZ)
case AMDGPUISD::LAST_AMDGPU_ISD_NUMBER: break;
}
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUISelLowering.h b/llvm/lib/Target/AMDGPU/AMDGPUISelLowering.h
index e971c85ee3f6e39..43e572dcc56423d 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUISelLowering.h
+++ b/llvm/lib/Target/AMDGPU/AMDGPUISelLowering.h
@@ -422,6 +422,20 @@ enum NodeType : unsigned {
// This is SETCC with the full mask result which is used for a compare with a
// result bit per item in the wavefront.
SETCC,
+
+ // Conditional branch on comparison of CondWaveMask operand to zero.
+ // BRCONDZ CondWaveMask, BB, CondCode
+ // where:
+ // - CondWaveMask - is either:
+ // * the i32/i64 result of AMDGPUISD::SETCC node,
+ // * i1 value that comes from ISD::SETCC node or logical combination of
+ // ISD::SETCCs. For a divergent node this becomes a i32/i64 value after
+ // selection.
+ // - BB is the target basic block,
+ // - CondCode is either SETEQ or SETNE meaning that the branch should happen
+ // if the CondWaveMask is either equal to zero or not.
+ BRCONDZ,
+
SETREG,
DENORM_MODE,
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUInstrInfo.td b/llvm/lib/Target/AMDGPU/AMDGPUInstrInfo.td
index 2492a7be651f6d6..a6b4fa4937dff30 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUInstrInfo.td
+++ b/llvm/lib/Target/AMDGPU/AMDGPUInstrInfo.td
@@ -58,6 +58,11 @@ def AMDGPUIfBreakOp : SDTypeProfile<1, 2,
[SDTCisVT<0, i1>, SDTCisVT<1, i1>, SDTCisVT<2, i1>]
>;
+def AMDGPUBrcondzOp : SDTypeProfile<0, 3, [
+ // cond, bb, cc
+ SDTCisInt<0>, SDTCisVT<1, OtherVT>, SDTCisVT<2, OtherVT>
+]>;
+
//===----------------------------------------------------------------------===//
// AMDGPU DAG Nodes
//
@@ -65,6 +70,7 @@ def AMDGPUIfBreakOp : SDTypeProfile<1, 2,
def AMDGPUif : SDNode<"AMDGPUISD::IF", AMDGPUIfOp, [SDNPHasChain]>;
def AMDGPUelse : SDNode<"AMDGPUISD::ELSE", AMDGPUElseOp, [SDNPHasChain]>;
def AMDGPUloop : SDNode<"AMDGPUISD::LOOP", AMDGPULoopOp, [SDNPHasChain]>;
+def AMDGPUbrcondz: SDNode<"AMDGPUISD::BRCONDZ", AMDGPUBrcondzOp, [SDNPHasChain]>;
def callseq_start : SDNode<"ISD::CALLSEQ_START",
SDCallSeqStart<[ SDTCisVT<0, i32>, SDTCisVT<1, i32> ]>,
diff --git a/llvm/lib/Target/AMDGPU/SIISelLowering.cpp b/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
index 0a32844bdb01a09..5edf1446ca34dac 100644
--- a/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
+++ b/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
@@ -798,7 +798,8 @@ SITargetLowering::SITargetLowering(const TargetMachine &TM,
ISD::SIGN_EXTEND_INREG,
ISD::EXTRACT_VECTOR_ELT,
ISD::INSERT_VECTOR_ELT,
- ISD::FCOPYSIGN});
+ ISD::FCOPYSIGN,
+ ISD::BRCOND});
if (Subtarget->has16BitInsts() && !Subtarget->hasMed3_16())
setTargetDAGCombine(ISD::FP_ROUND);
@@ -13584,6 +13585,56 @@ SDValue SITargetLowering::performClampCombine(SDNode *N,
return SDValue(CSrc, 0);
}
+SDValue SITargetLowering::performBRCondCombine(SDNode *N,
+ DAGCombinerInfo &DCI) const {
+ if (!DCI.isAfterLegalizeDAG())
+ return SDValue(N, 0);
+
+ SDValue Cond = N->getOperand(1);
+ if (Cond.getOpcode() == ISD::SETCC &&
+ Cond->getOperand(0)->getOpcode() == AMDGPUISD::SETCC) {
+
+ // %VCMP = i32/i64 AMDGPUISD::SETCC ...
+ // %C = ISD::SETCC %VCMP, 0, setne/seteq
+ // BRCOND %BB, %C
+ // =>
+ // %VCMP = i32/i64 AMDGPUISD::SETCC ...
+ // BRCONDZ %BB, %VCMP, setne/seteq
+
+ auto CC = cast<CondCodeSDNode>(Cond->getOperand(2))->get();
+ auto *CRHS = dyn_cast<ConstantSDNode>(Cond->getOperand(1));
+ if ((CC == ISD::SETEQ || CC == ISD::SETNE) && CRHS && CRHS->isZero()) {
+
+ auto VCMP = Cond->getOperand(0);
+ auto VCMP_CC = cast<CondCodeSDNode>(VCMP.getOperand(2))->get();
+ auto *VCMP_CRHS = dyn_cast<ConstantSDNode>(VCMP.getOperand(1));
+ auto Src = VCMP;
+ if (VCMP_CC == ISD::SETNE && VCMP_CRHS && VCMP_CRHS->isZero()) {
+
+ // Special case for amdgcn.ballot:
+ // %VCMPSrc = ISD::SETCC or a logical combination of ISD::SETCCs
+ // %VCMP = i32/i64 AMDGPUISD::SETCC (ext %VCMPSrc), 0, setne
+ // %C = ISD::SETCC %VCMP, 0, setne/seteq
+ // BRCOND %BB, %C
+ // =>
+ // BRCONDZ %BB, %VCMPSrc, setne/seteq
+
+ auto VCMPSrc = VCMP.getOperand(0);
+ if (ISD::isExtOpcode(VCMPSrc->getOpcode())) // Skip extension.
+ VCMPSrc = VCMPSrc.getOperand(0);
+
+ if (isBoolSGPR(VCMPSrc))
+ Src = VCMPSrc;
+ }
+ return DCI.DAG.getNode(AMDGPUISD::BRCONDZ, SDLoc(N), N->getVTList(),
+ N->getOperand(0), // Chain
+ Src,
+ N->getOperand(2), // BB
+ DCI.DAG.getCondCode(CC)); // SETEQ|SETNE
+ }
+ }
+ return SDValue(N, 0);
+}
SDValue SITargetLowering::PerformDAGCombine(SDNode *N,
DAGCombinerInfo &DCI) const {
@@ -13694,6 +13745,8 @@ SDValue SITargetLowering::PerformDAGCombine(SDNode *N,
return performInsertVectorEltCombine(N, DCI);
case ISD::FP_ROUND:
return performFPRoundCombine(N, DCI);
+ case ISD::BRCOND:
+ return performBRCondCombine(N, DCI);
case ISD::LOAD: {
if (SDValue Widended = widenLoad(cast<LoadSDNode>(N), DCI))
return Widended;
diff --git a/llvm/lib/Target/AMDGPU/SIISelLowering.h b/llvm/lib/Target/AMDGPU/SIISelLowering.h
index d717e12d29a514a..f03b83705d14083 100644
--- a/llvm/lib/Target/AMDGPU/SIISelLowering.h
+++ b/llvm/lib/Target/AMDGPU/SIISelLowering.h
@@ -220,6 +220,7 @@ class SITargetLowering final : public AMDGPUTargetLowering {
SDValue performCvtF32UByteNCombine(SDNode *N, DAGCombinerInfo &DCI) const;
SDValue performClampCombine(SDNode *N, DAGCombinerInfo &DCI) const;
SDValue performRcpCombine(SDNode *N, DAGCombinerInfo &DCI) const;
+ SDValue performBRCondCombine(SDNode *N, DAGCombinerInfo &DCI) const;
bool isLegalFlatAddressingMode(const AddrMode &AM) const;
bool isLegalMUBUFAddressingMode(const AddrMode &AM) const;
diff --git a/llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.ballot.i32.ll b/llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.ballot.i32.ll
index 8bd1be04650e005..6c12329930b8a2c 100644
--- a/llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.ballot.i32.ll
+++ b/llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.ballot.i32.ll
@@ -83,3 +83,313 @@ define amdgpu_cs i32 @ctpop_of_ballot(float %x, float %y) {
%bcnt = call i32 @llvm.ctpop.i32(i32 %ballot)
ret i32 %bcnt
}
+
+define amdgpu_cs i32 @branch_divergent_ballot_ne_zero_non_compare(i32 %v) {
+; CHECK-LABEL: branch_divergent_ballot_ne_zero_non_compare:
+; CHECK: ; %bb.0:
+; CHECK-NEXT: v_and_b32_e32 v0, 1, v0
+; CHECK-NEXT: v_cmp_ne_u32_e32 vcc_lo, 0, v0
+; CHECK-NEXT: s_cmp_eq_u32 vcc_lo, 0
+; CHECK-NEXT: s_cbranch_scc1 .LBB7_2
+; CHECK-NEXT: ; %bb.1: ; %true
+; CHECK-NEXT: s_mov_b32 s0, 42
+; CHECK-NEXT: s_branch .LBB7_3
+; CHECK-NEXT: .LBB7_2: ; %false
+; CHECK-NEXT: s_mov_b32 s0, 33
+; CHECK-NEXT: s_branch .LBB7_3
+; CHECK-NEXT: .LBB7_3:
+ %c = trunc i32 %v to i1
+ %ballot = call i32 @llvm.amdgcn.ballot.i32(i1 %c)
+ %ballot_ne_zero = icmp ne i32 %ballot, 0
+ br i1 %ballot_ne_zero, label %true, label %false
+true:
+ ret i32 42
+false:
+ ret i32 33
+}
+
+define amdgpu_cs i32 @branch_uniform_ballot_ne_zero_non_compare(i32 inreg %v) {
+; CHECK-LABEL: branch_uniform_ballot_ne_zero_non_compare:
+; CHECK: ; %bb.0:
+; CHECK-NEXT: s_and_b32 s0, 1, s0
+; CHECK-NEXT: v_cmp_ne_u32_e64 s0, 0, s0
+; CHECK-NEXT: s_cmp_eq_u32 s0, 0
+; CHECK-NEXT: s_cbranch_scc1 .LBB8_2
+; CHECK-NEXT: ; %bb.1: ; %true
+; CHECK-NEXT: s_mov_b32 s0, 42
+; CHECK-NEXT: s_branch .LBB8_3
+; CHECK-NEXT: .LBB8_2: ; %false
+; CHECK-NEXT: s_mov_b32 s0, 33
+; CHECK-NEXT: s_branch .LBB8_3
+; CHECK-NEXT: .LBB8_3:
+ %c = trunc i32 %v to i1
+ %ballot = call i32 @llvm.amdgcn.ballot.i32(i1 %c)
+ %ballot_ne_zero = icmp ne i32 %ballot, 0
+ br i1 %ballot_ne_zero, label %true, label %false
+true:
+ ret i32 42
+false:
+ ret i32 33
+}
+
+define amdgpu_cs i32 @branch_divergent_ballot_eq_zero_non_compare(i32 %v) {
+; CHECK-LABEL: branch_divergent_ballot_eq_zero_non_compare:
+; CHECK: ; %bb.0:
+; CHECK-NEXT: v_and_b32_e32 v0, 1, v0
+; CHECK-NEXT: v_cmp_ne_u32_e32 vcc_lo, 0, v0
+; CHECK-NEXT: s_cmp_lg_u32 vcc_lo, 0
+; CHECK-NEXT: s_cbranch_scc0 .LBB9_2
+; CHECK-NEXT: ; %bb.1: ; %false
+; CHECK-NEXT: s_mov_b32 s0, 33
+; CHECK-NEXT: s_branch .LBB9_3
+; CHECK-NEXT: .LBB9_2: ; %true
+; CHECK-NEXT: s_mov_b32 s0, 42
+; CHECK-NEXT: s_branch .LBB9_3
+; CHECK-NEXT: .LBB9_3:
+ %c = trunc i32 %v to i1
+ %ballot = call i32 @llvm.amdgcn.ballot.i32(i1 %c)
+ %ballot_eq_zero = icmp eq i32 %ballot, 0
+ br i1 %ballot_eq_zero, label %true, label %false
+true:
+ ret i32 42
+false:
+ ret i32 33
+}
+
+define amdgpu_cs i32 @branch_uniform_ballot_eq_zero_non_compare(i32 inreg %v) {
+; CHECK-LABEL: branch_uniform_ballot_eq_zero_non_compare:
+; CHECK: ; %bb.0:
+; CHECK-NEXT: s_and_b32 s0, 1, s0
+; CHECK-NEXT: v_cmp_ne_u32_e64 s0, 0, s0
+; CHECK-NEXT: s_cmp_lg_u32 s0, 0
+; CHECK-NEXT: s_cbranch_scc0 .LBB10_2
+; CHECK-NEXT: ; %bb.1: ; %false
+; CHECK-NEXT: s_mov_b32 s0, 33
+; CHECK-NEXT: s_branch .LBB10_3
+; CHECK-NEXT: .LBB10_2: ; %true
+; CHECK-NEXT: s_mov_b32 s0, 42
+; CHECK-NEXT: s_branch .LBB10_3
+; CHECK-NEXT: .LBB10_3:
+ %c = trunc i32 %v to i1
+ %ballot = call i32 @llvm.amdgcn.ballot.i32(i1 %c)
+ %ballot_eq_zero = icmp eq i32 %ballot, 0
+ br i1 %ballot_eq_zero, label %true, label %false
+true:
+ ret i32 42
+false:
+ ret i32 33
+}
+
+define amdgpu_cs i32 @branch_divergent_ballot_ne_zero_compare(i32 %v) {
+; CHECK-LABEL: branch_divergent_ballot_ne_zero_compare:
+; CHECK: ; %bb.0:
+; CHECK-NEXT: v_cmp_gt_u32_e32 vcc_lo, 12, v0
+; CHECK-NEXT: s_cmp_eq_u32 vcc_lo, 0
+; CHECK-NEXT: s_cbranch_scc1 .LBB11_2
+; CHECK-NEXT: ; %bb.1: ; %true
+; CHECK-NEXT: s_mov_b32 s0, 42
+; CHECK-NEXT: s_branch .LBB11_3
+; CHECK-NEXT: .LBB11_2: ; %false
+; CHECK-NEXT: s_mov_b32 s0, 33
+; CHECK-NEXT: s_branch .LBB11_3
+; CHECK-NEXT: .LBB11_3:
+ %c = icmp ult i32 %v, 12
+ %ballot = call i32 @llvm.amdgcn.ballot.i32(i1 %c)
+ %ballot_ne_zero = icmp ne i32 %ballot, 0
+ br i1 %ballot_ne_zero, label %true, label %false
+true:
+ ret i32 42
+false:
+ ret i32 33
+}
+
+define amdgpu_cs i32 @branch_uniform_ballot_ne_zero_compare(i32 inreg %v) {
+; CHECK-LABEL: branch_uniform_ballot_ne_zero_compare:
+; CHECK: ; %bb.0:
+; CHECK-NEXT: s_cmp_lt_u32 s0, 12
+; CHECK-NEXT: s_cselect_b32 s0, 1, 0
+; CHECK-NEXT: s_and_b32 s0, 1, s0
+; CHECK-NEXT: v_cmp_ne_u32_e64 s0, 0, s0
+; CHECK-NEXT: s_cmp_eq_u32 s0, 0
+; CHECK-NEXT: s_cbranch_scc1 .LBB12_2
+; CHECK-NEXT: ; %bb.1: ; %true
+; CHECK-NEXT: s_mov_b32 s0, 42
+; CHECK-NEXT: s_branch .LBB12_3
+; CHECK-NEXT: .LBB12_2: ; %false
+; CHECK-NEXT: s_mov_b32 s0, 33
+; CHECK-NEXT: s_branch .LBB12_3
+; CHECK-NEXT: .LBB12_3:
+ %c = icmp ult i32 %v, 12
+ %ballot = call i32 @llvm.amdgcn.ballot.i32(i1 %c)
+ %ballot_ne_zero = icmp ne i32 %ballot, 0
+ br i1 %ballot_ne_zero, label %true, label %false
+true:
+ ret i32 42
+false:
+ ret i32 33
+}
+
+define amdgpu_cs i32 @branch_divergent_ballot_eq_zero_compare(i32 %v) {
+; CHECK-LABEL: branch_divergent_ballot_eq_zero_compare:
+; CHECK: ; %bb.0:
+; CHECK-NEXT: v_cmp_gt_u32_e32 vcc_lo, 12, v0
+; CHECK-NEXT: s_cmp_lg_u32 vcc_lo, 0
+; CHECK-NEXT: s_cbranch_scc0 .LBB13_2
+; CHECK-NEXT: ; %bb.1: ; %false
+; CHECK-NEXT: s_mov_b32 s0, 33
+; CHECK-NEXT: s_branch .LBB13_3
+; CHECK-NEXT: .LBB13_2: ; %true
+; CHECK-NEXT: s_mov_b32 s0, 42
+; CHECK-NEXT: s_branch .LBB13_3
+; CHECK-NEXT: .LBB13_3:
+ %c = icmp ult i32 %v, 12
+ %ballot = call i32 @llvm.amdgcn.ballot.i32(i1 %c)
+ %ballot_eq_zero = icmp eq i32 %ballot, 0
+ br i1 %ballot_eq_zero, label %true, label %false
+true:
+ ret i32 42
+false:
+ ret i32 33
+}
+
+define amdgpu_cs i32 @branch_uniform_ballot_eq_zero_compare(i32 inreg %v) {
+; CHECK-LABEL: branch_uniform_ballot_eq_zero_compare:
+; CHECK: ; %bb.0:
+; CHECK-NEXT: s_cmp_lt_u32 s0, 12
+; CHECK-NEXT: s_cselect_b32 s0, 1, 0
+; CHECK-NEXT: s_and_b32 s0, 1, s0
+; CHECK-NEXT: v_cmp_ne_u32_e64 s0, 0, s0
+; CHECK-NEXT: s_cmp_lg_u32 s0, 0
+; CHECK-NEXT: s_cbranch_scc0 .LBB14_2
+; CHECK-NEXT: ; %bb.1: ; %false
+; CHECK-NEXT: s_mov_b32 s0, 33
+; CHECK-NEXT: s_branch .LBB14_3
+; CHECK-NEXT: .LBB14_2: ; %true
+; CHECK-NEXT: s_mov_b32 s0, 42
+; CHECK-NEXT: s_branch .LBB14_3
+; CHECK-NEXT: .LBB14_3:
+ %c = icmp ult i32 %v, 12
+ %ballot = call i32 @llvm.amdgcn.ballot.i32(i1 %c)
+ %ballot_eq_zero = icmp eq i32 %ballot, 0
+ br i1 %ballot_eq_zero, label %true, label %false
+true:
+ ret i32 42
+false:
+ ret i32 33
+}
+
+define amdgpu_cs i32 @branch_divergent_ballot_ne_zero_and(i32 %v1, i32 %v2) {
+; CHECK-LABEL: branch_divergent_ballot_ne_zero_and:
+; CHECK: ; %bb.0:
+; CHECK-NEXT: v_cmp_gt_u32_e32 vcc_lo, 12, v0
+; CHECK-NEXT: v_cmp_lt_u32_e64 s0, 34, v1
+; CHECK-NEXT: s_and_b32 s0, vcc_lo, s0
+; CHECK-NEXT: s_cmp_eq_u32 s0, 0
+; CHECK-NEXT: s_cbranch_scc1 .LBB15_2
+; CHECK-NEXT: ; %bb.1: ; %true
+; CHECK-NEXT: s_mov_b32 s0, 42
+; CHECK-NEXT: s_branch .LBB15_3
+; CHECK-NEXT: .LBB15_2: ; %false
+; CHECK-NEXT: s_mov_b32 s0, 33
+; CHECK-NEXT: s_branch .LBB15_3
+; CHECK-NEXT: .LBB15_3:
+ %v1c = icmp ult i32 %v1, 12
+ %v2c = icmp ugt i32 %v2, 34
+ %c = and i1 %v1c, %v2c
+ %ballot = call i32 @llvm.amdgcn.ballot.i32(i1 %c)
+ %ballot_ne_zero = icmp ne i32 %ballot, 0
+ br i1 %ballot_ne_zero, label %true, label %false
+true:
+ ret i32 42
+false:
+ ret i32 33
+}
+
+define amdgpu_cs i32 @branch_uniform_ballot_ne_zero_and(i32 inreg %v1, i32 inreg %v2) {
+; CHECK-LABEL: branch_uniform_ballot_ne_zero_and:
+; CHECK: ; %bb.0:
+; CHECK-NEXT: s_cmp_lt_u32 s0, 12
+; CHECK-NEXT: s_cselect_b32 s0, 1, 0
+; CHECK-NEXT: s_cmp_gt_u32 s1, 34
+; CHECK-NEXT: s_cselect_b32 s1, 1, 0
+; CHECK-NEXT: s_and_b32 s0, s0, s1
+; CHECK-NEXT: s_and_b32 s0, 1, s0
+; CHECK-NEXT: v_cmp_ne_u32_e64 s0, 0, s0
+; CHECK-NEXT: s_cmp_eq_u32 s0, 0
+; CHECK-NEXT: s_cbranch_scc1 .LBB16_2
+; CHECK-NEXT: ; %bb.1: ; %true
+; CHECK-NEXT: s_mov_b32 s0, 42
+; CHECK-NEXT: s_branch .LBB16_3
+; CHECK-NEXT: .LBB16_2: ; %false
+; CHECK-NEXT: s_mov_b32 s0, 33
+; CHECK-NEXT: s_branch .LBB16_3
+; CHECK-NEXT: .LBB16_3:
+ %v1c = icmp ult i32 %v1, 12
+ %v2c = icmp ugt i32 %v2, 34
+ %c = and i1 %v1c, %v2c
+ %ballot = call i32 @llvm.amdgcn.ballot.i32(i1 %c)
+ %ballot_ne_zero = icmp ne i32 %ballot, 0
+ br i1 %ballot_ne_zero, label %true, label %false
+true:
+ ret i32 42
+false:
+ ret i32 33
+}
+
+define amdgpu_cs i32 @branch_divergent_ballot_eq_zero_and(i32 %v1, i32 %v2) {
+; CHECK-LABEL: branch_divergent_ballot_eq_zero_and:
+; CHECK: ; %bb.0:
+; CHECK-NEXT: v_cmp_gt_u32_e32 vcc_lo, 12, v0
+; CHECK-NEXT: v_cmp_lt_u32_e64 s0, 34, v1
+; CHECK-NEXT: s_and_b32 s0, vcc_lo, s0
+; CHECK-NEXT: s_cmp_lg_u32 s0, 0
+; CHECK-NEXT: s_cbranch_scc0 .LBB17_2
+; CHECK-NEXT: ; %bb.1: ; %false
+; CHECK-NEXT: s_mov_b32 s0, 33
+; CHECK-NEXT: s_branch .LBB17_3
+; CHECK-NEXT: .LBB17_2: ; %true
+; CHECK-NEXT: s_mov_b32 s0, 42
+; CHECK-NEXT: s_branch .LBB17_3
+; CHECK-NEXT: .LBB17_3:
+ %v1c = icmp ult i32 %v1, 12
+ %v2c = icmp ugt i32 %v2, 34
+ %c = and i1 %v1c, %v2c
+ %ballot = call i32 @llvm.amdgcn.ballot.i32(i1 %c)
+ %ballot_eq_zero = icmp eq i32 %ballot, 0
+ br i1 %ballot_eq_zero, label %true, label %false
+true:
+ ret i32 42
+false:
+ ret i32 33
+}
+
+define amdgpu_cs i32 @branch_uniform_ballot_eq_zero_and(i32 inreg %v1, i32 inreg %v2) {
+; CHECK-LABEL: branch_uniform_ballot_eq_zero_and:
+; CHECK: ; %bb.0:
+; CHECK-NEXT: s_cmp_lt_u32 s0, 12
+; CHECK-NEXT: s_cselect_b32 s0, 1, 0
+; CHECK-NEXT: s_cmp_gt_u32 s1, 34
+; CHECK-NEXT: s_cselect_b32 s1, 1, 0
+; CHECK-NEXT: s_and_b32 s0, s0, s1
+; CHECK-NEXT: s_and_b32 s0, 1, s0
+; CHECK-NEXT: v_cmp_ne_u32_e64 s0, 0, s0
+; CHECK-NEXT: s_cmp_lg_u32 s0, 0
+; CHECK-NEXT: s_cbranch_scc0 .LBB18_2
+; CHECK-NEXT: ; %bb.1: ; %false
+; CHECK-NEXT: s_mov_b32 s0, 33
+; CHECK-NEXT: s_branch .LBB18_3
+; CHECK-NEXT: .LBB18_2: ; %true
+; CHECK-NEXT: s_mov_b32 s0, 42
+; CHECK-NEXT: s_branch .LBB18_3
+; CHECK-NEXT: .LBB18_3:
+ %v1c = icmp ult i32 %v1, 12
+ %v2c = icmp ugt i32 %v2, 34
+ %c = and i1 %v1c, %v2c
+ %ballot = call i32 @llvm.amdgcn.ballot.i32(i1 %c)
+ %ballot_eq_zero = icmp eq i32 %ballot, 0
+ br i1 %ballot_eq_zero, label %true, label %false
+true:
+ ret i32 42
+false:
+ ret i32 33
+}
diff --git a/llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.ballot.i64.ll b/llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.ballot.i64.ll
index 9f83012f5457509..ebb96ddc0603d68 100644
--- a/llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.ballot.i64.ll
+++ b/llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.ballot.i64.ll
@@ -86,3 +86,313 @@ define amdgpu_cs i64 @ctpop_of_ballot(float %x, float %y) {
%bcnt = call i64 @llvm.ctpop.i64(i64 %ballot)
ret i64 %bcnt
}
+
+d...
[truncated]
|
// %VCMP = i32/i64 AMDGPUISD::SETCC ... | ||
// %C = ISD::SETCC %VCMP, 0, setne/seteq | ||
// BRCOND %BB, %C | ||
// => | ||
// %VCMP = i32/i64 AMDGPUISD::SETCC ... | ||
// BRCONDZ %BB, %VCMP, setne/seteq |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why does VCMP
have to be AMDGPUISD::SETCC
here? Can you do a simpler combine instead:
// %C = ISD::SETCC %VCMP, 0, setne/seteq
// BRCOND %BB, %C
// =>
// BRCONDZ %BB, %VCMP, setne/seteq
// %VCMPSrc = ISD::SETCC or a logical combination of ISD::SETCCs | ||
// %VCMP = i32/i64 AMDGPUISD::SETCC (ext %VCMPSrc), 0, setne | ||
// %C = ISD::SETCC %VCMP, 0, setne/seteq | ||
// BRCOND %BB, %C | ||
// => | ||
// BRCONDZ %BB, %VCMPSrc, setne/seteq |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you split this out into a separate simpler combine:
// %VCMPSrc = ISD::SETCC or a logical combination of ISD::SETCCs
// %VCMP = i32/i64 AMDGPUISD::SETCC (ext %VCMPSrc), 0, setne
// %C = ISD::SETCC %VCMP, 0, setne/seteq
// =>
// %C = ISD::SETCC or a logical combination of ISD::SETCCs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you Jay, these're good suggestions and they showed me that BRCONDZ has vague semantics.
Here I use lowercase setcc to refer to ordinary i1 ISD::SETCC opcode and AMDGPU::SETCC for a special instruction returning boolean mask accross lanes.
The problem is that when we get rid of AMDGPU::SETCC (as in your second pattern) we're loosing the semantic of a boolean mask accross all lanes.
This confusion comes from the description of BRCONDZ node that can accept setcc value or logical combination of setccs. Instead it should be defined as BRCONDZ only accepts the result of AMDGPU::SETCC node which means that BRCONDZ compares uniform boolean mask resulted from the AMDGPU::SETCC operation to zero.
For example:
%Mask = i64 AMDGPU::SETCC i1 %c, 42, setle
BRCONDZ i64 %Mask, BB2, setne
=>
v_cmp_le vcc, v0, 42
c_cbranch_vccnz bb2
Now the confusing case with ballot:
%c = i1 setcc ...
%Mask = i64 AMDGPU::SETCC i32 (zext i32 %c), 0, setne ; <- this is lowered ballot
BRCONDZ i64 %Mask, BB2, setne
=>
v_cmp_ vcc, ... ; <- setcc
s_cmp_ne vcc, 0 ; <- ballot
c_cbranch_scc1 bb2
Here we want to remove second comparison because vcc is a required boolean mask across all lanes resulted from the first v_cmp_ instruction and it can be compared to zero with the branch. In SelectionDAG we know that setcc for a divergent value will be selected as a v_cmp instruction and we can use its result directly and that is why setcc input to ballot is delivered to BRCONDZ. This is probably wrong.
I'm not really sure how to resolve this yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On the other thought these two patterns have a node in common:
%C = ISD::SETCC %VCMP, 0, setne/seteq
BRCOND %BB, %C
=>
BRCONDZ %BB, %VCMP, setne/seteq
%VCMPSrc = ISD::SETCC or a logical combination of ISD::SETCCs
%VCMP = i32/i64 AMDGPUISD::SETCC (ext %VCMPSrc), 0, setne
%C = ISD::SETCC %VCMP, 0, setne/seteq
=>
%C = ISD::SETCC or a logical combination of ISD::SETCCs
SelectionDAG starts with BRCOND of the first pattern and erases the ISD::SETCC which is the root for the second pattern. I tried to perform DAG combine on the %C value of the first rule so that the second rule could perform its transformation and return result to the first rule but this requires to create artificial result %C = ISD::SETCC %VCMPSrc, 0, setne so that the first rule pattern could match. It looks like this should work as a single pattern.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've reverted previous implementation and submitted simplified version without using additional DAG node.
…llot!=0 in SelectionDAG." This reverts commit 13ebb87.
✅ With the latest revision this PR passed the C/C++ code formatter. |
4429535
to
abb2ca2
Compare
abb2ca2
to
8b836e2
Compare
Fixed SCC not being used if |
@@ -2259,6 +2259,30 @@ bool AMDGPUDAGToDAGISel::isCBranchSCC(const SDNode *N) const { | |||
return false; | |||
} | |||
|
|||
bool isBoolSGPR(SDValue V); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can just move the body up here to avoid the forward declaration
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The definition is in another source (SIISelLowering.cpp), I haven't found common header to place this declaration yet.
assert(VCMP->getOpcode() == AMDGPUISD::SETCC); | ||
// Special case for amdgcn.ballot: | ||
// %Cond = i1 (and/or combination of i1 ISD::SETCCs) | ||
// %VCMP = i(WaveSize) AMDGPUISD::SETCC (ext %Cond), 0, setne ; lowered ballot |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You might need to handle the negated form as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added, but I had to simulate it using llvm.amdgcn.icmp(i1 %Cond, i1 0, ICMP_EQ) intrinsic.
* supported negated ballot * improved comments
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added support for negated ballot but I had to simulate it using llvm.amdgcn.icmp intrinsic.
Tests for GlobalISel aren't added because they fail on "cannot select llvm.amdgcn.icmp(i1 ...)"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems pretty reasonable to me
; CHECK: ; %bb.0: | ||
; CHECK-NEXT: s_cmp_lt_u32 s0, 12 | ||
; CHECK-NEXT: s_cselect_b32 s0, -1, 0 | ||
; CHECK-NEXT: s_cmp_gt_u32 s1, 34 | ||
; CHECK-NEXT: s_cselect_b32 s1, -1, 0 | ||
; CHECK-NEXT: s_and_b32 s0, s0, s1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This sequence could be improved to:
s_cmp_lt_u32 s0, 12
s_cselect_b32 s0, -1, 0
s_cmp_gt_u32 s1, 34
s_cselect_b32 s0, s0, 0
By selecting into vcc(_lo) instead, we could even avoid the AND-with-exec that follows.
Not something for this patch, just an observation. Perhaps you could add a TODO.
(In general, LLVM knows to do boolean logic -> select folds. The fact that they don't fire here is probably a pass ordering issue.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps you could add a TODO.
Thanks Nicolai, added the TODO.
* added the TODO by Nicolai's suggestion.
With the latest commit I added a guard to skip ballot.i64 in wave32 mode because it's lowered with i64 AMDGPUISD::SETCC - this should be fixed with i32 result similar to ICMP/FCMP intrinsics. After a fix that follows this PR I'm going to turn the guard into assert - TODO is added for this. |
Though, on second thought, shouldn't there be some wave64 tests? |
I added a couple of ballot.i64 in wave32 mode tests to AMDGPU/GlobalISel/llvm.amdgcn.ballot.i32.ll and AMDGPU/llvm.amdgcn.ballot.i32.ll with the 6e865d1 or you mean ballot.i32 in wave64 mode? |
I meant ballot.i64 in wave64. For some reason, I didn't see the changes you made in those tests (perhaps I was looking at GitHub's display of your last changes to the PR, instead of the full PR.) But since you did make those changes, it's all good as far as I'm concerned. |
Thanks Nicolai! |
I'm going to introduce changes in address sanitizer instrumentation checks for AMDGPU target and it will make use of the following pattern:
which means "execute then if any lane has satisfied the cnd condition". Note that this is uniform condition since amdgcn.ballot is uniform intrinsic.
This patch improves code generation for this pattern in SelectionDAG. GlobalISel change should be added later. It introduces special AMDGPUISD::BRCONDZ node that is created in the DAG combiner and maps directly either to S_CBRANCH_VCC(Z|NZ) or S_CBRANCH_SCC(0|1) if cnd is itself uniform.
P.S. This pull request contains two commits, first one just adds tests so I recommend to take a look at the second commit 13ebb87 to see the code difference it makes in tests.