Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[AMDGPU] Change control flow intrinsic lowering making the wave to re… #86805

Closed
wants to merge 6 commits into from

Conversation

alex-t
Copy link
Contributor

@alex-t alex-t commented Mar 27, 2024

…converge at the end of the predecessor block.

We currently lower the SI_IF/ELSE, SI_LOOP, and SI_END_CF to reconverge the wave at the beginning of the CF join basic block or on the loop exit block. This leads to numerous issues related to the spill/split insertion points. LLVM core kits consider the start of the block as the best point to reload the spilled registers. As a result, the vector loads are incorrectly masked out. A similar issue arose when the split kit split the live interval on the CF joining block: the spills were inserted before the exec mask was restored.

The current code status is an early draft. The PR is created to serve as a design review and discussion space.
Many LIT tests are XFailed because they do not make sense any longer.
This was done to let it go through PSDB to ensure that the generated code could run correctly.
The rest of the LIT stuff is under construction and is going to be taken seriously as soon as the general approach is approved by the community.

@llvmbot
Copy link
Collaborator

llvmbot commented Mar 27, 2024

@llvm/pr-subscribers-llvm-ir
@llvm/pr-subscribers-llvm-analysis
@llvm/pr-subscribers-llvm-globalisel

@llvm/pr-subscribers-backend-amdgpu

Author: None (alex-t)

Changes

…converge at the end of the predecessor block.

We currently lower the SI_IF/ELSE, SI_LOOP, and SI_END_CF to reconverge the wave at the beginning of the CF join basic block or on the loop exit block. This leads to numerous issues related to the spill/split insertion points. LLVM core kits consider the start of the block as the best point to reload the spilled registers. As a result, the vector loads are incorrectly masked out. A similar issue arose when the split kit split the live interval on the CF joining block: the spills were inserted before the exec mask was restored.

The current code status is an early draft. The PR is created to serve as a design review and discussion space.
Many LIT tests are XFailed because they do not make sense any longer.
This was done to let it go through PSDB to ensure that the generated code could run correctly.
The rest of the LIT stuff is under construction and is going to be taken seriously as soon as the general approach is approved by the community.


Patch is 3.27 MiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/86805.diff

196 Files Affected:

  • (modified) llvm/lib/Target/AMDGPU/SIAnnotateControlFlow.cpp (+15-27)
  • (modified) llvm/lib/Target/AMDGPU/SIISelLowering.cpp (+85)
  • (modified) llvm/lib/Target/AMDGPU/SIInstrInfo.cpp (+1)
  • (modified) llvm/lib/Target/AMDGPU/SIInstructions.td (+2-2)
  • (modified) llvm/lib/Target/AMDGPU/SILowerControlFlow.cpp (+131-384)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/atomic_optimizations_mul_one.ll (+3-1)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/divergence-divergent-i1-phis-no-lane-mask-merging.ll (+16-12)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/divergence-divergent-i1-used-outside-loop.ll (+95-53)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/divergence-structurizer.ll (+97-49)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/divergence-temporal-divergent-i1.ll (+20-12)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/divergence-temporal-divergent-reg.ll (+5-3)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/divergent-control-flow.ll (+29-18)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/fp64-atomics-gfx90a.ll (+60-28)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/global-atomic-fadd.f32-no-rtn.ll (+2-2)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/global-atomic-fadd.f32-rtn.ll (+2-2)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/irtranslator-atomicrmw.ll (-2)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/irtranslator-function-args.ll (+1-1)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.end.cf.i32.ll (+5-5)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.end.cf.i64.ll (+4-5)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.wqm.demote.ll (+190-126)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.memmove.ll (+9-5)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/localizer.ll (+23-22)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/mul-known-bits.i64.ll (+36-25)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/non-entry-alloca.ll (+28-17)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/sdiv.i64.ll (+262-257)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/srem.i64.ll (+225-218)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/udiv.i64.ll (+80-73)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/urem.i64.ll (+76-69)
  • (modified) llvm/test/CodeGen/AMDGPU/amdpal-callable.ll (+1)
  • (modified) llvm/test/CodeGen/AMDGPU/atomic-optimizer-strict-wqm.ll (+25-13)
  • (modified) llvm/test/CodeGen/AMDGPU/atomic_optimizations_buffer.ll (+514-342)
  • (modified) llvm/test/CodeGen/AMDGPU/atomic_optimizations_global_pointer.ll (+691-421)
  • (modified) llvm/test/CodeGen/AMDGPU/atomic_optimizations_local_pointer.ll (+1086-686)
  • (modified) llvm/test/CodeGen/AMDGPU/atomic_optimizations_pixelshader.ll (+191-119)
  • (modified) llvm/test/CodeGen/AMDGPU/atomic_optimizations_raw_buffer.ll (+466-310)
  • (modified) llvm/test/CodeGen/AMDGPU/atomic_optimizations_struct_buffer.ll (+442-286)
  • (modified) llvm/test/CodeGen/AMDGPU/atomicrmw-expand.ll (+123-91)
  • (modified) llvm/test/CodeGen/AMDGPU/atomicrmw-nand.ll (+15-9)
  • (modified) llvm/test/CodeGen/AMDGPU/atomics-cas-remarks-gfx90a.ll (+1)
  • (modified) llvm/test/CodeGen/AMDGPU/bb-prolog-spill-during-regalloc.ll (+19-18)
  • (modified) llvm/test/CodeGen/AMDGPU/block-should-not-be-in-alive-blocks.mir (+13-13)
  • (modified) llvm/test/CodeGen/AMDGPU/branch-condition-and.ll (+1)
  • (modified) llvm/test/CodeGen/AMDGPU/branch-folding-implicit-def-subreg.ll (+428-376)
  • (modified) llvm/test/CodeGen/AMDGPU/branch-relaxation-gfx10-branch-offset-bug.ll (+1)
  • (modified) llvm/test/CodeGen/AMDGPU/branch-relaxation.ll (+30-19)
  • (modified) llvm/test/CodeGen/AMDGPU/bug-sdag-emitcopyfromreg.ll (+7-65)
  • (modified) llvm/test/CodeGen/AMDGPU/bypass-div.ll (+60-36)
  • (modified) llvm/test/CodeGen/AMDGPU/byval-frame-setup.ll (+1)
  • (modified) llvm/test/CodeGen/AMDGPU/call-skip.ll (+1)
  • (modified) llvm/test/CodeGen/AMDGPU/cgp-addressing-modes-flat.ll (+164-92)
  • (modified) llvm/test/CodeGen/AMDGPU/cgp-addressing-modes-gfx1030.ll (+2-1)
  • (modified) llvm/test/CodeGen/AMDGPU/cgp-addressing-modes-gfx908.ll (+7-4)
  • (modified) llvm/test/CodeGen/AMDGPU/cgp-addressing-modes.ll (+1)
  • (modified) llvm/test/CodeGen/AMDGPU/codegen-prepare-addrspacecast-non-null.ll (+36-26)
  • (modified) llvm/test/CodeGen/AMDGPU/collapse-endcf.ll (+490-443)
  • (modified) llvm/test/CodeGen/AMDGPU/collapse-endcf.mir (+297-195)
  • (modified) llvm/test/CodeGen/AMDGPU/control-flow-fastregalloc.ll (+1)
  • (modified) llvm/test/CodeGen/AMDGPU/control-flow-optnone.ll (+2)
  • (modified) llvm/test/CodeGen/AMDGPU/convergent-inlineasm.ll (+1)
  • (modified) llvm/test/CodeGen/AMDGPU/cse-convergent.ll (+10-6)
  • (modified) llvm/test/CodeGen/AMDGPU/cse-phi-incoming-val.ll (+1)
  • (modified) llvm/test/CodeGen/AMDGPU/dag-divergence-atomic.ll (+15-9)
  • (modified) llvm/test/CodeGen/AMDGPU/dagcombine-lshr-and-cmp.ll (+1-1)
  • (modified) llvm/test/CodeGen/AMDGPU/dagcombine-v1i8-extractvecelt-crash.ll (+6-3)
  • (modified) llvm/test/CodeGen/AMDGPU/div_i128.ll (+169-165)
  • (modified) llvm/test/CodeGen/AMDGPU/divergent-branch-uniform-condition.ll (+19-8)
  • (modified) llvm/test/CodeGen/AMDGPU/elf-header-flags-mach.ll (+1)
  • (modified) llvm/test/CodeGen/AMDGPU/else.ll (+1)
  • (modified) llvm/test/CodeGen/AMDGPU/endcf-loop-header.ll (+1)
  • (modified) llvm/test/CodeGen/AMDGPU/fix-frame-ptr-reg-copy-livein.ll (+1)
  • (modified) llvm/test/CodeGen/AMDGPU/flat_atomics_i32_system.ll (+765-444)
  • (modified) llvm/test/CodeGen/AMDGPU/flat_atomics_i64_system.ll (+765-444)
  • (modified) llvm/test/CodeGen/AMDGPU/fneg-combines.f16.ll (+26-16)
  • (modified) llvm/test/CodeGen/AMDGPU/fneg-combines.new.ll (+14-8)
  • (modified) llvm/test/CodeGen/AMDGPU/fold-fabs.ll (+36-16)
  • (modified) llvm/test/CodeGen/AMDGPU/fp64-atomics-gfx90a.ll (+60-28)
  • (modified) llvm/test/CodeGen/AMDGPU/frame-index-elimination.ll (+1)
  • (modified) llvm/test/CodeGen/AMDGPU/function-args.ll (+15-9)
  • (modified) llvm/test/CodeGen/AMDGPU/global-atomic-fadd.f32-no-rtn.ll (+6-6)
  • (modified) llvm/test/CodeGen/AMDGPU/global-atomic-fadd.f32-rtn.ll (+2-2)
  • (modified) llvm/test/CodeGen/AMDGPU/global-atomics-fp-wrong-subtarget.ll (+1)
  • (modified) llvm/test/CodeGen/AMDGPU/global-atomics-fp.ll (+473-252)
  • (modified) llvm/test/CodeGen/AMDGPU/global-saddr-atomics-min-max-system.ll (+480-240)
  • (modified) llvm/test/CodeGen/AMDGPU/global_atomics_i32_system.ll (+813-492)
  • (modified) llvm/test/CodeGen/AMDGPU/global_atomics_i64_system.ll (+765-444)
  • (modified) llvm/test/CodeGen/AMDGPU/global_atomics_scan_fadd.ll (+1156-590)
  • (modified) llvm/test/CodeGen/AMDGPU/global_atomics_scan_fmax.ll (+753-348)
  • (modified) llvm/test/CodeGen/AMDGPU/global_atomics_scan_fmin.ll (+753-348)
  • (modified) llvm/test/CodeGen/AMDGPU/global_atomics_scan_fsub.ll (+1234-622)
  • (modified) llvm/test/CodeGen/AMDGPU/hoist-cond.ll (+1)
  • (modified) llvm/test/CodeGen/AMDGPU/i1-copy-from-loop.ll (+19-12)
  • (modified) llvm/test/CodeGen/AMDGPU/i1-copy-phi.ll (+1)
  • (modified) llvm/test/CodeGen/AMDGPU/i1_copy_phi_with_phi_incoming_value.mir (+4-13)
  • (modified) llvm/test/CodeGen/AMDGPU/identical-subrange-spill-infloop.ll (+100-89)
  • (modified) llvm/test/CodeGen/AMDGPU/indirect-addressing-si.ll (+1)
  • (modified) llvm/test/CodeGen/AMDGPU/indirect-call.ll (+13-7)
  • (modified) llvm/test/CodeGen/AMDGPU/infinite-loop.ll (+2)
  • (modified) llvm/test/CodeGen/AMDGPU/inline-asm.ll (+1)
  • (modified) llvm/test/CodeGen/AMDGPU/insert-delay-alu-bug.ll (+24-11)
  • (modified) llvm/test/CodeGen/AMDGPU/kill-infinite-loop.ll (+32-21)
  • (modified) llvm/test/CodeGen/AMDGPU/lds-global-non-entry-func.ll (+118-79)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.div.fmas.ll (+1)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.ds.ordered.swap.ll (+1)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.exp.ll (+1)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.inverse.ballot.i32.ll (+14-6)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.inverse.ballot.i64.ll (+14-6)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.ps.live.ll (+1)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.reduce.umax.ll (+160-108)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.reduce.umin.ll (+160-108)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.sendmsg.ll (+1)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.softwqm.ll (+20-12)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.wqm.demote.ll (+198-134)
  • (modified) llvm/test/CodeGen/AMDGPU/local-atomics-fp.ll (+410-254)
  • (modified) llvm/test/CodeGen/AMDGPU/long-branch-reserve-register.ll (+11-8)
  • (modified) llvm/test/CodeGen/AMDGPU/loop-live-out-copy-undef-subrange.ll (+5-3)
  • (modified) llvm/test/CodeGen/AMDGPU/loop-on-function-argument.ll (+7-3)
  • (modified) llvm/test/CodeGen/AMDGPU/loop_break.ll (+32-17)
  • (modified) llvm/test/CodeGen/AMDGPU/loop_exit_with_xor.ll (+34-23)
  • (modified) llvm/test/CodeGen/AMDGPU/lower-control-flow-live-intervals.mir (+73-58)
  • (modified) llvm/test/CodeGen/AMDGPU/lower-control-flow-live-variables-update.mir (+65-81)
  • (modified) llvm/test/CodeGen/AMDGPU/lower-control-flow-live-variables-update.xfail.mir (+1)
  • (modified) llvm/test/CodeGen/AMDGPU/lower-control-flow-other-terminators.mir (+39-39)
  • (modified) llvm/test/CodeGen/AMDGPU/lower-i1-copies-clear-kills.mir (+4-8)
  • (modified) llvm/test/CodeGen/AMDGPU/machine-sink-lane-mask.mir (+2-6)
  • (modified) llvm/test/CodeGen/AMDGPU/machine-sink-loop-var-out-of-divergent-loop-swdev407790.ll (+26-16)
  • (modified) llvm/test/CodeGen/AMDGPU/machine-sink-loop-var-out-of-divergent-loop-swdev407790.mir (-4)
  • (modified) llvm/test/CodeGen/AMDGPU/machine-sink-temporal-divergence-swdev407790.ll (+252-184)
  • (modified) llvm/test/CodeGen/AMDGPU/machine-sink-temporal-divergence-swdev407790.mir (-4)
  • (modified) llvm/test/CodeGen/AMDGPU/memcpy-crash-issue63986.ll (+37-22)
  • (modified) llvm/test/CodeGen/AMDGPU/mixed-wave32-wave64.ll (+1)
  • (modified) llvm/test/CodeGen/AMDGPU/move-to-valu-atomicrmw-system.ll (+20-9)
  • (modified) llvm/test/CodeGen/AMDGPU/move-to-valu-atomicrmw.ll (+10-4)
  • (modified) llvm/test/CodeGen/AMDGPU/mubuf-legalize-operands-non-ptr-intrinsics.ll (+50-34)
  • (modified) llvm/test/CodeGen/AMDGPU/mubuf-legalize-operands.ll (+50-34)
  • (modified) llvm/test/CodeGen/AMDGPU/mul24-pass-ordering.ll (+18-9)
  • (modified) llvm/test/CodeGen/AMDGPU/multi-divergent-exit-region.ll (+1)
  • (modified) llvm/test/CodeGen/AMDGPU/multilevel-break.ll (+3-4)
  • (modified) llvm/test/CodeGen/AMDGPU/nested-loop-conditions.ll (+2-2)
  • (modified) llvm/test/CodeGen/AMDGPU/no-dup-inst-prefetch.ll (+26-13)
  • (modified) llvm/test/CodeGen/AMDGPU/non-entry-alloca.ll (+62-36)
  • (modified) llvm/test/CodeGen/AMDGPU/phi-elimination-end-cf.mir (+1-2)
  • (modified) llvm/test/CodeGen/AMDGPU/rem_i128.ll (+170-166)
  • (modified) llvm/test/CodeGen/AMDGPU/ret_jump.ll (+1)
  • (modified) llvm/test/CodeGen/AMDGPU/scheduler-rp-calc-one-successor-two-predecessors-bug.ll (+11-9)
  • (modified) llvm/test/CodeGen/AMDGPU/sdiv64.ll (+107-79)
  • (modified) llvm/test/CodeGen/AMDGPU/set-inactive-wwm-overwrite.ll (+32-19)
  • (modified) llvm/test/CodeGen/AMDGPU/setcc-sext.ll (+1)
  • (modified) llvm/test/CodeGen/AMDGPU/sgpr-control-flow.ll (+26-19)
  • (modified) llvm/test/CodeGen/AMDGPU/should-not-hoist-set-inactive.ll (+28-15)
  • (modified) llvm/test/CodeGen/AMDGPU/si-annotate-cf-kill.ll (+26-12)
  • (modified) llvm/test/CodeGen/AMDGPU/si-annotate-cf-noloop.ll (+1)
  • (modified) llvm/test/CodeGen/AMDGPU/si-annotate-cf-unreachable.ll (+1)
  • (modified) llvm/test/CodeGen/AMDGPU/si-annotate-cf.ll (+34-20)
  • (modified) llvm/test/CodeGen/AMDGPU/si-annotate-dbg-info.ll (+6-7)
  • (modified) llvm/test/CodeGen/AMDGPU/si-annotate-nested-control-flows.ll (+2)
  • (modified) llvm/test/CodeGen/AMDGPU/si-annotatecfg-multiple-backedges.ll (+5-5)
  • (modified) llvm/test/CodeGen/AMDGPU/si-fix-sgpr-copies.mir (-1)
  • (modified) llvm/test/CodeGen/AMDGPU/si-lower-control-flow-kill.ll (+1)
  • (modified) llvm/test/CodeGen/AMDGPU/si-lower-control-flow-unreachable-block.ll (+1)
  • (modified) llvm/test/CodeGen/AMDGPU/si-lower-control-flow.mir (+61-84)
  • (modified) llvm/test/CodeGen/AMDGPU/si-lower-i1-copies-order-of-phi-incomings.mir (+8-4)
  • (modified) llvm/test/CodeGen/AMDGPU/si-opt-vgpr-liverange-bug-deadlanes.mir (+2-2)
  • (modified) llvm/test/CodeGen/AMDGPU/si-optimize-vgpr-live-range-dbg-instr.ll (+14-15)
  • (modified) llvm/test/CodeGen/AMDGPU/si-optimize-vgpr-live-range-dbg-instr.mir (+2-1)
  • (modified) llvm/test/CodeGen/AMDGPU/si-unify-exit-multiple-unreachables.ll (+29-51)
  • (modified) llvm/test/CodeGen/AMDGPU/si-unify-exit-return-unreachable.ll (+7-6)
  • (modified) llvm/test/CodeGen/AMDGPU/skip-branch-trap.ll (+1)
  • (modified) llvm/test/CodeGen/AMDGPU/skip-if-dead.ll (+210-130)
  • (modified) llvm/test/CodeGen/AMDGPU/spill-cfg-position.ll (+1)
  • (modified) llvm/test/CodeGen/AMDGPU/spill-scavenge-offset.ll (+60-75)
  • (modified) llvm/test/CodeGen/AMDGPU/srem64.ll (+107-79)
  • (modified) llvm/test/CodeGen/AMDGPU/stack-pointer-offset-relative-frameindex.ll (+22-10)
  • (modified) llvm/test/CodeGen/AMDGPU/stacksave_stackrestore.ll (+65-47)
  • (modified) llvm/test/CodeGen/AMDGPU/stale-livevar-in-twoaddr-pass.mir (+1-1)
  • (modified) llvm/test/CodeGen/AMDGPU/stop-tail-duplicate-cfg-intrinsic.mir (+2-2)
  • (modified) llvm/test/CodeGen/AMDGPU/subreg-coalescer-undef-use.ll (+7-3)
  • (modified) llvm/test/CodeGen/AMDGPU/transform-block-with-return-to-epilog.ll (+38-42)
  • (modified) llvm/test/CodeGen/AMDGPU/tuple-allocation-failure.ll (+114-96)
  • (modified) llvm/test/CodeGen/AMDGPU/udiv64.ll (+112-84)
  • (modified) llvm/test/CodeGen/AMDGPU/uniform-cfg.ll (+54-30)
  • (modified) llvm/test/CodeGen/AMDGPU/uniform-loop-inside-nonuniform.ll (+1)
  • (modified) llvm/test/CodeGen/AMDGPU/uniform-phi-with-undef.ll (+9-7)
  • (modified) llvm/test/CodeGen/AMDGPU/unstructured-cfg-def-use-issue.ll (+63-35)
  • (modified) llvm/test/CodeGen/AMDGPU/urem64.ll (+83-62)
  • (modified) llvm/test/CodeGen/AMDGPU/valu-i1.ll (+1)
  • (modified) llvm/test/CodeGen/AMDGPU/vcmp-saveexec-to-vcmpx.ll (+1)
  • (modified) llvm/test/CodeGen/AMDGPU/vgpr-liverange-ir.ll (+6-6)
  • (modified) llvm/test/CodeGen/AMDGPU/vgpr-liverange.ll (+66-48)
  • (modified) llvm/test/CodeGen/AMDGPU/vgpr-mark-last-scratch-load.ll (+12-8)
  • (modified) llvm/test/CodeGen/AMDGPU/vgpr-spill-placement-issue61083.ll (+16-14)
  • (modified) llvm/test/CodeGen/AMDGPU/vni8-across-blocks.ll (+756-735)
  • (modified) llvm/test/CodeGen/AMDGPU/wave32.ll (+185-93)
  • (modified) llvm/test/CodeGen/AMDGPU/while-break.ll (+68-46)
  • (modified) llvm/test/CodeGen/AMDGPU/wqm.ll (+313-197)
  • (modified) llvm/test/CodeGen/AMDGPU/wwm-reserved-spill.ll (+53-42)
  • (modified) llvm/test/CodeGen/AMDGPU/wwm-reserved.ll (+82-60)
diff --git a/llvm/lib/Target/AMDGPU/SIAnnotateControlFlow.cpp b/llvm/lib/Target/AMDGPU/SIAnnotateControlFlow.cpp
index 58214f30bb8d67..31dcfb959e54c0 100644
--- a/llvm/lib/Target/AMDGPU/SIAnnotateControlFlow.cpp
+++ b/llvm/lib/Target/AMDGPU/SIAnnotateControlFlow.cpp
@@ -306,38 +306,26 @@ bool SIAnnotateControlFlow::handleLoop(BranchInst *Term) {
 
 /// Close the last opened control flow
 bool SIAnnotateControlFlow::closeControlFlow(BasicBlock *BB) {
-  llvm::Loop *L = LI->getLoopFor(BB);
 
   assert(Stack.back().first == BB);
 
-  if (L && L->getHeader() == BB) {
-    // We can't insert an EndCF call into a loop header, because it will
-    // get executed on every iteration of the loop, when it should be
-    // executed only once before the loop.
-    SmallVector <BasicBlock *, 8> Latches;
-    L->getLoopLatches(Latches);
-
-    SmallVector<BasicBlock *, 2> Preds;
-    for (BasicBlock *Pred : predecessors(BB)) {
-      if (!is_contained(Latches, Pred))
-        Preds.push_back(Pred);
-    }
-
-    BB = SplitBlockPredecessors(BB, Preds, "endcf.split", DT, LI, nullptr,
-                                false);
-  }
-
   Value *Exec = popSaved();
-  BasicBlock::iterator FirstInsertionPt = BB->getFirstInsertionPt();
-  if (!isa<UndefValue>(Exec) && !isa<UnreachableInst>(FirstInsertionPt)) {
-    Instruction *ExecDef = cast<Instruction>(Exec);
-    BasicBlock *DefBB = ExecDef->getParent();
-    if (!DT->dominates(DefBB, BB)) {
-      // Split edge to make Def dominate Use
-      FirstInsertionPt = SplitEdge(DefBB, BB, DT, LI)->getFirstInsertionPt();
+  Instruction *ExecDef = dyn_cast<Instruction>(Exec);
+  BasicBlock *DefBB = ExecDef->getParent();
+  for (auto Pred : predecessors(BB)) {
+    llvm::Loop *L = LI->getLoopFor(Pred);
+    bool IsLoopLatch = false;
+    if (L) {
+      SmallVector<BasicBlock *, 4> LL;
+      L->getLoopLatches(LL);
+      IsLoopLatch = std::find_if(LL.begin(), LL.end(), [Pred](BasicBlock *B) {
+                      return B == Pred;
+                    }) != LL.end();
+    }
+    if (Pred != DefBB && DT->dominates(DefBB, Pred) && !IsLoopLatch) {
+      BasicBlock::iterator InsPt(Pred->getTerminator());
+      IRBuilder<>(Pred, InsPt).CreateCall(EndCf, {Exec});
     }
-    IRBuilder<>(FirstInsertionPt->getParent(), FirstInsertionPt)
-        .CreateCall(EndCf, {Exec});
   }
 
   return true;
diff --git a/llvm/lib/Target/AMDGPU/SIISelLowering.cpp b/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
index 9bc1b8eb598f3a..6f94cb16175786 100644
--- a/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
+++ b/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
@@ -15619,6 +15619,91 @@ void SITargetLowering::finalizeLowering(MachineFunction &MF) const {
     }
   }
 
+  // ISel inserts copy to regs for the successor PHIs
+  // at the BB end. We need to move the SI_END_CF right before the branch.
+  // Even we don't have to move SI_END_CF we need to take care of the
+  // S_CBRANCH_SCC0/1 as SI_END_CF overwrites SCC
+  for (auto &MBB : MF) {
+    for (auto &MI : MBB) {
+      if (MI.getOpcode() == AMDGPU::SI_END_CF) {
+        MachineBasicBlock::iterator I(MI);
+        MachineBasicBlock::iterator Next = std::next(I);
+        bool NeedToMove = false;
+        while (Next != MBB.end() && !Next->isBranch()) {
+          NeedToMove = true;
+          Next++;
+        }
+
+        // Lets take care of SCC users as S_END_CF defines SCC
+        bool NeedPreserveSCC =
+            Next != MBB.end() && Next->readsRegister(AMDGPU::SCC);
+        MachineBasicBlock::iterator SCCDefUse(Next);
+        // This loop will be never taken as we always have S_CBRANCH_SCC1/0 at
+        // the end of the block.
+        while (!NeedPreserveSCC && SCCDefUse != MBB.end()) {
+          if (SCCDefUse->definesRegister(AMDGPU::SCC))
+            // This should never happen - SCC def after the branch reading SCC
+            break;
+          if (SCCDefUse->readsRegister(AMDGPU::SCC)) {
+            NeedPreserveSCC = true;
+            break;
+          }
+          SCCDefUse++;
+        }
+        if (NeedPreserveSCC) {
+          MachineBasicBlock::reverse_iterator BackSeeker(Next);
+          while (BackSeeker != MBB.rend()) {
+            if (BackSeeker != MI && BackSeeker->definesRegister(AMDGPU::SCC))
+              break;
+            BackSeeker++;
+          }
+          // we need this to makes some artificial MIR tests happy
+          bool NeedSetSCCUndef = false;
+          if (BackSeeker == MBB.rend()) {
+            // We have reached the begin of the block but haven't seen the SCC
+            // def Given that the MIR is correct, we either have SCC live in
+            // or SCCUser SCC operand is undef. In fact, we don't need to emit
+            // the instructions that preserve thje SCC if the use is Undef. We
+            // do this just because the MIR looks weird otherwise.
+            MachineOperand *SCCUseOp =
+                SCCDefUse->findRegisterUseOperand(AMDGPU::SCC, false, TRI);
+            assert(SCCUseOp);
+            bool IsSCCLiveIn = MBB.isLiveIn(AMDGPU::SCC);
+            bool IsUseUndef = SCCUseOp->isUndef();
+            NeedSetSCCUndef = (!IsSCCLiveIn && IsUseUndef);
+          }
+          MachineBasicBlock::iterator InsPt(BackSeeker);
+          Register SavedSCC =
+              MRI.createVirtualRegister(&AMDGPU::SReg_32RegClass);
+          MachineInstr *SaveSCC =
+              BuildMI(MBB, InsPt, InsPt->getDebugLoc(),
+                      TII->get(AMDGPU::S_CSELECT_B32), SavedSCC)
+                  .addImm(1)
+                  .addImm(0);
+          if (NeedSetSCCUndef) {
+
+            MachineOperand *SCCOp =
+                SaveSCC->findRegisterUseOperand(AMDGPU::SCC, false, TRI);
+            if (SCCOp)
+              SCCOp->setIsUndef();
+          }
+          Register Tmp =
+              MRI.createVirtualRegister(&AMDGPU::SReg_32RegClass);
+          Next = BuildMI(MBB, Next, Next->getDebugLoc(),
+                         TII->get(AMDGPU::S_AND_B32_term), Tmp)
+                     .addReg(SavedSCC)
+                     .addImm(1);
+        }
+
+        if (NeedToMove) {
+          MBB.splice(Next, &MBB, &MI);
+        }
+
+        break;
+      }
+    }
+  }
+
   // FIXME: This is a hack to fixup AGPR classes to use the properly aligned
   // classes if required. Ideally the register class constraints would differ
   // per-subtarget, but there's no easy way to achieve that right now. This is
diff --git a/llvm/lib/Target/AMDGPU/SIInstrInfo.cpp b/llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
index c19c3c6017a7c8..acfa6e88725a94 100644
--- a/llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
+++ b/llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
@@ -3051,6 +3051,7 @@ bool SIInstrInfo::analyzeBranch(MachineBasicBlock &MBB, MachineBasicBlock *&TBB,
       break;
     case AMDGPU::SI_IF:
     case AMDGPU::SI_ELSE:
+    case AMDGPU::SI_END_CF:
     case AMDGPU::SI_KILL_I1_TERMINATOR:
     case AMDGPU::SI_KILL_F32_COND_IMM_TERMINATOR:
       // FIXME: It's messy that these need to be considered here at all.
diff --git a/llvm/lib/Target/AMDGPU/SIInstructions.td b/llvm/lib/Target/AMDGPU/SIInstructions.td
index 33c93cdf20c43b..54c0359b5272b3 100644
--- a/llvm/lib/Target/AMDGPU/SIInstructions.td
+++ b/llvm/lib/Target/AMDGPU/SIInstructions.td
@@ -469,8 +469,6 @@ def SI_LOOP : CFPseudoInstSI <
   let IsNeverUniform = 1;
 }
 
-} // End isTerminator = 1
-
 def SI_END_CF : CFPseudoInstSI <
   (outs), (ins SReg_1:$saved), [], 1, 1> {
   let Size = 4;
@@ -482,6 +480,8 @@ def SI_END_CF : CFPseudoInstSI <
   let mayStore = 1;
 }
 
+} // End isTerminator = 1
+
 def SI_IF_BREAK : CFPseudoInstSI <
   (outs SReg_1:$dst), (ins SReg_1:$vcc, SReg_1:$src), []> {
   let Size = 4;
diff --git a/llvm/lib/Target/AMDGPU/SILowerControlFlow.cpp b/llvm/lib/Target/AMDGPU/SILowerControlFlow.cpp
index f178324dbbe246..b5bd2bf02dfab7 100644
--- a/llvm/lib/Target/AMDGPU/SILowerControlFlow.cpp
+++ b/llvm/lib/Target/AMDGPU/SILowerControlFlow.cpp
@@ -82,6 +82,9 @@ class SILowerControlFlow : public MachineFunctionPass {
   SmallSet<Register, 8> RecomputeRegs;
 
   const TargetRegisterClass *BoolRC = nullptr;
+  long unsigned TestMask;
+  unsigned Select;
+  unsigned CmovOpc;
   unsigned AndOpc;
   unsigned OrOpc;
   unsigned XorOpc;
@@ -92,16 +95,14 @@ class SILowerControlFlow : public MachineFunctionPass {
   unsigned OrSaveExecOpc;
   unsigned Exec;
 
-  bool EnableOptimizeEndCf = false;
-
-  bool hasKill(const MachineBasicBlock *Begin, const MachineBasicBlock *End);
-
   void emitIf(MachineInstr &MI);
   void emitElse(MachineInstr &MI);
   void emitIfBreak(MachineInstr &MI);
   void emitLoop(MachineInstr &MI);
+  void emitWaveDiverge(MachineInstr &MI, Register EnabledLanesMask,
+                       Register DisableLanesMask);
 
-  MachineBasicBlock *emitEndCf(MachineInstr &MI);
+  void emitEndCf(MachineInstr &MI);
 
   void lowerInitExec(MachineBasicBlock *MBB, MachineInstr &MI);
 
@@ -110,8 +111,6 @@ class SILowerControlFlow : public MachineFunctionPass {
 
   void combineMasks(MachineInstr &MI);
 
-  bool removeMBBifRedundant(MachineBasicBlock &MBB);
-
   MachineBasicBlock *process(MachineInstr &MI);
 
   // Skip to the next instruction, ignoring debug instructions, and trivial
@@ -134,9 +133,6 @@ class SILowerControlFlow : public MachineFunctionPass {
     return I;
   }
 
-  // Remove redundant SI_END_CF instructions.
-  void optimizeEndCf();
-
 public:
   static char ID;
 
@@ -166,205 +162,39 @@ char SILowerControlFlow::ID = 0;
 INITIALIZE_PASS(SILowerControlFlow, DEBUG_TYPE,
                "SI lower control flow", false, false)
 
-static void setImpSCCDefDead(MachineInstr &MI, bool IsDead) {
-  MachineOperand &ImpDefSCC = MI.getOperand(3);
-  assert(ImpDefSCC.getReg() == AMDGPU::SCC && ImpDefSCC.isDef());
-
-  ImpDefSCC.setIsDead(IsDead);
-}
-
 char &llvm::SILowerControlFlowID = SILowerControlFlow::ID;
 
-bool SILowerControlFlow::hasKill(const MachineBasicBlock *Begin,
-                                 const MachineBasicBlock *End) {
-  DenseSet<const MachineBasicBlock*> Visited;
-  SmallVector<MachineBasicBlock *, 4> Worklist(Begin->successors());
-
-  while (!Worklist.empty()) {
-    MachineBasicBlock *MBB = Worklist.pop_back_val();
-
-    if (MBB == End || !Visited.insert(MBB).second)
-      continue;
-    if (KillBlocks.contains(MBB))
-      return true;
-
-    Worklist.append(MBB->succ_begin(), MBB->succ_end());
-  }
-
-  return false;
-}
-
-static bool isSimpleIf(const MachineInstr &MI, const MachineRegisterInfo *MRI) {
-  Register SaveExecReg = MI.getOperand(0).getReg();
-  auto U = MRI->use_instr_nodbg_begin(SaveExecReg);
-
-  if (U == MRI->use_instr_nodbg_end() ||
-      std::next(U) != MRI->use_instr_nodbg_end() ||
-      U->getOpcode() != AMDGPU::SI_END_CF)
-    return false;
-
-  return true;
-}
-
 void SILowerControlFlow::emitIf(MachineInstr &MI) {
   MachineBasicBlock &MBB = *MI.getParent();
   const DebugLoc &DL = MI.getDebugLoc();
   MachineBasicBlock::iterator I(&MI);
-  Register SaveExecReg = MI.getOperand(0).getReg();
-  MachineOperand& Cond = MI.getOperand(1);
+  Register MaskElse = MI.getOperand(0).getReg();
+  MachineOperand &Cond = MI.getOperand(1);
   assert(Cond.getSubReg() == AMDGPU::NoSubRegister);
-
-  MachineOperand &ImpDefSCC = MI.getOperand(4);
-  assert(ImpDefSCC.getReg() == AMDGPU::SCC && ImpDefSCC.isDef());
-
-  // If there is only one use of save exec register and that use is SI_END_CF,
-  // we can optimize SI_IF by returning the full saved exec mask instead of
-  // just cleared bits.
-  bool SimpleIf = isSimpleIf(MI, MRI);
-
-  if (SimpleIf) {
-    // Check for SI_KILL_*_TERMINATOR on path from if to endif.
-    // if there is any such terminator simplifications are not safe.
-    auto UseMI = MRI->use_instr_nodbg_begin(SaveExecReg);
-    SimpleIf = !hasKill(MI.getParent(), UseMI->getParent());
-  }
-
-  // Add an implicit def of exec to discourage scheduling VALU after this which
-  // will interfere with trying to form s_and_saveexec_b64 later.
-  Register CopyReg = SimpleIf ? SaveExecReg
-                       : MRI->createVirtualRegister(BoolRC);
-  MachineInstr *CopyExec =
-    BuildMI(MBB, I, DL, TII->get(AMDGPU::COPY), CopyReg)
-    .addReg(Exec)
-    .addReg(Exec, RegState::ImplicitDefine);
-  LoweredIf.insert(CopyReg);
-
-  Register Tmp = MRI->createVirtualRegister(BoolRC);
-
-  MachineInstr *And =
-    BuildMI(MBB, I, DL, TII->get(AndOpc), Tmp)
-    .addReg(CopyReg)
-    .add(Cond);
-  if (LV)
-    LV->replaceKillInstruction(Cond.getReg(), MI, *And);
-
-  setImpSCCDefDead(*And, true);
-
-  MachineInstr *Xor = nullptr;
-  if (!SimpleIf) {
-    Xor =
-      BuildMI(MBB, I, DL, TII->get(XorOpc), SaveExecReg)
-      .addReg(Tmp)
-      .addReg(CopyReg);
-    setImpSCCDefDead(*Xor, ImpDefSCC.isDead());
-  }
-
-  // Use a copy that is a terminator to get correct spill code placement it with
-  // fast regalloc.
-  MachineInstr *SetExec =
-    BuildMI(MBB, I, DL, TII->get(MovTermOpc), Exec)
-    .addReg(Tmp, RegState::Kill);
+  Register CondReg = Cond.getReg();
+
+  Register MaskThen = MRI->createVirtualRegister(BoolRC);
+  // Get rid of the garbage bits in the Cond register which might be coming from
+  // the bitwise arithmetic when one of the expression operands is coming from
+  // the outer scope and hence having extra bits set.
+  MachineInstr *CondFiltered = BuildMI(MBB, I, DL, TII->get(AndOpc), MaskThen)
+                                   .add(Cond)
+                                   .addReg(Exec);
   if (LV)
-    LV->getVarInfo(Tmp).Kills.push_back(SetExec);
-
-  // Skip ahead to the unconditional branch in case there are other terminators
-  // present.
-  I = skipToUncondBrOrEnd(MBB, I);
+    LV->replaceKillInstruction(CondReg, MI, *CondFiltered);
 
-  // Insert the S_CBRANCH_EXECZ instruction which will be optimized later
-  // during SIRemoveShortExecBranches.
-  MachineInstr *NewBr = BuildMI(MBB, I, DL, TII->get(AMDGPU::S_CBRANCH_EXECZ))
-                            .add(MI.getOperand(2));
+  emitWaveDiverge(MI, MaskThen, MaskElse);
 
-  if (!LIS) {
-    MI.eraseFromParent();
-    return;
+  if (LIS) {
+    LIS->InsertMachineInstrInMaps(*CondFiltered);
+    LIS->createAndComputeVirtRegInterval(MaskThen);
   }
-
-  LIS->InsertMachineInstrInMaps(*CopyExec);
-
-  // Replace with and so we don't need to fix the live interval for condition
-  // register.
-  LIS->ReplaceMachineInstrInMaps(MI, *And);
-
-  if (!SimpleIf)
-    LIS->InsertMachineInstrInMaps(*Xor);
-  LIS->InsertMachineInstrInMaps(*SetExec);
-  LIS->InsertMachineInstrInMaps(*NewBr);
-
-  LIS->removeAllRegUnitsForPhysReg(AMDGPU::EXEC);
-  MI.eraseFromParent();
-
-  // FIXME: Is there a better way of adjusting the liveness? It shouldn't be
-  // hard to add another def here but I'm not sure how to correctly update the
-  // valno.
-  RecomputeRegs.insert(SaveExecReg);
-  LIS->createAndComputeVirtRegInterval(Tmp);
-  if (!SimpleIf)
-    LIS->createAndComputeVirtRegInterval(CopyReg);
 }
 
 void SILowerControlFlow::emitElse(MachineInstr &MI) {
-  MachineBasicBlock &MBB = *MI.getParent();
-  const DebugLoc &DL = MI.getDebugLoc();
-
-  Register DstReg = MI.getOperand(0).getReg();
-  Register SrcReg = MI.getOperand(1).getReg();
-
-  MachineBasicBlock::iterator Start = MBB.begin();
-
-  // This must be inserted before phis and any spill code inserted before the
-  // else.
-  Register SaveReg = MRI->createVirtualRegister(BoolRC);
-  MachineInstr *OrSaveExec =
-    BuildMI(MBB, Start, DL, TII->get(OrSaveExecOpc), SaveReg)
-    .add(MI.getOperand(1)); // Saved EXEC
-  if (LV)
-    LV->replaceKillInstruction(SrcReg, MI, *OrSaveExec);
-
-  MachineBasicBlock *DestBB = MI.getOperand(2).getMBB();
-
-  MachineBasicBlock::iterator ElsePt(MI);
-
-  // This accounts for any modification of the EXEC mask within the block and
-  // can be optimized out pre-RA when not required.
-  MachineInstr *And = BuildMI(MBB, ElsePt, DL, TII->get(AndOpc), DstReg)
-                          .addReg(Exec)
-                          .addReg(SaveReg);
-
-  MachineInstr *Xor =
-    BuildMI(MBB, ElsePt, DL, TII->get(XorTermrOpc), Exec)
-    .addReg(Exec)
-    .addReg(DstReg);
-
-  // Skip ahead to the unconditional branch in case there are other terminators
-  // present.
-  ElsePt = skipToUncondBrOrEnd(MBB, ElsePt);
-
-  MachineInstr *Branch =
-      BuildMI(MBB, ElsePt, DL, TII->get(AMDGPU::S_CBRANCH_EXECZ))
-          .addMBB(DestBB);
-
-  if (!LIS) {
-    MI.eraseFromParent();
-    return;
-  }
-
-  LIS->RemoveMachineInstrFromMaps(MI);
-  MI.eraseFromParent();
-
-  LIS->InsertMachineInstrInMaps(*OrSaveExec);
-  LIS->InsertMachineInstrInMaps(*And);
-
-  LIS->InsertMachineInstrInMaps(*Xor);
-  LIS->InsertMachineInstrInMaps(*Branch);
-
-  RecomputeRegs.insert(SrcReg);
-  RecomputeRegs.insert(DstReg);
-  LIS->createAndComputeVirtRegInterval(SaveReg);
-
-  // Let this be recomputed.
-  LIS->removeAllRegUnitsForPhysReg(AMDGPU::EXEC);
+  Register InvCondReg = MI.getOperand(0).getReg();
+  Register CondReg = MI.getOperand(1).getReg();
+  emitWaveDiverge(MI, CondReg, InvCondReg);
 }
 
 void SILowerControlFlow::emitIfBreak(MachineInstr &MI) {
@@ -425,141 +255,137 @@ void SILowerControlFlow::emitLoop(MachineInstr &MI) {
   MachineBasicBlock &MBB = *MI.getParent();
   const DebugLoc &DL = MI.getDebugLoc();
 
-  MachineInstr *AndN2 =
-      BuildMI(MBB, &MI, DL, TII->get(Andn2TermOpc), Exec)
-          .addReg(Exec)
-          .add(MI.getOperand(0));
+  Register Cond = MI.getOperand(0).getReg();
+  Register MaskLoop = MRI->createVirtualRegister(BoolRC);
+  Register MaskExit = MRI->createVirtualRegister(BoolRC);
+  Register AndZero = MRI->createVirtualRegister(BoolRC);
+  MachineInstr *CondLoop = BuildMI(MBB, &MI, DL, TII->get(XorOpc), MaskLoop)
+                                   .addReg(Cond)
+                                   .addReg(Exec);
+
+  MachineInstr *ExitExec = BuildMI(MBB, &MI, DL, TII->get(OrOpc), MaskExit)
+  .addReg(Cond)
+  .addReg(Exec);
+
+  MachineInstr *IfZeroMask = BuildMI(MBB, &MI, DL, TII->get(AndOpc), AndZero)
+                                 .addReg(MaskLoop)
+                                 .addImm(TestMask);
+
+  MachineInstr *SetExec= BuildMI(MBB, &MI, DL, TII->get(Select), Exec)
+                                     .addReg(MaskLoop)
+                                     .addReg(MaskExit);
+
   if (LV)
-    LV->replaceKillInstruction(MI.getOperand(0).getReg(), MI, *AndN2);
+    LV->replaceKillInstruction(MI.getOperand(0).getReg(), MI, *SetExec);
 
   auto BranchPt = skipToUncondBrOrEnd(MBB, MI.getIterator());
   MachineInstr *Branch =
-      BuildMI(MBB, BranchPt, DL, TII->get(AMDGPU::S_CBRANCH_EXECNZ))
+      BuildMI(MBB, BranchPt, DL, TII->get(AMDGPU::S_CBRANCH_SCC1))
           .add(MI.getOperand(1));
 
   if (LIS) {
     RecomputeRegs.insert(MI.getOperand(0).getReg());
-    LIS->ReplaceMachineInstrInMaps(MI, *AndN2);
+    LIS->ReplaceMachineInstrInMaps(MI, *SetExec);
+    LIS->InsertMachineInstrInMaps(*CondLoop);
+    LIS->InsertMachineInstrInMaps(*IfZeroMask);
+    LIS->InsertMachineInstrInMaps(*ExitExec);
     LIS->InsertMachineInstrInMaps(*Branch);
+    LIS->createAndComputeVirtRegInterval(MaskLoop);
+    LIS->createAndComputeVirtRegInterval(MaskExit);
+    LIS->createAndComputeVirtRegInterval(AndZero);
   }
 
   MI.eraseFromParent();
 }
 
-MachineBasicBlock::iterator
-SILowerControlFlow::skipIgnoreExecInstsTrivialSucc(
-  MachineBasicBlock &MBB, MachineBasicBlock::iterator It) const {
+void SILowerControlFlow::emitWaveDiverge(MachineInstr &MI,
+                                         Register EnabledLanesMask,
+                                         Register DisableLanesMask) {
+  MachineBasicBlock &MBB = *MI.getParent();
+  const DebugLoc &DL = MI.getDebugLoc();
+  MachineBasicBlock::iterator I(MI);
 
-  SmallSet<const MachineBasicBlock *, 4> Visited;
-  MachineBasicBlock *B = &MBB;
-  do {
-    if (!Visited.insert(B).second)
-      return MBB.end();
+  MachineInstr *CondInverted =
+      BuildMI(MBB, I, DL, TII->get(XorOpc), DisableLanesMask)
+          .addReg(EnabledLanesMask)
+          .addReg(Exec);
 
-    auto E = B->end();
-    for ( ; It != E; ++It) {
-      if (TII->mayReadEXEC(*MRI, *It))
+  if (LV) {
+    LV->replaceKillInstruction(DisableLanesMask, MI, *CondInverted);
+  }
+
+  Register TestResultReg = MRI->createVirtualRegister(BoolRC);
+  MachineInstr *IfZeroMask =
+      BuildMI(MBB, I, DL, TII->get(AndOpc), TestResultReg)
+          .addReg(EnabledLanesMask)
+          .addImm(TestMask);
+
+  Ma...
[truncated]

@alex-t alex-t requested a review from vpykhtin March 27, 2024 19:26
Copy link
Contributor

@arsenm arsenm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you elaborate more in the high level description about what this does and how this solves the underlying issue? Does it simply move towards only using exec modifications glued to the terminator?

Do we still need to place more temporary blocks for exec manipulations in some cases?

@@ -306,38 +306,26 @@ bool SIAnnotateControlFlow::handleLoop(BranchInst *Term) {

/// Close the last opened control flow
bool SIAnnotateControlFlow::closeControlFlow(BasicBlock *BB) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Annotate part should be done independently

@@ -82,6 +82,9 @@ class SILowerControlFlow : public MachineFunctionPass {
SmallSet<Register, 8> RecomputeRegs;

const TargetRegisterClass *BoolRC = nullptr;
long unsigned TestMask;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

uint64_t

llvm/lib/Target/AMDGPU/SIAnnotateControlFlow.cpp Outdated Show resolved Hide resolved
Comment on lines +1 to +3
; XFAIL: *
; XFAIL: *
; XFAIL: *
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't just break this test

llvm/lib/Target/AMDGPU/SIAnnotateControlFlow.cpp Outdated Show resolved Hide resolved
}
}
bool SIAnnotateControlFlow::insertWaveReconverge(BasicBlock *BB) {
assert(succ_empty(BB) || succ_size(BB) == 1);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove the empty assert and sink the == 2 case down below the if (empty)

llvm/lib/Target/AMDGPU/SIISelLowering.cpp Outdated Show resolved Hide resolved
// ISel inserts copy to regs for the successor PHIs
// at the BB end. We need to move the SI_WAVE_RECONVERGE right before the branch.
// Even we don't have to move SI_WAVE_RECONVERGE we need to take care of the
// S_CBRANCH_SCC0/1 as SI_WAVE_RECONVERGE overwrites SCC
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should have been emitted as a copy pair to a temp vreg?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please clarify your question?

@alex-t
Copy link
Contributor Author

alex-t commented Apr 17, 2024

Thanks for the feedback and pieces of advice.
The code here is far outdated.
I am currently fixing some correctness issues.
I will update the PR as soon as they are fixed.

…converge at the end of the predecessor block.
…converge at the end of the predecessor block.
…converge at the end of the predecessor block. Tests updated. Else exit mask fixed. AndTerm changed to usual And in emitLoop
…converge at the end of the predecessor block. si_end_cf intrinsic and opcode changed to si_wave_reconverge. Restoring the exec mask on Else fix reverted and changed to another approach: placing the si_wave_reconverge in any predecessors of the Stack top ion SIAnnotateControlFlow.
…op inside the divergent CF in SIAnnotateControlFlow
Copy link

⚠️ C/C++ code formatter, clang-format found issues in your code. ⚠️

You can test this locally with the following command:
git-clang-format --diff 82c5d350d200ccc5365d40eac187b9ec967af727 a96acb5340f4c6d2ef1884eb2ce374b2a28081db -- llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.cpp llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.h llvm/lib/Target/AMDGPU/AMDGPURegisterBankInfo.cpp llvm/lib/Target/AMDGPU/SIAnnotateControlFlow.cpp llvm/lib/Target/AMDGPU/SIISelLowering.cpp llvm/lib/Target/AMDGPU/SIInstrInfo.cpp llvm/lib/Target/AMDGPU/SILowerControlFlow.cpp llvm/lib/Target/AMDGPU/SIOptimizeExecMasking.cpp
View the diff from clang-format here.
diff --git a/llvm/lib/Target/AMDGPU/SIAnnotateControlFlow.cpp b/llvm/lib/Target/AMDGPU/SIAnnotateControlFlow.cpp
index 68d81a6ffa..8e909e5afb 100644
--- a/llvm/lib/Target/AMDGPU/SIAnnotateControlFlow.cpp
+++ b/llvm/lib/Target/AMDGPU/SIAnnotateControlFlow.cpp
@@ -13,9 +13,9 @@
 
 #include "AMDGPU.h"
 #include "GCNSubtarget.h"
+#include "llvm/Analysis/DomTreeUpdater.h"
 #include "llvm/Analysis/LoopInfo.h"
 #include "llvm/Analysis/UniformityAnalysis.h"
-#include "llvm/Analysis/DomTreeUpdater.h"
 #include "llvm/CodeGen/TargetPassConfig.h"
 #include "llvm/IR/BasicBlock.h"
 #include "llvm/IR/Constants.h"
@@ -142,7 +142,8 @@ void SIAnnotateControlFlow::initialize(Module &M, const GCNSubtarget &ST) {
   IfBreak = Intrinsic::getDeclaration(&M, Intrinsic::amdgcn_if_break,
                                       { IntMask });
   Loop = Intrinsic::getDeclaration(&M, Intrinsic::amdgcn_loop, { IntMask });
-  WaveReconverge = Intrinsic::getDeclaration(&M, Intrinsic::amdgcn_wave_reconverge, { IntMask });
+  WaveReconverge = Intrinsic::getDeclaration(
+      &M, Intrinsic::amdgcn_wave_reconverge, {IntMask});
 }
 
 /// Is the branch condition uniform or did the StructurizeCFG pass
@@ -331,14 +332,14 @@ bool SIAnnotateControlFlow::tryWaveReconverge(BasicBlock *BB) {
     for (auto Succ : Term->successors()) {
       if (isTopOfStack(Succ)) {
         // Just split to make a room for further WAVE_RECONVERGE insertion
-        SmallVector<BasicBlock*, 2> Preds;
+        SmallVector<BasicBlock *, 2> Preds;
         for (auto P : predecessors(Succ)) {
           if (DT->dominates(BB, P))
             Preds.push_back(P);
         }
         DomTreeUpdater DTU(DT, DomTreeUpdater::UpdateStrategy::Eager);
-        SplitBlockPredecessors(Succ, Preds, ".reconverge", &DTU, LI,
-                                            nullptr, false);
+        SplitBlockPredecessors(Succ, Preds, ".reconverge", &DTU, LI, nullptr,
+                               false);
       }
     }
   }
diff --git a/llvm/lib/Target/AMDGPU/SIISelLowering.cpp b/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
index ea1e7c782e..b3984d4124 100644
--- a/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
+++ b/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
@@ -9941,8 +9941,9 @@ SDValue SITargetLowering::LowerINTRINSIC_VOID(SDValue Op,
     return SDValue(Load, 0);
   }
   case Intrinsic::amdgcn_wave_reconverge:
-    return SDValue(DAG.getMachineNode(AMDGPU::SI_WAVE_RECONVERGE, DL, MVT::Other,
-                                      Op->getOperand(2), Chain), 0);
+    return SDValue(DAG.getMachineNode(AMDGPU::SI_WAVE_RECONVERGE, DL,
+                                      MVT::Other, Op->getOperand(2), Chain),
+                   0);
   case Intrinsic::amdgcn_s_barrier_init:
   case Intrinsic::amdgcn_s_barrier_join:
   case Intrinsic::amdgcn_s_wakeup_barrier: {
diff --git a/llvm/lib/Target/AMDGPU/SILowerControlFlow.cpp b/llvm/lib/Target/AMDGPU/SILowerControlFlow.cpp
index 15f1c776cd..216cf963ec 100644
--- a/llvm/lib/Target/AMDGPU/SILowerControlFlow.cpp
+++ b/llvm/lib/Target/AMDGPU/SILowerControlFlow.cpp
@@ -214,9 +214,8 @@ void SILowerControlFlow::emitIf(MachineInstr &MI) {
   // Get rid of the garbage bits in the Cond register which might be coming from
   // the bitwise arithmetic when one of the expression operands is coming from
   // the outer scope and hence having extra bits set.
-  MachineInstr *CondFiltered = BuildMI(MBB, I, DL, TII->get(AndOpc), MaskThen)
-                                   .add(Cond)
-                                   .addReg(Exec);
+  MachineInstr *CondFiltered =
+      BuildMI(MBB, I, DL, TII->get(AndOpc), MaskThen).add(Cond).addReg(Exec);
   if (LV)
     LV->replaceKillInstruction(CondReg, MI, *CondFiltered);
 
@@ -304,9 +303,9 @@ void SILowerControlFlow::emitLoop(MachineInstr &MI) {
                                  .addReg(MaskLoop)
                                  .addImm(TestMask);
 
-  MachineInstr *SetExec= BuildMI(MBB, &MI, DL, TII->get(Select), Exec)
-                                     .addReg(MaskLoop)
-                                     .addReg(Cond);
+  MachineInstr *SetExec = BuildMI(MBB, &MI, DL, TII->get(Select), Exec)
+                              .addReg(MaskLoop)
+                              .addReg(Cond);
 
   if (LV)
     LV->replaceKillInstruction(MI.getOperand(0).getReg(), MI, *SetExec);
@@ -371,7 +370,7 @@ void SILowerControlFlow::emitWaveDiverge(MachineInstr &MI,
     MachineInstr *CopyExec =
         BuildMI(MBB, I, DL, TII->get(AMDGPU::COPY), DisableLanesMask)
             .addReg(Exec);
-    if(LIS)
+    if (LIS)
       LIS->InsertMachineInstrInMaps(*CopyExec);
   }
   Register TestResultReg = MRI->createVirtualRegister(BoolRC);
@@ -385,7 +384,7 @@ void SILowerControlFlow::emitWaveDiverge(MachineInstr &MI,
 
   MachineBasicBlock *FlowBB = MI.getOperand(2).getMBB();
   MachineBasicBlock *TargetBB = nullptr;
-    // determine target BBs
+  // determine target BBs
   I = skipToUncondBrOrEnd(MBB, I);
   if (I != MBB.end()) {
     // skipToUncondBrOrEnd returns either unconditional branch or end()
@@ -416,7 +415,7 @@ void SILowerControlFlow::emitWaveDiverge(MachineInstr &MI,
     return;
   }
 
-    LIS->InsertMachineInstrInMaps(*IfZeroMask);
+  LIS->InsertMachineInstrInMaps(*IfZeroMask);
   LIS->ReplaceMachineInstrInMaps(MI, *SetExecForSucc);
 
   RecomputeRegs.insert(MI.getOperand(0).getReg());
@@ -429,7 +428,7 @@ void SILowerControlFlow::emitWaveDiverge(MachineInstr &MI,
   LIS->removeAllRegUnitsForPhysReg(Exec);
 }
 
-void SILowerControlFlow::emitWaveReconverge(MachineInstr &MI) { 
+void SILowerControlFlow::emitWaveReconverge(MachineInstr &MI) {
 
   MachineBasicBlock &BB = *MI.getParent();
   Register Mask = MI.getOperand(0).getReg();

@alex-t
Copy link
Contributor Author

alex-t commented May 20, 2024

This is going to be closed as there were too many changes done.
the new one is: #92809

@alex-t alex-t closed this May 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants