[AMDGPU]: Rewrite mbcnt_lo/mbcnt_hi to work item ID where applicable #160496

TejaX-Alaghari · 2025-09-24T10:48:02Z

This PR rewrites mbcnt_hi(-1, mbcnt_lo(-1, 0)) lane id computation pattern to llvm.amdgcn.workitem.idx() when reqd_work_group_size proves X == wave

This adds a conservative AMDGPUCodeGenPrepare optimization for converting pattern mbcnt.hi(~0, mbcnt.lo(~0, 0)) to workitem.id.x(). The transformation is applied when required work group size equals the target wavefront size.

Tests:

llvm/test/Transforms/AMDGPU/mbcnt-to-bitmask-neg.ll
llvm/test/Transforms/AMDGPU/mbcnt-to-bitmask-posit.ll
llvm/test/Transforms/AMDGPU/mbcnt-to-workitem-neg.ll
llvm/test/Transforms/AMDGPU/mbcnt-to-workitem-posit.ll
llvm/test/Transforms/AMDGPU/mbcnt-to-workitem-wave32-neg.ll
llvm/test/Transforms/AMDGPU/mbcnt-to-workitem-wave32.ll

github-actions · 2025-09-24T10:48:18Z

Thank you for submitting a Pull Request (PR) to the LLVM Project!

This PR will be automatically labeled and the relevant teams will be notified.

If you wish to, you can add reviewers by using the "Reviewers" section on this page.

If this is not working for you, it is probably because you do not have write permissions for the repository. In which case you can instead tag reviewers by name in a comment by using @ followed by their GitHub username.

If you have received no comments on your PR for a week, you can request a review by "ping"ing the PR by adding a comment “Ping”. The common courtesy "ping" rate is once a week. Please remember that you are asking for valuable time from other developers.

If you have further questions, they may be answered by the LLVM GitHub User Guide.

You can also ask questions in a comment on this PR, on the LLVM Discord or on the forums.

llvmbot · 2025-09-24T10:48:51Z

@llvm/pr-subscribers-backend-amdgpu

@llvm/pr-subscribers-llvm-transforms

Author: Teja Alaghari (TejaX-Alaghari)

Changes

This PR rewrites mbcnt_hi(-1, mbcnt_lo(-1, 0)) lane id computation pattern to llvm.amdgcn.workitem.idx() when reqd_work_group_size proves X==wave

This adds a conservative InstCombine peephole handling the exact pattern mbcnt.hi(~0, mbcnt.lo(~0, 0)). The transformation is applied when required work group size equals the target wavefront size.

Tests:

llvm/test/Transforms/InstCombine/AMDGPU/mbcnt-to-workitem-posit.ll
llvm/test/Transforms/InstCombine/AMDGPU/mbcnt-to-bitmask-posit.ll
llvm/test/Transforms/InstCombine/AMDGPU/mbcnt-to-workitem-neg.ll
llvm/test/Transforms/InstCombine/AMDGPU/mbcnt-to-bitmask-neg.ll

Full diff: https://github.com/llvm/llvm-project/pull/160496.diff

5 Files Affected:

(modified) llvm/lib/Target/AMDGPU/AMDGPUInstCombineIntrinsic.cpp (+122-3)
(added) llvm/test/Transforms/InstCombine/AMDGPU/mbcnt-to-bitmask-neg.ll (+18)
(added) llvm/test/Transforms/InstCombine/AMDGPU/mbcnt-to-bitmask-posit.ll (+20)
(added) llvm/test/Transforms/InstCombine/AMDGPU/mbcnt-to-workitem-neg.ll (+16)
(added) llvm/test/Transforms/InstCombine/AMDGPU/mbcnt-to-workitem-posit.ll (+18)

diff --git a/llvm/lib/Target/AMDGPU/AMDGPUInstCombineIntrinsic.cpp b/llvm/lib/Target/AMDGPU/AMDGPUInstCombineIntrinsic.cpp
index 4fe5d00679436..509e2b019224f 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUInstCombineIntrinsic.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPUInstCombineIntrinsic.cpp
@@ -15,6 +15,7 @@
 //===----------------------------------------------------------------------===//
 
 #include "AMDGPUInstrInfo.h"
+#include "AMDGPUSubtarget.h"
 #include "AMDGPUTargetTransformInfo.h"
 #include "GCNSubtarget.h"
 #include "llvm/ADT/FloatingPointMode.h"
@@ -28,6 +29,10 @@ using namespace llvm::PatternMatch;
 
 #define DEBUG_TYPE "AMDGPUtti"
 
+// Common wavefront sizes used in several conservative checks below.
+static constexpr unsigned WavefrontSize32 = 32u;
+static constexpr unsigned WavefrontSize64 = 64u;
+
 namespace {
 
 struct AMDGPUImageDMaskIntrinsic {
@@ -1312,9 +1317,122 @@ GCNTTIImpl::instCombineIntrinsic(InstCombiner &IC, IntrinsicInst &II) const {
     break;
   }
   case Intrinsic::amdgcn_mbcnt_hi: {
-    // exec_hi is all 0, so this is just a copy.
-    if (ST->isWave32())
+    // exec_hi is all 0, so this is just a copy on wave32.
+    if (ST && ST->isWave32())
       return IC.replaceInstUsesWith(II, II.getArgOperand(1));
+
+    // Pattern: mbcnt.hi(~0, mbcnt.lo(~0, 0))
+    if (auto *HiArg1 = dyn_cast<CallInst>(II.getArgOperand(1))) {
+      Function *CalledF = HiArg1->getCalledFunction();
+      bool IsMbcntLo = false;
+      if (CalledF) {
+        // Fast-path: if this is a declared intrinsic, check the intrinsic ID.
+        if (CalledF->getIntrinsicID() == Intrinsic::amdgcn_mbcnt_lo) {
+          IsMbcntLo = true;
+        } else {
+          // Fallback: accept a declared function with the canonical name, but
+          // verify its signature to be safe: i32(i32,i32). Use the name
+          // comparison only when there's no intrinsic ID match.
+          if (CalledF->getName() == "llvm.amdgcn.mbcnt.lo") {
+            if (FunctionType *FT = CalledF->getFunctionType()) {
+              if (FT->getNumParams() == 2 &&
+                  FT->getReturnType()->isIntegerTy(32) &&
+                  FT->getParamType(0)->isIntegerTy(32) &&
+                  FT->getParamType(1)->isIntegerTy(32))
+                IsMbcntLo = true;
+            }
+          }
+        }
+      }
+
+      if (!IsMbcntLo)
+        break;
+
+      // hi arg0 must be all-ones
+      if (auto *HiArg0C = dyn_cast<ConstantInt>(II.getArgOperand(0))) {
+        if (!HiArg0C->isAllOnesValue())
+          break;
+      } else
+        break;
+
+      // lo args: arg0 == ~0, arg1 == 0
+      Value *Lo0 = HiArg1->getArgOperand(0);
+      Value *Lo1 = HiArg1->getArgOperand(1);
+      auto *Lo0C = dyn_cast<ConstantInt>(Lo0);
+      auto *Lo1C = dyn_cast<ConstantInt>(Lo1);
+      if (!Lo0C || !Lo1C)
+        break;
+      if (!Lo0C->isAllOnesValue() || !Lo1C->isZero())
+        break;
+
+      // Query reqd_work_group_size via subtarget helper and compare X to wave
+      // size conservatively.
+      if (Function *F = II.getFunction()) {
+        unsigned Wave = 0;
+        if (ST && ST->isWaveSizeKnown())
+          Wave = ST->getWavefrontSize();
+
+        if (ST) {
+          if (auto MaybeX = ST->getReqdWorkGroupSize(*F, 0)) {
+            unsigned XLen = *MaybeX;
+            if (Wave == 0 && (XLen == WavefrontSize32 ||
+                              XLen == WavefrontSize64))
+              Wave = XLen; // allow common sizes under test harness
+
+            if (Wave != 0 && XLen == Wave) {
+              SmallVector<Type *, 0> OverloadTys;
+              CallInst *NewCall = IC.Builder.CreateIntrinsic(
+                  Intrinsic::amdgcn_workitem_id_x, OverloadTys, {});
+              NewCall->takeName(&II);
+              // Attach range metadata when available.
+              ST->makeLIDRangeMetadata(NewCall);
+              return IC.replaceInstUsesWith(II, NewCall);
+            }
+            // Optional: if X dimension evenly splits into wavefronts we can
+            // replace lane-id computation with a bitmask when the wave is a
+            // power-of-two. Use the Subtarget helper to conservatively decide
+            // when per-wave tiling is preserved.
+        if (ST->hasWavefrontsEvenlySplittingXDim(
+          *F, /*RequiresUniformYZ=*/true)) {
+        if (Wave != 0 && isPowerOf2_32(Wave)) {
+                // Construct: tid = workitem.id.x(); mask = Wave-1; res = tid &
+                // mask
+                SmallVector<Type *, 0> OverloadTys;
+                CallInst *Tid = IC.Builder.CreateIntrinsic(
+                    Intrinsic::amdgcn_workitem_id_x, OverloadTys, {});
+                Tid->takeName(&II);
+                IntegerType *ITy = cast<IntegerType>(Tid->getType());
+                Constant *Mask = ConstantInt::get(ITy, Wave - 1);
+                Instruction *AndInst =
+                    cast<Instruction>(IC.Builder.CreateAnd(Tid, Mask));
+                AndInst->takeName(&II);
+                // Attach range metadata for the result if possible.
+                ST->makeLIDRangeMetadata(AndInst);
+                return IC.replaceInstUsesWith(II, AndInst);
+              }
+            }
+          }
+        } else {
+          // No ST: be conservative and only handle the common test harness
+          // cases where reqd_work_group_size metadata exists and equals
+          // 32/64.
+          if (auto *Node = F->getMetadata("reqd_work_group_size")) {
+            if (Node->getNumOperands() == 3) {
+              unsigned XLen = mdconst::extract<ConstantInt>(Node->getOperand(0))
+                                  ->getZExtValue();
+              if (XLen == WavefrontSize32 || XLen == WavefrontSize64) {
+                SmallVector<Type *, 0> OverloadTys;
+                CallInst *NewCall = IC.Builder.CreateIntrinsic(
+                    Intrinsic::amdgcn_workitem_id_x, OverloadTys, {});
+                NewCall->takeName(&II);
+                return IC.replaceInstUsesWith(II, NewCall);
+              }
+            }
+          }
+        }
+      }
+    }
+
     break;
   }
   case Intrinsic::amdgcn_ballot: {
@@ -1328,7 +1446,8 @@ GCNTTIImpl::instCombineIntrinsic(InstCombiner &IC, IntrinsicInst &II) const {
         return IC.replaceInstUsesWith(II, Constant::getNullValue(II.getType()));
       }
     }
-    if (ST->isWave32() && II.getType()->getIntegerBitWidth() == 64) {
+    if (ST->isWave32() &&
+        II.getType()->getIntegerBitWidth() == WavefrontSize64) {
       // %b64 = call i64 ballot.i64(...)
       // =>
       // %b32 = call i32 ballot.i32(...)
diff --git a/llvm/test/Transforms/InstCombine/AMDGPU/mbcnt-to-bitmask-neg.ll b/llvm/test/Transforms/InstCombine/AMDGPU/mbcnt-to-bitmask-neg.ll
new file mode 100644
index 0000000000000..0313f284e5775
--- /dev/null
+++ b/llvm/test/Transforms/InstCombine/AMDGPU/mbcnt-to-bitmask-neg.ll
@@ -0,0 +1,18 @@
+; RUN: opt -S -mtriple=amdgcn-unknown-amdhsa -passes=instcombine < %s | FileCheck %s
+; CHECK-NOT: and i32
+; CHECK-NOT: @llvm.amdgcn.workitem.id.x()
+
+; ModuleID = 'mbcnt_to_bitmask_neg'
+
+define i32 @kernel() !reqd_work_group_size !1 {
+entry:
+  %a = call i32 @llvm.amdgcn.mbcnt.lo(i32 -1, i32 0)
+  %b = call i32 @llvm.amdgcn.mbcnt.hi(i32 -1, i32 %a)
+  ret i32 %b
+}
+
+!1 = !{i32 48, i32 1, i32 1}
+
+; Declarations
+declare i32 @llvm.amdgcn.mbcnt.lo(i32, i32)
+declare i32 @llvm.amdgcn.mbcnt.hi(i32, i32)
diff --git a/llvm/test/Transforms/InstCombine/AMDGPU/mbcnt-to-bitmask-posit.ll b/llvm/test/Transforms/InstCombine/AMDGPU/mbcnt-to-bitmask-posit.ll
new file mode 100644
index 0000000000000..b87913edc8805
--- /dev/null
+++ b/llvm/test/Transforms/InstCombine/AMDGPU/mbcnt-to-bitmask-posit.ll
@@ -0,0 +1,20 @@
+; RUN: opt -S -mtriple=amdgcn-unknown-amdhsa -passes=instcombine < %s | FileCheck %s
+; CHECK: @llvm.amdgcn.workitem.id.x()
+; CHECK-NOT: call i32 @llvm.amdgcn.mbcnt.hi
+; CHECK-NOT: call i32 @llvm.amdgcn.mbcnt.lo
+
+; ModuleID = 'mbcnt_to_bitmask_posit'
+
+define i32 @kernel() !reqd_work_group_size !1 {
+entry:
+  %a = call i32 @llvm.amdgcn.mbcnt.lo(i32 -1, i32 0)
+  %b = call i32 @llvm.amdgcn.mbcnt.hi(i32 -1, i32 %a)
+  ret i32 %b
+}
+
+!1 = !{i32 64, i32 1, i32 1}
+
+; Declarations
+declare i32 @llvm.amdgcn.mbcnt.lo(i32, i32)
+declare i32 @llvm.amdgcn.mbcnt.hi(i32, i32)
+declare i32 @llvm.amdgcn.workitem.id.x()
diff --git a/llvm/test/Transforms/InstCombine/AMDGPU/mbcnt-to-workitem-neg.ll b/llvm/test/Transforms/InstCombine/AMDGPU/mbcnt-to-workitem-neg.ll
new file mode 100644
index 0000000000000..1779b631be9f6
--- /dev/null
+++ b/llvm/test/Transforms/InstCombine/AMDGPU/mbcnt-to-workitem-neg.ll
@@ -0,0 +1,16 @@
+; RUN: opt -S -mtriple amdgcn-unknown-amdhsa -passes=instcombine < %s | FileCheck %s
+; CHECK: llvm.amdgcn.mbcnt.lo
+; CHECK: llvm.amdgcn.mbcnt.hi
+; CHECK-NOT: call i32 @llvm.amdgcn.workitem.id.x()
+
+define i32 @kernel() {
+entry:
+  %a = call i32 @llvm.amdgcn.mbcnt.lo(i32 -1, i32 0)
+  %b = call i32 @llvm.amdgcn.mbcnt.hi(i32 -1, i32 %a)
+  ret i32 %b
+}
+
+; Declarations
+declare i32 @llvm.amdgcn.mbcnt.lo(i32, i32)
+declare i32 @llvm.amdgcn.mbcnt.hi(i32, i32)
+declare i32 @llvm.amdgcn.workitem.id.x()
diff --git a/llvm/test/Transforms/InstCombine/AMDGPU/mbcnt-to-workitem-posit.ll b/llvm/test/Transforms/InstCombine/AMDGPU/mbcnt-to-workitem-posit.ll
new file mode 100644
index 0000000000000..d3d8d40b8359d
--- /dev/null
+++ b/llvm/test/Transforms/InstCombine/AMDGPU/mbcnt-to-workitem-posit.ll
@@ -0,0 +1,18 @@
+; RUN: opt -S -mtriple amdgcn-unknown-amdhsa -passes=instcombine < %s | FileCheck %s
+; CHECK-NOT: amdgcn.mbcnt_lo
+; CHECK-NOT: amdgcn.mbcnt_hi
+; CHECK: @llvm.amdgcn.workitem.id.x()
+
+define i32 @kernel() !reqd_work_group_size !0 {
+entry:
+  %a = call i32 @llvm.amdgcn.mbcnt.lo(i32 -1, i32 0)
+  %b = call i32 @llvm.amdgcn.mbcnt.hi(i32 -1, i32 %a)
+  ret i32 %b
+}
+
+!0 = !{i32 64, i32 1, i32 1}
+
+; Declarations
+declare i32 @llvm.amdgcn.mbcnt.lo(i32, i32)
+declare i32 @llvm.amdgcn.mbcnt.hi(i32, i32)
+declare i32 @llvm.amdgcn.workitem.id.x()

TejaX-Alaghari · 2025-09-24T10:59:32Z

@krzysz00, can you please take a look at this PR and let me know your thoughts on further improvements, thanks :)

llvm/lib/Target/AMDGPU/AMDGPUInstCombineIntrinsic.cpp

github-actions · 2025-09-24T18:37:45Z

✅ With the latest revision this PR passed the C/C++ code formatter.

llvm/lib/Target/AMDGPU/AMDGPUInstCombineIntrinsic.cpp

arsenm

A few issues that I think will prevent this from practically triggering:

I'm not sure instcombine is the right place for this. Is there an instcombine run after AMDGPU attributor?
Do we actually propagate reqd_work_group_size in attributor? It's an odd annotation since it's metadata like every other attribute, and I don't think we do (although we could)

llvm/lib/Target/AMDGPU/AMDGPUInstCombineIntrinsic.cpp

llvm/test/Transforms/InstCombine/AMDGPU/mbcnt-to-bitmask-neg.ll

llvm/test/Transforms/InstCombine/AMDGPU/mbcnt-to-workitem-posit.ll

TejaX-Alaghari · 2025-09-29T11:16:42Z

Hey @arsenm,

I'm not sure instcombine is the right place for this. Is there an instcombine run after AMDGPU attributor?

You are absolutely correct! This was the fundamental issue with the original implementation. After analyzing the LLVM optimization pipeline, I discovered that:

InstCombine runs early and repeatedly throughout the O2 pipeline
AMDGPU Attributor runs at the very end of the pipeline
No InstCombine passes occur after AMDGPU Attributor

This pipeline timing meant the optimization could never practically trigger in real-world code, which is why I moved it to AMDGPUCodeGenPrepare. This pass:

Runs after AMDGPU Attributor in the pipeline
Has proper access to reqd_work_group_size metadata when it's needed
Is specifically designed for target-specific IR optimizations that depend on metadata

"Do we actually propagate reqd_work_group_size in attributor?"

Yes, the AMDGPU Attributor does handle reqd_work_group_size metadata, though you're right to note it's more complex than typical attributes:

How it works:

Clang CodeGen sets the metadata initially from source attributes:

// From CodeGenFunction.cpp  
Fn->setMetadata("reqd_work_group_size", llvm::MDNode::get(Context, AttrMDArgs));

AMDGPU Attributor processes and potentially refines this metadata during optimization
AMDGPUCodeGenPrepare accesses it via the subtarget helper:

if (auto MaybeX = ST.getReqdWorkGroupSize(*F, 0)) {
  // Optimization can now safely proceed
}

The key insight is that the metadata needs to be stable and available when the optimization runs, which is guaranteed in CodeGenPrepare but not in InstCombine.

Please let me know if you have further concerns or suggestions on the current implementation, thanks!

TejaX-Alaghari · 2025-09-29T11:19:24Z

@krzysz00 & @jayfoad, I've re-implemented the original mechanism in AMDGPUCodeGenPrepare now.

Can you please take a look again now and let me know your suggestions, thanks!

krzysz00

One note from me

krzysz00 · 2025-09-29T21:25:56Z

llvm/lib/Target/AMDGPU/AMDGPUCodeGenPrepare.cpp

+  if (ST.isWaveSizeKnown())
+    Wave = ST.getWavefrontSize();
+
+  if (auto MaybeX = ST.getReqdWorkGroupSize(*F, 0)) {


I think we can still do the masking thing here? An x wavesize of 64 implies that the mbcnt is just thread_id_x & 0x1f?

Done

// Handle bitmask case: when X dimension evenly splits into waves // mbcnt.lo(~0, 0) = workitem.id.x() & (wave_size - 1) if (ST.hasWavefrontsEvenlySplittingXDim(*F, /*RequiresUniformYZ=*/true)) { if (Wave != 0 && isPowerOf2_32(Wave)) { // Creates: workitem.id.x() & (wave_size - 1) } }

cdevadas · 2025-09-30T14:55:00Z

The AMDGPUCodegenPrepare pass is scheduled only for the higher optimization levels and it isn't included in the -O0 pipeline. Isn't this patch relevant for -O0 compilation?

Nitpick:
Try to avoid the auto keyword.
Terminate all the comments with a period(.).

krzysz00 · 2025-09-30T15:08:38Z

The AMDGPUCodegenPrepare pass is scheduled only for the higher optimization levels and it isn't included in the -O0 pipeline. Isn't this patch relevant for -O0 compilation?

I'd argue that is the sort of "optimization" that -O0 can not do

krzysz00 · 2025-09-30T15:10:43Z

And I think all the autos in the PR are legitimate - they're generally appearing next to dyn_cast<> that already include the type

arsenm · 2025-10-03T06:17:17Z

And I think all the autos in the PR are legitimate - they're generally appearing next to dyn_cast<> that already include the type

There are ones that are not

arsenm · 2025-10-03T06:17:51Z

llvm/lib/Target/AMDGPU/AMDGPUCodeGenPrepare.cpp

+  if (!F)
+    return false;


This cannot happen. You can also directly use the F member in AMDGPUCodeGenPrepareImpl

Replaced all getFunction() calls with the class member F directly.

// Before: if (auto MaybeX = ST.getReqdWorkGroupSize(*I.getFunction(), 0)) { // After: if (auto MaybeX = ST.getReqdWorkGroupSize(*F, 0)) {

arsenm · 2025-10-03T06:19:03Z

llvm/lib/Target/AMDGPU/AMDGPUCodeGenPrepare.cpp

 INITIALIZE_PASS_END(AMDGPUCodeGenPrepare, DEBUG_TYPE, "AMDGPU IR optimizations",
                    false, false)

+bool AMDGPUCodeGenPrepareImpl::visitMbcntLo(IntrinsicInst &I) {


Suggested change

bool AMDGPUCodeGenPrepareImpl::visitMbcntLo(IntrinsicInst &I) {

bool AMDGPUCodeGenPrepareImpl::visitMbcntLo(IntrinsicInst &I) const {

@arsenm, I believe visitMbcntLo should not be const because they are transformation functions that directly modify the IR:

They call I.replaceAllUsesWith to remove instructions from the IR

They call I.eraseFromParent() to remove instructions from the IR

They create new instructions and modify the instruction stream

This follows the same pattern as other visitor functions in this class:

visitLoadInst, visitSelectInst, visitIntrinsicInst are all non-const transformation functions.

Only analysis/helper functions like visitFDivElement, replaceMulWithMul24, optimizeWithRcp are const (these create/return new values but don't modify existing IR)

arsenm · 2025-10-03T06:19:39Z

llvm/lib/Target/AMDGPU/AMDGPUCodeGenPrepare.cpp

+    return false;
+
+  unsigned Wave = 0;
+  if (ST.isWaveSizeKnown())


This kind of shouldn't be possible, but should also just abort the transform if it's not known

Aborted if wave size is not known

arsenm · 2025-10-03T06:19:54Z

llvm/lib/Target/AMDGPU/AMDGPUCodeGenPrepare.cpp

+  Function *F = I.getFunction();
+  if (!F)
+    return false;


arsenm · 2025-10-03T06:21:27Z

llvm/lib/Target/AMDGPU/AMDGPUCodeGenPrepare.cpp

+        Instruction *AndInst = cast<Instruction>(B.CreateAnd(Tid, Mask));
+        AndInst->takeName(&I);
+        // Attach range metadata for the result if possible.
+        ST.makeLIDRangeMetadata(AndInst);


Range attribute is preferable these days, we probably should stop using the range metadata. Especially on an and instruction, I don't think it will do anything

Removed the range metadata attachment on the AND instruction.

// Before: BinaryOperator *And = BinaryOperator::CreateAnd(NewCall, MaskVal, "", &I); ST.makeLIDRangeMetadata(And); // Incorrect - don't set range metadata on AND // After: BinaryOperator *And = BinaryOperator::CreateAnd(NewCall, MaskVal, "", &I); // Removed the makeLIDRangeMetadata call - AND result doesn't need range metadata

arsenm · 2025-10-03T06:21:40Z

llvm/lib/Target/AMDGPU/AMDGPUCodeGenPrepare.cpp

+    // No reqd_work_group_size metadata: be conservative and only handle the
+    // common test harness cases where reqd_work_group_size metadata exists
+    // and equals 32/64.
+    if (auto *Node = F->getMetadata("reqd_work_group_size")) {


const and no auto

Done

// Before: auto *Node = F->getMetadata("reqd_work_group_size") // After: const MDNode *Node = F.getMetadata("reqd_work_group_size");

…pare This addresses reviewer concerns about pipeline timing by moving the mbcnt optimization from InstCombine to AMDGPUCodeGenPrepare. The InstCombine pass runs before AMDGPU Attributor, which means reqd_work_group_size metadata may not be available. AMDGPUCodeGenPrepare runs later in the pipeline after the attributor pass, ensuring proper metadata availability. Changes: - Move visitMbcntLo and visitMbcntHi methods to AMDGPUCodeGenPrepare - Remove complex mbcnt optimization from AMDGPUInstCombineIntrinsic - Keep simple wave32 mbcnt_hi -> copy optimization in InstCombine - Move test files from InstCombine/AMDGPU to Transforms/AMDGPU - Update test RUN lines to use amdgpu-codegenprepare pass This fixes the pipeline ordering issue where InstCombine runs before AMDGPU Attributor, preventing the optimization from triggering when reqd_work_group_size metadata is set by the attributor.

Address reviewer feedback by extending visitMbcntLo to handle cases where X dimension doesn't exactly match wave size but allows even wave distribution. Key improvements: - mbcnt.lo(~0, 0) -> workitem.id.x() & (wave_size-1) when X dimension allows wavefronts to evenly split using ST.hasWavefrontsEvenlySplittingXDim() - Handles cases like X=48, X=64, X=128 on wave32 (previously unoptimized) - Added comprehensive test coverage for bitmask optimization cases - Updated existing test to reflect new correct behavior This addresses krzysz00's comment about supporting masking optimizations for non-exact wave size matches.

…data The wave32 optimization for mbcnt.hi was unconditionally optimizing the instruction on wave32 targets, which caused CodeGen test failures because the tests expected to see the mbcnt.hi instruction in the output. This change makes the wave32 mbcnt.hi optimization conditional on the same metadata requirements as the mbcnt.lo optimization, ensuring consistency and preserving existing test behavior for functions without the appropriate metadata.

@cdevadas

As requested by @cdevadas, all comments in the mbcnt optimization functions now end with a period for consistency with LLVM style. Note: The use of 'auto' keyword was defended by @krzysz00 as these appear next to dyn_cast<> which already includes the type information, following common LLVM patterns.

- Make visitMbcnt* functions const - Use class member F instead of getFunction() calls - Use auto* for pointer types and remove unnecessary null checks - Remove range metadata from AND instruction (only apply to intrinsic calls) - Simplify wave size logic by checking isWaveSizeKnown() upfront - Use direct metadata access with F.getMetadata() instead of F->getMetadata() - Add -mcpu=gfx906 to both test RUN lines to enable wave64 detection - Update CHECK expectations to account for function attributes added when specifying a CPU target All changes maintain the same optimization functionality while improving code quality and following LLVM coding conventions.

llvmbot added backend:AMDGPU llvm:instcombine Covers the InstCombine, InstSimplify and AggressiveInstCombine passes llvm:transforms labels Sep 24, 2025

chrisjbris reviewed Sep 24, 2025

View reviewed changes

llvm/lib/Target/AMDGPU/AMDGPUInstCombineIntrinsic.cpp Outdated Show resolved Hide resolved

llvm/lib/Target/AMDGPU/AMDGPUInstCombineIntrinsic.cpp Outdated Show resolved Hide resolved

jayfoad reviewed Sep 24, 2025

View reviewed changes

llvm/lib/Target/AMDGPU/AMDGPUInstCombineIntrinsic.cpp Outdated Show resolved Hide resolved

jayfoad reviewed Sep 24, 2025

View reviewed changes

llvm/lib/Target/AMDGPU/AMDGPUInstCombineIntrinsic.cpp Outdated Show resolved Hide resolved

krzysz00 reviewed Sep 24, 2025

View reviewed changes

llvm/lib/Target/AMDGPU/AMDGPUInstCombineIntrinsic.cpp Outdated Show resolved Hide resolved

llvm/lib/Target/AMDGPU/AMDGPUInstCombineIntrinsic.cpp Outdated Show resolved Hide resolved

arsenm reviewed Sep 24, 2025

View reviewed changes

TejaX-Alaghari force-pushed the mbcnt-to-workitem branch from ad90ea0 to c6a9ce3 Compare September 29, 2025 08:00

TejaX-Alaghari requested review from nikic and a team as code owners September 29, 2025 08:00

TejaX-Alaghari force-pushed the mbcnt-to-workitem branch 2 times, most recently from 1b6c6a8 to e89e1d7 Compare September 29, 2025 09:22

nikic removed request for a team and nikic September 29, 2025 09:54

TejaX-Alaghari force-pushed the mbcnt-to-workitem branch from e89e1d7 to 74731cd Compare September 29, 2025 10:48

TejaX-Alaghari requested review from krzysz00 and jayfoad September 29, 2025 11:49

krzysz00 reviewed Sep 29, 2025

View reviewed changes

TejaX-Alaghari force-pushed the mbcnt-to-workitem branch from 80c13e9 to c14e9ca Compare September 30, 2025 07:33

TejaX-Alaghari requested a review from krzysz00 September 30, 2025 07:42

TejaX-Alaghari force-pushed the mbcnt-to-workitem branch from c14e9ca to 28ebe9e Compare September 30, 2025 09:54

TejaX-Alaghari requested a review from chrisjbris September 30, 2025 10:31

TejaX-Alaghari force-pushed the mbcnt-to-workitem branch from 28ebe9e to ae399f7 Compare October 3, 2025 05:43

arsenm reviewed Oct 3, 2025

View reviewed changes

TejaX-Alaghari force-pushed the mbcnt-to-workitem branch from ae399f7 to bc39ac7 Compare October 3, 2025 07:17

TejaX-Alaghari added 5 commits October 3, 2025 13:18

TejaX-Alaghari force-pushed the mbcnt-to-workitem branch from bc39ac7 to 8fc6c7e Compare October 3, 2025 07:50

TejaX-Alaghari requested a review from arsenm October 3, 2025 08:03

	bool AMDGPUCodeGenPrepareImpl::visitMbcntLo(IntrinsicInst &I) {
	bool AMDGPUCodeGenPrepareImpl::visitMbcntLo(IntrinsicInst &I) const {

[AMDGPU]: Rewrite mbcnt_lo/mbcnt_hi to work item ID where applicable #160496

Are you sure you want to change the base?

[AMDGPU]: Rewrite mbcnt_lo/mbcnt_hi to work item ID where applicable #160496

Conversation

TejaX-Alaghari commented Sep 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Sep 24, 2025

Uh oh!

llvmbot commented Sep 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

TejaX-Alaghari commented Sep 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions bot commented Sep 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

arsenm left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

TejaX-Alaghari commented Sep 29, 2025

Uh oh!

TejaX-Alaghari commented Sep 29, 2025

Uh oh!

krzysz00 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

TejaX-Alaghari Sep 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cdevadas commented Sep 30, 2025

Uh oh!

krzysz00 commented Sep 30, 2025

Uh oh!

krzysz00 commented Sep 30, 2025

Uh oh!

arsenm commented Oct 3, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

TejaX-Alaghari Oct 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

TejaX-Alaghari commented Sep 24, 2025 •

edited

Loading

llvmbot commented Sep 24, 2025 •

edited

Loading

TejaX-Alaghari commented Sep 24, 2025 •

edited

Loading

github-actions bot commented Sep 24, 2025 •

edited

Loading

TejaX-Alaghari Sep 30, 2025 •

edited

Loading

TejaX-Alaghari Oct 3, 2025 •

edited

Loading