Skip to content

Conversation

@linuxrocks123
Copy link
Contributor

This PR improves the lowering of abs(i16) and -abs(i16) on the AMDGPU target. It is written as an early Machine IR-level pass since the transformation is only profitable for SGPR registers as there is no dedicated abs instruction for VGPRs, and it is only possible to determine whether a value is VGPR or SGPR after ISel.

An earlier failed, correct-but-pessimizing attempt overriding expandABS at the DAG level is in the Git history.

@llvmbot
Copy link
Member

llvmbot commented Oct 29, 2025

@llvm/pr-subscribers-backend-amdgpu

Author: Patrick Simmons (linuxrocks123)

Changes

This PR improves the lowering of abs(i16) and -abs(i16) on the AMDGPU target. It is written as an early Machine IR-level pass since the transformation is only profitable for SGPR registers as there is no dedicated abs instruction for VGPRs, and it is only possible to determine whether a value is VGPR or SGPR after ISel.

An earlier failed, correct-but-pessimizing attempt overriding expandABS at the DAG level is in the Git history.


Full diff: https://github.com/llvm/llvm-project/pull/165626.diff

4 Files Affected:

  • (modified) llvm/lib/Target/AMDGPU/AMDGPU.h (+11)
  • (modified) llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp (+3)
  • (modified) llvm/lib/Target/AMDGPU/CMakeLists.txt (+1)
  • (added) llvm/test/CodeGen/AMDGPU/s_abs_i16.ll (+22)
diff --git a/llvm/lib/Target/AMDGPU/AMDGPU.h b/llvm/lib/Target/AMDGPU/AMDGPU.h
index ce2b4a5f6f2e9..43a052b687109 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPU.h
+++ b/llvm/lib/Target/AMDGPU/AMDGPU.h
@@ -39,6 +39,7 @@ FunctionPass *createSIAnnotateControlFlowLegacyPass();
 FunctionPass *createSIFoldOperandsLegacyPass();
 FunctionPass *createSIPeepholeSDWALegacyPass();
 FunctionPass *createSILowerI1CopiesLegacyPass();
+FunctionPass *createSISAbs16FixupLegacyPass();
 FunctionPass *createSIShrinkInstructionsLegacyPass();
 FunctionPass *createSILoadStoreOptimizerLegacyPass();
 FunctionPass *createSIWholeQuadModeLegacyPass();
@@ -93,6 +94,13 @@ class SILowerI1CopiesPass : public PassInfoMixin<SILowerI1CopiesPass> {
                         MachineFunctionAnalysisManager &MFAM);
 };
 
+class SISAbs16FixupPass : public PassInfoMixin<SISAbs16FixupPass> {
+public:
+  SISAbs16FixupPass() = default;
+  PreservedAnalyses run(MachineFunction &MF,
+                        MachineFunctionAnalysisManager &MFAM);
+};
+
 void initializeAMDGPUDAGToDAGISelLegacyPass(PassRegistry &);
 
 void initializeAMDGPUAlwaysInlinePass(PassRegistry&);
@@ -197,6 +205,9 @@ extern char &SILowerWWMCopiesLegacyID;
 void initializeSILowerI1CopiesLegacyPass(PassRegistry &);
 extern char &SILowerI1CopiesLegacyID;
 
+void initializeSISAbs16FixupLegacyPass(PassRegistry &);
+extern char &SISAbs16FixupLegacyID;
+
 void initializeAMDGPUGlobalISelDivergenceLoweringPass(PassRegistry &);
 extern char &AMDGPUGlobalISelDivergenceLoweringID;
 
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp b/llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp
index 996b55f42fd0b..90405fed8efdd 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp
@@ -551,6 +551,7 @@ extern "C" LLVM_ABI LLVM_EXTERNAL_VISIBILITY void LLVMInitializeAMDGPUTarget() {
   initializeAMDGPUPrepareAGPRAllocLegacyPass(*PR);
   initializeGCNDPPCombineLegacyPass(*PR);
   initializeSILowerI1CopiesLegacyPass(*PR);
+  initializeSISAbs16FixupLegacyPass(*PR);
   initializeAMDGPUGlobalISelDivergenceLoweringPass(*PR);
   initializeAMDGPURegBankSelectPass(*PR);
   initializeAMDGPURegBankLegalizePass(*PR);
@@ -1517,6 +1518,7 @@ bool GCNPassConfig::addInstSelector() {
   AMDGPUPassConfig::addInstSelector();
   addPass(&SIFixSGPRCopiesLegacyID);
   addPass(createSILowerI1CopiesLegacyPass());
+  addPass(createSISAbs16FixupLegacyPass());
   return false;
 }
 
@@ -2209,6 +2211,7 @@ Error AMDGPUCodeGenPassBuilder::addInstSelector(AddMachinePass &addPass) const {
   addPass(AMDGPUISelDAGToDAGPass(TM));
   addPass(SIFixSGPRCopiesPass());
   addPass(SILowerI1CopiesPass());
+  addPass(SISAbs16FixupPass());
   return Error::success();
 }
 
diff --git a/llvm/lib/Target/AMDGPU/CMakeLists.txt b/llvm/lib/Target/AMDGPU/CMakeLists.txt
index a1e0e5293c706..cd9225acdb002 100644
--- a/llvm/lib/Target/AMDGPU/CMakeLists.txt
+++ b/llvm/lib/Target/AMDGPU/CMakeLists.txt
@@ -185,6 +185,7 @@ add_llvm_target(AMDGPUCodeGen
   SIPreEmitPeephole.cpp
   SIProgramInfo.cpp
   SIRegisterInfo.cpp
+  SISAbs16Fixup.cpp
   SIShrinkInstructions.cpp
   SIWholeQuadMode.cpp
 
diff --git a/llvm/test/CodeGen/AMDGPU/s_abs_i16.ll b/llvm/test/CodeGen/AMDGPU/s_abs_i16.ll
new file mode 100644
index 0000000000000..e61abb7173d78
--- /dev/null
+++ b/llvm/test/CodeGen/AMDGPU/s_abs_i16.ll
@@ -0,0 +1,22 @@
+; RUN: llc -mtriple=amdgcn-- -mcpu=gfx900 < %s | FileCheck %s
+
+define amdgpu_ps i16 @abs_i16(i16 inreg %arg) {
+; CHECK-LABEL: abs_i16:
+; CHECK: %bb.0:
+; CHECK-NEXT: s_sext_i32_i16 s0, s0
+; CHECK-NEXT: s_abs_i32 s0, s0
+
+  %res = call i16 @llvm.abs.i16(i16 %arg, i1 false)
+  ret i16 %res
+}
+
+define amdgpu_ps i16 @abs_i16_neg(i16 inreg %arg) {
+; CHECK-LABEL: abs_i16_neg:
+; CHECK: ; %bb.0:
+; CHECK-NEXT: s_sext_i32_i16 s0, s0
+; CHECK-NEXT: s_abs_i32 s0, s0
+; CHECK-NEXT: s_sub_i32 s0, 0, s0
+  %res1 = call i16 @llvm.abs.i16(i16 %arg, i1 false)
+  %res2 = sub i16 0, %res1
+  ret i16 %res2
+}
\ No newline at end of file

@jayfoad
Copy link
Contributor

jayfoad commented Oct 30, 2025

Patch is missing the new file. Also I would really hope we can find a way to do this without adding a dedicated pass.

@github-actions
Copy link

github-actions bot commented Oct 30, 2025

⚠️ C/C++ code formatter, clang-format found issues in your code. ⚠️

You can test this locally with the following command:
git-clang-format --diff origin/main HEAD --extensions cpp,h -- llvm/lib/Target/AMDGPU/SISAbs16Fixup.cpp llvm/lib/Target/AMDGPU/AMDGPU.h llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp --diff_from_common_commit

⚠️
The reproduction instructions above might return results for more than one PR
in a stack if you are using a stacked PR workflow. You can limit the results by
changing origin/main to the base branch/commit you want to compare against.
⚠️

View the diff from clang-format here.
diff --git a/llvm/lib/Target/AMDGPU/SISAbs16Fixup.cpp b/llvm/lib/Target/AMDGPU/SISAbs16Fixup.cpp
index fd305b6ff..ca9d314c0 100644
--- a/llvm/lib/Target/AMDGPU/SISAbs16Fixup.cpp
+++ b/llvm/lib/Target/AMDGPU/SISAbs16Fixup.cpp
@@ -24,7 +24,7 @@
 
 using namespace llvm;
 
-static Register pierceCopies(Register R, MachineRegisterInfo& MRI) {
+static Register pierceCopies(Register R, MachineRegisterInfo &MRI) {
   MachineInstr *CopyMI = MRI.getVRegDef(R);
   while (CopyMI && CopyMI->getOpcode() == AMDGPU::COPY) {
     Register T = CopyMI->getOperand(1).getReg();
@@ -41,25 +41,24 @@ static Register pierceCopies(Register R, MachineRegisterInfo& MRI) {
 static MachineInstr *matchExpandAbsPattern(MachineInstr &MI,
                                            MachineRegisterInfo &MRI) {
   std::array<MachineInstr *, 2> SextInstructions;
-  for (unsigned I = 0; I < SextInstructions.size(); I++)
-  {
+  for (unsigned I = 0; I < SextInstructions.size(); I++) {
     SextInstructions[I] = MRI.getVRegDef(MI.getOperand(I + 1).getReg());
     if (SextInstructions[I]->getOpcode() != AMDGPU::S_SEXT_I32_I16)
       return nullptr;
   }
 
   Register AbsSource;
-  MachineInstr* SubIns = nullptr;
+  MachineInstr *SubIns = nullptr;
   for (MachineInstr *SextMI : SextInstructions) {
     Register SextReg = SextMI->getOperand(1).getReg();
-    MachineInstr* OperandMI = MRI.getVRegDef(SextReg);
+    MachineInstr *OperandMI = MRI.getVRegDef(SextReg);
     if (OperandMI->getOpcode() == AMDGPU::S_SUB_I32)
-      if(!SubIns)
+      if (!SubIns)
         SubIns = OperandMI;
       else
         return nullptr;
     else
-      AbsSource = pierceCopies(SextReg,MRI);
+      AbsSource = pierceCopies(SextReg, MRI);
   }
 
   if (!SubIns)
@@ -85,12 +84,12 @@ static bool runSAbs16Fixup(MachineFunction &MF) {
   const SIInstrInfo &TII = *MF.getSubtarget<GCNSubtarget>().getInstrInfo();
 
   bool Changed = false;
-  
+
   for (MachineBasicBlock &MBB : MF)
     for (MachineInstr &MI : make_early_inc_range(MBB)) {
       bool IsPositive = MI.getOpcode() == AMDGPU::S_MAX_I32;
       bool IsNegative = MI.getOpcode() == AMDGPU::S_MIN_I32;
-      MachineInstr* AbsSourceMI;
+      MachineInstr *AbsSourceMI;
       if ((!IsPositive && !IsNegative) ||
           !(AbsSourceMI = matchExpandAbsPattern(MI, MRI)))
         continue;
@@ -107,7 +106,7 @@ static bool runSAbs16Fixup(MachineFunction &MF) {
       BuildMI(MBB, MI, MI.getDebugLoc(), TII.get(AMDGPU::S_ABS_I32), AbsDestReg)
           .addReg(SextDestReg);
 
-      if(IsNegative)
+      if (IsNegative)
         BuildMI(MBB, MI, MI.getDebugLoc(), TII.get(AMDGPU::S_SUB_I32),
                 MI.getOperand(0).getReg())
             .addImm(0)
@@ -154,10 +153,10 @@ bool SISAbs16FixupLegacy::runOnMachineFunction(MachineFunction &MF) {
   return runSAbs16Fixup(MF);
 }
 
-INITIALIZE_PASS_BEGIN(SISAbs16FixupLegacy, DEBUG_TYPE, "SI SAbs16 Fixup",
-                      false, false)
-INITIALIZE_PASS_END(SISAbs16FixupLegacy, DEBUG_TYPE, "SI SAbs16 Fixup",
-                    false, false)
+INITIALIZE_PASS_BEGIN(SISAbs16FixupLegacy, DEBUG_TYPE, "SI SAbs16 Fixup", false,
+                      false)
+INITIALIZE_PASS_END(SISAbs16FixupLegacy, DEBUG_TYPE, "SI SAbs16 Fixup", false,
+                    false)
 
 char SISAbs16FixupLegacy::ID = 0;
 

@linuxrocks123
Copy link
Contributor Author

@jayfoad thanks, I added the new file.

To avoid adding a new pass, maybe we could combine this and SILowerI1Copies into a pass containing a collection of short optimizations, similar to what AMDGPUCodegenPrepare is. I thought about doing that but went with the default option of touching as little existing and unrelated code as possible.

By the way, this isn't DAG-level because the transformation pessimizes 16-bit abs on VGPRs if it runs on them, and we don't know what the register classes are in the DAG yet.

Copy link
Contributor

@jmmartinez jmmartinez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To see better the improvement motivating this patch.

Could you squash the commits and split them in 2: one where you pre-commit the tests, and one where you solve the issue.

@@ -0,0 +1,22 @@
; RUN: llc -mtriple=amdgcn-- -mcpu=gfx900 < %s | FileCheck %s
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you run update_llc_test_checks.py over this file ?

Please add a GISel test line too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@LU-JOHN LU-JOHN changed the title Improved AMDGPU Lowering of abs(i16) and -abs(i16) [AMDGPU] Improved Lowering of abs(i16) and -abs(i16) Oct 30, 2025
@linuxrocks123
Copy link
Contributor Author

Hi @jmmartinez, thanks for reviewing this PR. Normally I would not have a problem squashing the commits as you suggested, but, in this case, the commit history contains two failed implementation attempts that I would like to preserve during the review process in case a method of fixing one of them is discovered. Would it be acceptable if I posted the master/branch assemblies of the s_cmp_0.ll testcase instead to facilitate seeing the improvement?

@jmmartinez
Copy link
Contributor

Hi @jmmartinez, thanks for reviewing this PR. Normally I would not have a problem squashing the commits as you suggested, but, in this case, the commit history contains two failed implementation attempts that I would like to preserve during the review process in case a method of fixing one of them is discovered. Would it be acceptable if I posted the master/branch assemblies of the s_cmp_0.ll testcase instead to facilitate seeing the improvement?

No problem in keeping the previous cases, but at least seeing the difference in s_abs_16 would be nice (and knowing what happens for gisel).

@linuxrocks123
Copy link
Contributor Author

@jmmartinez interestingly, the problem doesn't exist with global isel enabled. The main branch generates the optimal instruction selection without this PR.

If we are planning to switch to global isel soon, we may not need this.

@linuxrocks123
Copy link
Contributor Author

linuxrocks123 commented Oct 31, 2025

@jmmartinez, for non-global-isel, here are the differing testcase files:

master.s.txt
branch.s.txt

@linuxrocks123
Copy link
Contributor Author

@jmmartinez @jayfoad I think this is everything. If we're ready to merge this as a new pass, I'll update the llc-pipeline tests to pass. If we'd rather combine this with an existing early MIR pass, such as SILowerI1Copies, please let me know, and I will do that instead.

Copy link
Contributor

@arsenm arsenm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't really fit with the usual strategy for dealing with this sort of issue (it's sort of backwards). The usual strategy is something closer to treat the operation as legal, and deal with it later if it isn't.

There are a few options. What I suggest is to make ISD::ABS custom for i16. In the custom lowering, check if the input is divergent, return to use the default expansion.

That should get the right result most of the time. There's a possible edge case where SIFixVGPRCopies will need to deal s_abs_i16 with SGPR inputs

@@ -0,0 +1,26 @@
; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 6
; RUN: llc -mtriple=amdgcn-- -mcpu=gfx900 < %s | FileCheck %s
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would hope we wouldn't need a new test for this. It looks like the abs tests could use some gardening. test/CodeGen/AMDGPU/GlobalISel/llvm.abs.ll looks reasonably comprehensive; can you start by precommitting moving that up to the main directory, and adding dag run lines? It then should gain the negated abs cases

(I also noticed the test is minorly broken, fixed by #165965)

Comment on lines +9 to +11
// This pass matches the pattern for 16-bit ABS instructions after they have
// been lowered to for execution on the Scalar Unit.
//
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is too specific for a dedicated pass

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I can merge it with the SII1Copies one (but first let's see if you can help me fix the earlier DAG-level attempt).

@@ -0,0 +1,168 @@
//===-- SISAbs16Fixup.cpp - Lower I1 Copies -----------------------------===//
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy paste error

@linuxrocks123
Copy link
Contributor Author

This doesn't really fit with the usual strategy for dealing with this sort of issue (it's sort of backwards). The usual strategy is something closer to treat the operation as legal, and deal with it later if it isn't.

There are a few options. What I suggest is to make ISD::ABS custom for i16. In the custom lowering, check if the input is divergent, return to use the default expansion.

That should get the right result most of the time. There's a possible edge case where SIFixVGPRCopies will need to deal s_abs_i16 with SGPR inputs

@arsenm have you looked at b83ae56? I think that earlier failed attempt is close to what you were suggesting. Perhaps that earlier method could work better if I checked for whether the input is divergent as you suggested, so how would I go about checking whether the input is divergent?

@linuxrocks123
Copy link
Contributor Author

@arsenm also am I making the lowering "custom for i16" properly in that commit, or is there another way? It bugged me that I had to make that function virtual to override it since I thought, if this was the right way to go, some other target probably would have made it virtual by now.

@arsenm
Copy link
Contributor

arsenm commented Nov 1, 2025

so how would I go about checking whether the input is divergent?

Like this

@arsenm also am I making the lowering "custom for i16" properly in that commit, or is there another way? It bugged me that I had to make that function virtual to override it since I thought, if this was the right way to go, some other target probably would have made it virtual by now.

No, you shouldn't need to make anything virtual. You should call setOperationAction(ISD::ABS, MVT::i16, Legal) under Subtarget->has16BitInsts() in the SITargetLowering constructor. In the handling in LowerOperation, you can defer to the default expansion (by returning either SDValue(), or the original SDValue). One of them is the take as directly legal, the other is take as use default expansion but I can never remember which is which

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants