Skip to content

Conversation

linuxrocks123
Copy link
Contributor

Addresses SWDEV-549227

@llvmbot
Copy link
Member

llvmbot commented Aug 26, 2025

@llvm/pr-subscribers-backend-amdgpu

Author: Patrick Simmons (linuxrocks123)

Changes

Addresses SWDEV-549227


Full diff: https://github.com/llvm/llvm-project/pull/155491.diff

1 Files Affected:

  • (modified) llvm/lib/Target/AMDGPU/AMDGPUIGroupLP.cpp (+10)
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUIGroupLP.cpp b/llvm/lib/Target/AMDGPU/AMDGPUIGroupLP.cpp
index dbe74b1b08f8c..8c514714bd7dd 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUIGroupLP.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPUIGroupLP.cpp
@@ -2391,6 +2391,16 @@ bool SchedGroup::canAddMI(const MachineInstr &MI) const {
   if (MI.isMetaInstruction())
     Result = false;
 
+  else if (MI.isInlineAsm()) {
+    std::string Text = MI.getOperand(0).getSymbolName();
+    if (Text.find("SGMASK:") != std::string::npos) {
+      Text = Text.substr(Text.find("SGMASK:") + strlen("SGMASK:"));
+      Text = Text.substr(0, Text.find_first_of(" \t\r\n"));
+      unsigned long InlineAsmMask = std::stoul(Text, nullptr, 0);
+      Result = ((unsigned long)SGMask & InlineAsmMask) != 0;
+    }
+  }
+
   else if (((SGMask & SchedGroupMask::ALU) != SchedGroupMask::NONE) &&
            (TII->isVALU(MI) || TII->isMFMAorWMMA(MI) || TII->isSALU(MI) ||
             TII->isTRANS(MI)))

@linuxrocks123
Copy link
Contributor Author

This PR allows the user to specify sched group barrier information for inline asm instructions. The mechanism for doing so is adding a comment with the text "SGMASK:" inside the inline asm string. The contents of "" should be the schedule barrier groups you want to group the inline asm with. So, for example, if your inline asm is a VALU instruction, you would have an inline asm statement that looks like this:

asm ("v_add_f32_e32 %[result], %[lhs], %[rhs] ; SGMASK:0x2"
             : [result] "=v"(result)
             : [lhs] "v"(lhs), [rhs] "v"(rhs));

Because of the "SGMASK:0x2", sched group barrier instructions will now be able to recognize that the inline asm instruction is a VALU instruction and schedule it accordingly.

Copy link
Contributor

@arsenm arsenm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs tests. Also don't use c string functions, StringRef has everything on it directly.

I also don't love the idea of parsing something special out of inline assembly. The general way to handle inline asm is answer yes for every question, so just if (isInlineAsm()) return true. If you want to try to be more refined than that, you can try to guess at the contents by the register constraints before dropping to scraping the asm contents

@linuxrocks123
Copy link
Contributor Author

linuxrocks123 commented Aug 28, 2025

@arsenm Thanks, I switched to using StringRef.

Regarding the parsing of the inline assembly, the goal of this PR is to allow the user to optionally specify the schedule group mask for a specific inline assembly instruction to allow schedule group barriers to work correctly with that instruction. Parsing a special token out of the inline assembly is the only reasonable mechanism I currently see for accomplishing that, but I am open to suggestions.

Copy link
Contributor

@jplehr jplehr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same question for the other attributes and metadata.

Comment on lines +12 to +13
@llvm.compiler.used = appending addrspace(1) global [1 x ptr] [ptr addrspacecast (ptr addrspace(1) @__hip_cuid_bffb86447932ec40 to ptr)], section "llvm.metadata"
@__hip_cuid_bffb86447932ec40 = addrspace(1) global i8 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this needed for the test to work?

@arsenm
Copy link
Contributor

arsenm commented Sep 4, 2025

@arsenm Thanks, I switched to using StringRef.

Regarding the parsing of the inline assembly, the goal of this PR is to allow the user to optionally specify the schedule group mask for a specific inline assembly instruction to allow schedule group barriers to work correctly with that instruction. Parsing a special token out of the inline assembly is the only reasonable mechanism I currently see for accomplishing that, but I am open to suggestions.

I am going to outright reject the concept of trying to parse the content of inline assembly. I am firmly in the camp of wont fixing any performance issue involving inline assembly

@arsenm
Copy link
Contributor

arsenm commented Sep 5, 2025

#157080 also proposes codifying not doing things like this

@linuxrocks123
Copy link
Contributor Author

@arsenm, I don't think we're giving users sufficient alternative methods of controlling code generation such that they would be expected not to see a need for using inline assembly as a tool to improve performance.

Instead of this, we could add __builtin_sched_group_barrier_override_sgmask() with the semantics that a nonzero argument will override the sgmask for any instructions found prior to it. A call with 0 could clear the custom mask. That way, we're both giving users more control and not directly adding any code to support inline assembly.

@arsenm
Copy link
Contributor

arsenm commented Sep 6, 2025

Instead of this, we could add __builtin_sched_group_barrier_override_sgmask() with the semantics that a nonzero argument will override the sgmask for any instructions found prior to it. A call with 0 could clear the custom mask. That way, we're both giving users more control and not directly adding any code to support inline assembly.

All of these built-ins are a hack, and that's adding a hack on top of the hack. All of this stuff is a development resource sink we have to maintain forever and it's not worth it.

Did you try just interpreting inline asm as matching all classes of instructions, or filtering based on whether there are VGPR or SGPR operands?

@linuxrocks123
Copy link
Contributor Author

Instead of this, we could add __builtin_sched_group_barrier_override_sgmask() with the semantics that a nonzero argument will override the sgmask for any instructions found prior to it. A call with 0 could clear the custom mask. That way, we're both giving users more control and not directly adding any code to support inline assembly.

All of these built-ins are a hack, and that's adding a hack on top of the hack. All of this stuff is a development resource sink we have to maintain forever and it's not worth it.

Did you try just interpreting inline asm as matching all classes of instructions, or filtering based on whether there are VGPR or SGPR operands?

I haven't tried testing the inline assembly operands. I would be hesitant to do that because, although you know the architecture better than I do, I can't imagine every assembly instruction that uses a VGPR is a VALU instruction, and "there's wrong behavior" seems worse than "there's no way to express what I want."

The original user request was to have inline asm match only its own new sgmask bit, which is similar in expressiveness to having it match all classes of instructions. That approach has the advantage of not possibly changing the scheduling behavior of any existing code that uses sched group barriers, which having it match all classes of instructions could do. Although I feel the current PR provides maximum expressiveness, since the user is the one that suggested just adding a new bit to sgmask for inline assembly, doing that would almost certainly be expressive enough for his needs and would avoid the need to parse the inline assembly string.

An implementation of the user's original request is in this PR's history at 6f45055. I can revert this PR back to that commit and then add the tests back if we decide to go that route.

@arsenm
Copy link
Contributor

arsenm commented Sep 15, 2025

I haven't tried testing the inline assembly operands. I would be hesitant to do that because, although you know the architecture better than I do, I can't imagine every assembly instruction that uses a VGPR is a VALU instruction,

This is the case. There are 0 SALU instructions that can read or write a VGPR. The converse is almost true too; VALU instructions cannot write VGPRs, with the special cases of boolean flags, and readlane/writelane. But you don't need to have a perfect answer, it's still a scheduling heuristic decision.

and "there's wrong behavior" seems worse than "there's no way to express what I want."

There cannot be wrong behavior in scheduling. The scheduling intrinsics can at most be interpreted as an optimization hint; optimizations cannot be semantics.

I can revert this PR back to that commit and then add the tests back if we decide to go that route.

Inline asm really is not a tool for improving performance and should not be treated as such. Inline assembly is a hard break of optimizations of the compiler. If anything we should be de-optimizing it harder (i.e. we could be conservatively resolving more hazards that exist inside inline assembly than we do today)

@arsenm
Copy link
Contributor

arsenm commented Sep 15, 2025

To make my position perfectly explicit, I think this should be abandoned. The asm constraints suggested in #157080 should also not be pursued

@linuxrocks123
Copy link
Contributor Author

linuxrocks123 commented Sep 16, 2025

@arsenm,

There cannot be wrong behavior in scheduling. The scheduling intrinsics can at most be interpreted as an optimization hint; optimizations cannot be semantics.

By "wrong", I meant wrong from an optimal scheduling point of view, not correctness.

This is the case. There are 0 SALU instructions that can read or write a VGPR. The converse is almost true too; VALU instructions cannot write VGPRs, with the special cases of boolean flags, and readlane/writelane. But you don't need to have a perfect answer, it's still a scheduling heuristic decision.

That sounds like a promising approach, in that case.

To make my position perfectly explicit, I think this should be abandoned. The asm constraints suggested in #157080 should also not be pursued

Would you be willing to remove your block if I pursued the approach of guessing the instruction type based on the operand constraints?

Edit: Po Yen is explaining on the internal issue tracker why guessing the instruction type based on operand constraints will not work.

@linuxrocks123
Copy link
Contributor Author

Hi @arsenm, thanks for your help reviewing this. I made some changes to the code to try to implement the constraint inference approach. Would you please let me know what you think?

Also, the testcase I added generates a0 instead of a[0-1] which I think is correct. Do you know how to fix that?

@linuxrocks123
Copy link
Contributor Author

@arsenm I intend to add one testcase per sgmask category, but I wanted to get the code solid first.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants