-
Notifications
You must be signed in to change notification settings - Fork 15.2k
[AMDGPU] Add module flag to control TBUFFER combining #156454
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not a robust feature design, it is hacking out a singular transformation. You need to define a semantic meaning, have the default be conservatively correct
Hi, Matt, I don't know how to define a semantic meaning, could you please give me some suggestions? |
Is this the same as relaxed-buffer-oob-mod? |
I think it is not same as |
I mean what you proposed sounds like the opposite of relaxed-buffer-oob-mod. i.e. you don't need something new, the pass is just bugged and needs to check if relaxed-buffer-oob-mod is enabled to perform the merge |
Agreed that we should be correct by default, so tbuffer merges should not happen unless a certain function attribute is set that allows it. The use case is slightly different than relaxed-buffer-oob-mode though. In this case it is due to a recent Vulkan spec hardening extension https://registry.khronos.org/vulkan/specs/latest/man/html/VK_KHR_maintenance9.html that says: "Add a property to indicate the implementation will return (0,0,0,0) or (0,0,0,1) to vertex shaders that read unassigned attributes.". The driver property that drives this is different than the property that drives relaxed-buffer-oob-mode, so there should be two different mechanisms to avoid code pessimization. Also note, the middle-end merging never happens for tbuffers, because they are intrinsincs all the way - at no point are they regular llvm loads. Having said that, I agree si-load-store-optimizer should also be updated to check for relaxed-buffer-oob-mode to disallow buffer merging, but this is a separate change. |
We should fix that. We should not have any of this MI level merging |
@llvm/pr-subscribers-backend-amdgpu Author: Harrison Hao (harrisonGPU) ChangesSome graphics features do not allow TBUFFER combine. When vertex attribute robustness is enabled, unbound vertex attributes must For typed-buffers, if any component of a thread is out of bounds, the entire Therefore, TBUFFER combine must be disabled in this case. Reference: Full diff: https://github.com/llvm/llvm-project/pull/156454.diff 4 Files Affected:
diff --git a/llvm/lib/Target/AMDGPU/SILoadStoreOptimizer.cpp b/llvm/lib/Target/AMDGPU/SILoadStoreOptimizer.cpp
index 6f2ea8ad1ff01..740e574bf70c9 100644
--- a/llvm/lib/Target/AMDGPU/SILoadStoreOptimizer.cpp
+++ b/llvm/lib/Target/AMDGPU/SILoadStoreOptimizer.cpp
@@ -62,6 +62,7 @@
#include "GCNSubtarget.h"
#include "MCTargetDesc/AMDGPUMCTargetDesc.h"
#include "SIDefines.h"
+#include "SIMachineFunctionInfo.h"
#include "llvm/Analysis/AliasAnalysis.h"
#include "llvm/CodeGen/MachineFunctionPass.h"
#include "llvm/InitializePasses.h"
@@ -2456,6 +2457,14 @@ SILoadStoreOptimizer::collectMergeableInsts(
LLVM_DEBUG(dbgs() << "Skip tbuffer with unknown format: " << MI);
continue;
}
+
+ const MachineFunction *MF = MI.getParent()->getParent();
+ const auto *MFI = MF->getInfo<SIMachineFunctionInfo>();
+ if (MFI->isTBufferCombineDisabled()) {
+ LLVM_DEBUG(
+ dbgs() << "Skip TBUFFER combine: disabled by function attribute\n");
+ continue;
+ }
}
CombineInfo CI;
diff --git a/llvm/lib/Target/AMDGPU/SIMachineFunctionInfo.cpp b/llvm/lib/Target/AMDGPU/SIMachineFunctionInfo.cpp
index 8a1120321af9f..717517e83adfd 100644
--- a/llvm/lib/Target/AMDGPU/SIMachineFunctionInfo.cpp
+++ b/llvm/lib/Target/AMDGPU/SIMachineFunctionInfo.cpp
@@ -195,6 +195,9 @@ SIMachineFunctionInfo::SIMachineFunctionInfo(const Function &F,
VGPRForAGPRCopy =
AMDGPU::VGPR_32RegClass.getRegister(ST.getMaxNumVGPRs(F) - 1);
}
+
+ if (F.hasFnAttribute("amdgpu-disable-tbuffer-combine"))
+ setDisableTBufferCombine(true);
}
MachineFunctionInfo *SIMachineFunctionInfo::clone(
diff --git a/llvm/lib/Target/AMDGPU/SIMachineFunctionInfo.h b/llvm/lib/Target/AMDGPU/SIMachineFunctionInfo.h
index ca8f8033a2d54..0e8a0b75c0491 100644
--- a/llvm/lib/Target/AMDGPU/SIMachineFunctionInfo.h
+++ b/llvm/lib/Target/AMDGPU/SIMachineFunctionInfo.h
@@ -293,6 +293,8 @@ struct SIMachineFunctionInfo final : public yaml::MachineFunctionInfo {
unsigned PSInputEnable = 0;
unsigned MaxMemoryClusterDWords = DefaultMemoryClusterDWordsLimit;
+ bool DisableTBufferCombine = false;
+
SIMode Mode;
std::optional<FrameIndex> ScavengeFI;
StringValue VGPRForAGPRCopy;
@@ -525,6 +527,9 @@ class SIMachineFunctionInfo final : public AMDGPUMachineFunction,
// scheduler stage.
unsigned MaxMemoryClusterDWords = DefaultMemoryClusterDWordsLimit;
+ // Disable combining of TBUFFER instructions.
+ bool DisableTBufferCombine = false;
+
MCPhysReg getNextUserSGPR() const;
MCPhysReg getNextSystemSGPR() const;
@@ -1207,6 +1212,11 @@ class SIMachineFunctionInfo final : public AMDGPUMachineFunction,
unsigned getMaxNumWorkGroupsX() const { return MaxNumWorkGroups[0]; }
unsigned getMaxNumWorkGroupsY() const { return MaxNumWorkGroups[1]; }
unsigned getMaxNumWorkGroupsZ() const { return MaxNumWorkGroups[2]; }
+
+ bool isTBufferCombineDisabled() const { return DisableTBufferCombine; }
+ void setDisableTBufferCombine(bool IsDisabled) {
+ DisableTBufferCombine = IsDisabled;
+ }
};
} // end namespace llvm
diff --git a/llvm/test/CodeGen/AMDGPU/tbuffer-combine-disable-attr.mir b/llvm/test/CodeGen/AMDGPU/tbuffer-combine-disable-attr.mir
new file mode 100644
index 0000000000000..cd119aacf7496
--- /dev/null
+++ b/llvm/test/CodeGen/AMDGPU/tbuffer-combine-disable-attr.mir
@@ -0,0 +1,65 @@
+# NOTE: Assertions have been autogenerated by utils/update_mir_test_checks.py UTC_ARGS: --version 5
+# RUN: llc -mtriple=amdgcn-amd-amdhsa -mcpu=gfx1100 -run-pass=si-load-store-opt -verify-machineinstrs %s -o - | FileCheck %s
+
+--- |
+ target triple = "amdgcn"
+
+ define float @disable-tbuffer-combine(<4 x i32> %vec, i32 %index) #0 {
+ %1 = call i32 @llvm.amdgcn.struct.tbuffer.load.i32(<4 x i32> %vec, i32 %index, i32 0, i32 0, i32 22, i32 0)
+ %2 = call i32 @llvm.amdgcn.struct.tbuffer.load.i32(<4 x i32> %vec, i32 %index, i32 4, i32 0, i32 22, i32 0)
+ %3 = call i32 @llvm.amdgcn.struct.tbuffer.load.i32(<4 x i32> %vec, i32 %index, i32 8, i32 0, i32 22, i32 0)
+ %4 = call i32 @llvm.amdgcn.struct.tbuffer.load.i32(<4 x i32> %vec, i32 %index, i32 12, i32 0, i32 22, i32 0)
+ %5 = bitcast i32 %1 to float
+ %6 = bitcast i32 %2 to float
+ %7 = bitcast i32 %3 to float
+ %8 = bitcast i32 %4 to float
+ %add = fadd float %5, %6
+ %mul = fmul float %7, %8
+ %res = fadd float %add, %mul
+ ret float %res
+ }
+
+ attributes #0 = {"amdgpu-disable-tbuffer-combine"}
+...
+---
+name: disable-tbuffer-combine
+body: |
+ bb.0 (%ir-block.0):
+ liveins: $vgpr0, $vgpr1, $vgpr2, $vgpr3, $vgpr4
+
+ ; CHECK-LABEL: name: disable-tbuffer-combine
+ ; CHECK: liveins: $vgpr0, $vgpr1, $vgpr2, $vgpr3, $vgpr4
+ ; CHECK-NEXT: {{ $}}
+ ; CHECK-NEXT: [[COPY:%[0-9]+]]:vgpr_32 = COPY $vgpr4
+ ; CHECK-NEXT: [[COPY1:%[0-9]+]]:vgpr_32 = COPY $vgpr3
+ ; CHECK-NEXT: [[COPY2:%[0-9]+]]:vgpr_32 = COPY $vgpr2
+ ; CHECK-NEXT: [[COPY3:%[0-9]+]]:vgpr_32 = COPY $vgpr1
+ ; CHECK-NEXT: [[COPY4:%[0-9]+]]:vgpr_32 = COPY $vgpr0
+ ; CHECK-NEXT: [[REG_SEQUENCE:%[0-9]+]]:sgpr_128 = REG_SEQUENCE [[COPY4]], %subreg.sub0, [[COPY3]], %subreg.sub1, [[COPY2]], %subreg.sub2, [[COPY1]], %subreg.sub3
+ ; CHECK-NEXT: [[S_MOV_B32_:%[0-9]+]]:sreg_32 = S_MOV_B32 0
+ ; CHECK-NEXT: [[TBUFFER_LOAD_FORMAT_X_IDXEN:%[0-9]+]]:vgpr_32 = TBUFFER_LOAD_FORMAT_X_IDXEN [[COPY]], [[REG_SEQUENCE]], [[S_MOV_B32_]], 0, 22, 0, 0, implicit $exec :: (dereferenceable load (s32), align 1, addrspace 8)
+ ; CHECK-NEXT: [[TBUFFER_LOAD_FORMAT_X_IDXEN1:%[0-9]+]]:vgpr_32 = TBUFFER_LOAD_FORMAT_X_IDXEN [[COPY]], [[REG_SEQUENCE]], [[S_MOV_B32_]], 4, 22, 0, 0, implicit $exec :: (dereferenceable load (s32), align 1, addrspace 8)
+ ; CHECK-NEXT: [[TBUFFER_LOAD_FORMAT_X_IDXEN2:%[0-9]+]]:vgpr_32 = TBUFFER_LOAD_FORMAT_X_IDXEN [[COPY]], [[REG_SEQUENCE]], [[S_MOV_B32_]], 8, 22, 0, 0, implicit $exec :: (dereferenceable load (s32), align 1, addrspace 8)
+ ; CHECK-NEXT: [[TBUFFER_LOAD_FORMAT_X_IDXEN3:%[0-9]+]]:vgpr_32 = TBUFFER_LOAD_FORMAT_X_IDXEN [[COPY]], [[REG_SEQUENCE]], [[S_MOV_B32_]], 12, 22, 0, 0, implicit $exec :: (dereferenceable load (s32), align 1, addrspace 8)
+ ; CHECK-NEXT: [[V_ADD_F32_e64_:%[0-9]+]]:vgpr_32 = nofpexcept V_ADD_F32_e64 0, killed [[TBUFFER_LOAD_FORMAT_X_IDXEN]], 0, killed [[TBUFFER_LOAD_FORMAT_X_IDXEN1]], 0, 0, implicit $mode, implicit $exec
+ ; CHECK-NEXT: [[V_MUL_F32_e64_:%[0-9]+]]:vgpr_32 = nofpexcept V_MUL_F32_e64 0, killed [[TBUFFER_LOAD_FORMAT_X_IDXEN2]], 0, killed [[TBUFFER_LOAD_FORMAT_X_IDXEN3]], 0, 0, implicit $mode, implicit $exec
+ ; CHECK-NEXT: [[V_ADD_F32_e64_1:%[0-9]+]]:vgpr_32 = nofpexcept V_ADD_F32_e64 0, killed [[V_ADD_F32_e64_]], 0, killed [[V_MUL_F32_e64_]], 0, 0, implicit $mode, implicit $exec
+ ; CHECK-NEXT: $vgpr0 = COPY [[V_ADD_F32_e64_1]]
+ ; CHECK-NEXT: SI_RETURN implicit $vgpr0
+ %12:vgpr_32 = COPY $vgpr4
+ %11:vgpr_32 = COPY $vgpr3
+ %10:vgpr_32 = COPY $vgpr2
+ %9:vgpr_32 = COPY $vgpr1
+ %8:vgpr_32 = COPY $vgpr0
+ %13:sgpr_128 = REG_SEQUENCE %8, %subreg.sub0, %9, %subreg.sub1, %10, %subreg.sub2, %11, %subreg.sub3
+ %14:sreg_32 = S_MOV_B32 0
+ %15:vgpr_32 = TBUFFER_LOAD_FORMAT_X_IDXEN %12, %13, %14, 0, 22, 0, 0, implicit $exec :: (dereferenceable load (s32), align 1, addrspace 8)
+ %16:vgpr_32 = TBUFFER_LOAD_FORMAT_X_IDXEN %12, %13, %14, 4, 22, 0, 0, implicit $exec :: (dereferenceable load (s32), align 1, addrspace 8)
+ %17:vgpr_32 = TBUFFER_LOAD_FORMAT_X_IDXEN %12, %13, %14, 8, 22, 0, 0, implicit $exec :: (dereferenceable load (s32), align 1, addrspace 8)
+ %18:vgpr_32 = TBUFFER_LOAD_FORMAT_X_IDXEN %12, %13, %14, 12, 22, 0, 0, implicit $exec :: (dereferenceable load (s32), align 1, addrspace 8)
+ %19:vgpr_32 = nofpexcept V_ADD_F32_e64 0, killed %15, 0, killed %16, 0, 0, implicit $mode, implicit $exec
+ %20:vgpr_32 = nofpexcept V_MUL_F32_e64 0, killed %17, 0, killed %18, 0, 0, implicit $mode, implicit $exec
+ %21:vgpr_32 = nofpexcept V_ADD_F32_e64 0, killed %19, 0, killed %20, 0, 0, implicit $mode, implicit $exec
+ $vgpr0 = COPY %21
+ SI_RETURN implicit $vgpr0
+...
|
I'd also like to take the opportunity to flag that relaxed-buffer-oob-mode should either be a module-level or a function-level property ... and should actually have its default flipped (to strict-oob-mode on a module, for example - or maybe that'd be an HSA-specific thing) |
Also, should this be a function-level or module-level property? It feels module-level, in that your entire program either does or doesn't want these semantics? |
Is the suggestion to add an extra ME pass that merges tbuffers? load-store-vectorizer works on the regular load/stores, so it is not easily possible to get that working with tbuffer representation which has extra information (format) on the instruction. |
Yes, this is more like module-level property. But function attribute is a standard way to pass this kind of information from the front-end to the backend. Is there a better option? |
if (F.hasFnAttribute("amdgpu-disable-tbuffer-combine")) | ||
setDisableTBufferCombine(true); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think Matt's suggestion was that the conservative behaviour should be the default, so the tbuffer merging should be disabled by default, and only enabled when an attribute is set.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should enable tbuffer combine by default, because its logic is generally correct, and only certain features disallow it. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see where you're coming from on this, and I had similar thoughts in #115479. But being always correct by default has some clear advantages. The price to pay for this is to always set a certain flag to enable relaxed mode.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And I have another question: if we disable tbuffer combine by default, will many lit tests need to be updated? Does that mean all the 8-bit, 16-bit, and 32-bit tests would need to change?
Module flags? (Which, well, relaxed-oob-mode - or its inverse - should also be) |
Yeah - and you might want the same for buffer_load/buffer_store more generally, in theory? |
If this is a module level thingy, please just use module flag. We already have something "bad" that is supposed to be a module-level config but ends up being a function level attribute, which causes many trouble. |
Making it module level turns it into ABI, which is bad. You should be able to perform 1 way linking from strict to relaxed use contexts |
|
That is, I think it's possible - and correct - to make the strict behavior infectious when you do a module-to-module link at the LLVM IR level At the |
It appears the only remaining point of contention is whether to use a module flag or a function attribute. I agree with Matt that making it module level could introduce new problems. I prefer using a function attribute, which is also more flexible if we decide to set it on a per-function basis in the future. |
I’ve switched to a function attribute to control TBUFFER combining and renamed it to |
0c7a9b6
to
5476248
Compare
I am not completely up to speed on this, but I thought combining dword loads into multi-dword loads was always OK, because bounds checking checks each dword individually. Hopefully @piotrAMD will correct me if I am wrong. |
Can I get more clarity on what the problem with a module-level flag is? |
5476248
to
e702821
Compare
I think the only issue with a module flag would be if you care about having different entry points in the same module with different behavior. If this is an attribute this needs more work to properly handling inlining with mixed modes Also I'm still not clear how this differs from the the relaxed-oob-mode feature. Can these be merged into one? In any case I'd like to see those use identical mechanisms before introducing another knob |
Typed buffer loads are not range-checked per dword, they are range-checked "all or nothing". |
This is similar to the existing relaxed-oob-mode feature in that it controls oob merging behaviour. The behavior for untyped buffer merging (in llvm-ir) and this tbuffer merging in mir are driven by different language features, so we need to be able to select them individually. I guess having a single mechanism like multistate function attribute/module flag would be ok, but there is this problem that the function that uses relaxed-oob-mode currently ( |
@piotrAMD More fundamental question: is the strictness/relaxed-ness something that's enabled per-compilation (that is, per-module) or, at the SPIR-V level, can in differ per-function ? If you can't have functions with different strictness requirements in ore program ... both of these should be module flags, and |
Hi, I’ve switched to using a module flag to control |
Proposing a change to the existing oob mode (relaxed-buffer-oob-mode) to use module flag instead of subtarget feature - just a draft to canvas opinions: #160922. |
Some graphics APIs require stricter out-of-bounds (OOB) semantics that are
incompatible with TBUFFER instruction combining.
When vertex attribute robustness is enabled, unbound vertex attributes must
return (0,0,0,0) or (0,0,0,1). For typed buffers, if any component of a thread
is out of bounds, the entire thread is considered out of bounds and must
return zero. If multiple TBUFFER loads are merged into a wider load, a single
OOB component would incorrectly force all components to return zero.
This patch introduces a module flag,
amdgpu.oob.mode
, to control this:Reference:
https://registry.khronos.org/vulkan/specs/latest/man/html/VK_EXT_vertex_attribute_robustness.html