Skip to content

Conversation

harrisonGPU
Copy link
Contributor

@harrisonGPU harrisonGPU commented Sep 2, 2025

Some graphics APIs require stricter out-of-bounds (OOB) semantics that are
incompatible with TBUFFER instruction combining.

When vertex attribute robustness is enabled, unbound vertex attributes must
return (0,0,0,0) or (0,0,0,1). For typed buffers, if any component of a thread
is out of bounds, the entire thread is considered out of bounds and must
return zero. If multiple TBUFFER loads are merged into a wider load, a single
OOB component would incorrectly force all components to return zero.

This patch introduces a module flag, amdgpu.oob.mode, to control this:

  • bit 0x2 → RelaxedTBufferOOBMode (Enable TBUFFER combining)

Reference:
https://registry.khronos.org/vulkan/specs/latest/man/html/VK_EXT_vertex_attribute_robustness.html

Copy link
Contributor

@arsenm arsenm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not a robust feature design, it is hacking out a singular transformation. You need to define a semantic meaning, have the default be conservatively correct

@harrisonGPU
Copy link
Contributor Author

This is not a robust feature design, it is hacking out a singular transformation. You need to define a semantic meaning, have the default be conservatively correct

Hi, Matt, I don't know how to define a semantic meaning, could you please give me some suggestions?

@arsenm arsenm requested a review from krzysz00 September 3, 2025 02:56
@arsenm
Copy link
Contributor

arsenm commented Sep 3, 2025

This is not a robust feature design, it is hacking out a singular transformation. You need to define a semantic meaning, have the default be conservatively correct

Hi, Matt, I don't know how to define a semantic meaning, could you please give me some suggestions?

Is this the same as relaxed-buffer-oob-mod?

@harrisonGPU
Copy link
Contributor Author

harrisonGPU commented Sep 3, 2025

This is not a robust feature design, it is hacking out a singular transformation. You need to define a semantic meaning, have the default be conservatively correct

Hi, Matt, I don't know how to define a semantic meaning, could you please give me some suggestions?

Is this the same as relaxed-buffer-oob-mod?

I think it is not same as relaxed-buffer-oob-mod, this attribute is aims to the Vulkan semantics for typed-buffers and unbound vertex attributes, where if any component is out of bounds, the entire result must return zero.So the attribute is not enabling relaxed OOB handling, but instead enforcing the stricter API-level semantics which prohibit TBUFFER combine.

@arsenm
Copy link
Contributor

arsenm commented Sep 3, 2025

I think it is not same as relaxed-buffer-oob-mod, this attribute is aims to the Vulkan semantics for typed-buffers and unbound vertex attributes, where if any component is out of bounds, the entire result must return zero.So the attribute is not enabling relaxed OOB handling, but instead enforcing the stricter API-level semantics which prohibit TBUFFER combine.

I mean what you proposed sounds like the opposite of relaxed-buffer-oob-mod. i.e. you don't need something new, the pass is just bugged and needs to check if relaxed-buffer-oob-mod is enabled to perform the merge

@piotrAMD
Copy link
Collaborator

piotrAMD commented Sep 3, 2025

Agreed that we should be correct by default, so tbuffer merges should not happen unless a certain function attribute is set that allows it.

The use case is slightly different than relaxed-buffer-oob-mode though. In this case it is due to a recent Vulkan spec hardening extension https://registry.khronos.org/vulkan/specs/latest/man/html/VK_KHR_maintenance9.html that says: "Add a property to indicate the implementation will return (0,0,0,0) or (0,0,0,1) to vertex shaders that read unassigned attributes.".
We can't merge tbuffers in this case, because tbuffers are not checked by component, so we can't selectively return the oob value.

The driver property that drives this is different than the property that drives relaxed-buffer-oob-mode, so there should be two different mechanisms to avoid code pessimization.

Also note, the middle-end merging never happens for tbuffers, because they are intrinsincs all the way - at no point are they regular llvm loads.

Having said that, I agree si-load-store-optimizer should also be updated to check for relaxed-buffer-oob-mode to disallow buffer merging, but this is a separate change.

@arsenm
Copy link
Contributor

arsenm commented Sep 3, 2025

Also note, the middle-end merging never happens for tbuffers, because they are intrinsincs all the way - at no point are they regular llvm loads.

We should fix that. We should not have any of this MI level merging

@llvmbot
Copy link
Member

llvmbot commented Sep 3, 2025

@llvm/pr-subscribers-backend-amdgpu

Author: Harrison Hao (harrisonGPU)

Changes

Some graphics features do not allow TBUFFER combine.

When vertex attribute robustness is enabled, unbound vertex attributes must
return (0,0,0,0) or (0,0,0,1).

For typed-buffers, if any component of a thread is out of bounds, the entire
thread is considered out of bounds and must return zero.
If we combine multiple TBUFFER instructions into a wider load, a single out-of-bounds
component would incorrectly cause all components to return zero.

Therefore, TBUFFER combine must be disabled in this case.

Reference:
https://registry.khronos.org/vulkan/specs/latest/man/html/VK_EXT_vertex_attribute_robustness.html


Full diff: https://github.com/llvm/llvm-project/pull/156454.diff

4 Files Affected:

  • (modified) llvm/lib/Target/AMDGPU/SILoadStoreOptimizer.cpp (+9)
  • (modified) llvm/lib/Target/AMDGPU/SIMachineFunctionInfo.cpp (+3)
  • (modified) llvm/lib/Target/AMDGPU/SIMachineFunctionInfo.h (+10)
  • (added) llvm/test/CodeGen/AMDGPU/tbuffer-combine-disable-attr.mir (+65)
diff --git a/llvm/lib/Target/AMDGPU/SILoadStoreOptimizer.cpp b/llvm/lib/Target/AMDGPU/SILoadStoreOptimizer.cpp
index 6f2ea8ad1ff01..740e574bf70c9 100644
--- a/llvm/lib/Target/AMDGPU/SILoadStoreOptimizer.cpp
+++ b/llvm/lib/Target/AMDGPU/SILoadStoreOptimizer.cpp
@@ -62,6 +62,7 @@
 #include "GCNSubtarget.h"
 #include "MCTargetDesc/AMDGPUMCTargetDesc.h"
 #include "SIDefines.h"
+#include "SIMachineFunctionInfo.h"
 #include "llvm/Analysis/AliasAnalysis.h"
 #include "llvm/CodeGen/MachineFunctionPass.h"
 #include "llvm/InitializePasses.h"
@@ -2456,6 +2457,14 @@ SILoadStoreOptimizer::collectMergeableInsts(
         LLVM_DEBUG(dbgs() << "Skip tbuffer with unknown format: " << MI);
         continue;
       }
+
+      const MachineFunction *MF = MI.getParent()->getParent();
+      const auto *MFI = MF->getInfo<SIMachineFunctionInfo>();
+      if (MFI->isTBufferCombineDisabled()) {
+        LLVM_DEBUG(
+            dbgs() << "Skip TBUFFER combine: disabled by function attribute\n");
+        continue;
+      }
     }
 
     CombineInfo CI;
diff --git a/llvm/lib/Target/AMDGPU/SIMachineFunctionInfo.cpp b/llvm/lib/Target/AMDGPU/SIMachineFunctionInfo.cpp
index 8a1120321af9f..717517e83adfd 100644
--- a/llvm/lib/Target/AMDGPU/SIMachineFunctionInfo.cpp
+++ b/llvm/lib/Target/AMDGPU/SIMachineFunctionInfo.cpp
@@ -195,6 +195,9 @@ SIMachineFunctionInfo::SIMachineFunctionInfo(const Function &F,
     VGPRForAGPRCopy =
         AMDGPU::VGPR_32RegClass.getRegister(ST.getMaxNumVGPRs(F) - 1);
   }
+
+  if (F.hasFnAttribute("amdgpu-disable-tbuffer-combine"))
+    setDisableTBufferCombine(true);
 }
 
 MachineFunctionInfo *SIMachineFunctionInfo::clone(
diff --git a/llvm/lib/Target/AMDGPU/SIMachineFunctionInfo.h b/llvm/lib/Target/AMDGPU/SIMachineFunctionInfo.h
index ca8f8033a2d54..0e8a0b75c0491 100644
--- a/llvm/lib/Target/AMDGPU/SIMachineFunctionInfo.h
+++ b/llvm/lib/Target/AMDGPU/SIMachineFunctionInfo.h
@@ -293,6 +293,8 @@ struct SIMachineFunctionInfo final : public yaml::MachineFunctionInfo {
   unsigned PSInputEnable = 0;
   unsigned MaxMemoryClusterDWords = DefaultMemoryClusterDWordsLimit;
 
+  bool DisableTBufferCombine = false;
+
   SIMode Mode;
   std::optional<FrameIndex> ScavengeFI;
   StringValue VGPRForAGPRCopy;
@@ -525,6 +527,9 @@ class SIMachineFunctionInfo final : public AMDGPUMachineFunction,
   // scheduler stage.
   unsigned MaxMemoryClusterDWords = DefaultMemoryClusterDWordsLimit;
 
+  // Disable combining of TBUFFER instructions.
+  bool DisableTBufferCombine = false;
+
   MCPhysReg getNextUserSGPR() const;
 
   MCPhysReg getNextSystemSGPR() const;
@@ -1207,6 +1212,11 @@ class SIMachineFunctionInfo final : public AMDGPUMachineFunction,
   unsigned getMaxNumWorkGroupsX() const { return MaxNumWorkGroups[0]; }
   unsigned getMaxNumWorkGroupsY() const { return MaxNumWorkGroups[1]; }
   unsigned getMaxNumWorkGroupsZ() const { return MaxNumWorkGroups[2]; }
+
+  bool isTBufferCombineDisabled() const { return DisableTBufferCombine; }
+  void setDisableTBufferCombine(bool IsDisabled) {
+    DisableTBufferCombine = IsDisabled;
+  }
 };
 
 } // end namespace llvm
diff --git a/llvm/test/CodeGen/AMDGPU/tbuffer-combine-disable-attr.mir b/llvm/test/CodeGen/AMDGPU/tbuffer-combine-disable-attr.mir
new file mode 100644
index 0000000000000..cd119aacf7496
--- /dev/null
+++ b/llvm/test/CodeGen/AMDGPU/tbuffer-combine-disable-attr.mir
@@ -0,0 +1,65 @@
+# NOTE: Assertions have been autogenerated by utils/update_mir_test_checks.py UTC_ARGS: --version 5
+# RUN: llc -mtriple=amdgcn-amd-amdhsa -mcpu=gfx1100 -run-pass=si-load-store-opt -verify-machineinstrs %s -o - | FileCheck %s
+
+--- |
+  target triple = "amdgcn"
+
+  define float @disable-tbuffer-combine(<4 x i32> %vec, i32 %index) #0 {
+    %1 = call i32 @llvm.amdgcn.struct.tbuffer.load.i32(<4 x i32> %vec, i32 %index, i32 0, i32 0, i32 22, i32 0)
+    %2 = call i32 @llvm.amdgcn.struct.tbuffer.load.i32(<4 x i32> %vec, i32 %index, i32 4, i32 0, i32 22, i32 0)
+    %3 = call i32 @llvm.amdgcn.struct.tbuffer.load.i32(<4 x i32> %vec, i32 %index, i32 8, i32 0, i32 22, i32 0)
+    %4 = call i32 @llvm.amdgcn.struct.tbuffer.load.i32(<4 x i32> %vec, i32 %index, i32 12, i32 0, i32 22, i32 0)
+    %5 = bitcast i32 %1 to float
+    %6 = bitcast i32 %2 to float
+    %7 = bitcast i32 %3 to float
+    %8 = bitcast i32 %4 to float
+    %add = fadd float %5, %6
+    %mul = fmul float %7, %8
+    %res = fadd float %add, %mul
+    ret float %res
+  }
+
+  attributes #0 = {"amdgpu-disable-tbuffer-combine"}
+...
+---
+name:            disable-tbuffer-combine
+body:             |
+  bb.0 (%ir-block.0):
+    liveins: $vgpr0, $vgpr1, $vgpr2, $vgpr3, $vgpr4
+
+    ; CHECK-LABEL: name: disable-tbuffer-combine
+    ; CHECK: liveins: $vgpr0, $vgpr1, $vgpr2, $vgpr3, $vgpr4
+    ; CHECK-NEXT: {{  $}}
+    ; CHECK-NEXT: [[COPY:%[0-9]+]]:vgpr_32 = COPY $vgpr4
+    ; CHECK-NEXT: [[COPY1:%[0-9]+]]:vgpr_32 = COPY $vgpr3
+    ; CHECK-NEXT: [[COPY2:%[0-9]+]]:vgpr_32 = COPY $vgpr2
+    ; CHECK-NEXT: [[COPY3:%[0-9]+]]:vgpr_32 = COPY $vgpr1
+    ; CHECK-NEXT: [[COPY4:%[0-9]+]]:vgpr_32 = COPY $vgpr0
+    ; CHECK-NEXT: [[REG_SEQUENCE:%[0-9]+]]:sgpr_128 = REG_SEQUENCE [[COPY4]], %subreg.sub0, [[COPY3]], %subreg.sub1, [[COPY2]], %subreg.sub2, [[COPY1]], %subreg.sub3
+    ; CHECK-NEXT: [[S_MOV_B32_:%[0-9]+]]:sreg_32 = S_MOV_B32 0
+    ; CHECK-NEXT: [[TBUFFER_LOAD_FORMAT_X_IDXEN:%[0-9]+]]:vgpr_32 = TBUFFER_LOAD_FORMAT_X_IDXEN [[COPY]], [[REG_SEQUENCE]], [[S_MOV_B32_]], 0, 22, 0, 0, implicit $exec :: (dereferenceable load (s32), align 1, addrspace 8)
+    ; CHECK-NEXT: [[TBUFFER_LOAD_FORMAT_X_IDXEN1:%[0-9]+]]:vgpr_32 = TBUFFER_LOAD_FORMAT_X_IDXEN [[COPY]], [[REG_SEQUENCE]], [[S_MOV_B32_]], 4, 22, 0, 0, implicit $exec :: (dereferenceable load (s32), align 1, addrspace 8)
+    ; CHECK-NEXT: [[TBUFFER_LOAD_FORMAT_X_IDXEN2:%[0-9]+]]:vgpr_32 = TBUFFER_LOAD_FORMAT_X_IDXEN [[COPY]], [[REG_SEQUENCE]], [[S_MOV_B32_]], 8, 22, 0, 0, implicit $exec :: (dereferenceable load (s32), align 1, addrspace 8)
+    ; CHECK-NEXT: [[TBUFFER_LOAD_FORMAT_X_IDXEN3:%[0-9]+]]:vgpr_32 = TBUFFER_LOAD_FORMAT_X_IDXEN [[COPY]], [[REG_SEQUENCE]], [[S_MOV_B32_]], 12, 22, 0, 0, implicit $exec :: (dereferenceable load (s32), align 1, addrspace 8)
+    ; CHECK-NEXT: [[V_ADD_F32_e64_:%[0-9]+]]:vgpr_32 = nofpexcept V_ADD_F32_e64 0, killed [[TBUFFER_LOAD_FORMAT_X_IDXEN]], 0, killed [[TBUFFER_LOAD_FORMAT_X_IDXEN1]], 0, 0, implicit $mode, implicit $exec
+    ; CHECK-NEXT: [[V_MUL_F32_e64_:%[0-9]+]]:vgpr_32 = nofpexcept V_MUL_F32_e64 0, killed [[TBUFFER_LOAD_FORMAT_X_IDXEN2]], 0, killed [[TBUFFER_LOAD_FORMAT_X_IDXEN3]], 0, 0, implicit $mode, implicit $exec
+    ; CHECK-NEXT: [[V_ADD_F32_e64_1:%[0-9]+]]:vgpr_32 = nofpexcept V_ADD_F32_e64 0, killed [[V_ADD_F32_e64_]], 0, killed [[V_MUL_F32_e64_]], 0, 0, implicit $mode, implicit $exec
+    ; CHECK-NEXT: $vgpr0 = COPY [[V_ADD_F32_e64_1]]
+    ; CHECK-NEXT: SI_RETURN implicit $vgpr0
+    %12:vgpr_32 = COPY $vgpr4
+    %11:vgpr_32 = COPY $vgpr3
+    %10:vgpr_32 = COPY $vgpr2
+    %9:vgpr_32 = COPY $vgpr1
+    %8:vgpr_32 = COPY $vgpr0
+    %13:sgpr_128 = REG_SEQUENCE %8, %subreg.sub0, %9, %subreg.sub1, %10, %subreg.sub2, %11, %subreg.sub3
+    %14:sreg_32 = S_MOV_B32 0
+    %15:vgpr_32 = TBUFFER_LOAD_FORMAT_X_IDXEN %12, %13, %14, 0, 22, 0, 0, implicit $exec :: (dereferenceable load (s32), align 1, addrspace 8)
+    %16:vgpr_32 = TBUFFER_LOAD_FORMAT_X_IDXEN %12, %13, %14, 4, 22, 0, 0, implicit $exec :: (dereferenceable load (s32), align 1, addrspace 8)
+    %17:vgpr_32 = TBUFFER_LOAD_FORMAT_X_IDXEN %12, %13, %14, 8, 22, 0, 0, implicit $exec :: (dereferenceable load (s32), align 1, addrspace 8)
+    %18:vgpr_32 = TBUFFER_LOAD_FORMAT_X_IDXEN %12, %13, %14, 12, 22, 0, 0, implicit $exec :: (dereferenceable load (s32), align 1, addrspace 8)
+    %19:vgpr_32 = nofpexcept V_ADD_F32_e64 0, killed %15, 0, killed %16, 0, 0, implicit $mode, implicit $exec
+    %20:vgpr_32 = nofpexcept V_MUL_F32_e64 0, killed %17, 0, killed %18, 0, 0, implicit $mode, implicit $exec
+    %21:vgpr_32 = nofpexcept V_ADD_F32_e64 0, killed %19, 0, killed %20, 0, 0, implicit $mode, implicit $exec
+    $vgpr0 = COPY %21
+    SI_RETURN implicit $vgpr0
+...

@krzysz00
Copy link
Contributor

krzysz00 commented Sep 3, 2025

I'd also like to take the opportunity to flag that relaxed-buffer-oob-mode should either be a module-level or a function-level property ... and should actually have its default flipped (to strict-oob-mode on a module, for example - or maybe that'd be an HSA-specific thing)

@krzysz00
Copy link
Contributor

krzysz00 commented Sep 3, 2025

Also, should this be a function-level or module-level property? It feels module-level, in that your entire program either does or doesn't want these semantics?

@piotrAMD
Copy link
Collaborator

piotrAMD commented Sep 4, 2025

Also note, the middle-end merging never happens for tbuffers, because they are intrinsincs all the way - at no point are they regular llvm loads.

We should fix that. We should not have any of this MI level merging

Is the suggestion to add an extra ME pass that merges tbuffers? load-store-vectorizer works on the regular load/stores, so it is not easily possible to get that working with tbuffer representation which has extra information (format) on the instruction.

@piotrAMD
Copy link
Collaborator

piotrAMD commented Sep 4, 2025

Also, should this be a function-level or module-level property? It feels module-level, in that your entire program either does or doesn't want these semantics?

Yes, this is more like module-level property. But function attribute is a standard way to pass this kind of information from the front-end to the backend. Is there a better option?

Comment on lines 199 to 200
if (F.hasFnAttribute("amdgpu-disable-tbuffer-combine"))
setDisableTBufferCombine(true);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think Matt's suggestion was that the conservative behaviour should be the default, so the tbuffer merging should be disabled by default, and only enabled when an attribute is set.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should enable tbuffer combine by default, because its logic is generally correct, and only certain features disallow it. What do you think?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see where you're coming from on this, and I had similar thoughts in #115479. But being always correct by default has some clear advantages. The price to pay for this is to always set a certain flag to enable relaxed mode.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And I have another question: if we disable tbuffer combine by default, will many lit tests need to be updated? Does that mean all the 8-bit, 16-bit, and 32-bit tests would need to change?

@krzysz00
Copy link
Contributor

krzysz00 commented Sep 4, 2025

Yes, this is more like module-level property. But function attribute is a standard way to pass this kind of information from the front-end to the backend. Is there a better option?

Module flags?

(Which, well, relaxed-oob-mode - or its inverse - should also be)

@krzysz00
Copy link
Contributor

krzysz00 commented Sep 4, 2025

Is the suggestion to add an extra ME pass that merges tbuffers? load-store-vectorizer works on the regular load/stores, so it is not easily possible to get that working with tbuffer representation which has extra information (format) on the instruction.

Yeah - and you might want the same for buffer_load/buffer_store more generally, in theory?

@shiltian
Copy link
Contributor

shiltian commented Sep 4, 2025

Yes, this is more like module-level property. But function attribute is a standard way to pass this kind of information from the front-end to the backend. Is there a better option?

If this is a module level thingy, please just use module flag. We already have something "bad" that is supposed to be a module-level config but ends up being a function level attribute, which causes many trouble.

@arsenm
Copy link
Contributor

arsenm commented Sep 5, 2025

Yes, this is more like module-level property. But function attribute is a standard way to pass this kind of information from the front-end to the backend. Is there a better option?

Making it module level turns it into ABI, which is bad. You should be able to perform 1 way linking from strict to relaxed use contexts

@krzysz00
Copy link
Contributor

krzysz00 commented Sep 5, 2025

Making it module level turns it into ABI, which is bad. You should be able to perform 1 way linking from strict to relaxed use contexts

ModFlagBehavior is a thing and would like you max-combine these things? That is, if you linked relaxed mode into strict mode you get strict mode.

@krzysz00
Copy link
Contributor

krzysz00 commented Sep 8, 2025

That is, I think it's possible - and correct - to make the strict behavior infectious when you do a module-to-module link at the LLVM IR level

At the ld level ... none of this matters, each .o will generate instructions according to the module flags it was generated with.

@piotrAMD
Copy link
Collaborator

It appears the only remaining point of contention is whether to use a module flag or a function attribute. I agree with Matt that making it module level could introduce new problems. I prefer using a function attribute, which is also more flexible if we decide to set it on a per-function basis in the future.

@harrisonGPU
Copy link
Contributor Author

I’ve switched to a function attribute to control TBUFFER combining and renamed it to amdgpu-relaxed-tbuffer-oob-mod for clearer semantics. This attribute is false by default, so TBUFFER combining is disabled for all bit-widths unless explicitly enabled. What do you think?

@harrisonGPU harrisonGPU changed the title [AMDGPU] Add function attribute to disable TBUFFER combine [AMDGPU] Add function attribute to control TBUFFER combining Sep 12, 2025
@jayfoad
Copy link
Contributor

jayfoad commented Sep 12, 2025

This attribute is false by default, so TBUFFER combining is disabled for all bit-widths unless explicitly enabled.

I am not completely up to speed on this, but I thought combining dword loads into multi-dword loads was always OK, because bounds checking checks each dword individually. Hopefully @piotrAMD will correct me if I am wrong.

@krzysz00
Copy link
Contributor

Can I get more clarity on what the problem with a module-level flag is?

@arsenm
Copy link
Contributor

arsenm commented Sep 15, 2025

I think the only issue with a module flag would be if you care about having different entry points in the same module with different behavior. If this is an attribute this needs more work to properly handling inlining with mixed modes

Also I'm still not clear how this differs from the the relaxed-oob-mode feature. Can these be merged into one? In any case I'd like to see those use identical mechanisms before introducing another knob

@piotrAMD
Copy link
Collaborator

This attribute is false by default, so TBUFFER combining is disabled for all bit-widths unless explicitly enabled.

I am not completely up to speed on this, but I thought combining dword loads into multi-dword loads was always OK, because bounds checking checks each dword individually. Hopefully @piotrAMD will correct me if I am wrong.

Typed buffer loads are not range-checked per dword, they are range-checked "all or nothing".

@piotrAMD
Copy link
Collaborator

piotrAMD commented Sep 15, 2025

Also I'm still not clear how this differs from the the relaxed-oob-mode feature. Can these be merged into one? In any case I'd like to see those use identical mechanisms before introducing another knob

This is similar to the existing relaxed-oob-mode feature in that it controls oob merging behaviour. The behavior for untyped buffer merging (in llvm-ir) and this tbuffer merging in mir are driven by different language features, so we need to be able to select them individually.
When I added relaxed-oob-mode I was not aware about the tbuffer problem that Harrison found.

I guess having a single mechanism like multistate function attribute/module flag would be ok, but there is this problem that the function that uses relaxed-oob-mode currently (AllowsMisalignedMemoryAccesses()) has no access to "Function", and I am not sure what the best solution for that one is.

@krzysz00
Copy link
Contributor

@piotrAMD More fundamental question: is the strictness/relaxed-ness something that's enabled per-compilation (that is, per-module) or, at the SPIR-V level, can in differ per-function ?

If you can't have functions with different strictness requirements in ore program ... both of these should be module flags, and AllowsMisalignedMemoryAccesses() should be changed to take a Module* (or a Function*) as appropriate.

@harrisonGPU
Copy link
Contributor Author

harrisonGPU commented Sep 20, 2025

Hi, I’ve switched to using a module flag to control RelaxedTBufferOOBMode. When the mask has bit 0x2 set, RelaxedTBufferOOBMode is enabled.

@harrisonGPU harrisonGPU changed the title [AMDGPU] Add function attribute to control TBUFFER combining [AMDGPU] Add module flag to control TBUFFER combining Sep 20, 2025
@piotrAMD
Copy link
Collaborator

Proposing a change to the existing oob mode (relaxed-buffer-oob-mode) to use module flag instead of subtarget feature - just a draft to canvas opinions: #160922.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants