Skip to content

Conversation

shiltian
Copy link
Contributor

No description provided.

@shiltian shiltian marked this pull request as ready for review August 14, 2024 00:41
@shiltian shiltian requested review from arsenm and changpeng and removed request for arsenm August 14, 2024 00:42
Copy link
Contributor Author

This stack of pull requests is managed by Graphite. Learn more about stacking.

Join @shiltian and the rest of your teammates on Graphite Graphite

@shiltian shiltian requested a review from arsenm August 14, 2024 00:42
@shiltian shiltian requested a review from kerbowa August 14, 2024 00:42
@llvmbot
Copy link
Member

llvmbot commented Aug 14, 2024

@llvm/pr-subscribers-llvm-transforms
@llvm/pr-subscribers-llvm-globalisel

@llvm/pr-subscribers-backend-amdgpu

Author: Shilei Tian (shiltian)

Changes

Full diff: https://github.com/llvm/llvm-project/pull/103490.diff

4 Files Affected:

  • (modified) llvm/lib/Target/AMDGPU/SIISelLowering.cpp (+6-1)
  • (modified) llvm/test/CodeGen/AMDGPU/abi-attribute-hints-undefined-behavior.ll (+72-32)
  • (modified) llvm/test/CodeGen/AMDGPU/callee-special-input-sgprs-fixed-abi.ll (+2-1)
  • (modified) llvm/test/CodeGen/AMDGPU/implicit-kernarg-backend-usage.ll (+14-8)
diff --git a/llvm/lib/Target/AMDGPU/SIISelLowering.cpp b/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
index 86fc100f1c2da0..52f93472eac206 100644
--- a/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
+++ b/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
@@ -8414,8 +8414,13 @@ SDValue SITargetLowering::LowerINTRINSIC_WO_CHAIN(SDValue Op,
     return getPreloadedValue(DAG, *MFI, VT,
                              AMDGPUFunctionArgInfo::IMPLICIT_BUFFER_PTR);
   }
-  case Intrinsic::amdgcn_dispatch_ptr:
   case Intrinsic::amdgcn_queue_ptr: {
+    const Module *M = DAG.getMachineFunction().getFunction().getParent();
+    if (AMDGPU::getAMDHSACodeObjectVersion(*M) >= AMDGPU::AMDHSA_COV5)
+      return loadImplicitKernelArgument(DAG, MVT::i64, DL, Align(8), QUEUE_PTR);
+    [[fallthrough]];
+  }
+  case Intrinsic::amdgcn_dispatch_ptr: {
     if (!Subtarget->isAmdHsaOrMesa(MF.getFunction())) {
       DiagnosticInfoUnsupported BadIntrin(
           MF.getFunction(), "unsupported hsa intrinsic without hsa target",
diff --git a/llvm/test/CodeGen/AMDGPU/abi-attribute-hints-undefined-behavior.ll b/llvm/test/CodeGen/AMDGPU/abi-attribute-hints-undefined-behavior.ll
index e53653408feb40..dcbe3363f5874a 100644
--- a/llvm/test/CodeGen/AMDGPU/abi-attribute-hints-undefined-behavior.ll
+++ b/llvm/test/CodeGen/AMDGPU/abi-attribute-hints-undefined-behavior.ll
@@ -204,26 +204,50 @@ define amdgpu_kernel void @marked_kernel_use_workgroup_id(ptr addrspace(1) %ptr)
 }
 
 define void @marked_func_use_other_sgpr(ptr addrspace(1) %ptr) #0 {
-; FIXEDABI-LABEL: marked_func_use_other_sgpr:
-; FIXEDABI:       ; %bb.0:
-; FIXEDABI-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
-; FIXEDABI-NEXT:    v_mov_b32_e32 v2, s6
-; FIXEDABI-NEXT:    v_mov_b32_e32 v3, s7
-; FIXEDABI-NEXT:    flat_load_ubyte v2, v[2:3] glc
-; FIXEDABI-NEXT:    s_waitcnt vmcnt(0)
-; FIXEDABI-NEXT:    v_mov_b32_e32 v2, s8
-; FIXEDABI-NEXT:    v_mov_b32_e32 v3, s9
-; FIXEDABI-NEXT:    flat_load_ubyte v2, v[2:3] glc
-; FIXEDABI-NEXT:    s_waitcnt vmcnt(0)
-; FIXEDABI-NEXT:    v_mov_b32_e32 v2, s4
-; FIXEDABI-NEXT:    v_mov_b32_e32 v3, s5
-; FIXEDABI-NEXT:    flat_load_ubyte v2, v[2:3] glc
-; FIXEDABI-NEXT:    s_waitcnt vmcnt(0)
-; FIXEDABI-NEXT:    v_mov_b32_e32 v2, s10
-; FIXEDABI-NEXT:    v_mov_b32_e32 v3, s11
-; FIXEDABI-NEXT:    flat_store_dwordx2 v[0:1], v[2:3]
-; FIXEDABI-NEXT:    s_waitcnt vmcnt(0)
-; FIXEDABI-NEXT:    s_setpc_b64 s[30:31]
+; FIXEDABI-SDAG-LABEL: marked_func_use_other_sgpr:
+; FIXEDABI-SDAG:       ; %bb.0:
+; FIXEDABI-SDAG-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; FIXEDABI-SDAG-NEXT:    s_mov_b64 s[6:7], 0xc8
+; FIXEDABI-SDAG-NEXT:    s_load_dwordx2 s[6:7], s[6:7], 0x0
+; FIXEDABI-SDAG-NEXT:    s_waitcnt lgkmcnt(0)
+; FIXEDABI-SDAG-NEXT:    v_mov_b32_e32 v2, s6
+; FIXEDABI-SDAG-NEXT:    v_mov_b32_e32 v3, s7
+; FIXEDABI-SDAG-NEXT:    flat_load_ubyte v2, v[2:3] glc
+; FIXEDABI-SDAG-NEXT:    s_waitcnt vmcnt(0)
+; FIXEDABI-SDAG-NEXT:    v_mov_b32_e32 v2, s8
+; FIXEDABI-SDAG-NEXT:    v_mov_b32_e32 v3, s9
+; FIXEDABI-SDAG-NEXT:    flat_load_ubyte v2, v[2:3] glc
+; FIXEDABI-SDAG-NEXT:    s_waitcnt vmcnt(0)
+; FIXEDABI-SDAG-NEXT:    v_mov_b32_e32 v2, s4
+; FIXEDABI-SDAG-NEXT:    v_mov_b32_e32 v3, s5
+; FIXEDABI-SDAG-NEXT:    flat_load_ubyte v2, v[2:3] glc
+; FIXEDABI-SDAG-NEXT:    s_waitcnt vmcnt(0)
+; FIXEDABI-SDAG-NEXT:    v_mov_b32_e32 v2, s10
+; FIXEDABI-SDAG-NEXT:    v_mov_b32_e32 v3, s11
+; FIXEDABI-SDAG-NEXT:    flat_store_dwordx2 v[0:1], v[2:3]
+; FIXEDABI-SDAG-NEXT:    s_waitcnt vmcnt(0)
+; FIXEDABI-SDAG-NEXT:    s_setpc_b64 s[30:31]
+;
+; FIXEDABI-GISEL-LABEL: marked_func_use_other_sgpr:
+; FIXEDABI-GISEL:       ; %bb.0:
+; FIXEDABI-GISEL-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; FIXEDABI-GISEL-NEXT:    v_mov_b32_e32 v2, s6
+; FIXEDABI-GISEL-NEXT:    v_mov_b32_e32 v3, s7
+; FIXEDABI-GISEL-NEXT:    flat_load_ubyte v2, v[2:3] glc
+; FIXEDABI-GISEL-NEXT:    s_waitcnt vmcnt(0)
+; FIXEDABI-GISEL-NEXT:    v_mov_b32_e32 v2, s8
+; FIXEDABI-GISEL-NEXT:    v_mov_b32_e32 v3, s9
+; FIXEDABI-GISEL-NEXT:    flat_load_ubyte v2, v[2:3] glc
+; FIXEDABI-GISEL-NEXT:    s_waitcnt vmcnt(0)
+; FIXEDABI-GISEL-NEXT:    v_mov_b32_e32 v2, s4
+; FIXEDABI-GISEL-NEXT:    v_mov_b32_e32 v3, s5
+; FIXEDABI-GISEL-NEXT:    flat_load_ubyte v2, v[2:3] glc
+; FIXEDABI-GISEL-NEXT:    s_waitcnt vmcnt(0)
+; FIXEDABI-GISEL-NEXT:    v_mov_b32_e32 v2, s10
+; FIXEDABI-GISEL-NEXT:    v_mov_b32_e32 v3, s11
+; FIXEDABI-GISEL-NEXT:    flat_store_dwordx2 v[0:1], v[2:3]
+; FIXEDABI-GISEL-NEXT:    s_waitcnt vmcnt(0)
+; FIXEDABI-GISEL-NEXT:    s_setpc_b64 s[30:31]
   %queue.ptr = call ptr addrspace(4) @llvm.amdgcn.queue.ptr()
   %implicitarg.ptr = call ptr addrspace(4) @llvm.amdgcn.implicitarg.ptr()
   %dispatch.ptr = call ptr addrspace(4) @llvm.amdgcn.dispatch.ptr()
@@ -236,18 +260,34 @@ define void @marked_func_use_other_sgpr(ptr addrspace(1) %ptr) #0 {
 }
 
 define amdgpu_kernel void @marked_kernel_use_other_sgpr(ptr addrspace(1) %ptr) #0 {
-; FIXEDABI-LABEL: marked_kernel_use_other_sgpr:
-; FIXEDABI:       ; %bb.0:
-; FIXEDABI-NEXT:    s_add_u32 s0, s4, 8
-; FIXEDABI-NEXT:    flat_load_ubyte v0, v[0:1] glc
-; FIXEDABI-NEXT:    s_addc_u32 s1, s5, 0
-; FIXEDABI-NEXT:    s_waitcnt vmcnt(0)
-; FIXEDABI-NEXT:    v_mov_b32_e32 v0, s0
-; FIXEDABI-NEXT:    v_mov_b32_e32 v1, s1
-; FIXEDABI-NEXT:    flat_load_ubyte v0, v[0:1] glc
-; FIXEDABI-NEXT:    s_waitcnt vmcnt(0)
-; FIXEDABI-NEXT:    flat_load_ubyte v0, v[0:1] glc
-; FIXEDABI-NEXT:    s_endpgm
+; FIXEDABI-SDAG-LABEL: marked_kernel_use_other_sgpr:
+; FIXEDABI-SDAG:       ; %bb.0:
+; FIXEDABI-SDAG-NEXT:    s_load_dwordx2 s[0:1], s[4:5], 0xd0
+; FIXEDABI-SDAG-NEXT:    s_add_u32 s2, s4, 8
+; FIXEDABI-SDAG-NEXT:    s_addc_u32 s3, s5, 0
+; FIXEDABI-SDAG-NEXT:    v_mov_b32_e32 v0, s2
+; FIXEDABI-SDAG-NEXT:    v_mov_b32_e32 v1, s3
+; FIXEDABI-SDAG-NEXT:    s_waitcnt lgkmcnt(0)
+; FIXEDABI-SDAG-NEXT:    v_mov_b32_e32 v3, s1
+; FIXEDABI-SDAG-NEXT:    v_mov_b32_e32 v2, s0
+; FIXEDABI-SDAG-NEXT:    flat_load_ubyte v2, v[2:3] glc
+; FIXEDABI-SDAG-NEXT:    flat_load_ubyte v0, v[0:1] glc
+; FIXEDABI-SDAG-NEXT:    s_waitcnt vmcnt(0)
+; FIXEDABI-SDAG-NEXT:    flat_load_ubyte v0, v[0:1] glc
+; FIXEDABI-SDAG-NEXT:    s_endpgm
+;
+; FIXEDABI-GISEL-LABEL: marked_kernel_use_other_sgpr:
+; FIXEDABI-GISEL:       ; %bb.0:
+; FIXEDABI-GISEL-NEXT:    s_add_u32 s0, s4, 8
+; FIXEDABI-GISEL-NEXT:    flat_load_ubyte v0, v[0:1] glc
+; FIXEDABI-GISEL-NEXT:    s_addc_u32 s1, s5, 0
+; FIXEDABI-GISEL-NEXT:    s_waitcnt vmcnt(0)
+; FIXEDABI-GISEL-NEXT:    v_mov_b32_e32 v0, s0
+; FIXEDABI-GISEL-NEXT:    v_mov_b32_e32 v1, s1
+; FIXEDABI-GISEL-NEXT:    flat_load_ubyte v0, v[0:1] glc
+; FIXEDABI-GISEL-NEXT:    s_waitcnt vmcnt(0)
+; FIXEDABI-GISEL-NEXT:    flat_load_ubyte v0, v[0:1] glc
+; FIXEDABI-GISEL-NEXT:    s_endpgm
   %queue.ptr = call ptr addrspace(4) @llvm.amdgcn.queue.ptr()
   %implicitarg.ptr = call ptr addrspace(4) @llvm.amdgcn.implicitarg.ptr()
   %dispatch.ptr = call ptr addrspace(4) @llvm.amdgcn.dispatch.ptr()
diff --git a/llvm/test/CodeGen/AMDGPU/callee-special-input-sgprs-fixed-abi.ll b/llvm/test/CodeGen/AMDGPU/callee-special-input-sgprs-fixed-abi.ll
index 032ec65fa85133..c2c6670c3ecacd 100644
--- a/llvm/test/CodeGen/AMDGPU/callee-special-input-sgprs-fixed-abi.ll
+++ b/llvm/test/CodeGen/AMDGPU/callee-special-input-sgprs-fixed-abi.ll
@@ -22,7 +22,8 @@ define amdgpu_kernel void @kern_indirect_use_dispatch_ptr(i32) #1 {
 }
 
 ; GCN-LABEL: {{^}}use_queue_ptr:
-; GCN: s_load_dword s{{[0-9]+}}, s[6:7]
+; GCN: s_mov_b64 s[4:5], 0xc8
+; GCN-NEXT: s_load_dwordx2 s[4:5], s[4:5], 0x0
 define hidden void @use_queue_ptr() #1 {
   %queue_ptr = call noalias ptr addrspace(4) @llvm.amdgcn.queue.ptr() #0
   %value = load volatile i32, ptr addrspace(4) %queue_ptr
diff --git a/llvm/test/CodeGen/AMDGPU/implicit-kernarg-backend-usage.ll b/llvm/test/CodeGen/AMDGPU/implicit-kernarg-backend-usage.ll
index b89dbd42e0466f..8999b8f0d07678 100644
--- a/llvm/test/CodeGen/AMDGPU/implicit-kernarg-backend-usage.ll
+++ b/llvm/test/CodeGen/AMDGPU/implicit-kernarg-backend-usage.ll
@@ -287,20 +287,24 @@ define amdgpu_kernel void @llvm_amdgcn_queue_ptr(ptr addrspace(1) %ptr)  {
 ;
 ; GFX8V5-LABEL: llvm_amdgcn_queue_ptr:
 ; GFX8V5:       ; %bb.0:
-; GFX8V5-NEXT:    s_add_u32 s0, s6, 8
-; GFX8V5-NEXT:    flat_load_ubyte v0, v[0:1] glc
-; GFX8V5-NEXT:    s_addc_u32 s1, s7, 0
-; GFX8V5-NEXT:    s_waitcnt vmcnt(0)
+; GFX8V5-NEXT:    s_load_dwordx2 s[0:1], s[6:7], 0xd0
+; GFX8V5-NEXT:    s_add_u32 s2, s6, 8
+; GFX8V5-NEXT:    s_addc_u32 s3, s7, 0
+; GFX8V5-NEXT:    v_mov_b32_e32 v2, s8
+; GFX8V5-NEXT:    v_mov_b32_e32 v3, s9
+; GFX8V5-NEXT:    s_waitcnt lgkmcnt(0)
 ; GFX8V5-NEXT:    v_mov_b32_e32 v0, s0
 ; GFX8V5-NEXT:    v_mov_b32_e32 v1, s1
 ; GFX8V5-NEXT:    flat_load_ubyte v0, v[0:1] glc
 ; GFX8V5-NEXT:    s_waitcnt vmcnt(0)
+; GFX8V5-NEXT:    v_mov_b32_e32 v0, s2
+; GFX8V5-NEXT:    v_mov_b32_e32 v1, s3
+; GFX8V5-NEXT:    flat_load_ubyte v0, v[0:1] glc
+; GFX8V5-NEXT:    s_waitcnt vmcnt(0)
 ; GFX8V5-NEXT:    v_mov_b32_e32 v0, s4
 ; GFX8V5-NEXT:    v_mov_b32_e32 v1, s5
 ; GFX8V5-NEXT:    flat_load_ubyte v0, v[0:1] glc
 ; GFX8V5-NEXT:    s_load_dwordx2 s[0:1], s[6:7], 0x0
-; GFX8V5-NEXT:    v_mov_b32_e32 v2, s8
-; GFX8V5-NEXT:    v_mov_b32_e32 v3, s9
 ; GFX8V5-NEXT:    s_waitcnt vmcnt(0) lgkmcnt(0)
 ; GFX8V5-NEXT:    v_mov_b32_e32 v0, s0
 ; GFX8V5-NEXT:    v_mov_b32_e32 v1, s1
@@ -327,16 +331,18 @@ define amdgpu_kernel void @llvm_amdgcn_queue_ptr(ptr addrspace(1) %ptr)  {
 ;
 ; GFX9V5-LABEL: llvm_amdgcn_queue_ptr:
 ; GFX9V5:       ; %bb.0:
+; GFX9V5-NEXT:    s_load_dwordx2 s[0:1], s[6:7], 0xd0
 ; GFX9V5-NEXT:    v_mov_b32_e32 v2, 0
+; GFX9V5-NEXT:    ; kill: killed $sgpr0_sgpr1
+; GFX9V5-NEXT:    ; kill: killed $sgpr4_sgpr5
+; GFX9V5-NEXT:    s_waitcnt lgkmcnt(0)
 ; GFX9V5-NEXT:    global_load_ubyte v0, v2, s[0:1] glc
 ; GFX9V5-NEXT:    global_load_ubyte v0, v2, s[6:7] offset:8 glc
 ; GFX9V5-NEXT:    global_load_ubyte v0, v2, s[4:5] glc
-; GFX9V5-NEXT:    ; kill: killed $sgpr0_sgpr1
 ; GFX9V5-NEXT:    s_load_dwordx2 s[0:1], s[6:7], 0x0
 ; GFX9V5-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9V5-NEXT:    v_mov_b32_e32 v0, s8
 ; GFX9V5-NEXT:    v_mov_b32_e32 v1, s9
-; GFX9V5-NEXT:    ; kill: killed $sgpr4_sgpr5
 ; GFX9V5-NEXT:    s_waitcnt lgkmcnt(0)
 ; GFX9V5-NEXT:    global_store_dwordx2 v2, v[0:1], s[0:1]
 ; GFX9V5-NEXT:    s_waitcnt vmcnt(0)

@shiltian shiltian force-pushed the users/shiltian/lower-queue-ptr-to-implicit-kernel-arg branch from be06d03 to f3ca3ef Compare August 14, 2024 03:45
Copy link
Contributor

@arsenm arsenm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the intrinsic should be interpreted as the user SGPR value. There is already a representation for load from implicit argument pointer, and we don't need 2 ways of doing that.

@shiltian
Copy link
Contributor Author

There is already a representation for load from implicit argument pointer, and we don't need 2 ways of doing that.

Which is the representation for loading queue ptr from implicit kernel argument?

@arsenm
Copy link
Contributor

arsenm commented Aug 14, 2024

There is already a representation for load from implicit argument pointer, and we don't need 2 ways of doing that.

Which is the representation for loading queue ptr from implicit kernel argument?

Call the implicitarg ptr intrinsic and load from that offset

@shiltian
Copy link
Contributor Author

That is an indirect way doing that.

What this patch is doing can allow us not to set .hidden_queue_ptr unnecessarily. Currently we set it when the function attribute amdgpu-no-queue-ptr is absent even for GFX9+ with COV5+. Since we already handle the use of queue pointer for aperture base and trap handling correctly based on COV, it is supposed to safely not set it. However, we still do it. That is probably based on the assumption that a function can call the intrinsic, since we lower it to read the SGPR in any case. If we can lower it to implicit kernel argument for COV5+, we can safely drop .hidden_queue_ptr for COV5+.

@kerbowa
Copy link
Member

kerbowa commented Aug 15, 2024

That is an indirect way doing that.

What this patch is doing can allow us not to set .hidden_queue_ptr unnecessarily. Currently we set it when the function attribute amdgpu-no-queue-ptr is absent even for GFX9+ with COV5+. Since we already handle the use of queue pointer for aperture base and trap handling correctly based on COV, it is supposed to safely not set it. However, we still do it. That is probably based on the assumption that a function can call the intrinsic, since we lower it to read the SGPR in any case. If we can lower it to implicit kernel argument for COV5+, we can safely drop .hidden_queue_ptr for COV5+.

What do you mean by .hidden_queue_ptr, the preloaded SGPR pair? That should be set with the KD field ENABLE_SGPR_QUEUE_PTR.

Currently the lowering for this is totally broken even without this patch. There is an explicit check for COV5 for whether to allocate SGPRs for the queue_ptr and we are not allocating anything. So we load from some random incorrect SGPRs as far as I can tell.

Copy link
Member

@kerbowa kerbowa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like the intention of @changpeng's COv5 patches was that we should never allocate or access the user SGPR queue_ptr.

AMDGPU::getAMDHSACodeObjectVersion(*M) < AMDGPU::AMDHSA_COV5)

Otherwise if we go with Matt's suggestion to make the intrinsic mean the preloaded SGPR then the allocation needs to be fixed.

@changpeng
Copy link
Contributor

It looks like the intention of @changpeng's COv5 patches was that we should never allocate or access the user SGPR queue_ptr.

AMDGPU::getAMDHSACodeObjectVersion(*M) < AMDGPU::AMDHSA_COV5)

Otherwise if we go with Matt's suggestion to make the intrinsic mean the preloaded SGPR then the allocation needs to be fixed.

Right. We should only use the implicitarg field for queue_ptr for cov5. I do remember I have to update many LIT tests due to less sgpr usage.

@arsenm
Copy link
Contributor

arsenm commented Aug 16, 2024

That is an indirect way doing that.

Yes, that's my point, it's redundant.

What this patch is doing can allow us not to set .hidden_queue_ptr unnecessarily. Currently we set it when the function attribute amdgpu-no-queue-ptr is absent even for GFX9+ with COV5+.

Right. We should only use the implicitarg field for queue_ptr for cov5. I do remember I have to update many LIT tests due to less sgpr usage.

But you still can use the queue pointer, it's not gone. The logic should be able to differentiate should not use queue pointer and queue pointer does not exist

@shiltian
Copy link
Contributor Author

shiltian commented Aug 16, 2024

But you still can use the queue pointer, it's not gone. The logic should be able to differentiate should not use queue pointer and queue pointer does not exist

But the thing is, the two SGPRs are never set in COV5. If you still want to use them, you basically have to revert those patches that @changpeng has done to disable allocation of the two SGPRs completely in COV5.

@changpeng
Copy link
Contributor

changpeng commented Aug 16, 2024

But the thing is, the two SGPRs are never set in COV5. If you still want to use them, you basically have to revert those patches that @changpeng has done to disable allocation of the two SGPRs completely in COV5.

COV5 introduced a queue_ptr field in implicit kernarg, and there is no reason to allocate SGPRs. For gfx9+, the doc implies that queue_ptr field does not exist. However, the intrinsic has to return the queue_ptr. (My first impression is that the intrinsic is no longer needed, and should never be invoked). Suppose we have to keep this intrinsic, I would suggest we update the doc to say so that queue_ptr field is still needed when intrinsic call exists.

@arsenm
Copy link
Contributor

arsenm commented Aug 16, 2024

But the thing is, the two SGPRs are never set in COV5. If you still want to use them, you basically have to revert those patches that @changpeng has done to disable allocation of the two SGPRs completely in COV5.

Yes. It is still possible to use the queue pointer. You could still want to use the queue pointer as a preloaded argument.

@shiltian
Copy link
Contributor Author

Let me try to make things clear here. The current situation (without this patch) is:

  • The intrinsic is lowered to read s[6:7], no matter what GFX and COV. This assumes s[6:7] has the value, which means the kernel descriptor needs to be set, no matter what GFX and COV.
  • @changpeng's previous patches removed the setup of the kernel descriptor for COV5+. This means if the builtin/intrinsic is used, it is completely broken right now.

Given this, there are two approaches to fix this.

  1. Change the lowering of the intrinsic to make it align with @changpeng's patches, that lower the intrinsic to reading from implicit kernel argument. This is what this patch is trying to do.
  2. No matter what COV is, s[6:7] needs to be set, as long as the builtin/intrinsic is used (explicitly or implicitly). This is what @arsenm is proposing.

Let's compare the two approaches.

For 1, we can safely assume s[6:7] will never need to be set up for queue pointer for COV5+, no matter of the existence of the function attribute amdgpu-no-queue-ptr, which means no matter whether the optimization is on or off, and no matter whether the optimization can add the function attribute (such as an indirect function call that can't be resolved at compile time), we always have the two SGPRs available for other use.

For 2, we need to revert some changes by @changpeng that set the kernel descriptor if the intrinsic is used explicitly or implicitly. By "implicitly" I mean when at compile time we can't resolve an indirect function call such that we need to assume the intrinsic might be used. This means, even for COV5+, if the optimization is off, or the optimization can't work as expected, we lose the two SGPRs, no matter whether they are really used or not.

IMHO, I don't see the win of 2. It might follow the "convention" that an intrinsic needs to read from registers, but we have so many intrinsics that don't read from register. The reason it read from register in the past is there is no other way.

Did I misunderstand anything here? @arsenm @kerbowa @changpeng

@arsenm
Copy link
Contributor

arsenm commented Aug 19, 2024

My expectation is for cov5, llvm.amdgcn.queue.ptr would not be used in the first place. If it were used, I would expect it to enable the same user SGPR input as it always did. Lowering it to the load from implicit argument is a strange midpoint that should not occur in real code.

IMHO, I don't see the win of 2

Multiple representations of the same thing are bad. When we detect uses of implicit argument pointers, now we have an extra special case to consider.

So I would just revert the changes that broke the queue pointer intrinsic.

@shiltian shiltian force-pushed the users/shiltian/lower-queue-ptr-to-implicit-kernel-arg branch from f3ca3ef to 4f5ef5f Compare September 3, 2024 03:50
@shiltian shiltian changed the title [AMDGPU] Lower llvm.amdgcn.queue.ptr instrinsic to using implicit kernel argument if feasible [AMDGPU] Still set up s[6:7] for COV5 if queue ptr is needed Sep 3, 2024
@shiltian
Copy link
Contributor Author

shiltian commented Sep 3, 2024

I'm still working on updating test cases. This is WIP.

@shiltian shiltian changed the title [AMDGPU] Still set up s[6:7] for COV5 if queue ptr is needed [AMDGPU] Still set up the two SGPRs for queue ptr even it is COV5 Sep 3, 2024
Copy link
Contributor

@arsenm arsenm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm with the tests all updated

@shiltian
Copy link
Contributor Author

Close this one. Will do it in #112403.

@shiltian shiltian closed this Oct 15, 2024
@shiltian shiltian deleted the users/shiltian/lower-queue-ptr-to-implicit-kernel-arg branch October 15, 2024 17:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants