-
Notifications
You must be signed in to change notification settings - Fork 15.2k
[AMDGPU] Still set up the two SGPRs for queue ptr even it is COV5 #103490
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[AMDGPU] Still set up the two SGPRs for queue ptr even it is COV5 #103490
Conversation
This stack of pull requests is managed by Graphite. Learn more about stacking. |
@llvm/pr-subscribers-llvm-transforms @llvm/pr-subscribers-backend-amdgpu Author: Shilei Tian (shiltian) ChangesFull diff: https://github.com/llvm/llvm-project/pull/103490.diff 4 Files Affected:
diff --git a/llvm/lib/Target/AMDGPU/SIISelLowering.cpp b/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
index 86fc100f1c2da0..52f93472eac206 100644
--- a/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
+++ b/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
@@ -8414,8 +8414,13 @@ SDValue SITargetLowering::LowerINTRINSIC_WO_CHAIN(SDValue Op,
return getPreloadedValue(DAG, *MFI, VT,
AMDGPUFunctionArgInfo::IMPLICIT_BUFFER_PTR);
}
- case Intrinsic::amdgcn_dispatch_ptr:
case Intrinsic::amdgcn_queue_ptr: {
+ const Module *M = DAG.getMachineFunction().getFunction().getParent();
+ if (AMDGPU::getAMDHSACodeObjectVersion(*M) >= AMDGPU::AMDHSA_COV5)
+ return loadImplicitKernelArgument(DAG, MVT::i64, DL, Align(8), QUEUE_PTR);
+ [[fallthrough]];
+ }
+ case Intrinsic::amdgcn_dispatch_ptr: {
if (!Subtarget->isAmdHsaOrMesa(MF.getFunction())) {
DiagnosticInfoUnsupported BadIntrin(
MF.getFunction(), "unsupported hsa intrinsic without hsa target",
diff --git a/llvm/test/CodeGen/AMDGPU/abi-attribute-hints-undefined-behavior.ll b/llvm/test/CodeGen/AMDGPU/abi-attribute-hints-undefined-behavior.ll
index e53653408feb40..dcbe3363f5874a 100644
--- a/llvm/test/CodeGen/AMDGPU/abi-attribute-hints-undefined-behavior.ll
+++ b/llvm/test/CodeGen/AMDGPU/abi-attribute-hints-undefined-behavior.ll
@@ -204,26 +204,50 @@ define amdgpu_kernel void @marked_kernel_use_workgroup_id(ptr addrspace(1) %ptr)
}
define void @marked_func_use_other_sgpr(ptr addrspace(1) %ptr) #0 {
-; FIXEDABI-LABEL: marked_func_use_other_sgpr:
-; FIXEDABI: ; %bb.0:
-; FIXEDABI-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
-; FIXEDABI-NEXT: v_mov_b32_e32 v2, s6
-; FIXEDABI-NEXT: v_mov_b32_e32 v3, s7
-; FIXEDABI-NEXT: flat_load_ubyte v2, v[2:3] glc
-; FIXEDABI-NEXT: s_waitcnt vmcnt(0)
-; FIXEDABI-NEXT: v_mov_b32_e32 v2, s8
-; FIXEDABI-NEXT: v_mov_b32_e32 v3, s9
-; FIXEDABI-NEXT: flat_load_ubyte v2, v[2:3] glc
-; FIXEDABI-NEXT: s_waitcnt vmcnt(0)
-; FIXEDABI-NEXT: v_mov_b32_e32 v2, s4
-; FIXEDABI-NEXT: v_mov_b32_e32 v3, s5
-; FIXEDABI-NEXT: flat_load_ubyte v2, v[2:3] glc
-; FIXEDABI-NEXT: s_waitcnt vmcnt(0)
-; FIXEDABI-NEXT: v_mov_b32_e32 v2, s10
-; FIXEDABI-NEXT: v_mov_b32_e32 v3, s11
-; FIXEDABI-NEXT: flat_store_dwordx2 v[0:1], v[2:3]
-; FIXEDABI-NEXT: s_waitcnt vmcnt(0)
-; FIXEDABI-NEXT: s_setpc_b64 s[30:31]
+; FIXEDABI-SDAG-LABEL: marked_func_use_other_sgpr:
+; FIXEDABI-SDAG: ; %bb.0:
+; FIXEDABI-SDAG-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; FIXEDABI-SDAG-NEXT: s_mov_b64 s[6:7], 0xc8
+; FIXEDABI-SDAG-NEXT: s_load_dwordx2 s[6:7], s[6:7], 0x0
+; FIXEDABI-SDAG-NEXT: s_waitcnt lgkmcnt(0)
+; FIXEDABI-SDAG-NEXT: v_mov_b32_e32 v2, s6
+; FIXEDABI-SDAG-NEXT: v_mov_b32_e32 v3, s7
+; FIXEDABI-SDAG-NEXT: flat_load_ubyte v2, v[2:3] glc
+; FIXEDABI-SDAG-NEXT: s_waitcnt vmcnt(0)
+; FIXEDABI-SDAG-NEXT: v_mov_b32_e32 v2, s8
+; FIXEDABI-SDAG-NEXT: v_mov_b32_e32 v3, s9
+; FIXEDABI-SDAG-NEXT: flat_load_ubyte v2, v[2:3] glc
+; FIXEDABI-SDAG-NEXT: s_waitcnt vmcnt(0)
+; FIXEDABI-SDAG-NEXT: v_mov_b32_e32 v2, s4
+; FIXEDABI-SDAG-NEXT: v_mov_b32_e32 v3, s5
+; FIXEDABI-SDAG-NEXT: flat_load_ubyte v2, v[2:3] glc
+; FIXEDABI-SDAG-NEXT: s_waitcnt vmcnt(0)
+; FIXEDABI-SDAG-NEXT: v_mov_b32_e32 v2, s10
+; FIXEDABI-SDAG-NEXT: v_mov_b32_e32 v3, s11
+; FIXEDABI-SDAG-NEXT: flat_store_dwordx2 v[0:1], v[2:3]
+; FIXEDABI-SDAG-NEXT: s_waitcnt vmcnt(0)
+; FIXEDABI-SDAG-NEXT: s_setpc_b64 s[30:31]
+;
+; FIXEDABI-GISEL-LABEL: marked_func_use_other_sgpr:
+; FIXEDABI-GISEL: ; %bb.0:
+; FIXEDABI-GISEL-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; FIXEDABI-GISEL-NEXT: v_mov_b32_e32 v2, s6
+; FIXEDABI-GISEL-NEXT: v_mov_b32_e32 v3, s7
+; FIXEDABI-GISEL-NEXT: flat_load_ubyte v2, v[2:3] glc
+; FIXEDABI-GISEL-NEXT: s_waitcnt vmcnt(0)
+; FIXEDABI-GISEL-NEXT: v_mov_b32_e32 v2, s8
+; FIXEDABI-GISEL-NEXT: v_mov_b32_e32 v3, s9
+; FIXEDABI-GISEL-NEXT: flat_load_ubyte v2, v[2:3] glc
+; FIXEDABI-GISEL-NEXT: s_waitcnt vmcnt(0)
+; FIXEDABI-GISEL-NEXT: v_mov_b32_e32 v2, s4
+; FIXEDABI-GISEL-NEXT: v_mov_b32_e32 v3, s5
+; FIXEDABI-GISEL-NEXT: flat_load_ubyte v2, v[2:3] glc
+; FIXEDABI-GISEL-NEXT: s_waitcnt vmcnt(0)
+; FIXEDABI-GISEL-NEXT: v_mov_b32_e32 v2, s10
+; FIXEDABI-GISEL-NEXT: v_mov_b32_e32 v3, s11
+; FIXEDABI-GISEL-NEXT: flat_store_dwordx2 v[0:1], v[2:3]
+; FIXEDABI-GISEL-NEXT: s_waitcnt vmcnt(0)
+; FIXEDABI-GISEL-NEXT: s_setpc_b64 s[30:31]
%queue.ptr = call ptr addrspace(4) @llvm.amdgcn.queue.ptr()
%implicitarg.ptr = call ptr addrspace(4) @llvm.amdgcn.implicitarg.ptr()
%dispatch.ptr = call ptr addrspace(4) @llvm.amdgcn.dispatch.ptr()
@@ -236,18 +260,34 @@ define void @marked_func_use_other_sgpr(ptr addrspace(1) %ptr) #0 {
}
define amdgpu_kernel void @marked_kernel_use_other_sgpr(ptr addrspace(1) %ptr) #0 {
-; FIXEDABI-LABEL: marked_kernel_use_other_sgpr:
-; FIXEDABI: ; %bb.0:
-; FIXEDABI-NEXT: s_add_u32 s0, s4, 8
-; FIXEDABI-NEXT: flat_load_ubyte v0, v[0:1] glc
-; FIXEDABI-NEXT: s_addc_u32 s1, s5, 0
-; FIXEDABI-NEXT: s_waitcnt vmcnt(0)
-; FIXEDABI-NEXT: v_mov_b32_e32 v0, s0
-; FIXEDABI-NEXT: v_mov_b32_e32 v1, s1
-; FIXEDABI-NEXT: flat_load_ubyte v0, v[0:1] glc
-; FIXEDABI-NEXT: s_waitcnt vmcnt(0)
-; FIXEDABI-NEXT: flat_load_ubyte v0, v[0:1] glc
-; FIXEDABI-NEXT: s_endpgm
+; FIXEDABI-SDAG-LABEL: marked_kernel_use_other_sgpr:
+; FIXEDABI-SDAG: ; %bb.0:
+; FIXEDABI-SDAG-NEXT: s_load_dwordx2 s[0:1], s[4:5], 0xd0
+; FIXEDABI-SDAG-NEXT: s_add_u32 s2, s4, 8
+; FIXEDABI-SDAG-NEXT: s_addc_u32 s3, s5, 0
+; FIXEDABI-SDAG-NEXT: v_mov_b32_e32 v0, s2
+; FIXEDABI-SDAG-NEXT: v_mov_b32_e32 v1, s3
+; FIXEDABI-SDAG-NEXT: s_waitcnt lgkmcnt(0)
+; FIXEDABI-SDAG-NEXT: v_mov_b32_e32 v3, s1
+; FIXEDABI-SDAG-NEXT: v_mov_b32_e32 v2, s0
+; FIXEDABI-SDAG-NEXT: flat_load_ubyte v2, v[2:3] glc
+; FIXEDABI-SDAG-NEXT: flat_load_ubyte v0, v[0:1] glc
+; FIXEDABI-SDAG-NEXT: s_waitcnt vmcnt(0)
+; FIXEDABI-SDAG-NEXT: flat_load_ubyte v0, v[0:1] glc
+; FIXEDABI-SDAG-NEXT: s_endpgm
+;
+; FIXEDABI-GISEL-LABEL: marked_kernel_use_other_sgpr:
+; FIXEDABI-GISEL: ; %bb.0:
+; FIXEDABI-GISEL-NEXT: s_add_u32 s0, s4, 8
+; FIXEDABI-GISEL-NEXT: flat_load_ubyte v0, v[0:1] glc
+; FIXEDABI-GISEL-NEXT: s_addc_u32 s1, s5, 0
+; FIXEDABI-GISEL-NEXT: s_waitcnt vmcnt(0)
+; FIXEDABI-GISEL-NEXT: v_mov_b32_e32 v0, s0
+; FIXEDABI-GISEL-NEXT: v_mov_b32_e32 v1, s1
+; FIXEDABI-GISEL-NEXT: flat_load_ubyte v0, v[0:1] glc
+; FIXEDABI-GISEL-NEXT: s_waitcnt vmcnt(0)
+; FIXEDABI-GISEL-NEXT: flat_load_ubyte v0, v[0:1] glc
+; FIXEDABI-GISEL-NEXT: s_endpgm
%queue.ptr = call ptr addrspace(4) @llvm.amdgcn.queue.ptr()
%implicitarg.ptr = call ptr addrspace(4) @llvm.amdgcn.implicitarg.ptr()
%dispatch.ptr = call ptr addrspace(4) @llvm.amdgcn.dispatch.ptr()
diff --git a/llvm/test/CodeGen/AMDGPU/callee-special-input-sgprs-fixed-abi.ll b/llvm/test/CodeGen/AMDGPU/callee-special-input-sgprs-fixed-abi.ll
index 032ec65fa85133..c2c6670c3ecacd 100644
--- a/llvm/test/CodeGen/AMDGPU/callee-special-input-sgprs-fixed-abi.ll
+++ b/llvm/test/CodeGen/AMDGPU/callee-special-input-sgprs-fixed-abi.ll
@@ -22,7 +22,8 @@ define amdgpu_kernel void @kern_indirect_use_dispatch_ptr(i32) #1 {
}
; GCN-LABEL: {{^}}use_queue_ptr:
-; GCN: s_load_dword s{{[0-9]+}}, s[6:7]
+; GCN: s_mov_b64 s[4:5], 0xc8
+; GCN-NEXT: s_load_dwordx2 s[4:5], s[4:5], 0x0
define hidden void @use_queue_ptr() #1 {
%queue_ptr = call noalias ptr addrspace(4) @llvm.amdgcn.queue.ptr() #0
%value = load volatile i32, ptr addrspace(4) %queue_ptr
diff --git a/llvm/test/CodeGen/AMDGPU/implicit-kernarg-backend-usage.ll b/llvm/test/CodeGen/AMDGPU/implicit-kernarg-backend-usage.ll
index b89dbd42e0466f..8999b8f0d07678 100644
--- a/llvm/test/CodeGen/AMDGPU/implicit-kernarg-backend-usage.ll
+++ b/llvm/test/CodeGen/AMDGPU/implicit-kernarg-backend-usage.ll
@@ -287,20 +287,24 @@ define amdgpu_kernel void @llvm_amdgcn_queue_ptr(ptr addrspace(1) %ptr) {
;
; GFX8V5-LABEL: llvm_amdgcn_queue_ptr:
; GFX8V5: ; %bb.0:
-; GFX8V5-NEXT: s_add_u32 s0, s6, 8
-; GFX8V5-NEXT: flat_load_ubyte v0, v[0:1] glc
-; GFX8V5-NEXT: s_addc_u32 s1, s7, 0
-; GFX8V5-NEXT: s_waitcnt vmcnt(0)
+; GFX8V5-NEXT: s_load_dwordx2 s[0:1], s[6:7], 0xd0
+; GFX8V5-NEXT: s_add_u32 s2, s6, 8
+; GFX8V5-NEXT: s_addc_u32 s3, s7, 0
+; GFX8V5-NEXT: v_mov_b32_e32 v2, s8
+; GFX8V5-NEXT: v_mov_b32_e32 v3, s9
+; GFX8V5-NEXT: s_waitcnt lgkmcnt(0)
; GFX8V5-NEXT: v_mov_b32_e32 v0, s0
; GFX8V5-NEXT: v_mov_b32_e32 v1, s1
; GFX8V5-NEXT: flat_load_ubyte v0, v[0:1] glc
; GFX8V5-NEXT: s_waitcnt vmcnt(0)
+; GFX8V5-NEXT: v_mov_b32_e32 v0, s2
+; GFX8V5-NEXT: v_mov_b32_e32 v1, s3
+; GFX8V5-NEXT: flat_load_ubyte v0, v[0:1] glc
+; GFX8V5-NEXT: s_waitcnt vmcnt(0)
; GFX8V5-NEXT: v_mov_b32_e32 v0, s4
; GFX8V5-NEXT: v_mov_b32_e32 v1, s5
; GFX8V5-NEXT: flat_load_ubyte v0, v[0:1] glc
; GFX8V5-NEXT: s_load_dwordx2 s[0:1], s[6:7], 0x0
-; GFX8V5-NEXT: v_mov_b32_e32 v2, s8
-; GFX8V5-NEXT: v_mov_b32_e32 v3, s9
; GFX8V5-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
; GFX8V5-NEXT: v_mov_b32_e32 v0, s0
; GFX8V5-NEXT: v_mov_b32_e32 v1, s1
@@ -327,16 +331,18 @@ define amdgpu_kernel void @llvm_amdgcn_queue_ptr(ptr addrspace(1) %ptr) {
;
; GFX9V5-LABEL: llvm_amdgcn_queue_ptr:
; GFX9V5: ; %bb.0:
+; GFX9V5-NEXT: s_load_dwordx2 s[0:1], s[6:7], 0xd0
; GFX9V5-NEXT: v_mov_b32_e32 v2, 0
+; GFX9V5-NEXT: ; kill: killed $sgpr0_sgpr1
+; GFX9V5-NEXT: ; kill: killed $sgpr4_sgpr5
+; GFX9V5-NEXT: s_waitcnt lgkmcnt(0)
; GFX9V5-NEXT: global_load_ubyte v0, v2, s[0:1] glc
; GFX9V5-NEXT: global_load_ubyte v0, v2, s[6:7] offset:8 glc
; GFX9V5-NEXT: global_load_ubyte v0, v2, s[4:5] glc
-; GFX9V5-NEXT: ; kill: killed $sgpr0_sgpr1
; GFX9V5-NEXT: s_load_dwordx2 s[0:1], s[6:7], 0x0
; GFX9V5-NEXT: s_waitcnt vmcnt(0)
; GFX9V5-NEXT: v_mov_b32_e32 v0, s8
; GFX9V5-NEXT: v_mov_b32_e32 v1, s9
-; GFX9V5-NEXT: ; kill: killed $sgpr4_sgpr5
; GFX9V5-NEXT: s_waitcnt lgkmcnt(0)
; GFX9V5-NEXT: global_store_dwordx2 v2, v[0:1], s[0:1]
; GFX9V5-NEXT: s_waitcnt vmcnt(0)
|
be06d03
to
f3ca3ef
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the intrinsic should be interpreted as the user SGPR value. There is already a representation for load from implicit argument pointer, and we don't need 2 ways of doing that.
Which is the representation for loading queue ptr from implicit kernel argument? |
Call the implicitarg ptr intrinsic and load from that offset |
That is an indirect way doing that. What this patch is doing can allow us not to set |
What do you mean by .hidden_queue_ptr, the preloaded SGPR pair? That should be set with the KD field ENABLE_SGPR_QUEUE_PTR. Currently the lowering for this is totally broken even without this patch. There is an explicit check for COV5 for whether to allocate SGPRs for the queue_ptr and we are not allocating anything. So we load from some random incorrect SGPRs as far as I can tell. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like the intention of @changpeng's COv5 patches was that we should never allocate or access the user SGPR queue_ptr.
AMDGPU::getAMDHSACodeObjectVersion(*M) < AMDGPU::AMDHSA_COV5) |
Otherwise if we go with Matt's suggestion to make the intrinsic mean the preloaded SGPR then the allocation needs to be fixed.
Right. We should only use the implicitarg field for queue_ptr for cov5. I do remember I have to update many LIT tests due to less sgpr usage. |
Yes, that's my point, it's redundant.
But you still can use the queue pointer, it's not gone. The logic should be able to differentiate should not use queue pointer and queue pointer does not exist |
But the thing is, the two SGPRs are never set in COV5. If you still want to use them, you basically have to revert those patches that @changpeng has done to disable allocation of the two SGPRs completely in COV5. |
COV5 introduced a queue_ptr field in implicit kernarg, and there is no reason to allocate SGPRs. For gfx9+, the doc implies that queue_ptr field does not exist. However, the intrinsic has to return the queue_ptr. (My first impression is that the intrinsic is no longer needed, and should never be invoked). Suppose we have to keep this intrinsic, I would suggest we update the doc to say so that queue_ptr field is still needed when intrinsic call exists. |
Yes. It is still possible to use the queue pointer. You could still want to use the queue pointer as a preloaded argument. |
Let me try to make things clear here. The current situation (without this patch) is:
Given this, there are two approaches to fix this.
Let's compare the two approaches. For 1, we can safely assume For 2, we need to revert some changes by @changpeng that set the kernel descriptor if the intrinsic is used explicitly or implicitly. By "implicitly" I mean when at compile time we can't resolve an indirect function call such that we need to assume the intrinsic might be used. This means, even for COV5+, if the optimization is off, or the optimization can't work as expected, we lose the two SGPRs, no matter whether they are really used or not. IMHO, I don't see the win of 2. It might follow the "convention" that an intrinsic needs to read from registers, but we have so many intrinsics that don't read from register. The reason it read from register in the past is there is no other way. Did I misunderstand anything here? @arsenm @kerbowa @changpeng |
My expectation is for cov5, llvm.amdgcn.queue.ptr would not be used in the first place. If it were used, I would expect it to enable the same user SGPR input as it always did. Lowering it to the load from implicit argument is a strange midpoint that should not occur in real code.
Multiple representations of the same thing are bad. When we detect uses of implicit argument pointers, now we have an extra special case to consider. So I would just revert the changes that broke the queue pointer intrinsic. |
…ernel argument if feasible
f3ca3ef
to
4f5ef5f
Compare
llvm.amdgcn.queue.ptr
instrinsic to using implicit kernel argument if feasibles[6:7]
for COV5 if queue ptr is needed
I'm still working on updating test cases. This is WIP. |
s[6:7]
for COV5 if queue ptr is neededThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm with the tests all updated
Close this one. Will do it in #112403. |
No description provided.