Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[AMDGPU][LSV] Restrict large vectors #92540

Open
wants to merge 5 commits into
base: main
Choose a base branch
from
Open

Conversation

piotrAMD
Copy link
Collaborator

@piotrAMD piotrAMD commented May 17, 2024

The patch restricts forming large (> 128b) vectors in load-store-vectorizer.

In graphics, loads in CONSTANT_ADDRESS space are primarily descriptor loads, and they already have the size required for the consuming instructions (128b, or 256b).

For buffer loads the natural size is maximum 128b, as that corresponds to vec4 type which is frequently used in buffer declarations in various graphics APIs (colors, positions, normals).

Using larger sizes can be problematic for later passes as they may cause code motion issues, register fragmentation, inefficient spills (most of them are really deficiency in handling subregisters).

However, for cases where adjacent loads end up close together, there is a late pass load-store-optimizer that would often merge them together.

The patch restricts forming large (> 128b) vectors in load-store-vectorizer
for graphics.

In graphics, loads in CONSTANT_ADDRESS space are primarily descriptor loads,
and they already have the size required for the consuming instructions (128b, or 256b).

For buffer loads the natural size is maximum 128b, as that corresponds to
vec4 type which is frequently used in buffer declarations in various graphics
APIs (colors, positions, normals).

Using larger sizes can be problematic for later passes as they may cause code motion
issues, register fragmentation, inefficient spills (most of them are really deficiency
in handling subregisters).

However, for cases where adjacent loads end up close together, there is a late pass
load-store-optimizer that would often merge them together.
@llvmbot
Copy link

llvmbot commented May 17, 2024

@llvm/pr-subscribers-llvm-transforms
@llvm/pr-subscribers-backend-amdgpu

@llvm/pr-subscribers-llvm-globalisel

Author: Piotr Sobczak (piotrAMD)

Changes

The patch restricts forming large (> 128b) vectors in load-store-vectorizer for graphics.

In graphics, loads in CONSTANT_ADDRESS space are primarily descriptor loads, and they already have the size required for the consuming instructions (128b, or 256b).

For buffer loads the natural size is maximum 128b, as that corresponds to vec4 type which is frequently used in buffer declarations in various graphics APIs (colors, positions, normals).

Using larger sizes can be problematic for later passes as they may cause code motion issues, register fragmentation, inefficient spills (most of them are really deficiency in handling subregisters).

However, for cases where adjacent loads end up close together, there is a late pass load-store-optimizer that would often merge them together.


Full diff: https://github.com/llvm/llvm-project/pull/92540.diff

2 Files Affected:

  • (modified) llvm/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp (+6-5)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/divergence-divergent-i1-phis-no-lane-mask-merging.ll (+14-12)
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp b/llvm/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp
index 84320d296a037..9785e7d71bec2 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp
@@ -364,14 +364,15 @@ unsigned GCNTTIImpl::getStoreVectorFactor(unsigned VF, unsigned StoreSize,
 }
 
 unsigned GCNTTIImpl::getLoadStoreVecRegBitWidth(unsigned AddrSpace) const {
-  if (AddrSpace == AMDGPUAS::GLOBAL_ADDRESS ||
-      AddrSpace == AMDGPUAS::CONSTANT_ADDRESS ||
+  if (AddrSpace == AMDGPUAS::GLOBAL_ADDRESS)
+    return 512;
+
+  if (AddrSpace == AMDGPUAS::CONSTANT_ADDRESS ||
       AddrSpace == AMDGPUAS::CONSTANT_ADDRESS_32BIT ||
       AddrSpace == AMDGPUAS::BUFFER_FAT_POINTER ||
       AddrSpace == AMDGPUAS::BUFFER_RESOURCE ||
-      AddrSpace == AMDGPUAS::BUFFER_STRIDED_POINTER) {
-    return 512;
-  }
+      AddrSpace == AMDGPUAS::BUFFER_STRIDED_POINTER)
+    return ST->isAmdPalOS() ? 128 : 512;
 
   if (AddrSpace == AMDGPUAS::PRIVATE_ADDRESS)
     return 8 * ST->getMaxPrivateElementSize();
diff --git a/llvm/test/CodeGen/AMDGPU/GlobalISel/divergence-divergent-i1-phis-no-lane-mask-merging.ll b/llvm/test/CodeGen/AMDGPU/GlobalISel/divergence-divergent-i1-phis-no-lane-mask-merging.ll
index d4d5cb18bbd30..6930e0d809177 100644
--- a/llvm/test/CodeGen/AMDGPU/GlobalISel/divergence-divergent-i1-phis-no-lane-mask-merging.ll
+++ b/llvm/test/CodeGen/AMDGPU/GlobalISel/divergence-divergent-i1-phis-no-lane-mask-merging.ll
@@ -227,14 +227,14 @@ define amdgpu_cs void @single_lane_execution_attribute(i32 inreg %.userdata0, <3
 ; GFX10-LABEL: single_lane_execution_attribute:
 ; GFX10:       ; %bb.0: ; %.entry
 ; GFX10-NEXT:    s_getpc_b64 s[4:5]
-; GFX10-NEXT:    s_mov_b32 s12, 0
-; GFX10-NEXT:    s_mov_b32 s13, -1
+; GFX10-NEXT:    s_mov_b32 s8, 0
+; GFX10-NEXT:    s_mov_b32 s9, -1
 ; GFX10-NEXT:    s_mov_b32 s2, s0
-; GFX10-NEXT:    s_and_b64 s[4:5], s[4:5], s[12:13]
-; GFX10-NEXT:    s_mov_b32 s3, s12
+; GFX10-NEXT:    s_and_b64 s[4:5], s[4:5], s[8:9]
+; GFX10-NEXT:    s_mov_b32 s3, s8
 ; GFX10-NEXT:    v_mbcnt_lo_u32_b32 v1, -1, 0
 ; GFX10-NEXT:    s_or_b64 s[2:3], s[4:5], s[2:3]
-; GFX10-NEXT:    s_load_dwordx8 s[4:11], s[2:3], 0x0
+; GFX10-NEXT:    s_load_dwordx4 s[4:7], s[2:3], 0x0
 ; GFX10-NEXT:    v_mbcnt_hi_u32_b32 v1, -1, v1
 ; GFX10-NEXT:    v_lshlrev_b32_e32 v2, 2, v1
 ; GFX10-NEXT:    v_and_b32_e32 v3, 1, v1
@@ -248,8 +248,8 @@ define amdgpu_cs void @single_lane_execution_attribute(i32 inreg %.userdata0, <3
 ; GFX10-NEXT:    v_cmp_eq_u32_e64 s0, 0, v2
 ; GFX10-NEXT:    s_cbranch_vccnz .LBB4_4
 ; GFX10-NEXT:  ; %bb.1: ; %.preheader.preheader
-; GFX10-NEXT:    v_mov_b32_e32 v3, s12
-; GFX10-NEXT:    v_mov_b32_e32 v4, s12
+; GFX10-NEXT:    v_mov_b32_e32 v3, s8
+; GFX10-NEXT:    v_mov_b32_e32 v4, s8
 ; GFX10-NEXT:  .LBB4_2: ; %.preheader
 ; GFX10-NEXT:    ; =>This Inner Loop Header: Depth=1
 ; GFX10-NEXT:    buffer_load_dword v5, v3, s[4:7], 0 offen
@@ -261,18 +261,20 @@ define amdgpu_cs void @single_lane_execution_attribute(i32 inreg %.userdata0, <3
 ; GFX10-NEXT:    s_cbranch_vccnz .LBB4_2
 ; GFX10-NEXT:  ; %bb.3: ; %.preheader._crit_edge
 ; GFX10-NEXT:    v_cmp_eq_u32_e32 vcc_lo, v4, v2
-; GFX10-NEXT:    s_mov_b32 s13, 0
-; GFX10-NEXT:    s_or_b32 s2, s0, vcc_lo
-; GFX10-NEXT:    v_cndmask_b32_e64 v3, 0, 1, s2
+; GFX10-NEXT:    s_mov_b32 s9, 0
+; GFX10-NEXT:    s_or_b32 s4, s0, vcc_lo
+; GFX10-NEXT:    v_cndmask_b32_e64 v3, 0, 1, s4
 ; GFX10-NEXT:  .LBB4_4: ; %Flow
-; GFX10-NEXT:    s_and_b32 vcc_lo, exec_lo, s13
+; GFX10-NEXT:    s_load_dwordx4 s[4:7], s[2:3], 0x10
+; GFX10-NEXT:    s_and_b32 vcc_lo, exec_lo, s9
 ; GFX10-NEXT:    s_cbranch_vccz .LBB4_6
 ; GFX10-NEXT:  ; %bb.5: ; %.19
 ; GFX10-NEXT:    v_cndmask_b32_e64 v1, 0, 1, s0
 ; GFX10-NEXT:    v_or_b32_e32 v3, 2, v1
 ; GFX10-NEXT:  .LBB4_6: ; %.22
 ; GFX10-NEXT:    v_add_lshl_u32 v0, v0, s1, 2
-; GFX10-NEXT:    buffer_store_dword v3, v0, s[8:11], 0 offen
+; GFX10-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX10-NEXT:    buffer_store_dword v3, v0, s[4:7], 0 offen
 ; GFX10-NEXT:    s_endpgm
 .entry:
   %.0 = call i64 @llvm.amdgcn.s.getpc()

Copy link
Contributor

@arsenm arsenm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We definitely should not be using the OS as a hint on usage or to change any properties like this

if (AddrSpace == AMDGPUAS::GLOBAL_ADDRESS ||
AddrSpace == AMDGPUAS::CONSTANT_ADDRESS ||
if (AddrSpace == AMDGPUAS::GLOBAL_ADDRESS)
return 512;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

128 would be more correct on average for global than constant

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This patch does not have any effect for GLOBAL_ADDRESS. Is the suggestion to also modify the value returned for this case to 128? (I haven't investigated that).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also note the values returned for R600 (R600TTIImpl::getLoadStoreVecRegBitWidth) are all not larger than 128.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

R600 didn't have the really wide scalar loads

@piotrAMD
Copy link
Collaborator Author

We definitely should not be using the OS as a hint on usage or to change any properties like this

Fine - would adding a flag be better?

@arsenm
Copy link
Contributor

arsenm commented May 20, 2024

We definitely should not be using the OS as a hint on usage or to change any properties like this

Fine - would adding a flag be better?

Not really. It would be better to just unconditionally change this to be 128 for any of the global-like address spaces. Ultimately this hook is supposed to be a "give me this property", and the constant address spaces are a weak heuristic for using scalar instructions. The problem would be in the use context for this

@arsenm
Copy link
Contributor

arsenm commented May 20, 2024

We definitely should not be using the OS as a hint on usage or to change any properties like this

Fine - would adding a flag be better?

Not really. It would be better to just unconditionally change this to be 128 for any of the global-like address spaces. Ultimately this hook is supposed to be a "give me this property", and the constant address spaces are a weak heuristic for using scalar instructions. The problem would be in the use context for this

Plus I think we should have better support for split and partial rematerialized of the large SMRD loads

@piotrAMD
Copy link
Collaborator Author

We definitely should not be using the OS as a hint on usage or to change any properties like this

Fine - would adding a flag be better?

Not really. It would be better to just unconditionally change this to be 128 for any of the global-like address spaces. Ultimately this hook is supposed to be a "give me this property", and the constant address spaces are a weak heuristic for using scalar instructions. The problem would be in the use context for this

Happy to change this to 128 for any of the global-like address spaces, although I'll probably split GLOBAL_ADDRESS to a separate patch.

if (AddrSpace == AMDGPUAS::GLOBAL_ADDRESS)
return 512;

if (AddrSpace == AMDGPUAS::CONSTANT_ADDRESS ||
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should probably be using isExtendedGlobalAddrSpace (which should also cover the buffers, but it doesn't...)

@piotrAMD
Copy link
Collaborator Author

Removed condition on ST, included GLOBAL_ADDRESS, updated tests and rebased.

@piotrAMD piotrAMD changed the title [AMDGPU][LSV] Restrict large vectors in graphics [AMDGPU][LSV] Restrict large vectors May 23, 2024
if (AddrSpace == AMDGPUAS::PRIVATE_ADDRESS)
return 8 * ST->getMaxPrivateElementSize();

// Common to flat, global, local and region. Assume for unknown addrspace.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably should leave some form of comment

AddrSpace == AMDGPUAS::BUFFER_STRIDED_POINTER) {
return 512;
}

if (AddrSpace == AMDGPUAS::PRIVATE_ADDRESS)
return 8 * ST->getMaxPrivateElementSize();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Separate patch, but this is wrong when using scratch instructions. It should be the same as any other flat / global

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

getMaxPrivateElementSize() returns 16 if enableFlatScratch(), so it returns 128 in that case. So there is no issue?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, that's not really what getMaxPrivateElementSize means but I guess that works

@piotrAMD
Copy link
Collaborator Author

Restored/expanded comment.

// Assume for unknown addrspace. For constant, we also return 128 here despite
// support for wide scalar loads, because very large vectors can cause
// problems in the backend: high register pressure or increased
// fragmentation.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably should add a fixme here, we should try to form 512 if we were better at rematerializing partial vectors

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants