[AMDGPU][LSV] Restrict large vectors #92540

piotrAMD · 2024-05-17T13:23:24Z

The patch restricts forming large (> 128b) vectors in load-store-vectorizer.

In graphics, loads in CONSTANT_ADDRESS space are primarily descriptor loads, and they already have the size required for the consuming instructions (128b, or 256b).

For buffer loads the natural size is maximum 128b, as that corresponds to vec4 type which is frequently used in buffer declarations in various graphics APIs (colors, positions, normals).

Using larger sizes can be problematic for later passes as they may cause code motion issues, register fragmentation, inefficient spills (most of them are really deficiency in handling subregisters).

However, for cases where adjacent loads end up close together, there is a late pass load-store-optimizer that would often merge them together.

The patch restricts forming large (> 128b) vectors in load-store-vectorizer for graphics. In graphics, loads in CONSTANT_ADDRESS space are primarily descriptor loads, and they already have the size required for the consuming instructions (128b, or 256b). For buffer loads the natural size is maximum 128b, as that corresponds to vec4 type which is frequently used in buffer declarations in various graphics APIs (colors, positions, normals). Using larger sizes can be problematic for later passes as they may cause code motion issues, register fragmentation, inefficient spills (most of them are really deficiency in handling subregisters). However, for cases where adjacent loads end up close together, there is a late pass load-store-optimizer that would often merge them together.

llvmbot · 2024-05-17T13:23:54Z

@llvm/pr-subscribers-llvm-transforms
@llvm/pr-subscribers-backend-amdgpu

@llvm/pr-subscribers-llvm-globalisel

Author: Piotr Sobczak (piotrAMD)

Changes

The patch restricts forming large (> 128b) vectors in load-store-vectorizer for graphics.

In graphics, loads in CONSTANT_ADDRESS space are primarily descriptor loads, and they already have the size required for the consuming instructions (128b, or 256b).

For buffer loads the natural size is maximum 128b, as that corresponds to vec4 type which is frequently used in buffer declarations in various graphics APIs (colors, positions, normals).

Using larger sizes can be problematic for later passes as they may cause code motion issues, register fragmentation, inefficient spills (most of them are really deficiency in handling subregisters).

However, for cases where adjacent loads end up close together, there is a late pass load-store-optimizer that would often merge them together.

Full diff: https://github.com/llvm/llvm-project/pull/92540.diff

2 Files Affected:

(modified) llvm/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp (+6-5)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/divergence-divergent-i1-phis-no-lane-mask-merging.ll (+14-12)

diff --git a/llvm/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp b/llvm/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp
index 84320d296a037..9785e7d71bec2 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp
@@ -364,14 +364,15 @@ unsigned GCNTTIImpl::getStoreVectorFactor(unsigned VF, unsigned StoreSize,
 }
 
 unsigned GCNTTIImpl::getLoadStoreVecRegBitWidth(unsigned AddrSpace) const {
-  if (AddrSpace == AMDGPUAS::GLOBAL_ADDRESS ||
-      AddrSpace == AMDGPUAS::CONSTANT_ADDRESS ||
+  if (AddrSpace == AMDGPUAS::GLOBAL_ADDRESS)
+    return 512;
+
+  if (AddrSpace == AMDGPUAS::CONSTANT_ADDRESS ||
       AddrSpace == AMDGPUAS::CONSTANT_ADDRESS_32BIT ||
       AddrSpace == AMDGPUAS::BUFFER_FAT_POINTER ||
       AddrSpace == AMDGPUAS::BUFFER_RESOURCE ||
-      AddrSpace == AMDGPUAS::BUFFER_STRIDED_POINTER) {
-    return 512;
-  }
+      AddrSpace == AMDGPUAS::BUFFER_STRIDED_POINTER)
+    return ST->isAmdPalOS() ? 128 : 512;
 
   if (AddrSpace == AMDGPUAS::PRIVATE_ADDRESS)
     return 8 * ST->getMaxPrivateElementSize();
diff --git a/llvm/test/CodeGen/AMDGPU/GlobalISel/divergence-divergent-i1-phis-no-lane-mask-merging.ll b/llvm/test/CodeGen/AMDGPU/GlobalISel/divergence-divergent-i1-phis-no-lane-mask-merging.ll
index d4d5cb18bbd30..6930e0d809177 100644
--- a/llvm/test/CodeGen/AMDGPU/GlobalISel/divergence-divergent-i1-phis-no-lane-mask-merging.ll
+++ b/llvm/test/CodeGen/AMDGPU/GlobalISel/divergence-divergent-i1-phis-no-lane-mask-merging.ll
@@ -227,14 +227,14 @@ define amdgpu_cs void @single_lane_execution_attribute(i32 inreg %.userdata0, <3
 ; GFX10-LABEL: single_lane_execution_attribute:
 ; GFX10:       ; %bb.0: ; %.entry
 ; GFX10-NEXT:    s_getpc_b64 s[4:5]
-; GFX10-NEXT:    s_mov_b32 s12, 0
-; GFX10-NEXT:    s_mov_b32 s13, -1
+; GFX10-NEXT:    s_mov_b32 s8, 0
+; GFX10-NEXT:    s_mov_b32 s9, -1
 ; GFX10-NEXT:    s_mov_b32 s2, s0
-; GFX10-NEXT:    s_and_b64 s[4:5], s[4:5], s[12:13]
-; GFX10-NEXT:    s_mov_b32 s3, s12
+; GFX10-NEXT:    s_and_b64 s[4:5], s[4:5], s[8:9]
+; GFX10-NEXT:    s_mov_b32 s3, s8
 ; GFX10-NEXT:    v_mbcnt_lo_u32_b32 v1, -1, 0
 ; GFX10-NEXT:    s_or_b64 s[2:3], s[4:5], s[2:3]
-; GFX10-NEXT:    s_load_dwordx8 s[4:11], s[2:3], 0x0
+; GFX10-NEXT:    s_load_dwordx4 s[4:7], s[2:3], 0x0
 ; GFX10-NEXT:    v_mbcnt_hi_u32_b32 v1, -1, v1
 ; GFX10-NEXT:    v_lshlrev_b32_e32 v2, 2, v1
 ; GFX10-NEXT:    v_and_b32_e32 v3, 1, v1
@@ -248,8 +248,8 @@ define amdgpu_cs void @single_lane_execution_attribute(i32 inreg %.userdata0, <3
 ; GFX10-NEXT:    v_cmp_eq_u32_e64 s0, 0, v2
 ; GFX10-NEXT:    s_cbranch_vccnz .LBB4_4
 ; GFX10-NEXT:  ; %bb.1: ; %.preheader.preheader
-; GFX10-NEXT:    v_mov_b32_e32 v3, s12
-; GFX10-NEXT:    v_mov_b32_e32 v4, s12
+; GFX10-NEXT:    v_mov_b32_e32 v3, s8
+; GFX10-NEXT:    v_mov_b32_e32 v4, s8
 ; GFX10-NEXT:  .LBB4_2: ; %.preheader
 ; GFX10-NEXT:    ; =>This Inner Loop Header: Depth=1
 ; GFX10-NEXT:    buffer_load_dword v5, v3, s[4:7], 0 offen
@@ -261,18 +261,20 @@ define amdgpu_cs void @single_lane_execution_attribute(i32 inreg %.userdata0, <3
 ; GFX10-NEXT:    s_cbranch_vccnz .LBB4_2
 ; GFX10-NEXT:  ; %bb.3: ; %.preheader._crit_edge
 ; GFX10-NEXT:    v_cmp_eq_u32_e32 vcc_lo, v4, v2
-; GFX10-NEXT:    s_mov_b32 s13, 0
-; GFX10-NEXT:    s_or_b32 s2, s0, vcc_lo
-; GFX10-NEXT:    v_cndmask_b32_e64 v3, 0, 1, s2
+; GFX10-NEXT:    s_mov_b32 s9, 0
+; GFX10-NEXT:    s_or_b32 s4, s0, vcc_lo
+; GFX10-NEXT:    v_cndmask_b32_e64 v3, 0, 1, s4
 ; GFX10-NEXT:  .LBB4_4: ; %Flow
-; GFX10-NEXT:    s_and_b32 vcc_lo, exec_lo, s13
+; GFX10-NEXT:    s_load_dwordx4 s[4:7], s[2:3], 0x10
+; GFX10-NEXT:    s_and_b32 vcc_lo, exec_lo, s9
 ; GFX10-NEXT:    s_cbranch_vccz .LBB4_6
 ; GFX10-NEXT:  ; %bb.5: ; %.19
 ; GFX10-NEXT:    v_cndmask_b32_e64 v1, 0, 1, s0
 ; GFX10-NEXT:    v_or_b32_e32 v3, 2, v1
 ; GFX10-NEXT:  .LBB4_6: ; %.22
 ; GFX10-NEXT:    v_add_lshl_u32 v0, v0, s1, 2
-; GFX10-NEXT:    buffer_store_dword v3, v0, s[8:11], 0 offen
+; GFX10-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX10-NEXT:    buffer_store_dword v3, v0, s[4:7], 0 offen
 ; GFX10-NEXT:    s_endpgm
 .entry:
   %.0 = call i64 @llvm.amdgcn.s.getpc()

arsenm

We definitely should not be using the OS as a hint on usage or to change any properties like this

arsenm · 2024-05-17T13:30:32Z

llvm/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp

-  if (AddrSpace == AMDGPUAS::GLOBAL_ADDRESS ||
-      AddrSpace == AMDGPUAS::CONSTANT_ADDRESS ||
+  if (AddrSpace == AMDGPUAS::GLOBAL_ADDRESS)
+    return 512;


128 would be more correct on average for global than constant

This patch does not have any effect for GLOBAL_ADDRESS. Is the suggestion to also modify the value returned for this case to 128? (I haven't investigated that).

I also note the values returned for R600 (R600TTIImpl::getLoadStoreVecRegBitWidth) are all not larger than 128.

R600 didn't have the really wide scalar loads

piotrAMD · 2024-05-17T13:31:41Z

We definitely should not be using the OS as a hint on usage or to change any properties like this

Fine - would adding a flag be better?

arsenm · 2024-05-20T14:33:29Z

We definitely should not be using the OS as a hint on usage or to change any properties like this

Fine - would adding a flag be better?

Not really. It would be better to just unconditionally change this to be 128 for any of the global-like address spaces. Ultimately this hook is supposed to be a "give me this property", and the constant address spaces are a weak heuristic for using scalar instructions. The problem would be in the use context for this

arsenm · 2024-05-20T14:34:46Z

We definitely should not be using the OS as a hint on usage or to change any properties like this

Fine - would adding a flag be better?

Not really. It would be better to just unconditionally change this to be 128 for any of the global-like address spaces. Ultimately this hook is supposed to be a "give me this property", and the constant address spaces are a weak heuristic for using scalar instructions. The problem would be in the use context for this

Plus I think we should have better support for split and partial rematerialized of the large SMRD loads

piotrAMD · 2024-05-20T15:02:28Z

We definitely should not be using the OS as a hint on usage or to change any properties like this

Fine - would adding a flag be better?

Not really. It would be better to just unconditionally change this to be 128 for any of the global-like address spaces. Ultimately this hook is supposed to be a "give me this property", and the constant address spaces are a weak heuristic for using scalar instructions. The problem would be in the use context for this

Happy to change this to 128 for any of the global-like address spaces, although I'll probably split GLOBAL_ADDRESS to a separate patch.

arsenm · 2024-05-20T19:31:17Z

llvm/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp

+  if (AddrSpace == AMDGPUAS::GLOBAL_ADDRESS)
+    return 512;
+
+  if (AddrSpace == AMDGPUAS::CONSTANT_ADDRESS ||


This should probably be using isExtendedGlobalAddrSpace (which should also cover the buffers, but it doesn't...)

- Remove condition on ST - Update tests

piotrAMD · 2024-05-23T11:36:03Z

Removed condition on ST, included GLOBAL_ADDRESS, updated tests and rebased.

arsenm · 2024-05-23T12:51:42Z

llvm/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp

  if (AddrSpace == AMDGPUAS::PRIVATE_ADDRESS)
    return 8 * ST->getMaxPrivateElementSize();

-  // Common to flat, global, local and region. Assume for unknown addrspace.


Probably should leave some form of comment

arsenm · 2024-05-23T12:52:13Z

llvm/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp

-      AddrSpace == AMDGPUAS::BUFFER_STRIDED_POINTER) {
-    return 512;
-  }
-
  if (AddrSpace == AMDGPUAS::PRIVATE_ADDRESS)
    return 8 * ST->getMaxPrivateElementSize();


Separate patch, but this is wrong when using scratch instructions. It should be the same as any other flat / global

getMaxPrivateElementSize() returns 16 if enableFlatScratch(), so it returns 128 in that case. So there is no issue?

ok, that's not really what getMaxPrivateElementSize means but I guess that works

piotrAMD · 2024-05-23T13:29:20Z

Restored/expanded comment.

arsenm · 2024-05-23T13:46:12Z

llvm/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp

+  // Assume for unknown addrspace. For constant, we also return 128 here despite
+  // support for wide scalar loads, because very large vectors can cause
+  // problems in the backend: high register pressure or increased
+  // fragmentation.


Probably should add a fixme here, we should try to form 512 if we were better at rematerializing partial vectors

llvmbot added backend:AMDGPU llvm:globalisel labels May 17, 2024

arsenm requested changes May 17, 2024

View reviewed changes

arsenm reviewed May 17, 2024

View reviewed changes

arsenm reviewed May 20, 2024

View reviewed changes

piotrAMD added 3 commits May 23, 2024 12:13

Merge remote-tracking branch 'origin/main' into RESTRICT_LS_VEC

6036800

Fixup

babba9f

- Remove condition on ST - Update tests

Also include GLOBAL_ADDRESS space.

721a16c

llvmbot added the llvm:transforms label May 23, 2024

piotrAMD changed the title ~~[AMDGPU][LSV] Restrict large vectors in graphics~~ [AMDGPU][LSV] Restrict large vectors May 23, 2024

arsenm reviewed May 23, 2024

View reviewed changes

Add comment

daa2160

arsenm approved these changes May 23, 2024

View reviewed changes

arsenm reviewed May 23, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[AMDGPU][LSV] Restrict large vectors #92540

[AMDGPU][LSV] Restrict large vectors #92540

piotrAMD commented May 17, 2024 •

edited

Loading

llvmbot commented May 17, 2024 •

edited

Loading

arsenm left a comment

arsenm May 17, 2024

piotrAMD May 17, 2024

piotrAMD May 17, 2024

arsenm May 20, 2024

piotrAMD commented May 17, 2024

arsenm commented May 20, 2024

arsenm commented May 20, 2024

piotrAMD commented May 20, 2024

arsenm May 20, 2024

piotrAMD commented May 23, 2024

arsenm May 23, 2024

arsenm May 23, 2024

piotrAMD May 23, 2024

arsenm May 23, 2024

piotrAMD commented May 23, 2024

arsenm May 23, 2024

[AMDGPU][LSV] Restrict large vectors #92540

Are you sure you want to change the base?

[AMDGPU][LSV] Restrict large vectors #92540

Conversation

piotrAMD commented May 17, 2024 • edited Loading

llvmbot commented May 17, 2024 • edited Loading

arsenm left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

piotrAMD commented May 17, 2024

arsenm commented May 20, 2024

arsenm commented May 20, 2024

piotrAMD commented May 20, 2024

Choose a reason for hiding this comment

piotrAMD commented May 23, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

piotrAMD commented May 23, 2024

Choose a reason for hiding this comment

piotrAMD commented May 17, 2024 •

edited

Loading

llvmbot commented May 17, 2024 •

edited

Loading