[AMDGPU] Identify vector idiom to unlock SROA #156791

yxsamliu · 2025-09-04T02:54:33Z

HIP vector types often lower to aggregates and get copied with memcpy. When the source or destination is chosen via a pointer select, SROA cannot split the aggregate. This keeps data in stack slots and increases scratch traffic. By rewriting these memcpy idioms, we enable SROA to promote values, reducing stack usage and improving occupancy and bandwidth on AMD GPUs.

For example:

%p = select i1 %cond, ptr %A, ptr %B

call void @llvm.memcpy.p0.p0.i32(ptr %dst, ptr %p, i32 16, i1 false)

When the source is a pointer select and conditions allow, the pass replaces the memcpy with two aligned loads, a value-level select of the loaded vector, and one aligned store. If it is not safe to speculate both loads, it splits control flow and emits a memcpy in each arm. When the destination is a select, it always splits control flow to avoid speculative stores. Vector element types are chosen based on size and minimum proven alignment to minimize the number of operations.

The pass handles non-volatile, constant-length memcpy up to a small size cap. Source and destination must be in the same address space. It runs early, after inlining and before InferAddressSpaces and SROA. Volatile and cross-address-space memcpys are skipped.

The size cap is controlled by -amdgpu-vector-idiom-max-bytes (default 32), allowing tuning for different workloads.

Fixes: SWDEV-550134

…ne (llvm#1271)

The underlying issue was that I forgot to clean the cache directory before running the test. So the test ended up running sometimes on a dirty cache yielding bad fails. Since the code is only running a single comgr action that only converts spirv->bc, the contents of the cache should be 2 files: * the bitcode * the cache timestamp

In this patch, we add a new action: AMD_COMGR_ACTION_COMPILE_SPIRV_TO_RELOCATABLE That accepts a set of .spv files, translates them to .bc files, extracts any embedded @llvm.cmdline flags, and then compiles to a set of relocatable .o files.

The underlying issue was that I forgot to clean the cache directory before running the test. So the test ended up running sometimes on a dirty cache yielding bad fails. Since the code is only running a single comgr action that only converts spirv->bc, the contents of the cache should be 2 files: * the bitcode * the cache timestamp

…lvm#1365) from row llvm#19 in "Mainline for 6.5 Cherry-pick List" amd-staging commits: [Comgr][Cache] Fix broken test: spirv-translator-cached.cl · ROCm/llvm-project@51fa25b [Cache][SPIRV] Fix flacky test... again · ROCm/llvm-project@56cf45a

Note that this is not an NFC change because the test case `llvm/test/CodeGen/AMDGPU/amdgpu-spill-cfi-saved-regs.ll` has been updated due to the recent SGPR layout change. The 32 CSR SGPRs in `callee_need_to_spill_fp_exec_to_memory` have been adjusted to reflect this update. Change-Id: I332a721e7e8feaa5491c63228ecb42759e4d979d

)

This PR updates the SGPR layout to a striped caller/callee-saved design, similar to the VGPR layout. To ensure that s30-s31 (return address), s32 (stack pointer), s33 (frame pointer), and s34 (base pointer) remain callee-saved, the striped layout starts from s40, with a stripe width of 8. The last stripe is 10 wide instead of 8 to avoid ending with a 2-wide stripe. Fixes llvm#113782. Change-Id: I6fe8fca8b70985a8775ec04d93b460333533d2bb

…) (llvm#1371)

…#128170) Fixes SWDEV-515029

…#128… (llvm#1372)

For hipBinNVPtr_ and hipBinAMDPtr_ members: the destructor of the base class was not marked as virtual, but the destructor of the derived classes are. When we delete the object we do it through a pointer to the base class. So the base class destructor is called but not the one of the derived classes. This results in strange memory behaviour detected by ASAN. Solves SWDEV-516418

…1354)

Also archive the Comgr V3 Release notes, and start a new document for Comgr V4 changes. Change-Id: I25137c174bd70caafe9b3c26d3a956331e0e9dfc

…1384)

…1522)

… 'amdflang-new' (llvm#1458)

llvm/lib/Target/AMDGPU/AMDGPUVectorIdiom.cpp

llvm/test/CodeGen/AMDGPU/amdgpu-vector-idiom-memcpy-select.ll

yxsamliu · 2025-09-09T15:05:09Z

https://reviews.llvm.org/D140493 related?

Thanks, yes, it's similar in spirit to D140493. Both aim to unblock SROA by reshaping memory idioms. But D140493 already showed that profitability and legality here are highly target-specific, and that trying to fit this into generic SROA/InstCombine causes difficulty and delay.

On AMDGPU, costs and safety depend on address spaces, proven alignment, how byte and small vectors legalize, and when it's safe to speculate loads (but not stores). This also needs a precise spot in our pipeline (after inlining, before infer-AS/SROA). Keeping it as an AMDGPU pass lets us tune behavior per subtarget and avoid cross-target regressions.

HIP vector types often lower to aggregates and get copied with memcpy. When the source or destination is chosen via a pointer select, SROA cannot split the aggregate. This keeps data in stack slots and increases scratch traffic. By rewriting these memcpy idioms, we enable SROA to promote values, reducing stack usage and improving occupancy and bandwidth on AMD GPUs. For example: %p = select i1 %cond, ptr %A, ptr %B call void @llvm.memcpy.p0.p0.i32(ptr %dst, ptr %p, i32 16, i1 false) When the source is a pointer select and conditions allow, the pass replaces the memcpy with two aligned loads, a value-level select of the loaded vector, and one aligned store. If it is not safe to speculate both loads, it splits control flow and emits a memcpy in each arm. When the destination is a select, it always splits control flow to avoid speculative stores. Vector element types are chosen based on size and minimum proven alignment to minimize the number of operations. The pass handles non-volatile, constant-length memcpy up to a small size cap. Source and destination must be in the same address space. It runs early, after inlining and before InferAddressSpaces and SROA. Volatile and cross-address-space memcpys are skipped. The size cap is controlled by -amdgpu-vector-idiom-max-bytes (default 32), allowing tuning for different workloads. Fixes: SWDEV-550134

nikic · 2025-09-23T09:44:28Z

Can you please fix your broken rebase?

Currently MC print source location of instructions in comments in assembly when debug info is available, however, it does not include inlined-at locations when a function is inlined. For example, function foo is defined in header file a.h and is called multiple times in b.cpp. If foo is inlined, current assembly will only show its instructions with their line numbers in a.h. With inlined-at locations, the assembly will also show where foo is called in b.cpp. This patch adds inlined-at locations to the comments by using DebugLoc::print. It makes the printed source location info consistent with those printed by machine passes.

Change-Id: I8fb28509ab5f17dc451464c8bdb34cfb5240fc39

nikic · 2025-09-29T07:37:46Z

Closing this because the history is completely messed up -- now in a different way from before. When you open a new PR for this, please verify that the history is correct and does not contain an excessive number of unrelated commits.

yxsamliu · 2025-09-29T13:43:33Z

sorry I mixed up this branch with another one. I will open a new PR.

yxsamliu · 2025-09-29T14:03:07Z

opened new PR #161200

Salinas, David and others added 30 commits March 20, 2025 12:37

[Comgr][Cache] Late code reviews cherry-picked from staging to mainli…

9d5047e

…ne (llvm#1271)

[Comgr][Cache] Enable the cache by default (llvm#1272)

85a5451

[Cache][SPIRV] Fix flacky test... again (llvm#1318)

ca47f24

[Comgr] Fix disassem-instruction memory corruption (llvm#1336)

842526e

[Comgr][Cache] Fix broken test: spirv-translator-cached.cl

84c7a38

[AMDGPU] Auto generated check lines for two tests (llvm#1126) (llvm#1368

4a86a97

)

[AMDGPU] Change SGPR layout to striped caller/callee saved (llvm#127353…

bb5598a

…) (llvm#1371)

[Attributor] Do not optimize away externally_initialized loads. (llvm…

e6c5ed5

…#128170) Fixes SWDEV-515029

[Attributor] Do not optimize away externally_initialized loads. (llvm…

3dbccb7

…#128… (llvm#1372)

[Comgr] Fix disassem-instruction memory corruption (llvm#1336) (llvm#…

92fc035

…1354)

[Comgr] Add new Action to compile SPIR-V to Relocatable (llvm#1362)

7aac4cb

Add Env var control on enabling device-to-device memory access

c27d4e8

[Comgr][V3] Increment Version number to 3.0

0475427

Also archive the Comgr V3 Release notes, and start a new document for Comgr V4 changes. Change-Id: I25137c174bd70caafe9b3c26d3a956331e0e9dfc

Add test for warning about using amdflang-new

0f51431

Emit warning when amdflang-new is invoked

0f9a104

Add Env var control on enabling device-to-device memory access (llvm#…

cdead4f

…1384)

[Comgr][V3] Increment Version number to 3.0 (llvm#1423)

28b85e3

Combined ASAN device malloc patches to eliminate false reports

aaed638

Combined ASAN device malloc patches to eliminate false reports (llvm#…

9b71db5

…1522)

Bring tgamma patches 1397 and 1429 over to mainline

af6fb56

Bring tgamma patches 1397 and 1429 over to mainline (llvm#1523)

3de1e5d

[Flang] Emit warning when AMD Next-gen Fortran Compiler is invoked as…

fa50204

… 'amdflang-new' (llvm#1458)

[clang][Driver] Default to DIOp-based DIExpressions in SPIRV (llvm#1278)

c13f8d5