Skip to content

Conversation

yxsamliu
Copy link
Collaborator

@yxsamliu yxsamliu commented Sep 4, 2025

HIP vector types often lower to aggregates and get copied with memcpy. When the source or destination is chosen via a pointer select, SROA cannot split the aggregate. This keeps data in stack slots and increases scratch traffic. By rewriting these memcpy idioms, we enable SROA to promote values, reducing stack usage and improving occupancy and bandwidth on AMD GPUs.

For example:

%p = select i1 %cond, ptr %A, ptr %B

call void @llvm.memcpy.p0.p0.i32(ptr %dst, ptr %p, i32 16, i1 false)

When the source is a pointer select and conditions allow, the pass replaces the memcpy with two aligned loads, a value-level select of the loaded vector, and one aligned store. If it is not safe to speculate both loads, it splits control flow and emits a memcpy in each arm. When the destination is a select, it always splits control flow to avoid speculative stores. Vector element types are chosen based on size and minimum proven alignment to minimize the number of operations.

The pass handles non-volatile, constant-length memcpy up to a small size cap. Source and destination must be in the same address space. It runs early, after inlining and before InferAddressSpaces and SROA. Volatile and cross-address-space memcpys are skipped.

The size cap is controlled by -amdgpu-vector-idiom-max-bytes (default 32), allowing tuning for different workloads.

Fixes: SWDEV-550134

Salinas, David and others added 30 commits March 20, 2025 12:37
The underlying issue was that I forgot to clean the cache directory
before running the test. So the test ended up running sometimes on a
dirty cache yielding bad fails.

Since the code is only running a single comgr action that only converts
spirv->bc, the contents of the cache should be 2 files:
* the bitcode
* the cache timestamp
In this patch, we add a new action:

AMD_COMGR_ACTION_COMPILE_SPIRV_TO_RELOCATABLE

That accepts a set of .spv files, translates them to .bc files,
extracts any embedded @llvm.cmdline flags, and then compiles
to a set of relocatable .o files.
The underlying issue was that I forgot to clean the cache directory
before running the test. So the test ended up running sometimes on a
dirty cache yielding bad fails.

Since the code is only running a single comgr action that only converts
spirv->bc, the contents of the cache should be 2 files:
* the bitcode
* the cache timestamp
…lvm#1365)

from row llvm#19 in "Mainline for 6.5 Cherry-pick List"
amd-staging commits:
[Comgr][Cache] Fix broken test: spirv-translator-cached.cl · ROCm/llvm-project@51fa25b
[Cache][SPIRV] Fix flacky test... again · ROCm/llvm-project@56cf45a
Note that this is not an NFC change because the test case
`llvm/test/CodeGen/AMDGPU/amdgpu-spill-cfi-saved-regs.ll` has been
updated due
to the recent SGPR layout change. The 32 CSR SGPRs in
`callee_need_to_spill_fp_exec_to_memory` have been adjusted to reflect
this
update.

Change-Id: I332a721e7e8feaa5491c63228ecb42759e4d979d
This PR updates the SGPR layout to a striped caller/callee-saved design,
similar
to the VGPR layout.

To ensure that s30-s31 (return address), s32 (stack pointer), s33 (frame
pointer), and s34 (base pointer) remain callee-saved, the striped layout
starts
from s40, with a stripe width of 8. The last stripe is 10 wide instead
of 8 to
avoid ending with a 2-wide stripe.

Fixes llvm#113782.

Change-Id: I6fe8fca8b70985a8775ec04d93b460333533d2bb
For hipBinNVPtr_ and hipBinAMDPtr_ members: the destructor of the base
class was not marked as virtual, but the destructor of the derived
classes are. When we delete the object we do it through a
pointer to the base class. So the base class destructor is called but
not the one of the derived classes. This results in strange memory
behaviour detected by ASAN.

Solves SWDEV-516418
Also archive the Comgr V3 Release notes, and start a new document
for Comgr V4 changes.

Change-Id: I25137c174bd70caafe9b3c26d3a956331e0e9dfc
@yxsamliu
Copy link
Collaborator Author

yxsamliu commented Sep 9, 2025

https://reviews.llvm.org/D140493 related?

Thanks, yes, it's similar in spirit to D140493. Both aim to unblock SROA by reshaping memory idioms. But D140493 already showed that profitability and legality here are highly target-specific, and that trying to fit this into generic SROA/InstCombine causes difficulty and delay.

On AMDGPU, costs and safety depend on address spaces, proven alignment, how byte and small vectors legalize, and when it's safe to speculate loads (but not stores). This also needs a precise spot in our pipeline (after inlining, before infer-AS/SROA). Keeping it as an AMDGPU pass lets us tune behavior per subtarget and avoid cross-target regressions.

HIP vector types often lower to aggregates and get copied with memcpy.
When the source or destination is chosen via a pointer select, SROA
cannot split the aggregate. This keeps data in stack slots and increases
scratch traffic. By rewriting these memcpy idioms, we enable SROA to
promote values, reducing stack usage and improving occupancy and
bandwidth on AMD GPUs.

For example:

%p = select i1 %cond, ptr %A, ptr %B

call void @llvm.memcpy.p0.p0.i32(ptr %dst, ptr %p, i32 16, i1 false)

When the source is a pointer select and conditions allow, the pass
replaces the memcpy with two aligned loads, a value-level select of the
loaded vector, and one aligned store. If it is not safe to speculate
both loads, it splits control flow and emits a memcpy in each arm. When
the destination is a select, it always splits control flow to avoid
speculative stores. Vector element types are chosen based on size and
minimum proven alignment to minimize the number of operations.

The pass handles non-volatile, constant-length memcpy up to a small size
cap. Source and destination must be in the same address space. It runs
early, after inlining and before InferAddressSpaces and SROA. Volatile
and cross-address-space memcpys are skipped.

The size cap is controlled by -amdgpu-vector-idiom-max-bytes (default
32), allowing tuning for different workloads.

Fixes: SWDEV-550134
@yxsamliu yxsamliu force-pushed the vector-idiom branch 2 times, most recently from c4a9bf2 to 1e18937 Compare September 22, 2025 02:02
@nikic
Copy link
Contributor

nikic commented Sep 23, 2025

Can you please fix your broken rebase?

@nikic nikic removed the request for review from a team September 23, 2025 09:44
Currently MC print source location of instructions in comments in
assembly when debug info is available, however, it does not include
inlined-at locations when a function is inlined.

For example, function foo is defined in header file a.h and is called
multiple times in b.cpp. If foo is inlined, current assembly will only
show its instructions with their line numbers in a.h. With inlined-at
locations, the assembly will also show where foo is called in b.cpp.

This patch adds inlined-at locations to the comments by using
DebugLoc::print. It makes the printed source location info consistent
with those printed by machine passes.
Change-Id: I8fb28509ab5f17dc451464c8bdb34cfb5240fc39
@nikic
Copy link
Contributor

nikic commented Sep 29, 2025

Closing this because the history is completely messed up -- now in a different way from before. When you open a new PR for this, please verify that the history is correct and does not contain an excessive number of unrelated commits.

@nikic nikic closed this Sep 29, 2025
@yxsamliu
Copy link
Collaborator Author

sorry I mixed up this branch with another one. I will open a new PR.

@yxsamliu
Copy link
Collaborator Author

opened new PR #161200

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.