-
Notifications
You must be signed in to change notification settings - Fork 15.2k
[AMDGPU] Identify vector idiom to unlock SROA #156791
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
The underlying issue was that I forgot to clean the cache directory before running the test. So the test ended up running sometimes on a dirty cache yielding bad fails. Since the code is only running a single comgr action that only converts spirv->bc, the contents of the cache should be 2 files: * the bitcode * the cache timestamp
In this patch, we add a new action: AMD_COMGR_ACTION_COMPILE_SPIRV_TO_RELOCATABLE That accepts a set of .spv files, translates them to .bc files, extracts any embedded @llvm.cmdline flags, and then compiles to a set of relocatable .o files.
The underlying issue was that I forgot to clean the cache directory before running the test. So the test ended up running sometimes on a dirty cache yielding bad fails. Since the code is only running a single comgr action that only converts spirv->bc, the contents of the cache should be 2 files: * the bitcode * the cache timestamp
…lvm#1365) from row llvm#19 in "Mainline for 6.5 Cherry-pick List" amd-staging commits: [Comgr][Cache] Fix broken test: spirv-translator-cached.cl · ROCm/llvm-project@51fa25b [Cache][SPIRV] Fix flacky test... again · ROCm/llvm-project@56cf45a
Note that this is not an NFC change because the test case `llvm/test/CodeGen/AMDGPU/amdgpu-spill-cfi-saved-regs.ll` has been updated due to the recent SGPR layout change. The 32 CSR SGPRs in `callee_need_to_spill_fp_exec_to_memory` have been adjusted to reflect this update. Change-Id: I332a721e7e8feaa5491c63228ecb42759e4d979d
This PR updates the SGPR layout to a striped caller/callee-saved design, similar to the VGPR layout. To ensure that s30-s31 (return address), s32 (stack pointer), s33 (frame pointer), and s34 (base pointer) remain callee-saved, the striped layout starts from s40, with a stripe width of 8. The last stripe is 10 wide instead of 8 to avoid ending with a 2-wide stripe. Fixes llvm#113782. Change-Id: I6fe8fca8b70985a8775ec04d93b460333533d2bb
…#128170) Fixes SWDEV-515029
For hipBinNVPtr_ and hipBinAMDPtr_ members: the destructor of the base class was not marked as virtual, but the destructor of the derived classes are. When we delete the object we do it through a pointer to the base class. So the base class destructor is called but not the one of the derived classes. This results in strange memory behaviour detected by ASAN. Solves SWDEV-516418
Also archive the Comgr V3 Release notes, and start a new document for Comgr V4 changes. Change-Id: I25137c174bd70caafe9b3c26d3a956331e0e9dfc
Thanks, yes, it's similar in spirit to D140493. Both aim to unblock SROA by reshaping memory idioms. But D140493 already showed that profitability and legality here are highly target-specific, and that trying to fit this into generic SROA/InstCombine causes difficulty and delay. On AMDGPU, costs and safety depend on address spaces, proven alignment, how byte and small vectors legalize, and when it's safe to speculate loads (but not stores). This also needs a precise spot in our pipeline (after inlining, before infer-AS/SROA). Keeping it as an AMDGPU pass lets us tune behavior per subtarget and avoid cross-target regressions. |
HIP vector types often lower to aggregates and get copied with memcpy. When the source or destination is chosen via a pointer select, SROA cannot split the aggregate. This keeps data in stack slots and increases scratch traffic. By rewriting these memcpy idioms, we enable SROA to promote values, reducing stack usage and improving occupancy and bandwidth on AMD GPUs. For example: %p = select i1 %cond, ptr %A, ptr %B call void @llvm.memcpy.p0.p0.i32(ptr %dst, ptr %p, i32 16, i1 false) When the source is a pointer select and conditions allow, the pass replaces the memcpy with two aligned loads, a value-level select of the loaded vector, and one aligned store. If it is not safe to speculate both loads, it splits control flow and emits a memcpy in each arm. When the destination is a select, it always splits control flow to avoid speculative stores. Vector element types are chosen based on size and minimum proven alignment to minimize the number of operations. The pass handles non-volatile, constant-length memcpy up to a small size cap. Source and destination must be in the same address space. It runs early, after inlining and before InferAddressSpaces and SROA. Volatile and cross-address-space memcpys are skipped. The size cap is controlled by -amdgpu-vector-idiom-max-bytes (default 32), allowing tuning for different workloads. Fixes: SWDEV-550134
c4a9bf2
to
1e18937
Compare
Can you please fix your broken rebase? |
Currently MC print source location of instructions in comments in assembly when debug info is available, however, it does not include inlined-at locations when a function is inlined. For example, function foo is defined in header file a.h and is called multiple times in b.cpp. If foo is inlined, current assembly will only show its instructions with their line numbers in a.h. With inlined-at locations, the assembly will also show where foo is called in b.cpp. This patch adds inlined-at locations to the comments by using DebugLoc::print. It makes the printed source location info consistent with those printed by machine passes.
2d524e1
to
03ac654
Compare
Change-Id: I8fb28509ab5f17dc451464c8bdb34cfb5240fc39
Closing this because the history is completely messed up -- now in a different way from before. When you open a new PR for this, please verify that the history is correct and does not contain an excessive number of unrelated commits. |
sorry I mixed up this branch with another one. I will open a new PR. |
opened new PR #161200 |
HIP vector types often lower to aggregates and get copied with memcpy. When the source or destination is chosen via a pointer select, SROA cannot split the aggregate. This keeps data in stack slots and increases scratch traffic. By rewriting these memcpy idioms, we enable SROA to promote values, reducing stack usage and improving occupancy and bandwidth on AMD GPUs.
For example:
%p = select i1 %cond, ptr %A, ptr %B
call void @llvm.memcpy.p0.p0.i32(ptr %dst, ptr %p, i32 16, i1 false)
When the source is a pointer select and conditions allow, the pass replaces the memcpy with two aligned loads, a value-level select of the loaded vector, and one aligned store. If it is not safe to speculate both loads, it splits control flow and emits a memcpy in each arm. When the destination is a select, it always splits control flow to avoid speculative stores. Vector element types are chosen based on size and minimum proven alignment to minimize the number of operations.
The pass handles non-volatile, constant-length memcpy up to a small size cap. Source and destination must be in the same address space. It runs early, after inlining and before InferAddressSpaces and SROA. Volatile and cross-address-space memcpys are skipped.
The size cap is controlled by -amdgpu-vector-idiom-max-bytes (default 32), allowing tuning for different workloads.
Fixes: SWDEV-550134