-
Notifications
You must be signed in to change notification settings - Fork 10.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[AMDGPU][CodeGen] Improve handling of memcpy for -Os/-Oz compilations #87632
Conversation
@llvm/pr-subscribers-backend-amdgpu Author: Shilei Tian (shiltian) ChangesWe had some instances when LLVM would not inline fixed-count memcpy and ended up The patch relaxes the threshold used for -Os compilation so we're always allowed This patch basically does the same thing as https://reviews.llvm.org/D158226 for Patch is 34.24 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/87632.diff 2 Files Affected:
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUISelLowering.cpp b/llvm/lib/Target/AMDGPU/AMDGPUISelLowering.cpp
index f283af6fa07d3e..db69d50799e70b 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUISelLowering.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPUISelLowering.cpp
@@ -59,6 +59,12 @@ unsigned AMDGPUTargetLowering::numBitsSigned(SDValue Op, SelectionDAG &DAG) {
AMDGPUTargetLowering::AMDGPUTargetLowering(const TargetMachine &TM,
const AMDGPUSubtarget &STI)
: TargetLowering(TM), Subtarget(&STI) {
+ // Always lower memset, memcpy, and memmove intrinsics to load/store
+ // instructions, rather then generating calls to memset, mempcy or memmove.
+ MaxStoresPerMemset = MaxStoresPerMemsetOptSize = ~0U;
+ MaxStoresPerMemcpy = MaxStoresPerMemcpyOptSize = ~0U;
+ MaxStoresPerMemmove = MaxStoresPerMemmoveOptSize = ~0U;
+
// Lower floating point store/load to integer store/load to reduce the number
// of patterns in tablegen.
setOperationAction(ISD::LOAD, MVT::f32, Promote);
diff --git a/llvm/test/CodeGen/AMDGPU/memcpy-libcall.ll b/llvm/test/CodeGen/AMDGPU/memcpy-libcall.ll
new file mode 100644
index 00000000000000..2c1c2b4656f0ba
--- /dev/null
+++ b/llvm/test/CodeGen/AMDGPU/memcpy-libcall.ll
@@ -0,0 +1,642 @@
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
+; RUN: llc -march=amdgcn -mcpu=gfx908 %s -o - | FileCheck %s
+
+%struct.S = type { [32 x i32] }
+
+@shared = local_unnamed_addr addrspace(3) global %struct.S undef, align 4
+
+define dso_local void @_Z12copy_genericPvPKv(ptr nocapture noundef writeonly %dest, ptr nocapture noundef readonly %src) local_unnamed_addr #0 {
+; CHECK-LABEL: _Z12copy_genericPvPKv:
+; CHECK: _Z12copy_genericPvPKv$local:
+; CHECK-NEXT: .type _Z12copy_genericPvPKv$local,@function
+; CHECK-NEXT: ; %bb.0: ; %entry
+; CHECK-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; CHECK-NEXT: flat_load_ubyte v4, v[2:3] offset:46
+; CHECK-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
+; CHECK-NEXT: flat_store_byte v[0:1], v4 offset:46
+; CHECK-NEXT: flat_load_ubyte v4, v[2:3] offset:45
+; CHECK-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
+; CHECK-NEXT: flat_store_byte v[0:1], v4 offset:45
+; CHECK-NEXT: flat_load_ubyte v4, v[2:3] offset:44
+; CHECK-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
+; CHECK-NEXT: flat_store_byte v[0:1], v4 offset:44
+; CHECK-NEXT: flat_load_ubyte v4, v[2:3] offset:43
+; CHECK-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
+; CHECK-NEXT: flat_store_byte v[0:1], v4 offset:43
+; CHECK-NEXT: flat_load_ubyte v4, v[2:3] offset:42
+; CHECK-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
+; CHECK-NEXT: flat_store_byte v[0:1], v4 offset:42
+; CHECK-NEXT: flat_load_ubyte v4, v[2:3] offset:41
+; CHECK-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
+; CHECK-NEXT: flat_store_byte v[0:1], v4 offset:41
+; CHECK-NEXT: flat_load_ubyte v4, v[2:3] offset:40
+; CHECK-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
+; CHECK-NEXT: flat_store_byte v[0:1], v4 offset:40
+; CHECK-NEXT: flat_load_ubyte v4, v[2:3] offset:39
+; CHECK-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
+; CHECK-NEXT: flat_store_byte v[0:1], v4 offset:39
+; CHECK-NEXT: flat_load_ubyte v4, v[2:3] offset:38
+; CHECK-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
+; CHECK-NEXT: flat_store_byte v[0:1], v4 offset:38
+; CHECK-NEXT: flat_load_ubyte v4, v[2:3] offset:37
+; CHECK-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
+; CHECK-NEXT: flat_store_byte v[0:1], v4 offset:37
+; CHECK-NEXT: flat_load_ubyte v4, v[2:3] offset:36
+; CHECK-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
+; CHECK-NEXT: flat_store_byte v[0:1], v4 offset:36
+; CHECK-NEXT: flat_load_ubyte v4, v[2:3] offset:35
+; CHECK-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
+; CHECK-NEXT: flat_store_byte v[0:1], v4 offset:35
+; CHECK-NEXT: flat_load_ubyte v4, v[2:3] offset:34
+; CHECK-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
+; CHECK-NEXT: flat_store_byte v[0:1], v4 offset:34
+; CHECK-NEXT: flat_load_ubyte v4, v[2:3] offset:33
+; CHECK-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
+; CHECK-NEXT: flat_store_byte v[0:1], v4 offset:33
+; CHECK-NEXT: flat_load_ubyte v4, v[2:3] offset:32
+; CHECK-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
+; CHECK-NEXT: flat_store_byte v[0:1], v4 offset:32
+; CHECK-NEXT: flat_load_ubyte v4, v[2:3] offset:31
+; CHECK-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
+; CHECK-NEXT: flat_store_byte v[0:1], v4 offset:31
+; CHECK-NEXT: flat_load_ubyte v4, v[2:3] offset:30
+; CHECK-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
+; CHECK-NEXT: flat_store_byte v[0:1], v4 offset:30
+; CHECK-NEXT: flat_load_ubyte v4, v[2:3] offset:29
+; CHECK-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
+; CHECK-NEXT: flat_store_byte v[0:1], v4 offset:29
+; CHECK-NEXT: flat_load_ubyte v4, v[2:3] offset:28
+; CHECK-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
+; CHECK-NEXT: flat_store_byte v[0:1], v4 offset:28
+; CHECK-NEXT: flat_load_ubyte v4, v[2:3] offset:27
+; CHECK-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
+; CHECK-NEXT: flat_store_byte v[0:1], v4 offset:27
+; CHECK-NEXT: flat_load_ubyte v4, v[2:3] offset:26
+; CHECK-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
+; CHECK-NEXT: flat_store_byte v[0:1], v4 offset:26
+; CHECK-NEXT: flat_load_ubyte v4, v[2:3] offset:25
+; CHECK-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
+; CHECK-NEXT: flat_store_byte v[0:1], v4 offset:25
+; CHECK-NEXT: flat_load_ubyte v4, v[2:3] offset:24
+; CHECK-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
+; CHECK-NEXT: flat_store_byte v[0:1], v4 offset:24
+; CHECK-NEXT: flat_load_ubyte v4, v[2:3] offset:23
+; CHECK-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
+; CHECK-NEXT: flat_store_byte v[0:1], v4 offset:23
+; CHECK-NEXT: flat_load_ubyte v4, v[2:3] offset:22
+; CHECK-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
+; CHECK-NEXT: flat_store_byte v[0:1], v4 offset:22
+; CHECK-NEXT: flat_load_ubyte v4, v[2:3] offset:21
+; CHECK-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
+; CHECK-NEXT: flat_store_byte v[0:1], v4 offset:21
+; CHECK-NEXT: flat_load_ubyte v4, v[2:3] offset:20
+; CHECK-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
+; CHECK-NEXT: flat_store_byte v[0:1], v4 offset:20
+; CHECK-NEXT: flat_load_ubyte v4, v[2:3] offset:19
+; CHECK-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
+; CHECK-NEXT: flat_store_byte v[0:1], v4 offset:19
+; CHECK-NEXT: flat_load_ubyte v4, v[2:3] offset:18
+; CHECK-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
+; CHECK-NEXT: flat_store_byte v[0:1], v4 offset:18
+; CHECK-NEXT: flat_load_ubyte v4, v[2:3] offset:17
+; CHECK-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
+; CHECK-NEXT: flat_store_byte v[0:1], v4 offset:17
+; CHECK-NEXT: flat_load_ubyte v4, v[2:3] offset:16
+; CHECK-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
+; CHECK-NEXT: flat_store_byte v[0:1], v4 offset:16
+; CHECK-NEXT: flat_load_ubyte v4, v[2:3] offset:15
+; CHECK-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
+; CHECK-NEXT: flat_store_byte v[0:1], v4 offset:15
+; CHECK-NEXT: flat_load_ubyte v4, v[2:3] offset:14
+; CHECK-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
+; CHECK-NEXT: flat_store_byte v[0:1], v4 offset:14
+; CHECK-NEXT: flat_load_ubyte v4, v[2:3] offset:13
+; CHECK-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
+; CHECK-NEXT: flat_store_byte v[0:1], v4 offset:13
+; CHECK-NEXT: flat_load_ubyte v4, v[2:3] offset:12
+; CHECK-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
+; CHECK-NEXT: flat_store_byte v[0:1], v4 offset:12
+; CHECK-NEXT: flat_load_ubyte v4, v[2:3] offset:11
+; CHECK-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
+; CHECK-NEXT: flat_store_byte v[0:1], v4 offset:11
+; CHECK-NEXT: flat_load_ubyte v4, v[2:3] offset:10
+; CHECK-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
+; CHECK-NEXT: flat_store_byte v[0:1], v4 offset:10
+; CHECK-NEXT: flat_load_ubyte v4, v[2:3] offset:9
+; CHECK-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
+; CHECK-NEXT: flat_store_byte v[0:1], v4 offset:9
+; CHECK-NEXT: flat_load_ubyte v4, v[2:3] offset:8
+; CHECK-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
+; CHECK-NEXT: flat_store_byte v[0:1], v4 offset:8
+; CHECK-NEXT: flat_load_ubyte v4, v[2:3] offset:7
+; CHECK-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
+; CHECK-NEXT: flat_store_byte v[0:1], v4 offset:7
+; CHECK-NEXT: flat_load_ubyte v4, v[2:3] offset:6
+; CHECK-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
+; CHECK-NEXT: flat_store_byte v[0:1], v4 offset:6
+; CHECK-NEXT: flat_load_ubyte v4, v[2:3] offset:5
+; CHECK-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
+; CHECK-NEXT: flat_store_byte v[0:1], v4 offset:5
+; CHECK-NEXT: flat_load_ubyte v4, v[2:3] offset:4
+; CHECK-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
+; CHECK-NEXT: flat_store_byte v[0:1], v4 offset:4
+; CHECK-NEXT: flat_load_ubyte v4, v[2:3] offset:3
+; CHECK-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
+; CHECK-NEXT: flat_store_byte v[0:1], v4 offset:3
+; CHECK-NEXT: flat_load_ubyte v4, v[2:3] offset:2
+; CHECK-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
+; CHECK-NEXT: flat_store_byte v[0:1], v4 offset:2
+; CHECK-NEXT: flat_load_ubyte v4, v[2:3] offset:1
+; CHECK-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
+; CHECK-NEXT: flat_store_byte v[0:1], v4 offset:1
+; CHECK-NEXT: flat_load_ubyte v2, v[2:3]
+; CHECK-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
+; CHECK-NEXT: flat_store_byte v[0:1], v2
+; CHECK-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
+; CHECK-NEXT: s_setpc_b64 s[30:31]
+entry:
+ tail call void @llvm.memcpy.p0.p0.i64(ptr noundef nonnull align 1 dereferenceable(47) %dest, ptr noundef nonnull align 1 dereferenceable(47) %src, i64 47, i1 false)
+ ret void
+}
+
+declare void @llvm.memcpy.p0.p0.i64(ptr noalias nocapture writeonly, ptr noalias nocapture readonly, i64, i1 immarg) #0
+
+define dso_local amdgpu_kernel void @_Z11copy_globalPvS_(ptr addrspace(1) nocapture noundef writeonly %dest.coerce, ptr addrspace(1) nocapture noundef readonly %src.coerce) local_unnamed_addr #0 {
+; CHECK-LABEL: _Z11copy_globalPvS_:
+; CHECK: _Z11copy_globalPvS_$local:
+; CHECK-NEXT: .type _Z11copy_globalPvS_$local,@function
+; CHECK-NEXT: ; %bb.0: ; %entry
+; CHECK-NEXT: s_load_dwordx4 s[0:3], s[0:1], 0x24
+; CHECK-NEXT: v_mov_b32_e32 v0, 0
+; CHECK-NEXT: s_waitcnt lgkmcnt(0)
+; CHECK-NEXT: global_load_ubyte v1, v0, s[2:3]
+; CHECK-NEXT: s_waitcnt vmcnt(0)
+; CHECK-NEXT: global_store_byte v0, v1, s[0:1]
+; CHECK-NEXT: global_load_ubyte v1, v0, s[2:3] offset:1
+; CHECK-NEXT: s_waitcnt vmcnt(0)
+; CHECK-NEXT: global_store_byte v0, v1, s[0:1] offset:1
+; CHECK-NEXT: global_load_ubyte v1, v0, s[2:3] offset:2
+; CHECK-NEXT: s_waitcnt vmcnt(0)
+; CHECK-NEXT: global_store_byte v0, v1, s[0:1] offset:2
+; CHECK-NEXT: global_load_ubyte v1, v0, s[2:3] offset:3
+; CHECK-NEXT: s_waitcnt vmcnt(0)
+; CHECK-NEXT: global_store_byte v0, v1, s[0:1] offset:3
+; CHECK-NEXT: global_load_ubyte v1, v0, s[2:3] offset:4
+; CHECK-NEXT: s_waitcnt vmcnt(0)
+; CHECK-NEXT: global_store_byte v0, v1, s[0:1] offset:4
+; CHECK-NEXT: global_load_ubyte v1, v0, s[2:3] offset:5
+; CHECK-NEXT: s_waitcnt vmcnt(0)
+; CHECK-NEXT: global_store_byte v0, v1, s[0:1] offset:5
+; CHECK-NEXT: global_load_ubyte v1, v0, s[2:3] offset:6
+; CHECK-NEXT: s_waitcnt vmcnt(0)
+; CHECK-NEXT: global_store_byte v0, v1, s[0:1] offset:6
+; CHECK-NEXT: global_load_ubyte v1, v0, s[2:3] offset:7
+; CHECK-NEXT: s_waitcnt vmcnt(0)
+; CHECK-NEXT: global_store_byte v0, v1, s[0:1] offset:7
+; CHECK-NEXT: global_load_ubyte v1, v0, s[2:3] offset:8
+; CHECK-NEXT: s_waitcnt vmcnt(0)
+; CHECK-NEXT: global_store_byte v0, v1, s[0:1] offset:8
+; CHECK-NEXT: global_load_ubyte v1, v0, s[2:3] offset:9
+; CHECK-NEXT: s_waitcnt vmcnt(0)
+; CHECK-NEXT: global_store_byte v0, v1, s[0:1] offset:9
+; CHECK-NEXT: global_load_ubyte v1, v0, s[2:3] offset:10
+; CHECK-NEXT: s_waitcnt vmcnt(0)
+; CHECK-NEXT: global_store_byte v0, v1, s[0:1] offset:10
+; CHECK-NEXT: global_load_ubyte v1, v0, s[2:3] offset:11
+; CHECK-NEXT: s_waitcnt vmcnt(0)
+; CHECK-NEXT: global_store_byte v0, v1, s[0:1] offset:11
+; CHECK-NEXT: global_load_ubyte v1, v0, s[2:3] offset:12
+; CHECK-NEXT: s_waitcnt vmcnt(0)
+; CHECK-NEXT: global_store_byte v0, v1, s[0:1] offset:12
+; CHECK-NEXT: global_load_ubyte v1, v0, s[2:3] offset:13
+; CHECK-NEXT: s_waitcnt vmcnt(0)
+; CHECK-NEXT: global_store_byte v0, v1, s[0:1] offset:13
+; CHECK-NEXT: global_load_ubyte v1, v0, s[2:3] offset:14
+; CHECK-NEXT: s_waitcnt vmcnt(0)
+; CHECK-NEXT: global_store_byte v0, v1, s[0:1] offset:14
+; CHECK-NEXT: global_load_ubyte v1, v0, s[2:3] offset:15
+; CHECK-NEXT: s_waitcnt vmcnt(0)
+; CHECK-NEXT: global_store_byte v0, v1, s[0:1] offset:15
+; CHECK-NEXT: global_load_ubyte v1, v0, s[2:3] offset:16
+; CHECK-NEXT: s_waitcnt vmcnt(0)
+; CHECK-NEXT: global_store_byte v0, v1, s[0:1] offset:16
+; CHECK-NEXT: global_load_ubyte v1, v0, s[2:3] offset:17
+; CHECK-NEXT: s_waitcnt vmcnt(0)
+; CHECK-NEXT: global_store_byte v0, v1, s[0:1] offset:17
+; CHECK-NEXT: global_load_ubyte v1, v0, s[2:3] offset:18
+; CHECK-NEXT: s_waitcnt vmcnt(0)
+; CHECK-NEXT: global_store_byte v0, v1, s[0:1] offset:18
+; CHECK-NEXT: global_load_ubyte v1, v0, s[2:3] offset:19
+; CHECK-NEXT: s_waitcnt vmcnt(0)
+; CHECK-NEXT: global_store_byte v0, v1, s[0:1] offset:19
+; CHECK-NEXT: global_load_ubyte v1, v0, s[2:3] offset:20
+; CHECK-NEXT: s_waitcnt vmcnt(0)
+; CHECK-NEXT: global_store_byte v0, v1, s[0:1] offset:20
+; CHECK-NEXT: global_load_ubyte v1, v0, s[2:3] offset:21
+; CHECK-NEXT: s_waitcnt vmcnt(0)
+; CHECK-NEXT: global_store_byte v0, v1, s[0:1] offset:21
+; CHECK-NEXT: global_load_ubyte v1, v0, s[2:3] offset:22
+; CHECK-NEXT: s_waitcnt vmcnt(0)
+; CHECK-NEXT: global_store_byte v0, v1, s[0:1] offset:22
+; CHECK-NEXT: global_load_ubyte v1, v0, s[2:3] offset:23
+; CHECK-NEXT: s_waitcnt vmcnt(0)
+; CHECK-NEXT: global_store_byte v0, v1, s[0:1] offset:23
+; CHECK-NEXT: global_load_ubyte v1, v0, s[2:3] offset:24
+; CHECK-NEXT: s_waitcnt vmcnt(0)
+; CHECK-NEXT: global_store_byte v0, v1, s[0:1] offset:24
+; CHECK-NEXT: global_load_ubyte v1, v0, s[2:3] offset:25
+; CHECK-NEXT: s_waitcnt vmcnt(0)
+; CHECK-NEXT: global_store_byte v0, v1, s[0:1] offset:25
+; CHECK-NEXT: global_load_ubyte v1, v0, s[2:3] offset:26
+; CHECK-NEXT: s_waitcnt vmcnt(0)
+; CHECK-NEXT: global_store_byte v0, v1, s[0:1] offset:26
+; CHECK-NEXT: global_load_ubyte v1, v0, s[2:3] offset:27
+; CHECK-NEXT: s_waitcnt vmcnt(0)
+; CHECK-NEXT: global_store_byte v0, v1, s[0:1] offset:27
+; CHECK-NEXT: global_load_ubyte v1, v0, s[2:3] offset:28
+; CHECK-NEXT: s_waitcnt vmcnt(0)
+; CHECK-NEXT: global_store_byte v0, v1, s[0:1] offset:28
+; CHECK-NEXT: global_load_ubyte v1, v0, s[2:3] offset:29
+; CHECK-NEXT: s_waitcnt vmcnt(0)
+; CHECK-NEXT: global_store_byte v0, v1, s[0:1] offset:29
+; CHECK-NEXT: global_load_ubyte v1, v0, s[2:3] offset:30
+; CHECK-NEXT: s_waitcnt vmcnt(0)
+; CHECK-NEXT: global_store_byte v0, v1, s[0:1] offset:30
+; CHECK-NEXT: global_load_ubyte v1, v0, s[2:3] offset:31
+; CHECK-NEXT: s_waitcnt vmcnt(0)
+; CHECK-NEXT: global_store_byte v0, v1, s[0:1] offset:31
+; CHECK-NEXT: global_load_ubyte v1, v0, s[2:3] offset:32
+; CHECK-NEXT: s_waitcnt vmcnt(0)
+; CHECK-NEXT: global_store_byte v0, v1, s[0:1] offset:32
+; CHECK-NEXT: global_load_ubyte v1, v0, s[2:3] offset:33
+; CHECK-NEXT: s_waitcnt vmcnt(0)
+; CHECK-NEXT: global_store_byte v0, v1, s[0:1] offset:33
+; CHECK-NEXT: global_load_ubyte v1, v0, s[2:3] offset:34
+; CHECK-NEXT: s_waitcnt vmcnt(0)
+; CHECK-NEXT: global_store_byte v0, v1, s[0:1] offset:34
+; CHECK-NEXT: global_load_ubyte v1, v0, s[2:3] offset:35
+; CHECK-NEXT: s_waitcnt vmcnt(0)
+; CHECK-NEXT: global_store_byte v0, v1, s[0:1] offset:35
+; CHECK-NEXT: global_load_ubyte v1, v0, s[2:3] offset:36
+; CHECK-NEXT: s_waitcnt vmcnt(0)
+; CHECK-NEXT: global_store_byte v0, v1, s[0:1] offset:36
+; CHECK-NEXT: global_load_ubyte v1, v0, s[2:3] offset:37
+; CHECK-NEXT: s_waitcnt vmcnt(0)
+; CHECK-NEXT: global_store_byte v0, v1, s[0:1] offset:37
+; CHECK-NEXT: global_load_ubyte v1, v0, s[2:3] offset:38
+; CHECK-NEXT: s_waitcnt vmcnt(0)
+; CHECK-NEXT: global_store_byte v0, v1, s[0:1] offset:38
+; CHECK-NEXT: global_load_ubyte v1, v0, s[2:3] offset:39
+; CHECK-NEXT: s_waitcnt vmcnt(0)
+; CHECK-NEXT: global_store_byte v0, v1, s[0:1] offset:39
+; CHECK-NEXT: global_load_ubyte v1, v0, s[2:3] offset:40
+; CHECK-NEXT: s_waitcnt vmcnt(0)
+; CHECK-NEXT: global_store_byte v0, v1, s[0:1] offset:40
+; CHECK-NEXT: global_load_ubyte v1, v0, s[2:3] offset:41
+; CHECK-NEXT: s_waitcnt vmcnt(0)
+; CHECK-NEXT: global_store_byte v0, v1, s[0:1] offset:41
+; CHECK-NEXT: global_load_ubyte v1, v0, s[2:3] offset:42
+; CHECK-NEXT: s_waitcnt vmcnt(0)
+; CHECK-NEXT: global_store_byte v0, v1, s[0:1] offset:42
+; CHECK-NEXT: global_load_ubyte v1, v0, s[2:3] offset:43
+; CHECK-NEXT: s_waitcnt vmcnt(0)
+; CHECK-NEXT: global_store_byte v0, v1, s[0:1] offset:43
+; CHECK-NEXT: global_load_ubyte v1, v0, s[2:3] offset:44
+; CHECK-NEXT: s_waitcnt vmcnt(0)
+; CHECK-NEXT: global_store_byte v0, v1, s[0:1] offset:44
+; CHECK-NEXT: global_load_ubyte v1, v0, s[2:3] offset:45
+; CHECK-NEXT: s_waitcnt vmcnt(0)
+; CHECK-NEXT: global_store_byte v0, v1, s[0:1] offset:45
+; CHECK-NEXT: global_load_ubyte v1, v0, s[2:3] offset:46
+; CHECK-NEXT: s_waitcnt vmcnt(0)
+; CHECK-NEXT: global_store_byte v0, v1, s[0:1] offset:46
+; CHECK-NEXT: s_endpgm
+entry:
+ tail call void @llvm.memcpy.p1.p1.i64(ptr addrspace(1) noundef align 1 dereferenceable(47) %dest.coerce, ptr addrspace(1) noundef align 1 dereferenceable(47) %src.coerce, i64 47, i1 false)
+ ret void
+}
+
+define dso_local amdgpu_kernel void @_Z20copy_param_to_globalP1SS_(ptr addrspace(1) nocapture noundef writeonly %global.coerce, ptr addrspace(4) nocapture noundef readonly byref(%struct.S) align 4 %0) local_unnamed_addr #0 {
+; CHECK-LABEL: _Z20copy_param_to_globalP1SS_:
+; CHECK: _Z20copy_param_to_globalP1SS_$local:
+; CHECK-NEXT: .type _Z20copy_param_to_globalP1SS_$local,@function
+; CHECK-NEXT: ; %bb.0: ; %entry
+; CHECK-NEXT: s_load_dwordx4 s[20:23], s[0:1], 0x9c
+; CHECK-NEXT: s_load_dwordx2 s[28:29], s[0:1], 0x24
+; CHECK-NEXT: s_load_dwordx8 s[4:11], s[0:1], 0x2c
+; CHECK-NEXT: s_load_dwordx8 s[12:19], s[0:1], 0x4c
+; CHECK-NEXT: s_load_dwordx4 s[24:27], s[0:1], 0x8c
+; CHECK-NEXT: s_waitcnt lgkmcnt(0)
+; CHECK-NEXT: v_mov_b32_e32 v0, s20
+; CHECK-NEXT: v_mov_b32_e32 v1, s21
+; CHECK-NEXT: v_mov_b32_e32 v2, s22
+; CHECK-NEXT: v_mov_b32_e32 v3, s23
+; CHECK-NEXT: s_load_dwordx4 s[20:23], s[0:1], 0x7c
+; CHECK-NEXT: v_mov_b32_e32 v4, 0
+; CHECK-NEXT: s_load_dwordx4 s[0:3], s[0:1], 0x6c
+; CHECK-NEXT: global_store_dwordx4 v4, v[0:3], s[28:29] offset:112
+; CHECK-NEXT: s_nop 0
+; CHECK-NEXT: v_mov_b32_e32 v0, s24
+; CHECK-NEXT: v_mov_b32_e32 v1, s25
+; CHECK-NEXT: v_mov_b32_e32 v2, s26
+; CHECK-NEXT: v_mov_b32_e32 v3, s27
+; CHECK-NEXT: global_store_dwordx4 v4, v[0:3], s[28:29] offset:96
+; CHECK-NEXT: s_waitcnt lgkmcnt(0)
+; CHECK-NEXT: v_mov_b32_e32 v0, s20
+; CHECK-NEXT: v_mov_b32_e32 v1, s21
+; CHECK-NEXT: v_mov_b32_e32 v2, s22
+; CHECK-NEXT: v_mov_b32_e32 v3, s23
+; CHECK-NEXT: global_store_dwordx4 v4, v[0:3], s[28:29] offset:80
+; CHECK-NEXT: s_nop 0
+; CHECK-NEXT: v_mov_b32_e32 v0, s0
+; CHECK-NEXT: v_mov_b32_e32 v1, s1
+; CHECK-NEXT: v_mov_b32_e32 v2, s2
+; CHECK-NEXT: v_mov_b32_e32 v3, s3
+; CHECK-NEXT: global_store_dwordx4 v4, v[0:3], s[28:29] offset:64
+; CHECK-NEXT: s_nop 0
+; CHECK-NEXT: v_mov_b32_e32 v0, s16
+; CHECK-NEXT: v_mov_b32_e32 v1, s17
+; CHECK-NEXT: v_mov_b32_e32 v2, s18
+; CHECK-NEXT: v_mov_b32_e32 v3, s19
+; CHECK-NEXT: global_store_dwordx4 v4, v[0:3], s[28:29] offset:48
+; CHECK-NEXT: s_nop 0
+; CHECK-NEXT: v_mov_b32_e32 v0, s12
+; CHECK-NEXT: v_mov_b32_e32 v1, s13
+; CHECK-NEXT: v_mov_b32_e32 v2, s14
+; CHECK-NEXT: v_mov_b32_e32 v3, s15
+; CHECK-NEXT: global_store_dwordx4 v4, v[0:3], s[28:29] offset:32
+; CHECK-NEXT: s_nop 0
+; CHECK-NEXT: v_mov_b32_e32 v0, s8
+; CHECK-NEXT: v_mov_b32_e32 v1, s9
+; CHECK-NEXT: v_mov_b32_e32 v2, s10
+; CHECK-NEXT: v_mov_b32_e32 v3, s11...
[truncated]
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we change the expansion threshold in the PreISelIntrinsicLowering?
Not really in this patch. With this patch those intrinsics will be lowered before instruction selection. |
gentle ping |
gentle ping + 1 |
Is this PR related to #88497? |
Yes, it fixes my issue, see #88497 (comment) |
ret void | ||
} | ||
|
||
; Function Attrs: nocallback nofree nounwind willreturn memory(argmem: readwrite) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Drop attribute comments
Can you add the fixes issue number to the description |
We had some instances when LLVM would not inline fixed-count memcpy and ended up attempting to lower it a a libcall, which would not work on AMDGPU as the address space doesn't meet the requirement, causing compiler crash. The patch relaxes the threshold used for -Os compilation so we're always allowed to inline memory copy functions. This patch basically does the same thing as https://reviews.llvm.org/D158226 for AMDGPU.
…llvm#87632) We had some instances when LLVM would not inline fixed-count memcpy and ended up attempting to lower it a a libcall, which would not work on AMDGPU as the address space doesn't meet the requirement, causing compiler crash. The patch relaxes the threshold used for -Os/-Oz compilation so we're always allowed to inline memory copy functions. This patch basically does the same thing as https://reviews.llvm.org/D158226 for AMDGPU. Fix llvm#88497. Change-Id: I5723e72172e1fc0b38265d864164a2408a493c28
We had some instances when LLVM would not inline fixed-count memcpy and ended up
attempting to lower it a a libcall, which would not work on AMDGPU as the
address space doesn't meet the requirement, causing compiler crash.
The patch relaxes the threshold used for -Os/-Oz compilation so we're always allowed
to inline memory copy functions.
This patch basically does the same thing as https://reviews.llvm.org/D158226 for
AMDGPU.
Fix #88497.