Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[AMDGPU][CodeGen] Improve handling of memcpy for -Os/-Oz compilations #87632

Merged
merged 1 commit into from
Apr 16, 2024

Conversation

shiltian
Copy link
Contributor

@shiltian shiltian commented Apr 4, 2024

We had some instances when LLVM would not inline fixed-count memcpy and ended up
attempting to lower it a a libcall, which would not work on AMDGPU as the
address space doesn't meet the requirement, causing compiler crash.

The patch relaxes the threshold used for -Os/-Oz compilation so we're always allowed
to inline memory copy functions.

This patch basically does the same thing as https://reviews.llvm.org/D158226 for
AMDGPU.

Fix #88497.

@llvmbot
Copy link
Collaborator

llvmbot commented Apr 4, 2024

@llvm/pr-subscribers-backend-amdgpu

Author: Shilei Tian (shiltian)

Changes

We had some instances when LLVM would not inline fixed-count memcpy and ended up
attempting to lower it a a libcall, which would not work on AMDGPU as the
address space doesn't meet the requirement, causing compiler crash.

The patch relaxes the threshold used for -Os compilation so we're always allowed
to inline memory copy functions.

This patch basically does the same thing as https://reviews.llvm.org/D158226 for
AMDGPU.


Patch is 34.24 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/87632.diff

2 Files Affected:

  • (modified) llvm/lib/Target/AMDGPU/AMDGPUISelLowering.cpp (+6)
  • (added) llvm/test/CodeGen/AMDGPU/memcpy-libcall.ll (+642)
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUISelLowering.cpp b/llvm/lib/Target/AMDGPU/AMDGPUISelLowering.cpp
index f283af6fa07d3e..db69d50799e70b 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUISelLowering.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPUISelLowering.cpp
@@ -59,6 +59,12 @@ unsigned AMDGPUTargetLowering::numBitsSigned(SDValue Op, SelectionDAG &DAG) {
 AMDGPUTargetLowering::AMDGPUTargetLowering(const TargetMachine &TM,
                                            const AMDGPUSubtarget &STI)
     : TargetLowering(TM), Subtarget(&STI) {
+  // Always lower memset, memcpy, and memmove intrinsics to load/store
+  // instructions, rather then generating calls to memset, mempcy or memmove.
+  MaxStoresPerMemset = MaxStoresPerMemsetOptSize = ~0U;
+  MaxStoresPerMemcpy = MaxStoresPerMemcpyOptSize = ~0U;
+  MaxStoresPerMemmove = MaxStoresPerMemmoveOptSize = ~0U;
+
   // Lower floating point store/load to integer store/load to reduce the number
   // of patterns in tablegen.
   setOperationAction(ISD::LOAD, MVT::f32, Promote);
diff --git a/llvm/test/CodeGen/AMDGPU/memcpy-libcall.ll b/llvm/test/CodeGen/AMDGPU/memcpy-libcall.ll
new file mode 100644
index 00000000000000..2c1c2b4656f0ba
--- /dev/null
+++ b/llvm/test/CodeGen/AMDGPU/memcpy-libcall.ll
@@ -0,0 +1,642 @@
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
+; RUN: llc -march=amdgcn -mcpu=gfx908 %s -o - | FileCheck %s
+
+%struct.S = type { [32 x i32] }
+
+@shared = local_unnamed_addr addrspace(3) global %struct.S undef, align 4
+
+define dso_local void @_Z12copy_genericPvPKv(ptr nocapture noundef writeonly %dest, ptr nocapture noundef readonly %src) local_unnamed_addr #0 {
+; CHECK-LABEL: _Z12copy_genericPvPKv:
+; CHECK:       _Z12copy_genericPvPKv$local:
+; CHECK-NEXT:    .type _Z12copy_genericPvPKv$local,@function
+; CHECK-NEXT:  ; %bb.0: ; %entry
+; CHECK-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; CHECK-NEXT:    flat_load_ubyte v4, v[2:3] offset:46
+; CHECK-NEXT:    s_waitcnt vmcnt(0) lgkmcnt(0)
+; CHECK-NEXT:    flat_store_byte v[0:1], v4 offset:46
+; CHECK-NEXT:    flat_load_ubyte v4, v[2:3] offset:45
+; CHECK-NEXT:    s_waitcnt vmcnt(0) lgkmcnt(0)
+; CHECK-NEXT:    flat_store_byte v[0:1], v4 offset:45
+; CHECK-NEXT:    flat_load_ubyte v4, v[2:3] offset:44
+; CHECK-NEXT:    s_waitcnt vmcnt(0) lgkmcnt(0)
+; CHECK-NEXT:    flat_store_byte v[0:1], v4 offset:44
+; CHECK-NEXT:    flat_load_ubyte v4, v[2:3] offset:43
+; CHECK-NEXT:    s_waitcnt vmcnt(0) lgkmcnt(0)
+; CHECK-NEXT:    flat_store_byte v[0:1], v4 offset:43
+; CHECK-NEXT:    flat_load_ubyte v4, v[2:3] offset:42
+; CHECK-NEXT:    s_waitcnt vmcnt(0) lgkmcnt(0)
+; CHECK-NEXT:    flat_store_byte v[0:1], v4 offset:42
+; CHECK-NEXT:    flat_load_ubyte v4, v[2:3] offset:41
+; CHECK-NEXT:    s_waitcnt vmcnt(0) lgkmcnt(0)
+; CHECK-NEXT:    flat_store_byte v[0:1], v4 offset:41
+; CHECK-NEXT:    flat_load_ubyte v4, v[2:3] offset:40
+; CHECK-NEXT:    s_waitcnt vmcnt(0) lgkmcnt(0)
+; CHECK-NEXT:    flat_store_byte v[0:1], v4 offset:40
+; CHECK-NEXT:    flat_load_ubyte v4, v[2:3] offset:39
+; CHECK-NEXT:    s_waitcnt vmcnt(0) lgkmcnt(0)
+; CHECK-NEXT:    flat_store_byte v[0:1], v4 offset:39
+; CHECK-NEXT:    flat_load_ubyte v4, v[2:3] offset:38
+; CHECK-NEXT:    s_waitcnt vmcnt(0) lgkmcnt(0)
+; CHECK-NEXT:    flat_store_byte v[0:1], v4 offset:38
+; CHECK-NEXT:    flat_load_ubyte v4, v[2:3] offset:37
+; CHECK-NEXT:    s_waitcnt vmcnt(0) lgkmcnt(0)
+; CHECK-NEXT:    flat_store_byte v[0:1], v4 offset:37
+; CHECK-NEXT:    flat_load_ubyte v4, v[2:3] offset:36
+; CHECK-NEXT:    s_waitcnt vmcnt(0) lgkmcnt(0)
+; CHECK-NEXT:    flat_store_byte v[0:1], v4 offset:36
+; CHECK-NEXT:    flat_load_ubyte v4, v[2:3] offset:35
+; CHECK-NEXT:    s_waitcnt vmcnt(0) lgkmcnt(0)
+; CHECK-NEXT:    flat_store_byte v[0:1], v4 offset:35
+; CHECK-NEXT:    flat_load_ubyte v4, v[2:3] offset:34
+; CHECK-NEXT:    s_waitcnt vmcnt(0) lgkmcnt(0)
+; CHECK-NEXT:    flat_store_byte v[0:1], v4 offset:34
+; CHECK-NEXT:    flat_load_ubyte v4, v[2:3] offset:33
+; CHECK-NEXT:    s_waitcnt vmcnt(0) lgkmcnt(0)
+; CHECK-NEXT:    flat_store_byte v[0:1], v4 offset:33
+; CHECK-NEXT:    flat_load_ubyte v4, v[2:3] offset:32
+; CHECK-NEXT:    s_waitcnt vmcnt(0) lgkmcnt(0)
+; CHECK-NEXT:    flat_store_byte v[0:1], v4 offset:32
+; CHECK-NEXT:    flat_load_ubyte v4, v[2:3] offset:31
+; CHECK-NEXT:    s_waitcnt vmcnt(0) lgkmcnt(0)
+; CHECK-NEXT:    flat_store_byte v[0:1], v4 offset:31
+; CHECK-NEXT:    flat_load_ubyte v4, v[2:3] offset:30
+; CHECK-NEXT:    s_waitcnt vmcnt(0) lgkmcnt(0)
+; CHECK-NEXT:    flat_store_byte v[0:1], v4 offset:30
+; CHECK-NEXT:    flat_load_ubyte v4, v[2:3] offset:29
+; CHECK-NEXT:    s_waitcnt vmcnt(0) lgkmcnt(0)
+; CHECK-NEXT:    flat_store_byte v[0:1], v4 offset:29
+; CHECK-NEXT:    flat_load_ubyte v4, v[2:3] offset:28
+; CHECK-NEXT:    s_waitcnt vmcnt(0) lgkmcnt(0)
+; CHECK-NEXT:    flat_store_byte v[0:1], v4 offset:28
+; CHECK-NEXT:    flat_load_ubyte v4, v[2:3] offset:27
+; CHECK-NEXT:    s_waitcnt vmcnt(0) lgkmcnt(0)
+; CHECK-NEXT:    flat_store_byte v[0:1], v4 offset:27
+; CHECK-NEXT:    flat_load_ubyte v4, v[2:3] offset:26
+; CHECK-NEXT:    s_waitcnt vmcnt(0) lgkmcnt(0)
+; CHECK-NEXT:    flat_store_byte v[0:1], v4 offset:26
+; CHECK-NEXT:    flat_load_ubyte v4, v[2:3] offset:25
+; CHECK-NEXT:    s_waitcnt vmcnt(0) lgkmcnt(0)
+; CHECK-NEXT:    flat_store_byte v[0:1], v4 offset:25
+; CHECK-NEXT:    flat_load_ubyte v4, v[2:3] offset:24
+; CHECK-NEXT:    s_waitcnt vmcnt(0) lgkmcnt(0)
+; CHECK-NEXT:    flat_store_byte v[0:1], v4 offset:24
+; CHECK-NEXT:    flat_load_ubyte v4, v[2:3] offset:23
+; CHECK-NEXT:    s_waitcnt vmcnt(0) lgkmcnt(0)
+; CHECK-NEXT:    flat_store_byte v[0:1], v4 offset:23
+; CHECK-NEXT:    flat_load_ubyte v4, v[2:3] offset:22
+; CHECK-NEXT:    s_waitcnt vmcnt(0) lgkmcnt(0)
+; CHECK-NEXT:    flat_store_byte v[0:1], v4 offset:22
+; CHECK-NEXT:    flat_load_ubyte v4, v[2:3] offset:21
+; CHECK-NEXT:    s_waitcnt vmcnt(0) lgkmcnt(0)
+; CHECK-NEXT:    flat_store_byte v[0:1], v4 offset:21
+; CHECK-NEXT:    flat_load_ubyte v4, v[2:3] offset:20
+; CHECK-NEXT:    s_waitcnt vmcnt(0) lgkmcnt(0)
+; CHECK-NEXT:    flat_store_byte v[0:1], v4 offset:20
+; CHECK-NEXT:    flat_load_ubyte v4, v[2:3] offset:19
+; CHECK-NEXT:    s_waitcnt vmcnt(0) lgkmcnt(0)
+; CHECK-NEXT:    flat_store_byte v[0:1], v4 offset:19
+; CHECK-NEXT:    flat_load_ubyte v4, v[2:3] offset:18
+; CHECK-NEXT:    s_waitcnt vmcnt(0) lgkmcnt(0)
+; CHECK-NEXT:    flat_store_byte v[0:1], v4 offset:18
+; CHECK-NEXT:    flat_load_ubyte v4, v[2:3] offset:17
+; CHECK-NEXT:    s_waitcnt vmcnt(0) lgkmcnt(0)
+; CHECK-NEXT:    flat_store_byte v[0:1], v4 offset:17
+; CHECK-NEXT:    flat_load_ubyte v4, v[2:3] offset:16
+; CHECK-NEXT:    s_waitcnt vmcnt(0) lgkmcnt(0)
+; CHECK-NEXT:    flat_store_byte v[0:1], v4 offset:16
+; CHECK-NEXT:    flat_load_ubyte v4, v[2:3] offset:15
+; CHECK-NEXT:    s_waitcnt vmcnt(0) lgkmcnt(0)
+; CHECK-NEXT:    flat_store_byte v[0:1], v4 offset:15
+; CHECK-NEXT:    flat_load_ubyte v4, v[2:3] offset:14
+; CHECK-NEXT:    s_waitcnt vmcnt(0) lgkmcnt(0)
+; CHECK-NEXT:    flat_store_byte v[0:1], v4 offset:14
+; CHECK-NEXT:    flat_load_ubyte v4, v[2:3] offset:13
+; CHECK-NEXT:    s_waitcnt vmcnt(0) lgkmcnt(0)
+; CHECK-NEXT:    flat_store_byte v[0:1], v4 offset:13
+; CHECK-NEXT:    flat_load_ubyte v4, v[2:3] offset:12
+; CHECK-NEXT:    s_waitcnt vmcnt(0) lgkmcnt(0)
+; CHECK-NEXT:    flat_store_byte v[0:1], v4 offset:12
+; CHECK-NEXT:    flat_load_ubyte v4, v[2:3] offset:11
+; CHECK-NEXT:    s_waitcnt vmcnt(0) lgkmcnt(0)
+; CHECK-NEXT:    flat_store_byte v[0:1], v4 offset:11
+; CHECK-NEXT:    flat_load_ubyte v4, v[2:3] offset:10
+; CHECK-NEXT:    s_waitcnt vmcnt(0) lgkmcnt(0)
+; CHECK-NEXT:    flat_store_byte v[0:1], v4 offset:10
+; CHECK-NEXT:    flat_load_ubyte v4, v[2:3] offset:9
+; CHECK-NEXT:    s_waitcnt vmcnt(0) lgkmcnt(0)
+; CHECK-NEXT:    flat_store_byte v[0:1], v4 offset:9
+; CHECK-NEXT:    flat_load_ubyte v4, v[2:3] offset:8
+; CHECK-NEXT:    s_waitcnt vmcnt(0) lgkmcnt(0)
+; CHECK-NEXT:    flat_store_byte v[0:1], v4 offset:8
+; CHECK-NEXT:    flat_load_ubyte v4, v[2:3] offset:7
+; CHECK-NEXT:    s_waitcnt vmcnt(0) lgkmcnt(0)
+; CHECK-NEXT:    flat_store_byte v[0:1], v4 offset:7
+; CHECK-NEXT:    flat_load_ubyte v4, v[2:3] offset:6
+; CHECK-NEXT:    s_waitcnt vmcnt(0) lgkmcnt(0)
+; CHECK-NEXT:    flat_store_byte v[0:1], v4 offset:6
+; CHECK-NEXT:    flat_load_ubyte v4, v[2:3] offset:5
+; CHECK-NEXT:    s_waitcnt vmcnt(0) lgkmcnt(0)
+; CHECK-NEXT:    flat_store_byte v[0:1], v4 offset:5
+; CHECK-NEXT:    flat_load_ubyte v4, v[2:3] offset:4
+; CHECK-NEXT:    s_waitcnt vmcnt(0) lgkmcnt(0)
+; CHECK-NEXT:    flat_store_byte v[0:1], v4 offset:4
+; CHECK-NEXT:    flat_load_ubyte v4, v[2:3] offset:3
+; CHECK-NEXT:    s_waitcnt vmcnt(0) lgkmcnt(0)
+; CHECK-NEXT:    flat_store_byte v[0:1], v4 offset:3
+; CHECK-NEXT:    flat_load_ubyte v4, v[2:3] offset:2
+; CHECK-NEXT:    s_waitcnt vmcnt(0) lgkmcnt(0)
+; CHECK-NEXT:    flat_store_byte v[0:1], v4 offset:2
+; CHECK-NEXT:    flat_load_ubyte v4, v[2:3] offset:1
+; CHECK-NEXT:    s_waitcnt vmcnt(0) lgkmcnt(0)
+; CHECK-NEXT:    flat_store_byte v[0:1], v4 offset:1
+; CHECK-NEXT:    flat_load_ubyte v2, v[2:3]
+; CHECK-NEXT:    s_waitcnt vmcnt(0) lgkmcnt(0)
+; CHECK-NEXT:    flat_store_byte v[0:1], v2
+; CHECK-NEXT:    s_waitcnt vmcnt(0) lgkmcnt(0)
+; CHECK-NEXT:    s_setpc_b64 s[30:31]
+entry:
+  tail call void @llvm.memcpy.p0.p0.i64(ptr noundef nonnull align 1 dereferenceable(47) %dest, ptr noundef nonnull align 1 dereferenceable(47) %src, i64 47, i1 false)
+  ret void
+}
+
+declare void @llvm.memcpy.p0.p0.i64(ptr noalias nocapture writeonly, ptr noalias nocapture readonly, i64, i1 immarg) #0
+
+define dso_local amdgpu_kernel void @_Z11copy_globalPvS_(ptr addrspace(1) nocapture noundef writeonly %dest.coerce, ptr addrspace(1) nocapture noundef readonly %src.coerce) local_unnamed_addr #0 {
+; CHECK-LABEL: _Z11copy_globalPvS_:
+; CHECK:       _Z11copy_globalPvS_$local:
+; CHECK-NEXT:    .type _Z11copy_globalPvS_$local,@function
+; CHECK-NEXT:  ; %bb.0: ; %entry
+; CHECK-NEXT:    s_load_dwordx4 s[0:3], s[0:1], 0x24
+; CHECK-NEXT:    v_mov_b32_e32 v0, 0
+; CHECK-NEXT:    s_waitcnt lgkmcnt(0)
+; CHECK-NEXT:    global_load_ubyte v1, v0, s[2:3]
+; CHECK-NEXT:    s_waitcnt vmcnt(0)
+; CHECK-NEXT:    global_store_byte v0, v1, s[0:1]
+; CHECK-NEXT:    global_load_ubyte v1, v0, s[2:3] offset:1
+; CHECK-NEXT:    s_waitcnt vmcnt(0)
+; CHECK-NEXT:    global_store_byte v0, v1, s[0:1] offset:1
+; CHECK-NEXT:    global_load_ubyte v1, v0, s[2:3] offset:2
+; CHECK-NEXT:    s_waitcnt vmcnt(0)
+; CHECK-NEXT:    global_store_byte v0, v1, s[0:1] offset:2
+; CHECK-NEXT:    global_load_ubyte v1, v0, s[2:3] offset:3
+; CHECK-NEXT:    s_waitcnt vmcnt(0)
+; CHECK-NEXT:    global_store_byte v0, v1, s[0:1] offset:3
+; CHECK-NEXT:    global_load_ubyte v1, v0, s[2:3] offset:4
+; CHECK-NEXT:    s_waitcnt vmcnt(0)
+; CHECK-NEXT:    global_store_byte v0, v1, s[0:1] offset:4
+; CHECK-NEXT:    global_load_ubyte v1, v0, s[2:3] offset:5
+; CHECK-NEXT:    s_waitcnt vmcnt(0)
+; CHECK-NEXT:    global_store_byte v0, v1, s[0:1] offset:5
+; CHECK-NEXT:    global_load_ubyte v1, v0, s[2:3] offset:6
+; CHECK-NEXT:    s_waitcnt vmcnt(0)
+; CHECK-NEXT:    global_store_byte v0, v1, s[0:1] offset:6
+; CHECK-NEXT:    global_load_ubyte v1, v0, s[2:3] offset:7
+; CHECK-NEXT:    s_waitcnt vmcnt(0)
+; CHECK-NEXT:    global_store_byte v0, v1, s[0:1] offset:7
+; CHECK-NEXT:    global_load_ubyte v1, v0, s[2:3] offset:8
+; CHECK-NEXT:    s_waitcnt vmcnt(0)
+; CHECK-NEXT:    global_store_byte v0, v1, s[0:1] offset:8
+; CHECK-NEXT:    global_load_ubyte v1, v0, s[2:3] offset:9
+; CHECK-NEXT:    s_waitcnt vmcnt(0)
+; CHECK-NEXT:    global_store_byte v0, v1, s[0:1] offset:9
+; CHECK-NEXT:    global_load_ubyte v1, v0, s[2:3] offset:10
+; CHECK-NEXT:    s_waitcnt vmcnt(0)
+; CHECK-NEXT:    global_store_byte v0, v1, s[0:1] offset:10
+; CHECK-NEXT:    global_load_ubyte v1, v0, s[2:3] offset:11
+; CHECK-NEXT:    s_waitcnt vmcnt(0)
+; CHECK-NEXT:    global_store_byte v0, v1, s[0:1] offset:11
+; CHECK-NEXT:    global_load_ubyte v1, v0, s[2:3] offset:12
+; CHECK-NEXT:    s_waitcnt vmcnt(0)
+; CHECK-NEXT:    global_store_byte v0, v1, s[0:1] offset:12
+; CHECK-NEXT:    global_load_ubyte v1, v0, s[2:3] offset:13
+; CHECK-NEXT:    s_waitcnt vmcnt(0)
+; CHECK-NEXT:    global_store_byte v0, v1, s[0:1] offset:13
+; CHECK-NEXT:    global_load_ubyte v1, v0, s[2:3] offset:14
+; CHECK-NEXT:    s_waitcnt vmcnt(0)
+; CHECK-NEXT:    global_store_byte v0, v1, s[0:1] offset:14
+; CHECK-NEXT:    global_load_ubyte v1, v0, s[2:3] offset:15
+; CHECK-NEXT:    s_waitcnt vmcnt(0)
+; CHECK-NEXT:    global_store_byte v0, v1, s[0:1] offset:15
+; CHECK-NEXT:    global_load_ubyte v1, v0, s[2:3] offset:16
+; CHECK-NEXT:    s_waitcnt vmcnt(0)
+; CHECK-NEXT:    global_store_byte v0, v1, s[0:1] offset:16
+; CHECK-NEXT:    global_load_ubyte v1, v0, s[2:3] offset:17
+; CHECK-NEXT:    s_waitcnt vmcnt(0)
+; CHECK-NEXT:    global_store_byte v0, v1, s[0:1] offset:17
+; CHECK-NEXT:    global_load_ubyte v1, v0, s[2:3] offset:18
+; CHECK-NEXT:    s_waitcnt vmcnt(0)
+; CHECK-NEXT:    global_store_byte v0, v1, s[0:1] offset:18
+; CHECK-NEXT:    global_load_ubyte v1, v0, s[2:3] offset:19
+; CHECK-NEXT:    s_waitcnt vmcnt(0)
+; CHECK-NEXT:    global_store_byte v0, v1, s[0:1] offset:19
+; CHECK-NEXT:    global_load_ubyte v1, v0, s[2:3] offset:20
+; CHECK-NEXT:    s_waitcnt vmcnt(0)
+; CHECK-NEXT:    global_store_byte v0, v1, s[0:1] offset:20
+; CHECK-NEXT:    global_load_ubyte v1, v0, s[2:3] offset:21
+; CHECK-NEXT:    s_waitcnt vmcnt(0)
+; CHECK-NEXT:    global_store_byte v0, v1, s[0:1] offset:21
+; CHECK-NEXT:    global_load_ubyte v1, v0, s[2:3] offset:22
+; CHECK-NEXT:    s_waitcnt vmcnt(0)
+; CHECK-NEXT:    global_store_byte v0, v1, s[0:1] offset:22
+; CHECK-NEXT:    global_load_ubyte v1, v0, s[2:3] offset:23
+; CHECK-NEXT:    s_waitcnt vmcnt(0)
+; CHECK-NEXT:    global_store_byte v0, v1, s[0:1] offset:23
+; CHECK-NEXT:    global_load_ubyte v1, v0, s[2:3] offset:24
+; CHECK-NEXT:    s_waitcnt vmcnt(0)
+; CHECK-NEXT:    global_store_byte v0, v1, s[0:1] offset:24
+; CHECK-NEXT:    global_load_ubyte v1, v0, s[2:3] offset:25
+; CHECK-NEXT:    s_waitcnt vmcnt(0)
+; CHECK-NEXT:    global_store_byte v0, v1, s[0:1] offset:25
+; CHECK-NEXT:    global_load_ubyte v1, v0, s[2:3] offset:26
+; CHECK-NEXT:    s_waitcnt vmcnt(0)
+; CHECK-NEXT:    global_store_byte v0, v1, s[0:1] offset:26
+; CHECK-NEXT:    global_load_ubyte v1, v0, s[2:3] offset:27
+; CHECK-NEXT:    s_waitcnt vmcnt(0)
+; CHECK-NEXT:    global_store_byte v0, v1, s[0:1] offset:27
+; CHECK-NEXT:    global_load_ubyte v1, v0, s[2:3] offset:28
+; CHECK-NEXT:    s_waitcnt vmcnt(0)
+; CHECK-NEXT:    global_store_byte v0, v1, s[0:1] offset:28
+; CHECK-NEXT:    global_load_ubyte v1, v0, s[2:3] offset:29
+; CHECK-NEXT:    s_waitcnt vmcnt(0)
+; CHECK-NEXT:    global_store_byte v0, v1, s[0:1] offset:29
+; CHECK-NEXT:    global_load_ubyte v1, v0, s[2:3] offset:30
+; CHECK-NEXT:    s_waitcnt vmcnt(0)
+; CHECK-NEXT:    global_store_byte v0, v1, s[0:1] offset:30
+; CHECK-NEXT:    global_load_ubyte v1, v0, s[2:3] offset:31
+; CHECK-NEXT:    s_waitcnt vmcnt(0)
+; CHECK-NEXT:    global_store_byte v0, v1, s[0:1] offset:31
+; CHECK-NEXT:    global_load_ubyte v1, v0, s[2:3] offset:32
+; CHECK-NEXT:    s_waitcnt vmcnt(0)
+; CHECK-NEXT:    global_store_byte v0, v1, s[0:1] offset:32
+; CHECK-NEXT:    global_load_ubyte v1, v0, s[2:3] offset:33
+; CHECK-NEXT:    s_waitcnt vmcnt(0)
+; CHECK-NEXT:    global_store_byte v0, v1, s[0:1] offset:33
+; CHECK-NEXT:    global_load_ubyte v1, v0, s[2:3] offset:34
+; CHECK-NEXT:    s_waitcnt vmcnt(0)
+; CHECK-NEXT:    global_store_byte v0, v1, s[0:1] offset:34
+; CHECK-NEXT:    global_load_ubyte v1, v0, s[2:3] offset:35
+; CHECK-NEXT:    s_waitcnt vmcnt(0)
+; CHECK-NEXT:    global_store_byte v0, v1, s[0:1] offset:35
+; CHECK-NEXT:    global_load_ubyte v1, v0, s[2:3] offset:36
+; CHECK-NEXT:    s_waitcnt vmcnt(0)
+; CHECK-NEXT:    global_store_byte v0, v1, s[0:1] offset:36
+; CHECK-NEXT:    global_load_ubyte v1, v0, s[2:3] offset:37
+; CHECK-NEXT:    s_waitcnt vmcnt(0)
+; CHECK-NEXT:    global_store_byte v0, v1, s[0:1] offset:37
+; CHECK-NEXT:    global_load_ubyte v1, v0, s[2:3] offset:38
+; CHECK-NEXT:    s_waitcnt vmcnt(0)
+; CHECK-NEXT:    global_store_byte v0, v1, s[0:1] offset:38
+; CHECK-NEXT:    global_load_ubyte v1, v0, s[2:3] offset:39
+; CHECK-NEXT:    s_waitcnt vmcnt(0)
+; CHECK-NEXT:    global_store_byte v0, v1, s[0:1] offset:39
+; CHECK-NEXT:    global_load_ubyte v1, v0, s[2:3] offset:40
+; CHECK-NEXT:    s_waitcnt vmcnt(0)
+; CHECK-NEXT:    global_store_byte v0, v1, s[0:1] offset:40
+; CHECK-NEXT:    global_load_ubyte v1, v0, s[2:3] offset:41
+; CHECK-NEXT:    s_waitcnt vmcnt(0)
+; CHECK-NEXT:    global_store_byte v0, v1, s[0:1] offset:41
+; CHECK-NEXT:    global_load_ubyte v1, v0, s[2:3] offset:42
+; CHECK-NEXT:    s_waitcnt vmcnt(0)
+; CHECK-NEXT:    global_store_byte v0, v1, s[0:1] offset:42
+; CHECK-NEXT:    global_load_ubyte v1, v0, s[2:3] offset:43
+; CHECK-NEXT:    s_waitcnt vmcnt(0)
+; CHECK-NEXT:    global_store_byte v0, v1, s[0:1] offset:43
+; CHECK-NEXT:    global_load_ubyte v1, v0, s[2:3] offset:44
+; CHECK-NEXT:    s_waitcnt vmcnt(0)
+; CHECK-NEXT:    global_store_byte v0, v1, s[0:1] offset:44
+; CHECK-NEXT:    global_load_ubyte v1, v0, s[2:3] offset:45
+; CHECK-NEXT:    s_waitcnt vmcnt(0)
+; CHECK-NEXT:    global_store_byte v0, v1, s[0:1] offset:45
+; CHECK-NEXT:    global_load_ubyte v1, v0, s[2:3] offset:46
+; CHECK-NEXT:    s_waitcnt vmcnt(0)
+; CHECK-NEXT:    global_store_byte v0, v1, s[0:1] offset:46
+; CHECK-NEXT:    s_endpgm
+entry:
+  tail call void @llvm.memcpy.p1.p1.i64(ptr addrspace(1) noundef align 1 dereferenceable(47) %dest.coerce, ptr addrspace(1) noundef align 1 dereferenceable(47) %src.coerce, i64 47, i1 false)
+  ret void
+}
+
+define dso_local amdgpu_kernel void @_Z20copy_param_to_globalP1SS_(ptr addrspace(1) nocapture noundef writeonly %global.coerce, ptr addrspace(4) nocapture noundef readonly byref(%struct.S) align 4 %0) local_unnamed_addr #0 {
+; CHECK-LABEL: _Z20copy_param_to_globalP1SS_:
+; CHECK:       _Z20copy_param_to_globalP1SS_$local:
+; CHECK-NEXT:    .type _Z20copy_param_to_globalP1SS_$local,@function
+; CHECK-NEXT:  ; %bb.0: ; %entry
+; CHECK-NEXT:    s_load_dwordx4 s[20:23], s[0:1], 0x9c
+; CHECK-NEXT:    s_load_dwordx2 s[28:29], s[0:1], 0x24
+; CHECK-NEXT:    s_load_dwordx8 s[4:11], s[0:1], 0x2c
+; CHECK-NEXT:    s_load_dwordx8 s[12:19], s[0:1], 0x4c
+; CHECK-NEXT:    s_load_dwordx4 s[24:27], s[0:1], 0x8c
+; CHECK-NEXT:    s_waitcnt lgkmcnt(0)
+; CHECK-NEXT:    v_mov_b32_e32 v0, s20
+; CHECK-NEXT:    v_mov_b32_e32 v1, s21
+; CHECK-NEXT:    v_mov_b32_e32 v2, s22
+; CHECK-NEXT:    v_mov_b32_e32 v3, s23
+; CHECK-NEXT:    s_load_dwordx4 s[20:23], s[0:1], 0x7c
+; CHECK-NEXT:    v_mov_b32_e32 v4, 0
+; CHECK-NEXT:    s_load_dwordx4 s[0:3], s[0:1], 0x6c
+; CHECK-NEXT:    global_store_dwordx4 v4, v[0:3], s[28:29] offset:112
+; CHECK-NEXT:    s_nop 0
+; CHECK-NEXT:    v_mov_b32_e32 v0, s24
+; CHECK-NEXT:    v_mov_b32_e32 v1, s25
+; CHECK-NEXT:    v_mov_b32_e32 v2, s26
+; CHECK-NEXT:    v_mov_b32_e32 v3, s27
+; CHECK-NEXT:    global_store_dwordx4 v4, v[0:3], s[28:29] offset:96
+; CHECK-NEXT:    s_waitcnt lgkmcnt(0)
+; CHECK-NEXT:    v_mov_b32_e32 v0, s20
+; CHECK-NEXT:    v_mov_b32_e32 v1, s21
+; CHECK-NEXT:    v_mov_b32_e32 v2, s22
+; CHECK-NEXT:    v_mov_b32_e32 v3, s23
+; CHECK-NEXT:    global_store_dwordx4 v4, v[0:3], s[28:29] offset:80
+; CHECK-NEXT:    s_nop 0
+; CHECK-NEXT:    v_mov_b32_e32 v0, s0
+; CHECK-NEXT:    v_mov_b32_e32 v1, s1
+; CHECK-NEXT:    v_mov_b32_e32 v2, s2
+; CHECK-NEXT:    v_mov_b32_e32 v3, s3
+; CHECK-NEXT:    global_store_dwordx4 v4, v[0:3], s[28:29] offset:64
+; CHECK-NEXT:    s_nop 0
+; CHECK-NEXT:    v_mov_b32_e32 v0, s16
+; CHECK-NEXT:    v_mov_b32_e32 v1, s17
+; CHECK-NEXT:    v_mov_b32_e32 v2, s18
+; CHECK-NEXT:    v_mov_b32_e32 v3, s19
+; CHECK-NEXT:    global_store_dwordx4 v4, v[0:3], s[28:29] offset:48
+; CHECK-NEXT:    s_nop 0
+; CHECK-NEXT:    v_mov_b32_e32 v0, s12
+; CHECK-NEXT:    v_mov_b32_e32 v1, s13
+; CHECK-NEXT:    v_mov_b32_e32 v2, s14
+; CHECK-NEXT:    v_mov_b32_e32 v3, s15
+; CHECK-NEXT:    global_store_dwordx4 v4, v[0:3], s[28:29] offset:32
+; CHECK-NEXT:    s_nop 0
+; CHECK-NEXT:    v_mov_b32_e32 v0, s8
+; CHECK-NEXT:    v_mov_b32_e32 v1, s9
+; CHECK-NEXT:    v_mov_b32_e32 v2, s10
+; CHECK-NEXT:    v_mov_b32_e32 v3, s11...
[truncated]

@shiltian shiltian requested a review from Pierre-vh April 4, 2024 12:50
Copy link
Contributor

@arsenm arsenm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we change the expansion threshold in the PreISelIntrinsicLowering?

llvm/test/CodeGen/AMDGPU/memcpy-libcall.ll Outdated Show resolved Hide resolved
llvm/test/CodeGen/AMDGPU/memcpy-libcall.ll Show resolved Hide resolved
@shiltian
Copy link
Contributor Author

shiltian commented Apr 7, 2024

Should we change the expansion threshold in the PreISelIntrinsicLowering?

Not really in this patch. With this patch those intrinsics will be lowered before instruction selection.

@shiltian
Copy link
Contributor Author

shiltian commented Apr 9, 2024

gentle ping

@shiltian shiltian requested a review from rampitec April 9, 2024 12:39
@shiltian
Copy link
Contributor Author

gentle ping + 1

@ex-rzr
Copy link

ex-rzr commented Apr 16, 2024

Is this PR related to #88497?

@ex-rzr
Copy link

ex-rzr commented Apr 16, 2024

Yes, it fixes my issue, see #88497 (comment)

ret void
}

; Function Attrs: nocallback nofree nounwind willreturn memory(argmem: readwrite)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Drop attribute comments

@arsenm
Copy link
Contributor

arsenm commented Apr 16, 2024

Can you add the fixes issue number to the description

We had some instances when LLVM would not inline fixed-count memcpy and ended up
attempting to lower it a a libcall, which would not work on AMDGPU as the
address space doesn't meet the requirement, causing compiler crash.

The patch relaxes the threshold used for -Os compilation so we're always allowed
to inline memory copy functions.

This patch basically does the same thing as https://reviews.llvm.org/D158226 for
AMDGPU.
@shiltian shiltian merged commit 9ce74d6 into llvm:main Apr 16, 2024
3 of 4 checks passed
@shiltian shiltian deleted the limit branch April 16, 2024 13:34
searlmc1 pushed a commit to ROCm/llvm-project that referenced this pull request Apr 17, 2024
…llvm#87632)

We had some instances when LLVM would not inline fixed-count memcpy and
ended up
attempting to lower it a a libcall, which would not work on AMDGPU as
the
address space doesn't meet the requirement, causing compiler crash.

The patch relaxes the threshold used for -Os/-Oz compilation so we're
always allowed
to inline memory copy functions.

This patch basically does the same thing as
https://reviews.llvm.org/D158226 for
AMDGPU.

Fix llvm#88497.

Change-Id: I5723e72172e1fc0b38265d864164a2408a493c28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants