Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

avoid libcall to memcpy harder #54535

Open
nickdesaulniers opened this issue Mar 24, 2022 · 7 comments
Open

avoid libcall to memcpy harder #54535

nickdesaulniers opened this issue Mar 24, 2022 · 7 comments

Comments

@nickdesaulniers
Copy link
Member

nickdesaulniers commented Mar 24, 2022

Via this thread:

Consider the following example:

struct foo {
    unsigned long x0;
    unsigned long x1;
    unsigned long x2;
    unsigned long x3;
    unsigned long x4;
    unsigned long x5;
    unsigned long x6;
    unsigned long x7;
    unsigned long x8;
    unsigned long x9;
    unsigned long x10;
    unsigned long x11;
    unsigned long x12;
    unsigned long x13;
    unsigned long x14;
    unsigned long x15;
    // Comment out below members.
    unsigned long x16;
    unsigned long x17;
    unsigned long x18;
    unsigned long x19;
} *x, *y;

struct foo* get_x(void);

struct foo* cpy(struct foo *y) {
    struct foo *x = get_x();
    if (y != x)
        *x = *y;
    return x;
}

compiled with -O2 -mno-sse (as the Linux kernel does), we get:

cpy:
  ...
        movl    $160, %edx
        movq    %rbx, %rdi
        movq    %r14, %rsi
        callq   memcpy@PLT
...

but if we reduce the number of members in struct foo, we can get:

cpy:
  ...
        movl    $16, %ecx
        movq    %rax, %rdi
        movq    %rbx, %rsi
        rep;movsq (%rsi), %es:(%rdi)
...

which is going to be way faster. FWICT, it looks like isel is choosing whether to lower @llvm.memcpy.p0i8.p0i8.i64() to a libcall to memcpy vs inline a simple memcpy.

I assume there's some limit on how many bytes rep;movsq can copy, but surely it's much larger than 16x8B?

cc @phoebewang

@llvmbot
Copy link
Collaborator

llvmbot commented Mar 24, 2022

@llvm/issue-subscribers-backend-x86

@efriedma-quic
Copy link
Collaborator

/// Max. memset / memcpy size that is turned into rep/movs, rep/stos ops.

The value "128" was chosen 16 years ago in 03c1e6f . Maybe the correct default has changed since then. :)

@phoebewang
Copy link
Contributor

It seems only icelake and later targets have fast rep: https://reviews.llvm.org/D85989
We already have patches for the replacement: https://reviews.llvm.org/D86883
https://godbolt.org/z/ovEf9Kxfz

@LebedevRI
Copy link
Member

I've actually thought about similar problem.
Let me at least come up with a benchmark (-mllvm -x86-use-fsrm-for-memcpy will simplify that :))

@LebedevRI
Copy link
Member

LebedevRI commented Mar 25, 2022

Ok, got the benchmark-ish:
benchmark_memcpy.cc.txt

On Zen3:

For fully unaligned pointers, memcpy always wins: res-align1.txt. This concludes my interest.

align 2 likewise: res-align2.txt

@RKSimon
Copy link
Collaborator

RKSimon commented Mar 25, 2022

CC @RKSimon

@bdaase
Copy link

bdaase commented Jan 3, 2023

llvm-project/llvm/lib/Target/X86/X86Subtarget.h

Line 81 in a9b70a8

/// Max. memset / memcpy size that is turned into rep/movs, rep/stos ops.
The value "128" was chosen 16 years ago in 03c1e6f . Maybe the correct default has changed since then. :)

I am a bit surprised that the MaxInlineSizeThreshold is actually 128, because my experiments indicate that it stops inlining the memcpy at 256 Bytes: https://godbolt.org/z/j4qaTvjb7

In the above example, the memcpy is inligned, even though one field is 16 * 16 = 256 Byte.
Uncommenting either line 21 or changing the 16 to a 17 in line 39 makes it call memcpy.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants