Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better memcpy on Windows #201

Closed
degski opened this issue Feb 3, 2020 · 18 comments
Closed

Better memcpy on Windows #201

degski opened this issue Feb 3, 2020 · 18 comments

Comments

@degski
Copy link

degski commented Feb 3, 2020

The following code:

// dst and src must be 16-byte aligned
// size must be multiple of 16*2 = 32 bytes
inline void memcpy_sse ( void * dst, void const * src, size_t size ) {
// https://hero.handmade.network/forums/code-discussion/t/157-memory_bandwidth_+_implementing_memcpy
    size_t stride = 2 * sizeof ( __m128 );
    while ( size ) {
        __m128 a = _mm_load_ps ( ( float * ) ( ( uint8_t const * ) src ) + 0 * sizeof ( __m128 ) ) );
        __m128 b = _mm_load_ps ( ( float * ) ( ( ( uint8_t const * ) src ) + 1 * sizeof ( __m128 ) ) );
        _mm_stream_ps ( ( float * ) ( ( ( uint8_t * ) dst ) + 0 * sizeof ( __m128 ) ), a );
        _mm_stream_ps ( ( float * ) ( ( ( uint8_t * ) dst ) + 1 * sizeof ( __m128 ) ), b );
        size -= stride;
        src = ( ( uint8_t const * ) src ) + stride;
        dst = ( ( uint8_t * ) dst ) + stride;
    }
}

is a SSE2-streaming-coded memcpy, it's about 25% faster than memcpy on Windows. The AVX-streaming version is not faster than the SSE2 version, so better use SSE2 for lower size and easier alignment constraints and almost any (PC-)processor, still functioning, supports it. Speed up for grabs.

PS: mimalloc uses memcpy 15 times in dev (I grepped it, so you don't have to grep it, I didn't look whether it would be applicable in every single case).

@daanx
Copy link
Collaborator

daanx commented Mar 7, 2020

Thanks! @degski. I am hesitant to add it as it is so architecture specific and the only important memcpy where this could matter is in mi_realloc... but that could be an important one :-) Need to think a bit more on this.

@degski
Copy link
Author

degski commented Mar 7, 2020

There are virtually no CPU's in use that don't support SSE2 (One can write the same function with AVX, but this gives no benefit and the alignment requirements are less general, I guess we are hitting max bus-speed (on my machine), already with SSE2. Maybe Optane-ram (my birth day is in some months :-) ) benefits from the AVX instructions?). The above code compiles with gcc/msvc/clang. realloc is the most obvious candidate, and a similar situation made me try and have a go at doing it even faster. mi_malloc beats malloc by some 50%, that is, using https://github.com/degski/pector/blob/master/include/pector/mimalloc_allocator.h (a re-allocing-allocator) with this pt::pector [unfortunately, this cannot work with std::vector, the STL has no api to call any realloc-functionality]. Faster memcpy will make it even faster. Within the context of mimalloc, the requirements are filled by default. Thanks for considering it.

PS: in the meanwhile I have generalized/relaxed things some here: https://github.com/degski/Sax/blob/ed2289055690baba71be31437154e72af504a55d/include/sax/stl.hpp#L263

@mpoeter
Copy link

mpoeter commented Mar 7, 2020

@degski this code uses non-temporal streaming instructions to write the data, but it is lacking the sfence or mfence instruction to ensure that these stores are globally visible, which can be crucial in a multithreaded context. Just a sidenote: it is unclear to me why this implementation uses the packed single version _mm_stream_ps instead of the generic _mm_stream_si128.

But more importantly, the non-temporal stores bypass the cache. This only makes sense if the memory written is not immediately re-accessed, as otherwise we will have to re-load the just written data from memory to the cache, so IMHO a realloc operation is probably not a good candidate!

@kfir-drivenets
Copy link

@nxrighthere
Copy link

On my machine with Ryzen 5, memcpy is the absolute winner:

memcpy = 9.73 GiB/s
CopyWithSSE = 3.69 GiB/s
CopyWithSSESmall = 2.93 GiB/s
CopyWithSSENoCache = 3.02 GiB/s (the code from this issue)
CopyWithAVX = 3.94 GiB/s
CopyWithAVXSmall = 4.15 GiB/s
CopyWithAVXNoCache = 4.47 GiB/s
CopyWithRepMovsb = 5.77 GiB/s
CopyWithRepMovsd = 5.76 GiB/s
CopyWithRepMovsq = 5.76 GiB/s
CopyWithRepMovsbUnaligned = 5.77 GiB/s
CopyWithThreads = 8.14 GiB/s

@nxrighthere
Copy link

rte_memcpy from DPDK: rte_memcpy = 2.23 GiB/s

@degski
Copy link
Author

degski commented May 7, 2020

Just a sidenote: it is unclear to me why this implementation uses the packed single version _mm_stream_ps instead of the generic _mm_stream_si128.

Three reasons, it's faster, it' more widely available, it is easier on alignment requirements. It helps to read everything that's written, including the linked article (in the updated code (see blobl)).

@degski
Copy link
Author

degski commented May 7, 2020

On my machine with Ryzen 5, memcpy is the absolute winner:

What OS? I am comparing against std::memcpy on latest Windows 64 bit. This idea pertains to W10-X64 only!

@nxrighthere
Copy link

I'm on Windows 10 x64 1909.

@degski
Copy link
Author

degski commented May 7, 2020

@nxrighthere Interesting, need to add some 'fat' code then. On an Intel Broadwell I see improvements of 20% or so (on something as basic as this) is noting to be sniffed at. Soon (I hope) I can start using my Coffee-lake 6 core (cannot move and pick it up coz covid) and do this test again.

@mpoeter
Copy link

mpoeter commented May 7, 2020

It helps to read everything that's written, including the linked article (in the updated code (see blobl)).

True, but apparently you did not read the whole thing yourself, since the author himself wrote:

No idea about differences between integer vs float loads. [...]

Appart from that, _mm_stream_ps and _mm_stream_si128 have the same alignment requirements as well as the same throughput and latency. The real difference is that they are usually handled by different execution units. As far as availability goes - the first one was introduced with SSE while the second one was added with SSE2. But I doubt that you will find many CPUs in the wild today that predate SSE2.

And why would this idea pertains to W10-X64 only? This part of the code is completely OS independent.

In general I would not try to draw any conclusion from the results in the referenced forum discussion. It is not clear how these results where obtained, how many samples where taken, etc.

@haneefmubarak
Copy link
Contributor

I mean, if you're going to argue for a memcpy replacement on Windows, basically any modern x64 processor implements Enhanced REP MOVSB or equivalent / improved functionality, which doesn't even have any alignment requirements. Conveniently enough, this is available as an intrinsic on Windows platforms, so this implementation should be sufficient:

static inline void memcpy_movsb (void *d, const void *s, size_t n) {
	__movsb (d, s, n);
	return;
}

This has the significant advantage of being roughly the size of a function call when inlined (arguably when inlined into an actual function registers will already be in use, so it will be smaller than a function call by virtue of not having to push registers to the stack),

// variable to show work before / after and prevent jmp optimization
int incr = 0;

void mcpy_memcpy (void *d, const void *s, size_t n) {
    incr++;
    memcpy (d, s, n);
    incr++;
}

void mcpy_movsb (void *d, const void *s, size_t n) {
    incr++;
    memcpy_movsb (d, s, n);
    incr++;
}
_mcpy_memcpy PROC
        push    DWORD PTR _n$[esp-4]
        inc     DWORD PTR _incr
        push    DWORD PTR _s$[esp]
        push    DWORD PTR _d$[esp+4]
        call    _memcpy
        inc     DWORD PTR _incr
        add     esp, 12
        ret     0
_mcpy_memcpy ENDP


_mcpy_movsb PROC
        inc     DWORD PTR _incr
        mov     ecx, DWORD PTR _n$[esp-4]
        push    esi
        mov     esi, DWORD PTR _s$[esp]
        push    edi
        mov     edi, DWORD PTR _d$[esp+4]
        rep movsb
        inc     DWORD PTR _incr
        pop     edi
        pop     esi
        ret     0
_mcpy_movsb ENDP

not requiring alignment, being universally supported on literally any x86/x64 system, and being reasonably if not remarkably fast on any remotely modern x64 system.

@degski
Copy link
Author

degski commented May 25, 2020

@haneefmubarak Do you have benchmarks against the SSE2 version above?

I suspect that memcpy (d, s, 1); generates the same code as memcpy_movsb (d, s, 1) (or do you think memcpy is not special-cased for copying 1 char?).

@haneefmubarak
Copy link
Contributor

Let me do some testing and experimenting as best I can and see if I can also find existing numbers for older platforms and get back to you in about a day or two :)

@degski
Copy link
Author

degski commented May 26, 2020

And why would this idea pertains to W10-X64 only? This part of the code is completely OS independent.

It doesn't, but since I don't know how well this would do on a linux/ios system (because I don't have those memcpy's hangin' around), I don't propose that for linux.

In general I would not try to draw any conclusion from the results in the referenced forum discussion. It is not clear how these results where obtained, how many samples where taken, etc.

You're right, I'm basing myself on my own bench-marking and the code I used is in the relevant repo, I just re-report the facts. I can in fact only talk about a Broadwell Ci3 (5005U), because that's what I have for the moment.

@haneefmubarak
Copy link
Contributor

haneefmubarak commented May 26, 2020

Here are some benchmarks from StackOverflow (edited for clarity):

Skylake:
MOVSB copy                                           :  10197.7 MB/s
SSE2 copy                                            :   8973.3 MB/s

Haswell:
MOVSB copy                                           :   9393.9 MB/s
SSE2 copy                                            :   6780.5 MB/s

I didn't include non-temporal versions because those require fences to ensure consistency across cores nor prefetched versions because they require knowledge of cache line sizes (which are likely to change in the future).

Additionally, these numbers should be reasonable to compare with, even though they may not be on similar operating systems since they concern raw processor and memory performance.

As discussed in this answer on the aforementioned SO post, REP MOVSB is fast on basically every remotely modern x86 µarch and is preferred unless you want to write and maintain a full blown optimized memcpy() implementation. Here are byte/cycle metrics across a swathe of architectures from that answer (edited for clarity):

Yonah (2006-2008):
    REP MOVSB 10.91 B/c
    REP MOVSW 10.85 B/c
    REP MOVSD 11.05 B/c

Nehalem (2009-2010):
    REP MOVSB 25.32 B/c
    REP MOVSW 19.72 B/c
    REP MOVSD 27.56 B/c
    REP MOVSQ 27.54 B/c

Westmere (2010-2011):
    REP MOVSB 21.14 B/c
    REP MOVSW 19.11 B/c
    REP MOVSD 24.27 B/c

Ivy Bridge (2012-2013) - with Enhanced REP MOVSB:
    REP MOVSB 28.72 B/c
    REP MOVSW 19.40 B/c
    REP MOVSD 27.96 B/c
    REP MOVSQ 27.89 B/c

Skylake (2015-2016) - with Enhanced REP MOVSB:
    REP MOVSB 57.59 B/c
    REP MOVSW 58.20 B/c
    REP MOVSD 58.10 B/c
    REP MOVSQ 57.59 B/c

Kaby Lake (2016-2017) - with Enhanced REP MOVSB:
    REP MOVSB 58.00 B/c
    REP MOVSW 57.69 B/c
    REP MOVSD 58.00 B/c
    REP MOVSQ 57.89 B/c

As you can see, from Nehalem onwards, MOVSB seems to be optimized to near saturate the write capacity of the core.

@degski for your specific Broadwell core, the Haswell results should be the most applicable — REP MOVSB will be faster than the SSE2 version.


And why would this idea pertains to W10-X64 only? This part of the code is completely OS independent.

It doesn't, but since I don't know how well this would do on a linux/ios system (because I don't have those memcpy's hangin' around), I don't propose that for linux.

As for MacOS and Linux in particular, let me be the first to say that both platforms automatically provided a µarch optimized memcpy() implementation at runtime, along with smart inlining by the compilers generally available on those platforms (Clang, GCC, ICC, etc). I don't think this optimization is useful nor meaningful on platforms without specific request as a result — we should only do this optimization for Windows and then if people need it applied to other platforms, we can do the same elsewhere.


I'll go ahead and prep up a PR to implement this in the most minimal and cleanest way to do so and tag @daanx when it's ready.

@degski
Copy link
Author

degski commented May 26, 2020

@haneefmubarak Thanks for the info, the bench-marking and your plan for the future, and looking forward seeing the results.

daanx added a commit that referenced this issue Jan 30, 2021
resolve #201 with a platform-selective REP MOVSB implementation
@daanx
Copy link
Collaborator

daanx commented Jan 30, 2021

Thanks everyone for the interesting discussion, I learned a lot about the intricacies of memcpy! And apologies for the long delay for the merge but it is in now; hope it works well :-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants