Better memcpy on Windows #201

degski · 2020-02-03T13:08:03Z

The following code:

// dst and src must be 16-byte aligned
// size must be multiple of 16*2 = 32 bytes
inline void memcpy_sse ( void * dst, void const * src, size_t size ) {
// https://hero.handmade.network/forums/code-discussion/t/157-memory_bandwidth_+_implementing_memcpy
    size_t stride = 2 * sizeof ( __m128 );
    while ( size ) {
        __m128 a = _mm_load_ps ( ( float * ) ( ( uint8_t const * ) src ) + 0 * sizeof ( __m128 ) ) );
        __m128 b = _mm_load_ps ( ( float * ) ( ( ( uint8_t const * ) src ) + 1 * sizeof ( __m128 ) ) );
        _mm_stream_ps ( ( float * ) ( ( ( uint8_t * ) dst ) + 0 * sizeof ( __m128 ) ), a );
        _mm_stream_ps ( ( float * ) ( ( ( uint8_t * ) dst ) + 1 * sizeof ( __m128 ) ), b );
        size -= stride;
        src = ( ( uint8_t const * ) src ) + stride;
        dst = ( ( uint8_t * ) dst ) + stride;
    }
}

is a SSE2-streaming-coded memcpy, it's about 25% faster than memcpy on Windows. The AVX-streaming version is not faster than the SSE2 version, so better use SSE2 for lower size and easier alignment constraints and almost any (PC-)processor, still functioning, supports it. Speed up for grabs.

PS: mimalloc uses memcpy 15 times in dev (I grepped it, so you don't have to grep it, I didn't look whether it would be applicable in every single case).

The text was updated successfully, but these errors were encountered:

daanx · 2020-03-07T00:52:26Z

Thanks! @degski. I am hesitant to add it as it is so architecture specific and the only important memcpy where this could matter is in mi_realloc... but that could be an important one :-) Need to think a bit more on this.

degski · 2020-03-07T16:14:59Z

There are virtually no CPU's in use that don't support SSE2 (One can write the same function with AVX, but this gives no benefit and the alignment requirements are less general, I guess we are hitting max bus-speed (on my machine), already with SSE2. Maybe Optane-ram (my birth day is in some months :-) ) benefits from the AVX instructions?). The above code compiles with gcc/msvc/clang. realloc is the most obvious candidate, and a similar situation made me try and have a go at doing it even faster. mi_malloc beats malloc by some 50%, that is, using https://github.com/degski/pector/blob/master/include/pector/mimalloc_allocator.h (a re-allocing-allocator) with this pt::pector [unfortunately, this cannot work with std::vector, the STL has no api to call any realloc-functionality]. Faster memcpy will make it even faster. Within the context of mimalloc, the requirements are filled by default. Thanks for considering it.

PS: in the meanwhile I have generalized/relaxed things some here: https://github.com/degski/Sax/blob/ed2289055690baba71be31437154e72af504a55d/include/sax/stl.hpp#L263

mpoeter · 2020-03-07T17:10:54Z

@degski this code uses non-temporal streaming instructions to write the data, but it is lacking the sfence or mfence instruction to ensure that these stores are globally visible, which can be crucial in a multithreaded context. Just a sidenote: it is unclear to me why this implementation uses the packed single version _mm_stream_ps instead of the generic _mm_stream_si128.

But more importantly, the non-temporal stores bypass the cache. This only makes sense if the memory written is not immediately re-accessed, as otherwise we will have to re-load the just written data from memory to the cache, so IMHO a realloc operation is probably not a good candidate!

kfir-drivenets · 2020-04-08T21:38:30Z

Checkout rte_memcpy in DPDK
https://doc.dpdk.org/api/rte__memcpy_8h.html#a09f4d8cdb1a7c25398a363906e1db8cd

nxrighthere · 2020-05-05T14:51:28Z

On my machine with Ryzen 5, memcpy is the absolute winner:

memcpy = 9.73 GiB/s
CopyWithSSE = 3.69 GiB/s
CopyWithSSESmall = 2.93 GiB/s
CopyWithSSENoCache = 3.02 GiB/s (the code from this issue)
CopyWithAVX = 3.94 GiB/s
CopyWithAVXSmall = 4.15 GiB/s
CopyWithAVXNoCache = 4.47 GiB/s
CopyWithRepMovsb = 5.77 GiB/s
CopyWithRepMovsd = 5.76 GiB/s
CopyWithRepMovsq = 5.76 GiB/s
CopyWithRepMovsbUnaligned = 5.77 GiB/s
CopyWithThreads = 8.14 GiB/s

nxrighthere · 2020-05-05T14:58:03Z

rte_memcpy from DPDK: rte_memcpy = 2.23 GiB/s

degski · 2020-05-07T12:05:25Z

Just a sidenote: it is unclear to me why this implementation uses the packed single version _mm_stream_ps instead of the generic _mm_stream_si128.

Three reasons, it's faster, it' more widely available, it is easier on alignment requirements. It helps to read everything that's written, including the linked article (in the updated code (see blobl)).

degski · 2020-05-07T12:07:59Z

On my machine with Ryzen 5, memcpy is the absolute winner:

What OS? I am comparing against std::memcpy on latest Windows 64 bit. This idea pertains to W10-X64 only!

nxrighthere · 2020-05-07T12:13:01Z

I'm on Windows 10 x64 1909.

degski · 2020-05-07T12:20:54Z

@nxrighthere Interesting, need to add some 'fat' code then. On an Intel Broadwell I see improvements of 20% or so (on something as basic as this) is noting to be sniffed at. Soon (I hope) I can start using my Coffee-lake 6 core (cannot move and pick it up coz covid) and do this test again.

mpoeter · 2020-05-07T12:35:48Z

It helps to read everything that's written, including the linked article (in the updated code (see blobl)).

True, but apparently you did not read the whole thing yourself, since the author himself wrote:

No idea about differences between integer vs float loads. [...]

Appart from that, _mm_stream_ps and _mm_stream_si128 have the same alignment requirements as well as the same throughput and latency. The real difference is that they are usually handled by different execution units. As far as availability goes - the first one was introduced with SSE while the second one was added with SSE2. But I doubt that you will find many CPUs in the wild today that predate SSE2.

And why would this idea pertains to W10-X64 only? This part of the code is completely OS independent.

In general I would not try to draw any conclusion from the results in the referenced forum discussion. It is not clear how these results where obtained, how many samples where taken, etc.

haneefmubarak · 2020-05-25T11:13:41Z

I mean, if you're going to argue for a memcpy replacement on Windows, basically any modern x64 processor implements Enhanced REP MOVSB or equivalent / improved functionality, which doesn't even have any alignment requirements. Conveniently enough, this is available as an intrinsic on Windows platforms, so this implementation should be sufficient:

static inline void memcpy_movsb (void *d, const void *s, size_t n) {
	__movsb (d, s, n);
	return;
}

This has the significant advantage of being roughly the size of a function call when inlined (arguably when inlined into an actual function registers will already be in use, so it will be smaller than a function call by virtue of not having to push registers to the stack),

// variable to show work before / after and prevent jmp optimization
int incr = 0;

void mcpy_memcpy (void *d, const void *s, size_t n) {
    incr++;
    memcpy (d, s, n);
    incr++;
}

void mcpy_movsb (void *d, const void *s, size_t n) {
    incr++;
    memcpy_movsb (d, s, n);
    incr++;
}

_mcpy_memcpy PROC
        push    DWORD PTR _n$[esp-4]
        inc     DWORD PTR _incr
        push    DWORD PTR _s$[esp]
        push    DWORD PTR _d$[esp+4]
        call    _memcpy
        inc     DWORD PTR _incr
        add     esp, 12
        ret     0
_mcpy_memcpy ENDP


_mcpy_movsb PROC
        inc     DWORD PTR _incr
        mov     ecx, DWORD PTR _n$[esp-4]
        push    esi
        mov     esi, DWORD PTR _s$[esp]
        push    edi
        mov     edi, DWORD PTR _d$[esp+4]
        rep movsb
        inc     DWORD PTR _incr
        pop     edi
        pop     esi
        ret     0
_mcpy_movsb ENDP

not requiring alignment, being universally supported on literally any x86/x64 system, and being reasonably if not remarkably fast on any remotely modern x64 system.

degski · 2020-05-25T19:11:24Z

@haneefmubarak Do you have benchmarks against the SSE2 version above?

I suspect that memcpy (d, s, 1); generates the same code as memcpy_movsb (d, s, 1) (or do you think memcpy is not special-cased for copying 1 char?).

haneefmubarak · 2020-05-26T11:09:07Z

Let me do some testing and experimenting as best I can and see if I can also find existing numbers for older platforms and get back to you in about a day or two :)

degski · 2020-05-26T22:03:16Z

And why would this idea pertains to W10-X64 only? This part of the code is completely OS independent.

It doesn't, but since I don't know how well this would do on a linux/ios system (because I don't have those memcpy's hangin' around), I don't propose that for linux.

In general I would not try to draw any conclusion from the results in the referenced forum discussion. It is not clear how these results where obtained, how many samples where taken, etc.

You're right, I'm basing myself on my own bench-marking and the code I used is in the relevant repo, I just re-report the facts. I can in fact only talk about a Broadwell Ci3 (5005U), because that's what I have for the moment.

haneefmubarak · 2020-05-26T22:37:08Z

Here are some benchmarks from StackOverflow (edited for clarity):

Skylake:
MOVSB copy                                           :  10197.7 MB/s
SSE2 copy                                            :   8973.3 MB/s

Haswell:
MOVSB copy                                           :   9393.9 MB/s
SSE2 copy                                            :   6780.5 MB/s

I didn't include non-temporal versions because those require fences to ensure consistency across cores nor prefetched versions because they require knowledge of cache line sizes (which are likely to change in the future).

Additionally, these numbers should be reasonable to compare with, even though they may not be on similar operating systems since they concern raw processor and memory performance.

As discussed in this answer on the aforementioned SO post, REP MOVSB is fast on basically every remotely modern x86 µarch and is preferred unless you want to write and maintain a full blown optimized memcpy() implementation. Here are byte/cycle metrics across a swathe of architectures from that answer (edited for clarity):

Yonah (2006-2008):
    REP MOVSB 10.91 B/c
    REP MOVSW 10.85 B/c
    REP MOVSD 11.05 B/c

Nehalem (2009-2010):
    REP MOVSB 25.32 B/c
    REP MOVSW 19.72 B/c
    REP MOVSD 27.56 B/c
    REP MOVSQ 27.54 B/c

Westmere (2010-2011):
    REP MOVSB 21.14 B/c
    REP MOVSW 19.11 B/c
    REP MOVSD 24.27 B/c

Ivy Bridge (2012-2013) - with Enhanced REP MOVSB:
    REP MOVSB 28.72 B/c
    REP MOVSW 19.40 B/c
    REP MOVSD 27.96 B/c
    REP MOVSQ 27.89 B/c

Skylake (2015-2016) - with Enhanced REP MOVSB:
    REP MOVSB 57.59 B/c
    REP MOVSW 58.20 B/c
    REP MOVSD 58.10 B/c
    REP MOVSQ 57.59 B/c

Kaby Lake (2016-2017) - with Enhanced REP MOVSB:
    REP MOVSB 58.00 B/c
    REP MOVSW 57.69 B/c
    REP MOVSD 58.00 B/c
    REP MOVSQ 57.89 B/c

As you can see, from Nehalem onwards, MOVSB seems to be optimized to near saturate the write capacity of the core.

@degski for your specific Broadwell core, the Haswell results should be the most applicable — REP MOVSB will be faster than the SSE2 version.

And why would this idea pertains to W10-X64 only? This part of the code is completely OS independent.

It doesn't, but since I don't know how well this would do on a linux/ios system (because I don't have those memcpy's hangin' around), I don't propose that for linux.

As for MacOS and Linux in particular, let me be the first to say that both platforms automatically provided a µarch optimized memcpy() implementation at runtime, along with smart inlining by the compilers generally available on those platforms (Clang, GCC, ICC, etc). I don't think this optimization is useful nor meaningful on platforms without specific request as a result — we should only do this optimization for Windows and then if people need it applied to other platforms, we can do the same elsewhere.

I'll go ahead and prep up a PR to implement this in the most minimal and cleanest way to do so and tag @daanx when it's ready.

degski · 2020-05-26T23:38:16Z

@haneefmubarak Thanks for the info, the bench-marking and your plan for the future, and looking forward seeing the results.

resolve #201 with a platform-selective REP MOVSB implementation

daanx · 2021-01-30T00:10:36Z

Thanks everyone for the interesting discussion, I learned a lot about the intricacies of memcpy! And apologies for the long delay for the merge but it is in now; hope it works well :-)

… mi_memcpy_aligned for machine-word aligned copy. see issue #201 and pr #253

SchrodingerZhu mentioned this issue Mar 22, 2020

Improve SIMD memcpy on windows? microsoft/snmalloc#154

Closed

haneefmubarak mentioned this issue May 26, 2020

resolve #201 with a platform-selective REP MOVSB implementation #253

Merged

daanx added a commit that referenced this issue Jan 30, 2021

Merge pull request #253 from haneefmubarak/memcpy-rep-movsb-windows-201

9b966c3

resolve #201 with a platform-selective REP MOVSB implementation

daanx added a commit that referenced this issue Jan 31, 2021

limit memcpy as rep stosb to windows where the cpu supporst FSRM; add…

35c1fc2

… mi_memcpy_aligned for machine-word aligned copy. see issue #201 and pr #253

daanx closed this as completed in 4290256 Jan 31, 2021

MyreMylar mentioned this issue Jul 19, 2022

Potentially outdated optimisations slowing SDL blitting? libsdl-org/SDL#5918

Closed

MBkkt mentioned this issue Dec 24, 2022

Make the benchmark compile successfully by GCC RobloxResearch/SIMDString#1

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Better memcpy on Windows #201

Better memcpy on Windows #201

degski commented Feb 3, 2020 •

edited

Loading

daanx commented Mar 7, 2020

degski commented Mar 7, 2020 •

edited

Loading

mpoeter commented Mar 7, 2020

kfir-drivenets commented Apr 8, 2020

nxrighthere commented May 5, 2020

nxrighthere commented May 5, 2020

degski commented May 7, 2020

degski commented May 7, 2020

nxrighthere commented May 7, 2020

degski commented May 7, 2020

mpoeter commented May 7, 2020

haneefmubarak commented May 25, 2020

degski commented May 25, 2020 •

edited

Loading

haneefmubarak commented May 26, 2020

degski commented May 26, 2020

haneefmubarak commented May 26, 2020 •

edited

Loading

degski commented May 26, 2020

daanx commented Jan 30, 2021

Better memcpy on Windows #201

Better memcpy on Windows #201

Comments

degski commented Feb 3, 2020 • edited Loading

daanx commented Mar 7, 2020

degski commented Mar 7, 2020 • edited Loading

mpoeter commented Mar 7, 2020

kfir-drivenets commented Apr 8, 2020

nxrighthere commented May 5, 2020

nxrighthere commented May 5, 2020

degski commented May 7, 2020

degski commented May 7, 2020

nxrighthere commented May 7, 2020

degski commented May 7, 2020

mpoeter commented May 7, 2020

haneefmubarak commented May 25, 2020

degski commented May 25, 2020 • edited Loading

haneefmubarak commented May 26, 2020

degski commented May 26, 2020

haneefmubarak commented May 26, 2020 • edited Loading

degski commented May 26, 2020

daanx commented Jan 30, 2021

degski commented Feb 3, 2020 •

edited

Loading

degski commented Mar 7, 2020 •

edited

Loading

degski commented May 25, 2020 •

edited

Loading

haneefmubarak commented May 26, 2020 •

edited

Loading