-
Notifications
You must be signed in to change notification settings - Fork 840
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Better memcpy on Windows #201
Comments
Thanks! @degski. I am hesitant to add it as it is so architecture specific and the only important |
There are virtually no CPU's in use that don't support SSE2 (One can write the same function with AVX, but this gives no benefit and the alignment requirements are less general, I guess we are hitting max bus-speed (on my machine), already with SSE2. Maybe Optane-ram (my birth day is in some months :-) ) benefits from the AVX instructions?). The above code compiles with gcc/msvc/clang. PS: in the meanwhile I have generalized/relaxed things some here: https://github.com/degski/Sax/blob/ed2289055690baba71be31437154e72af504a55d/include/sax/stl.hpp#L263 |
@degski this code uses non-temporal streaming instructions to write the data, but it is lacking the sfence or mfence instruction to ensure that these stores are globally visible, which can be crucial in a multithreaded context. Just a sidenote: it is unclear to me why this implementation uses the packed single version But more importantly, the non-temporal stores bypass the cache. This only makes sense if the memory written is not immediately re-accessed, as otherwise we will have to re-load the just written data from memory to the cache, so IMHO a realloc operation is probably not a good candidate! |
Checkout rte_memcpy in DPDK |
On my machine with Ryzen 5,
|
|
Three reasons, it's faster, it' more widely available, it is easier on alignment requirements. It helps to read everything that's written, including the linked article (in the updated code (see blobl)). |
What OS? I am comparing against |
I'm on Windows 10 x64 1909. |
@nxrighthere Interesting, need to add some 'fat' code then. On an Intel Broadwell I see improvements of 20% or so (on something as basic as this) is noting to be sniffed at. Soon (I hope) I can start using my Coffee-lake 6 core (cannot move and pick it up coz covid) and do this test again. |
True, but apparently you did not read the whole thing yourself, since the author himself wrote:
Appart from that, _mm_stream_ps and _mm_stream_si128 have the same alignment requirements as well as the same throughput and latency. The real difference is that they are usually handled by different execution units. As far as availability goes - the first one was introduced with SSE while the second one was added with SSE2. But I doubt that you will find many CPUs in the wild today that predate SSE2. And why would this idea pertains to W10-X64 only? This part of the code is completely OS independent. In general I would not try to draw any conclusion from the results in the referenced forum discussion. It is not clear how these results where obtained, how many samples where taken, etc. |
I mean, if you're going to argue for a static inline void memcpy_movsb (void *d, const void *s, size_t n) {
__movsb (d, s, n);
return;
} This has the significant advantage of being roughly the size of a function call when inlined (arguably when inlined into an actual function registers will already be in use, so it will be smaller than a function call by virtue of not having to push registers to the stack), // variable to show work before / after and prevent jmp optimization
int incr = 0;
void mcpy_memcpy (void *d, const void *s, size_t n) {
incr++;
memcpy (d, s, n);
incr++;
}
void mcpy_movsb (void *d, const void *s, size_t n) {
incr++;
memcpy_movsb (d, s, n);
incr++;
} _mcpy_memcpy PROC
push DWORD PTR _n$[esp-4]
inc DWORD PTR _incr
push DWORD PTR _s$[esp]
push DWORD PTR _d$[esp+4]
call _memcpy
inc DWORD PTR _incr
add esp, 12
ret 0
_mcpy_memcpy ENDP
_mcpy_movsb PROC
inc DWORD PTR _incr
mov ecx, DWORD PTR _n$[esp-4]
push esi
mov esi, DWORD PTR _s$[esp]
push edi
mov edi, DWORD PTR _d$[esp+4]
rep movsb
inc DWORD PTR _incr
pop edi
pop esi
ret 0
_mcpy_movsb ENDP not requiring alignment, being universally supported on literally any x86/x64 system, and being reasonably if not remarkably fast on any remotely modern x64 system. |
@haneefmubarak Do you have benchmarks against the SSE2 version above? I suspect that |
Let me do some testing and experimenting as best I can and see if I can also find existing numbers for older platforms and get back to you in about a day or two :) |
It doesn't, but since I don't know how well this would do on a linux/ios system (because I don't have those memcpy's hangin' around), I don't propose that for linux.
You're right, I'm basing myself on my own bench-marking and the code I used is in the relevant repo, I just re-report the facts. I can in fact only talk about a Broadwell Ci3 (5005U), because that's what I have for the moment. |
Here are some benchmarks from StackOverflow (edited for clarity):
I didn't include non-temporal versions because those require fences to ensure consistency across cores nor prefetched versions because they require knowledge of cache line sizes (which are likely to change in the future). Additionally, these numbers should be reasonable to compare with, even though they may not be on similar operating systems since they concern raw processor and memory performance. As discussed in this answer on the aforementioned SO post,
As you can see, from Nehalem onwards, @degski for your specific Broadwell core, the Haswell results should be the most applicable —
As for MacOS and Linux in particular, let me be the first to say that both platforms automatically provided a µarch optimized I'll go ahead and prep up a PR to implement this in the most minimal and cleanest way to do so and tag @daanx when it's ready. |
@haneefmubarak Thanks for the info, the bench-marking and your plan for the future, and looking forward seeing the results. |
resolve #201 with a platform-selective REP MOVSB implementation
Thanks everyone for the interesting discussion, I learned a lot about the intricacies of memcpy! And apologies for the long delay for the merge but it is in now; hope it works well :-) |
The following code:
is a SSE2-streaming-coded memcpy, it's about 25% faster than
memcpy
onWindows
. The AVX-streaming version is not faster than the SSE2 version, so better use SSE2 for lower size and easier alignment constraints and almost any (PC-)processor, still functioning, supports it. Speed up for grabs.PS:
mimalloc
usesmemcpy
15 times indev
(I grepped it, so you don't have to grep it, I didn't look whether it would be applicable in every single case).The text was updated successfully, but these errors were encountered: