Improve SIMD memcpy on windows? #154

SchrodingerZhu · 2020-03-22T13:24:29Z

I saw an interesting issue at mimalloc: microsoft/mimalloc#201.

It seems that on Windows memcpy does not invoke SIMD instructions by default. I remember that musl memcpy does not use SIMD directly either, though it is written in a way that makes the compiler feel happy to optimize.

There are three cases of memcpy in snmalloc. Therefore, perhaps writing memcpy using SIMD instructions directly on some platforms may result in a better performance.

davidchisnall · 2020-03-22T14:47:43Z

One of the problems with this kind of optimisation is that it's very, very microarchitecture specific. John Baldwin (FreeBSD) did some analysis of different memcpy implementations on x86 using SSE and AVX instructions (two SSE versions, one AVX, if I remember correctly). There was no consistent winner across all Intel and AMD platforms. On the most recent Intel chips, the `rep movsb` sequence is supposed to outperform SSE-based memcpy.

I'd much rather make this a problem for the host platform. On Linux / *BSD, it's quite common to use `ifunc`s to select the optimal memcpy for the current microarchitecture. I believe Windows now does something similar.

If a memcpy size is known at compile time, then it can make sense to inline it, but otherwise shipping our own version would mean that we can't take advantage of future performance / security improvements. This is similar to the OpenSSL problem of providing its own malloc: it was faster than the malloc that shipped with most *NIX systems in the '90s, but 15 years later it was still in use, slower than most system mallocs, and less secure. The end result was Heartbleed.

SchrodingerZhu · 2020-03-22T15:26:35Z

Yep. I also saw that @daanx was hesitant about this as it is too dependent to architectures. Similarly, mimalloc is also using copy for realloc. I think this is the only case that when memcpy can become a concern in both projects. However, if the influence of memcpy is not that much in the view of overall performance, I also agree that keep it to the platforms' default is a wiser choice.

davidchisnall · 2020-03-22T15:34:25Z

My objection is not so much arguing that memcpy is not a bottleneck, it might be depending on workload (though my personal bias is that any software using realloc is almost certainly wrong, and if it depends on realloc being faster than malloc followed by memcpy then it's definitely wrong). It's more that I don't believe that we can ship a memcpy that's faster than the system memcpy over a long time.

If someone is deploying software using snmalloc and memcpy is a bottleneck for them, then I'd hope that they'd profile several of the available high-performance memcpy implementations available and pick one that runs best on their specific deployment platform. And then reevaluate that when they move to new hardware.

That said, I would support a patch that allowed the user to provide an alternative implementations of the memcpy used in realloc. I'm much happier believing that snmalloc consumers can pick a good memcpy for their workload than that we can pick a good one for everyone's workload.

mjp41 · 2020-03-22T20:12:54Z

So I think the only thing I would add to this discussion is we know several alignment and size constraints that might allow us to build a more optimal memcpy/memset. In particular, we statically know the minimum size, and our allocations are always very aligned, so this means we can use aligned SIMD instructions.

I believe we can implement a faster memcpy/memset because it is a constrained problem, so we can probably remove two or three conditional branches that are used for handling unaligned src/dst and sizes that are not a multiple of MIN_ALLOC_SIZE. However, not sure it is worth it.

@darach, @Licenser had some traces where they had a reasonable amount of time in memcpy from rust_realloc, so if we wanted to push on this issue, duplicating their benchmark would be a good starting point.

Licenser · 2020-03-22T20:57:24Z

Benchmark should be easy to reproduce it is in https://github.com/wayfair-tremor/tremor-runtime

If you want to give it a go here it is how to run it. Feel free to ping me if you run into any issues with it.

cargo build --release -p tremor-server # after jumping through all the hoops for dependencies :P
./bench/run real-workflow-throughput-json

mjp41 · 2020-03-25T20:07:56Z

@Licenser, is there an easy way to patch a different commit for snmalloc in? I want to try some tests, to see if I am improving things.

Licenser · 2020-03-25T20:35:07Z

yes! if you clone the snmalloc crate (rust code) you can't patch dependencies

basically, at the end of the Cargo.toml you add:

[patch.crates-io]
snmalloc = { path = '/path/to/snmalloc' }

darach · 2020-03-25T20:55:29Z

It's also worth doing a full clean and rebuild - I've hit situations where a quick rebuild doesn't quite work as intended. Especially when you're using multiple build targets...

mjp41 · 2022-03-16T11:45:51Z

Closing as @davidchisnall has now added a optimised memcpy to snmalloc.

davidchisnall · 2022-03-16T12:39:55Z

It's there, but we're not using it in realloc. I wonder if we should?

mjp41 · 2022-03-16T14:20:13Z

I think the redis benchmark in mimalloc-bench does quite a bit of realloc. I can run some benchmarks, if someone else puts together a PR ;-)

This wraps our memcpy with some assumptions that let the optimiser know that we're copying chunks that are strongly aligned. With clang 13 on x86, this generates three variants: - A special case for 16 bytes that's a single vector load + store. - A vector-copy loop for sizes <512 bytes. - rep movsb for larger sizes. This is almost certainly faster than the platform memcpy (if for no other reason than that it doesn't have to care about handling unaligned copies). I don't know if it will show up in benchmarks, but if it does then it Fixes #154

mjp41 mentioned this issue Apr 15, 2020

Add Tremor-Server benchmark for Rust SchrodingerZhu/bench_suite#7

Open

mjp41 closed this as completed Mar 16, 2022

davidchisnall mentioned this issue Mar 16, 2022

Use the custom memcpy for realloc. #479

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve SIMD memcpy on windows? #154

Improve SIMD memcpy on windows? #154

SchrodingerZhu commented Mar 22, 2020 •

edited

Loading

davidchisnall commented Mar 22, 2020

SchrodingerZhu commented Mar 22, 2020

davidchisnall commented Mar 22, 2020

mjp41 commented Mar 22, 2020

Licenser commented Mar 22, 2020 •

edited

Loading

mjp41 commented Mar 25, 2020

Licenser commented Mar 25, 2020

darach commented Mar 25, 2020

mjp41 commented Mar 16, 2022

davidchisnall commented Mar 16, 2022

mjp41 commented Mar 16, 2022

Improve SIMD memcpy on windows? #154

Improve SIMD memcpy on windows? #154

Comments

SchrodingerZhu commented Mar 22, 2020 • edited Loading

davidchisnall commented Mar 22, 2020

SchrodingerZhu commented Mar 22, 2020

davidchisnall commented Mar 22, 2020

mjp41 commented Mar 22, 2020

Licenser commented Mar 22, 2020 • edited Loading

mjp41 commented Mar 25, 2020

Licenser commented Mar 25, 2020

darach commented Mar 25, 2020

mjp41 commented Mar 16, 2022

davidchisnall commented Mar 16, 2022

mjp41 commented Mar 16, 2022

SchrodingerZhu commented Mar 22, 2020 •

edited

Loading

Licenser commented Mar 22, 2020 •

edited

Loading