-
Notifications
You must be signed in to change notification settings - Fork 106
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve SIMD memcpy on windows? #154
Comments
One of the problems with this kind of optimisation is that it's very, very microarchitecture specific. John Baldwin (FreeBSD) did some analysis of different memcpy implementations on x86 using SSE and AVX instructions (two SSE versions, one AVX, if I remember correctly). There was no consistent winner across all Intel and AMD platforms. On the most recent Intel chips, the `rep movsb` sequence is supposed to outperform SSE-based memcpy. I'd much rather make this a problem for the host platform. On Linux / *BSD, it's quite common to use `ifunc`s to select the optimal memcpy for the current microarchitecture. I believe Windows now does something similar. If a memcpy size is known at compile time, then it can make sense to inline it, but otherwise shipping our own version would mean that we can't take advantage of future performance / security improvements. This is similar to the OpenSSL problem of providing its own malloc: it was faster than the malloc that shipped with most *NIX systems in the '90s, but 15 years later it was still in use, slower than most system mallocs, and less secure. The end result was Heartbleed. |
Yep. I also saw that @daanx was hesitant about this as it is too dependent to architectures. Similarly, mimalloc is also using copy for realloc. I think this is the only case that when memcpy can become a concern in both projects. However, if the influence of memcpy is not that much in the view of overall performance, I also agree that keep it to the platforms' default is a wiser choice. |
My objection is not so much arguing that If someone is deploying software using snmalloc and That said, I would support a patch that allowed the user to provide an alternative implementations of the |
So I think the only thing I would add to this discussion is we know several alignment and size constraints that might allow us to build a more optimal I believe we can implement a faster @darach, @Licenser had some traces where they had a reasonable amount of time in |
Benchmark should be easy to reproduce it is in https://github.com/wayfair-tremor/tremor-runtime If you want to give it a go here it is how to run it. Feel free to ping me if you run into any issues with it. cargo build --release -p tremor-server # after jumping through all the hoops for dependencies :P
./bench/run real-workflow-throughput-json |
@Licenser, is there an easy way to patch a different commit for snmalloc in? I want to try some tests, to see if I am improving things. |
yes! if you clone the snmalloc crate (rust code) you can't patch dependencies basically, at the end of the Cargo.toml you add:
|
It's also worth doing a full clean and rebuild - I've hit situations where a quick rebuild doesn't quite work as intended. Especially when you're using multiple build targets... |
Closing as @davidchisnall has now added a optimised memcpy to snmalloc. |
It's there, but we're not using it in realloc. I wonder if we should? |
I think the redis benchmark in mimalloc-bench does quite a bit of realloc. I can run some benchmarks, if someone else puts together a PR ;-) |
This wraps our memcpy with some assumptions that let the optimiser know that we're copying chunks that are strongly aligned. With clang 13 on x86, this generates three variants: - A special case for 16 bytes that's a single vector load + store. - A vector-copy loop for sizes <512 bytes. - rep movsb for larger sizes. This is almost certainly faster than the platform memcpy (if for no other reason than that it doesn't have to care about handling unaligned copies). I don't know if it will show up in benchmarks, but if it does then it Fixes #154
I saw an interesting issue at mimalloc: microsoft/mimalloc#201.
It seems that on Windows memcpy does not invoke SIMD instructions by default. I remember that musl memcpy does not use SIMD directly either, though it is written in a way that makes the compiler feel happy to optimize.
There are three cases of memcpy in snmalloc. Therefore, perhaps writing memcpy using SIMD instructions directly on some platforms may result in a better performance.
The text was updated successfully, but these errors were encountered: