Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve SIMD memcpy on windows? #154

Closed
SchrodingerZhu opened this issue Mar 22, 2020 · 11 comments
Closed

Improve SIMD memcpy on windows? #154

SchrodingerZhu opened this issue Mar 22, 2020 · 11 comments

Comments

@SchrodingerZhu
Copy link
Contributor

SchrodingerZhu commented Mar 22, 2020

I saw an interesting issue at mimalloc: microsoft/mimalloc#201.

It seems that on Windows memcpy does not invoke SIMD instructions by default. I remember that musl memcpy does not use SIMD directly either, though it is written in a way that makes the compiler feel happy to optimize.

There are three cases of memcpy in snmalloc. Therefore, perhaps writing memcpy using SIMD instructions directly on some platforms may result in a better performance.

@davidchisnall
Copy link
Collaborator

One of the problems with this kind of optimisation is that it's very, very microarchitecture specific. John Baldwin (FreeBSD) did some analysis of different memcpy implementations on x86 using SSE and AVX instructions (two SSE versions, one AVX, if I remember correctly). There was no consistent winner across all Intel and AMD platforms. On the most recent Intel chips, the `rep movsb` sequence is supposed to outperform SSE-based memcpy.

I'd much rather make this a problem for the host platform. On Linux / *BSD, it's quite common to use `ifunc`s to select the optimal memcpy for the current microarchitecture. I believe Windows now does something similar.

If a memcpy size is known at compile time, then it can make sense to inline it, but otherwise shipping our own version would mean that we can't take advantage of future performance / security improvements. This is similar to the OpenSSL problem of providing its own malloc: it was faster than the malloc that shipped with most *NIX systems in the '90s, but 15 years later it was still in use, slower than most system mallocs, and less secure. The end result was Heartbleed.

@SchrodingerZhu
Copy link
Contributor Author

Yep. I also saw that @daanx was hesitant about this as it is too dependent to architectures. Similarly, mimalloc is also using copy for realloc. I think this is the only case that when memcpy can become a concern in both projects. However, if the influence of memcpy is not that much in the view of overall performance, I also agree that keep it to the platforms' default is a wiser choice.

@davidchisnall
Copy link
Collaborator

My objection is not so much arguing that memcpy is not a bottleneck, it might be depending on workload (though my personal bias is that any software using realloc is almost certainly wrong, and if it depends on realloc being faster than malloc followed by memcpy then it's definitely wrong). It's more that I don't believe that we can ship a memcpy that's faster than the system memcpy over a long time.

If someone is deploying software using snmalloc and memcpy is a bottleneck for them, then I'd hope that they'd profile several of the available high-performance memcpy implementations available and pick one that runs best on their specific deployment platform. And then reevaluate that when they move to new hardware.

That said, I would support a patch that allowed the user to provide an alternative implementations of the memcpy used in realloc. I'm much happier believing that snmalloc consumers can pick a good memcpy for their workload than that we can pick a good one for everyone's workload.

@mjp41
Copy link
Member

mjp41 commented Mar 22, 2020

So I think the only thing I would add to this discussion is we know several alignment and size constraints that might allow us to build a more optimal memcpy/memset. In particular, we statically know the minimum size, and our allocations are always very aligned, so this means we can use aligned SIMD instructions.

I believe we can implement a faster memcpy/memset because it is a constrained problem, so we can probably remove two or three conditional branches that are used for handling unaligned src/dst and sizes that are not a multiple of MIN_ALLOC_SIZE. However, not sure it is worth it.

@darach, @Licenser had some traces where they had a reasonable amount of time in memcpy from rust_realloc, so if we wanted to push on this issue, duplicating their benchmark would be a good starting point.

@Licenser
Copy link

Licenser commented Mar 22, 2020

Benchmark should be easy to reproduce it is in https://github.com/wayfair-tremor/tremor-runtime

If you want to give it a go here it is how to run it. Feel free to ping me if you run into any issues with it.

cargo build --release -p tremor-server # after jumping through all the hoops for dependencies :P
./bench/run real-workflow-throughput-json

@mjp41
Copy link
Member

mjp41 commented Mar 25, 2020

@Licenser, is there an easy way to patch a different commit for snmalloc in? I want to try some tests, to see if I am improving things.

@Licenser
Copy link

yes! if you clone the snmalloc crate (rust code) you can't patch dependencies

basically, at the end of the Cargo.toml you add:

[patch.crates-io]
snmalloc = { path = '/path/to/snmalloc' }

@darach
Copy link

darach commented Mar 25, 2020

It's also worth doing a full clean and rebuild - I've hit situations where a quick rebuild doesn't quite work as intended. Especially when you're using multiple build targets...

@mjp41
Copy link
Member

mjp41 commented Mar 16, 2022

Closing as @davidchisnall has now added a optimised memcpy to snmalloc.

@mjp41 mjp41 closed this as completed Mar 16, 2022
@davidchisnall
Copy link
Collaborator

It's there, but we're not using it in realloc. I wonder if we should?

@mjp41
Copy link
Member

mjp41 commented Mar 16, 2022

I think the redis benchmark in mimalloc-bench does quite a bit of realloc. I can run some benchmarks, if someone else puts together a PR ;-)

davidchisnall added a commit that referenced this issue Mar 16, 2022
This wraps our memcpy with some assumptions that let the optimiser know
that we're copying chunks that are strongly aligned.  With clang 13 on
x86, this generates three variants:

 - A special case for 16 bytes that's a single vector load + store.
 - A vector-copy loop for sizes <512 bytes.
 - rep movsb for larger sizes.

This is almost certainly faster than the platform memcpy (if for no
other reason than that it doesn't have to care about handling unaligned
copies).

I don't know if it will show up in benchmarks, but if it does then it
Fixes #154
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants