-
Notifications
You must be signed in to change notification settings - Fork 946
Implement SIMD versions of vector operations #311
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Hi @aytekinar, thanks for the PR! It's a neat idea, and the code looks really clean. However, it adds a good amount of complexity. I think a minimal version would be:
For benchmarking, it'd be best to use ann-benchmarks to see the impact on the total time. Overall, I'm not sure it's something I'd like to move forward with right now, but will think on it for a bit (and welcome other comments). |
That's refreshing to hear. At @discourse we had problems with dependencies shipping AVX improvements, that were followed by the random user running the application in a 13 years old CPU frustrated because it broke.
For us, the CPU cost of the "search" (inner_product) is what is holding us back on enabling this for more of our hosted forums, as most of the 500M+ monthly pageviews we serve will do one search to show related topics when we enable our embeddings feature for a site. We do cache the result of the query for a whole week, but over a million topics are created month again, and with anon visits coming from Google Search, the cache isn't always there. That all to say that a (17% ~ 50%) speed-up would be very welcome for us. I do appreciate the the concern over complexity tho. |
|
I want to work on benchmarking over the coming days. I do think there's merit to being explicit about SIMD, but I'm not convinced on adding the complexity based on the above data. That said, I do generally think this is a good idea and, at least on paper, supportive of doing this. The main issues I've seen as of late with the latest version of pgvector is calculations getting CPU bound on high levels of concurrency with higher levels of That said, the benchmarks above don't convince me yet around adding them -- these are very short tests on very small data sets. I agree with @ankane on running these against ANN Benchmarks (I'm happy to do so), but I also want to test something similar to the proposed On the types, I would still suggest having AVX-512 available ("in for a penny, in for a pound") unless it truly adds that much more complexity. For my own benefit, is the work for AVX/AVX2 similar? From what I read for floating point calculations, they're pretty similar. On the code, I'd recommending keeping the SIMD implementation work to its own file (upstream PostgreSQL does this in a Thanks! |
|
Hello everyone, A colleague recently highlighted the potential benefits of integrating SIMD similarity measures into pgvector. To my delight, I discovered this ongoing implementation, which aligns perfectly with a project of mine. I am the creator of a library named SimSIMD. It is tailored to implement various similarity measures, including L2, Cosine, Inner Product, Jaccard, Hamming, and Jensen Shannon distances. It uses AVX2 and AVX-512 techniques, including VNNI for int8 and FP16 extensions for half-precision floating points on x86 platforms, including the most recent Sapphire Rapids. Furthermore, it supports Arm NEON and SVE, making it compatible with Graviton 3 and other recent chipsets. One of the major deployments of SimSIMD is in USearch. This tool has gained traction and is now integrated into platforms like ClickHouse and various data lakes. I've chronicled some benchmarks in the following articles. While they provide insight into the performance of SimSIMD, I'm keen to determine their relevance to this context:
@ankane and @jkatz, please let me know if the integration makes sense. It should be as easy as adding a git submodule and invoking |
9e519b4 to
406eccf
Compare
|
Hi @ashvardanian, thanks for sharing. Looks really neat, but don't want to take on any external dependencies or different licenses (however, if someone decides to fork and benchmark, I'd be curious to see the results). |
|
@ankane, the license is not an issue, I can dual-license to be compatible, and have previously done the for StringZilla, my strings processing libraries. As for benchmarking, shouldn't be hard, just let me know how you generally do that - same way as in the first message of this thread? 🤗 |
0f4113a to
40a355e
Compare
|
Hello all! First, apologies for responding late --- I have been going through your comments and trying to understand the compilation failure in the CI. I tested the code in Let me first answer the specific question:
Correct, AVX brought the One final comment is that when we use FMA, we get different results, which is the expected behavior because of the lack of one rounding operation. New ChangesetWhen the During my trials, I have observed the following:
The portable way is to have different function names (as was the case in the original PR of mine), but then, we cannot compile for LLVM-enabled PostgreSQL installations (due to bitcode generation and LTO). If, on the other hand, we choose to go for Option 2 above, we cannot support GCC. As a result, I have opted for this version of the PR. Namely, I am using the You can see how the code is generated for, e.g., BenchmarksI have repeated the same C and C BenchmarkI have used the following commit when repeating the C benchmarks: aytekinar/simd-playground@cbe8ac27 Below are the results from my laptop: clang-release (no attribute; non-native build)clang-release (target_clones attribute; non-native build)gcc-release (no attribute; non-native build)gcc-release (target_clones attribute; non-native build)We still get nice/aggressive speed-ups.
|
c80103b to
d8ddf62
Compare
|
I think I will need help for the The define macros should be fine: https://godbolt.org/z/WbKfPrYoG What's the output of the following: $ clang -dM -E -x c /dev/null | grep -iE '(__gnuc__|__clang__|__clang_m..or__)'
#define __GNUC__ 4
#define __clang__ 1
#define __clang_major__ 15
#define __clang_minor__ 0 |
|
@ashvardanian The best way to benchmark is ann-benchmarks. @aytekinar Also, there may be issues with musl. I'm seeing a similar performance difference with ann-benchmarks, which I don't think justifies added complexity. |
d8ddf62 to
f84a633
Compare
| #if defined _MSC_VER /* MSVC only */ | ||
| #define __attribute__(x) | ||
| #elif defined __APPLE__ /* Apple/OSX only */ | ||
| #define target_clones(...) | ||
| #elif !defined __has_attribute || !__has_attribute(target_clones) || !defined __gnu_linux__ /* target_clones not supported */ | ||
| #define target_clones(...) | ||
| #endif |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These, now, handle the following situations:
- for MS Visual C compiler,
__attribute__is defined to be no-op, - for Apple targets,
target_clonesis defined to be no-op, and, - whenever
target_clonesis not found as an attribute or__gnu_linux__is not defined (i.e., for Alpine),target_clonesis defined to be no-op.
You can verify the implementation for l2_distance_impl on Godbolt: https://godbolt.org/z/jzMjKbhKe
I have also tested the compilation in alpine:latest images. GCC builds without the SIMD support whereas Clang 16 builds with it. However, when I try to run the compiled code, I receive an error as mentioned in llvm/llvm-project#64631, which is closed by llvm/llvm-project-release-prs#615. In fact, when I test the code in alpine:edge, instead, GCC still builds the code without the SIMD support whereas Clang 17 builds with it, and I am able to see the benefits (speed improvements of up to 70%, again in pure C benchmarks).
|
Thanks @aytekinar, looks like CI is now passing. However, I don't think it makes sense to spend more time on this unless benchmarks show significant improvement in overall performance. |
|
A few other places to look for performance gains would be:
Based on past SIMD optimizations (#180), index build times might benefit the most. |
|
From some recent testing, I do think there may be some headroom on HNSW scans with higher Additionally, it'd be good to stress test this against some larger vectors (e.g. 1536-dim) where the CPU is used more. I've found that a 10,000,000 1536-dim test data set can be very revealing in terms of where one can tweak performance. (1,000,000 is good too, if you have space/resource constraints). |
|
Hi all, I'm considering including a version of this in 0.7.0 (and removing I did some initial benchmarking with this branch on HNSW build time (single process) with 128 dimensions and found:
If anyone is able to test or has thoughts on the above, please share. |
|
Great news! I can try it. In the meantime, do you want anything from my end regarding this PR? The diff looks neat -- I see that you're going to apply runtime dispatching on only L2 norm computations for the time being. Would you be interested in doing the same to L1, cosine and dot, as well? If so, I can modify my PR to include your changeset (i.e., the |
|
How were those targets "default", "avx", "fma" chosen for x86-64? I see some concerns around AVX not implying FMA upthread, but AFAIK, in practice they show up together more often than not. In fact, nowadays it's common to not care all that much for the specific instruction sets, but to target the coarser microarchitecture levels. So I'd have gone with something like this instead: In my testing, it generates variations of code with increasing sizes of registers used, as well as different instructions. Even after that, it's possible that with the AVX512 instruction set explosion, we might have to think about adding an extra variation for AVX512FP16 (specifically for |
|
I see the macro is: It only applies to GNU, which is very very sane, as I see it's also limiting to x86-64, which is a very safe bet. IIRC, ARM64 and PowerPC64 might also have I imagine we can worry about adding other architectures in the future, but I do wonder what architectures we'd like to keep an eye out for, since surely not all extensions have to cover the entirety of the postgres platforms. I think ARM64 will likely be the next ask, given its current popularity. Does anyone have thoughts on other architectures? |
|
@aytekinar No need to do anything with the PR. Will add the other distance functions, and include you as a co-author when merging (thanks for the idea and all of the work so far). @tureba Thanks, will try those out. If you have ideas for what to use for Here's what I'm thinking overall:
Also, if anyone has ideas for optimizing distance functions in the halfvec and bitvector branches, please share those as well (a new issue would probably be best). |
Co-authored-by: Arda Aytekin <arda.aytekin@microsoft.com>
|
Added CPU dispatching for @aytekinar, added you as a co-author since your work in this PR was very helpful for this. The reason for intrinsics / dispatching in this case was a significant difference in performance on x86-64. With the SIFT 1M dataset (128 dimensions):
Still looking at dispatching for |
|
Added CPU dispatching for key vector distance functions in the commit above. A few findings:
Also, some benchmarks with GIST 1M (960-dimensions):
Happy to consider other functions and targets if there are benchmarks to support it. Thanks again @aytekinar. |
|
Thank you, @ankane, for integrating all these changes and responding positively to our request. This was needed for us and our customers. As for @tureba's comment
I had tested the changeset in this PR on |
…h will be removed in 0.7.0 per: pgvector/pgvector#311
…h will be removed in 0.7.0 per: pgvector/pgvector#311

Rationale
Binary distributions of
pgvectoroften need to disable the-march=nativeflag (e.g., Debian patch) because, during build time, one cannot know in advance where the binary will be running. As a result, loops inside the vector operations cannot be auto-vectorized (contrary to the assumption in the code), which results in mediocre performances in places where the binary distributions are used (e.g., cloud providers).To alleviate the above, I have implemented the SSE, AVX and AVX512F versions of the vector operations, and added a CPU dispatching mechanism (during extension load time) to pick the most recent version the underlying CPU supports (by following the best practices mentioned in Agner Fog's Optimizing Software in C++ (Chapters 12 and 13) manual).
Benchmark
I have created a separate
simd-playgroundrepository where I host the different SIMD implementations of the relevant vector operations, together with their unit tests and benchmarks.When doing the benchmarks, I have used the following flags:
-Wall -Wpedantic -O2 -DNDEBUG -ftree-vectorize -fassociative-math -fno-trapping-math -fno-math-errno -fno-signed-zeros -funroll-loops-Wall -Wpedantic -O2 -DNDEBUG -ftree-vectorize -fassociative-math -fno-trapping-math -fno-math-errno -fno-signed-zeros -funroll-loops -march=native/Wall /O2 /fp:fast/Wall /O2 /fp:fast /arch:AVX512C Benchmark
Below, I provide the benchmark results on my machine (with 11th Gen Intel(R) Core(TM) i9-11900H @ 2.50GHz). Note that the benchmark name is in the form
BM_<vector_op>/{1,2,3,4}/<vector_len>, where1: scalar,2: sse,3: avxand4: avx512f. The scalar version is simply a for-loop, which is vectorized innative-builds, and the rest are manually implemented vectorized versions for the respective targets:Non-native builds
Native builds
In short, manual SIMD implementations are more performant than those of GCC in both non-native (up to ~70%) and native (~50%) builds. They are still more performant than those of Clang and MSVC in non-native builds (up to ~70%), while staying within the 5-10% range in native builds (here, Clang is a different beast; it performs much better in native builds especially when/if the vector lengths get bigger).
pgbenchTo see if the above results from pure (in-memory) benchmarks are also applicable to SQL operations, I have created the following benchmark data:
and the vector operations:
Below are the results (
v0.5.1; non-native build):$ pgbench --host localhost --username postgres --no-vacuum --client 10 --jobs 4 --time 30 --file 01-vector-ops.sql postgres pgbench (16.0) transaction type: 01-vector-ops.sql scaling factor: 1 query mode: simple number of clients: 10 number of threads: 4 maximum number of tries: 1 duration: 30 s number of transactions actually processed: 11298 number of failed transactions: 0 (0.000%) latency average = 26.567 ms initial connection time = 5.215 ms tps = 376.402088 (without initial connection time) $and the patch in this PR (again, non-native build):
$ pgbench --host localhost --username postgres --no-vacuum --client 10 --jobs 4 --time 30 --file 01-vector-ops.sql postgres pgbench (16.0) transaction type: 01-vector-ops.sql scaling factor: 1 query mode: simple number of clients: 10 number of threads: 4 maximum number of tries: 1 duration: 30 s number of transactions actually processed: 13196 number of failed transactions: 0 (0.000%) latency average = 22.739 ms initial connection time = 6.215 ms tps = 439.765535 (without initial connection time) $As can be seen, there is 17% relative speed-up compared to the non-native build (which is the case when binary distributions of this extension are / need to be used). Obviously, this is far from the in-memory benchmark results I have reported above, which results from the density of vector operations in a typical PostgreSQL session (see, e.g., the below flame graph for a sample session that runs the benchmark script manually, in which the portion of the operations is around 12%).