Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add AVX-512 support to Hamming and Jaccard distance functions. #519

Conversation

nathan-bossart
Copy link
Contributor

These distance functions are a natural fit for AVX-512 instructions. I'm seeing a decent speedup on top of the ongoing work to process these 64-bits at a time.

On v17 (which uses AVX-512 for pg_popcount() when possible), with ~100k randomly generated 2000-bit vectors, maintenance_work_mem = '8GB', and max_parallel_maintenance_workers = 1, I am seeing the following results:

At commit d3c49f1b7d:

postgres=# CREATE INDEX ON items USING hnsw (embedding bit_jaccard_ops);
CREATE INDEX
Time: 14401.630 ms (00:14.402)

postgres=# CREATE INDEX ON items USING hnsw (embedding bit_hamming_ops);
CREATE INDEX
Time: 20535.519 ms (00:20.536)

With this patch:

postgres=# CREATE INDEX ON items USING hnsw (embedding bit_jaccard_ops);
CREATE INDEX
Time: 12465.539 ms (00:12.466)

postgres=# CREATE INDEX ON items USING hnsw (embedding bit_hamming_ops);
CREATE INDEX
Time: 16865.540 ms (00:16.866)

I am quite skeptical that I've set up the attributes correctly, but this seems to be enough to get it working on my machine for benchmark purposes. If we want to proceed with these changes, I can spend more time on that.

@jkatz
Copy link
Contributor

jkatz commented Apr 16, 2024

Nice! Here is some benchmarking on this patch using a r6i.16xlarge, gcc 11, using the jaccard ops on the dbpedia-openai-1000k-angular dataset, binary quantized:

No AVX-512 AVX-512 Speedup
8 build workers 598.3 (s) 448.9 (s) 25.0%
64 build workers 155.4 (s) 121.9 (s) 21.6%
ef_search=40 629 QPS 875 QPS 39.1%
ef_search=200 209 QPS 293 QPS 40.1%

@jkatz
Copy link
Contributor

jkatz commented Apr 16, 2024

Some further tests, using an r6i.16xlarge, gcc 11, using hamming ops on the dbpedia-openai-1000k-angular dataset, binary quantized, m=16 and ef_construction=512 and the https://github.com/pgvector/pgvector/tree/hamming-performance-test branch:

No AVX-512 AVX-512 Speedup (%) Speedup (x)
8 build workers 547.6 (s) 279.9 (s) 48.9% 2.0
64 build workers 137.7 (s) 96.6 (s) 29.8% 1.4x
ef_search=40 680 QPS 926 QPS 36.1% 1.4x
ef_search=200 296 QPS 326 QPS 10.1% 1.1x

@ankane
Copy link
Member

ankane commented Apr 17, 2024

Awesome, thanks @nathan-bossart and @jkatz! Added CPU dispatching for this in the bit-dispatch branch (thanks to your recent work on Postgres). Let me know what you think.

@nathan-bossart
Copy link
Contributor Author

You bit-dispatch branch looks pretty solid to me. A couple small notes:

#if defined(__x86_64__) || defined(_M_AMD64)
#define BIT_DISPATCH
#endif

We use this to assume the presence of immintrin.h, __attribute__((target("avx512vpopcntdq"))), and one of the CPUID intrinsics. If you're only interested in newer compilers, I suspect this is fine, but it may not work on older ones.

/* Use built-ins when possible for inlining */
#if defined(HAVE__BUILTIN_POPCOUNT) && defined(HAVE_LONG_INT_64)
#define popcount64(x) __builtin_popcountl(x)
#elif defined(HAVE__BUILTIN_POPCOUNT) && defined(HAVE_LONG_LONG_INT_64)
#define popcount64(x) __builtin_popcountll(x)
#elif defined(_MSC_VER)
#define popcount64(x) __popcnt64(x)
#else
#define popcount64(x) pg_popcount64(x)
#endif

Commit postgres/postgres@02a6a54 suggests that __builtin_popcount might not always compile to a POPCNT instruction. It does on my machine, so maybe this is fine for pgvector.

	/* TODO Fix defined checks */
#if defined(HAVE__GET_CPUID)
	__get_cpuid_count(7, 0, &exx[0], &exx[1], &exx[2], &exx[3]);
#elif defined(HAVE__CPUID)
	__cpuidex(exx, 7, 0);
#endif

Again, if you're only interested in newer compilers, I'd bet that assuming the presence of these intrinsics with those macros is sufficient, but it might not be on older systems. Presumably this is what the TODO is referring to.

You might also want to do an _xgetbv check (see postgres/postgres@792752a). That being said, I'm not certain there's any practical risk there, so it'd probably be reasonable to omit it for now and wait until there is a report from the field.

@ankane
Copy link
Member

ankane commented Apr 18, 2024

Thanks @nathan-bossart, this is really helpful. For the AVX check, should it use OSXSAVE (bit 27) instead of XSAVE (bit 26)?

@nathan-bossart
Copy link
Contributor Author

I think you are right about that. That bit seems to indicate that both the OS and the processor supports XGETBV, not just the processor.

@ankane
Copy link
Member

ankane commented Apr 18, 2024

I think the most recent version of the bit-dispatch branch should address all of the issues you mentioned above.

Edit: Besides the __builtin_popcount one, which I think is okay.

@nathan-bossart
Copy link
Contributor Author

LGTM

ankane added a commit that referenced this pull request Apr 18, 2024
Co-authored-by: Nathan Bossart <nathan@postgresql.org>
Co-authored-by: "Jonathan S. Katz" <jkatz@users.noreply.github.com>
@ankane
Copy link
Member

ankane commented Apr 18, 2024

Great, thanks for driving this! Merged in the commit above.

@ankane ankane closed this Apr 18, 2024
@nathan-bossart
Copy link
Contributor Author

Thanks for merging!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants