Can I help get tinyint or half branches released? #326

nathanwilk7 · 2023-10-29T22:06:48Z

Is there more to do on the tinyint or half branches to get them released or are they ready to be put into 0.5.1? If there is more to do for them, let me know and I'll see if it's something I could take care of (e.g.: code, docs, tests, etc).

The text was updated successfully, but these errors were encountered:

ankane · 2023-10-29T22:58:15Z

Hi @nathanwilk7, the most helpful thing would be sharing specific use cases for either data type (how the vectors are generated and what specifically they're used for). These may be included in 0.6.0, but I'd like to understand how common they will be.

nathanwilk7 · 2023-10-31T05:01:27Z

Sure, I'll add my use cases here. For context, we're doing chem/bio ML type work. Thanks for your thoughts on the below.

Our smaller datasets have about 300 million (300M) molecules in them. For those molecules there are a few different types of vectors we'd like to generate. Some of these vectors are generated via cheminformatics methods (basically a molecular hash function with certain similarity properties) and others are generated via embeddings from various ML models.

300M sparse "count" vectors w/ 1024-2048 dimensions (Morgan Count Fingerprints]. Most of the dimensions will be 0, but the non-zero ones will be integers between 0 and some small N (e.g.: 32 would likely be a reasonable upper bound, 256 would be extremely high). For these vectors, we'd like to be able to do ANN searches across them using the standard distance measures (inner product, cosine, L2). The goal here is to able to run searches to find the most similar molecules to a given molecule or molecule fragment. The tinyint type seems like it could be a good fit here, though supporting spare vectors would potentially be an even bigger win but using dense tinyint might be good enough for now.
300M embedding vectors w/ 128-1024 dimensions where each dimension is a non-zero decimal number (these are essentially the same as the standard embeddings everyone uses for various ML tasks). We would likely be ok giving up the precision of using half size floats or product quantization or any other similar technique.
I'll also add that we'd like 300M sparse bit vectors w/ 1024-2048 dimensions (Morgan Bit Fingerprints]. For these vectors, we'd like to be able to do ANN searches across them using tanimoto/jaccard distance. I recognize these are probably not going to be supported as true ANN-searchable sparse bitvectors in pgvector anytime soon, so we can probably ignore this unless you think it's on the roadmap.

For some higher level context, I'm currently running Postgres via GCP Cloud SQL to store our other molecular data and it would be nice to be able to integrate the molecular fingerprints/counts and embeddings into Postgres as well instead of needing to bring in another ANN lib/service (e.g.: faiss, pinecone, etc). I ran some rough numbers on storage cost and found that using the current pgvector, I estimate a (very hand wavy back of the envelope) storage cost of about $2k-$3k/yr for each 300M molecule fingerprint count vectors I store. Cutting that storage cost down by 50% or 75% would make it a lot easier to move forward, especially since I'll want to store multiple embedding vectors for different models.

The general user stories we have are:

Given a molecule which is a known "hit" (e.g.: binds to a protein of interest), find the most similar molecules to the hit so we can evaluate them as well.
Given embeddings which are classified as hits by a certain model, find the orderable molecules (from vendors like Enamine) which are most similar to these hits.
Count the number of hits/misses among the nearest N molecules to a given molecule.

Lastly, in a perfect world I'd like to load up these representations for 36 billion molecules (Enamine REAL) into a vector database and run the same types of searches as above.

Happy to discuss these use cases more and provide more context of course. Please let me know if there's anything I can do to push efforts forward on smaller data sizes, product quantization, or support for sparse/bit vectors.

ankane · 2023-11-01T03:08:46Z

Thanks @nathanwilk7, this is really great context (and a great explanation)! I really appreciate it.

I agree that sparse vectors are probably a better option for 1.

I'd like to see what other use cases people have before deciding to add.

andreassteffen · 2023-11-02T17:39:35Z

Essentially we would have the very same use case as @nathanwilk7 and would see great benefit to have them represented in pgvector. Very excited to see this!

cile98 · 2023-11-15T14:52:29Z

I agree that adding sparse vector support would be an amazing feature. I currently store 300k vectors with 5k dimensions, but most of the entries are 0, so storing them would save a bunch of space

lrotim · 2024-01-11T10:01:12Z

In many other embedding databases, balancing precision and accuracy is often a valid option.

My example is that we have a database of around 1TB when we build it with Postgres. Compared to, for instance, faiss it's about 3x bloat - as postgre implementation has an overhead on how data is persisted (probably for a good reason). However, for some customers us having /2 disk space reduction would be "a go".

We confirmed that using fp16 doesn't affect our accuracy as much and would be a great option to have

tureba · 2024-02-23T14:38:57Z

I see some acceptance of the potential accuracy loss given the benefits. So maybe if we reduce that loss even further, we can draw even more attention to this feature soon.

I don't see mention of brain floats or other representations lower than single precision. Was there any investigation on those done yet?

I mention that because much of the precision is lost not when the mantissa is reduced, but when the exponent is. So going from single precision floats to the standard ieee half floats is kinda painful. But a representation like brain float would keep the exponent at the same size as it were in the original single precision. It's also available in most current compilers, and should be vectorizable from SSE through AVX512BF16.

ankane · 2024-03-27T22:42:10Z

There's now a new halfvec branch for this for anyone who wants to try it (in a non-production environment).

CREATE TABLE items (id bigserial PRIMARY KEY, embedding halfvec(3));
INSERT INTO items (embedding) VALUES ('[1,2,3]'), ('[4,5,6]');
CREATE INDEX ON items USING hnsw (embedding halfvec_l2_ops);
SELECT * FROM items ORDER BY embedding <-> '[3,1,2]' LIMIT 5;

@nathanwilk7 There's also a bitvector branch that could be good for bit fingerprints (if you have the bandwidth to try it out, it'd be super helpful).

CREATE TABLE items (id bigserial PRIMARY KEY, fingerprint bit(3));
INSERT INTO items (fingerprint) VALUES (B'000'), (B'111');
CREATE INDEX ON items USING hnsw (fingerprint bit_jaccard_ops);
SELECT * FROM items ORDER BY fingerprint <%> B'101' LIMIT 5;

@tureba From what I've seen, bfloat16 isn't common for nearest neighbor search.

tureba · 2024-03-28T13:48:21Z

Very nice, @ankane. Thanks for sharing. I'll invest some time over the next couple of days checking the code out.

But from what I see so far, there is a new data type halfvec and hnsw is taught to index tables using it. So now you can index columns of 32b using indexes with 32b values, and 16b columns using indexes of 16b values.

Have you experimented indexing regular 32b vector data types using 16b halfvec only inside the index? Maybe an index amoption could be an easy way to control that index creation. I ask because in my testing, doing that was enough to bring the performance gains of smaller indexes and more index items per page, leading to faster convergence, with acceptable accuracy changes, without having the end user rewrite their tables entirely.

ankane · 2024-03-28T21:15:48Z

Yeah, think that could be a common use case. Pushed an update for casting between vector and halfvec (meant to include it earlier). This allows you to do:

CREATE TABLE items (id bigserial PRIMARY KEY, embedding vector(3));
INSERT INTO items (embedding) VALUES ('[1,2,3]'), ('[4,5,6]');
CREATE INDEX ON items USING hnsw ((embedding::halfvec(3)) halfvec_l2_ops);

-- no re-ranking
SELECT id FROM items ORDER BY embedding::halfvec(3) <-> '[1,2,3]' LIMIT 5;

-- re-ranking
SELECT id FROM (
    SELECT * FROM items ORDER BY embedding::halfvec(3) <-> '[1,2,3]' LIMIT 20
) ORDER BY embedding <-> '[1,2,3]' LIMIT 5;

May also add quantization="half" and quantization="binary" options to index creation at some point, but would like to start here.

jkatz · 2024-03-30T05:50:31Z

Per #326 (comment) and #326 (comment) -- I've been testing this exact case using ANN Benchmark with some modifications to support the halfvec type. I've been running two tests:

Storing a vector in the table, and building the index using a vector (fp32 => fp32); in the table I call this vector/vector
Storing a vector in the table, and building the index using a halfvec (fp32 => fp16); in the table I call this vector/halfvec

A handful of my halfvec tests did not run, but thus far, the results are very promising.

For my test machine, I used a r7gd.16xlarge (64 vCPU, 512GiB RAM) and stored my data on the local disk to eliminate network latency. PostgreSQL settings of note:

Parameter	Value
`shared_buffers`	128GB
`max_parallel_maintenance_workers`	63
`maintenance_work_mem`	64GB
`effective_cache_size`	256GB

Below are a sampling of test results. Note that the ANN Benchmark test only tests a single query at a time, so this is not a test of concurrency. Below I review index size, build time, recall, and query throughput.

Dataset: `sift-128-euclidean`

m=16; ef_construction=32

	`vector`/`vector`	`vector`/`halfvec`
Index size (MB)	793	544
Index build time (s)	33	31
Recall @ ef_search=10	0.711	0.710
QPS @ ef_search=10	2130	2349
Recall @ ef_search=40	0.908	0.907
QPS @ ef_search=40	1090	1190
Recall @ ef_search=200	0.989	0.989
QPS @ ef_search=200	297	322

m=16; ef_construction=64

	`vector`/`vector`	`vector`/`halfvec`
Index size (MB)	782	538
Index build time (s)	37	34
Recall @ ef_search=10	0.751	0.751
QPS @ ef_search=10	2180	2178
Recall @ ef_search=40	0.937	0.937
QPS @ ef_search=40	1058	1119
Recall @ ef_search=200	0.995	0.995
QPS @ ef_search=200	281	315

m=16; ef_construction=128

	`vector`/`vector`	`vector`/`halfvec`
Index size (MB)	782	538
Index build time (s)	44	40
Recall @ ef_search=10	0.770	0.770
QPS @ ef_search=10	2096	2241
Recall @ ef_search=40	0.949	0.949
QPS @ ef_search=40	1049	1110
Recall @ ef_search=200	0.997	0.997
QPS @ ef_search=200	279	297

m=16; ef_construction=256

	`vector`/`vector`	`vector`/`halfvec`
Index size (MB)	782	538
Index build time (s)	58	51
Recall @ ef_search=10	0.776	0.776
QPS @ ef_search=10	2099	2284
Recall @ ef_search=40	0.954	0.954
QPS @ ef_search=40	1020	1140
Recall @ ef_search=200	0.998	0.998
QPS @ ef_search=200	268	302

m=16; ef_construction=512

	`vector`/`vector`	`vector`/`halfvec`
Index size (MB)	7811	538
Index build time (s)	88	75
Recall @ ef_search=10	0.776	0.775
QPS @ ef_search=10	1988	2118
Recall @ ef_search=40	0.956	0.956
QPS @ ef_search=40	983	1053
Recall @ ef_search=200	0.998	0.998
QPS @ ef_search=200	261	279

Dataset: `gist-960-euclidean`

m=16; ef_construction=32

	`vector`/`vector`	`vector`/`halfvec`
Index size (MB)	7811	2603
Index build time (s)	145	68
Recall @ ef_search=10	0.430	0.427
QPS @ ef_search=10	1160	1184
Recall @ ef_search=40	0.687	0.684
QPS @ ef_search=40	532	581
Recall @ ef_search=200	0.905	0.903
QPS @ ef_search=200	141	163

m=16; ef_construction=64

	`vector`/`vector`	`vector`/`halfvec`
Index size (MB)	7684	2561
Index build time (s)	161	77
Recall @ ef_search=10	0.476	0.476
QPS @ ef_search=10	1178	1223
Recall @ ef_search=40	0.740	0.742
QPS @ ef_search=40	541	592
Recall @ ef_search=200	0.939	0.933
QPS @ ef_search=200	143	160

m=16; ef_construction=128

	`vector`/`vector`	`vector`/`halfvec`
Index size (MB)	7680	2560
Index build time (s)	190	101
Recall @ ef_search=10	0.497	0.502
QPS @ ef_search=10	1177	1177
Recall @ ef_search=40	0.771	0.770
QPS @ ef_search=40	535	568
Recall @ ef_search=200	0.952	0.953
QPS @ ef_search=200	141	154

m=16; ef_construction=256

	`vector`/`vector`	`vector`/`halfvec`
Index size (MB)	7678	2559
Index build time (s)	247	147
Recall @ ef_search=10	0.505	0.499
QPS @ ef_search=10	1114	1187
Recall @ ef_search=40	0.780	0.784
QPS @ ef_search=40	513	570
Recall @ ef_search=200	0.960	0.960
QPS @ ef_search=200	135	152

m=16; ef_construction=512

	`vector`/`vector`	`vector`/`halfvec`
Index size (MB)	7678	2559
Index build time (s)	349	229
Recall @ ef_search=10	0.508	0.508
QPS @ ef_search=10	1105	1149
Recall @ ef_search=40	0.789	0.785
QPS @ ef_search=40	503	551
Recall @ ef_search=200	0.968	0.967
QPS @ ef_search=200	134	145

Analysis

Overall, if the vector/halfvec takes less time to build and shrinks index size 3x while maintaining comparable recall and have a 10% throughput improvement, that seems to be a winner. I'd like to see the full comparison with dbpedia-openai-1000k-angular (should have that tomorrow), but generally these results are positive for a fp32 => fp16 scalar quantization.

jkatz · 2024-03-30T20:57:18Z

Using the methodology in #326 (comment) below are teh results for dbpedia-openai-1000k-angular.

Top posting the analysis, there are similar patterns to the above -- roughly comparable recall, but vector/halfvec has 3x faster index builds and half the space utilization. This time, there's roughly comparable QPS, but that's likely because each index entry is taking up an entire page, even in fp16 quantized form.

Dataset: `dbpedia-openai-1000k-angular`

m=16; ef_construction=32

	`vector`/`vector`	`vector`/`halfvec`
Index size (MB)	7734	3867
Index build time (s)	244	77
Recall @ ef_search=10	0.762	0.761
QPS @ ef_search=10	1258	1245
Recall @ ef_search=40	0.913	0.911
QPS @ ef_search=40	649	655
Recall @ ef_search=200	0.973	0.973
QPS @ ef_search=200	207	214

m=16; ef_construction=64

	`vector`/`vector`	`vector`/`halfvec`
Index size (MB)	7734	3867
Index build time (s)	264	90
Recall @ ef_search=10	0.819	0.809
QPS @ ef_search=10	1231	1219
Recall @ ef_search=40	0.945	0.945
QPS @ ef_search=40	627	642
Recall @ ef_search=200	0.987	0.987
QPS @ ef_search=200	191	190

m=16; ef_construction=128

	`vector`/`vector`	`vector`/`halfvec`
Index size (MB)	7734	3867
Index build time (s)	301	115
Recall @ ef_search=10	0.843	0.844
QPS @ ef_search=10	1199	1152
Recall @ ef_search=40	0.962	0.962
QPS @ ef_search=40	604	591
Recall @ ef_search=200	0.993	0.993
QPS @ ef_search=200	165	171

m=16; ef_construction=256

	`vector`/`vector`	`vector`/`halfvec`
Index size (MB)	7734	3867
Index build time (s)	377	163
Recall @ ef_search=10	0.851	0.852
QPS @ ef_search=10	1162	1162
Recall @ ef_search=40	0.968	0.968
QPS @ ef_search=40	567	578
Recall @ ef_search=200	0.996	0.996
QPS @ ef_search=200	156	163

m=16; ef_construction=512

	`vector`/`vector`	`vector`/`halfvec`
Index size (MB)	7734	3867
Index build time (s)	508	254
Recall @ ef_search=10	0.853	0.851
QPS @ ef_search=10	1118	1079
Recall @ ef_search=40	0.971	0.971
QPS @ ef_search=40	530	540
Recall @ ef_search=200	0.997	0.997
QPS @ ef_search=200	144	149

tureba · 2024-04-01T16:50:39Z

Nice results. Those numbers are roughly what I expected to see, based in past experiments. Though I admit I'm having trouble running ann-benchmarks as fluidly as you are.

ankane · 2024-04-02T21:00:24Z

Support for half vectors is now merged (32a502c) and will be included in 0.7.0 (#508) 🎉

ankane · 2024-04-02T21:28:01Z

Support for sparse vectors (abac7a3) and bit vectors (94a444f) is merged as well

ankane · 2024-04-08T04:36:48Z

Just fyi for anyone testing on x86-64: some key halfvec functions now use intrinsics, so you should see much better performance 🚀

pashkinelfe · 2024-04-22T13:06:45Z

Hi @ankane and @jkatz. I've tested HNSW index build time on ARM (graviton 3) using the same workflow as #409 (comment)

Unfortunately, I get only a marginal performance difference for halfvec compared to vector for serial build. Though the performance difference for many cores is still here.

I've looked at the halfvec code and saw that product calculation functions for halfvec don't perform better than those for vector. It seems we have some potential for speedup but as of now it's used only for x86. Did I possibly miss something?

ankane · 2024-04-23T03:43:06Z

Hey @pashkinelfe, thanks for testing/sharing. Which commit hash are you using?

pashkinelfe · 2024-04-23T10:38:08Z

Hi @ankane ,
Commit 3df5655
PG16.2
in the previous comment results are for gcc11,
now I added measurements for GCC13

gcc -v
Target: aarch64-linux-gnu
Configured with: ../src/configure -v --with-pkgversion='Ubuntu 13.1.0-8ubuntu1~22.04' --with-bugurl=file:///usr/share/doc/gcc-13/README.Bugs --enable-languages=c,ada,c++,go,d,fortran,objc,obj-c++,m2,rust --prefix=/usr --with-gcc-major-version-only --program-suffix=-13 --program-prefix=aarch64-linux-gnu- --enable-shared --enable-linker-build-id --libexecdir=/usr/libexec --without-included-gettext --enable-threads=posix --libdir=/usr/lib --enable-nls --enable-bootstrap --enable-clocale=gnu --enable-libstdcxx-debug --enable-libstdcxx-time=yes --with-default-libstdcxx-abi=new --enable-gnu-unique-object --disable-libquadmath --disable-libquadmath-support --enable-plugin --enable-default-pie --with-system-zlib --enable-libphobos-checking=release --with-target-system-zlib=auto --enable-objc-gc=auto --enable-multiarch --enable-fix-cortex-a53-843419 --disable-werror --enable-checking=release --build=aarch64-linux-gnu --host=aarch64-linux-gnu --target=aarch64-linux-gnu --with-build-config=bootstrap-lto-lean --enable-link-serialization=2
Thread model: posix
Supported LTO compression algorithms: zlib zstd
gcc version 13.1.0 (Ubuntu 13.1.0-8ubuntu1~22.04)

Build string:
gcc -Wall -Wmissing-prototypes -Wpointer-arith -Wdeclaration-after-statement -Werror=vla -Wendif-labels -Wmissing-format-attribute -Wimplicit-fallthrough=3 -Wcast-function-type -Wshadow=compatible-local -Wformat-security -fno-strict-aliasing -fwrapv -fexcess-precision=standard -Wno-format-truncation -Wno-stringop-truncation -g -O3 -g3 -march=native -lstdc++ -march=native -ftree-vectorize -fassociative-math -fno-signed-zeros -fno-trapping-math -fPIC -fvisibility=hidden -I. -I./ -I/usr/local/pgsql/include/server -I/usr/local/pgsql/include/internal -D_GNU_SOURCE -c -o src/halfutils.o src/halfutils.c -MMD -MP -MF .deps/halfutils.Po

nathanwilk7 changed the title ~~Can I Help Get tinyint or half branches released?~~ Can I help get tinyint or half branches released? Oct 29, 2023

jkatz mentioned this issue Nov 14, 2023

Support for lower precision floating point vectors #356

Closed

ankane mentioned this issue Dec 20, 2023

[Question] Planned release time for "svector" branch? #382

Closed

jkatz mentioned this issue Mar 29, 2024

Increase max vectors dimension limit for index #461

Open

ankane mentioned this issue Apr 1, 2024

0.7.0 Plan #508

Closed

ankane closed this as completed Apr 2, 2024

jkatz mentioned this issue Apr 3, 2024

Is pgVector performance competitive compared to other VectorDBs #510

Closed

pashkinelfe mentioned this issue Apr 30, 2024

SVE vector optimization for halfvectors dot product calculation #536

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can I help get tinyint or half branches released? #326

Can I help get tinyint or half branches released? #326

nathanwilk7 commented Oct 29, 2023

ankane commented Oct 29, 2023

nathanwilk7 commented Oct 31, 2023 •

edited

ankane commented Nov 1, 2023

andreassteffen commented Nov 2, 2023 •

edited

cile98 commented Nov 15, 2023

lrotim commented Jan 11, 2024 •

edited

tureba commented Feb 23, 2024

ankane commented Mar 27, 2024

tureba commented Mar 28, 2024

ankane commented Mar 28, 2024 •

edited

jkatz commented Mar 30, 2024 •

edited

jkatz commented Mar 30, 2024

tureba commented Apr 1, 2024

ankane commented Apr 2, 2024

ankane commented Apr 2, 2024

ankane commented Apr 8, 2024

pashkinelfe commented Apr 22, 2024 •

edited

ankane commented Apr 23, 2024 •

edited

pashkinelfe commented Apr 23, 2024 •

edited

Can I help get tinyint or half branches released? #326

Can I help get tinyint or half branches released? #326

Comments

nathanwilk7 commented Oct 29, 2023

ankane commented Oct 29, 2023

nathanwilk7 commented Oct 31, 2023 • edited

ankane commented Nov 1, 2023

andreassteffen commented Nov 2, 2023 • edited

cile98 commented Nov 15, 2023

lrotim commented Jan 11, 2024 • edited

tureba commented Feb 23, 2024

ankane commented Mar 27, 2024

tureba commented Mar 28, 2024

ankane commented Mar 28, 2024 • edited

jkatz commented Mar 30, 2024 • edited

Dataset: sift-128-euclidean

m=16; ef_construction=32

m=16; ef_construction=64

m=16; ef_construction=128

m=16; ef_construction=256

m=16; ef_construction=512

Dataset: gist-960-euclidean

m=16; ef_construction=32

m=16; ef_construction=64

m=16; ef_construction=128

m=16; ef_construction=256

m=16; ef_construction=512

Analysis

jkatz commented Mar 30, 2024

Dataset: dbpedia-openai-1000k-angular

m=16; ef_construction=32

m=16; ef_construction=64

m=16; ef_construction=128

m=16; ef_construction=256

m=16; ef_construction=512

tureba commented Apr 1, 2024

ankane commented Apr 2, 2024

ankane commented Apr 2, 2024

ankane commented Apr 8, 2024

pashkinelfe commented Apr 22, 2024 • edited

ankane commented Apr 23, 2024 • edited

pashkinelfe commented Apr 23, 2024 • edited

nathanwilk7 commented Oct 31, 2023 •

edited

andreassteffen commented Nov 2, 2023 •

edited

lrotim commented Jan 11, 2024 •

edited

ankane commented Mar 28, 2024 •

edited

jkatz commented Mar 30, 2024 •

edited

Dataset: `sift-128-euclidean`

Dataset: `gist-960-euclidean`

Dataset: `dbpedia-openai-1000k-angular`

pashkinelfe commented Apr 22, 2024 •

edited

ankane commented Apr 23, 2024 •

edited

pashkinelfe commented Apr 23, 2024 •

edited