Question: Can you explain the reasons why the MAX_PQ_TRAINING_SET_SIZE is a fixed constant in the ProductQuantization class? #157

karlney · 2023-11-23T14:16:44Z

Hi, in the ProductQuantization class there is a hard coded value for the MAX_PQ_TRAINING_SET_SIZE

https://github.com/jbellis/jvector/blob/5b92a134212c8ed7b2fa0d6739233c6b9b0cb9a3/jvector-base/src/main/java/io/github/jbellis/jvector/pq/ProductQuantization.java#L48C6-L48C6

I would like to know about the rationale for "only" having that (relatively small) number for the quantization training.

Have you considered making this value configurable?
AND/OR
Done benchmarks that show that you never need more than that number for training the PQ model?
2.1 If so, what are the inherent assumptions on the training data set? For example, inserting 128k exactly similar vectors is ofc not a great idea, but are there more things to think about?
What I am after is if you have some literature or any test etc done in this area that you can refer to?

The main reason I am asking is that we are right now using that PQ class in isolation for a semantic search project we are working on and are unsure if we will do a good enough training with just using 128k vectors.. as we are looking to index >1B vectors in the end.

We might also start to evaluate using the full functionality of this library.. but we are hoping on this lucene feature apache/lucene#12615 to be available soon so we don't have to wire it up it ourselves.

Thx in advance.

jbellis · 2023-11-24T02:05:22Z

Haven't tested it super exhaustively but

This is the setting that Microsoft's DiskANN code uses
In all the Bench datasets we did not observe lower recall when using this, vs training on the entire set of vectors

karlney · 2023-11-24T07:58:42Z

Thanks for a quick answer, very useful information.

karlney closed this as completed Nov 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question: Can you explain the reasons why the MAX_PQ_TRAINING_SET_SIZE is a fixed constant in the ProductQuantization class? #157

Question: Can you explain the reasons why the MAX_PQ_TRAINING_SET_SIZE is a fixed constant in the ProductQuantization class? #157

karlney commented Nov 23, 2023

jbellis commented Nov 24, 2023

karlney commented Nov 24, 2023

Question: Can you explain the reasons why the MAX_PQ_TRAINING_SET_SIZE is a fixed constant in the ProductQuantization class? #157

Question: Can you explain the reasons why the MAX_PQ_TRAINING_SET_SIZE is a fixed constant in the ProductQuantization class? #157

Comments

karlney commented Nov 23, 2023

jbellis commented Nov 24, 2023

karlney commented Nov 24, 2023