You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I would like to know about the rationale for "only" having that (relatively small) number for the quantization training.
Have you considered making this value configurable?
AND/OR
Done benchmarks that show that you never need more than that number for training the PQ model?
2.1 If so, what are the inherent assumptions on the training data set? For example, inserting 128k exactly similar vectors is ofc not a great idea, but are there more things to think about?
What I am after is if you have some literature or any test etc done in this area that you can refer to?
The main reason I am asking is that we are right now using that PQ class in isolation for a semantic search project we are working on and are unsure if we will do a good enough training with just using 128k vectors.. as we are looking to index >1B vectors in the end.
We might also start to evaluate using the full functionality of this library.. but we are hoping on this lucene feature apache/lucene#12615 to be available soon so we don't have to wire it up it ourselves.
Thx in advance.
The text was updated successfully, but these errors were encountered:
Hi, in the ProductQuantization class there is a hard coded value for the MAX_PQ_TRAINING_SET_SIZE
https://github.com/jbellis/jvector/blob/5b92a134212c8ed7b2fa0d6739233c6b9b0cb9a3/jvector-base/src/main/java/io/github/jbellis/jvector/pq/ProductQuantization.java#L48C6-L48C6
I would like to know about the rationale for "only" having that (relatively small) number for the quantization training.
AND/OR
2.1 If so, what are the inherent assumptions on the training data set? For example, inserting 128k exactly similar vectors is ofc not a great idea, but are there more things to think about?
What I am after is if you have some literature or any test etc done in this area that you can refer to?
The main reason I am asking is that we are right now using that PQ class in isolation for a semantic search project we are working on and are unsure if we will do a good enough training with just using 128k vectors.. as we are looking to index >1B vectors in the end.
We might also start to evaluate using the full functionality of this library.. but we are hoping on this lucene feature apache/lucene#12615 to be available soon so we don't have to wire it up it ourselves.
Thx in advance.
The text was updated successfully, but these errors were encountered: