Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question: Can you explain the reasons why the MAX_PQ_TRAINING_SET_SIZE is a fixed constant in the ProductQuantization class? #157

Closed
karlney opened this issue Nov 23, 2023 · 2 comments

Comments

@karlney
Copy link

karlney commented Nov 23, 2023

Hi, in the ProductQuantization class there is a hard coded value for the MAX_PQ_TRAINING_SET_SIZE

https://github.com/jbellis/jvector/blob/5b92a134212c8ed7b2fa0d6739233c6b9b0cb9a3/jvector-base/src/main/java/io/github/jbellis/jvector/pq/ProductQuantization.java#L48C6-L48C6

I would like to know about the rationale for "only" having that (relatively small) number for the quantization training.

  1. Have you considered making this value configurable?
    AND/OR
  2. Done benchmarks that show that you never need more than that number for training the PQ model?
    2.1 If so, what are the inherent assumptions on the training data set? For example, inserting 128k exactly similar vectors is ofc not a great idea, but are there more things to think about?
    What I am after is if you have some literature or any test etc done in this area that you can refer to?

The main reason I am asking is that we are right now using that PQ class in isolation for a semantic search project we are working on and are unsure if we will do a good enough training with just using 128k vectors.. as we are looking to index >1B vectors in the end.

We might also start to evaluate using the full functionality of this library.. but we are hoping on this lucene feature apache/lucene#12615 to be available soon so we don't have to wire it up it ourselves.

Thx in advance.

@jbellis
Copy link
Owner

jbellis commented Nov 24, 2023

Haven't tested it super exhaustively but

  1. This is the setting that Microsoft's DiskANN code uses
  2. In all the Bench datasets we did not observe lower recall when using this, vs training on the entire set of vectors

@karlney
Copy link
Author

karlney commented Nov 24, 2023

Thanks for a quick answer, very useful information.

@karlney karlney closed this as completed Nov 24, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants