-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Question: best way to store a vector ID? #22
Comments
Hi! There is unfortunately no built-in way of storing metadata or different IDs inside granne. Both of your suggested approaches seem feasible. For the second one, if you leave the exponent part of the float untouched (or at least small enough) when encoding IDs in a extra column, I don't think it should affect the results too much. This will work as long as you can encode the IDs using few enough bits. If you are worried about inaccuracies you could recompute the distances without the extra column after you get an initial result set back. |
interesting idea about using the exponents. I'm expecting to have ~billions of rows so it sounds like keeping a separate list of index -> ID is the right way to go. I did notice that the function to re-sort the index (for more efficient memory mapped files on disk) returns a list allowing a way to re-sort an index list so that does seem slightly easier as well. |
Update on this. I have 768 dimension vectors, like this:
and I tried appended the ID (int cast to float) as a 769th dimensions
Where 33931 is the ID of this vector. Doing this did not work well, results were quite wrong. Keeping a separate list seems like a better approach, for those following along at home. |
I think hnsw_rs (https://crates.io/crates/hnsw_rs) is a better option for IDs and it provides many other distances such as hamming, jaccard, euclidean et.al. We have only angular here and it is not trivial to add other distance metrics. Jianshu |
Is there a recommended best practice for storing an ID along with a vector?
My use case is sentence embedding. I have many passages of text, and I'd like to be able to search that embedding.
However, the problem is I'm trying to think of an optimal strategy to link a vector back to the original text. Near as I can tell, the IDs returned when running a
search()
are related to the order of insertion (i.e. first inserted index 0, second has index 1, etc.).Some ideas I thought of:
Any recommendations?
The text was updated successfully, but these errors were encountered: