Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question: best way to store a vector ID? #22

Open
mmisiewicz opened this issue Oct 11, 2022 · 4 comments
Open

Question: best way to store a vector ID? #22

mmisiewicz opened this issue Oct 11, 2022 · 4 comments

Comments

@mmisiewicz
Copy link

Is there a recommended best practice for storing an ID along with a vector?

My use case is sentence embedding. I have many passages of text, and I'd like to be able to search that embedding.

However, the problem is I'm trying to think of an optimal strategy to link a vector back to the original text. Near as I can tell, the IDs returned when running a search() are related to the order of insertion (i.e. first inserted index 0, second has index 1, etc.).

Some ideas I thought of:

  • storing the vector ID in another database to provide mapping, but this would be a lot of boilerplate and complexity. I expect to need to remove items from the index from time to time as well which further complicates.
  • add a column in my embedding vector to store a numerical ID (but this might affect accuracy, if there's stray data)

Any recommendations?

@plenkl
Copy link
Collaborator

plenkl commented Oct 16, 2022

Hi! There is unfortunately no built-in way of storing metadata or different IDs inside granne. Both of your suggested approaches seem feasible.

For the second one, if you leave the exponent part of the float untouched (or at least small enough) when encoding IDs in a extra column, I don't think it should affect the results too much. This will work as long as you can encode the IDs using few enough bits. If you are worried about inaccuracies you could recompute the distances without the extra column after you get an initial result set back.

@mmisiewicz
Copy link
Author

interesting idea about using the exponents. I'm expecting to have ~billions of rows so it sounds like keeping a separate list of index -> ID is the right way to go.

I did notice that the function to re-sort the index (for more efficient memory mapped files on disk) returns a list allowing a way to re-sort an index list so that does seem slightly easier as well.

@mmisiewicz
Copy link
Author

Update on this. I have 768 dimension vectors, like this:

[-0.018988762,0.050045867,0.027589366,-0.020255722, ... ]

and I tried appended the ID (int cast to float) as a 769th dimensions

[... 0.008151617,-0.037780758,-0.0036997749,-0.041787587,-0.011534975, 33931]

Where 33931 is the ID of this vector. Doing this did not work well, results were quite wrong.

Keeping a separate list seems like a better approach, for those following along at home.

@jianshu93
Copy link

I think hnsw_rs (https://crates.io/crates/hnsw_rs) is a better option for IDs and it provides many other distances such as hamming, jaccard, euclidean et.al. We have only angular here and it is not trivial to add other distance metrics.

Jianshu

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants