Question: best way to store a vector ID? #22

mmisiewicz · 2022-10-11T22:20:36Z

Is there a recommended best practice for storing an ID along with a vector?

My use case is sentence embedding. I have many passages of text, and I'd like to be able to search that embedding.

However, the problem is I'm trying to think of an optimal strategy to link a vector back to the original text. Near as I can tell, the IDs returned when running a search() are related to the order of insertion (i.e. first inserted index 0, second has index 1, etc.).

Some ideas I thought of:

storing the vector ID in another database to provide mapping, but this would be a lot of boilerplate and complexity. I expect to need to remove items from the index from time to time as well which further complicates.
add a column in my embedding vector to store a numerical ID (but this might affect accuracy, if there's stray data)

Any recommendations?

The text was updated successfully, but these errors were encountered:

plenkl · 2022-10-16T16:51:12Z

Hi! There is unfortunately no built-in way of storing metadata or different IDs inside granne. Both of your suggested approaches seem feasible.

For the second one, if you leave the exponent part of the float untouched (or at least small enough) when encoding IDs in a extra column, I don't think it should affect the results too much. This will work as long as you can encode the IDs using few enough bits. If you are worried about inaccuracies you could recompute the distances without the extra column after you get an initial result set back.

mmisiewicz · 2022-10-16T23:22:05Z

interesting idea about using the exponents. I'm expecting to have ~billions of rows so it sounds like keeping a separate list of index -> ID is the right way to go.

I did notice that the function to re-sort the index (for more efficient memory mapped files on disk) returns a list allowing a way to re-sort an index list so that does seem slightly easier as well.

mmisiewicz · 2022-12-22T04:32:20Z

Update on this. I have 768 dimension vectors, like this:

[-0.018988762,0.050045867,0.027589366,-0.020255722, ... ]

and I tried appended the ID (int cast to float) as a 769th dimensions

[... 0.008151617,-0.037780758,-0.0036997749,-0.041787587,-0.011534975, 33931]

Where 33931 is the ID of this vector. Doing this did not work well, results were quite wrong.

Keeping a separate list seems like a better approach, for those following along at home.

jianshu93 · 2024-09-13T11:44:10Z

I think hnsw_rs (https://crates.io/crates/hnsw_rs) is a better option for IDs and it provides many other distances such as hamming, jaccard, euclidean et.al. We have only angular here and it is not trivial to add other distance metrics.

Jianshu

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question: best way to store a vector ID? #22

Question: best way to store a vector ID? #22

mmisiewicz commented Oct 11, 2022

plenkl commented Oct 16, 2022

mmisiewicz commented Oct 16, 2022

mmisiewicz commented Dec 22, 2022

jianshu93 commented Sep 13, 2024

Question: best way to store a vector ID? #22

Question: best way to store a vector ID? #22

Comments

mmisiewicz commented Oct 11, 2022

plenkl commented Oct 16, 2022

mmisiewicz commented Oct 16, 2022

mmisiewicz commented Dec 22, 2022

jianshu93 commented Sep 13, 2024