Conversation
Users can now create IVFFlat indexes with a fixed number of samples: 1 CREATE INDEX ON items USING ivfflat (embedding vector_l2_ops) 2 WITH (lists = 100, samples = 20000); This allows maintaining 20,000 samples for k-means clustering regardless of the number of lists, ensuring consistent cluster center quality and recall performance. The implementation has been tested and verified to work correctly, with all existing regression tests passing.
|
Hi @digoal, thanks for the PR. However, I'm not sure it's common enough to add. What's the motivation for making it configurable? Are you seeing significantly higher recall with more samples? |
|
Thanks for the feedback. I understand the concern about commonality. The motivation comes from scenarios with large datasets: when lists is too small, the centroid calculation becomes less precise, leading to poor partitioning. You can easily reproduce this with the following test: Generate 1,000 random base vectors. For each base vector, create 999 variants using Gaussian perturbation (total 1M vectors). Build an IVFFlat index and compare lists=100 vs. lists=1000. In my tests, at probe=1, the recall with lists=1000 is significantly higher than with lists=100. This flexibility allows users to optimize the Voronoi cells for specific data distributions. |
|
Thanks. It seems like the situation above can be solved by increasing the number of lists rather than the number of samples. At this point, I don't think there's a strong enough case to make this configurable. |
|
Just for completeness, I had tested sampling several years ago to try to improve the performance/recall ratio, particularly on larger datasets, but I wasn't able to get much of a difference. |
Users can now create IVFFlat indexes with a fixed number of samples:
1 CREATE INDEX ON items USING ivfflat (embedding vector_l2_ops)
2 WITH (lists = 100, samples = 20000);
This allows maintaining 20,000 samples for k-means clustering regardless of the number of lists, ensuring consistent cluster center quality and recall performance.
The implementation has been tested and verified to work correctly, with all existing regression tests passing.