Skip to content

add sample support. Usage#946

Closed
digoal wants to merge 1 commit intopgvector:masterfrom
digoal:master
Closed

add sample support. Usage#946
digoal wants to merge 1 commit intopgvector:masterfrom
digoal:master

Conversation

@digoal
Copy link
Copy Markdown

@digoal digoal commented Jan 11, 2026

Users can now create IVFFlat indexes with a fixed number of samples:

1 CREATE INDEX ON items USING ivfflat (embedding vector_l2_ops)
2 WITH (lists = 100, samples = 20000);

This allows maintaining 20,000 samples for k-means clustering regardless of the number of lists, ensuring consistent cluster center quality and recall performance.

The implementation has been tested and verified to work correctly, with all existing regression tests passing.

  Users can now create IVFFlat indexes with a fixed number of samples:

   1 CREATE INDEX ON items USING ivfflat (embedding vector_l2_ops)
   2 WITH (lists = 100, samples = 20000);

  This allows maintaining 20,000 samples for k-means clustering regardless of the number of lists, ensuring consistent cluster center quality and recall performance.

  The implementation has been tested and verified to work correctly, with all existing regression tests passing.
@ankane
Copy link
Copy Markdown
Member

ankane commented Jan 12, 2026

Hi @digoal, thanks for the PR. However, I'm not sure it's common enough to add. What's the motivation for making it configurable? Are you seeing significantly higher recall with more samples?

@digoal
Copy link
Copy Markdown
Author

digoal commented Jan 13, 2026

Thanks for the feedback. I understand the concern about commonality. The motivation comes from scenarios with large datasets: when lists is too small, the centroid calculation becomes less precise, leading to poor partitioning.

You can easily reproduce this with the following test:

Generate 1,000 random base vectors.

For each base vector, create 999 variants using Gaussian perturbation (total 1M vectors).

Build an IVFFlat index and compare lists=100 vs. lists=1000.

In my tests, at probe=1, the recall with lists=1000 is significantly higher than with lists=100. This flexibility allows users to optimize the Voronoi cells for specific data distributions.

@ankane
Copy link
Copy Markdown
Member

ankane commented Jan 13, 2026

Thanks. It seems like the situation above can be solved by increasing the number of lists rather than the number of samples.

At this point, I don't think there's a strong enough case to make this configurable.

@ankane ankane closed this Jan 13, 2026
@jkatz
Copy link
Copy Markdown
Contributor

jkatz commented Jan 13, 2026

Just for completeness, I had tested sampling several years ago to try to improve the performance/recall ratio, particularly on larger datasets, but I wasn't able to get much of a difference.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

3 participants