add sample support. Usage by digoal · Pull Request #946 · pgvector/pgvector

digoal · 2026-01-11T07:22:58Z

Users can now create IVFFlat indexes with a fixed number of samples:

1 CREATE INDEX ON items USING ivfflat (embedding vector_l2_ops)
2 WITH (lists = 100, samples = 20000);

This allows maintaining 20,000 samples for k-means clustering regardless of the number of lists, ensuring consistent cluster center quality and recall performance.

The implementation has been tested and verified to work correctly, with all existing regression tests passing.

Users can now create IVFFlat indexes with a fixed number of samples: 1 CREATE INDEX ON items USING ivfflat (embedding vector_l2_ops) 2 WITH (lists = 100, samples = 20000); This allows maintaining 20,000 samples for k-means clustering regardless of the number of lists, ensuring consistent cluster center quality and recall performance. The implementation has been tested and verified to work correctly, with all existing regression tests passing.

ankane · 2026-01-12T17:09:25Z

Hi @digoal, thanks for the PR. However, I'm not sure it's common enough to add. What's the motivation for making it configurable? Are you seeing significantly higher recall with more samples?

digoal · 2026-01-13T01:44:23Z

Thanks for the feedback. I understand the concern about commonality. The motivation comes from scenarios with large datasets: when lists is too small, the centroid calculation becomes less precise, leading to poor partitioning.

You can easily reproduce this with the following test:

Generate 1,000 random base vectors.

For each base vector, create 999 variants using Gaussian perturbation (total 1M vectors).

Build an IVFFlat index and compare lists=100 vs. lists=1000.

In my tests, at probe=1, the recall with lists=1000 is significantly higher than with lists=100. This flexibility allows users to optimize the Voronoi cells for specific data distributions.

ankane · 2026-01-13T17:09:14Z

Thanks. It seems like the situation above can be solved by increasing the number of lists rather than the number of samples.

At this point, I don't think there's a strong enough case to make this configurable.

jkatz · 2026-01-13T17:17:10Z

Just for completeness, I had tested sampling several years ago to try to improve the performance/recall ratio, particularly on larger datasets, but I wasn't able to get much of a difference.

ankane closed this Jan 13, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add sample support. Usage#946

add sample support. Usage#946
digoal wants to merge 1 commit intopgvector:masterfrom
digoal:master

digoal commented Jan 11, 2026

Uh oh!

ankane commented Jan 12, 2026 •

edited

Loading

Uh oh!

digoal commented Jan 13, 2026

Uh oh!

ankane commented Jan 13, 2026

Uh oh!

jkatz commented Jan 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

3 participants

Conversation

digoal commented Jan 11, 2026

Uh oh!

ankane commented Jan 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

digoal commented Jan 13, 2026

Uh oh!

ankane commented Jan 13, 2026

Uh oh!

jkatz commented Jan 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

3 participants

ankane commented Jan 12, 2026 •

edited

Loading