Skip to content

fix(mem_wal): exact PK dedup for LSM vector search and point lookup#6856

Closed
jackye1995 wants to merge 5 commits into
lance-format:mainfrom
jackye1995:jack/fix-lsm-pk-dedupe
Closed

fix(mem_wal): exact PK dedup for LSM vector search and point lookup#6856
jackye1995 wants to merge 5 commits into
lance-format:mainfrom
jackye1995:jack/fix-lsm-pk-dedupe

Conversation

@jackye1995

@jackye1995 jackye1995 commented May 20, 2026

Copy link
Copy Markdown
Contributor

Duplicate primary keys written into one memtable or across generations leaked through as distinct rows in vector search and point lookup.

Exact PK dedup — replaces the bloom-filter-based FilterStaleExec with an exact pipeline:

  • LsmSourceTagExec tags each row with (_memtable_gen, _freshness)
  • LsmGlobalPkDedupExec does single-pass cross-source PK dedup keeping the freshest row per PK
  • Point lookup adds WithinSourceDedupExec(KeepMaxFreshness) on the active arm

Removes FilterStaleExec, GenerationBloomFilter, and the bloom-filter building from transaction commit.

Post-rerank TakeExec — the planner now accepts a base Dataset reference. After global PK dedup + sort + top-k, a TakeExec materializes any user-projected columns not in the per-source KNN output by fetching from the base dataset via _rowid.

Refine factor plumbingplan_search() now accepts refine_factor so callers can enable base-table refine (over-fetch k * factor candidates, re-rank with exact distances). Memtable arms use exact HNSW and are unaffected. Exposed in both Python and Java bindings.

@claude claude Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

@github-actions github-actions Bot added the bug Something isn't working label May 20, 2026
…point lookup

Same primary key written multiple times into one memtable, or into both a
memtable and an older generation, used to leak through to the user as
distinct rows. Two paths were affected:

- Vector search: HNSW indexes every insert as its own graph node, so KNN
  could return both V1 and V2 of the same PK from a single source.
- Point lookup (active arm): `FilterExec + LIMIT 1` over an insert-ordered
  scan returned the oldest match among duplicates.

Vector search now runs each per-source KNN through `LsmSourceTagExec`,
which appends `(_memtable_gen, _freshness)`. A single `LsmGlobalPkDedupExec`
over the union picks the row with the largest tuple per PK — newer
generations win, ties fall to the normalized within-source order (larger
`_rowid` for the active arm; flipped via `u64::MAX - _rowid` for flushed
arms to compensate for the reverse-write convention). This replaces the
older two-step `WithinSourceDedupExec` + bloom-based `FilterStaleExec`
design and is exact (no false-positive recall loss, no top-k under-fill,
no missing-bloom footgun).

Point lookup keeps a `WithinSourceDedupExec(KeepMaxFreshness)` on the
active arm only; `CoalesceFirstExec` already short-circuits cross-source
selection so global dedup would conflict. Flushed and base arms still
rely on `LIMIT 1` under the reverse-write / forward-write conventions
respectively.

Removed: `FilterStaleExec`, `GenerationBloomFilter`, and the
`LsmVectorSearchPlanner::with_bloom_filter[s]` API — no longer needed
now that dedup is exact.

New tests pin: duplicate PK within one active memtable (both planners),
duplicate PK across generations (vector search), and the partition
coalesce ahead of `LsmGlobalPkDedupExec` that keeps active-memtable
rows from being silently dropped.
@jackye1995 jackye1995 force-pushed the jack/fix-lsm-pk-dedupe branch from 33cf21c to a6e7d82 Compare May 20, 2026 05:35
…gh LSM vector search

The LSM vector search planner now accepts a base dataset reference and
appends a TakeExec after the global PK dedup + sort. This allows the
final top-k rows to materialize any user-projected columns that were
not part of the per-source KNN output, fetching from the base dataset
by _rowid.

Also plumbs refine_factor as a parameter on plan_search() so callers
can enable base-table refine (over-fetch k*factor candidates, re-rank
with exact distances). Memtable arms use exact HNSW and are unaffected.

Both Python and Java bindings are updated with the new parameter.
@github-actions github-actions Bot added A-python Python bindings A-java Java bindings + JNI labels May 20, 2026
@jackye1995 jackye1995 changed the title fix(mem_wal): dedupe duplicate primary keys in LSM vector search and point lookup fix(mem_wal): exact PK dedup for LSM vector search and point lookup May 20, 2026
@codecov

codecov Bot commented May 20, 2026

Copy link
Copy Markdown

…point-lookup benchmarks

Remove the duplicated criterion-based mem_wal_read.rs and mem_wal_vector.rs
benchmarks. Replace with two standalone CLI benchmarks that produce JSON
output for panel-style trend analysis:

- mem_wal_vector_bench: KNN search across LSM levels using real 384-dim
  embeddings from lance-format/fineweb-edu, IVF-RQ base table index,
  recall verification against brute-force ground truth.

- mem_wal_point_lookup_bench: PK-based point lookups across base table,
  flushed generations, and active memtable.

Both accept --flushed-generations (0/1/2) and --max-memtable-rows
(100k/500k/1M) for sweeping the full matrix. Results are written as
individual JSON files for aggregation.
The hf:// URI scheme for accessing lance-format/fineweb-edu requires
network access that may not be available or reliable on all environments.
Switch to deterministic synthetic 384-dim embeddings using the same
cluster+noise scheme as mem_wal_hnsw_bench.rs. This makes the bench
self-contained with no external dependencies.

@hamersaw hamersaw left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC this algorithm is basically parallelize top-K from each source and then dedup based on freshness. When exploring this elsewhere we noted the bug where a high ranking row is bumped out of the top-K by an un-fresh row and ends up with incorrect results. For a concrete example:

active memtable
    PK=0, _distance=10
    PK=1, _distance=11
    PK=2, _distance=12
flushed memtable (l0 cache) gen 0
    PK=0, _distance=1
    PK=3, _distance=2
    PK=4, _distance=3

If we we a top-K of 2 and dedup on this we have active memtable returning PK 0, 1 and flushed memtable returning PK 0, 3. the dedup results in keys 0, 3 returned. When really, it should be 3, 4 -- both from flushed memtable.

The various mitigations we discussed were:
(1) overfetching each source: we can have some bound that overfetches - don't love it.
(2) incremental refill: if a lower tier source has keys that are update we re-query it to ensure top-K non-updated keys - don't love it.

TBH I thought the bloom filter approach was reasonable to stop duplicates between sources. My thought was that within a source can we simply add a deletion vector to the flushed memtable (l0 cache) on write so that the read tooling automatically removes duplicates without having to rebuild indexes on write.

@jackye1995

Copy link
Copy Markdown
Contributor Author

Good catch — this is a real limitation of per-source KNN with post-merge dedup. I looked at how other systems handle this:

Existing approaches:

System Architecture Solution
Elasticsearch/Lucene Per-segment HNSW Live docs bitmap passed into KNN search (stale docs skipped during traversal) + 2*k overfetch per segment
Milvus Per-segment KNN Deletion bitmap applied before per-segment KNN
Qdrant Per-segment HNSW Deletion bitmap during search + version-based dedup
Weaviate, Vespa Single unified HNSW Problem doesn't exist — graph mutated on every write

All production systems either use a unified index (heavy write cost) or push deletion state to old segments at write time (light write cost, but breaks pure LSM). None do pure read-time resolution.

Two mitigations we should support:

(1) Overfetch (simple, good for latency-sensitive cases)

Same approach as Elasticsearch: each source fetches k * overfetch_factor candidates instead of k. After cross-source dedup, take global top-k. The refine_factor plumbing already exists on plan_search() and could serve double duty here, or we add a separate overfetch_factor parameter. A default of 2x (matching ES) handles most cases. This is a heuristic — not a guarantee — but it's cheap and parallel.

(2) Pre-search bloom filter exclusion node (most accurate)

This is a new idea that's strictly better than the old FilterStaleExec. The old node was a post-search filter — it sat after the union and filtered stale rows from the already-merged per-source top-k results. By that point, displacement had already happened (the stale row had already knocked a valid candidate out of a source's top-k).

The new approach flips the bloom filter to pre-search: before each source runs its KNN, the bloom filters from all newer generations are combined into an exclusion set and passed into the KNN search. During graph traversal (HNSW or IVF), any candidate whose PK matches the exclusion bloom filter is skipped — it never occupies a top-k slot, so no displacement occurs.

Key properties:

  • Parallel: bloom filters are precomputed per-generation metadata (built at flush time, stored in manifest), so all sources can be searched concurrently — no sequential dependency
  • No write-path mutation: pure read-time, preserves our LSM property of never touching old generations
  • More accurate than old FilterStaleExec: old node filtered post-merge (displacement already happened), new node filters pre-search (displacement prevented)
  • Same bloom filter tradeoff: ~0.1% false positive rate may skip a small number of valid candidates, same as before

The implementation would pass a bloom-filter-backed predicate into each source's fast_search() as a row-level exclusion filter during index traversal.

@hamersaw

Copy link
Copy Markdown
Contributor

Lots of great information here, thanks for doing the dive. My thoughts:

  • write path mutation is obviously a non-solution.
  • using a bloom filter for pre-search would need to modify all of our search algorithms ... which is something we don't want.
  • if we go with overfetching what happens if we still shadow some valid PKs? Do we just continue to fetch until it works?

@jackye1995

Copy link
Copy Markdown
Contributor Author

MemWAL LSM Vector Search & Point Lookup Benchmark Results

Benchmarked the new exact PK dedup pipeline across LSM levels with two storage backends (S3 and local NVMe) and varying memtable configurations.

Setup: 1M-row base table with IVF-RQ index (256 partitions, 8-bit), 384-dim vectors. Each run has 1 active memtable at 50% capacity plus 0/1/2 flushed generations on disk.

Vector Search (k=10, nprobes=20)

Local NVMe (m6id.8xlarge)

Memtable Size Flushed Layers p50 (ms) p95 (ms) QPS
baseline - 7.6 9.1 129
100k 0 8.1 9.8 121
100k 1 8.1 9.9 122
100k 2 7.9 9.9 124
500k 0 8.1 9.8 123
500k 1 7.9 9.4 124
500k 2 7.6 9.3 129
1M 0 7.9 9.5 124
1M 1 8.2 12.4 115
1M 2 8.6 11.2 116

S3 (m7i.12xlarge)

Memtable Size Flushed Layers p50 (ms) p95 (ms) QPS
baseline - 79 211 9.5
100k 0 149 330 6.1
100k 1 101 255 8.7
100k 2 125 272 7.6
500k 0 107 236 8.5
500k 1 166 592 4.6
500k 2 142 482 5.1
1M 0 149 348 6.1
1M 1 122 367 6.8
1M 2 122 371 6.7

Observations: On NVMe, latency is stable (~8ms p50) regardless of memtable count or flushed layers — the LSM merge adds negligible overhead on local disk. On S3, latency is 10-20x higher (80-166ms p50) dominated by network round-trips, with tail latency up to 592ms when flushed layers are added.

Point Lookup

Local NVMe

MT Size Flushed Base p50 (ms) Flushed p50 (ms) Active p50 (ms)
100k 0 2.6 - 2.5
100k 2 3.7 1.4 1.5
500k 0 3.4 - 0.9
500k 2 5.1 2.8 2.9
1M 0 4.2 - 1.6
1M 2 7.2 4.7 4.6

S3

MT Size Flushed Base p50 (ms) Flushed p50 (ms) Active p50 (ms)
100k 0 66 - 38
100k 2 264 257 250
500k 0 65 - 0.7
500k 2 274 290 274
1M 0 69 - 1.2
1M 2 297 319 311

Observations: Active memtable lookups are fastest when data is warm in memory (sub-1ms on NVMe, sub-2ms on S3 with large memtables). On NVMe, flushed layer reads are fast (1.4-4.7ms) since flushed datasets are small. On S3, adding flushed layers causes a dramatic slowdown — from ~66ms to ~264-319ms (4-5x) — because each flushed generation is probed over the network.

Follow-ups

1. Recall verification with real embeddings

These benchmarks used synthetic clustered embeddings to isolate performance characteristics. To get meaningful recall@k measurements, we need to run against a real embedding dataset. The key question is whether the exact PK dedup pipeline preserves recall compared to a single-table baseline when memtable HNSW results are merged with IVF-RQ base results across generations.

2. Parallel per-generation probing

The current LSM scan probes flushed generations sequentially. This is the primary reason S3 point lookup latency jumps 4-5x when flushed layers are added — each generation requires its own network round-trip, and the cost is additive. When the key lives in the base table (the common case), every flushed generation must be probed and miss before the base is reached. Parallelizing probes (fire all generations concurrently, cancel on first hit for point lookup, merge for vector search) would reduce latency from sum(per_gen_latency) to max(per_gen_latency). The vector search UnionExec already supports parallel partitions, so the main work is in the point lookup path.

touch-of-grey added a commit to touch-of-grey/lance that referenced this pull request May 21, 2026
A primary key written multiple times into one active memtable used to
leak through to the user as distinct rows: FilterExec + LIMIT 1 over an
insert-ordered scan returned the oldest match among duplicates.

The active arm now runs a WithinSourceDedupExec(KeepMaxFreshness) that
collapses by PK, keeping the freshest row. Flushed and base arms still
rely on LIMIT 1 under the reverse-write / forward-write conventions.

Split from lance-format#6856 (lance-format#6856).

Co-Authored-By: Jack Ye <yezhaoqin@gmail.com>
touch-of-grey added a commit to touch-of-grey/lance that referenced this pull request May 21, 2026
A primary key written multiple times into one memtable, or into both a
memtable and an older generation, used to leak through as distinct rows:
HNSW indexes every insert as its own graph node, so KNN could return both
V1 and V2 of the same PK from a single source.

Each per-source KNN now runs through LsmSourceTagExec, which appends
(_memtable_gen, _freshness). A single LsmGlobalPkDedupExec over the union
keeps the row with the largest tuple per PK — newer generations win, ties
fall to the normalized within-source order. This replaces the bloom-based
FilterStaleExec design and is exact (no false-positive recall loss, no
top-k under-fill).

After global dedup + sort + top-k, a TakeExec materializes any
user-projected columns not in the per-source KNN output by fetching from
the base dataset via _rowid. plan_search() also accepts refine_factor so
callers can enable base-table refine. Exposed in Python and Java bindings.

Removes FilterStaleExec and GenerationBloomFilter.

Split from lance-format#6856 (lance-format#6856).

Co-Authored-By: Jack Ye <yezhaoqin@gmail.com>
@touch-of-grey

Copy link
Copy Markdown
Contributor

As discussed offline, I'm splitting this up into focused PRs so we can push the progress forward quickly. I need the benchmark template from this PR to reuse for FTS benchmarking, so I'd like to land the pieces incrementally:

Each split PR carries the relevant commits with Co-Authored-By: @jackye1995 and links back here.

jackye1995 added a commit that referenced this pull request May 21, 2026
Split from #6856 — vector-search portion.

A primary key written multiple times into one memtable, or into both a
memtable and an older generation, used to leak through as distinct rows:
HNSW indexes every insert as its own graph node, so KNN could return
both V1 and V2 of the same PK from a single source.

Each per-source KNN now runs through `LsmSourceTagExec`, which appends
`(_memtable_gen, _freshness)`. A single `LsmGlobalPkDedupExec` over the
union keeps the row with the largest tuple per PK — newer generations
win, ties fall to the normalized within-source order. This replaces the
bloom-based `FilterStaleExec` design and is exact (no false-positive
recall loss, no top-k under-fill).

After global dedup + sort + top-k, a `TakeExec` materializes any
user-projected columns not in the per-source KNN output by fetching from
the base dataset via `_rowid`. `plan_search()` also accepts
`refine_factor` so callers can enable base-table refine. Exposed in the
Python and Java bindings.

Removes `FilterStaleExec` and `GenerationBloomFilter`.

Part of splitting #6856 into focused PRs. Co-authored with @jackye1995.

Co-authored-by: Jack Ye <yezhaoqin@gmail.com>
touch-of-grey added a commit to touch-of-grey/lance that referenced this pull request May 21, 2026
A primary key written multiple times into one active memtable used to
leak through to the user as distinct rows: FilterExec + LIMIT 1 over an
insert-ordered scan returned the oldest match among duplicates.

The active arm now runs a WithinSourceDedupExec(KeepMaxRowAddr) that
collapses by PK, keeping the row with the newest row address. Flushed and
base arms still rely on LIMIT 1 under the reverse-write / forward-write
conventions.

PK resolution and row hashing are shared via a new exec::pk module so
WithinSourceDedupExec and LsmGlobalPkDedupExec resolve and hash a key
identically.

Split from lance-format#6856 (lance-format#6856).

Co-Authored-By: Jack Ye <yezhaoqin@gmail.com>
touch-of-grey added a commit to touch-of-grey/lance that referenced this pull request May 21, 2026
A primary key written multiple times into one active memtable used to
leak through to the user as distinct rows: FilterExec + LIMIT 1 over an
insert-ordered scan returned the oldest match among duplicates.

The active arm now runs a WithinSourceDedupExec(KeepMaxRowAddr) that
collapses by PK, keeping the row with the newest row address. Flushed and
base arms still rely on LIMIT 1 under the reverse-write / forward-write
conventions.

PK resolution and row hashing are shared via a new exec::pk module so
WithinSourceDedupExec and LsmGlobalPkDedupExec resolve and hash a key
identically.

Split from lance-format#6856 (lance-format#6856).

Co-Authored-By: Jack Ye <yezhaoqin@gmail.com>
jackye1995 added a commit that referenced this pull request May 21, 2026
Split from #6856 — point-lookup portion.

A primary key written multiple times into one active memtable used to
leak through to the user as distinct rows: `FilterExec + LIMIT 1` over
an insert-ordered scan returned the *oldest* match among duplicates.

The active arm now runs `WithinSourceDedupExec(KeepMaxFreshness)`, which
collapses by PK and keeps the freshest row. Flushed and base arms still
rely on `LIMIT 1` under the reverse-write / forward-write conventions.

Part of splitting #6856 into focused PRs. Co-authored with @jackye1995.

Co-authored-by: Jack Ye <yezhaoqin@gmail.com>
@hamersaw

Copy link
Copy Markdown
Contributor

Thinking about this the distributed top-K could have a stale reads problem as well. Where we update a row that doesn't fit the filter, return the old row, and it's not deduped because the new one isn't returned in an upstream set. This is a tough problem.

jackye1995 added a commit that referenced this pull request May 22, 2026
…rks (#6882)

Final piece of the #6856 split. Replaces the duplicated criterion-based
`mem_wal_read.rs` and `mem_wal_vector.rs` benchmarks with two standalone
CLI benchmarks that emit JSON output for panel-style trend analysis:

- `mem_wal_vector_bench`: KNN search across LSM levels with
deterministic synthetic 384-dim embeddings, IVF-RQ base table index, and
recall verification against brute-force ground truth.
- `mem_wal_point_lookup_bench`: PK-based point lookups across the base
table, flushed generations, and active memtable.

Both accept `--flushed-generations` and `--max-memtable-rows` for
sweeping the full matrix; results are written as individual JSON files.

This is the reusable bench template I want to extend for FTS
benchmarking. Depends on the LSM vector-search API (`with_dataset` /
`refine_factor`) landed in #6881, so it's the last of the three split
PRs.

Part of splitting #6856 into focused PRs. Co-authored with @jackye1995.

---------

Co-authored-by: Jack Ye <yezhaoqin@gmail.com>
@jackye1995 jackye1995 closed this May 22, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

A-java Java bindings + JNI A-python Python bindings bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants