Skip to content

fix: prevent concurrent merge_insert duplicate rows without PK metadata#6018

Open
ozzieba wants to merge 4 commits intolance-format:mainfrom
purpleplatform:fix/concurrent-merge-insert-duplicates
Open

fix: prevent concurrent merge_insert duplicate rows without PK metadata#6018
ozzieba wants to merge 4 commits intolance-format:mainfrom
purpleplatform:fix/concurrent-merge-insert-duplicates

Conversation

@ozzieba
Copy link

@ozzieba ozzieba commented Feb 25, 2026

Summary

Fixes the concurrent merge_insert duplicate row bug (#4585) by always including a bloom filter for inserted rows in the transaction, regardless of whether the schema has unenforced-primary-key metadata, and making conflict detection symmetric for asymmetric bloom filter pairs.

Problem

When multiple concurrent workers call merge_insert("id") with overlapping key ranges, the second writer's commit succeeds silently even when it inserts keys that the first writer already inserted, producing duplicate rows. This happens because:

  1. The bloom filter for inserted rows was only included in the transaction when the schema had lance-schema:unenforced-primary-key metadata on the merge column. Most users call merge_insert("id") without this metadata.

  2. The conflict resolver only detected (Some, None) bloom filter asymmetry as a conflict. The (None, Some) case — when the committed transaction has a filter but the current doesn't — fell through silently.

Bug reproduction

A self-contained Python script reproduces the bug against stock PyPI lancedb:

```bash
uv run --with lancedb --no-project test_concurrent_merge_insert_bug.py
```

Against lancedb 0.29.2:

Test Expected Actual Duplicates
5 workers, overlapping keys (0..19) 21 rows 101 rows 80
8 workers, identical data (100 rows) 101 rows 801 rows 700

Fix (2 changes)

1. `merge_insert/exec/write.rs`

Always include the bloom filter for inserted rows in the transaction, even without PK metadata.

2. `conflict_resolver.rs`

Treat `(None, Some)` bloom filter pair as a retryable conflict, symmetric with the existing `(Some, None)` handling.

Tests

  • Rust regression test: 5 concurrent workers with overlapping keys, no PK metadata
  • Rust unit tests: conflict detection for all 4 bloom filter pair combinations
  • Rust unit tests: concurrent insert same/different keys with/without PK metadata
  • Rust proptests: parameterized over dataset size, key values, PK metadata presence
  • Python reproducer script: demonstrates bug against stock PyPI lancedb

@github-actions github-actions bot added the bug Something isn't working label Feb 25, 2026
@ozzieba ozzieba changed the title fix: concurrent merge_insert duplicate row prevention via bloom filter WIP, AI-gen code; fix: concurrent merge_insert duplicate row prevention via bloom filter Feb 25, 2026
@github-actions
Copy link
Contributor

ACTION NEEDED
Lance follows the Conventional Commits specification for release automation.

The PR title and description are used as the merge commit message. Please update your PR title and description to match the specification.

For details on the error please inspect the "PR Title Check" action.

@ozzieba ozzieba force-pushed the fix/concurrent-merge-insert-duplicates branch from 5fe1666 to b8a6088 Compare February 25, 2026 22:23
@ozzieba ozzieba changed the title WIP, AI-gen code; fix: concurrent merge_insert duplicate row prevention via bloom filter AI-gen code; fix: concurrent merge_insert duplicate row prevention via bloom filter Feb 25, 2026
@ozzieba ozzieba changed the title AI-gen code; fix: concurrent merge_insert duplicate row prevention via bloom filter fix: prevent concurrent merge_insert duplicates without PK metadata Feb 25, 2026
@ozzieba
Copy link
Author

ozzieba commented Feb 25, 2026

@wjones127 @jackye1995 @yanghua
Disclaimer: AI generated code

does this make sense?

…ta (lance-format#4585)

Always include the bloom filter for inserted rows in the transaction,
regardless of whether the schema has unenforced-primary-key metadata.
Make conflict detection symmetric for asymmetric bloom filter pairs.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@ozzieba ozzieba changed the title fix: prevent concurrent merge_insert duplicates without PK metadata fix: prevent concurrent merge_insert duplicate rows without PK metadata Feb 25, 2026
@ozzieba ozzieba force-pushed the fix/concurrent-merge-insert-duplicates branch from b8a6088 to 062e8dc Compare February 25, 2026 23:12
@jackye1995
Copy link
Contributor

jackye1995 commented Feb 27, 2026

Thanks for doing this! In order to enforce uniqueness of inserted rows, we need to make sure that if user runs 2 merge-inserts and they are using different keys, it should be treated as incompatible because the bloom filter is incompatible. That means we need to track what merge key we are using in the Transaction model: https://github.com/lance-format/lance/blob/main/protos/transaction.proto#L228

This is a change of the Lance specification, so I would suggest creating a dedicated PR for that change first, and it would require a quick community vote. See example vote here: #5485

ozzieba added a commit to purpleplatform/lance that referenced this pull request Feb 27, 2026
…ict detection

Add `merge_key_field_ids` to the Update operation in the transaction proto
so conflict resolution can detect incompatible concurrent merge inserts.

- Always include bloom filter for inserted rows regardless of PK metadata
- Different merge keys (ON columns) are treated as conflicts
- Asymmetric bloom filter pairs (Some, None) are treated as conflicts
- Backward compatible: empty merge_key_field_ids for non-merge updates

Refs: lancedb/lancedb#2463, lance-format#4585, lance-format#6018

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
ozzieba added a commit to purpleplatform/lance that referenced this pull request Feb 27, 2026
Add `repeated int32 merge_key_field_ids = 9` to the Update message in
transaction.proto. This field records which columns were used as the
merge key (the ON columns) in a merge insert operation, enabling
conflict resolution to detect incompatible concurrent merge inserts
that use different merge keys.

Backward compatible: empty for non-merge-insert updates and older
writers.

Refs: lancedb/lancedb#2463, lance-format#4585, lance-format#6018

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants