fix: prevent concurrent merge_insert duplicate rows without PK metadata#6018
fix: prevent concurrent merge_insert duplicate rows without PK metadata#6018ozzieba wants to merge 4 commits intolance-format:mainfrom
Conversation
|
ACTION NEEDED The PR title and description are used as the merge commit message. Please update your PR title and description to match the specification. For details on the error please inspect the "PR Title Check" action. |
5fe1666 to
b8a6088
Compare
|
@wjones127 @jackye1995 @yanghua does this make sense? |
…ta (lance-format#4585) Always include the bloom filter for inserted rows in the transaction, regardless of whether the schema has unenforced-primary-key metadata. Make conflict detection symmetric for asymmetric bloom filter pairs. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
b8a6088 to
062e8dc
Compare
|
Thanks for doing this! In order to enforce uniqueness of inserted rows, we need to make sure that if user runs 2 merge-inserts and they are using different keys, it should be treated as incompatible because the bloom filter is incompatible. That means we need to track what merge key we are using in the Transaction model: https://github.com/lance-format/lance/blob/main/protos/transaction.proto#L228 This is a change of the Lance specification, so I would suggest creating a dedicated PR for that change first, and it would require a quick community vote. See example vote here: #5485 |
…ict detection Add `merge_key_field_ids` to the Update operation in the transaction proto so conflict resolution can detect incompatible concurrent merge inserts. - Always include bloom filter for inserted rows regardless of PK metadata - Different merge keys (ON columns) are treated as conflicts - Asymmetric bloom filter pairs (Some, None) are treated as conflicts - Backward compatible: empty merge_key_field_ids for non-merge updates Refs: lancedb/lancedb#2463, lance-format#4585, lance-format#6018 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add `repeated int32 merge_key_field_ids = 9` to the Update message in transaction.proto. This field records which columns were used as the merge key (the ON columns) in a merge insert operation, enabling conflict resolution to detect incompatible concurrent merge inserts that use different merge keys. Backward compatible: empty for non-merge-insert updates and older writers. Refs: lancedb/lancedb#2463, lance-format#4585, lance-format#6018 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Summary
Fixes the concurrent
merge_insertduplicate row bug (#4585) by always including a bloom filter for inserted rows in the transaction, regardless of whether the schema hasunenforced-primary-keymetadata, and making conflict detection symmetric for asymmetric bloom filter pairs.Problem
When multiple concurrent workers call
merge_insert("id")with overlapping key ranges, the second writer's commit succeeds silently even when it inserts keys that the first writer already inserted, producing duplicate rows. This happens because:The bloom filter for inserted rows was only included in the transaction when the schema had
lance-schema:unenforced-primary-keymetadata on the merge column. Most users callmerge_insert("id")without this metadata.The conflict resolver only detected
(Some, None)bloom filter asymmetry as a conflict. The(None, Some)case — when the committed transaction has a filter but the current doesn't — fell through silently.Bug reproduction
A self-contained Python script reproduces the bug against stock PyPI lancedb:
```bash
uv run --with lancedb --no-project test_concurrent_merge_insert_bug.py
```
Against lancedb 0.29.2:
Fix (2 changes)
1. `merge_insert/exec/write.rs`
Always include the bloom filter for inserted rows in the transaction, even without PK metadata.
2. `conflict_resolver.rs`
Treat `(None, Some)` bloom filter pair as a retryable conflict, symmetric with the existing `(Some, None)` handling.
Tests