feat: add lance_dataset_update for predicate + expression-based row updates#33
Merged
jja725 merged 3 commits intolance-format:mainfrom Apr 30, 2026
Merged
Conversation
…pdates
Mirrors the lance_dataset_delete plumbing: with_mut + block_on, SAFETY-
commented FFI, in-place mutation of the same handle. Wraps Lance's
UpdateBuilder so callers can pass a SQL filter (or NULL for all rows)
plus parallel arrays of column names and SQL scalar expressions.
C++ wrapper Dataset::update takes a vector<pair<string,string>>; empty
predicate maps to NULL on the C side ("update all rows").
Closes lance-format#32.
Surfaced by a 10-round multi-persona review of de34092: * test_update_unknown_predicate_column_rejected pins the upstream classification for predicate-side unknown columns. Without it, only the set-side path (UpdateBuilder::set) was tested; predicate-side goes through update_where, a different upstream path. Both wrap parser errors as InvalidInput, but that wasn't empirically verified. * test_update_null_values_rejected mirrors test_update_null_columns_rejected for the symmetric NULL check. No behavior change.
- Drop the "(upstream UpdateBuilder wraps parser errors as invalid-input)" parenthetical from the user-facing C and Rust docs. It explained an internal routing detail that callers don't need to act on; the error table on its own is enough. - Replace "callers can't slip a no-op past validation" — empty SQL would parse-fail upstream, not become a no-op, so the framing was wrong. New comment just says callers who mean "all rows" go through NULL. - Trim the snapshot-clone comment from 5 lines to 3 — same intent, less prose. No behavior change.
Contributor
Author
|
I'll be on leave for the next few days. |
jja725
approved these changes
Apr 30, 2026
3 tasks
jja725
pushed a commit
that referenced
this pull request
May 8, 2026
## Summary Adds `lance_dataset_merge_insert`, the next predicate-driven mutation primitive after `_delete` (#31) and `_update` (#33). It exposes upstream's `MergeInsertBuilder` to C/C++ callers, covering the four common shapes of a SQL-style MERGE in one entry point: find-or-create, upsert, replace-region-of-data, and bulk delete by key. Behavior is controlled via a single `LanceMergeInsertParams` struct that maps to upstream's three orthogonal mode enums (`when_matched`, `when_not_matched`, `when_not_matched_by_source`). `params=NULL` selects the find-or-create defaults (`DoNothing` / `InsertAll` / `Keep`), and a zero-initialized struct picks the same defaults thanks to discriminant pinning. The two `*_IF` modes consume an SQL filter; expressions on the wrong mode (or empty strings) are rejected at the FFI boundary so the contract is unambiguous. The validation order matches `lance_dataset_write` for stream lifetime: the source stream is the only thing checked before `from_raw` consumes it, which keeps the documented "stream is consumed on every return path" guarantee. Dataset / key / params errors fire afterwards, followed by upstream's own schema-compatibility, parser, and commit-conflict diagnostics. The mutation itself follows the same snapshot-and-republish pattern that `_update` already uses, so existing scanners holding a clone of the inner `Arc` keep their pre-merge view. ## C++ surface `Dataset::merge_insert(on_columns, source, params=nullptr)` is the full surface. A `Dataset::upsert(on_columns, source)` convenience overload covers the most common case (`UpdateAll` + `InsertAll` + `Keep`) without needing to fill out a params struct. ## Test plan - [x] Rust integration tests cover all 4 SQL-MERGE shapes plus boundary rejections (NULL/empty args, `num_on_columns=0`, unknown key column, out-of-range mode discriminants for all three enums, missing/empty/extraneous expression strings, no-op configuration, schema mismatch, version bump, optional `out_result`, untouched-on-error semantics) — 28 tests. - [x] C and C++ smoke tests (`compile_and_run_test`) self-merge the dataset under defaults to validate the FFI plumbing without hand-building Arrow batches in C, plus the `num_on_columns=0` rejection path. - [x] `cargo fmt`, `cargo clippy --all-targets -- -D warnings`, and `cargo test` all clean locally. ## Out of scope The `MergeInsertBuilder` knobs not exposed yet (`conflict_retries`, `retry_timeout`, `skip_auto_cleanup`, `use_index`, `source_dedupe_behavior`, `commit_retries`, `mark_generations_as_merged`) and the `execute_uncommitted` / `explain_plan` / `analyze_plan` paths. The params struct can grow without breaking the ABI when there's a concrete need.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds
lance_dataset_update— the next predicate-driven mutation primitive after_delete. Callers pass an optional SQL filter (NULL = all rows) plus parallel arrays of column names and SQL value expressions. Samewith_mut+block_onshape as delete, same lock semantics for in-flight scanners.C++ wrapper:
Dataset::update(predicate, vector<pair<string,string>>) -> uint64_t. Emptypredicateroutes to NULL on the C side. Updated row count comes back via the optional out param (NULL discards it).Closes #32.
Error semantics
LANCE_ERR_INVALID_ARGUMENTnum_updates == 0, NULL/empty entries incolumns/values, or an explicit emptypredicate. Also malformed SQL filters/expressions and unknown column names — bothupdate_whereandsetupstream wrap parser errors asError::InvalidInput, so they surface asINVALID_ARGUMENT(notINTERNALlike delete). Tests pin both the predicate-side and set-side classifications.LANCE_ERR_COMMIT_CONFLICTDesign notes
out_num_updatedis left untouched on error — every error path returns before the write.test_update_out_param_untouched_on_errorlocks this in with a0xDEAD_BEEFsentinel against a boundary path (empty predicate) and an upstream-surfaced path (unknown column).predicate == NULLupdates every row; explicit empty string is rejected so callers go through NULL rather than a parse error.(column, value)pairs up front, before taking the write lock — bad input gets a precisecolumns[i]/values[i]error message instead of tying up the writer.lance_dataset_delete:unsafe fn,ffi_try!,with_mut(|d| block_on(...)), SAFETY comments.Arc::try_unwrap(...).unwrap_or_else(|arc| (*arc).clone())matches what upstreamDataset::deletedoes internally for publishing the new dataset back into*d.Test plan
cargo test— 132 passed, 0 failed. 17 new update tests:test_update_basic_predicate—WHERE id < 50 SET value = 99.0on 100 rows → 50 updated; reads back the matched and unmatched rows to confirm only the matched ones changed.test_update_null_predicate_updates_all—predicate == NULL,SET label = 'frozen'on 20 rows; every row reads back as'frozen'.test_update_multiple_columns—id = 7 SET value = value * 2, label = 'updated'; per-row expression evaluation.test_update_no_match_returns_zero— predicate matches nothing → count = 0, dataset unchanged.test_update_out_param_optional— NULLout_num_updatedsucceeds.test_update_out_param_untouched_on_error— sentinel preserved on every error path.test_update_bumps_version— version strictly increases on success.LANCE_ERR_INVALID_ARGUMENT:test_update_null_dataset_rejected,test_update_zero_num_updates_rejected,test_update_null_columns_rejected,test_update_null_values_rejected,test_update_empty_predicate_rejected,test_update_empty_column_entry_rejected,test_update_null_entry_in_columns_rejected.LANCE_ERR_INVALID_ARGUMENT:test_update_invalid_predicate_rejected,test_update_unknown_column_rejected(set-side viaUpdateBuilder::set),test_update_unknown_predicate_column_rejected(predicate-side viaUpdateBuilder::update_where— pinned separately because it's a different upstream path).cargo clippy --all-targets -- -D warningsclean.cargo fmt --checkclean.RUSTDOCFLAGS="-D warnings" cargo doc --no-depsclean.cargo test --test compile_and_run_test -- --ignored— C and C++ integration tests pass; both exercise the new wrapper end-to-end (SET name = 'frozen'against the write-roundtrip dataset, plus anum_updates == 0rejection).Out of scope
lance_dataset_merge_insert(upsert) is the natural follow-on — richer builder surface (when_matched/when_not_matched_insert/ etc.), so it deserves its own design pass, but it'll reuse the samewith_mut+block_onplumbing.