Skip to content

feat: add lance_dataset_update for predicate + expression-based row updates#33

Merged
jja725 merged 3 commits intolance-format:mainfrom
LuciferYang:feat/dataset-update
Apr 30, 2026
Merged

feat: add lance_dataset_update for predicate + expression-based row updates#33
jja725 merged 3 commits intolance-format:mainfrom
LuciferYang:feat/dataset-update

Conversation

@LuciferYang
Copy link
Copy Markdown
Contributor

@LuciferYang LuciferYang commented Apr 30, 2026

Summary

Adds lance_dataset_update — the next predicate-driven mutation primitive after _delete. Callers pass an optional SQL filter (NULL = all rows) plus parallel arrays of column names and SQL value expressions. Same with_mut + block_on shape as delete, same lock semantics for in-flight scanners.

C++ wrapper: Dataset::update(predicate, vector<pair<string,string>>) -> uint64_t. Empty predicate routes to NULL on the C side. Updated row count comes back via the optional out param (NULL discards it).

Closes #32.

Error semantics

Code When
LANCE_ERR_INVALID_ARGUMENT NULL dataset, num_updates == 0, NULL/empty entries in columns/values, or an explicit empty predicate. Also malformed SQL filters/expressions and unknown column names — both update_where and set upstream wrap parser errors as Error::InvalidInput, so they surface as INVALID_ARGUMENT (not INTERNAL like delete). Tests pin both the predicate-side and set-side classifications.
LANCE_ERR_COMMIT_CONFLICT Concurrent writer landed a manifest first.

Design notes

  • out_num_updated is left untouched on error — every error path returns before the write. test_update_out_param_untouched_on_error locks this in with a 0xDEAD_BEEF sentinel against a boundary path (empty predicate) and an upstream-surfaced path (unknown column).
  • predicate == NULL updates every row; explicit empty string is rejected so callers go through NULL rather than a parse error.
  • The validation loop materializes owned (column, value) pairs up front, before taking the write lock — bad input gets a precise columns[i]/values[i] error message instead of tying up the writer.
  • Mirrors lance_dataset_delete: unsafe fn, ffi_try!, with_mut(|d| block_on(...)), SAFETY comments. Arc::try_unwrap(...).unwrap_or_else(|arc| (*arc).clone()) matches what upstream Dataset::delete does internally for publishing the new dataset back into *d.

Test plan

  • cargo test132 passed, 0 failed. 17 new update tests:
    • test_update_basic_predicateWHERE id < 50 SET value = 99.0 on 100 rows → 50 updated; reads back the matched and unmatched rows to confirm only the matched ones changed.
    • test_update_null_predicate_updates_allpredicate == NULL, SET label = 'frozen' on 20 rows; every row reads back as 'frozen'.
    • test_update_multiple_columnsid = 7 SET value = value * 2, label = 'updated'; per-row expression evaluation.
    • test_update_no_match_returns_zero — predicate matches nothing → count = 0, dataset unchanged.
    • test_update_out_param_optional — NULL out_num_updated succeeds.
    • test_update_out_param_untouched_on_error — sentinel preserved on every error path.
    • test_update_bumps_version — version strictly increases on success.
    • Boundary validation, LANCE_ERR_INVALID_ARGUMENT: test_update_null_dataset_rejected, test_update_zero_num_updates_rejected, test_update_null_columns_rejected, test_update_null_values_rejected, test_update_empty_predicate_rejected, test_update_empty_column_entry_rejected, test_update_null_entry_in_columns_rejected.
    • Upstream-surfaced, LANCE_ERR_INVALID_ARGUMENT: test_update_invalid_predicate_rejected, test_update_unknown_column_rejected (set-side via UpdateBuilder::set), test_update_unknown_predicate_column_rejected (predicate-side via UpdateBuilder::update_where — pinned separately because it's a different upstream path).
  • cargo clippy --all-targets -- -D warnings clean.
  • cargo fmt --check clean.
  • RUSTDOCFLAGS="-D warnings" cargo doc --no-deps clean.
  • cargo test --test compile_and_run_test -- --ignored — C and C++ integration tests pass; both exercise the new wrapper end-to-end (SET name = 'frozen' against the write-roundtrip dataset, plus a num_updates == 0 rejection).

Out of scope

lance_dataset_merge_insert (upsert) is the natural follow-on — richer builder surface (when_matched / when_not_matched_insert / etc.), so it deserves its own design pass, but it'll reuse the same with_mut + block_on plumbing.

…pdates

Mirrors the lance_dataset_delete plumbing: with_mut + block_on, SAFETY-
commented FFI, in-place mutation of the same handle. Wraps Lance's
UpdateBuilder so callers can pass a SQL filter (or NULL for all rows)
plus parallel arrays of column names and SQL scalar expressions.

C++ wrapper Dataset::update takes a vector<pair<string,string>>; empty
predicate maps to NULL on the C side ("update all rows").

Closes lance-format#32.
@LuciferYang LuciferYang marked this pull request as draft April 30, 2026 04:05
@jja725 jja725 self-requested a review April 30, 2026 04:42
Surfaced by a 10-round multi-persona review of de34092:

* test_update_unknown_predicate_column_rejected pins the upstream
  classification for predicate-side unknown columns. Without it, only
  the set-side path (UpdateBuilder::set) was tested; predicate-side
  goes through update_where, a different upstream path. Both wrap
  parser errors as InvalidInput, but that wasn't empirically verified.
* test_update_null_values_rejected mirrors test_update_null_columns_rejected
  for the symmetric NULL check.

No behavior change.
- Drop the "(upstream UpdateBuilder wraps parser errors as invalid-input)"
  parenthetical from the user-facing C and Rust docs. It explained an
  internal routing detail that callers don't need to act on; the error
  table on its own is enough.
- Replace "callers can't slip a no-op past validation" — empty SQL would
  parse-fail upstream, not become a no-op, so the framing was wrong. New
  comment just says callers who mean "all rows" go through NULL.
- Trim the snapshot-clone comment from 5 lines to 3 — same intent, less
  prose.

No behavior change.
@LuciferYang LuciferYang marked this pull request as ready for review April 30, 2026 05:28
@LuciferYang
Copy link
Copy Markdown
Contributor Author

LuciferYang commented Apr 30, 2026

I'll be on leave for the next few days.

@jja725 jja725 merged commit 3e8b448 into lance-format:main Apr 30, 2026
9 checks passed
@LuciferYang LuciferYang deleted the feat/dataset-update branch May 7, 2026 03:27
jja725 pushed a commit that referenced this pull request May 8, 2026
## Summary

Adds `lance_dataset_merge_insert`, the next predicate-driven mutation
primitive after `_delete` (#31) and `_update` (#33). It exposes
upstream's `MergeInsertBuilder` to C/C++ callers, covering the four
common shapes of a SQL-style MERGE in one entry point: find-or-create,
upsert, replace-region-of-data, and bulk delete by key.

Behavior is controlled via a single `LanceMergeInsertParams` struct that
maps to upstream's three orthogonal mode enums (`when_matched`,
`when_not_matched`, `when_not_matched_by_source`). `params=NULL` selects
the find-or-create defaults (`DoNothing` / `InsertAll` / `Keep`), and a
zero-initialized struct picks the same defaults thanks to discriminant
pinning. The two `*_IF` modes consume an SQL filter; expressions on the
wrong mode (or empty strings) are rejected at the FFI boundary so the
contract is unambiguous.

The validation order matches `lance_dataset_write` for stream lifetime:
the source stream is the only thing checked before `from_raw` consumes
it, which keeps the documented "stream is consumed on every return path"
guarantee. Dataset / key / params errors fire afterwards, followed by
upstream's own schema-compatibility, parser, and commit-conflict
diagnostics. The mutation itself follows the same snapshot-and-republish
pattern that `_update` already uses, so existing scanners holding a
clone of the inner `Arc` keep their pre-merge view.

## C++ surface

`Dataset::merge_insert(on_columns, source, params=nullptr)` is the full
surface. A `Dataset::upsert(on_columns, source)` convenience overload
covers the most common case (`UpdateAll` + `InsertAll` + `Keep`) without
needing to fill out a params struct.

## Test plan

- [x] Rust integration tests cover all 4 SQL-MERGE shapes plus boundary
rejections (NULL/empty args, `num_on_columns=0`, unknown key column,
out-of-range mode discriminants for all three enums,
missing/empty/extraneous expression strings, no-op configuration, schema
mismatch, version bump, optional `out_result`, untouched-on-error
semantics) — 28 tests.
- [x] C and C++ smoke tests (`compile_and_run_test`) self-merge the
dataset under defaults to validate the FFI plumbing without
hand-building Arrow batches in C, plus the `num_on_columns=0` rejection
path.
- [x] `cargo fmt`, `cargo clippy --all-targets -- -D warnings`, and
`cargo test` all clean locally.

## Out of scope

The `MergeInsertBuilder` knobs not exposed yet (`conflict_retries`,
`retry_timeout`, `skip_auto_cleanup`, `use_index`,
`source_dedupe_behavior`, `commit_retries`,
`mark_generations_as_merged`) and the `execute_uncommitted` /
`explain_plan` / `analyze_plan` paths. The params struct can grow
without breaking the ABI when there's a concrete need.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add lance_dataset_update() for predicate + expression-based row updates

2 participants