RFC: BatchCommitTables for DirectoryNamespace - Multi-Dataset Atomic Metadata Commits
#6775
Replies: 1 comment
-
|
@jackye1995 The way I understand it now: MemWAL is intra-table (LSM tree, memtable → L0 flushed generations, streaming writes into one Lance dataset) and this RFC is cross-table namespace-layer atomicity, so the two don't directly overlap. But the broader substrate around Phase 3 has moved in ways that matter — two precedents that didn't exist when I posted, and one blocker that's no longer a blocker. On Phase 3 substrate (§5.2 / §8 Q2): #6658 is closed, landed as #6781 ( Concretely the gap is that per-table Two precedents I'd fold into a revision:
§5.2 feels like it's narrowing rather than stalled, so I'm ready to start sketching Phase 1. @jackye1995 two questions where your read would help: |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
RFC:
BatchCommitTablesforDirectoryNamespace— Multi-Dataset Atomic Metadata CommitsTL;DR
The Lance namespace spec defines
BatchCommitTablesfor atomic multi-table metadata commits. The OpenAPI pipeline ships generated wire-format clients in Rust, Python, and Java. No server-side implementation exists in any language. The RustLanceNamespacetrait does not declare the method; the Java core throwsUnsupportedOperationException.#6668 asks for exactly this. This RFC proposes how to land it in
DirectoryNamespaceusing primitives that already exist (__manifestLance dataset + single-dataset OCC +copy_if_not_exists). Of the three additional requirements raised in #6668: Phase 1 delivers namespace-view atomicity for namespace-managed readers (covering the failure-semantics ask, §5.3); Phase 2 adds branch-awareness (§5.1); Phase 3 closes the per-table HEAD visibility leak for direct Lance readers (§5.2), which depends on a new two-phaseDatasetpublish primitive — that's Lance-core work, not just namespace wiring.The contribution is at the namespace layer (cross-table). It is orthogonal to milestone #11 (table-layer composable transactions,
UserOperation/Action) — sibling primitives, not stacked. Either can ship independently; both work better together; neither depends on the other for correctness.1. Motivation
[Restating from #6668 for context.]
CommitBuilder::execute(transaction)is per-dataset. There is no public Rust primitive for atomic multi-dataset metadata commits, even though the namespace specification and the generated REST client define one. Consumers maintaining invariants across multiple Lance datasets currently use a coordinator-dataset pattern, which leaves a partial-success window between per-dataset commits and the coordinator commit.Consumers affected (from #6668, annotated):
The Omnigraph case concretely: a typed property graph where each node/edge type lives in its own Lance dataset. A graph mutation touching N tables currently runs:
commit_stagedper table — N sequential Lance dataset advances, each individually visible to readers__manifestrow-level CAS to publish the multi-table state viaManifestBatchPublisher::publishStep (1) leaks per-table HEAD visibility before step (2) lands. The "Lance HEAD ahead of
__manifest" drift class — for which we ship an open-time recovery sweep — would not exist if the substrate owned multi-table atomicity. The recovery sweep is consumer-side workaround for the absence of this primitive.2. Where the substrate already is
The atomic primitive
BatchCommitTablesneeds is in place:__manifestLance dataset.DirectoryNamespacemaintains this at{namespace_root}/__manifest/. Table-version visibility is derived fromtable_versionrows in__manifest. Single-dataset OCC arbitrates concurrent writes viacopy_if_not_existson the_versions/<n>.manifestfile. Any change that updates K rows in__manifestin one dataset commit is atomic across those K rows.Generated wire types.
BatchCommitTablesRequest,CommitTableOperation, and the four sub-requests (DeclareTableRequest,CreateTableVersionRequest,BatchDeleteTableVersionsRequest,DeregisterTableRequest) have shipped as auto-generated Rust/Python/Java models since lance-namespacev0.6.0. Shape unchanged throughv0.7.6.MergeInsertBuilderwithWhenMatched::Failonobject_id— already used byinsert_into_manifest_with_metadatafor multi-row inserts into__manifest.copy_if_not_existsacross object stores (local FS, S3, GCS, Azure Blob via theObjectStoreabstraction) — the same primitivecreate_table_versionalready uses.For namespace-managed readers, the gap is wiring, not new mechanisms. Full closure of the per-table HEAD visibility window for direct Lance readers needs additional Lance-core work — see §5.2.
3. What's missing
rust/lance-namespace/src/namespace.rsdeclares 40+ methods;batch_commit_tablesis not one of them. Onlybatch_delete_table_versions(single-table) is present.DirectoryNamespaceimpl composing the four operation variants into a single__manifestmutation.RestNamespaceadapter routing the call to the existing REST endpoint surface.Non-trivial constraint:
BatchCommitTablesmust not be implemented as N parallelcreate_table_versioncalls. The existing single-table path records the__manifestrow best-effort — it logs a warning if the row insertion fails. ForBatchCommitTablesthe row mutation must be mandatory and atomic, or the "__manifestas source of truth" claim in §5.3 doesn't hold. The implementation needs to sharecreate_table_version's lower-level physical-copy helpers but route through a different commit boundary.4. Proposed design
4.1 Trait method
Default impl returns
not_supported, matching the convention for other optional ops (alter_transaction,update_table).4.2
DirectoryNamespace::batch_commit_tablesInvariant: no file appears at a canonical
_versions/<N>.manifestpath until its corresponding__manifestrow commits. Per-table version manifests are written to a per-batch staging directory first, then promoted viarename_if_not_existsonly after__manifestrow insertion succeeds.Three atomic boundaries, ordered:
_staging/<batch_uuid>/. Failure here is benign; nothing visible.__manifest— single Lance dataset commit inserts/updates/deletes thetable_versionrows. This is the atomicity point: either all row changes flip visible together, or none do. Mixed insert + delete in one batch is supported by Lance's existingCommitBuilder::execute_batch(Vec<Transaction>)(per #3734) — multiple per-dataset transactions are atomically published in one manifest version bump, so aBatchCommitTablesrequest withCreateTableVersion+DeregisterTableops needs no new__manifestmutator._versions/paths viarename_if_not_exists. After step 2 succeeds, the__manifestrow is the durable source of truth; promotion is recoverable on the next operation if it fails mid-step.A reader through the namespace sees nothing until step 2 commits. A reader bypassing the namespace and opening per-table datasets directly sees per-table HEADs advance during step 3 — one rename at a time, in whatever order the runtime promotes them. This means:
_versions/<N>.manifestfiles at all (step 1 wrote only to_staging/). The §5.3 "no observable state change to readers" claim holds at this boundary.Datasetprimitive discussed in §5.2 — Phase 3 of this RFC.4.3 Conflict model: snapshot isolation per batch
Each batch reads
__manifestat a single dataset version, validates all operations against that snapshot, and commits as a single dataset advance. Concurrent batches race for the same advance; a loser retries from a fresh snapshot. This matches Lance's existing single-dataset OCC, applied to__manifest.Properties:
__manifest.__manifest).4.4 Cross-backend portability
The implementation depends on three primitives Lance already provides:
Dataset::commiton__manifest— works on all supported backends today.copy_if_not_exists/rename_if_not_existsfor staging and promotion — works on all four major object stores via theObjectStoreabstraction (with the existing fallback path for backends that lack native conditional-write support; theBatchCommitTablesimpl inherits whatever fallback shape Lance already uses forcreate_table_version).No new backend-specific code. Behavior on each backend is determined by the existing
ObjectStoretrait impls.5. Addressing #6668's three additional requirements
#6668 raises three requirements beyond "implement what the spec says."
5.1 Branch-awareness
Non-main branch publication without flattening branch state. Phase 1 ships main-only; Phase 2 adds a first-class
branchparameter onBatchCommitTablesRequest, contributed as a spec PR alongside the implementation.__manifestCAS becomes scoped per(table, branch). Lance has native branch primitives at the dataset layer; forBatchCommitTablesto interact with them portably across namespace implementations, branch must be in the spec — not incontext, which is documented as implementation-custom. Alance-branch:context-key encoding is a transitional stopgap if the spec PR lags the implementation, but is not the long-term design.5.2 Publish must accept staged manifests without per-table HEAD visibility
The most architecturally significant of the three. Per-table
commit_stagedmust not advance Lance HEAD until the multi-table__manifestcommit succeeds. Today'sDataset::commitdoes not separate stage from publish, so closing the gap requires Lance-core work — either alongside #6658 (two-phase delete) and #6666 (two-phase vector index), or via a newDataset::stage_commitprimitive.Without one, Phase 1 leaks per-table HEAD between staging and the
__manifestflip. Reads via the namespace see consistent multi-table state; reads that bypass the namespace and open per-table datasets directly see HEADs advance one-by-one. This is what Omnigraph runs today.5.3 Explicit failure semantics;
__manifestas source of truthA failed batch leaves orphan per-table version files invisible to namespace-managed readers (no
table_versionrow points at them). Cleanup tiers:At every phase,
__manifestis authoritative — files on disk not referenced in it don't exist for namespace reads.6. Relationship to milestone #11 (
UserOperation/Action)BatchCommitTablesis at the namespace layer (cross-table); milestone #11 (#5960'sUserOperation { repeated Action }) is at the table layer (intra-table). They are sibling primitives at different architectural layers, not stacked.AddFragments + AddIndexon one table). It does not address atomicity across multiple datasets.BatchCommitTableslets the namespace-managed view of multiple tables advance atomically —table_versionrows in__manifestpublish together via row-level CAS on a single Lance dataset. The per-table underlying Lance HEADs still advance one-by-one during the staging phase; closing that visibility window requires the two-phaseDatasetpublish primitive discussed in §5.2. It does not address intra-table composability.A consumer might want both: "atomically (a) append data + create index on table A, AND (b) append data on table B."
BatchCommitTablesprovides the cross-table namespace-view grouping (b); milestone #11 provides the intra-table composability (a). Both can ship independently; both work better together.This proposal does not block on milestone #11 and does not propose any change to it. The two land on independent timelines.
7. Phasing
batch_commit_tablesandbatch_create_table_versions(matching the full #6668 ask) +DirectoryNamespaceimpl forCreateTableVersionandDeclareTable(main branch only; eager cleanup) +RestNamespaceadapter + Python core + Java core + JNI bindings, matching the #6678 "backfill and refresh" cross-language PR shape__manifestrow-level CAS atomic across N tablesDeleteTableVersions,DeregisterTable; add branch-awareness per §5.1Dataset::stage_commitprimitive or coordinate per-table staging through__manifestPhase 1 ships cross-language bindings together with the Rust impl (per the #6678 template); Phases 2–3 are the substantive remainder responding to #6668. Phase 4 is polish.
8. Open questions
Branch encoding (§5.1). The proposal commits to a first-class
branchparameter via spec PR (not context-key). Open question: is the spec PR straightforward to add to lance-namespace, or are there structural blockers that would push us toward keeping the transitional context-key encoding longer?Two-phase publish on
Dataset(§5.2). Does this work fit alongside Expose two-phase delete API (DeleteJob::execute_uncommitted analog) #6658 (two-phase delete) and Expose build_index_metadata_from_segments (or commit_existing_index_segments) for two-phase vector-index commits outside the lance crate #6666 (two-phase vector index), or does it want a newDataset::stage_commitprimitive? Without one, Phase 3 cannot close the per-table HEAD visibility window.Surface alternative —
MultiCommitBuilderonDataset. Multi-dataset atomic commit primitive #6668 explicitly leaves the API shape open: "If a different surface is preferred — e.g., aMultiCommitBuilderonDatasetrather than methods onLanceNamespace— happy to discuss. The architectural ask is 'make the substrate own multi-table atomicity' rather than the specific API shape." This proposal pursues the trait-method shape because it matches the already-shipped namespace spec, the REST surface, and the auto-generated Rust/Python/Java client code throughv0.7.6— wiring an existing surface, not designing a new one. ADataset-levelMultiCommitBuilderwould either require an additional spec at the Dataset layer or a routing shim from the namespace API to a Dataset primitive — strictly more spec work than the trait shape. Open question: do maintainers see structural problems with the trait shape that would justify the additional Dataset-level work?Settled in this RFC (no longer open):
__manifestcommit. Resolved in §4.2 —CommitBuilder::execute_batch(Vec<Transaction>)already supports this; no new mutator needed inDirectoryNamespace.expected_current_versiononCreateTableVersionRequest. Omnigraph runs without it via application-layerexpected_table_versionsCAS; the gap isn't blocking for Multi-dataset atomic commit primitive #6668. A spec extension would be cleaner but is a separable follow-up RFC; not blocking this contribution.BatchCreateTableVersionsandBatchCommitTablescoexistence. Both ship on the trait in Phase 1 per Multi-dataset atomic commit primitive #6668's ask. Deprecation of the older type can be a follow-up RFC onceBatchCommitTablesis established as the canonical successor.9. References
BatchCreateTableVersionsBatchCommitTables: lance-namespace#315 (in v0.6.0)batch_commit_tables_request.rs;commit_table_operation.rsDirectoryNamespace::create_table_versionatdir.rs(v6.0.0)10. Request for Comments
The three feedback points that most shape what this contribution looks like:
Dataset::stage_commit? Without one, Phase 3 cannot close the visibility window for direct Lance readers.LanceNamespace(current proposal, matches existing namespace spec + REST surface), orMultiCommitBuilderonDataset(the alternative Multi-dataset atomic commit primitive #6668 raises)? The architectural ask is the same; the API shape is open.The remaining §8 open question — branch encoding (Q1) — is a scoped spec-PR feasibility check; happy to take maintainer guidance during the spec/impl PRs rather than blocking design alignment here.
I'm interested in contributing toward the implementation and happy to coordinate with @ragnorc on scope and ownership; what makes sense from the maintainer side is also welcome input.
Beta Was this translation helpful? Give feedback.
All reactions