Skip to content

feat(dir): prototype sharded catalog branches#7007

Draft
jackye1995 wants to merge 11 commits into
lance-format:mainfrom
jackye1995:jack/sharded-catalog-branch-mtt-prototype
Draft

feat(dir): prototype sharded catalog branches#7007
jackye1995 wants to merge 11 commits into
lance-format:mainfrom
jackye1995:jack/sharded-catalog-branch-mtt-prototype

Conversation

@jackye1995
Copy link
Copy Markdown
Contributor

Prototype sharded directory-catalog manifests and branch-based catalog promotion on top of #6794.

Summary:

  • add a sharded manifest backend using __super_manifest and _manifest_shard* tables
  • add Rust-only catalog branch create/promote helpers with lazy table branch materialization
  • route table-version records by table shard and cover branch isolation with focused tests

Validated with cargo fmt, focused sharded tests, cargo check all-targets, rest-adapter check, clippy, and diff whitespace check.

jackye1995 added 11 commits May 15, 2026 00:43
Replace merge-insert/delete manifest mutations with always copy-on-write
full rewrites. Each mutation scans the latest __manifest dataset, streams
transformed rows into a replacement data file, and commits a new version
with replacement scalar indices built inline.

- Migrate metadata column from Utf8 to Lance JSON (LargeBinary)
- Remove base_objects column and LabelList index
- Build BTree (object_id), Bitmap (object_type), and FTS (metadata)
  indices during each streaming rewrite
- Add overwrite-with-replacement-indices commit support in Lance
- Handle concurrency via strict overwrite with full-rewrite retry
- Backward compatible: old schema datasets (Utf8 metadata, base_objects)
  are read correctly and migrated on first write
Binary measures read (list_namespaces, list_tables, describe_table) and
write (create_namespace, create_table) operations at configurable
concurrency levels. Supports --variant and --inline-optimization flags
to compare baseline merge-insert vs copy-on-write implementations.
Multi-process coordinator/worker architecture: coordinator spawns N
child processes each with independent namespace instance (no shared
cache). Supports S3 root paths, cold-read (fresh namespace per op),
warm-read (cached), and write operations. Separate seed mode for
populating manifests with configurable entry count.
…nline cleanup

- Remove FTS background channel asymmetry: accumulate metadata in
  ManifestIndexAccumulator during streaming, build all 3 indices
  (BTree, Bitmap, FTS) uniformly after the stream completes.
- Replace CommitBuilder with direct manifest commit: expose
  write_manifest_file and ManifestWriteConfig as public API, add
  Dataset::commit_handler() accessor, construct Manifest via
  new_from_previous and commit directly.
- Remove inline cleanup on retry: drop cleanup_uncommitted_overwrite_files
  and cleanup_uncommitted_index_uuids, rely on offline GC for orphaned files.
- Add index verification tests: test_manifest_indices_are_complete_and_versioned
  checks all 3 indices are present, versioned, and have fragment bitmaps;
  test_manifest_reads_use_indexed_scans verifies explain plans show
  ScalarIndexQuery for BTree/Bitmap filters and MatchQuery for FTS.
seed-large writes a __manifest Lance table directly with N rows,
bypassing the namespace API. Triggers one CoW rewrite to build
indices. Adds --initial-entries flag to run mode for result tracking.
The mutation lock already serializes local writes. Use get_cached()
on first attempt (no I/O) and get_refreshed() only on retry after
conflict. Make checkout_version on success non-fatal so the return
path doesn't block on I/O if the cache promotion fails.
S3X no-index: 6.8/s create-ns, 6.2/s declare-table at 1K entries.
S3X with-index: 5.7/s create-ns, 5.1/s declare-table at 1K entries.
Indexed point lookup flat from 100K to 1M (9ms warm on S3X).
CoW full rewrite + 3 indices at 1M: 2.0s S3X, 3.2s S3.
@github-actions github-actions Bot added enhancement New feature or request java labels May 29, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request java

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant