feat: route partial-schema merge_insert through the v2 write path by wombatu-kun · Pull Request #6472 · lance-format/lance

wombatu-kun · 2026-04-10T09:17:38Z

Summary

Closes #6442.

Partial-schema upserts (source column subset of the dataset schema) used to fall back to the legacy v1 Merger path, which wrote only the changed columns via Operation::Update with RewriteColumns. They now run through the same create_plan / FullSchemaMergeInsertExec pipeline as full-schema upserts. Missing source columns are filled from the target side of the join via a post-join projection, so the v2 writer sees a complete set of data columns and writes full rows as new fragments.

Per the issue, this intentionally does not replicate v1's RewriteColumns optimization — that is tracked separately in #4193. The v1 path is still used when the join key has a scalar index (falls through the old gate), so the legacy optimization remains available for the scalar-index-on-join-key case.

How it works

can_use_create_plan (gate) — accepts a source schema that is a type-compatible subset of the dataset schema. If insert_not_matched=InsertAll and any missing target column is non-nullable, the call short-circuits with an InvalidInput error naming the offending columns.
create_plan (plan construction) — after the join and the __action column, every dataset field missing from the source is added as an unqualified column populated from col("target.\"<name>\""). For matched rows this carries the existing target value; for unmatched source rows the outer join leaves it NULL.
necessary_children_exprs (projection pushdown) — the extension node now keeps unqualified columns whose name matches a dataset field, so the synthetic filled columns flow through to the write exec alongside source.*.
prepare_stream_schema — data columns are now emitted in dataset schema order keyed by name. This turns an accidental positional invariant into an explicit name-based one and is required for the partial-schema path (synthetic filled columns land at the end of the logical projection).

Consequences

explain_plan() now works for partial-schema upserts.
Bloom-filter conflict detection (inserted_rows_filter) is populated for partial-schema operations on unenforced primary keys; v1 always returned None.
when_not_matched_by_source=Delete/DeleteIf is accepted for partial-schema sources (previously rejected outright with a NotSupported error).
Partial-schema upserts with insert_not_matched=InsertAll now reject non-nullable missing columns at the API boundary with a descriptive error naming the offending columns, instead of producing a confusing downstream writer error. User-visible behavior change.

Tests

Refactored test_merge_insert_subcols to branch on scalar_index: v1 structural assertions (tombstoned fields, preserved fragment ids) stay on the scalar_index=true branch; the scalar_index=false branch asserts v2 structural behavior. A shared key-lookup-based check verifies that columns not in source retain original values across both paths.
Converted the old negative test_delete_not_supported into a positive test_delete_not_matched_by_source_on_v2_subcols test.
Updated test_sub_schema_upsert_fragment_bitmap to reflect v2 semantics (3 fragments after upsert, both indexes preserved — v2 relies on unindexed-fragment fallback rather than eagerly invalidating the vector index).
New test_merge_insert_subcols_v2_explain_plan confirms the v2 physical plan is used and that the filled other column appears in the projection.
New test_merge_insert_subcols_v2_rejects_non_nullable_insert for the new validation error.
New test_merge_insert_subcols_v2_bloom_filter for the bloom-filter conflict-detection acceptance criterion.
Updated the Python test_merge_insert_subcols to drop v1-specific file-layout assertions while preserving the semantic data checks (column c unchanged for updated rows, NULL for newly inserted rows).

Test plan

cargo check -p lance --lib --tests
cargo clippy -p lance --lib --tests -- -D warnings
cargo fmt --all
cargo test -p lance --lib dataset::write::merge_insert — 127 passed
cargo test -p lance --lib dataset::write — 190 passed
cargo test -p lance --lib dataset:: — 1058 passed
cargo test -p lance --lib index::tests — 63 passed
Python tests (pytest python/python/tests/test_dataset.py::test_merge_insert_subcols) — run in CI

🤖 Generated with Claude Code

codecov · 2026-04-10T09:51:31Z

Codecov Report

❌ Patch coverage is 96.17647% with 13 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
...lance/src/dataset/write/merge_insert/exec/write.rs	77.41%	6 Missing and 1 partial ⚠️
rust/lance/src/dataset/write/merge_insert.rs	98.05%	3 Missing and 3 partials ⚠️

📢 Thoughts on this report? Let us know!

wjones127

Thanks for working on this. I'd like to see some tests to make sure this doesn't break in cases where there are camel case columns. Otherwise I think it looks good.

wjones127 · 2026-04-10T19:35:53Z

+        // physical plan is deterministic and easy to inspect in tests.
+        for field in dataset_schema.fields() {
+            if !source_field_names.contains(field.name()) {
+                df = df.with_column(


issue(non-blocking): scanning these columns for all rows isn't ideal performance-wise. Consider the case where you are updating 3 rows with just part of the schema. This means you need to read all rows for missing columns in the table just to do this update. Ideally, we'd instead pull these columns in using TakeExec after the join.

I'm okay with keeping this code path for now, but if we do we should make a follow up ticket for optimizing to a Take later.

Agreed — materializing every missing column across the full target scan is wasteful when the update set is small. The follow-up is already tracked in #4193 ("Optimize upsert with partial schema and no index"), which is linked from this PR's description. I'll add a note on #4193 pointing at the post-join TakeExec approach you described so the optimization direction is captured concretely rather than as a generic "revisit later".

Reviewer on lance-format#6472 flagged that DataFusion's `col()` lowercases unquoted identifiers, so the partial-schema v2 fill-in and the `on_cols` join quoting were untested against camelCase names. Pin the behavior down with a focused test that uses a camelCase join key and a camelCase column omitted from the source. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Partial-schema upserts (source column subset of the dataset schema) used to fall back to the legacy v1 Merger path, which wrote only the changed columns via Operation::Update with RewriteColumns. They now run through the same create_plan / FullSchemaMergeInsertExec pipeline as full-schema upserts. Missing source columns are filled from the target side of the join via a post-join projection, so the v2 writer sees a complete set of data columns and writes full rows as new fragments. Consequences: - explain_plan() now works for partial-schema upserts. - Bloom-filter conflict detection (inserted_rows_filter) is populated for partial-schema operations on unenforced primary keys; v1 always returned None here. - when_not_matched_by_source=Delete/DeleteIf is accepted for partial schema (previously rejected outright). - Partial-schema upserts with insert_not_matched=InsertAll now reject non-nullable missing columns at the API boundary with a descriptive error naming the offending columns, instead of producing a confusing downstream writer error. - prepare_stream_schema now emits data columns in dataset-schema order keyed by name. This turns an accidental positional invariant into an explicit name-based one and is required for the partial-schema path (synthetic filled columns land at the end of the logical projection). The v1 RewriteColumns optimization is still used when the join key has a scalar index (falls through to the old path). Reviving a similar optimization on v2 is tracked separately as lance-format#4193. Closes lance-format#6442 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Reviewer on lance-format#6472 flagged that DataFusion's `col()` lowercases unquoted identifiers, so the partial-schema v2 fill-in and the `on_cols` join quoting were untested against camelCase names. Pin the behavior down with a focused test that uses a camelCase join key and a camelCase column omitted from the source. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

After rebasing onto main, our test added in "feat: route partial-schema merge_insert through the v2 write path" no longer compiled because main lance-format#6647 turned the public `Dataset::object_store()` getter into an async, base-aware method returning `Result<Arc<ObjectStore>>`. The other call site of `read_transaction_file` in the same module already uses the `pub(crate) object_store` field directly via `.as_ref()`; mirror that pattern here. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

wjones127

This is looking good. Apologies for the delay in review. Thanks for working on this!

wjones127 · 2026-05-12T20:47:32Z

Updated the branch. Will merge once CI is passing.

github-actions Bot added enhancement New feature or request python labels Apr 10, 2026

wjones127 reviewed Apr 10, 2026

View reviewed changes

wombatu-kun force-pushed the feat/partial-schema-v2-merge-insert-6442 branch from 5bc7d21 to 8664b2c Compare April 11, 2026 01:05

wombatu-kun mentioned this pull request Apr 11, 2026

Optimize upsert with partial schema and no index #4193

Open

4 tasks

wombatu-kun requested a review from wjones127 April 11, 2026 01:17

wombatu-kun force-pushed the feat/partial-schema-v2-merge-insert-6442 branch from 6d19c2f to 81c674b Compare April 14, 2026 02:17

Vova Kolmakov and others added 3 commits May 4, 2026 15:11

wombatu-kun force-pushed the feat/partial-schema-v2-merge-insert-6442 branch from cb3fdc2 to b80f89b Compare May 4, 2026 08:30

Merge branch 'main' into feat/partial-schema-v2-merge-insert-6442

41de6d2

wjones127 approved these changes May 12, 2026

View reviewed changes

Merge branch 'main' into feat/partial-schema-v2-merge-insert-6442

b34ea77

wjones127 merged commit 3cd8763 into lance-format:main May 12, 2026
28 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: route partial-schema merge_insert through the v2 write path#6472

feat: route partial-schema merge_insert through the v2 write path#6472
wjones127 merged 5 commits into
lance-format:mainfrom
wombatu-kun:feat/partial-schema-v2-merge-insert-6442

wombatu-kun commented Apr 10, 2026

Uh oh!

codecov Bot commented Apr 10, 2026 •

edited

Loading

Uh oh!

wjones127 left a comment

Uh oh!

Uh oh!

wjones127 Apr 10, 2026

Uh oh!

wombatu-kun Apr 11, 2026

Uh oh!

wjones127 left a comment

Uh oh!

wjones127 commented May 12, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

wombatu-kun commented Apr 10, 2026

Summary

How it works

Consequences

Tests

Test plan

Uh oh!

codecov Bot commented Apr 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

wjones127 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

wjones127 Apr 10, 2026

Choose a reason for hiding this comment

Uh oh!

wombatu-kun Apr 11, 2026

Choose a reason for hiding this comment

Uh oh!

wjones127 left a comment

Choose a reason for hiding this comment

Uh oh!

wjones127 commented May 12, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

codecov Bot commented Apr 10, 2026 •

edited

Loading