Skip to content

feat: route partial-schema merge_insert through the v2 write path#6472

Merged
wjones127 merged 5 commits into
lance-format:mainfrom
wombatu-kun:feat/partial-schema-v2-merge-insert-6442
May 12, 2026
Merged

feat: route partial-schema merge_insert through the v2 write path#6472
wjones127 merged 5 commits into
lance-format:mainfrom
wombatu-kun:feat/partial-schema-v2-merge-insert-6442

Conversation

@wombatu-kun
Copy link
Copy Markdown
Contributor

Summary

Closes #6442.

Partial-schema upserts (source column subset of the dataset schema) used to fall back to the legacy v1 Merger path, which wrote only the changed columns via Operation::Update with RewriteColumns. They now run through the same create_plan / FullSchemaMergeInsertExec pipeline as full-schema upserts. Missing source columns are filled from the target side of the join via a post-join projection, so the v2 writer sees a complete set of data columns and writes full rows as new fragments.

Per the issue, this intentionally does not replicate v1's RewriteColumns optimization — that is tracked separately in #4193. The v1 path is still used when the join key has a scalar index (falls through the old gate), so the legacy optimization remains available for the scalar-index-on-join-key case.

How it works

  1. can_use_create_plan (gate) — accepts a source schema that is a type-compatible subset of the dataset schema. If insert_not_matched=InsertAll and any missing target column is non-nullable, the call short-circuits with an InvalidInput error naming the offending columns.
  2. create_plan (plan construction) — after the join and the __action column, every dataset field missing from the source is added as an unqualified column populated from col("target.\"<name>\""). For matched rows this carries the existing target value; for unmatched source rows the outer join leaves it NULL.
  3. necessary_children_exprs (projection pushdown) — the extension node now keeps unqualified columns whose name matches a dataset field, so the synthetic filled columns flow through to the write exec alongside source.*.
  4. prepare_stream_schema — data columns are now emitted in dataset schema order keyed by name. This turns an accidental positional invariant into an explicit name-based one and is required for the partial-schema path (synthetic filled columns land at the end of the logical projection).

Consequences

  • explain_plan() now works for partial-schema upserts.
  • Bloom-filter conflict detection (inserted_rows_filter) is populated for partial-schema operations on unenforced primary keys; v1 always returned None.
  • when_not_matched_by_source=Delete/DeleteIf is accepted for partial-schema sources (previously rejected outright with a NotSupported error).
  • Partial-schema upserts with insert_not_matched=InsertAll now reject non-nullable missing columns at the API boundary with a descriptive error naming the offending columns, instead of producing a confusing downstream writer error. User-visible behavior change.

Tests

  • Refactored test_merge_insert_subcols to branch on scalar_index: v1 structural assertions (tombstoned fields, preserved fragment ids) stay on the scalar_index=true branch; the scalar_index=false branch asserts v2 structural behavior. A shared key-lookup-based check verifies that columns not in source retain original values across both paths.
  • Converted the old negative test_delete_not_supported into a positive test_delete_not_matched_by_source_on_v2_subcols test.
  • Updated test_sub_schema_upsert_fragment_bitmap to reflect v2 semantics (3 fragments after upsert, both indexes preserved — v2 relies on unindexed-fragment fallback rather than eagerly invalidating the vector index).
  • New test_merge_insert_subcols_v2_explain_plan confirms the v2 physical plan is used and that the filled other column appears in the projection.
  • New test_merge_insert_subcols_v2_rejects_non_nullable_insert for the new validation error.
  • New test_merge_insert_subcols_v2_bloom_filter for the bloom-filter conflict-detection acceptance criterion.
  • Updated the Python test_merge_insert_subcols to drop v1-specific file-layout assertions while preserving the semantic data checks (column c unchanged for updated rows, NULL for newly inserted rows).

Test plan

  • cargo check -p lance --lib --tests
  • cargo clippy -p lance --lib --tests -- -D warnings
  • cargo fmt --all
  • cargo test -p lance --lib dataset::write::merge_insert — 127 passed
  • cargo test -p lance --lib dataset::write — 190 passed
  • cargo test -p lance --lib dataset:: — 1058 passed
  • cargo test -p lance --lib index::tests — 63 passed
  • Python tests (pytest python/python/tests/test_dataset.py::test_merge_insert_subcols) — run in CI

🤖 Generated with Claude Code

@github-actions github-actions Bot added enhancement New feature or request python labels Apr 10, 2026
@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 10, 2026

Codecov Report

❌ Patch coverage is 96.17647% with 13 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
...lance/src/dataset/write/merge_insert/exec/write.rs 77.41% 6 Missing and 1 partial ⚠️
rust/lance/src/dataset/write/merge_insert.rs 98.05% 3 Missing and 3 partials ⚠️

📢 Thoughts on this report? Let us know!

Copy link
Copy Markdown
Contributor

@wjones127 wjones127 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this. I'd like to see some tests to make sure this doesn't break in cases where there are camel case columns. Otherwise I think it looks good.

Comment thread rust/lance/src/dataset/write/merge_insert.rs
// physical plan is deterministic and easy to inspect in tests.
for field in dataset_schema.fields() {
if !source_field_names.contains(field.name()) {
df = df.with_column(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue(non-blocking): scanning these columns for all rows isn't ideal performance-wise. Consider the case where you are updating 3 rows with just part of the schema. This means you need to read all rows for missing columns in the table just to do this update. Ideally, we'd instead pull these columns in using TakeExec after the join.

I'm okay with keeping this code path for now, but if we do we should make a follow up ticket for optimizing to a Take later.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed — materializing every missing column across the full target scan is wasteful when the update set is small. The follow-up is already tracked in #4193 ("Optimize upsert with partial schema and no index"), which is linked from this PR's description. I'll add a note on #4193 pointing at the post-join TakeExec approach you described so the optimization direction is captured concretely rather than as a generic "revisit later".

wombatu-kun pushed a commit to wombatu-kun/lance that referenced this pull request Apr 11, 2026
Reviewer on lance-format#6472 flagged that DataFusion's `col()` lowercases
unquoted identifiers, so the partial-schema v2 fill-in and the
`on_cols` join quoting were untested against camelCase names.
Pin the behavior down with a focused test that uses a camelCase
join key and a camelCase column omitted from the source.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
wombatu-kun pushed a commit to wombatu-kun/lance that referenced this pull request Apr 11, 2026
Reviewer on lance-format#6472 flagged that DataFusion's `col()` lowercases
unquoted identifiers, so the partial-schema v2 fill-in and the
`on_cols` join quoting were untested against camelCase names.
Pin the behavior down with a focused test that uses a camelCase
join key and a camelCase column omitted from the source.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@wombatu-kun wombatu-kun force-pushed the feat/partial-schema-v2-merge-insert-6442 branch from 5bc7d21 to 8664b2c Compare April 11, 2026 01:05
@wombatu-kun wombatu-kun requested a review from wjones127 April 11, 2026 01:17
wombatu-kun pushed a commit to wombatu-kun/lance that referenced this pull request Apr 14, 2026
Reviewer on lance-format#6472 flagged that DataFusion's `col()` lowercases
unquoted identifiers, so the partial-schema v2 fill-in and the
`on_cols` join quoting were untested against camelCase names.
Pin the behavior down with a focused test that uses a camelCase
join key and a camelCase column omitted from the source.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@wombatu-kun wombatu-kun force-pushed the feat/partial-schema-v2-merge-insert-6442 branch from 6d19c2f to 81c674b Compare April 14, 2026 02:17
Vova Kolmakov and others added 3 commits May 4, 2026 15:11
Partial-schema upserts (source column subset of the dataset schema) used
to fall back to the legacy v1 Merger path, which wrote only the changed
columns via Operation::Update with RewriteColumns. They now run through
the same create_plan / FullSchemaMergeInsertExec pipeline as full-schema
upserts. Missing source columns are filled from the target side of the
join via a post-join projection, so the v2 writer sees a complete set of
data columns and writes full rows as new fragments.

Consequences:

- explain_plan() now works for partial-schema upserts.
- Bloom-filter conflict detection (inserted_rows_filter) is populated
  for partial-schema operations on unenforced primary keys; v1 always
  returned None here.
- when_not_matched_by_source=Delete/DeleteIf is accepted for partial
  schema (previously rejected outright).
- Partial-schema upserts with insert_not_matched=InsertAll now reject
  non-nullable missing columns at the API boundary with a descriptive
  error naming the offending columns, instead of producing a confusing
  downstream writer error.
- prepare_stream_schema now emits data columns in dataset-schema order
  keyed by name. This turns an accidental positional invariant into an
  explicit name-based one and is required for the partial-schema path
  (synthetic filled columns land at the end of the logical projection).

The v1 RewriteColumns optimization is still used when the join key has
a scalar index (falls through to the old path). Reviving a similar
optimization on v2 is tracked separately as lance-format#4193.

Closes lance-format#6442

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Reviewer on lance-format#6472 flagged that DataFusion's `col()` lowercases
unquoted identifiers, so the partial-schema v2 fill-in and the
`on_cols` join quoting were untested against camelCase names.
Pin the behavior down with a focused test that uses a camelCase
join key and a camelCase column omitted from the source.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
After rebasing onto main, our test added in
"feat: route partial-schema merge_insert through the v2 write path"
no longer compiled because main lance-format#6647 turned the public
`Dataset::object_store()` getter into an async, base-aware method
returning `Result<Arc<ObjectStore>>`. The other call site of
`read_transaction_file` in the same module already uses the
`pub(crate) object_store` field directly via `.as_ref()`; mirror that
pattern here.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@wombatu-kun wombatu-kun force-pushed the feat/partial-schema-v2-merge-insert-6442 branch from cb3fdc2 to b80f89b Compare May 4, 2026 08:30
Copy link
Copy Markdown
Contributor

@wjones127 wjones127 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking good. Apologies for the delay in review. Thanks for working on this!

@wjones127
Copy link
Copy Markdown
Contributor

Updated the branch. Will merge once CI is passing.

@wjones127 wjones127 merged commit 3cd8763 into lance-format:main May 12, 2026
28 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request python

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support partial schema upsert on v2 merge_insert path

2 participants