feat: route partial-schema merge_insert through the v2 write path#6472
Conversation
Codecov Report❌ Patch coverage is 📢 Thoughts on this report? Let us know! |
wjones127
left a comment
There was a problem hiding this comment.
Thanks for working on this. I'd like to see some tests to make sure this doesn't break in cases where there are camel case columns. Otherwise I think it looks good.
| // physical plan is deterministic and easy to inspect in tests. | ||
| for field in dataset_schema.fields() { | ||
| if !source_field_names.contains(field.name()) { | ||
| df = df.with_column( |
There was a problem hiding this comment.
issue(non-blocking): scanning these columns for all rows isn't ideal performance-wise. Consider the case where you are updating 3 rows with just part of the schema. This means you need to read all rows for missing columns in the table just to do this update. Ideally, we'd instead pull these columns in using TakeExec after the join.
I'm okay with keeping this code path for now, but if we do we should make a follow up ticket for optimizing to a Take later.
There was a problem hiding this comment.
Agreed — materializing every missing column across the full target scan is wasteful when the update set is small. The follow-up is already tracked in #4193 ("Optimize upsert with partial schema and no index"), which is linked from this PR's description. I'll add a note on #4193 pointing at the post-join TakeExec approach you described so the optimization direction is captured concretely rather than as a generic "revisit later".
Reviewer on lance-format#6472 flagged that DataFusion's `col()` lowercases unquoted identifiers, so the partial-schema v2 fill-in and the `on_cols` join quoting were untested against camelCase names. Pin the behavior down with a focused test that uses a camelCase join key and a camelCase column omitted from the source. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Reviewer on lance-format#6472 flagged that DataFusion's `col()` lowercases unquoted identifiers, so the partial-schema v2 fill-in and the `on_cols` join quoting were untested against camelCase names. Pin the behavior down with a focused test that uses a camelCase join key and a camelCase column omitted from the source. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
5bc7d21 to
8664b2c
Compare
Reviewer on lance-format#6472 flagged that DataFusion's `col()` lowercases unquoted identifiers, so the partial-schema v2 fill-in and the `on_cols` join quoting were untested against camelCase names. Pin the behavior down with a focused test that uses a camelCase join key and a camelCase column omitted from the source. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
6d19c2f to
81c674b
Compare
Partial-schema upserts (source column subset of the dataset schema) used to fall back to the legacy v1 Merger path, which wrote only the changed columns via Operation::Update with RewriteColumns. They now run through the same create_plan / FullSchemaMergeInsertExec pipeline as full-schema upserts. Missing source columns are filled from the target side of the join via a post-join projection, so the v2 writer sees a complete set of data columns and writes full rows as new fragments. Consequences: - explain_plan() now works for partial-schema upserts. - Bloom-filter conflict detection (inserted_rows_filter) is populated for partial-schema operations on unenforced primary keys; v1 always returned None here. - when_not_matched_by_source=Delete/DeleteIf is accepted for partial schema (previously rejected outright). - Partial-schema upserts with insert_not_matched=InsertAll now reject non-nullable missing columns at the API boundary with a descriptive error naming the offending columns, instead of producing a confusing downstream writer error. - prepare_stream_schema now emits data columns in dataset-schema order keyed by name. This turns an accidental positional invariant into an explicit name-based one and is required for the partial-schema path (synthetic filled columns land at the end of the logical projection). The v1 RewriteColumns optimization is still used when the join key has a scalar index (falls through to the old path). Reviving a similar optimization on v2 is tracked separately as lance-format#4193. Closes lance-format#6442 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Reviewer on lance-format#6472 flagged that DataFusion's `col()` lowercases unquoted identifiers, so the partial-schema v2 fill-in and the `on_cols` join quoting were untested against camelCase names. Pin the behavior down with a focused test that uses a camelCase join key and a camelCase column omitted from the source. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
After rebasing onto main, our test added in "feat: route partial-schema merge_insert through the v2 write path" no longer compiled because main lance-format#6647 turned the public `Dataset::object_store()` getter into an async, base-aware method returning `Result<Arc<ObjectStore>>`. The other call site of `read_transaction_file` in the same module already uses the `pub(crate) object_store` field directly via `.as_ref()`; mirror that pattern here. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
cb3fdc2 to
b80f89b
Compare
wjones127
left a comment
There was a problem hiding this comment.
This is looking good. Apologies for the delay in review. Thanks for working on this!
|
Updated the branch. Will merge once CI is passing. |
Summary
Closes #6442.
Partial-schema upserts (source column subset of the dataset schema) used to fall back to the legacy v1
Mergerpath, which wrote only the changed columns viaOperation::UpdatewithRewriteColumns. They now run through the samecreate_plan/FullSchemaMergeInsertExecpipeline as full-schema upserts. Missing source columns are filled from the target side of the join via a post-join projection, so the v2 writer sees a complete set of data columns and writes full rows as new fragments.Per the issue, this intentionally does not replicate v1's
RewriteColumnsoptimization — that is tracked separately in #4193. The v1 path is still used when the join key has a scalar index (falls through the old gate), so the legacy optimization remains available for the scalar-index-on-join-key case.How it works
can_use_create_plan(gate) — accepts a source schema that is a type-compatible subset of the dataset schema. Ifinsert_not_matched=InsertAlland any missing target column is non-nullable, the call short-circuits with anInvalidInputerror naming the offending columns.create_plan(plan construction) — after the join and the__actioncolumn, every dataset field missing from the source is added as an unqualified column populated fromcol("target.\"<name>\""). For matched rows this carries the existing target value; for unmatched source rows the outer join leaves it NULL.necessary_children_exprs(projection pushdown) — the extension node now keeps unqualified columns whose name matches a dataset field, so the synthetic filled columns flow through to the write exec alongsidesource.*.prepare_stream_schema— data columns are now emitted in dataset schema order keyed by name. This turns an accidental positional invariant into an explicit name-based one and is required for the partial-schema path (synthetic filled columns land at the end of the logical projection).Consequences
explain_plan()now works for partial-schema upserts.inserted_rows_filter) is populated for partial-schema operations on unenforced primary keys; v1 always returnedNone.when_not_matched_by_source=Delete/DeleteIfis accepted for partial-schema sources (previously rejected outright with aNotSupportederror).insert_not_matched=InsertAllnow reject non-nullable missing columns at the API boundary with a descriptive error naming the offending columns, instead of producing a confusing downstream writer error. User-visible behavior change.Tests
test_merge_insert_subcolsto branch onscalar_index: v1 structural assertions (tombstoned fields, preserved fragment ids) stay on thescalar_index=truebranch; thescalar_index=falsebranch asserts v2 structural behavior. A shared key-lookup-based check verifies that columns not in source retain original values across both paths.test_delete_not_supportedinto a positivetest_delete_not_matched_by_source_on_v2_subcolstest.test_sub_schema_upsert_fragment_bitmapto reflect v2 semantics (3 fragments after upsert, both indexes preserved — v2 relies on unindexed-fragment fallback rather than eagerly invalidating the vector index).test_merge_insert_subcols_v2_explain_planconfirms the v2 physical plan is used and that the filledothercolumn appears in the projection.test_merge_insert_subcols_v2_rejects_non_nullable_insertfor the new validation error.test_merge_insert_subcols_v2_bloom_filterfor the bloom-filter conflict-detection acceptance criterion.test_merge_insert_subcolsto drop v1-specific file-layout assertions while preserving the semantic data checks (columncunchanged for updated rows,NULLfor newly inserted rows).Test plan
cargo check -p lance --lib --testscargo clippy -p lance --lib --tests -- -D warningscargo fmt --allcargo test -p lance --lib dataset::write::merge_insert— 127 passedcargo test -p lance --lib dataset::write— 190 passedcargo test -p lance --lib dataset::— 1058 passedcargo test -p lance --lib index::tests— 63 passedpytest python/python/tests/test_dataset.py::test_merge_insert_subcols) — run in CI🤖 Generated with Claude Code