Skip to content

fix(merge_insert): use sentinel column for NULL-safe source row detection#6439

Merged
wjones127 merged 4 commits into
lance-format:mainfrom
pratik0316:personal/pratikdey/merge_insert_on_null
Apr 8, 2026
Merged

fix(merge_insert): use sentinel column for NULL-safe source row detection#6439
wjones127 merged 4 commits into
lance-format:mainfrom
pratik0316:personal/pratikdey/merge_insert_on_null

Conversation

@pratik0316
Copy link
Copy Markdown
Contributor

@pratik0316 pratik0316 commented Apr 8, 2026

Source rows with NULL ON key columns were silently dropped because the action assignment logic used ON_col IS NOT NULL as a proxy for "source row is present in the join output". This conflates a legitimate NULL key with a NULL introduced by the outer join on the target side.

Fix by injecting a lit(true) sentinel column into the source DataFrame before the join. After the join the sentinel is non-null for every source row and null only for target-only rows, making source row detection independent of ON column values.

Strip the sentinel in prepare_stream_schema before writing and propagate it through projection pushdown in necessary_children_exprs.

Before the join, inject a constant lit(true) column (__merge_source_sentinel) into every source row. After the join:

  • Source rows (whether matched or unmatched) → sentinel = true
  • Target-only rows (no source match) → sentinel = NULL (outer join NULL-fill)
    assign_action now uses sentinel IS NOT NULL to detect source row presence, making it correct regardless of what values the ON columns hold.

The sentinel is a pure logical column — it never touches disk. It's stripped in prepare_stream_schema before any data is written, and necessary_children_exprs is updated to propagate it through DataFusion's projection pushdown.

Example that was broken before:

Target: (id=1, record_type="A") and (id=0, record_type=NULL)
Source: (id=2, record_type=NULL) — new row, should be inserted
ON: ["id", "record_type"]
Old behavior: source row silently dropped (Action::Nothing)
New behavior: source row correctly inserted (Action::Insert)

Fixes: #4644

…tion

Source rows with NULL ON key columns were silently dropped because the
action assignment logic used `ON_col IS NOT NULL` as a proxy for "source
row is present in the join output". This conflates a legitimate NULL key
with a NULL introduced by the outer join on the target side.

Fix by injecting a `lit(true)` sentinel column into the source DataFrame
before the join. After the join the sentinel is non-null for every source
row and null only for target-only rows, making source row detection
independent of ON column values.

Strip the sentinel in `prepare_stream_schema` before writing and propagate
it through projection pushdown in `necessary_children_exprs`.



Signed-off-by: Pratik <pratikrocks.dey11@gmail.com>
@github-actions github-actions Bot added the bug Something isn't working label Apr 8, 2026
@pratik0316
Copy link
Copy Markdown
Contributor Author

Hi @wjones127 can you pls provide a review

@wjones127 wjones127 self-requested a review April 8, 2026 15:32
Copy link
Copy Markdown
Contributor

@wjones127 wjones127 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a nice fix. There's some optional simplifications you can do with the tests. I've commented on one tests, but similar changes can be made to the others.

I will merge tomorrow to give you time to address those if you want.

Comment thread rust/lance/src/dataset/write/merge_insert.rs Outdated
Comment thread rust/lance/src/dataset/write/merge_insert.rs Outdated
@pratik0316
Copy link
Copy Markdown
Contributor Author

Thanks for the super fast review @wjones127 🙌

trying to adress the suggestions shortly

@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 8, 2026

Codecov Report

❌ Patch coverage is 99.28058% with 1 line in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
rust/lance/src/dataset/write/merge_insert.rs 99.24% 0 Missing and 1 partial ⚠️

📢 Thoughts on this report? Let us know!

Signed-off-by: Pratik <pratikrocks.dey11@gmail.com>
@pratik0316 pratik0316 requested a review from wjones127 April 8, 2026 16:44
Signed-off-by: Pratik <pratikrocks.dey11@gmail.com>
@github-actions github-actions Bot added the A-python Python bindings label Apr 8, 2026
…mn index sensitivity

The sentinel column added in the Rust fix changed the column indices in
the ProjectionExec expressions (e.g. _rowid@1 -> _rowid@0), breaking
the doctest pattern matches. Replace the specific column expressions with
[...] so the tests don't break when internal indices shift.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@wjones127 wjones127 merged commit 46650e6 into lance-format:main Apr 8, 2026
28 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

A-python Python bindings bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

merge_insert skips rows with NULL in on column

2 participants