fix(merge_insert): use sentinel column for NULL-safe source row detection by pratik0316 · Pull Request #6439 · lance-format/lance

pratik0316 · 2026-04-08T15:06:01Z

Source rows with NULL ON key columns were silently dropped because the action assignment logic used ON_col IS NOT NULL as a proxy for "source row is present in the join output". This conflates a legitimate NULL key with a NULL introduced by the outer join on the target side.

Fix by injecting a lit(true) sentinel column into the source DataFrame before the join. After the join the sentinel is non-null for every source row and null only for target-only rows, making source row detection independent of ON column values.

Strip the sentinel in prepare_stream_schema before writing and propagate it through projection pushdown in necessary_children_exprs.

Before the join, inject a constant lit(true) column (__merge_source_sentinel) into every source row. After the join:

Source rows (whether matched or unmatched) → sentinel = true
Target-only rows (no source match) → sentinel = NULL (outer join NULL-fill)
assign_action now uses sentinel IS NOT NULL to detect source row presence, making it correct regardless of what values the ON columns hold.

The sentinel is a pure logical column — it never touches disk. It's stripped in prepare_stream_schema before any data is written, and necessary_children_exprs is updated to propagate it through DataFusion's projection pushdown.

Example that was broken before:

Target: (id=1, record_type="A") and (id=0, record_type=NULL)
Source: (id=2, record_type=NULL) — new row, should be inserted
ON: ["id", "record_type"]
Old behavior: source row silently dropped (Action::Nothing)
New behavior: source row correctly inserted (Action::Insert)

Fixes: #4644

…tion Source rows with NULL ON key columns were silently dropped because the action assignment logic used `ON_col IS NOT NULL` as a proxy for "source row is present in the join output". This conflates a legitimate NULL key with a NULL introduced by the outer join on the target side. Fix by injecting a `lit(true)` sentinel column into the source DataFrame before the join. After the join the sentinel is non-null for every source row and null only for target-only rows, making source row detection independent of ON column values. Strip the sentinel in `prepare_stream_schema` before writing and propagate it through projection pushdown in `necessary_children_exprs`. Signed-off-by: Pratik <pratikrocks.dey11@gmail.com>

pratik0316 · 2026-04-08T15:07:44Z

Hi @wjones127 can you pls provide a review

wjones127

This looks like a nice fix. There's some optional simplifications you can do with the tests. I've commented on one tests, but similar changes can be made to the others.

I will merge tomorrow to give you time to address those if you want.

pratik0316 · 2026-04-08T15:48:47Z

Thanks for the super fast review @wjones127 🙌

trying to adress the suggestions shortly

codecov · 2026-04-08T16:07:58Z

Codecov Report

❌ Patch coverage is 99.28058% with 1 line in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
rust/lance/src/dataset/write/merge_insert.rs	99.24%	0 Missing and 1 partial ⚠️

📢 Thoughts on this report? Let us know!

Signed-off-by: Pratik <pratikrocks.dey11@gmail.com>

…mn index sensitivity The sentinel column added in the Rust fix changed the column indices in the ProjectionExec expressions (e.g. _rowid@1 -> _rowid@0), breaking the doctest pattern matches. Replace the specific column expressions with [...] so the tests don't break when internal indices shift. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

github-actions Bot added the bug Something isn't working label Apr 8, 2026

wjones127 self-requested a review April 8, 2026 15:32

wjones127 approved these changes Apr 8, 2026

View reviewed changes

Comment thread rust/lance/src/dataset/write/merge_insert.rs Outdated

Comment thread rust/lance/src/dataset/write/merge_insert.rs Outdated

add unit tests

fc23fa4

Signed-off-by: Pratik <pratikrocks.dey11@gmail.com>

pratik0316 requested a review from wjones127 April 8, 2026 16:44

fix doctest

4b3640c

Signed-off-by: Pratik <pratikrocks.dey11@gmail.com>

github-actions Bot added the A-python Python bindings label Apr 8, 2026

wjones127 merged commit 46650e6 into lance-format:main Apr 8, 2026
28 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(merge_insert): use sentinel column for NULL-safe source row detection#6439

fix(merge_insert): use sentinel column for NULL-safe source row detection#6439
wjones127 merged 4 commits into
lance-format:mainfrom
pratik0316:personal/pratikdey/merge_insert_on_null

pratik0316 commented Apr 8, 2026 •

edited

Loading

Uh oh!

pratik0316 commented Apr 8, 2026

Uh oh!

wjones127 left a comment

Uh oh!

Uh oh!

Uh oh!

pratik0316 commented Apr 8, 2026

Uh oh!

codecov Bot commented Apr 8, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

pratik0316 commented Apr 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pratik0316 commented Apr 8, 2026

Uh oh!

wjones127 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

pratik0316 commented Apr 8, 2026

Uh oh!

codecov Bot commented Apr 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

pratik0316 commented Apr 8, 2026 •

edited

Loading

codecov Bot commented Apr 8, 2026 •

edited

Loading