Background
We are migrating from an implementation of merge insert that manually manipulates streams to one that constructs a DataFusion plan and runs optimizers on that plan for optimize it.
The old code path is in execute_uncommitted_impl:
https://github.com/lancedb/lance/blob/94dcaf5e735480f9f61216b4e4f90358dbeac2ba/rust/lance/src/dataset/write/merge_insert.rs#L1337-L1340
While the new codepath is under execute_uncommitted_v2:
https://github.com/lancedb/lance/blob/94dcaf5e735480f9f61216b4e4f90358dbeac2ba/rust/lance/src/dataset/write/merge_insert.rs#L1352-L1371
Task
Goal: upsert queries that provide a subset of columns use the new code path.
Lance implements updates as delete plus insert. Therefore, if the user supplies a subset of the columns, we will need to read the existing columns values for updated rows before writing out the new rows. This should be handled by TakeExec.
Caveats
The current implementation has special behavior for how it updates existing rows, which is tested here:
https://github.com/lancedb/lance/blob/94dcaf5e735480f9f61216b4e4f90358dbeac2ba/rust/lance/src/dataset/write/merge_insert.rs#L2376-L2381
Instead of marked the old rows as deleted and then moving the new versions of those rows to a new file, it marks the updated columns as dropped, and adds new files in the existing fragments. This avoids rewriting large columns if they are not part of the update.
We should find a way to make an optimizer rule that is smart about choosing which write code path to choose.
TODO
Background
We are migrating from an implementation of merge insert that manually manipulates streams to one that constructs a DataFusion plan and runs optimizers on that plan for optimize it.
The old code path is in
execute_uncommitted_impl:https://github.com/lancedb/lance/blob/94dcaf5e735480f9f61216b4e4f90358dbeac2ba/rust/lance/src/dataset/write/merge_insert.rs#L1337-L1340
While the new codepath is under
execute_uncommitted_v2:https://github.com/lancedb/lance/blob/94dcaf5e735480f9f61216b4e4f90358dbeac2ba/rust/lance/src/dataset/write/merge_insert.rs#L1352-L1371
Task
Goal: upsert queries that provide a subset of columns use the new code path.
Lance implements updates as delete plus insert. Therefore, if the user supplies a subset of the columns, we will need to read the existing columns values for updated rows before writing out the new rows. This should be handled by
TakeExec.Caveats
The current implementation has special behavior for how it updates existing rows, which is tested here:
https://github.com/lancedb/lance/blob/94dcaf5e735480f9f61216b4e4f90358dbeac2ba/rust/lance/src/dataset/write/merge_insert.rs#L2376-L2381
Instead of marked the old rows as deleted and then moving the new versions of those rows to a new file, it marks the updated columns as dropped, and adds new files in the existing fragments. This avoids rewriting large columns if they are not part of the update.
We should find a way to make an optimizer rule that is smart about choosing which write code path to choose.
TODO
TakeNodethat implementsUserDefinedLocalNodeCoreand gets turned into aTakeExecwhen applied with a planner.create_planimplementation to add any missing columns from source usingTakeNode.TakeNodethat don't get any columns.