Optimize upsert with partial schema and no index

## Background

We are migrating from an implementation of merge insert that manually manipulates streams to one that constructs a DataFusion plan and runs optimizers on that plan for optimize it.

The old code path is in `execute_uncommitted_impl`:

https://github.com/lancedb/lance/blob/94dcaf5e735480f9f61216b4e4f90358dbeac2ba/rust/lance/src/dataset/write/merge_insert.rs#L1337-L1340

While the new codepath is under `execute_uncommitted_v2`:

https://github.com/lancedb/lance/blob/94dcaf5e735480f9f61216b4e4f90358dbeac2ba/rust/lance/src/dataset/write/merge_insert.rs#L1352-L1371

## Task

**Goal**: upsert queries that provide a subset of columns use the new code path.

Lance implements updates as delete plus insert. Therefore, if the user supplies a subset of the columns, we will need to read the existing columns values for updated rows before writing out the new rows. This should be handled by `TakeExec`.

## Caveats

The current implementation has special behavior for how it updates existing rows, which is tested here:

https://github.com/lancedb/lance/blob/94dcaf5e735480f9f61216b4e4f90358dbeac2ba/rust/lance/src/dataset/write/merge_insert.rs#L2376-L2381

Instead of marked the old rows as deleted and then moving the new versions of those rows to a new file, it marks the updated columns as dropped, and adds new files in the existing fragments. This avoids rewriting large columns if they are not part of the update.

We should find a way to make an optimizer rule that is smart about choosing which write code path to choose.

## TODO

* [ ] Create a `TakeNode` that implements `UserDefinedLocalNodeCore` and gets turned into a `TakeExec` when applied with a planner.
* [ ] Add a parameter to merge_insert to choose to use the in-place update write pattern, that will route to the current write behavior. This can default to the new write pattern.
* [ ] Alter the `create_plan` implementation to add any missing columns from source using `TakeNode`.
* [ ] Add an optimize rule that eliminates `TakeNode` that don't get any columns.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize upsert with partial schema and no index #4193

Background

Task

Caveats

TODO

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Optimize upsert with partial schema and no index #4193

Description

Background

Task

Caveats

TODO

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions