Skip to content

Optimize upsert with partial schema and no index #4193

@wjones127

Description

@wjones127

Background

We are migrating from an implementation of merge insert that manually manipulates streams to one that constructs a DataFusion plan and runs optimizers on that plan for optimize it.

The old code path is in execute_uncommitted_impl:

https://github.com/lancedb/lance/blob/94dcaf5e735480f9f61216b4e4f90358dbeac2ba/rust/lance/src/dataset/write/merge_insert.rs#L1337-L1340

While the new codepath is under execute_uncommitted_v2:

https://github.com/lancedb/lance/blob/94dcaf5e735480f9f61216b4e4f90358dbeac2ba/rust/lance/src/dataset/write/merge_insert.rs#L1352-L1371

Task

Goal: upsert queries that provide a subset of columns use the new code path.

Lance implements updates as delete plus insert. Therefore, if the user supplies a subset of the columns, we will need to read the existing columns values for updated rows before writing out the new rows. This should be handled by TakeExec.

Caveats

The current implementation has special behavior for how it updates existing rows, which is tested here:

https://github.com/lancedb/lance/blob/94dcaf5e735480f9f61216b4e4f90358dbeac2ba/rust/lance/src/dataset/write/merge_insert.rs#L2376-L2381

Instead of marked the old rows as deleted and then moving the new versions of those rows to a new file, it marks the updated columns as dropped, and adds new files in the existing fragments. This avoids rewriting large columns if they are not part of the update.

We should find a way to make an optimizer rule that is smart about choosing which write code path to choose.

TODO

  • Create a TakeNode that implements UserDefinedLocalNodeCore and gets turned into a TakeExec when applied with a planner.
  • Add a parameter to merge_insert to choose to use the in-place update write pattern, that will route to the current write behavior. This can default to the new write pattern.
  • Alter the create_plan implementation to add any missing columns from source using TakeNode.
  • Add an optimize rule that eliminates TakeNode that don't get any columns.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions