Skip to content

Differentiate streaming and materialized inputs for add and merge insert #4583

@wjones127

Description

@wjones127

Most of our APIs assume inputs are a stream. This is nice in that it supports larger-than-memory writes. However, if data is fully materialized, we can often do things more optimally. To give a few examples:

  1. If we have 2 million rows in memory we want to insert, we can write two data files in parallel. Currently we write them sequentially.
  2. To support retries for write operations, we buffer data on disk. This could by bypassed if the data is in memory.
  3. For merge_insert, we can compute basic statistics like num_rows and num_bytes, which can be used by DataFusion to optimize the join order. Currently we always use the table id column as the build side, but for large tables that is suboptimal.

Having an API would also support other downstream use cases: lancedb/lancedb#2602

API

In Rust, define an enum and conversion traits to take common input using generic APIs:

struct InputData {
    Stream(SendableRecordBatchStream)
    Materialized {
        batches: Vec<RecordBatch>,
        schema: SchemaRef,
    }
}

pub fn insert(data: impl Into<InputData>) -> { ... }

impl From<RecordBatch> for InputData { ... }
impl From<Vec<RecordBatch>> for InputData { ... }
impl From<Box<dyn RecordBatchReader>> for InputData { ... }

In Python, we want to make sure various inputs gets converted to the correct type.

Materialized:

  • pa.Table
  • pd.DataFrame
  • pa.RecordBatch

Stream:

  • pa.RecordBatchReader
  • pa.Dataset
  • pa.Scanner

TODO

  • Define InputData and conversion traits
  • Change write APIs to take impl Into<InputData>
  • Make sure merge_insert converts InputData::Materialized into MemTable instead of OneShotPartitionStream.
    • This should solve use case 3
  • Change new_source_iter to not spill when using InputData::Materialized
    • This should solve use case 2

(Note: we'll leave use case 1 for a follow up)

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Type

No type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions