Differentiate streaming and materialized inputs for add and merge insert

Most of our APIs assume inputs are a stream. This is nice in that it supports larger-than-memory writes. However, if data is fully materialized, we can often do things more optimally. To give a few examples:

1. If we have 2 million rows in memory we want to insert, we can write two data files in parallel. Currently we write them sequentially.
2. To support retries for write operations, we buffer data on disk. This could by bypassed if the data is in memory.
3. For merge_insert, we can compute basic statistics like `num_rows` and `num_bytes`, which can be used by DataFusion to optimize the join order. Currently we always use the table id column as the build side, but for large tables that is suboptimal.

Having an API would also support other downstream use cases: https://github.com/lancedb/lancedb/issues/2602

## API

In Rust, define an enum and conversion traits to take common input using generic APIs:

```rust
struct InputData {
    Stream(SendableRecordBatchStream)
    Materialized {
        batches: Vec<RecordBatch>,
        schema: SchemaRef,
    }
}

pub fn insert(data: impl Into<InputData>) -> { ... }

impl From<RecordBatch> for InputData { ... }
impl From<Vec<RecordBatch>> for InputData { ... }
impl From<Box<dyn RecordBatchReader>> for InputData { ... }
```

In Python, we want to make sure various inputs gets converted to the correct type.

Materialized:
* `pa.Table`
* `pd.DataFrame`
* `pa.RecordBatch`

Stream:
* `pa.RecordBatchReader`
* `pa.Dataset`
* `pa.Scanner`

## TODO

* [ ] Define `InputData` and conversion traits
* [ ] Change write APIs to take `impl Into<InputData>`
* [ ] Make sure `merge_insert` converts `InputData::Materialized` into [MemTable](https://docs.rs/datafusion/latest/datafusion/datasource/struct.MemTable.html) instead of `OneShotPartitionStream`.
     * This should solve use case 3
* [ ] Change `new_source_iter` to not spill when using `InputData::Materialized`
     * This should solve use case 2

(Note: we'll leave use case 1 for a follow up)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Differentiate streaming and materialized inputs for add and merge insert #4583

API

TODO

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Differentiate streaming and materialized inputs for add and merge insert #4583

Description

API

TODO

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions