-
Notifications
You must be signed in to change notification settings - Fork 510
Open
Labels
enhancementNew feature or requestNew feature or request
Milestone
Description
Most of our APIs assume inputs are a stream. This is nice in that it supports larger-than-memory writes. However, if data is fully materialized, we can often do things more optimally. To give a few examples:
- If we have 2 million rows in memory we want to insert, we can write two data files in parallel. Currently we write them sequentially.
- To support retries for write operations, we buffer data on disk. This could by bypassed if the data is in memory.
- For merge_insert, we can compute basic statistics like
num_rowsandnum_bytes, which can be used by DataFusion to optimize the join order. Currently we always use the table id column as the build side, but for large tables that is suboptimal.
Having an API would also support other downstream use cases: lancedb/lancedb#2602
API
In Rust, define an enum and conversion traits to take common input using generic APIs:
struct InputData {
Stream(SendableRecordBatchStream)
Materialized {
batches: Vec<RecordBatch>,
schema: SchemaRef,
}
}
pub fn insert(data: impl Into<InputData>) -> { ... }
impl From<RecordBatch> for InputData { ... }
impl From<Vec<RecordBatch>> for InputData { ... }
impl From<Box<dyn RecordBatchReader>> for InputData { ... }In Python, we want to make sure various inputs gets converted to the correct type.
Materialized:
pa.Tablepd.DataFramepa.RecordBatch
Stream:
pa.RecordBatchReaderpa.Datasetpa.Scanner
TODO
- Define
InputDataand conversion traits - Change write APIs to take
impl Into<InputData> - Make sure
merge_insertconvertsInputData::Materializedinto MemTable instead ofOneShotPartitionStream.- This should solve use case 3
- Change
new_source_iterto not spill when usingInputData::Materialized- This should solve use case 2
(Note: we'll leave use case 1 for a follow up)
artemru
Metadata
Metadata
Assignees
Labels
enhancementNew feature or requestNew feature or request