Based on discussion #3842
Currently we have added the basic metadata structure to allow an internal implementation of LSM write. The newly created tasks are for contributing the implementation to open source Lance.
In short, we will create a new job type LogStructuredMergeJob. From user perspective, this is a long running job that users can call start and then continuously accept writes. The writes are batched and written to WAL asynchronously, and at the same time to a MemTable kept in memory (the MemTable is also a Lance table).
A few internal details:
- Each job works against a specific region of the table, region definition is up to the job creator (more details see point 5)
- When creating, the MemTable will have the same set of indexes as the source table, and every time a write happens, the MemTable indexes are updated accordingly.
- When the MemTable reaches a specific configurable size, it triggers a flush operation to flush the MemTable to disk
- We expect an asynchronous process (e.g. table maintenance process) to continuously merge the flushed MemTables into the source Lance table, and after this merge, the MemTable can be dropped from the index.
- This job is intended to be integrated with distributed engines like Ray, Spark, etc. that can launch distributed writers, and each "writer" can create a job for the specific region and continuously accept writes.
When reading data, the scanner exposes 2 options:
- use memwal index - will let the scanner look into all the flushed but not merged MemTables, and create a merged scan plan
- use job - will let user supply a running job to the scanner, so that scanner also gain access to the in memory MemTable for the merged scan plan.
Some future work also listed but not in the immediate plan:
- support delete marker in the job (the plan above will allow user to only write record batches similar to merge-insert)
Some identified bug fixes:
- primary key field IDs should be ordered
Based on discussion #3842
Currently we have added the basic metadata structure to allow an internal implementation of LSM write. The newly created tasks are for contributing the implementation to open source Lance.
In short, we will create a new job type
LogStructuredMergeJob. From user perspective, this is a long running job that users can callstartand then continuously accept writes. The writes are batched and written to WAL asynchronously, and at the same time to a MemTable kept in memory (the MemTable is also a Lance table).A few internal details:
When reading data, the scanner exposes 2 options:
Some future work also listed but not in the immediate plan:
Some identified bug fixes: