Skip to content
Asakiny edited this page Jan 18, 2022 · 1 revision

LakeSoul is a unified streaming and batch table storage solution built on top of the Apache Spark engine by the DMetaSoul team, and supports scalable metadata management, ACID transactions, efficient and flexible upsert operation, schema evolution, and streaming & batch unification.

LakeSoul specializes in row and column level incremental upserts, high concurrent write, and bulk scan for data on cloud storage. The cloud native computing and storage separation architecture makes deployment very simple, while supporting huge amounts of data at lower cost.

To be specific, LakeSoul has the following characteristics:

  • Elastic framework: The computing and storage is completely separated. Without the need for fixed nodes and disks, the computing and storage has its own elastic capacity, and a lot of optimization for the cloud storage has done, like concurrency consistency in the object storage, incremental update and etc. With LakeSoul, there is no need to maintain fixed storage nodes, and the cost of object storage on cloud is only 1/10 of local disk, which greatly reduces storage and operation costs.
  • Efficient and scalable metadata management: LakeSoul uses Cassandra to manage metadata, which can efficiently handle modification on metadata and support multiple concurrent writes. It solves the problem of slow metadata parsing after long running in data Lake systems such as Delta Lake which use files to maintain metadata, and can only be written at a single point.
  • ACID transactions: Undo and Redo mechanism ensures that the committing are transactional and users will never see inconsistent data. Multi-level partitioning and efficient upsert: LakeSoul supports range and hash partitioning, and a flexible upsert operation at row and column level. The upsert data are stored as delta files, which greatly improves the efficiency and concurrency of writing data, and the optimized merge scan provides efficient MergeOnRead performance.
  • Streaming and batch unification: Streaming Sink is supported in LakeSoul, which can handle streaming data ingesting, historical data filling in batch, interactive query and other scenarios simultaneously.
  • Schema evolution: Users can add new fields at any time and quickly populate the new fields with historical data.

Application scenarios of LakeSoul:

  • Incremental data need to be written efficiently in large batches in real time, as well as concurrent updates at the row or column level. Detailed query and update on a large time range with huge amount historical data, while hoping to maintain a low cost
  • The query is not fixed, and the resource consumption changes greatly, which is expected that the computing resources can be flexible and scalable independently
  • High concurrent writes are required, and metadata is too large for Delta Lake to meet performance requirements.
  • For data updates to primary keys, Hudi's MergeOnRead does not meet update performance requirements.