Skip to content

A simplified data lake, more of a data shack, optimized for retrieving filtered records from Parquet files.

License

Notifications You must be signed in to change notification settings

mhendrey/lakeshack

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

49 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Lakeshack

A small rustic shack on the shores of a big lake

A small rustic shack on the shores of a big lake


A simplified data lakehouse, more of a data lakeshack, optimized for retrieving filtered records from Parquet files. Similar to the various lakehouse solutions (Iceberg, Hudi, Delta Lake), Lakeshack gathers up the min/max values for specified columns from each Parquet file and stores them into a database (Metastore). When you want to query for a set of records, it first checks the Metastore to get the list of Parquet files that might have the desired records, and then only queries those Parquet files. The files may be stored locally or in S3. You may query using either native pyarrow or leverage S3 Select.

To acheive optimal performance, a partitioning & clustering strategy (which specifies how the records are written to the Parquet files) should align with the main query pattern expected to be used on the data. See the documentation for more information on this.

Installation

Lakeshack may be install using pip:

pip install lakeshack

Documentation

Documentation can be found at https://mhendrey.github.io/lakeshack

About

A simplified data lake, more of a data shack, optimized for retrieving filtered records from Parquet files.

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages