Skip to content

JinsYin/awesome-datalake

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 

Repository files navigation

Awesome DataLake

This repository contains a curated list of awesome data lake frameworks.

Lakehouse

  • Apache Amoro - Apache Amoro (incubating) is a Lakehouse management system built on open data lake formats.
  • LakeSoul - LakeSoul is an end-to-end, realtime and cloud native Lakehouse framework with fast data ingestion, concurrent update and incremental data analytics on cloud storages for both BI and AI applications.
  • OpenHouse - Open Control Plane for Tables in Data Lakehouse.
  • Geolake - Universal solution for geospatial data tailored to data lakehouse systems for the first time in the industry.
  • LHBench - Lakehouse storage system benchmark.

Open Table Formats

  • Apache Hive - The Apache Hive (TM) data warehouse software facilitates reading, writing, and managing large datasets residing in distributed storage using SQL.
  • Apache Hudi - Upserts, Deletes And Incremental Processing on Big Data.
  • Apache Iceberg - Iceberg is a high-performance format for huge analytic tables. Iceberg brings the reliability and simplicity of SQL tables to big data, while making it possible for engines like Spark, Trino, Flink, Presto, Hive and Impala to safely work with the same tables, at the same time.
  • Apache Paimon - Apache Paimon is a lake format that enables building a Realtime Lakehouse Architecture with Flink and Spark for both streaming and batch operations.
  • Apache XTable - Apache XTable (incubating) is a cross-table converter for lakehouse table formats that facilitates interoperability across data processing systems and query engines.
  • Delta Lake: An open-source storage framework that enables building a Lakehouse architecture with compute engines including Spark, PrestoDB, Flink, Trino, and Hive and APIs

File Formats

  • Apache Avro - Apache Avro is a data serialization system.
  • Apache ORC - ORC is a self-describing type-aware columnar file format designed for Hadoop workloads.
  • Apache Parquet - Apache Parquet is an open source, column-oriented data file format designed for efficient data storage and retrieval. It provides high performance compression and encoding schemes to handle complex data in bulk and is supported in many programming language and analytics tools.
  • CSV
  • JSON

Data Lake Storages

HCFS (Hadoop Compatible File System):

  • HDFS - The Hadoop Distributed File System (HDFS) is a distributed file system designed to run on commodity hardware.
  • Minio - MinIO is a High Performance Object Storage released under GNU Affero General Public License v3.0. It is API compatible with Amazon S3 cloud storage service.

DVC (Data Version Control):

  • lakeFS - lakeFS is an open-source tool that transforms your object storage into a Git-like repository. It enables you to manage your data lake the way you manage your code.
  • DVC - ML Experiments and Data Management with Git
  • Nessie - Project Nessie is a Transactional Catalog for Data Lakes with Git-like semantics.

Cache:

  • Alluxio - data orchestration for analytics and machine learning in the cloud.

Data Lake Engines

Compute & Query:

  • Apache Flink - Apache Flink is an open source stream processing framework with powerful stream- and batch-processing capabilities.
  • Apache Hive - The Apache Hive (TM) data warehouse software facilitates reading, writing, and managing large datasets residing in distributed storage using SQL.
  • Apache Sedona - A cluster computing framework for processing large-scale geospatial data.
  • Apache Spark - Spark is a unified analytics engine for large-scale data processing.
  • Dremio - Dremio is a next-generation data lake engine that liberates your data with live, interactive queries directly on cloud data lake storage, including S3 and lakeFS.
  • Presto - Presto is a distributed SQL query engine for big data.
  • Trino - Trino is a fast distributed SQL query engine for big data analytics.

MPP:

  • StarRocks - StarRocks is the next-generation data platform designed to make data-intensive real-time analytics fast and easy. It can work as the compute engine to analyze data stored in data lakes such as Apache Hudi, Apache Iceberg, and Delta Lake.
  • Doris - Apache Doris is an easy-to-use, high performance and unified analytics database. It can access databases and data lakes including Apache Hive, Apache Iceberg, Apache Hudi, Apache Paimon, LakeSoul, Elasticsearch, MySQL, Oracle, and SQLServer.
  • DuckDB - DuckDB is an analytical in-process SQL database management system. DuckDB has a flexible extension mechanism that allows for dynamically loading extensions.

Data Catalog

  • Apache Gravitino - Apache Gravitino is a high-performance, geo-distributed, and federated metadata lake. It manages the metadata directly in different sources, types, and regions. It also provides users with unified metadata access for data and AI assets.
  • Metacat - Metacat is a unified metadata exploration API service. You can explore Hive, RDS, Teradata, Redshift, S3 and Cassandra.
  • Polaris Catalog - Polaris Catalog is an open source catalog for Apache Iceberg. Polaris Catalog implements Iceberg’s open REST API for multi-engine interoperability with Apache Doris, Apache Flink, Apache Spark, PyIceberg, StarRocks and Trino.
  • Unity Catalog - Open, Multi-modal Catalog for Data & AI.

Security

  • Apache Ranger - To enable, monitor and manage comprehensive data security across the Hadoop platform and beyond.
  • Kerberos - The Network Authentication Protocol.

AI

  • Horovod - Distributed training framework for TensorFlow, Keras, PyTorch, and Apache MXNet.
  • Petastorm - Petastorm library enables single machine or distributed training and evaluation of deep learning models from datasets in Apache Parquet format. It supports ML frameworks such as Tensorflow, Pytorch, and PySpark and can be used from pure Python code.
  • Databricks’ Dolly - Databricks’ Dolly, a large language model trained on the Databricks Machine Learning Platform.
  • DeepLake - About Database for AI. Store Vectors, Images, Texts, Videos, etc.

Tools

  • Smart Data Lake - Smart Automation Tool for building modern Data Lakes and Data Pipelines
  • Kylo - Kylo is a data lake management software platform and framework for enabling scalable enterprise-class data lakes on big data technologies such as Teradata, Apache Spark and/or Hadoop. Kylo is licensed under Apache 2.0. Contributed by Teradata Inc
  • Cuelake - Use SQL to build ELT pipelines on a data lakehouse.