Skip to content
An Awesome List of Open-Source Data Engineering Projects
Branch: master
Clone or download
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
.gitignore 🏆 Initial import Feb 11, 2020
CONTRIBUTING.md Misc. additions Feb 11, 2020
LICENSE.txt 🏆 Initial import Feb 11, 2020
README.adoc * Added new datastore Feb 19, 2020

README.adoc

Awesome Open-Source Data Engineering

This Awesome List aims at providing an overview of open-source projects related to data engineering. This is a community effort: please contribute and send your pull requests for growing this list! For a list including non-OSS tools, see this amazing Awesome List.

Analytics

  • Apache Spark - A unified analytics engine for large-scale data processing.

Business Intelligence

  • Apache Superset - A modern, enterprise-ready business intelligence web application.

  • Metabase - An easy way for everyone in your company to ask questions and learn from data.

  • Redash - All the tools to unlock your data.

Change Data Capture

  • Debezium - Change data capture for MySQL, Postgres, MongoDB, SQL Server and others.

  • Maxwell - Maxwell’s daemon, a MySQL-to-JSON Kafka producer

Datastores

  • Apache Calcite - SQL parser, building blocks for datastores.

  • Apache Cassandra - Open Source distributed wide column store, NoSQL database.

  • Apache Druid - A high performance real-time analytics database.

  • Apache HBase - Open Source non-relational distributed database.

  • Apache Pinot - A realtime distributed OLAP datastore.

  • ClickHouse - Open Source distributed column-oriented DBMS.

  • InfluxDB - Purpose-Built Open Source Time Series Database.

  • Postgres - The World’s Most Advanced Open Source Relational Database.

  • MinIO - MinIO is a high performance, distributed object storage system and AWS S3 compatible.

Data Governance and Registries

Data Virtualization

  • Teiid - A relational abstraction of different information sources.

  • Presto - Distributed SQL Query Engine for Big Data.

  • Apache Drill - Schema-free SQL Query Engine for Hadoop, NoSQL and Cloud Storage.

Formats

  • Apache Avro - A data serialization system.

  • Apache Parquet - A columnar storage format.

  • Apache ORC - Another columnar storage format.

  • Apache Thrift - Data type and service interface definitions and code generator.

  • Cap’n Proto - A data interchange format and capability-based RPC system.

  • FlatBuffers - An efficient cross platform serialization library for C++, C#, C, Go, Java, JavaScript, Lobster, Lua, TypeScript, PHP, Python, and Rust.

  • Protocol Buffers - Google’s language-neutral, platform-neutral, extensible mechanism for serializing structured data.

  • MessagePack - An efficient binary serialization format. It lets you exchange data among multiple languages like JSON.

Integration

  • Apache Camel - Easily integrate various systems consuming or producing data.

  • Kafka Connect - Reusable framework to handle data int-and-out of Apache Kafka.

  • Logstash - Open Source server-side data processing pipeline.

Messaging Infrastructure

  • Apache ActiveMQ - Flexible & Powerful Multi-Protocol Messaging.

  • Apache Kafka - A distributed commit log with messaging capabilities.

  • Apache Pulsar - A distributed pub-sub messaging system.

  • Liiklus - An event gateway that provides reactive gRPC/RSocket access to Kafka-like systems.

  • Nakadi - A distributed event bus that implements a RESTful API abstraction on top of Kafka-like queues].

  • NATS - A simple, secure and high performance messaging system.

  • RabbitMQ - A message broker.

  • Waltz - A quorum-based distributed write-ahead log for replicating transactions.

Specifications and Standards

  • CloudEvents - A specification for describing event data in a common way.

Stream Processing

  • Apache Beam - Implement batch and streaming data processing jobs that run on any execution engine.

  • Apache Flink - Stateful computations over data streams.

  • Apache Kafka Streams - A client library for building applications and microservices, where the input and output data are stored in Kafka.

  • Apache Samza - A distributed stream processing framework.

  • Apache Spark Structured Streaming - A scalable and fault-tolerant stream processing engine built on the Spark SQL engine.

  • Apache Storm - A distributed realtime computation system.

Testing

Workflow Management

  • Awesome Workflow Engines - A curated list of awesome open source workflow engines.

  • Apache Airflow - A platform created by community to programmatically author, schedule and monitor workflows.

  • Prefect - A workflow management system designed for modern infrastructure.

  • Apache NiFi - Apache NiFi supports powerful and scalable directed graphs of data routing, transformation, and system mediation logic

only overview contents, no specific tools

Slide Decks, Recordings and Podcasts

Blog Posts and Articles

Collections

Tbd.

Not quite sure yet where to put these

License

The contents of this repository is licensed under the "Creative Commons Attribution-ShareAlike 4.0 International License".

You can’t perform that action at this time.