# Big Data Ecosystem
*Author: Jacob Park*

## NoSQL/NewSQL Databases

### CockroachDB

An open-source, distributed NewSQL OLTP database based on Google's Spanner to provide CP guarantees and ACID transactions.

#### Use Cases

- PostgreSQL Replacement (Wire-Protocol Compatible).
- Distributed SQL.
- ACID Transactions.

#### See Also

- [Home Page](https://www.cockroachlabs.com/)
- [GitHub](https://github.com/cockroachdb/cockroach)

### Cassandra

An open-source, distributed key-value/columnar store based on Amazon's Dynamo to provide AP guarantees.

#### Use Cases

- Globally Distributed Replication.
- Automatic TTL.
- Event Data.
- Machine Learning Data.
- Time-Series Data.

#### See Also

- [Home Page](http://cassandra.apache.org/)
- [GitHub](https://github.com/apache/cassandra)

### Druid

An open-source, federated OLAP/analytics database.

#### Use Cases

- Historical and Real-Time Analytics.

#### See Also

- [Home Page](http://druid.io/)
- [GitHub](https://github.com/apache/incubator-druid)

### Elasticsearch

An open-source, sharded full-text search engine to provide CP guarantees.

#### Use Cases

- Full-Text Search.
- Geospatial Intelligence.
- Log Ingestion, Analysis, and Visualization.

#### See Also

- [Home Page](https://www.elastic.co/)
- [GitHub](https://github.com/elastic/elasticsearch)

### Kafka

An open-source, distributed publisher/subscriber message queue.

#### Use Cases

- Activity Tracking.
- Real-Time Metrics.
- Log Aggregation.
- Stream Processing.
- Event Sourcing.
- Commit Log.

#### See Also

- [Home Page](http://kafka.apache.org/)
- [GitHub](https://github.com/apache/kafka)

### JanusGraph

An open-source, distributed graph database which is a fork of TitanDB.

#### Use Cases

- Fraud Detection.
- Infrastructure Monitoring.
- Recommendation Engines.
- Social Network Graphs.

#### See Also

- [Home Page](http://janusgraph.org/)
- [GitHub](https://github.com/JanusGraph/janusgraph)

### MongoDB

An open-source, sharded BSON-document store to provide CP guarantees.

#### Use Cases

- Flexible Schemas.
- Complex Hierarchical Data.

#### See Also

- [Home Page](https://www.mongodb.com/)
- [GitHub](https://github.com/mongodb/mongo)

### Redis

An open-source, in-memory data structure store.

#### Use Cases

- LRU Cache.
- Complex Data Structures.

#### See Also

- [Home Page](https://redis.io/)
- [GitHub](https://github.com/antirez/redis)

### RocksDB

An open-source, embeddable persistent key-value store based on Google's LevelDB.

#### Use Cases

- Localized State.
- Low-Latency Embeddable Cache.

#### See Also

- [Home Page](https://rocksdb.org/)
- [GitHub](https://github.com/facebook/rocksdb)

### TiDB

An open-source, distributed NewSQL OLTP/OLAP database based on Google's Percolator to provide CP guarantees and ACID transactions.

#### Use Cases

- MySQL Replacement (Wire-Protocol Compatible).
- Distributed SQL.
- ACID Transactions.

#### See Also

- [Home Page](https://pingcap.com/en/)
- [GitHub](https://github.com/pingcap/tidb)

### ZooKeeper

An open-source, distributed hierarchical key-value store to provide CP guarantees.

#### Use Cases

- Distributed Configurations.
- Distributed Coordination.
- Naming Service.
- Leadership Election.

#### See Also

- [Home Page](https://zookeeper.apache.org/)
- [GitHub](https://github.com/apache/zookeeper)

## Processing

### Flink

An open-source, distributed streaming data-flow engine.

#### Use Cases

- Streaming ETL.
- Streaming SQL.
- Event-Driven Applications.
- Stateful Applications.

#### See Also

- [Home Page](https://flink.apache.org/)
- [GitHub](https://github.com/apache/flink)

### Spark

An open-source, distributed general-purpose cluster-computing framework.

#### Use Cases

- Batch ETL.
- Batch SQL.
- Data Mining.

#### See Also

- [Home Page](https://spark.apache.org/)
- [GitHub](https://github.com/apache/spark)

## Scheduling

### Airflow

An open-source platform to programmatically author, schedule and monitor workflows.

#### Use Cases

- Scheduling ETL Jobs.
- Scheduling Machine Learning Jobs.
- Coordinating Data Pipelines.

#### See Also

- [Home Page](https://airflow.apache.org/)
- [GitHub](https://github.com/apache/airflow)

## Serialization

### Arrow

An open-source, language-independent columnar memory format for flat and hierarchical data.

#### Use Cases

- In-Memory Analytics.

#### See Also

- [Home Page](https://arrow.apache.org/)
- [GitHub](https://github.com/apache/arrow)

### Avro

An open-source, remote procedure call and data serialization framework.

#### Use Cases

- Streaming Analytics.
- Schema Evolution.

#### See Also

- [Home Page](https://avro.apache.org/)
- [GitHub](https://github.com/apache/avro)

### Parquet

An open-source, columnar storage format.

#### Use Cases

- Batched Analytics.

#### See Also

- [Home Page](https://parquet.apache.org/)
- [GitHub](https://github.com/apache/parquet-format)

## Storage

### Alluxio

An open-source, virtual memory distributed file system.

#### Use Cases

- Storage Abstraction.
- Remote Data Access Acceleration.

#### See Also

- [Home Page](https://www.alluxio.org/)
- [GitHub](https://github.com/Alluxio/alluxio)

### Hadoop Distributed File System

An open-source, distributed file-system over commodity machines.

#### Use Cases

- Bare-Metal Data Center.

#### See Also

- [Home Page](http://hadoop.apache.org/)
- [GitHub](https://github.com/apache/hadoop)

### S3

A proprietary, distributed file-system with four nines of availability.

#### Use Cases

- AWS.

#### See Also

- [Home Page](https://aws.amazon.com/s3/)