===========================================

Gebil Jibul

Description: This program demonstrates the creation of an 
Awesome Big Data stack, and using Jinja2 to format a markdown file with it.

=========================================== 

# Creating a Big Data Stack

Big data is a rapidly evolving, ever-changing field. Keeping track of the latest big data stacks, programming libraries, software, and other tools requires constant vigilance. Any book on big data will be out of date by the time it is published. We need a resource that is updated on a more frequent basis. 

This project will help create that resource by researching the latest big data tools and technologies. We will use this research to create an *Awesome Big Data* list. Below is a list of similar *awesome* lists that may be useful when creating our *Awesome Big Data* list. 

*[Awesome Python](https://awesome-python.com/)* is a curated list of awesome Python frameworks, libraries, software and resources. It was inspired by [awesome-php](https://github.com/ziadoz/awesome-php). 

*[Awesome Jupyter](https://github.com/markusschanta/awesome-jupyter)* is a curated list of awesome Jupyter projects, libraries and resources. Jupyter is an open-source web application that allows you to create and share documents that contain live code, equations, visualizations and narrative text.

*[Awesome Dash](https://github.com/ucg8j/awesome-dash)* is a curated list of awesome Dash (plotly) resources. Dash is a productive Python framework for building web applications. Written on top of Flask, Plotly.js, and React.js, Dash is ideal for building data visualization apps with highly custom user interfaces in pure Python. It's particularly suited for anyone who works with data in Python.

*[Awesome JavaScript](https://github.com/sorrycc/awesome-javascript)* is a collection of awesome browser-side JavaScript libraries, resources and shiny things. The [data visualization section](https://github.com/sorrycc/awesome-javascript#data-visualization) may be of use. 

*[Awesome Deep Learning](https://github.com/ChristosChristofidis/awesome-deep-learning)* is a curated list of awesome Deep Learning tutorials, projects and communities.

*[Awesome Machine Learning](https://github.com/josephmisiti/awesome-machine-learning)* is a curated list of awesome machine learning frameworks, libraries and software (by language).

*[Awesome Data Engineering](https://github.com/igorbarinov/awesome-data-engineering)* is a curated list of data engineering tools for software developers. 

*[Awesome Public Datasets](https://github.com/awesomedata/awesome-public-datasets)* is a list of a topic-centric public data sources in high quality. They are collected and tidied from blogs, answers, and user responses. 

*[Awesome](https://github.com/sindresorhus/awesome)* is a list of awesome lists about all kinds of interesting topics.

Pokémon or Big Data Quiz
Included below is code that fetches the answers to the questions and provides the results in a Pandas dataframe. 

In [1]:
import pandas as pd

quiz_answers_json = 'https://raw.githubusercontent.com/pixelastic/pokemonorbigdata/master/app/questions.json'
df_all = pd.read_json(quiz_answers_json)
# Pokémon answers
df_all[df_all['type'] == 0]

Unnamed: 0,name,url,type,text,img
0,Gorebyss,http://bulbapedia.bulbagarden.net/wiki/Gorebys...,0,His body is durable enough to withstand high-p...,gorebyss.png
3,Feebas,http://bulbapedia.bulbagarden.net/wiki/Feebas_...,0,Feebas requires Swift Swim to get marvel scale.,feebas.png
4,Azurill,http://bulbapedia.bulbagarden.net/wiki/Azurill...,0,Azurill's tail is large and bouncy. Azurill ca...,azurill.png
6,Vulpix,http://bulbapedia.bulbagarden.net/wiki/Vulpix_...,0,"Munchlax is a great counter to Vulpix, being a...",vulpix.png
8,Delibird,http://bulbapedia.bulbagarden.net/wiki/Delibir...,0,The 25th of December is his favorite day.,delibird.png
11,Arbok,http://bulbapedia.bulbagarden.net/wiki/Arbok_(...,0,"Despite reports to the contrary, Arbok does no...",arbok.png
18,Horsea,http://bulbapedia.bulbagarden.net/wiki/Horsea_...,0,"Horsea is a small, blue, seahorse-like Pokémon...",horsea.png
21,Jirachi,http://bulbapedia.bulbagarden.net/wiki/Jirachi...,0,Jirachi is a steel and psychic-type Pokémon wi...,jirachi.png
22,Spoink,http://bulbapedia.bulbagarden.net/wiki/Spoink_...,0,Spoink is a bouncy psychic-type Pokémon that s...,spoink.png
25,Geodude,http://bulbapedia.bulbagarden.net/wiki/Geodude...,0,Geodude is a hard rock pokemon immune to one h...,geodude.jpg


In [2]:
# Big data answers
df_all[df_all['type'] == 1]

Unnamed: 0,name,url,type,text,img
1,Tokutek,https://www.percona.com/,1,Tokutek claims to improve MongoDB performance ...,tokutek.png
2,Adabas,https://en.wikipedia.org/wiki/ADABAS,1,ADABAS was NoSQL from a time when there was no...,adabas.gif
5,Hadoop,https://hadoop.apache.org/,1,Hadoop is a distributed system for counting wo...,hadoop.png
7,Hekaton,https://en.wikipedia.org/wiki/Hekaton_(database),1,Refer to lock-free architecture for SQL Server...,hekaton.jpg
9,Summingbird,https://github.com/twitter/summingbird,1,"This Twitter technology combines streaming, Ma...",summingbird.png
10,Akiban,http://www.akiban.com/,1,Touted as SQL database with object structured ...,akiban.jpg
12,Flink,https://flink.apache.org/,1,Apache Flink will help us with batch and strea...,flink.png
13,Impala,http://impala.io/,1,Querying big data is too slow? Impala has a so...,impala.png
14,Pangool,http://datasalt.github.io/pangool/,1,"It's like Sandlash, but it will help us to dev...",pangool.png
15,Azkaban,http://azkaban.github.io/,1,"It's not the evolution from Abra, Kadabra, or ...",azkaban.png


In the next part, we will populate the items with categories for our list. A book that I have been reading 'Big Data Science & Analytics', provides list of categories and subcategories for a big data stack. We will use these categories as a starting point, but will not constrain ourselves to them. 

When creating categories, I avoided deeply nested layers of categories and subcategories. I started with the following high-level categories and subcategories. 

***Categories***

We will use the disutils trove classification convention defined in [PEP 301](https://www.python.org/dev/peps/pep-0301/) when defining a category with a subcategory.

- Batch Analysis :: DAG
- Batch Analysis :: Machine Learning
- Batch Analysis :: MapReduce
- Batch Analysis :: Script
- Batch Analysis :: Search
- Batch Analysis :: Workflow Scheduling
- Data Access Connector :: Custom Connectors
- Data Access Connector :: Publish-Subscribe
- Data Access Connector :: Queues
- Data Access Connector :: SQL
- Data Access Connector :: Source-Sink
- Data Storage :: Distributed File System
- Data Storage :: NoSQL
- Deployment :: NoSQL
- Deployment :: SQL
- Deployment :: Visualization Frameworks
- Deployment :: Web Frameworks
- Interactive Querying :: Analytic SQL
- Real-Time Analysis :: In-Memory
- Real-Time Analysis :: Stream Processing

Below is a list containing categories and suggested starting points for research. 

* AI and Machine Learning
    * Apache Spark's MLlib
    * H2O
    * Tensorflow
* Batch Processing
    * Apache
    * Apache Spark
    * Dask
    * MapReduce
* Cloud and Data Platforms
    * Amazon Web Services
    * Cloudera Data Platform
    * Google Cloud Platform
    * Microsoft Azure
* Container Engines and Orchestration
    * Docker
    * Docker Swarm
    * Kubernetes
    * Podman
* Data Storage :: Block Storage
    * Amazon EBS
    * OpenEBS
* Data Storage :: Cluster Storage
    * Ceph
    * HDFS
* Data Storage :: Object Storage
    * Amazon S3
    * Minio
* Data Transfer Tools
    * Apache Sqoop
* Full-Text Search
    * Apache Solr
    * Elasticsearch
* Interactive Query
    * Apache Hive
    * Google Big Query
    * Spark SQL
* Message Queues
    * Apache Kafka
    * RabbitMQ
* NoSQL :: Document Databases
    * CouchDB
    * Google Firestore
    * MongoDB
* NoSQL :: Graph Databases
    * DGraph
    * Neo4j
* NoSQL :: Key-Value Databases
    * Amazon DynamoDB
* NoSQL :: Time-Series Databases
    * TSDB
* Serverless Functions
    * AWS Lambda
    * OpenFaaS
* Stream Processing
    * Apache Spark's Structured Streaming
    * Apache Storm
    * Google Dataflow
* Visualization Frameworks
    * Apache Superset
    * Redash
* Workflow Engine
    * Apache Airflow
    * Google Cloud Composer
    * Oozie
    
We populate the list items using the `ListItem` class, defined below. The following is a description of the `ListItem` fields. 

**name**

The proper name of the list item

**website**

Link to the item's website.  Include `http://` or `https://` in the link. 

**category**

Category and optional subcategory for the item. 

**short_description**

Provide a short, one to two-sentence description of the item. 

In [3]:
from dataclasses import dataclass

@dataclass(frozen=True)
class ListItem:
    name: str
    website: str
    category: str
    short_description: str
    
all_items = set()

The following is an example of creating the entry for AWS as a seperate variable and then adding it to the `all_items` set. 

In [4]:
aws = ListItem(
    'Amazon Web Services',
    'https://aws.amazon.com/',
    'Cloud and Data Platforms',
    """Provides on-demand cloud computing platforms and APIs to individuals, 
    companies, and governments, on a metered pay-as-you-go basis."""
)

all_items.add(aws)

You can also add an item to the list directly. 

In [5]:
all_items.remove(aws)
all_items.add(ListItem(
    'Amazon Web Services',
    'https://aws.amazon.com/',
    'Cloud and Data Platforms',
    """Provides on-demand cloud computing platforms and APIs to individuals, 
    companies, and governments, on a metered pay-as-you-go basis."""
))

In [6]:
# List allows for future updates to all_items, using next cell
list_entries = [
    # AI and Machine Learning
    ListItem(
        "Apache Spark's MLlib",
        'https://spark.apache.org/mllib/',
        'AI and Machine Learning',
        """MLlib is Apache Spark's scalable machine learning library. Ease of use. Usable in Java, Scala, Python, and R."""
    ),

    ListItem(
        'H2O',
        'https://www.h2o.ai/',
        'AI and Machine Learning',
        """H2O.ai is an advanced AI Cloud Platform designed to simplify and accelerate making, operating and innovating with AI in any environment."""
    ),

    ListItem(
        'Tensorflow',
        'https://www.tensorflow.org/',
        'AI and Machine Learning',
        """TensorFlow is a free and open-source software library for machine learning and artificial intelligence."""
    ),

    # Batch Processing
    ListItem(
        'Apache Beam',
        'https://beam.apache.org/',
        'Batch Processing',
        """Apache Beam is an open source unified programming model to define and execute data processing pipelines, including ETL, batch and stream processing"""
    ),

    ListItem(
        'Apache Spark',
        'https://spark.apache.org/',
        'Batch Processing',
        """Apache Spark is an open-source unified analytics engine for large-scale data processing. Spark provides an interface for programming entire clusters with implicit data parallelism and fault tolerance."""
    ),

    ListItem(
        'Dask',
        'https://dask.org/',
        'Batch Processing',
        """Dask is an open-source flexible parallel computing library written in Python for analytics"""
    ),

    # Cloud and Data Platforms
    ListItem(
        'Amazon Web Services',
        'https://aws.amazon.com/',
        'Cloud and Data Platforms',
        """Amazon Web Services, Inc. is a subsidiary of Amazon providing on-demand cloud computing platforms and APIs to individuals, companies, and governments, on a metered pay-as-you-go basis."""
    ),

    ListItem(
        'Cloudera Data Platform',
        'https://www.cloudera.com/products/cloudera-data-platform.html',
        'Cloud and Data Platforms',
        """Cloudera’s open-source data platform uses analytics and machine learning to yield insights from data through a secure connection."""
    ),

    ListItem(
        'Google Cloud Platform',
        'https://cloud.google.com/',
        'Cloud and Data Platforms',
        """Google Cloud Platform, offered by Google, is a suite of cloud computing services that runs on the same infrastructure that Google uses internally for its end-user products, such as Google Search, Gmail, Google Drive, and YouTube."""
    ),

    ListItem(
        'Microsoft Azure',
        'https://azure.microsoft.com/',
        'Cloud and Data Platforms',
        """Microsoft Azure, often referred to as Azure, is a cloud computing service operated by Microsoft for application management via Microsoft-managed data centers."""
    ),

    # Container Engines and Orchestration
    ListItem(
        'Docker',
        'https://www.docker.com/',
        'Container Engines and Orchestration',
        """Docker is a set of platform as a service products that use OS-level virtualization to deliver software in packages called containers."""
    ),

    ListItem(
        'Kubernetes',
        'https://kubernetes.io/',
        'Container Engines and Orchestration',
        """Kubernetes is an open-source container-orchestration system for automating computer application deployment, scaling, and management."""
    ),

    ListItem(
        'Podman',
        'https://podman.io/',
        'Container Engines and Orchestration',
        """Podman is a daemonless, open source, Linux native tool designed to make it easy to find, run, build, share and deploy applications using Open Containers Initiative (OCI) Containers and Container Images."""
    ),

    # Data Storage :: Block Storage
    ListItem(
        'Amazon EBS',
        'https://aws.amazon.com/ebs/',
        'Data Storage :: Block Storage',
        """Amazon Elastic Block Store (Amazon EBS) is an easy-to-use, scalable, high-performance block-storage service designed for Amazon Elastic Compute Cloud (Amazon EC2)."""
    ),

    ListItem(
        'OpenEBS',
        'https://openebs.io/',
        'Data Storage :: Block Storage',
        """OpenESB is a Java-based open-source enterprise service bus. It allows you to integrate legacy systems, external and internal partners and new development in your Business Process."""
    ),

    # Data Storage :: Cluster Storage
    ListItem(
        'Ceph',
        'https://ceph.io/en/',
        'Data Storage :: Cluster Storage',
        """Ceph is an open-source software storage platform, implements object storage on a single distributed computer cluster, and provides 3-in-1 interfaces for object-, block- and file-level storage."""
    ),

    ListItem(
        'Hadoop Distributed File System',
        'https://hadoop.apache.org/docs/r1.2.1/hdfs_design.html',
        'Data Storage :: Cluster Storage',
        """The Hadoop Distributed File System ( HDFS ) is a distributed file system designed to run on commodity hardware."""
    ),

    # Data Storage :: Object Storage
    ListItem(
        'Amazon S3',
        'https://aws.amazon.com/s3/',
        'Data Storage :: Object Storage',
        """Amazon S3 or Amazon Simple Storage Service is a service offered by Amazon Web Services that provides scalable object storage through a web service interface."""
    ),

    ListItem(
        'Minio',
        'https://min.io/',
        'Data Storage :: Object Storage',
        """MinIO is a High Performance Object Storage that is API compatible with Amazon S3 cloud storage service. It can handle unstructured data such as photos, videos, log files, backups, and container images with the maximum supported object size of 5TB."""
    ),

    # Data Transfer Tools
    ListItem(
        'Apache Sqoop',
        'https://sqoop.apache.org/',
        'Data Transfer Tools',
        """Sqoop is a command-line interface application for transferring data between relational databases and Hadoop. The Apache Sqoop project was retired in June 2021 and moved to the Apache Attic."""
    ),

    # Full-Text Search
    ListItem(
        'Apache Solr',
        'https://solr.apache.org/',
        'Full-Text Search',
        """Solr is an open-source enterprise-search platform, written in Java. Its major features include full-text search, hit highlighting, faceted search, real-time indexing, dynamic clustering, database integration, NoSQL features and rich document handling."""
    ),

    ListItem(
        'Elasticsearch',
        'https://www.elastic.co/elasticsearch/',
        'Full-Text Search',
        """Elasticsearch is a search engine based on the Lucene library. It provides a distributed, multitenant-capable full-text search engine with an HTTP web interface and schema-free JSON documents."""
    ),

    # Interactive Query
    ListItem(
        'Apache Hive',
        'https://hive.apache.org/',
        'Interactive Query',
        """Apache Hive is a data warehouse software project built on top of Apache Hadoop for providing data query and analysis. Hive gives an SQL-like interface to query data stored in various databases and file systems that integrate with Hadoop."""
    ),

    ListItem(
        'Google Big Query',
        'https://cloud.google.com/bigquery',
        'Interactive Query',
        """BigQuery is a fully-managed, serverless data warehouse that enables scalable analysis over petabytes of data. It is a Platform as a Service that supports querying using ANSI SQL. It also has built-in machine learning capabilities."""
    ),

    ListItem(
        'Spark SQL',
        'https://spark.apache.org/sql/',
        'Interactive Query',
        """Spark SQL is a Spark module for structured data processing. It provides a programming abstraction called DataFrames and can also act as a distributed SQL query engine."""
    ),

    # Message Queues
    ListItem(
        'Apache Kafka',
        'https://kafka.apache.org/',
        'Message Queues',
        """Apache Kafka is an open-source framework implementation of a software bus using stream-processing. The project aims to provide a unified, high-throughput, low-latency platform for handling real-time data feeds."""
    ),

    ListItem(
        'RabbitMQ',
        'https://www.rabbitmq.com/',
        'Message Queues',
        """RabbitMQ is an open-source message-broker software that originally implemented the Advanced Message Queuing Protocol and has since been extended with a plug-in architecture to support Streaming Text Oriented Messaging Protocol, MQ Telemetry Transport, and other protocols."""
    ),

    # NoSQL :: Document Databases
    ListItem(
        'CouchDB',
        'https://couchdb.apache.org/',
        'NoSQL :: Document Databases',
        """Apache CouchDB is an open-source document-oriented NoSQL database, implemented in Erlang. CouchDB uses multiple formats and protocols to store, transfer, and process its data. It uses JSON to store data, JavaScript as its query language using MapReduce, and HTTP for an API."""
    ),

    ListItem(
        'Google Firestore',
        'https://cloud.google.com/firestore',
        'NoSQL :: Document Databases',
        """Firebase is a platform developed by Google for creating mobile and web applications. It allows you to run sophisticated ACID transactions against your document data."""
    ),

    ListItem(
        'MongoDB',
        'https://www.mongodb.com/atlas',
        'NoSQL :: Document Databases',
        """MongoDB is a source-available cross-platform document-oriented database program. Classified as a NoSQL database program, MongoDB uses JSON-like documents with optional schemas."""
    ),

    # NoSQL :: Graph Databases
    ListItem(
        'DGraph',
        'https://dgraph.io/',
        'NoSQL :: Graph Databases',
        """Dgraph is a open-source graph database management system. Dgraph uses Raft for shard replication and a custom transactional protocol for snapshot-isolated cross-shard transactions."""
    ),

    ListItem(
        'Neo4j',
        'https://neo4j.com/product/neo4j-graph-database/',
        'NoSQL :: Graph Databases',
        """Neo4j is a graph database management system developed by Neo4j, Inc. Described by its developers as an ACID-compliant transactional database with native graph storage and processing,"""
    ),

    # NoSQL :: Key-Value Databases
    ListItem(
        'Amazon DynamoDB',
        'https://aws.amazon.com/dynamodb/',
        'NoSQL :: Key-Value Databases',
        """Amazon DynamoDB is a fully managed proprietary NoSQL database service that supports key–value and document data structures and is offered by Amazon.com as part of the Amazon Web Services portfolio."""
    ),

    # NoSQL :: Time-Series Databases
    ListItem(
        'OpenTSDB',
        'http://opentsdb.net/',
        'NoSQL :: Time-Series Databases',
        """OpenTSDB is a distributed, scalable Time Series Database (TSDB) written on top of HBase. OpenTSDB was written to address a common need: store, index and serve metrics collected from computer systems at a large scale, and make this data easily accessible and graphable."""
    ),

    # Serverless Functions
    ListItem(
        'AWS Lambda',
        'https://aws.amazon.com/lambda/',
        'Serverless Functions',
        """AWS Lambda is an event-driven, serverless computing platform provided by Amazon as a part of Amazon Web Services. It is a computing service that runs code in response to events and automatically manages the computing resources required by that code."""
    ),

    ListItem(
        'OpenFaaS',
        'https://www.openfaas.com/',
        'Serverless Functions',
        """OpenFaaS is an open source serverless function engine where users can publish, run, and manage functions on Kubernetes clusters."""
    ),

    # Stream Processing
    ListItem(
        'Apache Storm',
        'https://storm.apache.org/',
        'Stream Processing',
        """Apache Storm is an open-source distributed stream processing computation framework written predominantly in the Clojure programming language."""
    ),

    ListItem(
        'Google Dataflow',
        'https://cloud.google.com/dataflow',
        'Stream Processing',
        """Google Cloud Dataflow is a fully managed service for executing Apache Beam pipelines within the Google Cloud Platform ecosystem."""
    ),

    # Version Control Systems
    ListItem(
        'Data Version Control',
        'https://dvc.org/',
        'Version Control Systems',
        """DVC is an open-source version control system for machine learning projects that lets you define your pipeline regarless of the language used."""
    ),

    ListItem(
        'Git LFS',
        'https://git-lfs.github.com/',
        'Version Control Systems',
        """Git Large File Storage (LFS) is an open-source project that allows you to version large files with Git."""
    ),

    # Visualization Frameworks
    ListItem(
        'Apache Superset',
        'https://superset.apache.org/',
        'Visualization Frameworks',
        """Apache Superset is an open-source software cloud-native application for data exploration and data visualization able to handle data at petabyte scale."""
    ),

    ListItem(
        'Redash',
        'https://redash.io/',
        'Visualization Frameworks',
        """Redash is an open-source tool for teams to query, visualize and collaborate."""
    ),

    # Workflow Engine
    ListItem(
        'Apache Airflow',
        'https://airflow.apache.org/',
        'Workflow Engine',
        """Apache Airflow is an open-source workflow management platform for data engineering pipelines."""
    ),

    ListItem(
        'Google Cloud Composer',
        'https://cloud.google.com/composer',
        'Workflow Engine',
        """Cloud Composer is a managed workflow automation tool that is built on Apache Airflow. It's used to author, schedule, and monitor software development pipelines across data centers."""
    ),

    ListItem(
        'Oozie',
        'https://oozie.apache.org/',
        'Workflow Engine',
        """Apache Oozie is a server-based workflow scheduling system to manage Hadoop jobs. Workflows in Oozie are defined as a collection of control flow and action nodes in a directed acyclic graph."""
    )
]


In [7]:
items = [item.name for item in all_items]

# Adds each element to all_items
for entry in list_entries:
    if entry.name in items:
        print(f'{entry.name} already in all_items')
        
        # If element name already exists, offers to update in all_items
        if input('Update? (Y/N)').upper() == 'Y':
            outdated = next((item for item in all_items if item.name == entry.name), None)
            all_items.remove(outdated)
            all_items.add(entry)

    else:
        all_items.add(entry)

Amazon Web Services already in all_items
Update? (Y/N)Y


Use the `all_items` data to create Markdown output 

In [8]:
import jinja2

template = jinja2.Template("""
# Awesome Big Data

{% for category, items in category_dict.items() %}
## {{category}}

{% for item in items: %}
* [{{item.name}}]({{item.website}}) - {{item.short_description}}

{% endfor %} 
{% endfor %}
"""
)

In [9]:
from collections import OrderedDict
from collections import defaultdict

categories = set(item.category for item in all_items)

# Creates a k,v pair for categories and each item in category
category_dict = defaultdict(list)
for item in all_items:
    for category in categories:
        if item.category == category:
            category_dict[category].append(item)
            
# Orders the dictionary because it's better that way
category_dict = OrderedDict(sorted(category_dict.items(), key=lambda t: t[0]))

# context just nests the dict; 
# jinja2.Template cant see highest level of *arg for some reason
context = {'category_dict': category_dict}

markdown_result = template.render(context)

In [10]:
from IPython.display import display

display({'text/plain': markdown_result,
         'text/markdown': markdown_result},
        raw=True)


# Awesome Big Data


## AI and Machine Learning


* [Apache Spark's MLlib](https://spark.apache.org/mllib/) - MLlib is Apache Spark's scalable machine learning library. Ease of use. Usable in Java, Scala, Python, and R.


* [Tensorflow](https://www.tensorflow.org/) - TensorFlow is a free and open-source software library for machine learning and artificial intelligence.


* [H2O](https://www.h2o.ai/) - H2O.ai is an advanced AI Cloud Platform designed to simplify and accelerate making, operating and innovating with AI in any environment.

 

## Batch Processing


* [Apache Spark](https://spark.apache.org/) - Apache Spark is an open-source unified analytics engine for large-scale data processing. Spark provides an interface for programming entire clusters with implicit data parallelism and fault tolerance.


* [Apache Beam](https://beam.apache.org/) - Apache Beam is an open source unified programming model to define and execute data processing pipelines, including ETL, batch and stream processing


* [Dask](https://dask.org/) - Dask is an open-source flexible parallel computing library written in Python for analytics

 

## Cloud and Data Platforms


* [Amazon Web Services](https://aws.amazon.com/) - Amazon Web Services, Inc. is a subsidiary of Amazon providing on-demand cloud computing platforms and APIs to individuals, companies, and governments, on a metered pay-as-you-go basis.


* [Google Cloud Platform](https://cloud.google.com/) - Google Cloud Platform, offered by Google, is a suite of cloud computing services that runs on the same infrastructure that Google uses internally for its end-user products, such as Google Search, Gmail, Google Drive, and YouTube.


* [Microsoft Azure](https://azure.microsoft.com/) - Microsoft Azure, often referred to as Azure, is a cloud computing service operated by Microsoft for application management via Microsoft-managed data centers.


* [Cloudera Data Platform](https://www.cloudera.com/products/cloudera-data-platform.html) - Cloudera’s open-source data platform uses analytics and machine learning to yield insights from data through a secure connection.

 

## Container Engines and Orchestration


* [Kubernetes](https://kubernetes.io/) - Kubernetes is an open-source container-orchestration system for automating computer application deployment, scaling, and management.


* [Docker](https://www.docker.com/) - Docker is a set of platform as a service products that use OS-level virtualization to deliver software in packages called containers.


* [Podman](https://podman.io/) - Podman is a daemonless, open source, Linux native tool designed to make it easy to find, run, build, share and deploy applications using Open Containers Initiative (OCI) Containers and Container Images.

 

## Data Storage :: Block Storage


* [Amazon EBS](https://aws.amazon.com/ebs/) - Amazon Elastic Block Store (Amazon EBS) is an easy-to-use, scalable, high-performance block-storage service designed for Amazon Elastic Compute Cloud (Amazon EC2).


* [OpenEBS](https://openebs.io/) - OpenESB is a Java-based open-source enterprise service bus. It allows you to integrate legacy systems, external and internal partners and new development in your Business Process.

 

## Data Storage :: Cluster Storage


* [Hadoop Distributed File System](https://hadoop.apache.org/docs/r1.2.1/hdfs_design.html) - The Hadoop Distributed File System ( HDFS ) is a distributed file system designed to run on commodity hardware.


* [Ceph](https://ceph.io/en/) - Ceph is an open-source software storage platform, implements object storage on a single distributed computer cluster, and provides 3-in-1 interfaces for object-, block- and file-level storage.

 

## Data Storage :: Object Storage


* [Minio](https://min.io/) - MinIO is a High Performance Object Storage that is API compatible with Amazon S3 cloud storage service. It can handle unstructured data such as photos, videos, log files, backups, and container images with the maximum supported object size of 5TB.


* [Amazon S3](https://aws.amazon.com/s3/) - Amazon S3 or Amazon Simple Storage Service is a service offered by Amazon Web Services that provides scalable object storage through a web service interface.

 

## Data Transfer Tools


* [Apache Sqoop](https://sqoop.apache.org/) - Sqoop is a command-line interface application for transferring data between relational databases and Hadoop. The Apache Sqoop project was retired in June 2021 and moved to the Apache Attic.

 

## Full-Text Search


* [Elasticsearch](https://www.elastic.co/elasticsearch/) - Elasticsearch is a search engine based on the Lucene library. It provides a distributed, multitenant-capable full-text search engine with an HTTP web interface and schema-free JSON documents.


* [Apache Solr](https://solr.apache.org/) - Solr is an open-source enterprise-search platform, written in Java. Its major features include full-text search, hit highlighting, faceted search, real-time indexing, dynamic clustering, database integration, NoSQL features and rich document handling.

 

## Interactive Query


* [Spark SQL](https://spark.apache.org/sql/) - Spark SQL is a Spark module for structured data processing. It provides a programming abstraction called DataFrames and can also act as a distributed SQL query engine.


* [Google Big Query](https://cloud.google.com/bigquery) - BigQuery is a fully-managed, serverless data warehouse that enables scalable analysis over petabytes of data. It is a Platform as a Service that supports querying using ANSI SQL. It also has built-in machine learning capabilities.


* [Apache Hive](https://hive.apache.org/) - Apache Hive is a data warehouse software project built on top of Apache Hadoop for providing data query and analysis. Hive gives an SQL-like interface to query data stored in various databases and file systems that integrate with Hadoop.

 

## Message Queues


* [RabbitMQ](https://www.rabbitmq.com/) - RabbitMQ is an open-source message-broker software that originally implemented the Advanced Message Queuing Protocol and has since been extended with a plug-in architecture to support Streaming Text Oriented Messaging Protocol, MQ Telemetry Transport, and other protocols.


* [Apache Kafka](https://kafka.apache.org/) - Apache Kafka is an open-source framework implementation of a software bus using stream-processing. The project aims to provide a unified, high-throughput, low-latency platform for handling real-time data feeds.

 

## NoSQL :: Document Databases


* [MongoDB](https://www.mongodb.com/atlas) - MongoDB is a source-available cross-platform document-oriented database program. Classified as a NoSQL database program, MongoDB uses JSON-like documents with optional schemas.


* [CouchDB](https://couchdb.apache.org/) - Apache CouchDB is an open-source document-oriented NoSQL database, implemented in Erlang. CouchDB uses multiple formats and protocols to store, transfer, and process its data. It uses JSON to store data, JavaScript as its query language using MapReduce, and HTTP for an API.


* [Google Firestore](https://cloud.google.com/firestore) - Firebase is a platform developed by Google for creating mobile and web applications. It allows you to run sophisticated ACID transactions against your document data.

 

## NoSQL :: Graph Databases


* [DGraph](https://dgraph.io/) - Dgraph is a open-source graph database management system. Dgraph uses Raft for shard replication and a custom transactional protocol for snapshot-isolated cross-shard transactions.


* [Neo4j](https://neo4j.com/product/neo4j-graph-database/) - Neo4j is a graph database management system developed by Neo4j, Inc. Described by its developers as an ACID-compliant transactional database with native graph storage and processing,

 

## NoSQL :: Key-Value Databases


* [Amazon DynamoDB](https://aws.amazon.com/dynamodb/) - Amazon DynamoDB is a fully managed proprietary NoSQL database service that supports key–value and document data structures and is offered by Amazon.com as part of the Amazon Web Services portfolio.

 

## NoSQL :: Time-Series Databases


* [OpenTSDB](http://opentsdb.net/) - OpenTSDB is a distributed, scalable Time Series Database (TSDB) written on top of HBase. OpenTSDB was written to address a common need: store, index and serve metrics collected from computer systems at a large scale, and make this data easily accessible and graphable.

 

## Serverless Functions


* [AWS Lambda](https://aws.amazon.com/lambda/) - AWS Lambda is an event-driven, serverless computing platform provided by Amazon as a part of Amazon Web Services. It is a computing service that runs code in response to events and automatically manages the computing resources required by that code.


* [OpenFaaS](https://www.openfaas.com/) - OpenFaaS is an open source serverless function engine where users can publish, run, and manage functions on Kubernetes clusters.

 

## Stream Processing


* [Google Dataflow](https://cloud.google.com/dataflow) - Google Cloud Dataflow is a fully managed service for executing Apache Beam pipelines within the Google Cloud Platform ecosystem.


* [Apache Storm](https://storm.apache.org/) - Apache Storm is an open-source distributed stream processing computation framework written predominantly in the Clojure programming language.

 

## Version Control Systems


* [Git LFS](https://git-lfs.github.com/) - Git Large File Storage (LFS) is an open-source project that allows you to version large files with Git.


* [Data Version Control](https://dvc.org/) - DVC is an open-source version control system for machine learning projects that lets you define your pipeline regarless of the language used.

 

## Visualization Frameworks


* [Apache Superset](https://superset.apache.org/) - Apache Superset is an open-source software cloud-native application for data exploration and data visualization able to handle data at petabyte scale.


* [Redash](https://redash.io/) - Redash is an open-source tool for teams to query, visualize and collaborate.

 

## Workflow Engine


* [Apache Airflow](https://airflow.apache.org/) - Apache Airflow is an open-source workflow management platform for data engineering pipelines.


* [Google Cloud Composer](https://cloud.google.com/composer) - Cloud Composer is a managed workflow automation tool that is built on Apache Airflow. It's used to author, schedule, and monitor software development pipelines across data centers.


* [Oozie](https://oozie.apache.org/) - Apache Oozie is a server-based workflow scheduling system to manage Hadoop jobs. Workflows in Oozie are defined as a collection of control flow and action nodes in a directed acyclic graph.

 
