This repository contains a data lakehouse platform designed to process and analyze GitHub Archive data. The platform is composed of multiple services that work together to ingest, store, process, and visualize data. Below is a brief explanation of each service, how they are set up, and how they interact with each other.
- Description: A FastAPI-based service that provides an endpoint to download GitHub Archive data for a specified time range. It interacts with a PostgreSQL database to fetch and serve the data.
- Setup:
- Defined in
api/docker-compose.yml
. - Includes a
web-server
for the API and adb-bootstrap
service to initialize the PostgreSQL database. - Environment variables are configured in
.env
files.
- Defined in
- Interaction:
- The API fetches data from the PostgreSQL database and serves it as
.json.gz
files for downstream processing.
- The API fetches data from the PostgreSQL database and serves it as
- Description: Orchestrates the data ingestion pipeline. It downloads data from the API, stores it in MinIO, and loads it into ClickHouse for analytics.
- Setup:
- Defined in
airflow/docker-compose.yml
. - Uses custom DAGs located in the
airflow/dags
directory. - Dependencies are specified in
airflow/requirements.txt
.
- Defined in
- Interaction:
- Pulls data from the API, uploads it to MinIO, and triggers data loading into ClickHouse.
- Description: An S3-compatible object storage service used to store raw and processed data.
- Setup:
- Defined in
minio/docker-compose.yml
. - Configured with default credentials (
minioadmin
).
- Defined in
- Interaction:
- Acts as a storage layer for data ingested by Airflow and consumed by other services like ClickHouse & Spark app.
- Description: A columnar data warehouse optimized for analytics. Stores processed GitHub Archive data for querying and analysis.
- Setup:
- Defined in
clickhouse/docker-compose.yml
. - Includes an
initdb
directory for schema initialization.
- Defined in
- Interaction:
- Airflow loads data in parquet format into ClickHouse from MinIO.
- Metabase connects to ClickHouse for visualization.
- Description: A business intelligence tool used to create dashboards and visualize data stored in ClickHouse.
- Setup:
- Defined in
metabase/docker-compose.yml
. - Includes a custom Dockerfile to add the ClickHouse JDBC driver.
- Defined in
- Interaction:
- Queries data from ClickHouse to generate insights and visualizations.
- Description: A distributed data processing framework used for advanced analytics and transformations.
- Setup:
- Defined in
spark/docker-compose.yml
. - Includes a Spark master, worker nodes, and a Jupyter Notebook for interactive development.
- Defined in
- Interaction:
- Processes data stored in MinIO or ClickHouse for complex transformations and machine learning tasks.
- Data Ingestion:
- Airflow triggers the API to fetch GitHub Archive data and stores it in MinIO.
- Data Storage:
- MinIO acts as the raw data storage layer.
- Airflow loads processed data into ClickHouse for analytics.
- Data Processing:
- Spark can be used to process data from MinIO or ClickHouse for advanced use cases.
- Data Visualization:
- Metabase connects to ClickHouse to create dashboards and visualizations.
- Use the
scripts/up-all-services.sh
script to start all services. - Use the
scripts/down-all-services.sh
script to stop all services. - Ensure the
gharchive_network
Docker network is created properly before starting the services.
- Docker and Docker Compose installed on your system.
.env
files configured for each service.
api/
: API service and PostgreSQL database.airflow/
: Airflow DAGs and configuration.minio/
: MinIO object storage.clickhouse/
: ClickHouse database and schema initialization.metabase/
: Metabase BI tool.spark/
: Spark cluster and Jupyter Notebook for adhoc queries or machine learning analysis.scripts/
: Helper scripts to manage services.