DataSpec is a local‑first, fully open‑source analytics platform that installs a complete data + orchestration stack on a single machine with one command. It is designed to be self‑hosted, cloud‑independent, and repeatable for small teams.
- Local‑only, no cloud assumptions
- Fully open source
- Infrastructure as code
- Opinionated defaults
- Minimal configuration surface
- One‑command install
- Idempotent and repeatable
- Designed for small teams running on a single node
flowchart TD
U["Users"] --> AF["Airflow (Web/Scheduler)"]
U --> SS["Superset"]
U --> PGA["pgAdmin"]
U --> GF["Grafana (Logs)"]
AF -->|triggers| DLT["dlt ingestion"]
DLT --> PG["Postgres (Warehouse)"]
AF --> PG
SS --> PG
PGA --> PG
AF --> L["Container logs"]
SS --> L
PGA --> L
PG --> L
subgraph Logs
L --> PT["Promtail"] --> LK["Loki"] --> GF
end
- Postgres: analytics warehouse and metadata
- Airflow: orchestration (webserver, scheduler)
- dlt: ingestion pipelines
- Superset: BI and dashboarding
- Grafana + Loki + Promtail: log capture and dashboard
- pgAdmin: Postgres UI
- Airflow:
http://<host>:8080/ - Superset:
http://<host>:8088/ - Logs (Grafana):
http://<host>:3001/ - pgAdmin:
http://<host>:5050/ - Postgres:
localhost:5432
- Ensure Docker is running.
- Start the stack.
- Open Airflow, Superset, and pgAdmin (and Grafana for logs).
docker compose up -d- Linux, macOS, or Windows
- Docker and Docker Compose
- 4 vCPU / 16 GB RAM minimum
The Compose stack includes an internal setup service that:
- Creates persistent folders under
./data - Prepares pgAdmin auto‑registration files
- Applies safe permissions
All user configuration is provided via .env. Defaults are opinionated and safe to run locally.
Key variables:
DF_HOSTNAMEPOSTGRES_DB,POSTGRES_USER,POSTGRES_PASSWORDAIRFLOW_DBSUPERSET_DBAIRFLOW_ADMIN_USERNAME,AIRFLOW_ADMIN_PASSWORD,AIRFLOW_ADMIN_EMAILAIRFLOW__CORE__FERNET_KEY,AIRFLOW__WEBSERVER__BASE_URLAIRFLOW__WEBSERVER__WEB_SERVER_HOST,AIRFLOW__WEBSERVER__WEB_SERVER_PORTAIRFLOW_UIDSUPERSET_ADMIN_USERNAME,SUPERSET_ADMIN_PASSWORD,SUPERSET_ADMIN_EMAIL,SUPERSET_SECRET_KEYGRAFANA_ADMIN_USER,GRAFANA_ADMIN_PASSWORDPGADMIN_EMAIL,PGADMIN_PASSWORDNYC_TAXI_URL
Credentials are written to:
data/credentials.txt
Generate/refresh credentials:
make credsOn first boot, Airflow triggers a full refresh ingestion of NYC Taxi data via dlt.
- DAG:
airflow/dags/nyc_taxi_full_refresh.py - dlt script:
airflow/dags/nyc_taxi_dlt.py
Use Airflow to define new ingestion DAGs or extend the dlt scripts.
- Add a new DAG under
airflow/dags/ - Use
dltto load data into Postgres schemas
Superset uses its own metadata DB automatically. To query your warehouse, add a Database connection in Superset:
Connection string:
postgresql+psycopg2://datafoundry:<POSTGRES_PASSWORD>@postgres:5432/datafoundry
Notes:
- Replace
<POSTGRES_PASSWORD>with the value indata/credentials.txt - Database name is
datafoundryby default (orPOSTGRES_DBif you changed it) - Host must be
postgres(the Docker service name), notlocalhost
Start:
docker compose up -d --buildStop:
docker compose downReset data:
./reset.sh- Check container status:
docker compose ps - Check logs:
docker compose logs --tail=200 <service> - Rebuild a service:
docker compose up -d --build <service>
Common issues:
- Port conflicts: change port mapping in
docker-compose.yml - Permissions: rerun
docker compose up -d(setup service fixes ownership and modes)
This repo is containerized, so your editor needs a local Python environment for LSP/type hints.
One‑time setup:
./scripts/dev/setup_venv.shThis creates .venv/ and installs a lean dev dependency set from:
./requirements-dev.txt
For VS Code, a workspace config is included:
./.vscode/settings.json
If you use another editor, point it at:
./.venv/bin/python
This is a single‑node architecture. For production‑grade deployments:
- Use fast disks for
./data/postgres - Add backups for
./data/postgres - Set strong passwords in
.env
- Linux: native Docker
- macOS: Docker Desktop
- Windows: Docker Desktop
docker-compose.ymlruntime servicesdocker/images and Dockerfilesscripts/init and provisioningairflow/dags/ingestion workflows
Open source, local‑first, self‑hosted analytics.