A lightweight one-command environment to learn and practice big data pipelines with Kafka, Spark, Airflow, and MinIO β without the pain of setting everything up manually.
The Big Data Sandbox is an open-source project that provides a ready-to-run environment for experimenting with big data tools. It's perfect for:
- Students learning data engineering
- Developers prototyping pipelines
- Educators preparing workshops or demos
Included tools (MVP):
- Kafka β Streaming events
- Spark β Batch & stream processing
- Airflow β Workflow orchestration
- MinIO β S3-compatible object storage
- Jupyter β Interactive notebooks for experiments
The Problem: Setting up a big data environment takes days of configuration, version conflicts, and debugging.
Our Solution: Pre-configured, version-tested components that work together out of the box. Focus on learning concepts, not fighting configs.
What makes this different:
- β All services pre-integrated (no manual wiring)
- β Realistic sample data included
- β Production-like patterns (not toy examples)
- β Actively maintained with latest stable versions
ββββββββββββ βββββββββββ ββββββββββ
β Airflow ββββββΆβ Kafka ββββββΆβ Spark β
ββββββββββββ βββββββββββ ββββββββββ
β β β
βΌ βΌ βΌ
ββββββββββββ βββββββββββββββββββββββ
β Jupyter β β MinIO (S3) β
ββββββββββββ βββββββββββββββββββββββ
- Docker & Docker Compose installed
- 8GB+ RAM recommended
- 10GB free disk space
- Ports 8080, 8888, 9000, 9092, 4040, 8081 available
git clone https://github.com/yourname/big-data-sandbox.git
cd big-data-sandbox
docker compose up -d
docker compose ps
# All services should show as "Up"
- Airflow β http://localhost:8080 (admin/admin)
- Jupyter β http://localhost:8888 (token:
bigdata
) - Spark UI β http://localhost:4040
- MinIO β http://localhost:9000 (minioadmin/minioadmin)
- Kafka Manager β http://localhost:9001
New to Big Data? We have two learning paths for you:
- π Interactive Tutorials: Start with
jupyter/notebooks/01_getting_started.ipynb
for hands-on learning - π οΈ Production Examples: Use
examples/
directory for real-world workflows
π Read the complete Learning Guide for detailed explanations of when to use each approach.
Try this working example in under 5 minutes:
# 1. Upload sample data to MinIO
docker exec -it sandbox-minio mc mb local/raw-data 2>/dev/null || true
docker exec -it sandbox-minio mc cp /data/sales_data.csv local/raw-data/
# 2. Produce events to Kafka
docker exec -it sandbox-kafka kafka-console-producer \
--broker-list localhost:9092 \
--topic events < data/sample_events.json
# 3. Trigger the ETL pipeline in Airflow
curl -X POST http://localhost:8080/api/v1/dags/sample_etl/dagRuns \
-H "Content-Type: application/json" \
-H "Authorization: Basic YWRtaW46YWRtaW4=" \
-d '{"conf":{}}'
# 4. Check processed results
docker exec -it sandbox-minio mc ls local/processed/
See examples/quickstart/
for the full walkthrough.
big-data-sandbox/
βββ compose.yml # One-command environment
βββ .env.example # Environment variables template
βββ airflow/
β βββ dags/ # Airflow DAG definitions
β βββ plugins/ # Custom operators
β βββ config/ # Airflow configuration
βββ spark/
β βββ jobs/ # Spark applications
β βββ config/ # Spark configuration
βββ kafka/
β βββ config/ # Kafka broker config
β βββ producers/ # Sample data producers
βββ minio/
β βββ data/ # Initial buckets & data
βββ jupyter/
β βββ notebooks/ # π Interactive learning tutorials
βββ data/
β βββ sales_data.csv # Sample sales dataset
β βββ user_events.json # Sample event stream
β βββ iot_sensors.csv # IoT sensor readings
βββ examples/
β βββ quickstart/ # π Complete workflow demos
β βββ streaming/ # π Production streaming apps
β βββ batch/ # π Enterprise ETL & analytics
βββ README.md # This file
Services not starting?
- Check Docker memory allocation:
docker system info | grep Memory
- Increase Docker memory to at least 6GB in Docker Desktop settings
- View logs:
docker compose logs [service-name]
Port conflicts?
- Check for running services:
lsof -i :8080
(replace with conflicting port) - Modify ports in
.env
file (copy from.env.example
)
Kafka connection issues?
- Ensure Kafka is fully started:
docker compose logs kafka | grep "started (kafka.server.KafkaServer)"
- Wait 30 seconds after startup for all services to initialize
Need help? Open an issue with your docker compose logs
output.
- Core services (Kafka, Spark, Airflow, MinIO)
- Docker Compose setup
- Basic documentation
- Jupyter integration
- Sample datasets & generators
- Interactive tutorials in Jupyter
- Data generators (IoT sensors, web logs, transactions)
- Video tutorials series
- Performance monitoring dashboard
- Additional connectors (PostgreSQL, MongoDB)
- Kubernetes deployment (Helm charts)
- Delta Lake / Iceberg integration
- Security configurations (Kerberos, SSL)
- Multi-node Spark cluster option
- CI/CD pipeline examples
- Machine Learning pipelines (MLflow)
- Stream processing with Flink
- Data quality checks (Great Expectations)
- Cost optimization guides
- Cloud deployment scripts (AWS/GCP/Azure)
- Discord: Join our server - Get help and share projects
- Blog: Read tutorials at blog.bigdatasandbox.dev
- Twitter: Follow @bigdatasandbox for updates
- YouTube: Video tutorials
"Cut my workshop prep time from 2 days to 30 minutes!" - University Professor
"Finally, a way to test Spark jobs without AWS bills!" - Startup Developer
"Perfect for our internal data engineering bootcamp" - Fortune 500 Tech Lead
Contributions are welcome! We're looking for:
- π Bug reports and fixes
- π Documentation improvements
- π― New example pipelines
- π§ Performance optimizations
- π Translations
See CONTRIBUTING.md for guidelines.
MIT License - Free to use, modify, and share. See LICENSE file for details.
Built with amazing open-source projects:
- Apache Kafka, Spark, and Airflow
- MinIO Object Storage
- Project Jupyter
- Docker & Docker Compose
Special thanks to the data engineering community for feedback and contributions!
Ready to dive in? Star β this repo and start exploring big data in minutes!