🧰 Big Data Sandbox

A lightweight one-command environment to learn and practice big data pipelines with Kafka, Spark, Airflow, and MinIO — without the pain of setting everything up manually.

🚀 What is this?

The Big Data Sandbox is an open-source project that provides a ready-to-run environment for experimenting with big data tools. It's perfect for:

Students learning data engineering
Developers prototyping pipelines
Educators preparing workshops or demos

Included tools (MVP):

Kafka – Streaming events
Spark – Batch & stream processing
Airflow – Workflow orchestration
MinIO – S3-compatible object storage
Jupyter – Interactive notebooks for experiments

💡 Why Big Data Sandbox?

The Problem: Setting up a big data environment takes days of configuration, version conflicts, and debugging.

Our Solution: Pre-configured, version-tested components that work together out of the box. Focus on learning concepts, not fighting configs.

What makes this different:

✅ All services pre-integrated (no manual wiring)
✅ Realistic sample data included
✅ Production-like patterns (not toy examples)
✅ Actively maintained with latest stable versions

🏗 Architecture

┌──────────┐     ┌─────────┐     ┌────────┐
│  Airflow │────▶│  Kafka  │────▶│ Spark  │
└──────────┘     └─────────┘     └────────┘
      │                │               │
      ▼                ▼               ▼
┌──────────┐     ┌─────────────────────┐
│  Jupyter │     │      MinIO (S3)     │
└──────────┘     └─────────────────────┘

📋 Prerequisites

Docker & Docker Compose installed
8GB+ RAM recommended
10GB free disk space
Ports 8080, 8888, 9000, 9092, 4040, 8081 available

⚡ Quick Start

1. Clone the repository

git clone https://github.com/yourname/big-data-sandbox.git
cd big-data-sandbox

2. Launch the sandbox

docker compose up -d

3. Verify all services are running

docker compose ps
# All services should show as "Up"

4. Explore the services

Airflow → http://localhost:8080 (admin/admin)
Jupyter → http://localhost:8888 (token: bigdata)
Spark UI → http://localhost:4040
MinIO → http://localhost:9000 (minioadmin/minioadmin)
Kafka Manager → http://localhost:9001

📚 Learning Resources

New to Big Data? We have two learning paths for you:

📓 Interactive Tutorials: Start with jupyter/notebooks/01_getting_started.ipynb for hands-on learning
🛠️ Production Examples: Use examples/ directory for real-world workflows

👉 Read the complete Learning Guide for detailed explanations of when to use each approach.

📖 First Pipeline - Real Example

Try this working example in under 5 minutes:

# 1. Upload sample data to MinIO
docker exec -it sandbox-minio mc mb local/raw-data 2>/dev/null || true
docker exec -it sandbox-minio mc cp /data/sales_data.csv local/raw-data/

# 2. Produce events to Kafka
docker exec -it sandbox-kafka kafka-console-producer \
  --broker-list localhost:9092 \
  --topic events < data/sample_events.json

# 3. Trigger the ETL pipeline in Airflow
curl -X POST http://localhost:8080/api/v1/dags/sample_etl/dagRuns \
  -H "Content-Type: application/json" \
  -H "Authorization: Basic YWRtaW46YWRtaW4=" \
  -d '{"conf":{}}'

# 4. Check processed results
docker exec -it sandbox-minio mc ls local/processed/

See examples/quickstart/ for the full walkthrough.

🗂 Project Structure

big-data-sandbox/
│── compose.yml              # One-command environment
│── .env.example            # Environment variables template
│── airflow/
│   ├── dags/              # Airflow DAG definitions
│   ├── plugins/           # Custom operators
│   └── config/            # Airflow configuration
│── spark/
│   ├── jobs/              # Spark applications
│   └── config/            # Spark configuration
│── kafka/
│   ├── config/            # Kafka broker config
│   └── producers/         # Sample data producers
│── minio/
│   └── data/              # Initial buckets & data
│── jupyter/
│   └── notebooks/         # 📓 Interactive learning tutorials
│── data/
│   ├── sales_data.csv     # Sample sales dataset
│   ├── user_events.json   # Sample event stream
│   └── iot_sensors.csv    # IoT sensor readings
│── examples/
│   ├── quickstart/        # 🚀 Complete workflow demos
│   ├── streaming/         # 🌊 Production streaming apps
│   └── batch/             # 📊 Enterprise ETL & analytics
└── README.md              # This file

🔧 Troubleshooting

Services not starting?

Check Docker memory allocation: docker system info | grep Memory
Increase Docker memory to at least 6GB in Docker Desktop settings
View logs: docker compose logs [service-name]

Port conflicts?

Check for running services: lsof -i :8080 (replace with conflicting port)
Modify ports in .env file (copy from .env.example)

Kafka connection issues?

Ensure Kafka is fully started: docker compose logs kafka | grep "started (kafka.server.KafkaServer)"
Wait 30 seconds after startup for all services to initialize

Need help? Open an issue with your docker compose logs output.

🌱 Roadmap

Phase 1 - MVP (Current)

Phase 2 - Enhanced Learning (Q1 2025)

Interactive tutorials in Jupyter
Data generators (IoT sensors, web logs, transactions)
Video tutorials series
Performance monitoring dashboard
Additional connectors (PostgreSQL, MongoDB)

Phase 3 - Production Ready (Q2 2025)

Kubernetes deployment (Helm charts)
Delta Lake / Iceberg integration
Security configurations (Kerberos, SSL)
Multi-node Spark cluster option
CI/CD pipeline examples

Phase 4 - Advanced Features (Q3 2025)

Machine Learning pipelines (MLflow)
Stream processing with Flink
Data quality checks (Great Expectations)
Cost optimization guides
Cloud deployment scripts (AWS/GCP/Azure)

👥 Community

Discord: Join our server - Get help and share projects
Blog: Read tutorials at blog.bigdatasandbox.dev
Twitter: Follow @bigdatasandbox for updates
YouTube: Video tutorials

Success Stories

"Cut my workshop prep time from 2 days to 30 minutes!" - University Professor

"Finally, a way to test Spark jobs without AWS bills!" - Startup Developer

"Perfect for our internal data engineering bootcamp" - Fortune 500 Tech Lead

🤝 Contributing

Contributions are welcome! We're looking for:

🐛 Bug reports and fixes
📚 Documentation improvements
🎯 New example pipelines
🔧 Performance optimizations
🌍 Translations

See CONTRIBUTING.md for guidelines.

Contributors

📜 License

MIT License - Free to use, modify, and share. See LICENSE file for details.

🙏 Acknowledgments

Built with amazing open-source projects:

Apache Kafka, Spark, and Airflow
MinIO Object Storage
Project Jupyter
Docker & Docker Compose

Special thanks to the data engineering community for feedback and contributions!

Ready to dive in? Star ⭐ this repo and start exploring big data in minutes!

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
.github/workflows		.github/workflows
airflow/dags		airflow/dags
data		data
examples		examples
jupyter/notebooks		jupyter/notebooks
kafka/producers		kafka/producers
spark/jobs		spark/jobs
.env.example		.env.example
.gitignore		.gitignore
CI_CD_GUIDE.md		CI_CD_GUIDE.md
CLAUDE.md		CLAUDE.md
LEARNING_GUIDE.md		LEARNING_GUIDE.md
README.md		README.md
compose.yml		compose.yml
quickstart.sh		quickstart.sh
verify-services.sh		verify-services.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

🧰 Big Data Sandbox

🚀 What is this?

💡 Why Big Data Sandbox?

🏗 Architecture

📋 Prerequisites

⚡ Quick Start

1. Clone the repository

2. Launch the sandbox

3. Verify all services are running

4. Explore the services

📚 Learning Resources

📖 First Pipeline - Real Example

🗂 Project Structure

🔧 Troubleshooting

🌱 Roadmap

Phase 1 - MVP (Current)

Phase 2 - Enhanced Learning (Q1 2025)

Phase 3 - Production Ready (Q2 2025)

Phase 4 - Advanced Features (Q3 2025)

👥 Community

Success Stories

🤝 Contributing

Contributors

📜 License

🙏 Acknowledgments

About

Uh oh!

Releases

Packages

Languages

gridatek/big-data-sandbox

Folders and files

Latest commit

History

Repository files navigation

🧰 Big Data Sandbox

🚀 What is this?

💡 Why Big Data Sandbox?

🏗 Architecture

📋 Prerequisites

⚡ Quick Start

1. Clone the repository

2. Launch the sandbox

3. Verify all services are running

4. Explore the services

📚 Learning Resources

📖 First Pipeline - Real Example

🗂 Project Structure

🔧 Troubleshooting

🌱 Roadmap

Phase 1 - MVP (Current)

Phase 2 - Enhanced Learning (Q1 2025)

Phase 3 - Production Ready (Q2 2025)

Phase 4 - Advanced Features (Q3 2025)

👥 Community

Success Stories

🤝 Contributing

Contributors

📜 License

🙏 Acknowledgments

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages