Skip to content

gridatek/big-data-sandbox

Repository files navigation

🧰 Big Data Sandbox

Docker License PRs Welcome Discord

A lightweight one-command environment to learn and practice big data pipelines with Kafka, Spark, Airflow, and MinIO β€” without the pain of setting everything up manually.


πŸš€ What is this?

The Big Data Sandbox is an open-source project that provides a ready-to-run environment for experimenting with big data tools. It's perfect for:

  • Students learning data engineering
  • Developers prototyping pipelines
  • Educators preparing workshops or demos

Included tools (MVP):

  • Kafka – Streaming events
  • Spark – Batch & stream processing
  • Airflow – Workflow orchestration
  • MinIO – S3-compatible object storage
  • Jupyter – Interactive notebooks for experiments

πŸ’‘ Why Big Data Sandbox?

The Problem: Setting up a big data environment takes days of configuration, version conflicts, and debugging.

Our Solution: Pre-configured, version-tested components that work together out of the box. Focus on learning concepts, not fighting configs.

What makes this different:

  • βœ… All services pre-integrated (no manual wiring)
  • βœ… Realistic sample data included
  • βœ… Production-like patterns (not toy examples)
  • βœ… Actively maintained with latest stable versions

πŸ— Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Airflow │────▢│  Kafka  │────▢│ Spark  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β””β”€β”€β”€β”€β”€β”€β”€β”€β”˜
      β”‚                β”‚               β”‚
      β–Ό                β–Ό               β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Jupyter β”‚     β”‚      MinIO (S3)     β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

πŸ“‹ Prerequisites

  • Docker & Docker Compose installed
  • 8GB+ RAM recommended
  • 10GB free disk space
  • Ports 8080, 8888, 9000, 9092, 4040, 8081 available

⚑ Quick Start

1. Clone the repository

git clone https://github.com/yourname/big-data-sandbox.git
cd big-data-sandbox

2. Launch the sandbox

docker compose up -d

3. Verify all services are running

docker compose ps
# All services should show as "Up"

4. Explore the services


πŸ“š Learning Resources

New to Big Data? We have two learning paths for you:

  • πŸ““ Interactive Tutorials: Start with jupyter/notebooks/01_getting_started.ipynb for hands-on learning
  • πŸ› οΈ Production Examples: Use examples/ directory for real-world workflows

πŸ‘‰ Read the complete Learning Guide for detailed explanations of when to use each approach.


πŸ“– First Pipeline - Real Example

Try this working example in under 5 minutes:

# 1. Upload sample data to MinIO
docker exec -it sandbox-minio mc mb local/raw-data 2>/dev/null || true
docker exec -it sandbox-minio mc cp /data/sales_data.csv local/raw-data/

# 2. Produce events to Kafka
docker exec -it sandbox-kafka kafka-console-producer \
  --broker-list localhost:9092 \
  --topic events < data/sample_events.json

# 3. Trigger the ETL pipeline in Airflow
curl -X POST http://localhost:8080/api/v1/dags/sample_etl/dagRuns \
  -H "Content-Type: application/json" \
  -H "Authorization: Basic YWRtaW46YWRtaW4=" \
  -d '{"conf":{}}'

# 4. Check processed results
docker exec -it sandbox-minio mc ls local/processed/

See examples/quickstart/ for the full walkthrough.


πŸ—‚ Project Structure

big-data-sandbox/
│── compose.yml              # One-command environment
│── .env.example            # Environment variables template
│── airflow/
β”‚   β”œβ”€β”€ dags/              # Airflow DAG definitions
β”‚   β”œβ”€β”€ plugins/           # Custom operators
β”‚   └── config/            # Airflow configuration
│── spark/
β”‚   β”œβ”€β”€ jobs/              # Spark applications
β”‚   └── config/            # Spark configuration
│── kafka/
β”‚   β”œβ”€β”€ config/            # Kafka broker config
β”‚   └── producers/         # Sample data producers
│── minio/
β”‚   └── data/              # Initial buckets & data
│── jupyter/
β”‚   └── notebooks/         # πŸ““ Interactive learning tutorials
│── data/
β”‚   β”œβ”€β”€ sales_data.csv     # Sample sales dataset
β”‚   β”œβ”€β”€ user_events.json   # Sample event stream
β”‚   └── iot_sensors.csv    # IoT sensor readings
│── examples/
β”‚   β”œβ”€β”€ quickstart/        # πŸš€ Complete workflow demos
β”‚   β”œβ”€β”€ streaming/         # 🌊 Production streaming apps
β”‚   └── batch/             # πŸ“Š Enterprise ETL & analytics
└── README.md              # This file

πŸ”§ Troubleshooting

Services not starting?

  • Check Docker memory allocation: docker system info | grep Memory
  • Increase Docker memory to at least 6GB in Docker Desktop settings
  • View logs: docker compose logs [service-name]

Port conflicts?

  • Check for running services: lsof -i :8080 (replace with conflicting port)
  • Modify ports in .env file (copy from .env.example)

Kafka connection issues?

  • Ensure Kafka is fully started: docker compose logs kafka | grep "started (kafka.server.KafkaServer)"
  • Wait 30 seconds after startup for all services to initialize

Need help? Open an issue with your docker compose logs output.


🌱 Roadmap

Phase 1 - MVP (Current)

  • Core services (Kafka, Spark, Airflow, MinIO)
  • Docker Compose setup
  • Basic documentation
  • Jupyter integration
  • Sample datasets & generators

Phase 2 - Enhanced Learning (Q1 2025)

  • Interactive tutorials in Jupyter
  • Data generators (IoT sensors, web logs, transactions)
  • Video tutorials series
  • Performance monitoring dashboard
  • Additional connectors (PostgreSQL, MongoDB)

Phase 3 - Production Ready (Q2 2025)

  • Kubernetes deployment (Helm charts)
  • Delta Lake / Iceberg integration
  • Security configurations (Kerberos, SSL)
  • Multi-node Spark cluster option
  • CI/CD pipeline examples

Phase 4 - Advanced Features (Q3 2025)

  • Machine Learning pipelines (MLflow)
  • Stream processing with Flink
  • Data quality checks (Great Expectations)
  • Cost optimization guides
  • Cloud deployment scripts (AWS/GCP/Azure)

πŸ‘₯ Community

Success Stories

"Cut my workshop prep time from 2 days to 30 minutes!" - University Professor

"Finally, a way to test Spark jobs without AWS bills!" - Startup Developer

"Perfect for our internal data engineering bootcamp" - Fortune 500 Tech Lead


🀝 Contributing

Contributions are welcome! We're looking for:

  • πŸ› Bug reports and fixes
  • πŸ“š Documentation improvements
  • 🎯 New example pipelines
  • πŸ”§ Performance optimizations
  • 🌍 Translations

See CONTRIBUTING.md for guidelines.

Contributors


πŸ“œ License

MIT License - Free to use, modify, and share. See LICENSE file for details.


πŸ™ Acknowledgments

Built with amazing open-source projects:

  • Apache Kafka, Spark, and Airflow
  • MinIO Object Storage
  • Project Jupyter
  • Docker & Docker Compose

Special thanks to the data engineering community for feedback and contributions!


Ready to dive in? Star ⭐ this repo and start exploring big data in minutes!

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published