7 hands-on labs for mastering Apache Spark through progressive, practical exercises. From fundamentals to production deployment.
| # | Lab | Difficulty | Time | Topics |
|---|---|---|---|---|
| 01 | Spark Fundamentals | Beginner | 60-90 min | Architecture, DataFrames, basic operations |
| 02 | Data Loading & Transformation | Beginner | 60-90 min | Data ingestion, cleaning, validation |
| 03 | Advanced DataFrame Operations | Intermediate | 90-120 min | Joins, window functions, UDFs, complex types |
| 04 | Spark SQL & Optimization | Intermediate | 90-120 min | SQL operations, Catalyst optimizer, performance tuning |
| 05 | Spark Streaming Fundamentals | Intermediate | 90-120 min | Structured streaming, windowed operations, watermarking |
| 06 | Machine Learning with MLlib | Advanced | 90-120 min | ML pipelines, classification, regression, clustering |
| 07 | Production Deployment | Advanced | 90-120 min | Docker, Kubernetes, monitoring, CI/CD |
Total: ~9.5-13 hours
| Domain | Coverage | Labs |
|---|---|---|
| Spark Architecture | 15% | Lab 1, Lab 7 |
| DataFrame API | 30% | Lab 1, Lab 2, Lab 3 |
| Spark SQL | 20% | Lab 4 |
| Data Loading | 10% | Lab 2 |
| Streaming | 10% | Lab 5 |
| MLlib | 10% | Lab 6 |
| Production | 5% | Lab 7 |
Review prerequisites.md and ensure you have:
- Docker or Podman installed
- Python 3.8+
- 8GB RAM minimum (16GB recommended)
- 10GB disk space
# Clone repository (if not already done)
git clone https://github.com/nellaivijay/spark-code-practice.git
cd spark-code-practice
# Run setup script
./scripts/setup.sh./scripts/start.shOpen your browser and go to: http://localhost:8888
Start with Lab 1: Spark Fundamentals
Labs are designed to be completed in order:
- Labs 1-2 (Foundation): Architecture, data I/O, basic operations
- Labs 3-5 (Core Skills): Advanced DataFrame operations, SQL optimization, streaming
- Labs 6-7 (Advanced): Machine learning with MLlib, production deployment
Quick reference notebooks for key concepts:
- Prerequisites - Setup requirements and installation guide
- Installation Guide - Detailed setup instructions
- Quick Start - Get started in 5 minutes
- Troubleshooting - Common issues and solutions
- Jupyter Notebook: Interactive learning environment
- PySpark: Python API for Spark
- Spark SQL: SQL interface for structured queries
- Spark MLlib: Machine learning library
- Spark Streaming: Real-time data processing
- MinIO: S3-compatible object storage
- Docker Compose: Container orchestration
Developers, data engineers, and data scientists who want to:
- Learn Apache Spark from fundamentals to advanced concepts
- Gain hands-on experience with real-world data processing
- Understand Spark architecture and optimization techniques
- Build production-ready Spark applications
- Docker or Podman
- 8GB RAM minimum (16GB recommended for MLlib and streaming)
- 10GB disk space
- Python 3.8+ (for PySpark and scripts)
- Java 11+ (optional, for Scala Spark)
When done, clean up the environment:
# Stop services
./scripts/stop.sh
# Full cleanup (removes data, logs, volumes)
./scripts/cleanup.shApache License 2.0
Continue your learning journey:
- π€ DSPy Code Practice - Declarative LLM programming
- π§ LLM Fine-Tuning Practice - Model fine-tuning techniques
- π¦ DuckDB Code Practice - Analytics & SQL optimization
- ποΈ Apache Iceberg Code Practice - Lakehouse architecture
- π§ Apache Beam Code Practice - Data pipelines
- βοΈ Scala Data Analysis Practice - Functional programming
- π Awesome My Notes - Comprehensive technical notes