Apache Spark Code Practice

7 hands-on labs for mastering Apache Spark through progressive, practical exercises. From fundamentals to production deployment.

Labs

#	Lab	Difficulty	Time	Topics
01	Spark Fundamentals	Beginner	60-90 min	Architecture, DataFrames, basic operations
02	Data Loading & Transformation	Beginner	60-90 min	Data ingestion, cleaning, validation
03	Advanced DataFrame Operations	Intermediate	90-120 min	Joins, window functions, UDFs, complex types
04	Spark SQL & Optimization	Intermediate	90-120 min	SQL operations, Catalyst optimizer, performance tuning
05	Spark Streaming Fundamentals	Intermediate	90-120 min	Structured streaming, windowed operations, watermarking
06	Machine Learning with MLlib	Advanced	90-120 min	ML pipelines, classification, regression, clustering
07	Production Deployment	Advanced	90-120 min	Docker, Kubernetes, monitoring, CI/CD

Total: ~9.5-13 hours

Domain Coverage

Domain	Coverage	Labs
Spark Architecture	15%	Lab 1, Lab 7
DataFrame API	30%	Lab 1, Lab 2, Lab 3
Spark SQL	20%	Lab 4
Data Loading	10%	Lab 2
Streaming	10%	Lab 5
MLlib	10%	Lab 6
Production	5%	Lab 7

Getting Started

1. Check Prerequisites

Review prerequisites.md and ensure you have:

Docker or Podman installed
Python 3.8+
8GB RAM minimum (16GB recommended)
10GB disk space

2. Setup Environment

# Clone repository (if not already done)
git clone https://github.com/nellaivijay/spark-code-practice.git
cd spark-code-practice

# Run setup script
./scripts/setup.sh

3. Start Services

./scripts/start.sh

4. Access Jupyter Notebook

Open your browser and go to: http://localhost:8888

5. Begin Learning

Start with Lab 1: Spark Fundamentals

Lab Progression

Labs are designed to be completed in order:

Labs 1-2 (Foundation): Architecture, data I/O, basic operations
Labs 3-5 (Core Skills): Advanced DataFrame operations, SQL optimization, streaming
Labs 6-7 (Advanced): Machine learning with MLlib, production deployment

Supporting Materials

Cheatsheets

Quick reference notebooks for key concepts:

Documentation

Prerequisites - Setup requirements and installation guide
Installation Guide - Detailed setup instructions
Quick Start - Get started in 5 minutes
Troubleshooting - Common issues and solutions

Architecture

Components

Jupyter Notebook: Interactive learning environment
PySpark: Python API for Spark
Spark SQL: SQL interface for structured queries
Spark MLlib: Machine learning library
Spark Streaming: Real-time data processing
MinIO: S3-compatible object storage
Docker Compose: Container orchestration

Target Audience

Developers, data engineers, and data scientists who want to:

Learn Apache Spark from fundamentals to advanced concepts
Gain hands-on experience with real-world data processing
Understand Spark architecture and optimization techniques
Build production-ready Spark applications

Requirements

Docker or Podman
8GB RAM minimum (16GB recommended for MLlib and streaming)
10GB disk space
Python 3.8+ (for PySpark and scripts)
Java 11+ (optional, for Scala Spark)

Cleanup

When done, clean up the environment:

# Stop services
./scripts/stop.sh

# Full cleanup (removes data, logs, volumes)
./scripts/cleanup.sh

License

Apache License 2.0

Related Practice Repositories

Continue your learning journey:

AI/ML Practice

🤖 DSPy Code Practice - Declarative LLM programming
🧠 LLM Fine-Tuning Practice - Model fine-tuning techniques

Data Engineering Practice

🦆 DuckDB Code Practice - Analytics & SQL optimization
🏔️ Apache Iceberg Code Practice - Lakehouse architecture
🔧 Apache Beam Code Practice - Data pipelines

Programming Practice

⚙️ Scala Data Analysis Practice - Functional programming

Resource Hub

📚 Awesome My Notes - Comprehensive technical notes

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
.github		.github
assets/diagrams		assets/diagrams
cheatsheets		cheatsheets
config		config
data		data
docs		docs
k8s		k8s
labs		labs
notebooks		notebooks
quizzes		quizzes
scripts		scripts
solutions		solutions
tests		tests
wiki		wiki
.env.example		.env.example
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
PRACTICE_EXERCISES.md		PRACTICE_EXERCISES.md
README.md		README.md
SECURITY.md		SECURITY.md
WALKTHROUGH_GUIDE.md		WALKTHROUGH_GUIDE.md
WALKTHROUGH_LAB_01.md		WALKTHROUGH_LAB_01.md
WALKTHROUGH_LAB_02.md		WALKTHROUGH_LAB_02.md
WALKTHROUGH_LAB_03.md		WALKTHROUGH_LAB_03.md
WALKTHROUGH_LAB_04.md		WALKTHROUGH_LAB_04.md
WALKTHROUGH_LAB_05.md		WALKTHROUGH_LAB_05.md
WALKTHROUGH_LAB_06.md		WALKTHROUGH_LAB_06.md
WALKTHROUGH_LAB_07.md		WALKTHROUGH_LAB_07.md
docker-compose.yaml		docker-compose.yaml
prerequisites.md		prerequisites.md
pytest.ini		pytest.ini
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Apache Spark Code Practice

Labs

Domain Coverage

Getting Started

1. Check Prerequisites

2. Setup Environment

3. Start Services

4. Access Jupyter Notebook

5. Begin Learning

Lab Progression

Supporting Materials

Cheatsheets

Documentation

Architecture

Components

Target Audience

Requirements

Cleanup

License

Related Practice Repositories

AI/ML Practice

Data Engineering Practice

Programming Practice

Resource Hub

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Apache Spark Code Practice

Labs

Domain Coverage

Getting Started

1. Check Prerequisites

2. Setup Environment

3. Start Services

4. Access Jupyter Notebook

5. Begin Learning

Lab Progression

Supporting Materials

Cheatsheets

Documentation

Architecture

Components

Target Audience

Requirements

Cleanup

License

Related Practice Repositories

AI/ML Practice

Data Engineering Practice

Programming Practice

Resource Hub

About

Topics

Resources

License

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages