Skip to content

nellaivijay/spark-code-practice

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

12 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Apache Spark Code Practice

Python Apache Spark License Jupyter

7 hands-on labs for mastering Apache Spark through progressive, practical exercises. From fundamentals to production deployment.

Lab Progression

Labs

# Lab Difficulty Time Topics
01 Spark Fundamentals Beginner 60-90 min Architecture, DataFrames, basic operations
02 Data Loading & Transformation Beginner 60-90 min Data ingestion, cleaning, validation
03 Advanced DataFrame Operations Intermediate 90-120 min Joins, window functions, UDFs, complex types
04 Spark SQL & Optimization Intermediate 90-120 min SQL operations, Catalyst optimizer, performance tuning
05 Spark Streaming Fundamentals Intermediate 90-120 min Structured streaming, windowed operations, watermarking
06 Machine Learning with MLlib Advanced 90-120 min ML pipelines, classification, regression, clustering
07 Production Deployment Advanced 90-120 min Docker, Kubernetes, monitoring, CI/CD

Total: ~9.5-13 hours

Domain Coverage

Domain Coverage

Domain Coverage Labs
Spark Architecture 15% Lab 1, Lab 7
DataFrame API 30% Lab 1, Lab 2, Lab 3
Spark SQL 20% Lab 4
Data Loading 10% Lab 2
Streaming 10% Lab 5
MLlib 10% Lab 6
Production 5% Lab 7

Getting Started

1. Check Prerequisites

Review prerequisites.md and ensure you have:

  • Docker or Podman installed
  • Python 3.8+
  • 8GB RAM minimum (16GB recommended)
  • 10GB disk space

2. Setup Environment

# Clone repository (if not already done)
git clone https://github.com/nellaivijay/spark-code-practice.git
cd spark-code-practice

# Run setup script
./scripts/setup.sh

3. Start Services

./scripts/start.sh

4. Access Jupyter Notebook

Open your browser and go to: http://localhost:8888

5. Begin Learning

Start with Lab 1: Spark Fundamentals

Lab Progression

Labs are designed to be completed in order:

  • Labs 1-2 (Foundation): Architecture, data I/O, basic operations
  • Labs 3-5 (Core Skills): Advanced DataFrame operations, SQL optimization, streaming
  • Labs 6-7 (Advanced): Machine learning with MLlib, production deployment

Supporting Materials

Cheatsheets

Quick reference notebooks for key concepts:

Documentation

Architecture

Architecture Overview

Components

  • Jupyter Notebook: Interactive learning environment
  • PySpark: Python API for Spark
  • Spark SQL: SQL interface for structured queries
  • Spark MLlib: Machine learning library
  • Spark Streaming: Real-time data processing
  • MinIO: S3-compatible object storage
  • Docker Compose: Container orchestration

Target Audience

Developers, data engineers, and data scientists who want to:

  • Learn Apache Spark from fundamentals to advanced concepts
  • Gain hands-on experience with real-world data processing
  • Understand Spark architecture and optimization techniques
  • Build production-ready Spark applications

Requirements

  • Docker or Podman
  • 8GB RAM minimum (16GB recommended for MLlib and streaming)
  • 10GB disk space
  • Python 3.8+ (for PySpark and scripts)
  • Java 11+ (optional, for Scala Spark)

Cleanup

When done, clean up the environment:

# Stop services
./scripts/stop.sh

# Full cleanup (removes data, logs, volumes)
./scripts/cleanup.sh

License

Apache License 2.0

Related Practice Repositories

Continue your learning journey:

AI/ML Practice

Data Engineering Practice

Programming Practice

Resource Hub

About

No description or website provided.

Topics

Resources

License

Contributing

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors