This repository contains the materials for a short Apache Spark tutorial session.
-
Notebooks: see
notebooks/folderpyspark_short_tutorial.ipynb
This tutorial teaches you how to use Apache Spark for distributed data processing. By the end of this session, you will be able to:
- Understand Apache Spark - What it is, when to use it, and why it's better than Pandas for large datasets
- Start a SparkSession - Configure and launch Spark in local mode on your laptop
- Work with DataFrames - Create, inspect, and transform data using Spark DataFrames
- Perform aggregations - Use
groupByand aggregate functions to summarize data - Query with Spark SQL - Write SQL queries on DataFrames using familiar SQL syntax
- Join datasets - Combine multiple tables using join operations
- Build ETL pipelines - Extract, Transform, and Load data in a scalable way
- Read and write files - Work with CSV and Parquet file formats
- When to use Spark vs Pandas: Understanding the tradeoffs between single-machine and distributed computing
- Lazy evaluation: How Spark optimizes query execution by planning before executing
- Working with large datasets: Best practices for
.show(),.limit(), and avoiding.collect()on big data - Spark SQL: Leveraging your SQL knowledge to query distributed datasets
- Real-world ETL patterns: Building data transformation pipelines that can scale from your laptop to a cluster
Before starting, please fork this repository and create a fresh Python virtual environment.
All required libraries are listed in requirements.txt.
⚠️ If you encounter errors duringpip install, try removing the version pinning for the failing package(s) inrequirements.txt.
On Apple M1/M2 systems you may also need to install additional system packages (the “M1 shizzle”).
# Select Python version (if using pyenv)
pyenv local 3.11.3
# Create and activate virtual environment
python -m venv .venv
source .venv/bin/activate
# Upgrade pip and install dependencies
pip install --upgrade pip
pip install -r requirements.txt# Select Python version (if using pyenv)
pyenv local 3.11.3
# Create and activate virtual environment
python -m venv .venv
.venv\Scripts\Activate.ps1
# Upgrade pip and install dependencies
python -m pip install --upgrade pip
pip install -r requirements.txt# Select Python version (if using pyenv)
pyenv local 3.11.3
# Create and activate virtual environment
python -m venv .venv
source .venv/Scripts/activate
# Upgrade pip and install dependencies
python -m pip install --upgrade pip
pip install -r requirements.txtYou’re now ready to run the session notebooks!
Deactivate the environment when you’re done:
deactivate