Interactive Jupyter notebook tutorial for learning Apache Spark with Python. This hands-on guide covers fundamental to intermediate PySpark concepts with real-world data processing examples.
Comprehensive tutorial demonstrating distributed data processing using PySpark, including DataFrames, SQL queries, RDDs, and various output formats. Perfect for beginners and intermediate learners.
- Apache Spark Architecture - Understanding Driver, Executors, and Cluster Manager
- PySpark Basics - Python interface to Spark (Py4J bridge)
- Lazy Evaluation - How Spark optimizes query execution
- Spark Web UI - Monitoring jobs at
http://localhost:4040
-
Setting Up Spark
- Creating SparkSession
- Local execution with
master("local[*]") - Configuration options
-
Loading Data
- Reading CSV files with schema inference
- Inspecting DataFrames (
.printSchema(),.show()) - Understanding DataFrame structure
-
Data Manipulation
- Selecting columns (
.select()) - Filtering data (
.filter(),.where()) - Sorting (
.orderBy()) - Creating computed columns (
.withColumn())
- Selecting columns (
-
Data Cleaning
- Normalizing text (
.upper(),.trim()) - Handling NULL values (
.na.fill(),.when().otherwise()) - Fixing inconsistent data (department names, missing salaries)
- Creating categorical columns (salary bands)
- Normalizing text (
-
Aggregations
- Grouping data (
.groupBy()) - Aggregate functions (
count,avg,sum,min,max) - Multiple aggregations in single query
- Grouping data (
-
Spark SQL
- Creating temporary views (
.createOrReplaceTempView()) - Writing SQL queries on DataFrames
GROUP BYand aggregations in SQL
- Creating temporary views (
-
Output Formats
- Writing CSV files (
.write.csv()) - Writing Parquet files (columnar format)
- Overwrite vs append modes
- Writing CSV files (
-
RDD Operations
- Low-level Resilient Distributed Datasets
- Transformations (
.map(),.filter()) - Actions (
.collect(),.count()) - When to use RDDs vs DataFrames
File: data/employees.csv
Structure:
employee_id,name,age,department,salary,experience_years
1,John Doe,30,IT,65000,5
2,Jane Smith,28,HR,55000,3
...
Characteristics:
- 10,000+ employee records
- Intentional data quality issues for cleaning practice:
- Missing salary values (NULL)
- Inconsistent department capitalization
- Extra whitespace in fields
- Mixed case formatting
Departments: IT, HR, Sales, Marketing, Finance, R&D
- Python 3.8+
- Java 8 or 11 (required by Spark)
- 4GB+ RAM recommended
- Clone the repository
git clone https://github.com/lougail/pyspark-data-processing-tutorial.git
cd pyspark-data-processing-tutorial- Create virtual environment
python -m venv venv
# Windows
venv\Scripts\activate
# Linux/Mac
source venv/bin/activate- Install dependencies
pip install -r requirements.txt- Start Jupyter Notebook
jupyter notebook intro_pyspark.ipynb-
Run cells sequentially - Execute each cell in order to follow the tutorial progression
-
Access Spark UI - While Spark is running, visit
http://localhost:4040to monitor jobs
The notebook is organized into 8 main sections with 69 interactive cells:
- Spark vs Hadoop comparison
- Architecture overview
- PySpark setup
- SparkSession creation
- Local mode configuration
- Web UI introduction
- Reading CSV with options
- Schema inspection
- Understanding lazy evaluation
- Selecting specific columns
- Filtering rows with conditions
- Sorting results
- Creating temporary views
- SQL queries on DataFrames
- Aggregations with SQL
- Text normalization
- NULL handling strategies
- Creating derived columns
- Salary band categorization
- GroupBy operations
- Multiple aggregate functions
- Ordering aggregated results
- Writing CSV and Parquet files
- Reading Parquet files
- RDD basics and operations
- Apache Spark 3.5.3 - Distributed computing framework
- PySpark - Python API for Spark
- Jupyter Notebook - Interactive development environment
- NumPy 1.26.4 - Numerical computing library
- Py4J - Python-Java bridge
# DataFrame API
df_clean.groupBy("department") \
.agg(F.avg("salary").alias("avg_salary")) \
.orderBy(F.desc("avg_salary")) \
.show()
# SQL
spark.sql("""
SELECT department, AVG(salary) as avg_salary
FROM employees
GROUP BY department
ORDER BY avg_salary DESC
""").show()df_clean = df \
.withColumn("department", F.upper(F.trim(F.col("department")))) \
.withColumn("salary",
F.when(F.col("salary").isNull(), 50000)
.otherwise(F.col("salary"))) \
.na.fill({"experience_years": 0})df_with_bands = df_clean.withColumn(
"salary_band",
F.when(F.col("salary") < 40000, "Junior")
.when(F.col("salary") < 70000, "Mid-Level")
.otherwise("Senior")
)- Beginners: Start from Section 1, follow sequentially
- Intermediate: Jump to Sections 5-6 for SQL and data cleaning
- Advanced: Focus on Section 8 for RDD operations and optimizations
- Lazy Evaluation: Understanding transformations vs actions
- Schema Inference: Automatic type detection from CSV
- Data Cleaning: Real-world data quality issues
- Performance: Using Parquet for efficient storage
- SQL Integration: Seamless SQL queries on DataFrames
- Monitoring: Using Spark UI for job inspection
# Install Java 8 or 11
# Set JAVA_HOME environment variable
export JAVA_HOME=/path/to/java- Ensure PySpark version matches Spark version
- Check Python environment variables
- Restart Jupyter kernel
# Increase driver memory
spark = SparkSession.builder \
.config("spark.driver.memory", "4g") \
.getOrCreate()Included in notebook:
- Official Apache Spark documentation links
- PySpark API reference
- Spark SQL guide
- RDD programming guide
This is an educational project. Suggestions for improvements:
- Additional exercises
- More complex transformations
- Real-world use cases
- Performance optimization examples
Feel free to fork and create pull requests!
MIT License - Free for educational and commercial use
After completing this tutorial:
- Explore Spark MLlib for machine learning
- Learn about Spark Streaming for real-time processing
- Study performance optimization and partitioning
- Deploy Spark on cloud platforms (Azure, AWS, GCP)
Questions or feedback? Open an issue on GitHub!
Happy Learning with PySpark! π