
# Chapter 5: Performance Optimization, Best Practices & Real-World Projects (Deep Dive)

Welcome to the final and most advanced chapter! Here, you'll master PySpark performance tuning, best practices, and real-world project design. This notebook is precise, practical, and interview-ready.

## What You'll Do
- Deeply understand Spark's execution model and performance bottlenecks
- Apply best practices for scalable, maintainable data engineering
- Build and optimize real-world ETL pipelines
- Solve open-ended, interview-style challenges



## Important Instructions
- Sample syntax is for illustration only and uses generic DataFrame names (e.g., `df`, `input_df`).
- Always use the actual DataFrame names provided in the practice questions (e.g., `customers_df`, `orders_df`).
- Do not copy-paste the sample code for the practice question. Try to solve it yourself using the actual DataFrame.
- This is for your own practice, so type the commands even if the question is similar to the example.
- Don't execute the code mentioned in syntax as it may modify the data.
- Avoid using AI for code completion.
- Play around and try out a few more for your understanding.



## Data Preparation (Run This First)
This section downloads, loads, and cleans all datasets so you can start performance and project work without running previous chapters.


In [None]:

# Download the data
!wget -O customers.csv https://raw.githubusercontent.com/icyanide9/de-practice/refs/heads/main/customers.csv
!wget -O products.csv https://raw.githubusercontent.com/icyanide9/de-practice/refs/heads/main/products.csv
!wget -O orders.csv https://raw.githubusercontent.com/icyanide9/de-practice/refs/heads/main/orders.csv
!wget -O order_items.csv https://raw.githubusercontent.com/icyanide9/de-practice/refs/heads/main/order_items.csv
!wget -O employees.csv https://raw.githubusercontent.com/icyanide9/de-practice/refs/heads/main/employees.csv
!wget -O transactions.csv https://raw.githubusercontent.com/icyanide9/de-practice/refs/heads/main/transactions.csv

from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.types import *

spark = SparkSession.builder.appName("PySpark Optimization Practice").getOrCreate()

# Load DataFrames
df_map = {}
for name in ["customers", "products", "orders", "order_items", "employees", "transactions"]:
    df_map[name] = spark.read.csv(f"{name}.csv", header=True, inferSchema=True)

customers_df = df_map["customers"]
products_df = df_map["products"]
orders_df = df_map["orders"]
order_items_df = df_map["order_items"]
employees_df = df_map["employees"]
transactions_df = df_map["transactions"]

# Data cleaning (minimal, for performance):
customers_df = customers_df.dropDuplicates().dropna(subset=["customer_id", "name", "email"])
products_df = products_df.dropDuplicates().dropna(subset=["product_id", "product_name", "price"])
orders_df = orders_df.dropDuplicates().dropna(subset=["order_id", "customer_id", "order_amount"])
order_items_df = order_items_df.dropDuplicates().dropna(subset=["order_item_id", "order_id", "product_id"])
employees_df = employees_df.dropDuplicates().dropna(subset=["employee_id", "name"])
transactions_df = transactions_df.dropDuplicates().dropna(subset=["transaction_id", "customer_id", "amount"])



## Table Overview
- **customers_df**: customer_id, name, email, phone, address, registration_date, status
- **products_df**: product_id, product_name, category, price, stock_quantity
- **orders_df**: order_id, customer_id, order_date, order_amount, order_status, payment_method
- **order_items_df**: order_item_id, order_id, product_id, quantity, item_total
- **employees_df**: employee_id, name, department, hire_date, salary, manager_id
- **transactions_df**: transaction_id, customer_id, transaction_date, amount, transaction_type, location, created_at



### 1. Spark Execution Model: Jobs, Stages, Tasks

**Concept:**
- Spark breaks a job into stages (wide transformations like groupBy/join) and then into tasks (one per partition).
- Understanding this helps you optimize shuffles and partitioning.
- Use `explain()` to see the physical plan and identify shuffles.

**Sample Syntax (Generic):**
```python
df.explain(True)
```

**Practice:**
- Use `explain()` on a DataFrame after a join or groupBy to see the execution plan.

**Expected Output:**
- The physical plan showing stages and shuffles.

**Additional Challenge:**
- Identify a stage in the plan that causes a shuffle and suggest an optimization.


In [None]:
#practice here


### 2. Wide vs. Narrow Transformations

**Concept:**
- Narrow: Each partition of parent RDD is used by at most one partition of the child (e.g., map, filter).
- Wide: Multiple parent partitions are needed (e.g., groupBy, join), causing shuffles.
- Minimize wide transformations for better performance.

**Sample Syntax (Generic):**
```python
# Narrow transformation
df2 = df1.filter(df1.col1 > 0)
# Wide transformation
df3 = df2.groupBy("col2").agg({"col3": "sum"})
```

**Practice:**
- Identify which transformations in your pipeline are wide and which are narrow.

**Expected Output:**
- List of transformations classified as wide or narrow.

**Additional Challenge:**
- Refactor a pipeline to reduce the number of wide transformations.


In [None]:
#practice here


### 3. Partitioning Strategies and Data Skew

**Concept:**
- Use `repartition()` or `coalesce()` to control partition count and distribution.
- Data skew occurs when some partitions have much more data than others, causing slow tasks.
- Mitigate skew by salting keys or increasing partitions.

**Sample Syntax (Generic):**
```python
# Repartition by key
df = df.repartition(10, "col1")
# Coalesce to reduce partitions
df = df.coalesce(2)
```

**Practice:**
- Repartition a large DataFrame by a high-cardinality key and observe performance.
- Simulate skew and fix it by salting the join key.

**Expected Output:**
- Improved partition distribution and faster job completion.

**Additional Challenge:**
- Write code to detect skewed partitions.


In [None]:
#practice here


### 4. Predicate Pushdown & Column Pruning

**Concept:**
- Apply `filter()` and `select()` as early as possible to reduce data read and shuffle.
- Predicate pushdown lets Spark skip reading irrelevant data blocks.
- Column pruning reads only necessary columns.

**Sample Syntax (Generic):**
```python
df = df.filter(df.col1 > 0).select("col1", "col2")
```

**Practice:**
- Compare execution plans with and without early filtering using `explain()`.

**Expected Output:**
- More efficient physical plan with early filtering.

**Additional Challenge:**
- Measure and compare execution time for both approaches.


In [None]:
#practice here


### 5. Broadcast Joins

**Concept:**
- Use `broadcast()` to join a small DataFrame to a large one, avoiding shuffles.
- Only use when the small DataFrame fits in memory.

**Sample Syntax (Generic):**
```python
from pyspark.sql.functions import broadcast
joined_df = df1.join(broadcast(df2), df1.key == df2.key)
```

**Practice:**
- Broadcast join `products_df` to `order_items_df` and compare execution plans.

**Expected Output:**
- Faster join and no shuffle in the plan.

**Additional Challenge:**
- Try to broadcast a large DataFrame and observe the warning/error.


In [None]:
#practice here


### 6. Caching and Persisting DataFrames

**Concept:**
- Cache DataFrames only when reused multiple times.
- `cache()` stores in memory; `persist()` can use memory and disk.
- Unpersist when done to free resources.

**Sample Syntax (Generic):**
```python
df.cache()
df.persist()
df.unpersist()
```

**Practice:**
- Cache a DataFrame, run actions, and observe performance improvement.

**Expected Output:**
- Faster repeated actions on the cached DataFrame.

**Additional Challenge:**
- Measure memory usage before and after caching.


In [None]:
#practice here


### 7. File Formats and Compression

**Concept:**
- Use columnar formats like Parquet or ORC for analytics; they support predicate pushdown and compression.
- CSV is row-based and less efficient for large-scale analytics.

**Sample Syntax (Generic):**
```python
df.write.parquet("output.parquet")
df.write.csv("output.csv")
```

**Practice:**
- Save a DataFrame as Parquet and as CSV, then compare file size and read performance.

**Expected Output:**
- Smaller file size and faster reads with Parquet.

**Additional Challenge:**
- Try different compression codecs (e.g., snappy, gzip) and compare results.


In [None]:
#practice here


### 8. Monitoring and Debugging with Spark UI

**Concept:**
- Use Spark UI to monitor jobs, stages, and tasks.
- Look for long-running or failed stages, skewed tasks, and memory issues.

**Sample Syntax (Generic):**
```python
# No code—access Spark UI via the cluster or notebook environment.
```

**Practice:**
- Run a job, open Spark UI, and identify bottlenecks.

**Expected Output:**
- List of slow stages and possible causes.

**Additional Challenge:**
- Suggest optimizations based on Spark UI findings.


In [None]:
#practice here


### 9. Best Practices: Schema Management, Modular Code, Logging

**Concept:**
- Always define schemas for production jobs to avoid inference errors.
- Write modular, reusable code with functions and clear variable names.
- Use logging and error handling for production pipelines.

**Sample Syntax (Generic):**
```python
from pyspark.sql.types import StructType, StructField, StringType
schema = StructType([StructField("col1", StringType(), True)])
df = spark.read.schema(schema).csv("file.csv")

def clean_data(df):
    return df.dropDuplicates().dropna()

import logging
logging.basicConfig(level=logging.INFO)
logging.info("Pipeline started")
```

**Practice:**
- Refactor a notebook cell into a function and add logging.

**Expected Output:**
- Cleaner, more maintainable code with logs.

**Additional Challenge:**
- Add error handling to your function.


In [None]:
#practice here


### 10. Data Quality Checks and Validation

**Concept:**
- Check for nulls, duplicates, and out-of-range values.
- Write assertions and validation functions to ensure data quality.

**Sample Syntax (Generic):**
```python
assert df.filter(df.col1.isNull()).count() == 0, "Nulls found in col1"
invalid_rows = df.filter(df.col2 < 0)
```

**Practice:**
- Write a function to validate that `order_amount` in `orders_df` is always positive.

**Expected Output:**
- Assertion passes or fails with a clear message.

**Additional Challenge:**
- Add automated data quality checks to your ETL pipeline.


In [None]:
#practice here


### 11. Real-World ETL Pipeline: End-to-End Example

**Concept:**
- Build a pipeline: load, clean, transform, aggregate, and save data with all optimizations and best practices.

**Sample Syntax (Generic):**
```python
raw_df = spark.read.csv("raw.csv", header=True)
clean_df = raw_df.dropDuplicates().dropna()
agged_df = clean_df.groupBy("col1").agg({"col2": "sum"})
agged_df.write.parquet("output.parquet")
```

**Practice:**
- Build an ETL pipeline: load raw orders, clean data, aggregate total sales per customer, and save as Parquet.

**Expected Output:**
- End-to-end pipeline code and results.

**Additional Challenge:**
- Add data quality checks and performance optimizations to your pipeline.


In [None]:
#practice here


### 12. Interview-Style Challenges & Capstone Exercises

**Concept:**
- Tackle open-ended, multi-step data engineering problems.
- Optimize, debug, and design scalable pipelines.

**Sample Syntax (Generic):**
```python
# No single syntax—combine all skills learned so far.
```

**Practice:**
- Given a slow-running job, identify and fix the bottleneck.
- Debug and fix a broken ETL pipeline (e.g., missing data, schema mismatch).
- Design a scalable pipeline for a business scenario (e.g., daily sales reporting for millions of records).

**Expected Output:**
- Optimized, working code and clear explanations.

**Additional Challenge:**
- Document your solution and explain your design choices.


In [None]:
#practice here