
# Chapter 5: Performance Optimization, Best Practices & Real-World Projects

Welcome to the final chapter! Here, you'll learn how to optimize PySpark jobs, follow best practices, and tackle real-world data engineering projects. This chapter is standalone: all data is freshly loaded and cleaned before you begin practicing.

## What You'll Do
- Master performance tuning and best practices in PySpark
- Apply your skills to real-world, end-to-end data engineering scenarios
- Prepare for interviews and production work with capstone challenges



## Important Instructions
- Sample syntax is for illustration only and uses generic DataFrame names (e.g., `df`, `input_df`).
- Always use the actual DataFrame names provided in the practice questions (e.g., `customers_df`, `orders_df`).
- Do not copy-paste the sample code for the practice question. Try to solve it yourself using the actual DataFrame.
- This is for your own practice, so type the commands even if the question is similar to the example.
- Don't execute the code mentioned in syntax as it may modify the data.
- Avoid using AI for code completion.
- Play around and try out a few more for your understanding.



## Data Preparation (Run This First)
This section downloads, loads, and cleans all datasets so you can start performance and project work without running previous chapters.


In [None]:

# Download the data
!wget -O customers.csv https://raw.githubusercontent.com/icyanide9/de-practice/refs/heads/main/customers.csv
!wget -O products.csv https://raw.githubusercontent.com/icyanide9/de-practice/refs/heads/main/products.csv
!wget -O orders.csv https://raw.githubusercontent.com/icyanide9/de-practice/refs/heads/main/orders.csv
!wget -O order_items.csv https://raw.githubusercontent.com/icyanide9/de-practice/refs/heads/main/order_items.csv
!wget -O employees.csv https://raw.githubusercontent.com/icyanide9/de-practice/refs/heads/main/employees.csv
!wget -O transactions.csv https://raw.githubusercontent.com/icyanide9/de-practice/refs/heads/main/transactions.csv

from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.types import *

spark = SparkSession.builder.appName("PySpark Optimization Practice").getOrCreate()

# Load DataFrames
df_map = {}
for name in ["customers", "products", "orders", "order_items", "employees", "transactions"]:
    df_map[name] = spark.read.csv(f"{name}.csv", header=True, inferSchema=True)

customers_df = df_map["customers"]
products_df = df_map["products"]
orders_df = df_map["orders"]
order_items_df = df_map["order_items"]
employees_df = df_map["employees"]
transactions_df = df_map["transactions"]

# Data cleaning (minimal, for performance):
customers_df = customers_df.dropDuplicates().dropna(subset=["customer_id", "name", "email"])
products_df = products_df.dropDuplicates().dropna(subset=["product_id", "product_name", "price"])
orders_df = orders_df.dropDuplicates().dropna(subset=["order_id", "customer_id", "order_amount"])
order_items_df = order_items_df.dropDuplicates().dropna(subset=["order_item_id", "order_id", "product_id"])
employees_df = employees_df.dropDuplicates().dropna(subset=["employee_id", "name"])
transactions_df = transactions_df.dropDuplicates().dropna(subset=["transaction_id", "customer_id", "amount"])



## Table Overview
- **customers_df**: customer_id, name, email, phone, address, registration_date, status
- **products_df**: product_id, product_name, category, price, stock_quantity
- **orders_df**: order_id, customer_id, order_date, order_amount, order_status, payment_method
- **order_items_df**: order_item_id, order_id, product_id, quantity, item_total
- **employees_df**: employee_id, name, department, hire_date, salary, manager_id
- **transactions_df**: transaction_id, customer_id, transaction_date, amount, transaction_type, location, created_at



### 1. Performance Optimization in PySpark

**Concept:**
- Spark executes jobs as a DAG of stages and tasks. Understanding this helps you optimize your code.
- Avoid wide transformations (like groupBy, join) when possible, as they cause shuffles.
- Use partitioning to distribute data evenly and avoid data skew.
- Use predicate pushdown and column pruning by filtering and selecting only needed columns early.
- Use `broadcast()` joins for small tables to avoid shuffles.
- Cache or persist DataFrames only when reused multiple times.
- Monitor jobs using the Spark UI to identify bottlenecks.

**Sample Syntax (Generic):**
```python
from pyspark.sql.functions import broadcast
# Broadcast join
joined_df = df1.join(broadcast(df2), df1.key == df2.key)
# Cache DataFrame
df.cache()
# Filter early for predicate pushdown
df = df.filter(df.col1 > 0).select("col1", "col2")
```

**Practice:**
- Identify a wide transformation in your pipeline and suggest an optimization.
- Broadcast join `products_df` to `order_items_df` if products is small.

**Expected Output:**
- Optimized code and explanation.

**Additional Challenge:**
- Use the Spark UI to find a slow stage and explain how you would optimize it.


In [None]:
#practice here


### 2. Best Practices for PySpark Data Engineering

**Concept:**
- Use columnar formats like Parquet or ORC for efficient storage and processing.
- Define schemas explicitly for better performance and data quality.
- Handle skewed data by salting keys or increasing partitions.
- Write modular, reusable code with functions and clear variable names.
- Use logging and error handling for production pipelines.

**Sample Syntax (Generic):**
```python
# Define schema
from pyspark.sql.types import StructType, StructField, StringType
schema = StructType([StructField("col1", StringType(), True)])
df = spark.read.schema(schema).csv("file.csv")
# Example of modular code
def clean_data(df):
    return df.dropDuplicates().dropna()
```

**Practice:**
- Refactor a PySpark script to use explicit schema and modular functions.
- Save a DataFrame as Parquet and explain why it’s preferred over CSV.

**Expected Output:**
- Cleaner, more efficient code and explanation.

**Additional Challenge:**
- Handle a skewed join by salting the join key.


In [None]:
#practice here


### 3. Real-World Scenarios & Mini-Projects

**Concept:**
- Combine all your skills to build end-to-end ETL pipelines and analytics solutions.
- Use UDFs, SQL, joins, aggregations, and optimizations together.
- Monitor data quality and performance throughout the pipeline.

**Sample Syntax (Generic):**
```python
# Example ETL pipeline
raw_df = spark.read.csv("raw.csv", header=True)
clean_df = raw_df.dropDuplicates().dropna()
agged_df = clean_df.groupBy("col1").agg({"col2": "sum"})
agged_df.write.parquet("output.parquet")
```

**Practice:**
- Build an ETL pipeline: load raw orders, clean data, aggregate total sales per customer, and save as Parquet.
- Create a customer analytics dashboard: top customers, monthly sales, churn candidates.

**Expected Output:**
- End-to-end pipeline code and results.

**Additional Challenge:**
- Add data quality checks and performance optimizations to your pipeline.


In [None]:
#practice here


### 4. Interview-Style Challenges & Capstone Exercises

**Concept:**
- Tackle open-ended, multi-step data engineering problems.
- Optimize, debug, and design scalable pipelines.

**Sample Syntax (Generic):**
```python
# No single syntax—combine all skills learned so far.
```

**Practice:**
- Given a slow-running job, identify and fix the bottleneck.
- Debug and fix a broken ETL pipeline (e.g., missing data, schema mismatch).
- Design a scalable pipeline for a business scenario (e.g., daily sales reporting for millions of records).

**Expected Output:**
- Optimized, working code and clear explanations.

**Additional Challenge:**
- Document your solution and explain your design choices.


In [None]:
#practice here