**⭐ 1. What This Pattern Solves**

Reduces I/O and shuffle by scanning only relevant partitions in a partitioned table.

Critical for large datasets in Delta, Parquet, or Hive.

Improves query performance by avoiding full table scans.

Use cases:

Filtering large historical logs by date.

Querying customer data in a partitioned table by region or country.

Optimizing ETL pipelines by limiting input size.

**⭐ 2. SQL Equivalent**

In [0]:
%sql
-- Table partitioned by date
SELECT *
FROM sales
WHERE sale_date = '2025-12-01';  -- Only relevant partition scanned


**⭐ 3. Core Idea**

Partition pruning works when filters match partition columns.

Spark can skip irrelevant partitions → less read → faster execution.

Combine with pushdown filters for maximum efficiency.

**⭐ 4. Template Code (MEMORIZE THIS)**

In [0]:
# Read partitioned table with filter on partition column
df = spark.read.parquet("path/to/partitioned_data") \
       .filter("partition_col = 'value'")

# OR using .where()
df = spark.read.parquet("path/to/partitioned_data") \
       .where(df.partition_col == "value")

**⭐ 5. Detailed Example**

In [0]:
data = [
    ("2025-12-01", "Alice", 100),
    ("2025-12-02", "Bob", 200),
    ("2025-12-01", "Charlie", 300)
]
df = spark.createDataFrame(data, ["date", "name", "amount"])

# Write partitioned by date
df.write.partitionBy("date").mode("overwrite").parquet("s3://data/sales")

# Read only 2025-12-01 partition
df_filtered = spark.read.parquet("s3://data/sales").filter("date = '2025-12-01'")
df_filtered.show()


**⭐ 6. Mini Practice Problems**

Read a Delta table partitioned by region and filter only region='US'.

Why does filtering a non-partitioned column not benefit from partition pruning?

Write a DataFrame partitioned by year and month, then read only year=2025.

**⭐ 7. Full Data Engineering Problem**

Scenario: You have a 2TB Delta table partitioned by year/month/day. You need yesterday’s logs for aggregation.

In [0]:
from datetime import datetime, timedelta
yesterday = (datetime.now() - timedelta(days=1)).strftime("%Y-%m-%d")
year, month, day = yesterday.split("-")

df_yesterday = spark.read.format("delta") \
    .load("s3://bronze/logs/") \
    .filter(f"year={year} AND month={month} AND day={day}")

# Aggregate user clicks
agg_df = df_yesterday.groupBy("user_id").count()
agg_df.show()


**⭐ 8. Time & Space Complexity**

| Operation         | Time Complexity                 | Space Complexity    |
| ----------------- | ------------------------------- | ------------------- |
| Partition pruning | O(n/p) (scan only p partitions) | O(filtered records) |
| Full table scan   | O(n)                            | O(n)                |


**⭐ 9. Common Pitfalls**

Filtering non-partitioned columns → no pruning.

Using complex expressions on partition columns → Spark may not prune.

Forgetting to partition at write time → cannot prune later.

Over-partitioning → too many small files → overhead > benefit.