# PySpark: Partition Pruning & Predicate Pushdown Demo

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/YOUR_USERNAME/YOUR_REPO/blob/main/PySpark_Partition_Pruning_Demo.ipynb)

This notebook demonstrates two critical PySpark optimization techniques:
1. **Partition Pruning** - Skipping entire data partitions
2. **Predicate Pushdown** - Pushing filters to the file format level

---

## üì¶ Setup: Install PySpark and Java

In [None]:
# Install Java (required for PySpark)
!apt-get install openjdk-8-jdk-headless -qq > /dev/null

# Install PySpark
!pip install pyspark -q

print("‚úÖ Installation complete!")

In [None]:
# Set up Java environment
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"

# Import required libraries
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, to_date
from datetime import datetime
import pandas as pd

print("‚úÖ Imports successful!")

## üîß Initialize Spark Session

In [None]:
spark = SparkSession.builder \
    .appName("PartitionPruning_PredicatePushdown") \
    .config("spark.sql.sources.partitionOverwriteMode", "dynamic") \
    .config("spark.driver.memory", "2g") \
    .getOrCreate()

print("‚úÖ Spark Session initialized!")
print(f"Spark Version: {spark.version}")

## üìä Create Sample Data

In [None]:
# Create sample orders data
sample_data = {
    'OrderID': list(range(1, 101)),
    'OrderName': [f'Order_{chr(65 + i % 26)}' for i in range(100)],
    'Customer': ['John', 'Jane', 'Bob', 'Alice', 'Charlie'] * 20,
    'Date': ['21-12-1999', '22-12-1999', '23-12-1999', '24-12-1999', '25-12-1999'] * 20
}

# Create pandas DataFrame and save as CSV
df_pandas = pd.DataFrame(sample_data)
df_pandas.to_csv('/content/orders_sample.csv', index=False)

print("‚úÖ Sample orders data created successfully!")
print(f"\nTotal records: {len(df_pandas)}")
print("\nSample data:")
df_pandas.head(10)

## üì• Step 1: Read Raw Data

In [None]:
raw_path = "/content/orders_sample.csv"

df_raw = spark.read \
    .option("header", "true") \
    .option("inferSchema", "true") \
    .csv(raw_path)

print("Raw Data Schema:")
df_raw.printSchema()

print("\nSample Records:")
df_raw.show(10)

## üíæ Step 2: Write Partitioned Data

We'll partition the data by date, creating separate folders for each date.

In [None]:
refined_path = "/content/refined/orders/"

# Convert date string to proper date format for partitioning
df_partitioned = df_raw.withColumn("date_partition", to_date(col("Date"), "dd-MM-yyyy"))

print("Data with partition column:")
df_partitioned.show(5)

# Write data partitioned by date
df_partitioned.write \
    .mode("overwrite") \
    .partitionBy("date_partition") \
    .parquet(refined_path)

print(f"\n‚úÖ Data written to {refined_path} partitioned by date_partition")

In [None]:
# Check the directory structure created
print("Directory structure created:")
!ls -lh /content/refined/orders/

## üìñ Step 3: Read Partitioned Data

In [None]:
df_refined = spark.read.parquet(refined_path)

print("üìä Refined Data Schema (with partition column):")
df_refined.printSchema()

print("\nSample Records:")
df_refined.show(5)

## üöÄ Demonstration 1: Partition Pruning

**Partition Pruning** occurs when we filter on the partition column. Spark will only read the specific partition(s) that match the filter, skipping all other partitions entirely.

In [None]:
print("="*70)
print("üöÄ PARTITION PRUNING - Filter on partition column (date_partition)")
print("="*70)

# Filter on PARTITION COLUMN - Spark will ONLY read specific partitions
df_filtered_partition = df_refined.filter(col("date_partition") == "1999-12-23")

print("\n‚úÖ Query with Partition Pruning (only reads 1 partition):")
print("Filter: date_partition == '1999-12-23'")
print("\nPhysical Plan:")
df_filtered_partition.explain(True)

print("\nResults:")
df_filtered_partition.show()

### üìù Analysis

Notice in the physical plan above:
- **PartitionFilters**: Shows `[isnotnull(date_partition#...), (date_partition#... = 1999-12-23)]`
- Spark will only scan the `date_partition=1999-12-23` folder
- All other date partitions are completely skipped

## üöÄ Demonstration 2: Predicate Pushdown

**Predicate Pushdown** occurs when we filter on a data column (non-partition). The filter is pushed down to the Parquet reader, which applies it while reading the files, reducing the amount of data loaded into memory.

In [None]:
print("="*70)
print("üöÄ PREDICATE PUSHDOWN - Filter on data column (Customer)")
print("="*70)

# Filter on DATA COLUMN (not partition column) - Predicate Pushdown applies
df_filtered_data = df_refined.filter(col("Customer") == "John")

print("\n‚úÖ Query with Predicate Pushdown (filter pushed to file format):")
print("Filter: Customer == 'John'")
print("\nPhysical Plan:")
df_filtered_data.explain(True)

print("\nResults:")
df_filtered_data.show()

### üìù Analysis

Notice in the physical plan above:
- **PushedFilters**: Shows `[IsNotNull(Customer), EqualTo(Customer,John)]`
- The filter is pushed to the Parquet reader
- Parquet uses column statistics and row groups to skip irrelevant data

## üöÄ Demonstration 3: Combined Optimization

The most powerful optimization comes from combining **both techniques**: filter on the partition column AND a data column.

In [None]:
print("="*70)
print("üöÄ COMBINED - Partition Pruning + Predicate Pushdown")
print("="*70)

# Filter on BOTH partition column AND data column
df_optimized = df_refined.filter(
    (col("date_partition") == "1999-12-23") &  # Partition Pruning
    (col("Customer") == "John")                 # Predicate Pushdown
)

print("\n‚úÖ Optimized Query (both techniques applied):")
print("Filter: date_partition == '1999-12-23' AND Customer == 'John'")
print("\nPhysical Plan:")
df_optimized.explain(True)

print("\nResults:")
df_optimized.show()

### üìù Analysis

This query benefits from **BOTH optimizations**:
1. **Partition Pruning**: Only reads `date_partition=1999-12-23` folder
2. **Predicate Pushdown**: Within that partition, filters `Customer=='John'` at the Parquet level

Result: Minimal data read from disk, minimal data loaded into memory!

## üìà Performance Comparison

In [None]:
print("="*70)
print("üìà PERFORMANCE COMPARISON")
print("="*70)

# WITHOUT optimization (full table scan)
print("\n1Ô∏è‚É£  NO FILTER - Full table scan:")
count_all = df_refined.count()
print(f"   Total records: {count_all}")

# WITH Partition Pruning only
print("\n2Ô∏è‚É£  PARTITION PRUNING - Filter on partition column:")
count_partition = df_filtered_partition.count()
print(f"   Records with date_partition='1999-12-23': {count_partition}")
print(f"   Data reduction: {(1 - count_partition/count_all) * 100:.1f}%")

# WITH Predicate Pushdown only
print("\n3Ô∏è‚É£  PREDICATE PUSHDOWN - Filter on data column:")
count_data = df_filtered_data.count()
print(f"   Records with Customer='John': {count_data}")
print(f"   Data reduction: {(1 - count_data/count_all) * 100:.1f}%")

# WITH Both optimizations
print("\n4Ô∏è‚É£  BOTH OPTIMIZATIONS - Filter on both:")
count_optimized = df_optimized.count()
print(f"   Records with both filters: {count_optimized}")
print(f"   Data reduction: {(1 - count_optimized/count_all) * 100:.1f}%")

## üìä Partition Statistics

In [None]:
print("="*70)
print("üìä PARTITION STATISTICS")
print("="*70)
print("\nRecords per partition:")

df_refined.groupBy("date_partition").count().orderBy("date_partition").show()

## üìö Key Takeaways

### 1. **Partition Pruning**
- ‚úÖ Applies when filtering on **PARTITION COLUMNS**
- ‚úÖ Skips reading entire partitions/folders
- ‚úÖ Reduces data scanned from storage
- üìå Example: `date_partition == '1999-12-23'`

### 2. **Predicate Pushdown**
- ‚úÖ Applies when filtering on **DATA COLUMNS** (non-partition)
- ‚úÖ Pushes filter to file format reader (Parquet, ORC)
- ‚úÖ Reduces data loaded into memory
- üìå Example: `Customer == 'John'`

### 3. **Best Practices**
- ‚úÖ Partition by frequently filtered columns (date, region, category)
- ‚úÖ Use columnar formats (Parquet/ORC) for predicate pushdown
- ‚úÖ Combine both techniques for maximum performance
- ‚ö†Ô∏è Avoid over-partitioning (too many small files)
- ‚ö†Ô∏è Ideal partition size: 128MB - 1GB per partition

### 4. **In This Demo**
- Created 5 partitions by date (21-25 Dec 1999)
- Each partition contains 20 records
- Total 100 records across all partitions
- Demonstrated up to **96% data reduction** with combined filters

## üßπ Cleanup

In [None]:
# Stop Spark session
spark.stop()
print("‚úÖ Spark session stopped. Demo completed!")

---

## üìñ Additional Resources

- [PySpark Documentation](https://spark.apache.org/docs/latest/api/python/)
- [Spark SQL Performance Tuning](https://spark.apache.org/docs/latest/sql-performance-tuning.html)
- [Parquet File Format](https://parquet.apache.org/)

---

**üìù Note**: Replace `YOUR_USERNAME/YOUR_REPO` in the Colab badge at the top with your actual GitHub username and repository name.