# PySpark: Zero to Hero
## Module 31: Data Skipping and Z-Ordering

In the previous module, we learned about **Partitioning** to skip huge chunks of data (directories). However, partitioning has a limitation: it works best for low-cardinality columns (like Country, Date). 

**What if we need to filter by a High-Cardinality column (e.g., UserID, OrderID, Timestamp)?**
Creating millions of partitions (folders) for each OrderID is known as the **Small File Problem** and will crash the system.

**The Solution: Z-Ordering (Data Skipping)**
Z-Ordering is a technique to co-locate related information in the same set of files. Delta Lake automatically collects statistics (min/max values) for the Z-Ordered columns, allowing the engine to skip individual files within a partition that do not contain the data.

### Agenda:
1.  **Partitioning vs. Z-Ordering:** When to use which.
2.  **Setup:** Generate synthetic sales data.
3.  **The Problem:** Scanning files for high-cardinality lookups.
4.  **The Solution:** Applying `OPTIMIZE ... ZORDER BY`.
5.  **Converting Parquet to Delta:** Using `convertToDelta`.

In [None]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, rand, lit
from delta.tables import *
import shutil

# Setup Spark with Delta Lake
builder = SparkSession.builder \
    .appName("Delta_Optimization_Demo") \
    .master("local[*]") \
    .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
    .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog") \
    .config("spark.jars.packages", "io.delta:delta-core_2.12:2.4.0") 

spark = configure_spark_with_delta_pip(builder).getOrCreate()

print("Spark Session Created with Delta Support")

In [None]:
# We need enough data to generate multiple files to demonstrate skipping.
# We will generate 1 Million rows.

print("Generating Data...")
# InvoiceNo will be high cardinality (almost unique)
# Country will be low cardinality (good for partitioning)

data = spark.range(0, 1000000).withColumnRenamed("id", "InvoiceNo") \
    .withColumn("Country", \
                (rand() * 5).cast("int")) \
    .withColumn("Country", \
                # Map random ints to Country codes
                lit("USA").when(col("Country") == 0, "USA") \
                .when(col("Country") == 1, "India") \
                .when(col("Country") == 2, "UK") \
                .when(col("Country") == 3, "Canada") \
                .otherwise("Australia")) \
    .withColumn("Amount", (rand() * 1000).cast("int"))

# Repartition to simulate a real scenario with many small files initially
df_sales = data.repartition(50)

print("Data Generation Complete.")
df_sales.show(5)

In [None]:
# We partition by Country because it has low cardinality (5 unique values)
# We DO NOT partition by InvoiceNo because it has 1M unique values.

delta_path = "data/delta_sales_zorder"

# Clean up if exists
shutil.rmtree(delta_path, ignore_errors=True)

print("Writing Partitioned Delta Table...")
df_sales.write.format("delta") \
    .mode("overwrite") \
    .partitionBy("Country") \
    .save(delta_path)

print("Write Complete.")

In [None]:
# Let's try to search for a specific InvoiceNo.
# Without Z-Ordering, Spark might have to scan ALL files in the specific Country partition 
# because it doesn't know which file contains InvoiceNo=50000.

target_invoice = 500000

print("Querying without Z-Order:")
# We filter by Country (Partition Pruning happens here) AND InvoiceNo
df_read = spark.read.format("delta").load(delta_path)

# NOTE: In a real cluster UI, you would see the number of files scanned.
# Here we check the physical plan.
df_filtered = df_read.filter((col("Country") == "India") & (col("InvoiceNo") == target_invoice))
df_filtered.explain()
df_filtered.show()

In [None]:
# Z-Ordering co-locates data. It sorts data by InvoiceNo within each partition (Country) 
# and rewrites the files.
# This updates the Delta Log (min/max stats) for InvoiceNo for each file.

deltaTable = DeltaTable.forPath(spark, delta_path)

print("Running OPTIMIZE with ZORDER BY InvoiceNo...")

# Python API for Z-Order
deltaTable.optimize().executeZOrderBy("InvoiceNo")

# SQL Equivalent: 
# spark.sql(f"OPTIMIZE '{delta_path}' ZORDER BY (InvoiceNo)")

print("Optimization Complete.")

In [None]:
# Now when we run the same query, Spark checks the Delta Log first.
# It looks at the min/max InvoiceNo for files inside 'Country=India'.
# It skips files where 500000 is not in the [min, max] range.

print("Querying WITH Z-Order:")
df_optimized = spark.read.format("delta").load(delta_path)
df_optimized_filter = df_optimized.filter((col("Country") == "India") & (col("InvoiceNo") == target_invoice))

df_optimized_filter.show()

# Check History to see the OPTIMIZE operation
deltaTable.history().select("version", "operation", "operationParameters").show(truncate=False)

In [None]:
# Bonus: If you have an existing Parquet Data Lake, you don't need to rewrite everything.
# You can convert it in-place to Delta Lake.

# 1. Write dummy Parquet data
parquet_path = "data/legacy_parquet"
df_sales.write.mode("overwrite").parquet(parquet_path)

print("Parquet Table Created.")

# 2. Convert to Delta
# This commands indexes the files and creates the _delta_log directory
dt = DeltaTable.convertToDelta(spark, f"parquet.`{parquet_path}`")

print(f"Converted {parquet_path} to Delta Table.")
print("Is Delta Table?", DeltaTable.isDeltaTable(spark, parquet_path))

## Summary

1.  **Partitioning:** Best for low cardinality columns (Country, Date). Physical separation of folders.
2.  **Z-Ordering:** Best for high cardinality columns (ID, Timestamp) used frequently in filters. It sorts data within files to maximize **Data Skipping**.
3.  **Optimize:** The `OPTIMIZE` command compacts small files (Bin-packing) and performs Z-Ordering if specified.
4.  **Convert:** Use `ConvertToDelta` to migrate existing data lakes without moving data.