# PySpark: Zero to Hero
## Module 32: Deletion Vectors and Liquid Clustering

In this module, we explore two advanced optimization features in Delta Lake:

1.  **Deletion Vectors:** A storage optimization feature that speeds up `DELETE` and `UPDATE` operations by avoiding rewriting entire Parquet files. Instead, it marks rows as deleted in a separate vector file.
2.  **Liquid Clustering:** An intelligent data layout technique that replaces traditional Partitioning and Z-Ordering. It adapts to data skew and query patterns automatically.

### Agenda:
1.  **Setup:** Prepare the environment and dataset.
2.  **Deletion Vectors:** 
    *   Enable the feature on a Delta table.
    *   Perform DELETE/UPDATE operations.
    *   Observe how files are managed (rewrites vs. markers).
3.  **Liquid Clustering:**
    *   Create a table with Liquid Clustering enabled (`CLUSTER BY`).
    *   Perform operations and observe automatic layout optimization.

In [None]:
# This notebook is best run on Databricks Runtime 13.3 LTS or higher 
# to support Liquid Clustering fully.

# If running locally, ensure you have Delta Lake 3.1.0+ configured.

from pyspark.sql import SparkSession

# Create Spark Session (if not already available in Databricks)
spark = SparkSession.builder \
    .appName("Delta_Advanced_Features") \
    .master("local[*]") \
    .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
    .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog") \
    .getOrCreate()

print("Spark Session Ready")

In [None]:
# We will use a sample sales dataset similar to previous modules
data = [
    (101, "Product A", 100, "2024-01-01"),
    (102, "Product B", 200, "2024-01-01"),
    (103, "Product A", 150, "2024-01-02"),
    (104, "Product C", 300, "2024-01-02"),
    (105, "Product B", 250, "2024-01-03")
]
columns = ["invoice_id", "product", "amount", "date"]

df_sales = spark.createDataFrame(data, columns)
df_sales.show()

## 1. Deletion Vectors

Traditionally, when you delete a single row from a Parquet file in a Delta table, Delta Lake has to rewrite the **entire** Parquet file without that row. This is expensive for large files.

With **Deletion Vectors**, Delta writes a small "deletion vector" file indicating which rows are invalid. The original data file remains untouched. This makes deletes and updates significantly faster.

In [None]:
# Create a standard Delta Table first
table_name = "sales_dv_demo"
df_sales.write.format("delta").mode("overwrite").saveAsTable(table_name)

# Enable Deletion Vectors property on the table
# Note: Once enabled, tables might not be readable by older Delta versions.
spark.sql(f"""
    ALTER TABLE {table_name} 
    SET TBLPROPERTIES ('delta.enableDeletionVectors' = true)
""")

print(f"Deletion Vectors enabled for {table_name}")

In [None]:
from delta.tables import *

deltaTable = DeltaTable.forName(spark, table_name)

# Delete a specific invoice
print("Deleting invoice_id = 101...")
deltaTable.delete("invoice_id = 101")

# Check History
# In the operation metrics, look for 'numDeletedRows' and file operations.
# With DV enabled, you might see fewer bytes written compared to a full rewrite.
history = deltaTable.history().select("version", "operation", "operationMetrics")
display(history) # Use history.show(truncate=False) in local pyspark

In [None]:
# Verify the row is gone
spark.sql(f"SELECT * FROM {table_name}").show()

# Note: The physical parquet file for the deleted row still exists.
# The deletion vector tells Spark to ignore that row during reads.
# Running 'OPTIMIZE' or 'VACUUM' later will eventually compact and remove old files.

## 2. Liquid Clustering

Liquid Clustering simplifies data layout. Instead of deciding between `PARTITION BY` (physical folders) and `ZORDER BY` (file sorting), you simply use `CLUSTER BY`.

Delta Lake manages the physical layout dynamically, clustering data based on the columns you specify. It avoids the "small file problem" of over-partitioning and the need for manual `OPTIMIZE ZORDER BY` runs.

In [None]:
liquid_table_name = "sales_liquid_demo"

# Syntax: CLUSTER BY (col1, col2...)
# We will cluster by 'product' and 'date'
df_sales.write.format("delta") \
    .mode("overwrite") \
    .clusterBy("product", "date") \
    .saveAsTable(liquid_table_name)

print(f"Table '{liquid_table_name}' created with Liquid Clustering.")

# Describe table to confirm clustering
spark.sql(f"DESCRIBE EXTENDED {liquid_table_name}").show(truncate=False)

In [None]:
# Liquid clustering works best when data grows. 
# Let's append more data.
new_data = [
    (106, "Product A", 120, "2024-01-04"),
    (107, "Product C", 310, "2024-01-04")
]
df_new = spark.createDataFrame(new_data, columns)

df_new.write.format("delta").mode("append").saveAsTable(liquid_table_name)

# Run OPTIMIZE
# With Liquid Clustering, OPTIMIZE will automatically cluster data based on the keys provided.
# You don't need to specify ZORDER BY.
print("Running Optimization...")
spark.sql(f"OPTIMIZE {liquid_table_name}")

print("Optimization Complete.")

## Summary

1.  **Deletion Vectors:**
    *   **Benefit:** Faster DELETE/UPDATE/MERGE operations.
    *   **Mechanism:** Writes markers instead of rewriting full files.
    *   **Usage:** Enable via TBLPROPERTIES.

2.  **Liquid Clustering:**
    *   **Benefit:** Solves the partition cardinality problem and Z-Order maintenance.
    *   **Mechanism:** Flexible, dynamic data layout managed by Delta.
    *   **Usage:** Use `CLUSTER BY` during table creation. `OPTIMIZE` maintains the layout.

**Note:** Liquid Clustering is the recommended strategy for most new Delta tables in Databricks.