# Deletion Vectors and Liquid Clustering in Delta Lake

**Objective:** In this notebook, we will explore two powerful optimization features in Delta Lake that significantly improve performance and reduce I/O overhead.

1.  **Deletion Vectors:** An optimization that speeds up `DELETE` and `UPDATE` operations by avoiding full file rewrites.
2.  **Liquid Clustering:** A dynamic data layout technique that replaces Hive-style partitioning and Z-Ordering for better query performance.

**Prerequisites:**
*   For Deletion Vectors: Delta Lake 2.3.0+ or Databricks Runtime (DBR) 12.2 LTS+.
*   For Liquid Clustering: Delta Lake 3.1.0+ or DBR 13.3 LTS+.

## 1. Setup Data
We will use the standard Databricks dataset `online_retail` for this demonstration.

In [None]:
# Setup: Define file path
file_path = "dbfs:/databricks-datasets/online_retail/data-001/data.csv"

# Verify we can read the data
df = spark.read.format("csv").option("header", "true").load(file_path)
display(df.limit(5))

## 2. Deletion Vectors

### What are Deletion Vectors?
In standard Delta Lake (CoW - Copy on Write), if you modify a single row in a Parquet file, the entire file must be rewritten. This is I/O intensive.

**With Deletion Vectors (MoR - Merge on Read):**
Instead of rewriting the file, Delta marks the row as "deleted" using a position vector (a small auxiliary file). This makes deletions extremely fast. The physical removal of data happens later during an `OPTIMIZE` job.

### Demonstration
Let's create a table and explicitly **disable** Deletion Vectors first to see the default behavior (File Rewrite).

In [None]:
# Create a schema for our demo
spark.sql("CREATE SCHEMA IF NOT EXISTS dev_bronze")

# Drop table if exists to start fresh
spark.sql("DROP TABLE IF EXISTS dev_bronze.sales_no_dv")

# Create a Delta Table (CTAS)
# We explicitly set deletion vectors to false to demonstrate the "Old" way
spark.sql(f"""
    CREATE TABLE dev_bronze.sales_no_dv
    TBLPROPERTIES ('delta.enableDeletionVectors' = false)
    AS SELECT * FROM read_files('{file_path}', format => 'csv', header => true)
""")

### Scenario A: Deleting WITHOUT Deletion Vectors
We will delete specific records and check the transaction history.

In [None]:
# Delete specific Invoice Numbers
spark.sql("DELETE FROM dev_bronze.sales_no_dv WHERE InvoiceNo = '540644'")

# Check History
display(spark.sql("DESCRIBE HISTORY dev_bronze.sales_no_dv"))

**Observation:**
If you look at the `operationMetrics` in the history for the DELETE operation above:
*   `numFilesAdded`: 1 (or more)
*   `numFilesRemoved`: 1 (or more)
*   `numDeletionVectorsAdded`: 0

This proves that the file was **rewritten**. The old file was removed, and a new file (minus the deleted row) was added.

### Scenario B: Deleting WITH Deletion Vectors
Now, let's enable the feature on the existing table and perform another delete.

In [None]:
# Enable Deletion Vectors
spark.sql("ALTER TABLE dev_bronze.sales_no_dv SET TBLPROPERTIES ('delta.enableDeletionVectors' = true)")

# Check properties to confirm
display(spark.sql("DESCRIBE EXTENDED dev_bronze.sales_no_dv"))

In [None]:
# Perform another delete operation on a different Invoice
spark.sql("DELETE FROM dev_bronze.sales_no_dv WHERE InvoiceNo = '536365'")

# Check History again
display(spark.sql("DESCRIBE HISTORY dev_bronze.sales_no_dv"))

**Observation:**
Look at the latest DELETE operation in history:
*   `numFilesAdded`: 0
*   `numFilesRemoved`: 0
*   `numDeletionVectorsAdded`: 1 (or more)

**Conclusion:** The data files were NOT rewritten. Only a small vector file was added to mark the rows as deleted. This is significantly faster for large datasets.

### Compacting Deletion Vectors
To physically remove the deleted data and merge the vectors, we run `OPTIMIZE`.

In [None]:
spark.sql("OPTIMIZE dev_bronze.sales_no_dv")

---
## 3. Liquid Clustering

### What is Liquid Clustering?
Liquid clustering replaces traditional Hive Partitioning and Z-Ordering. It is a flexible data layout that prevents the "Small Files" problem often caused by over-partitioning.

**Benefits:**
*   Handles high cardinality columns (many unique values).
*   Handles skewed data.
*   Adapts to changing access patterns without rewriting the whole table.
*   Solves the "Too many partitions" or "Too few partitions" dilemma.

### How to use?
You use the `CLUSTER BY` clause when creating the table.
*Note: Clustering columns must be within the first 32 columns of the table.*

In [None]:
# Drop if exists
spark.sql("DROP TABLE IF EXISTS dev_bronze.sales_liquid")

# Create Table with Liquid Clustering using CLUSTER BY
spark.sql(f"""
    CREATE TABLE dev_bronze.sales_liquid
    CLUSTER BY (InvoiceNo)
    AS SELECT * FROM read_files('{file_path}', format => 'csv', header => true)
""")

In [None]:
# Verify the Clustering configuration
display(spark.sql("DESCRIBE EXTENDED dev_bronze.sales_liquid"))

**Observation:**
In the `DESCRIBE EXTENDED` output, look for the row **Clustering Columns**. It should list `['InvoiceNo']`.

Now, when you query this table filtering by `InvoiceNo`, Databricks will use liquid clustering to skip irrelevant data, making queries much faster.

### Modifying Clustering
If you want to change the clustering columns or remove them (go back to standard unclustered), you can use `ALTER TABLE`.

In [None]:
# Example: Querying filtered data utilizes the clustering layout
result = spark.sql("SELECT * FROM dev_bronze.sales_liquid WHERE InvoiceNo = '536365'")
display(result)

# To remove clustering (Optional command, just for reference)
# spark.sql("ALTER TABLE dev_bronze.sales_liquid CLUSTER BY NONE")

## Summary
1.  **Deletion Vectors** allow for soft-deletes, preventing expensive file rewrites during DML operations.
2.  **Liquid Clustering** simplifies data layout management, solving issues related to rigid partitioning schemes.