In [None]:
Delta Lake Maintenance is divided into two parts.
1. Logical Rewrite 
2. Physical Rewrite 

In Databricks, managing a Delta table is essentially about optimizing how data is laid out so that the Spark engine can find it faster. You've correctly categorized these into Logical (how we organize the structure) and Physical (how we actually move the bytes on disk).

In [None]:
1. Logical Rewrite is divided into two parts
   a) Partitioning 
   b) Liquid Clustering

In [None]:
2. Physical Rewrite is divided into three parts
   a) Optimize
   b) Z-order
   c) Vaccum

Logical Rewrite: The Structural Blueprint
A logical rewrite doesn't necessarily change the data itself immediately, but it changes the rules by which data is organized. It’s about defining the "containers" that data should live in.

a) Partitioning (The Traditional Way)
Partitioning involves physically separating data into folders based on a specific column (e.g., Year or Country).

* How it works: When you query WHERE Year = 2024, Spark skips every folder except the one labeled 2024.

* The Downside: It is rigid. If you partition by a column with too many unique values (high cardinality), you end up with "tiny file syndrome," which slows down performance.

b) Liquid Clustering (The Modern Way)
Liquid Clustering is the successor to partitioning. It is "flexible" rather than "fixed."

**How it works:** Instead of hard-coded folders, Databricks uses a clustering key to group similar data together within files.

**Why it's "Logical":** You can change the clustering keys without rewriting your entire table. It automatically adjusts as data grows, avoiding the pitfalls of over-partitioning.

2. Physical Rewrite: The Disk Cleanup
Physical rewrites involve changing the actual Parquet files stored in your cloud storage (S3/ADLS). Delta Lake creates new files and marks old ones for deletion to improve performance.

a) Optimize (Compaction)
Over time, many small JSON or Parquet files accumulate. Optimize takes those 1,000 tiny files and rewrites them into a few large, "right-sized" files (usually ~1GB). This reduces the "I/O overhead" of opening and closing files.

b) Z-Order (Data Skipping)
Often run alongside Optimize, Z-Ordering rearranges the data inside the files.

**How it works:** It maps multi-dimensional data into one dimension. If you Z-Order by CustomerID, the records for "Customer A" are physically stored next to each other on the disk. This allows Spark to "skip" entire chunks of a file that don't match your query.

c) Vacuum
When you perform an Optimize or an Update, Delta doesn't immediately delete the old files (to allow for Time Travel). Vacuum is the cleanup crew.

How it works: It permanently removes files that are no longer referenced by the current version of the table and are older than a retention period (default is 7 days). This saves money on storage costs.

Moving from traditional partitioning to Liquid Clustering is one of the best "quality of life" upgrades in Databricks because it removes the headache of managing **partition evolution**.

To convert a table, the process depends on whether you are creating a new table from existing data or altering an existing Delta table.

1. The SQL Approach (Most Common)
If you have an existing partitioned table and want to enable Liquid Clustering, you use the CLUSTER BY syntax.

To convert an existing table:

In [None]:
-- 1. Change the table property to use clustering instead of partitioning
ALTER TABLE my_catalog.my_schema.sensor_data
CLUSTER BY (device_id, sensor_type);

-- 2. Trigger the physical rewrite to reorganize the data
OPTIMIZE my_catalog.my_schema.sensor_data;

In [None]:
To create a fresh table with Liquid Clustering:

CREATE TABLE my_catalog.my_schema.new_table
(id INT, ts TIMESTAMP, val DOUBLE)
USING DELTA
CLUSTER BY (id); -- No 'PARTITIONED BY' needed!

**Important Transition Rules**

* When you move to Liquid Clustering, there are a few "under the hood" changes to be aware of:
* Dropping Partitions: You cannot use PARTITIONED BY and CLUSTER BY on the same table. When you alter a table to use clustering, Databricks stops using the old partition folders for new data.
* The "Optimize" Trigger: Simply running the ALTER TABLE command is a Logical Rewrite. The data isn't actually moved until you run OPTIMIZE. That is the Physical Rewrite that aligns the files to your new keys.

Key Selection: Unlike Z-Ordering (where order matters), the order of columns in CLUSTER BY does not matter. 

If you are managing a large-scale Lakehouse, manually running DESCRIBE DETAIL on every table isn't feasible.

Using the Information Schema allows you to treat your metadata like a database, letting you audit your entire environment in seconds to see which tables have been modernized and which are still using legacy partitioning.

In [None]:
SELECT 
    table_catalog, 
    table_schema, 
    table_name, 
    table_type,
    created_by
FROM 
    system.information_schema.tables
WHERE 
    -- This filters for tables where clustering is enabled in the metadata
    array_contains(table_features, 'clustering') 
    AND table_schema != 'information_schema';

I would definitely like to see that! If you are managing a large-scale Lakehouse, manually running DESCRIBE DETAIL on every table isn't feasible.

Using the Information Schema allows you to treat your metadata like a database, letting you audit your entire environment in seconds to see which tables have been modernized and which are still using legacy partitioning.

1. The SQL Audit Query
In Unity Catalog, the tables system view contains a column for table_features. Liquid clustering is stored as a specific feature property. You can use the following query to find every clustered table in a specific catalog:

SQL
SELECT 
    table_catalog, 
    table_schema, 
    table_name, 
    table_type,
    created_by
FROM 
    system.information_schema.tables
WHERE 
    -- This filters for tables where clustering is enabled in the metadata
    array_contains(table_features, 'clustering') 
    AND table_schema != 'information_schema';

2. Finding Tables That AREN'T Clustered (Migration List)
Often, the more helpful audit is finding the tables that should be clustered but aren't. This query identifies large Delta tables that are still using traditional partitioning, which makes them prime candidates for a migration.

Building a Clustering Coverage Dashboard is the "pro" way to manage a Databricks environment. Instead of running manual audits, you get a real-time visual of your technical debt and optimization progress.

To build this, we’ll use the Databricks SQL (DB SQL) workspace.

1. The "Clustering vs. Partitioning" Pie Chart     
First, create a visualization that shows the percentage of your tables that have adopted Liquid Clustering.

In [None]:
SELECT 
    CASE 
        WHEN array_contains(table_features, 'clustering') THEN 'Liquid Clustered'
        WHEN cardinality(partition_columns) > 0 THEN 'Legacy Partitioned'
        ELSE 'Flat (No Partition/Cluster)'
    END AS Table_Strategy,
    COUNT(*) AS Table_Count
FROM system.information_schema.tables
WHERE table_schema NOT IN ('information_schema', 'sys')
GROUP BY 1;

Visualization: Choose a Pie Chart. This gives leadership a quick look at how the migration to Delta Lake 3.x features is going.

In [None]:
-- Note: This requires access to the 'system.storage' schema to join size data
SELECT 
    t.table_schema,
    t.table_name,
    format_number(s.total_size_mb / 1024, 2) as size_gb
FROM system.information_schema.tables t
JOIN system.storage.table_size_metrics s -- Joins to get physical size
  ON t.table_id = s.table_id
WHERE NOT array_contains(t.table_features, 'clustering')
ORDER BY s.total_size_mb DESC
LIMIT 10;

In [None]:
Visualization: Choose a Counter or a Table widget.

3. The "Optimization Health" Line Chart
To see if your Physical Rewrites are actually happening, you can track the frequency of OPTIMIZE commands across your catalog over time.

In [None]:
SELECT 
    date_trunc('day', event_time) AS audit_date,
    COUNT(*) AS optimize_runs
FROM system.access.audit_log
WHERE service_name = 'unityCatalog' 
  AND action_name = 'optimizeTable'
GROUP BY 1
ORDER BY 1 DESC;

Visualization: Choose a Line Chart. If this line is flat, it means your tables are logically clustered but you aren't running the physical rewrites necessary to gain performance.

|Widget|Insight|Benefit|
|-------|-------|--------|
|Pie Chart|Coverage %|High-level migration tracking|
|Table List|Migration Targets|"Identifies ""heavy hitters"" for ALTER TABLE."|
|Line Chart|Maintenance Activity|Ensures OPTIMIZE jobs are actually running.|

**How to Deploy**  
* Open SQL Warehouses and ensure one is running.
* Go to Dashboards > Create Dashboard.
* Add the queries above as Widgets.
* Set a Refresh Schedule (e.g., every Monday morning) to keep the data fresh.

The Automation Script: "The Migration Loop"   
This script uses the databricks.sdk or standard PySpark to query the Information Schema, identify the target tables, and then execute the ALTER and OPTIMIZE commands sequentially.

In [None]:
# Import the necessary Spark functions
from pyspark.sql import functions as F

# 1. Define your target catalog and the clustering column logic
# (In this example, we'll cluster by 'device_id' if it exists, or 'id')
TARGET_CATALOG = "main"

def migrate_to_liquid_clustering():
    # 2. Query Information Schema for the 'Migration List'
    # We look for Managed Delta tables that are NOT yet clustered
    migration_targets = spark.sql(f"""
        SELECT table_schema, table_name 
        FROM {TARGET_CATALOG}.information_schema.tables 
        WHERE table_type = 'MANAGED' 
        AND NOT array_contains(table_features, 'clustering')
        AND table_schema NOT IN ('information_schema', 'sys')
        LIMIT 10
    """).collect()

    if not migration_targets:
        print("No tables found for migration. Your Lakehouse is up to date!")
        return

    for row in migration_targets:
        full_table_name = f"{TARGET_CATALOG}.{row.table_schema}.{row.table_name}"
        print(f"Starting migration for: {full_table_name}")

        try:
            # 3. Step 1: Logical Rewrite (Define Clustering Key)
            # You can customize the clustering key logic here
            spark.sql(f"ALTER TABLE {full_table_name} CLUSTER BY (id)")
            print(f"  - Logical Rewrite Complete: Clustering key set to 'id'")

            # 4. Step 2: Physical Rewrite (Compaction and Clustering)
            spark.sql(f"OPTIMIZE {full_table_name}")
            print(f"  - Physical Rewrite Complete: Data reorganized on disk")

        except Exception as e:
            print(f"  - Error migrating {full_table_name}: {e}")

# Run the migration
migrate_to_liquid_clustering()

In Databricks and Spark SQL, table_features is a column that stores an Array of strings. Because it is a list rather than a single value, you cannot use a simple equals sign (=) to query it. This is where array_contains comes in.

1. What is array_contains?
The array_contains(column, value) function is a boolean check. It looks inside a collection (the array) and returns True if the specific value exists anywhere in that list, and False if it does not.

In the context of your audit query:

The Column (table_features): A list of all advanced Delta Lake capabilities enabled for that specific table.

The Value ('clustering'): The specific string that identifies Liquid Clustering.