# DLT: Incremental Processing & Schema Evolution

In the previous session, we built a basic DLT pipeline. In this session, we will explore:
1.  **Incremental Data Loading:** How Streaming Tables process only new data.
2.  **Schema Evolution:** How to modify the pipeline logic (add columns, rename tables) and how DLT handles these changes automatically.
3.  **DLT Internals:** Understanding where data is stored, checkpoints, and the hidden internal catalogs.
4.  **Data Lineage:** Utilizing Unity Catalog to track data flow.

---
### Workflow for this Notebook
1.  **Simulate Data:** We will insert new records into our source table.
2.  **Update Pipeline:** We will modify the DLT code to add a new aggregation and rename a table.
3.  **Analyze Behavior:** We will observe how DLT handles the new data and code changes.

## 1. Simulate Incremental Data (Run Interactively)

Before running the pipeline again, let's insert **10,000 new records** into the raw source table (`orders_raw`). This will allow us to demonstrate how the **Streaming Table** (`orders_bronze`) picks up *only* the new data, while **Materialized Views** recompute their state.

*Note: Run this cell on a standard All-Purpose Cluster, NOT within the DLT pipeline execution.*

In [None]:
# Run this on a standard cluster to simulate incoming data
# We are inserting 10k random records from the sample data into our source table

spark.sql("""
    INSERT INTO dev.etl_source.orders_raw
    SELECT * FROM samples.tpch.orders
    LIMIT 10000
""")

print("Inserted 10,000 new records into 'dev.etl_source.orders_raw'.")

## 2. Updated DLT Code (Schema Evolution)

We are making the following changes to our previous pipeline logic:
1.  **Rename Table:** Changing `joined_silver` to `orders_silver`. DLT will handle the creation of the new table and removal of the old one (if not retained).
2.  **Schema Evolution (Gold Layer):**
    *   Renaming the count column from `total_orders` to `count_orders`.
    *   Adding a NEW aggregation column: `sum_totalprice`.

**Copy the code below into your DLT Pipeline Notebook (replacing the previous code).**

In [None]:
import dlt
from pyspark.sql.functions import *

# ---------------------------------------------------------
# BRONZE LAYER (No Changes)
# ---------------------------------------------------------

@dlt.table(
    name="orders_bronze",
    comment="Raw orders data ingested incrementally",
    table_properties={"quality": "bronze"}
)
def orders_bronze():
    # Streaming table: Tracks offsets via checkpoints
    # When triggered, this will only process the NEW 10k records inserted above
    return (
        spark.readStream
        .format("delta")
        .table("dev.etl_source.orders_raw")
    )

@dlt.table(
    name="customer_bronze",
    comment="Raw customer reference data",
    table_properties={"quality": "bronze"}
)
def customer_bronze():
    return (
        spark.read
        .format("delta")
        .table("dev.etl_source.customer_raw")
    )

# ---------------------------------------------------------
# SILVER LAYER (Renamed Table)
# ---------------------------------------------------------

@dlt.view(
    name="joined_view",
    comment="Intermediate logic to join orders with customers"
)
def joined_view():
    df_orders = spark.read.table("LIVE.orders_bronze")
    df_customers = spark.read.table("LIVE.customer_bronze")
    
    return df_orders.join(
        df_customers, 
        df_orders.o_custkey == df_customers.c_custkey, 
        "left"
    )

# CHANGE 1: Renamed table from 'joined_silver' to 'orders_silver'
@dlt.table(
    name="orders_silver",
    comment="Enriched orders with customer details",
    table_properties={"quality": "silver"}
)
def orders_silver():
    return (
        spark.read.table("LIVE.joined_view")
        .withColumn("processed_timestamp", current_timestamp())
    )

# ---------------------------------------------------------
# GOLD LAYER (Schema Evolution)
# ---------------------------------------------------------

@dlt.table(
    name="orders_by_segment_gold",
    comment="Aggregated order counts and sales by market segment",
    table_properties={"quality": "gold"}
)
def orders_by_segment_gold():
    # Reading from the NEW silver table name
    df = spark.read.table("LIVE.orders_silver")
    
    # CHANGE 2: Modified Aggregation Logic
    return (
        df.groupBy("c_mktsegment")
        .agg(
            count("o_orderkey").alias("count_orders"),      # Renamed Alias
            sum("o_totalprice").alias("sum_totalprice")     # New Column Added
        )
    )

## 3. Observations & Internals

### A. Incremental Processing (Streaming Tables)
When you run the pipeline after inserting data:
*   **`orders_bronze` (Streaming Table):** You will see it processes **10,000 records** (or however many you inserted). It uses the **checkpoint** mechanism to remember where it left off.
*   **`orders_silver` & `Gold` (Materialized Views):** These will recompute based on the full dataset available in the streaming table.

### B. Declarative Nature
*   You simply renamed `joined_silver` to `orders_silver` in the code.
*   **Result:** DLT automatically creates the new table `orders_silver` in the target schema. The old table `joined_silver` is no longer updated (and eventually removed from the pipeline graph).
*   **Gold Table:** The new column `sum_totalprice` appears automatically. You didn't run `ALTER TABLE`.

### C. DLT Internals (Storage)
All DLT tables are Delta Tables backed by a storage location.
1.  **Checkpoints:** Streaming tables maintain state in a hidden `checkpoints` directory within the storage location provided in pipeline settings.
2.  **Internal Catalog:** If using Unity Catalog, you might see a catalog named `_databricks_internal`. This is an implementation detail where DLT manages the physical state of materialized views before publishing them to your target schema.

### D. Data Lineage
Because we used Unity Catalog:
1.  Go to **Catalog Explorer**.
2.  Select the `orders_by_segment_gold` table.
3.  Click the **Lineage** tab.
4.  You will see a visual graph connecting `orders_raw` -> `orders_bronze` -> `orders_silver` -> `Gold`.
5.  **Column Lineage:** You can click on the `count_orders` column to trace exactly which source columns contributed to this calculation.