# DLT: Implementing SCD Type 1 & Type 2 with Apply Changes API

**Objective:** Learn how to handle Change Data Capture (CDC) efficiently in Delta Live Tables using the `apply_changes()` API. We will implement Slowly Changing Dimensions (SCD) Type 1 and Type 2.

**What we will cover:**
1.  **SCD Type 1:** Overwriting old records with new updates (Upsert).
2.  **SCD Type 2:** Maintaining a history of changes with valid-from and valid-to dates.
3.  **Handling Deletes & Truncates:** Propagating deletes from source to target.
4.  **Backfilling:** How DLT handles out-of-order data arrival automatically.

**Prerequisites:**
*   A DLT Pipeline created (from previous lessons).
*   **Databricks Edition:** You must enable **"Pro"** or **"Advanced"** edition in your DLT Pipeline settings to use CDC features.

## 1. Data Preparation (Interactive)
*This section simulates changes in your source system. Run this in a standard Databricks notebook cell attached to an interactive cluster, NOT inside the DLT pipeline.*

We need our source table to have specific columns to track changes:
*   `src_action`: To indicate if the record is Insert (I), Update (U), Delete (D), or Truncate (T).
*   `src_insert_dt`: A timestamp to determine the sequence of events.

In [None]:
# %sql
# -- 1. Alter the source table to add CDC tracking columns
# ALTER TABLE dev.bronze.customer_raw ADD COLUMNS (src_action STRING, src_insert_dt TIMESTAMP);

# -- 2. Update existing records with default values
# UPDATE dev.bronze.customer_raw
# SET src_action = 'I',
#     src_insert_dt = current_timestamp() - interval 3 days;

# -- 3. Verify the schema
# SELECT * FROM dev.bronze.customer_raw LIMIT 5;

## 2. DLT Pipeline Code
*The code below goes into your DLT pipeline notebook.*

### Step A: Create a Streaming View
The `apply_changes` API requires a streaming source (a streaming table or a view acting as a stream) to read data from.

In [None]:
import dlt
from pyspark.sql.functions import *

# Create a streaming view to read from the source table
@dlt.view
def customer_bronze_view():
    # We read this as a stream so CDC can process record by record
    return spark.readStream.table("dev.bronze.customer_raw")

### Step B: Implementing SCD Type 1
**SCD Type 1** updates existing records. It does not maintain history. If a customer changes their address, the old address is lost, and the new one is saved.

**Key Parameters:**
*   `keys`: The primary key(s) used to match records.
*   `sequence_by`: The column used to order changes. DLT ensures the record with the latest `sequence_by` value wins.
*   `stored_as_scd_type`: Set to `1`.
*   `apply_as_deletes`: Condition to identify delete rows.
*   `apply_as_truncates`: Condition to identify truncate commands.
*   `except_column_list`: Columns from source you don't want in the final table (like the CDC metadata columns).

In [None]:
# Define the target table structure
dlt.create_streaming_table("customer_scd1_bronze")

dlt.apply_changes(
    target = "customer_scd1_bronze",
    source = "customer_bronze_view",
    keys = ["c_custkey"],  # Primary Key
    sequence_by = col("src_insert_dt"), # Order events by this timestamp
    apply_as_deletes = expr("src_action = 'D'"), # Delete if action is 'D'
    apply_as_truncates = expr("src_action = 'T'"), # Truncate table if action is 'T'
    except_column_list = ["src_action", "src_insert_dt"], # Exclude meta columns
    stored_as_scd_type = 1
)

### Step C: Implementing SCD Type 2
**SCD Type 2** maintains history. If a customer changes their address, the old record is marked as inactive (via `__END_AT`), and a new record is inserted as active (via `__START_AT`).

**Automatic Columns:**
DLT automatically adds two columns to your target table:
1.  `__START_AT`: When the record became active.
2.  `__END_AT`: When the record became inactive (NULL means currently active).

In [None]:
# Define the target table structure
dlt.create_streaming_table("customer_scd2_bronze")

dlt.apply_changes(
    target = "customer_scd2_bronze",
    source = "customer_bronze_view",
    keys = ["c_custkey"],
    sequence_by = col("src_insert_dt"),
    apply_as_deletes = expr("src_action = 'D'"), # In SCD2, deleting closes the validity period
    except_column_list = ["src_action", "src_insert_dt"],
    stored_as_scd_type = 2 # Type 2 for History
)

## 3. Handling Downstream Dependencies (SCD 2)
When querying an SCD Type 2 table for current/active data, you must filter where `__END_AT` is `NULL`.

In [None]:
@dlt.table(
    name = "orders_silver",
    comment = "Silver table joining orders with active customer info"
)
def orders_silver():
    # Read the SCD2 table
    df_customers = dlt.read("customer_scd2_bronze")
    
    # Filter only active customers
    df_active_customers = df_customers.filter("___END_AT IS NULL")
    
    # Read orders
    df_orders = dlt.read("orders_bronze")
    
    # Join logic (example)
    return df_orders.join(df_active_customers, "c_custkey", "left")

## 4. Scenarios & Testing
You can verify the logic by inserting specific records into `customer_raw` (Interactive Mode) and refreshing the DLT pipeline.

1.  **Update Scenario:** Insert a row with the same `c_custkey` but new data and a newer `src_insert_dt`.
    *   *SCD 1 Result:* Row is updated.
    *   *SCD 2 Result:* Old row `__END_AT` is populated. New row added with `__END_AT` as NULL.
2.  **Backfill Scenario:** Insert a row with an OLDER `src_insert_dt` than the current record.
    *   *Result:* DLT intelligently rearranges the history (SCD 2) or ignores it if newer data exists (SCD 1), ensuring consistency based on `sequence_by`.
3.  **Delete Scenario:** Insert a row with `src_action = 'D'`.
    *   *Result:* Record removed (SCD 1) or closed out (SCD 2).
4.  **Truncate Scenario:** Insert a row with `src_action = 'T'`.
    *   *Result:* Target table is completely emptied.