# Databricks DLT: Advanced Features & Orchestration
## Handling Truncates, Full Refreshes, and File Arrival Triggers

**Objective:**
In this session, we will wrap up our deep dive into Delta Live Tables (DLT) by addressing complex real-world scenarios and moving our pipeline to production.

**Agenda:**
1.  **Handling Non-Append-Only Sources:** How to stream from tables that get truncated and reloaded (Truncate & Load).
2.  **Full Refresh Management:** How to trigger a full refresh and how to **prevent** specific tables (like SCD Type 2) from losing history during a full refresh.
3.  **Orchestration with Workflows:** Scheduling the DLT pipeline using Databricks Workflows.
4.  **Event-Driven Architecture:** Setting up **File Arrival Triggers** to run the pipeline automatically when new data lands in a Volume.

**Pre-requisites:**
*   You should have the DLT Pipeline from the previous sessions (`00_dlt_introduction`) set up.
*   A running Compute Cluster for executing the setup commands in this notebook.

## 1. Handling Truncate & Load Sources in Streaming

**The Problem:**
Streaming Tables (`readStream`) typically expect the source to be "Append-Only". If a source table is truncated (all data deleted) and then reloaded, the default behavior of a Spark Stream is to fail or stop processing because the offsets become invalid.

**The Scenario:**
Imagine our `orders_raw` table isn't appended to. Instead, an upstream process truncates it daily and inserts the "new" incremental batch.

**The Solution:**
We can use the option `skipChangeCommits` set to `true`. This tells DLT/Spark Structure Streaming to ignore the version commits associated with deletions/updates on the source Delta table and only process the new inserts.

### Step 1.1: Simulate a Truncate & Load
Let's modify our source data. We will truncate the `orders_raw` table and insert a small batch of new records.

*Note: Run this cell using your standard interactive cluster, not inside the DLT pipeline.*

In [None]:
# 1. Truncate the source table
spark.sql("TRUNCATE TABLE dev.bronze.orders_raw")

# 2. Insert new incremental records
spark.sql("""
INSERT INTO dev.bronze.orders_raw
VALUES
(999, 227285, 'O', 162769.66, '1995-10-11', '1-URGENT', 'Clerk#000000412', 0, 'Incremental Record'),
(999, 227285, 'O', 100,       '1995-10-11', '1-URGENT', 'Clerk#000000412', 0, 'Incremental Record'),
(999, 227285, 'O', 999,       '1995-10-11', '1-URGENT', 'Clerk#000000412', 0, 'Incremental Record')
""")

# 3. Verify data
display(spark.sql("SELECT * FROM dev.bronze.orders_raw"))

### Step 1.2: Update DLT Code
Go to your DLT pipeline notebook (e.g., `00_dlt_introduction`) and update the `orders_bronze` definition.

**Code Change:**
Add `.option("skipChangeCommits", "true")` to the read stream.

```python
@dlt.table(
    comment="Order bronze table"
)
def orders_bronze():
    return (
        spark.readStream
        .format("delta")
        .option("skipChangeCommits", "true") # <--- ADD THIS OPTION
        .table("dev.bronze.orders_raw")
    )


Action:
1. Update the code in your DLT notebook.
2. Click Start on your DLT Pipeline.
3. Verify that the pipeline succeeds and processes the 3 new records, despite the source being truncated.

## 2. Managing Full Refreshes

**The Concept:**
A **Full Refresh** clears all state and data from the DLT tables and rebuilds them from scratch (re-reading all source data). This is useful if schema changes drastically or logic is updated significantly.

**The Problem with SCD Type 2:**
If you have a Slowly Changing Dimension (SCD) Type 2 table (which tracks history), a Full Refresh is dangerous. It will wipe out your historical tracking and reset the table to the current state of the source, effectively deleting your historical insights.

**The Solution:**
We can use the table property `pipelines.reset.allowed` set to `false`. This prevents the specific table from being refreshed even when "Full Refresh All" is clicked.

### Step 2.1: Protect SCD Table
In your DLT pipeline notebook, locate your SCD Type 2 table definition (e.g., `customers_scd2`) and add the table property.

**Code Change:**

```python
dlt.create_streaming_table(
    name="customer_scd2_bronze",
    table_properties={
        "quality": "bronze",
        "pipelines.reset.allowed": "false" # <--- ADD THIS PROPERTY
    }
)
# ... rest of the SCD logic ...


Action:
1. Update the code in your DLT notebook.
2. If you were to click the arrow next to "Start" and select "Full Refresh all", this specific table would generally be skipped or protected (depending on DLT runtime version behavior, it ensures history isn't accidentally wiped via standard refresh mechanisms).

## 3. Orchestration with Workflows & File Arrival Triggers

Running pipelines manually via the "Start" button is for development. For production, we use **Databricks Workflows**.

We will set up a specific type of trigger: **File Arrival**. This runs the pipeline only when a file lands in a specific Storage Location (or Volume).

### Step 3.1: Create the Workflow
1. Navigate to **Workflows** on the left sidebar.
2. Click **Create Job**.
3. **Name:** `DLT_Pipeline_File_Trigger`.
4. **Task Name:** `dlt_task`.
5. **Type:** Select **Delta Live Tables pipeline**.
6. **Pipeline:** Select your `00_dlt_introduction` pipeline.
7. **Create Task**.

### Step 3.2: Configure File Arrival Trigger
1. On the Job configuration page, look for the **Trigger** section on the right.
2. Click **Add Trigger**.
3. **Trigger Type:** Select **File Arrival**.
4. **Storage Location:** Enter the path to your Volume/Landing zone where the Auto Loader expects files.
   * *Example:* `/Volumes/dev/etl/landing/files/`
5. **Configuration:**
   * **Minimum time between triggers:** (Optional, e.g., 0 seconds).
   * **Wait after last change:** (Optional).
6. Click **Save**.

*Note: Ensure the DLT Pipeline mode is set to **"Production"** in the DLT settings. This ensures the cluster terminates automatically after the job finishes, saving costs.*

### Step 3.3: Test the Trigger
To test this, we don't click "Run Now". Instead, we upload a file to the volume. The File Arrival trigger is constantly polling (or using event grid) to detect new files.

Run the cell below to upload a dummy file to your landing zone.

In [None]:
# Helper to write a dummy CSV file to the Volume being watched by the Workflow
import pandas as pd
import time

# Define path (Update this to match your Volume path used in the Trigger configuration)
volume_path = "/Volumes/dev/etl/landing/files/"
filename = f"orders_trigger_test_{int(time.time())}.csv"
full_path = volume_path + filename

# Create dummy data
data = {
    "order_id": [1001],
    "customer_id": [227285],
    "order_status": ["O"],
    "total_price": [50.0],
    "order_date": ["2024-01-01"],
    "order_priority": ["1-URGENT"],
    "clerk": ["Clerk#001"],
    "ship_priority": [0],
    "comment": ["Triggered by File Arrival"]
}

# Write file
df = pd.DataFrame(data)
# Note: In a real notebook, you might use dbutils.fs.put or standard python write
# writing to /Volumes works like a standard file system
df.to_csv(full_path, index=False)

print(f"File uploaded to: {full_path}")
print("Go to the Workflows UI and check the 'Runs' tab. The job should start automatically within a minute.")

## 4. Summary & SQL Alternative

### DLT with SQL
While this series focused on **Python** for DLT (which offers flexibility for metaprogramming), Databricks fully supports **SQL** for DLT.
The concepts remain identical:
*   `CREATE STREAMING LIVE TABLE` instead of `@dlt.table`
*   `APPLY CHANGES INTO` for SCD Type 1/2 logic.

### What we accomplished:
1.  **Robust Ingestion:** Handled `skipChangeCommits` for non-append sources.
2.  **Safety:** Protected historical tables from accidental Full Refreshes.
3.  **Automation:** Deployed the pipeline using Workflows.
4.  **Event-Driven:** Configured the pipeline to run automatically upon data arrival.

This concludes the Delta Live Tables module of the Zero to Hero series!