# Delta Live Tables (DLT) Introduction

**Delta Live Tables (DLT)** is a declarative framework for building reliable, maintainable, and testable data processing pipelines. Instead of defining your data pipelines using a series of separate Apache Spark tasks, you define the *state* of the data (target tables) and the transformations required to reach that state. DLT handles the orchestration, cluster management, error handling, and data quality checks automatically.

### Key Concepts
1.  **Declarative Framework**: You tell DLT *what* you want (e.g., "I want a table X derived from Y"), and DLT figures out *how* to build and update it.
2.  **Continuous vs. Triggered**: Pipelines can run continuously (low latency) or on a schedule (triggered).
3.  **Development vs. Production**:
    *   **Development**: Reuses clusters to speed up iteration/debugging.
    *   **Production**: Restarts clusters for every run (if triggered) to ensure isolation and cost efficiency, retries on failure.

### Dataset Types in DLT
| Type | Description | Use Case |
| :--- | :--- | :--- |
| **Streaming Table** | Incremental data processing. Only reads new data from source. | Ingestion from Kafka, Auto Loader, or append-only sources. |
| **Materialized View** | Recomputes the entire table from source based on query logic. | Aggregations, joins, complex transformations where state needs to be refreshed. |
| **View** | Intermediate logic. Not physically stored. | Reusable transformation logic used by other tables in the pipeline. |

---
**Pre-requisites:**
*   A Databricks Workspace with **Premium** Plan (DLT is a premium feature).
*   Unity Catalog enabled (preferred) or Hive Metastore.

## 1. Environment Setup (Run this interactively)
Before writing the DLT pipeline code, let's prepare some source data. We will use the `samples` catalog provided by Databricks and clone the `TPCH` dataset into a location we can use for our pipeline.

*Note: Run this cell on a standard All-Purpose Cluster.*

In [None]:
# Setup: Create a schema and clone sample data for our DLT source
# This mimics having raw data landing in your bronze layer

# 1. Create a schema for our source data
spark.sql("CREATE SCHEMA IF NOT EXISTS dev.etl_source")

# 2. Deep clone sample tables to our dev environment so we can treat them as raw data
# We are using TPCH orders and customer tables
spark.sql("""
    CREATE TABLE IF NOT EXISTS dev.etl_source.orders_raw 
    DEEP CLONE samples.tpch.orders
""")

spark.sql("""
    CREATE TABLE IF NOT EXISTS dev.etl_source.customer_raw 
    DEEP CLONE samples.tpch.customer
""")

print("Source data setup complete in 'dev.etl_source'")

## 2. The DLT Pipeline Code
The code below represents the logic for a DLT pipeline. 

**Important:** 
1.  Do not run the cells below interactively using "Shift+Enter" on a standard cluster. It will likely fail or do nothing useful because the `dlt` module is specific to the Pipeline runtime.
2.  In a real scenario, this code would exist in a notebook file that you select when creating a **Pipeline** in the Workflows tab.

### The Pipeline Logic:
1.  **`orders_bronze`**: A **Streaming Table** reading incrementally from the raw orders table.
2.  **`customer_bronze`**: A **Materialized View** reading the full customer snapshot (batch).
3.  **`joined_view`**: A temporary **View** joining orders and customers (filtering/cleaning).
4.  **`joined_silver`**: A **Materialized View** storing the result of the join with added metadata.
5.  **`orders_agg_gold`**: A **Materialized View** calculating aggregates (Gold Layer).

In [None]:
import dlt
from pyspark.sql.functions import *

# ---------------------------------------------------------
# BRONZE LAYER
# ---------------------------------------------------------

# 1. Streaming Table for Orders
# We use readStream to treat the source as a stream (incremental processing)
@dlt.table(
    name="orders_bronze",
    comment="Raw orders data ingested incrementally",
    table_properties={"quality": "bronze"}
)
def orders_bronze():
    return (
        spark.readStream
        .format("delta")
        .table("dev.etl_source.orders_raw")
    )

# 2. Materialized View for Customers
# We assume customer data is relatively small or updates slowly, so we treat it as batch
@dlt.table(
    name="customer_bronze",
    comment="Raw customer reference data",
    table_properties={"quality": "bronze"}
)
def customer_bronze():
    return (
        spark.read
        .format("delta")
        .table("dev.etl_source.customer_raw")
    )

# ---------------------------------------------------------
# SILVER LAYER (Transformations)
# ---------------------------------------------------------

# 3. Intermediate View (Join Logic)
# Views are not materialized to storage; they are computed at runtime
@dlt.view(
    name="joined_view",
    comment="Intermediate logic to join orders with customers"
)
def joined_view():
    # Read from the DLT tables defined above using the "LIVE" keyword
    df_orders = spark.read.table("LIVE.orders_bronze")
    df_customers = spark.read.table("LIVE.customer_bronze")
    
    # Perform Left Join
    # Note: 'o_' and 'c_' are prefixes in TPCH column names
    return df_orders.join(
        df_customers, 
        df_orders.o_custkey == df_customers.c_custkey, 
        "left"
    )

# 4. Materialized View (Silver Table)
# This persists the joined data
@dlt.table(
    name="joined_silver",
    comment="Enriched orders with customer details",
    table_properties={"quality": "silver"}
)
def joined_silver():
    return (
        spark.read.table("LIVE.joined_view")
        .withColumn("processed_timestamp", current_timestamp())
    )

# ---------------------------------------------------------
# GOLD LAYER (Aggregations)
# ---------------------------------------------------------

# 5. Aggregated Materialized View
# Business level aggregations
@dlt.table(
    name="orders_by_segment_gold",
    comment="Aggregated order counts by market segment",
    table_properties={"quality": "gold"}
)
def orders_by_segment_gold():
    df = spark.read.table("LIVE.joined_silver")
    
    return (
        df.groupBy("c_mktsegment")
        .agg(count("o_orderkey").alias("total_orders"))
    )

## 3. How to Execute this Pipeline?

Since this is DLT code, you must create a Pipeline in the Databricks UI:

1.  **Navigate**: Go to `Workflows` -> `Delta Live Tables` on the left sidebar.
2.  **Create Pipeline**: Click **Create Pipeline**.
3.  **Configure**:
    *   **Pipeline Name**: `01_DLT_Introduction`
    *   **Product Edition**: `Core` (or Pro/Advanced if you want Data Quality features later).
    *   **Pipeline Mode**: `Triggered` (Batch style) or `Continuous`. For this demo, use **Triggered**.
    *   **Source Code**: Browse and select **THIS Notebook**.
    *   **Destination**:
        *   **Storage options**: Unity Catalog (Recommended) or Hive Metastore.
        *   **Catalog**: `dev`
        *   **Target Schema**: `etl_dlt_demo` (This is where tables like `orders_bronze` will physically be created).
    *   **Compute**: Select a cluster policy or allow it to use the default `Standard_DS3_v2`.
4.  **Run**: Click **Start**.

### Debugging Tips
*   **Development Mode**: When starting the pipeline, select "Development". This keeps the cluster running after failure or success, allowing you to iterate quickly without waiting for cluster spin-up every time.
*   **Event Log**: If the pipeline fails (e.g., "count not defined"), check the Event Log at the bottom of the DLT UI screen. It provides stack traces similar to standard notebooks.

## 4. Validating the Data
Once the pipeline run finishes successfully, you can query the created tables in a standard SQL editor or a separate notebook.

```sql
-- Switch to the target catalog and schema you defined in the pipeline settings
USE dev.etl_dlt_demo;

-- Check the Bronze Streaming Table
SELECT * FROM orders_bronze LIMIT 10;

-- Check the Silver Enriched Table
SELECT * FROM joined_silver LIMIT 10;

-- Check the Gold Aggregated Data
SELECT * FROM orders_by_segment_gold;