# Autoloader, Append Flow & Dynamic Tables in DLT

**Objective:**
In this session, we will advance our Delta Live Tables (DLT) pipeline by:
1.  **Ingesting Files with Autoloader:** Reading CSV files from a managed volume using `cloudFiles`.
2.  **Implementing Append Flow:** Performing a union of data streams (Delta Table stream + Autoloader stream) into a single table without full re-processing.
3.  **Dynamic Table Generation:** Using DLT Pipeline Configurations (Parameters) to dynamically generate Gold tables based on input variables (e.g., Order Status).

**Pre-requisites:**
*   A Unity Catalog managed volume created at `/Volumes/dev/etl/landing/` (covered in Setup).
*   CSV files uploaded to the `files` directory within that volume.

In [None]:
import dlt
from pyspark.sql.functions import *

# -------------------------------------------------------------------------
# 1. READ DATA FROM DELTA TABLES (Existing Logic)
# -------------------------------------------------------------------------

# Reading raw orders data from the Delta Table source
@dlt.table(
    table_properties={"quality": "bronze"},
    comment = "Order Bronze Table",
    name = "orders_bronze_raw"
)
def orders_bronze_raw():
    return spark.readStream.table("dev.bronze.orders_raw")

# -------------------------------------------------------------------------
# 2. READ DATA USING AUTOLOADER (New Logic)
# -------------------------------------------------------------------------

# Reading raw orders data from Files (CSV) using Auto Loader
@dlt.table(
    table_properties={"quality": "bronze"},
    comment = "Order Autoloader Table",
    name = "orders_autoloader_bronze"
)
def orders_autoloader_bronze():
    return (
        spark.readStream.format("cloudFiles")
        .option("cloudFiles.format", "csv")
        # Schema location for Autoloader schema inference and evolution
        .option("cloudFiles.schemaLocation", "/Volumes/dev/etl/landing/autoloader/schemas/1/")
        # We set evolution mode to none for this demo to stick to provided schema
        .option("cloudFiles.schemaEvolutionMode", "none")
        # Providing schema hints to ensure correct data types
        .option("cloudFiles.schemaHints", """
            order_id long, 
            customer_id long, 
            order_status string, 
            total_price decimal(10,2), 
            order_date date, 
            order_priority string, 
            clerk string, 
            shipping_priority integer, 
            comment string
        """)
        # The path where files are landed
        .load("/Volumes/dev/etl/landing/files/")
    )

In [None]:
# -------------------------------------------------------------------------
# 3. UNION STREAMS USING APPEND FLOW
# -------------------------------------------------------------------------
# Instead of a standard union which might re-read data, we use Append Flow 
# to incrementally write data from multiple sources into one target table.

# Step A: Create the Target Streaming Table
dlt.create_streaming_table("orders_union_bronze")

# Step B: Append data from the Delta Source
@dlt.append_flow(target = "orders_union_bronze")
def order_delta_append():
    return spark.readStream.table("LIVE.orders_bronze_raw")

# Step C: Append data from the Autoloader Source
@dlt.append_flow(target = "orders_union_bronze")
def order_autoloader_append():
    return spark.readStream.table("LIVE.orders_autoloader_bronze")

In [None]:
# -------------------------------------------------------------------------
# 4. DOWNSTREAM TRANSFORMATIONS (Silver Layer)
# -------------------------------------------------------------------------

# Reading Customer Data
@dlt.table(
    table_properties={"quality": "bronze"},
    comment = "Customer Bronze Table",
    name = "customer_bronze"
)
def cust_bronze():
    return spark.read.table("dev.bronze.customer_raw")

# Creating Joined View (Silver Logic)
# We join the UNIONED table with the Customer table
@dlt.view(
    comment = "Joined View"
)
def joined_vw():
    df_orders = spark.readStream.table("LIVE.orders_union_bronze")
    df_cust = spark.read.table("LIVE.customer_bronze")
    
    # Performing Left Outer Join
    df_join = df_orders.join(
        df_cust, 
        df_orders.customer_id == df_cust.c_custkey, 
        "left_outer"
    ).select(
        df_orders["*"], 
        df_cust["c_name"], 
        df_cust["c_nationkey"]
    )
    return df_join

# Create Silver Table
@dlt.table(
    table_properties={"quality": "silver"},
    name = "orders_silver"
)
def orders_silver():
    return (
        dlt.read("joined_vw")
        .withColumn("insert_date", current_timestamp())
    )

In [None]:
# -------------------------------------------------------------------------
# 5. DYNAMIC GOLD TABLES USING PARAMETERS
# -------------------------------------------------------------------------
# We will read a configuration parameter 'custom.orderStatus' passed 
# during the Pipeline setup (e.g., value = "O,F")

# Fetch configuration, default to "NA" if not present
_order_status_config = spark.conf.get("custom.orderStatus", "NA")

# Split the string into a list (e.g., ['O', 'F'])
_status_list = _order_status_config.split(",")

# Loop through each status and generate a table dynamically
for _status in _status_list:
    
    # Define table name dynamically
    _table_name = f"orders_agg_{_status}_gold"
    
    # Use function closure to capture the loop variable correctly
    def create_gold_table(status_val=_status):
        @dlt.table(
            table_properties={"quality": "gold"},
            comment = f"Orders Aggregated Table for status {status_val}",
            name = f"orders_agg_{status_val}_gold"
        )
        def orders_agg_gold():
            return (
                dlt.read("orders_silver")
                # Filter based on the dynamic status
                .filter(col("order_status") == status_val)
                .groupBy("c_name")
                .agg(
                    count("order_id").alias("total_orders"),
                    sum("total_price").alias("total_sales")
                )
            )
    
    # Call the function to register the DLT table
    create_gold_table()

### Configuration Steps for DLT Pipeline

To make the Dynamic Tables work, you must add the following **Configuration** in your DLT Pipeline Settings:

1.  Open your DLT Pipeline settings.
2.  Scroll to the **Advanced** section.
3.  Under **Configuration**, add:
    *   **Key:** `custom.orderStatus`
    *   **Value:** `O,F` (or any comma-separated status codes you wish to filter by)

### Key Takeaways
*   **Autoloader (`cloudFiles`)** provides an easy, incremental way to ingest files into DLT.
*   **`@dlt.append_flow`** allows you to merge multiple streaming sources into a single target table efficiently, preserving incremental processing.
*   **Python Logic:** Since DLT is defined via Python, you can use standard Python control flow (loops, variables) to programmatically generate tables based on configurations.