### Kardiaflow - Bronze Claims Auto Loader

**Source:** Raw Parquet files in ADLS

**Target:** `kardia_bronze.bronze_claims` (CDF enabled)

**Trigger:** Incremental batch via Auto Loader; append to Bronze Claims table

In [0]:
import pyspark.sql.functions as F

from kflow.config import BRONZE_DB, bronze_paths
from kflow.etl_utils import add_audit_cols
from kflow.notebook_utils import init, show_history

**Step 1. Initialize environment and load paths**

We start by calling `init()` to set up authentication, Spark configs, and database context.  

Then we load the Bronze table metadata for the Claims dataset with `bronze_paths("claims")`, which gives us all storage locations (raw, schema, checkpoint, bad records) and the target Delta table name.

**Note:** We don’t manually define a schema here. Parquet is a self-describing format, so Auto Loader can infer and persist the schema reliably. We rely on this in Bronze and enforce stricter types when transforming to Silver.

In [0]:
init()

P = bronze_paths("claims")
BRONZE_TABLE = P.table

In [0]:
# Ensure Bronze DB and Claim table exist (Delta and CDF enabled)
spark.sql(f"CREATE DATABASE IF NOT EXISTS {BRONZE_DB}")

spark.sql(
    f"""
    CREATE TABLE IF NOT EXISTS {BRONZE_TABLE}
    USING DELTA
    COMMENT 'Bronze Parquet ingest of claim records.'
    LOCATION '{P.bronze}'
    TBLPROPERTIES (delta.enableChangeDataFeed = true)
    """
)

**Step 2. Incremental ingestion with Auto Loader**

Auto Loader ingests new Parquet claim files from ADLS, appending only fresh data and persisting schema history at `P.schema`. Rows missing a valid `ClaimID` are filtered out, audit columns (`_ingest_ts`, `_source_file`, `_batch_id`) are added for traceability, and malformed records are redirected to `P.bad` so ingestion can continue uninterrupted.

Parquet embeds its own schema, so no explicit definition is required in Bronze. With `mergeSchema = true`, the table can evolve as new fields appear while remaining consistent with the stored schema.

In [0]:
# Define an incremental batch pipeline using Auto Loader
stream = (
    spark.readStream
         .format("cloudFiles")
         .option("cloudFiles.format", "parquet")
         .option("cloudFiles.includeExistingFiles", "true")
         .option("cloudFiles.schemaLocation", P.schema)
         .option("badRecordsPath", P.bad)
         .load(P.raw)

         # Drop any records without a valid primary key
         .filter(F.col("ClaimID").isNotNull())
         
         # Add ingest timestamp, source file, batch ID
         .transform(add_audit_cols)
         
         .writeStream
         .option("checkpointLocation", P.checkpoint)
         .option("mergeSchema", "true")
         .trigger(availableNow=True)
         .toTable(BRONZE_TABLE)
)
stream.awaitTermination()

**Step 3. Verify Bronze Claims table**

Check that the batch completed successfully by inspecting row counts, previewing sample records, and reviewing Delta load history.

In [0]:
df = spark.table(BRONZE_TABLE)

print(f"Bronze Claims row count: {df.count():,}")

display(df.limit(5))

show_history(P.bronze)