### Kardiaflow - Bronze Claims Auto Loader

**Source:** Raw Parquet files in ADLS

**Target:** `kardia_bronze.bronze_claims` (CDF enabled)

**Trigger:** Incremental batch via Auto Loader; append to Bronze Claims table

**Description:** The Bronze layer is our raw ingestion zone. Data lands here directly from source files via Auto Loader with minimal transformation. The original schema is preserved as much as
possible, audit fields (_ingest_ts, _batch_id, _source_file) are added, and Change Data Feed is enabled for downstream
use. The
goal is durability and traceability. Bronze is an auditable record of what was received, not yet cleaned or
standardized.

In [0]:
import pyspark.sql.functions as F

from kflow.config import BRONZE_DB, bronze_paths
from kflow.etl_utils import add_audit_cols
from kflow.notebook_utils import init, show_history

**Step 1. Initialize environment and load paths**

Start by calling `init()` to set up authentication, Spark configs, and database context.

Then load the Bronze table metadata for the Claims dataset with `bronze_paths("claims")`, which returns all storage
locations (raw, schema, checkpoint, bad records) and the target Delta table name.

In [0]:
init()

P = bronze_paths("claims")
BRONZE_TABLE = P.table

# Ensure audit fields like _ingest_ts use UTC
spark.sql("SET spark.sql.session.timeZone=UTC")

In [0]:
# Ensure Bronze DB and Claim table exist (Delta and CDF enabled)
spark.sql(f"CREATE DATABASE IF NOT EXISTS {BRONZE_DB}")

spark.sql(
    f"""
    CREATE TABLE IF NOT EXISTS {BRONZE_TABLE}
    USING DELTA
    COMMENT 'Bronze Parquet ingest of claim records.'
    LOCATION '{P.bronze}'
    TBLPROPERTIES (delta.enableChangeDataFeed = true)
    """
)

**Step 2. Incremental ingestion with Auto Loader**

Now we use Auto Loader to ingest new Parquet claim files from ADLS, appending only fresh data and persisting schema
history at `P.schema`. Audit
columns
(`_ingest_ts`,
`_source_file`,
 `_batch_id`) are added for traceability. Malformed records are redirected to `P.bad` so ingestion can continue
 uninterrupted.

Rows missing a valid `ClaimID` (PK) are filtered out. `patient_id` and `provider_id` (FKs) may be null
 in
Bronze. Late-arriving or missing dimension references are reconciled in Silver.

Parquet embeds its own schema, so no explicit definition is required in Bronze. With `mergeSchema = true`, the table can evolve as new fields appear while remaining consistent with the stored schema.

In [0]:
# Define an incremental batch pipeline using Auto Loader
stream = (
    spark.readStream
         .format("cloudFiles")
         .option("cloudFiles.format", "parquet")
         .option("cloudFiles.includeExistingFiles", "true")
         .option("cloudFiles.schemaLocation", P.schema)
         .option("badRecordsPath", P.bad)
         .load(P.raw)

         # Drop any records without valid PK
         .filter(F.col("ClaimID").isNotNull()
         
         # Add ingest timestamp, source file, batch ID
         .transform(add_audit_cols)
         
         .writeStream
         .option("checkpointLocation", P.checkpoint)
         .option("mergeSchema", "true")
         .trigger(availableNow=True)
         .toTable(BRONZE_TABLE)
)
stream.awaitTermination()

**Step 3. Verify Bronze Claims table**

Check that the batch completed successfully by inspecting row counts, previewing sample records, and reviewing Delta load history.

In [0]:
df = spark.table(BRONZE_TABLE)

print(f"Bronze Claims row count: {df.count():,}")

display(df.limit(5))

show_history(P.bronze)