### Kardiaflow - Bronze Patients Auto Loader

**Source:** Raw CSV files in ADLS

**Target:** `kardia_bronze.bronze_patients` (CDF enabled)

**Trigger:** Incremental batch via Auto Loader; append to Bronze Patients table

**Description:** The Bronze layer is our raw ingestion zone. Data lands here directly from source files via Auto Loader with minimal transformation. The original schema is preserved as much as
possible, audit fields (_ingest_ts, _batch_id, _source_file) are added, and Change Data Feed is enabled for downstream
use. The
goal is durability and traceability. Bronze is an auditable record of what was received, not yet cleaned or
standardized.

In [0]:
from pyspark.sql.types import StructType, StructField, StringType

import pyspark.sql.functions as F

from kflow.config import BRONZE_DB, bronze_paths
from kflow.etl_utils import add_audit_cols
from kflow.notebook_utils import init, show_history

**Step 1. Initialize environment and load paths**

Start by calling `init()` to set up authentication, Spark configs, and database context.

Then load the Bronze table metadata for the Patients dataset with `bronze_paths("patients")`, which returns all storage
locations (raw, schema, checkpoint, bad records) and the target Delta table name.

*Note: The Spark session timezone is set to UTC so that audit fields like (_ingest_ts) are consistent and
comparable across regions.*

In [0]:
init()

P = bronze_paths("patients")
BRONZE_TABLE = P.table

# Ensure audit fields like _ingest_ts use UTC
spark.sql("SET spark.sql.session.timeZone=UTC")

**Step 2. Define schema for CSV input**

Patient data arrives as CSV files without schema metadata, so we declare the schema explicitly (`ID`, `BIRTHDATE`, `DEATHDATE`, … `ADDRESS`). This provides consistent typing across ingestions.  

Audit columns (`_ingest_ts`, `_batch_id`, `_source_file`) are added later for traceability and data quality.

In [0]:
# Define schema explicitly for CSV input
patients_schema = StructType([
    StructField("ID",         StringType(),  False),
    StructField("BIRTHDATE",  StringType(),  True),
    StructField("DEATHDATE",  StringType(),  True),
    StructField("SSN",        StringType(),  True),
    StructField("DRIVERS",    StringType(),  True),
    StructField("PASSPORT",   StringType(),  True),
    StructField("PREFIX",     StringType(),  True),
    StructField("FIRST",      StringType(),  True),
    StructField("LAST",       StringType(),  True),
    StructField("SUFFIX",     StringType(),  True),
    StructField("MAIDEN",     StringType(),  True),
    StructField("MARITAL",    StringType(),  True),
    StructField("RACE",       StringType(),  True),
    StructField("ETHNICITY",  StringType(),  True),
    StructField("GENDER",     StringType(),  True),
    StructField("BIRTHPLACE", StringType(),  True),
    StructField("ADDRESS",    StringType(),  True)
])

In [0]:
# Ensure Bronze DB and Patients table exist
spark.sql(f"CREATE DATABASE IF NOT EXISTS {BRONZE_DB}")

spark.sql(
    f"""
    CREATE TABLE IF NOT EXISTS {BRONZE_TABLE}
    USING DELTA
    COMMENT 'Bronze CSV ingest of Patient records.'
    LOCATION '{P.bronze}'
    TBLPROPERTIES (delta.enableChangeDataFeed = true)
    """
)

**Step 3. Incremental ingestion with Auto Loader**

Now we use Auto Loader to monitor the Patients raw folder in ADLS and ingests new CSV files as they arrive. Each run
processes only fresh data, with checkpointing and the schema log (`P.schema`) ensuring files are tracked without reprocessing.

CSV has no embedded schema, so we apply the `patients_schema` defined earlier to keep columns and types consistent. With `mergeSchema = true`, the Bronze table can adapt if future files introduce additional fields.  

Records missing a valid `ID` (PK) are dropped, and audit fields (`_ingest_ts`, `_source_file`, `_batch_id`) are added
 for
 traceability.

In [0]:
# Define an incremental batch pipeline using Auto Loader
stream = (
  spark.readStream.format("cloudFiles")
       .option("cloudFiles.format", "csv")
       .option("cloudFiles.schemaLocation", P.schema)
       .option("cloudFiles.includeExistingFiles", "true")
       .option("header", "true")
       .option("ignoreEmptyLines","true")
       .schema(patients_schema)
       .load(P.raw)

       # Drop any records without a valid primary key
       .filter(F.col("ID").isNotNull())

       # Add ingest timestamp, source file, batch ID
       .transform(add_audit_cols)

       .writeStream
       .option("checkpointLocation", P.checkpoint)
       .option("mergeSchema", "true")
       .trigger(availableNow=True)
       .toTable(BRONZE_TABLE)
)
stream.awaitTermination()

**Step 4. Verify Bronze Patients table**

Confirm that ingestion completed successfully by printing the row count, previewing a sample of records, and reviewing the Delta transaction history for the batch.

In [0]:
df = spark.table(BRONZE_TABLE)

print(f"Bronze Patients row count: {df.count():,}")

display(df.limit(5))

show_history(P.bronze)