### Kardiaflow - Bronze Encounters Auto Loader

**Source:** Raw Avro files in ADLS

**Target:** `kardia_bronze.bronze_encounters` (CDF enabled)

**Trigger:** (configurable via job param `mode`)
  - `batch` → one-time load of all files
  - `stream` → continuous 30s micro-batches

**Description:** The Bronze layer is our raw ingestion zone. Data lands here directly from source files via Auto Loader with minimal transformation. The original schema is preserved as much as
possible, audit fields (_ingest_ts, _batch_id, _source_file) are added, and Change Data Feed is enabled for downstream
use. The
goal is durability and traceability. Bronze is an auditable record of what was received, not yet cleaned or
standardized.

In [0]:
import pyspark.sql.functions as F

from kflow.config import BRONZE_DB, bronze_paths
from kflow.etl_utils import add_audit_cols
from kflow.notebook_utils import init

**Step 1. Initialize environment and load paths**

Start by calling `init()` to set up authentication, Spark configs, and database context.

The Encounters Bronze metadata is loaded with `bronze_paths("encounters")`, which provides storage locations (raw, schema, checkpoint, bad records) and the target Delta table.

*Note: The Spark session timezone is set to UTC so that audit fields like (_ingest_ts) are consistent and
comparable across regions.*

In [0]:
init()

P = bronze_paths("encounters")
BRONZE_TABLE = P.table

# Ensure audit fields like _ingest_ts use UTC
spark.sql("SET spark.sql.session.timeZone=UTC")

**Step 2. Retrieve runtime mode**

The job accepts a parameter `mode` that controls ingestion:  
- **batch**  - process all available files once and stop.  
- **stream** - run continuous 30-second micro-batches.  

Each mode writes to a separate checkpoint directory so their states remain isolated.

In [0]:
# Retrieve runtime mode from job widget ("batch" = default, or "stream")
try:
    dbutils.widgets.dropdown("mode", "batch", ["batch", "stream"])
except:
    # Widget may not exist in interactive mode
    pass

MODE = dbutils.widgets.get("mode") if "dbutils" in globals() else "batch"
IS_BATCH = (MODE == "batch")

# Keep batch and stream checkpoints isolated
CHECKPOINT = f"{P.checkpoint}/{MODE}"

**Step 3. Create Bronze Encounters table**

Now we create the Bronze Encounters table if it doesn’t exist and enable Change Data Feed for downstream incremental
processing.

Avro is self-describing, so no explicit schema is defined in Bronze. Types are enforced later in Silver.

In [0]:
spark.sql(f"CREATE DATABASE IF NOT EXISTS {BRONZE_DB}")

spark.sql(
    f"""
    CREATE TABLE IF NOT EXISTS {BRONZE_TABLE}
    USING DELTA
    COMMENT 'Bronze Avro ingest of Encounter records.'
    LOCATION '{P.bronze}'
    TBLPROPERTIES (delta.enableChangeDataFeed = true)
    """
)

**Step 4. Define Auto Loader pipeline**

Now we use Auto Loader to ingest Avro files from the raw path and tracks schema evolution at `P.schema`.

Audit fields (`_ingest_ts`, `_source_file`, `_batch_id`)
 are added for lineage and traceability. The stream is configured with `mergeSchema = true` to accommodate new fields introduced in future files.

Rows missing `ID` (PK) are filtered out. `PATIENT` (FK) may be null in Bronze. Late-arriving or
missing
dimension references are reconciled in Silver.

In [0]:
# Define a streaming pipeline using Auto Loader
reader = (
    spark.readStream.format("cloudFiles")
         .option("cloudFiles.format", "avro")
         .option("cloudFiles.schemaLocation", P.schema)
         .option("cloudFiles.includeExistingFiles", "true")
         .option("badRecordsPath", P.bad)
         .load(P.raw)

         # Drop any records missing PK
         .filter(
             F.col("ID").isNotNull()
         )

         # Add ingest timestamp, source file, batch ID
         .transform(add_audit_cols)
)

writer = (
    reader.writeStream
          .option("checkpointLocation", CHECKPOINT)
          .option("mergeSchema", "true")
)

**Step 5. Run batch or stream**

Depending on the job parameter, the pipeline executes in one of two modes:  
- **Batch mode**: triggers with `availableNow`, ingests all current files, then exits.  
- **Stream mode**: runs continuously, processing new data in 30-second intervals.  

This flexibility allows the same notebook to support both backfills and live ingestion.

In [0]:
if IS_BATCH:
    # Batch mode: process all available files once and exit
    query = writer.trigger(availableNow=True).toTable(BRONZE_TABLE)
    print(f"[batch] Wrote to {BRONZE_TABLE} with checkpoint={CHECKPOINT} …")
    query.awaitTermination()
else:
    # Streaming mode: run continuously every 30s
    query = writer.trigger(processingTime="30 seconds").toTable(BRONZE_TABLE)
    print(f"[live] Continuous 30s micro-batches to {BRONZE_TABLE} with checkpoint={CHECKPOINT}")