### Kardiaflow - Bronze Providers Auto Loader

**Source:** Raw TSV files in ADLS

**Target:** `kardia_bronze.bronze_providers` (CDF enabled)

**Trigger:** Incremental batch via Auto Loader; append to Bronze Providers table

**Description:** The Bronze layer is our raw ingestion zone. Data lands here directly from source files via Auto Loader with minimal transformation. The original schema is preserved as much as
possible, audit fields (_ingest_ts, _batch_id, _source_file) are added, and Change Data Feed is enabled for downstream
use. The
goal is durability and traceability. Bronze is an auditable record of what was received, not yet cleaned or
standardized.

In [0]:
import pyspark.sql.functions as F
from pyspark.sql.types import StructType, StructField, StringType

from kflow.config import BRONZE_DB, bronze_paths
from kflow.etl_utils import add_audit_cols
from kflow.notebook_utils import init, show_history

**Step 1. Initialize environment and load paths**

Start by calling `init()` to set up authentication, Spark configs, and database context.

Then load the Bronze table metadata for the Providers dataset with `bronze_paths("providers")`, which returns all
storage locations (raw, schema, checkpoint, bad records) and the target Delta table name.

*Note: The Spark session timezone is set to UTC so that audit fields like (_ingest_ts) are consistent and
comparable across regions.*

In [0]:
init()

P = bronze_paths("providers")
BRONZE_TABLE = P.table

# Ensure audit fields like _ingest_ts use UTC
spark.sql("SET spark.sql.session.timeZone=UTC")

**Step 2. Define schema for TSV input**

Provider data arrives as TSV files without schema metadata, so we declare the schema explicitly (ProviderID, ProviderSpecialty, ProviderLocation). This provides consistent typing.

Audit columns (`_ingest_ts`, `_batch_id`, `_source_file`, `_rescued_data`) are included for traceability and data quality.

In [0]:
provider_schema = StructType([
    StructField("ProviderID",        StringType(), True),
    StructField("ProviderSpecialty", StringType(), True),
    StructField("ProviderLocation",  StringType(), True),
])

In [0]:
spark.sql(f"CREATE DATABASE IF NOT EXISTS {BRONZE_DB}")

spark.sql(
    f"""
    CREATE TABLE IF NOT EXISTS {BRONZE_TABLE} (
        ProviderID        STRING,
        ProviderSpecialty STRING,
        ProviderLocation  STRING,
        _ingest_ts        TIMESTAMP,
        _batch_id         STRING,
        _source_file      STRING,
        _rescued_data     STRING
    )
    USING DELTA
    COMMENT 'Bronze TSV ingest of Provider records.'
    LOCATION '{P.bronze}'
    TBLPROPERTIES (delta.enableChangeDataFeed = true)
    """
)

**Step 3. Incremental ingestion with Auto Loader**

Now we configure Auto Loader to discover and ingest new TSV files from ADLS. Each run appends only new files.

TSV input is read with the CSV reader (tab-delimited, with headers). Because TSV lacks an embedded schema, we define it explicitly and persist it at `P.schema` for consistency. With `mergeSchema = true`, Auto Loader can still evolve the table if new columns appear in future files.  

Malformed records are captured in `P.bad` and `_rescued_data`, and audit columns (`_ingest_ts`, `_source_file`,
`_batch_id`) are added for traceability. This keeps the Bronze layer stable while leaving room for controlled schema
growth. Rows missing `ProviderID` are filtered out.

In [0]:
# Define an incremental batch pipeline using Auto Loader

# Collect all Auto Loader options
auto_loader_opts = {
    "cloudFiles.format": "csv",
    "cloudFiles.includeExistingFiles": "true",
    "cloudFiles.schemaLocation": P.schema,
    "delimiter": "\t",
    "header": "true",
    "ignoreEmptyLines": "true",
    "badRecordsPath": P.bad,
    "rescuedDataColumn": "_rescued_data"
}

stream = (
    spark.readStream
         .format("cloudFiles")
         .options(**auto_loader_opts)
         .schema(provider_schema)
         .load(P.raw)

         # Drop any records without a valid primary key
         .filter(F.col("ProviderID").isNotNull())
         
         # Add ingest timestamp, source file, batch ID
         .transform(add_audit_cols)

         .writeStream
         .option("checkpointLocation", P.checkpoint)
         .option("mergeSchema", "true")
         .trigger(availableNow=True)
         .toTable(BRONZE_TABLE)
)
stream.awaitTermination()

**Step 4. Verify Bronze Providers table**

Confirm that ingestion succeeded by checking row counts, previewing sample records, and reviewing the Delta load history for this batch.![](path)

In [0]:
df = spark.table(BRONZE_TABLE)

print(f"Bronze Providers row count: {df.count():,}")

display(df.limit(5))

show_history(P.bronze)