### Kardiaflow - Bronze Feedback COPY INTO

**Source:** Raw JSON-lines files in ADLS

**Target:** `kardia_bronze.bronze_feedback` (CDF enabled)

**Trigger:** Incremental batch via COPY INTO; append to Bronze Feedback table

In [0]:
from pyspark.sql.types import (StructType, StructField, StringType, IntegerType,
                               ArrayType, MapType)

from kflow.config import BRONZE_DB, bronze_paths, current_batch_id
from kflow.notebook_utils import init, show_history

**Step 1. Initialize environment and load paths**

Start by calling `init()` to set up authentication, Spark configs, and database context.

Then load the Bronze table metadata for the Feedback dataset with bronze_paths("feedback"), which returns all storage
locations (raw, schema, checkpoint, bad records) and the target Delta table name.

*Note: Each run generates a unique BATCH_ID to tag otherwise stateless COPY INTO loads, letting us trace which files were ingested together and track lineage. The Spark session timezone is set to UTC so audit fields like _ingest_ts are consistent and comparable across regions.*

In [0]:
init()

P            = bronze_paths("feedback")
BRONZE_TABLE = P.table
BATCH_ID     = current_batch_id()

# Ensure audit fields like _ingest_ts use UTC
spark.sql("SET spark.sql.session.timeZone=UTC")

**Step 2. Create Bronze Feedback table**

COPY INTO requires the target Delta table to exist ahead of ingestion. We define the Bronze Feedback table with an explicit schema, including both the raw fields from the JSONL source (`feedback_id`, `provider_id`, `timestamp`, etc.) and audit columns (`_ingest_ts`, `_source_file`, `_batch_id`).  

Change Data Feed is enabled to allow downstream incremental processing. The schema also normalizes nested structures: for example, `tags` is stored as an array of strings and `metadata` is serialized into `metadata_json`.

In [0]:
spark.sql(f"CREATE DATABASE IF NOT EXISTS {BRONZE_DB}")

spark.sql(
    f"""
    CREATE TABLE IF NOT EXISTS {BRONZE_TABLE} (
      feedback_id        STRING NOT NULL,
      provider_id        STRING NOT NULL,
      timestamp          STRING,
      visit_id           STRING NOT NULL,
      satisfaction_score INT,
      comments           STRING,
      source             STRING,
      tags               ARRAY<STRING>,
      metadata_json      STRING,
      _ingest_ts         TIMESTAMP,
      _source_file       STRING,
      _batch_id          STRING
    )
    USING DELTA
    COMMENT 'Bronze JSONL ingest of Feedback records.'
    LOCATION '{P.bronze}'
    TBLPROPERTIES (delta.enableChangeDataFeed = true)
    """
)

**Step 3. Ingest with COPY INTO**

Now we will batch-load new Feedback JSONL files from ADLS using COPY INTO. Each run scans the raw path,
skips
files already recorded in the table’s COPY INTO load history, and appends only new ones.

Although JSON may evolve, we disable schema evolution (`mergeSchema = false`) in Bronze to keep a stable contract
. Fields are explicitly CAST to a known schema to prevent silent type drift, and unexpected keys are captured in
`metadata_json` rather than altering the table. Audit columns (`_ingest_ts`, `_source_file`, `_batch_id`) make each
otherwise stateless run traceable. Malformed records are redirected to `P.bad` so ingestion can continue
uninterrupted. Rows missing valid `feedback_id` (PK), `provider_id` and `visit_id` (FKs) are filtered out.

In [0]:
# Run batch operation
# COPY INTO scans the entire source path each run
spark.sql(
    f"""
    COPY INTO {BRONZE_TABLE}
    FROM (
      SELECT
        CAST(feedback_id        AS STRING)            AS feedback_id,
        CAST(provider_id        AS STRING)            AS provider_id,
        CAST(timestamp          AS STRING)            AS timestamp,
        CAST(visit_id           AS STRING)            AS visit_id,
        CAST(satisfaction_score AS INT)               AS satisfaction_score,
        CAST(comments           AS STRING)            AS comments,
        CAST(source             AS STRING)            AS source,
        CAST(tags               AS ARRAY<STRING>)     AS tags,
        to_json(metadata)                             AS metadata_json,
        current_timestamp()                           AS _ingest_ts,
        input_file_name()                             AS _source_file,
        '{BATCH_ID}'                                  AS _batch_id
      FROM '{P.raw}'
      WHERE feedback_id IS NOT NULL
        AND provider_id IS NOT NULL
        AND visit_id    IS NOT NULL
    )
    FILEFORMAT = JSON
    FORMAT_OPTIONS ('multiLine' = 'false', 'badRecordsPath' = '{P.bad}')
    COPY_OPTIONS ('mergeSchema' = 'false')
    """
)

**Step 4. Verify Bronze Feedback table**

Confirm that ingestion succeeded by checking row counts, previewing sample records, and reviewing the Delta load history for this batch.![](path)

In [0]:
df = spark.table(BRONZE_TABLE)

print(f"Bronze Feedback row count: {df.count():,}")

display(df.limit(5))

show_history(P.bronze)