##1. Environment Configuration via Widgets

####Logic: 

The script uses dbutils.widgets to retrieve the catalog name, schema name, source path, and checkpoint location. 

####Why this code: 

Using widgets allows the notebook to be generic and portable. The same code can run in Development, Staging, or Production environments simply by passing different values from the DAB variables.yml file.

In [0]:
from pyspark.sql.functions import current_timestamp, lit

# 1. Capture Environment Variables via Widgets
# Logic: Text widgets act as input parameters for the notebook
dbutils.widgets.text("catalog_name", "")
dbutils.widgets.text("schema_name", "")
dbutils.widgets.text("source_path", "")
dbutils.widgets.text("checkpoint_base", "") # New: Passed from DAB variables.yml

# 2. Assign widget values to Python variables for downstream use
catalog_name = dbutils.widgets.get("catalog_name")
schema_name = dbutils.widgets.get("schema_name")
source_path = dbutils.widgets.get("source_path")
checkpoint_base = dbutils.widgets.get("checkpoint_base")

# Construct the full Unity Catalog table identifier
target_table = f"`{catalog_name}`.`{schema_name}`.`users_bronze`"


##2. Streaming Ingestion with Auto Loader


This block performs the actual data movement using Spark Structured Streaming.

####Logic: 

The code uses spark.readStream with the cloudFiles format to automatically detect new JSON files in the source directory. It includes schema inference and evolution support. 

####Why this code: 

* Auto Loader (cloudFiles): Efficiently processes millions of files and handles schema drift (new columns) automatically.

* Checkpointing: Stores the state of the stream in checkpoint_base, ensuring that if the job fails, it resumes exactly where it left off without reprocessing data.

* Trigger availableNow=True: Processes all available data as a batch-like stream, which is cost-effective for scheduled jobs.

In [0]:

# 3. Ensure the target Bronze table exists before starting the stream
spark.sql(f"CREATE TABLE IF NOT EXISTS {target_table}")
# 4. Define and start the Auto Loader streaming query
query = (
    spark.readStream
        .format("cloudFiles")
        .option("cloudFiles.format", "json")
        # Logic: Auto-detect data types and store the inferred schema
        .option("cloudFiles.inferColumnTypes", "true") 
        .option("cloudFiles.schemaLocation", f"{checkpoint_base}/schema")
        .load(source_path)
        # Logic: Add audit columns to track load time and source origin
        .withColumn("load_dt", current_timestamp())
        .withColumn("source", lit("dab_json_ingestion"))
        .writeStream
        .format("delta")
        # Logic: Update the Delta table schema if new fields appear in the JSON
        .option("mergeSchema", "true") 
        .option("checkpointLocation", checkpoint_base)
        .outputMode("append")
        # Logic: Process only new data since the last run, then shut down
        .trigger(availableNow=True)
        .table(target_table)
)
# Logic: Block the cell until all data is processed
query.awaitTermination()