##1. Dynamic Environment Setup


This section initializes the notebook parameters to allow for seamless deployment across different environments (Dev, Test, Prod).

####Logic: 

The notebook uses dbutils.widgets to capture environment-specific variables like the catalog and schema names. 

####Why this code:

 Using widgets ensures the notebook is not hard-coded. By dynamically constructing the table name, the same logic can be used to target different Unity Catalog locations without manual code changes.

In [0]:
# 1. Capture Environment Variables from Spark Conf (set via DABs)
# Logic: Retrieve the target location and source data path from job parameters
dbutils.widgets.text("catalog_name","")
dbutils.widgets.text("schema_name","")



In [0]:
dbutils.widgets.text("source_path", "/Volumes/vstone-catalog/vstone_schema/chunked_data/chunk1/")

In [0]:
# 2. Get the values into Python variables
catalog_name = dbutils.widgets.get("catalog_name")
schema_name = dbutils.widgets.get("schema_name")
source_path=dbutils.widgets.get("source_path")

# Construct the full Delta table identifier for Unity Catalog
table_name = f"`{catalog_name}`.`{schema_name}`.`transactions_bronze`"

# 3. Create the table if it doesn't exist
# Why: This prevents errors when running COPY INTO on a non-existent target
spark.sql(f"CREATE TABLE IF NOT EXISTS {table_name}")

##2. Schema Seeding and COPY INTO Execution

This block handles the core ingestion logic, ensuring that the target table matches the source file structure while adding audit tracking.

Logic: 

We first "seed" the table by creating it with the correct schema but zero rows. We then use the COPY INTO command to incrementally load only new data. 

Why this code: 

* COPY INTO: An idempotent operation that automatically skips files that have already been loaded, making it ideal for scheduled batch ingestion.

* Metadata Tracking: By selecting _metadata.file_path, we provide a clear audit trail of which file contributed each row.

In [0]:
# 3. SEED THE SCHEMA
# Logic: Use read_files to infer the CSV schema and add placeholder audit columns
# 'WHERE 1=0' ensures we define the structure without loading duplicate data initially
spark.sql(f"""
CREATE TABLE IF NOT EXISTS {table_name} 
AS SELECT *, 
          cast(NULL as string) AS source_file, 
          current_timestamp() AS load_dt 
   FROM read_files('{source_path}', format => 'csv', header => true, inferSchema => true) 
   WHERE 1=0
""")

# 4. Execute Idempotent Ingestion
# Logic: Copy data from the volume path while enriching it with file metadata
spark.sql(f"""
COPY INTO {table_name}
FROM (
  SELECT 
    *, 
    _metadata.file_path AS source_file, 
    current_timestamp() AS load_dt
  FROM '{source_path}'
)
FILEFORMAT = CSV
FORMAT_OPTIONS (
    'header' = 'true',
    'inferSchema' = 'true',
    'mergeSchema' = 'true'
)
COPY_OPTIONS (
    'mergeSchema' = 'true'
)
""")

# 5. Display result
display(spark.read.table(table_name))

##3. Discovery & Audit Metadata

The final section focuses on Data Governance by adding descriptions and properties to the table in Unity Catalog.

####Logic: 

It applies table-level properties (quality, source) and column-level comments to improve discoverability. 

####Why this code: 

In an enterprise environment, metadata is as important as the data itself. These properties allow other data engineers and analysts to understand the data's lineage and purpose directly from the Unity Catalog UI.

In [0]:

# 5. Add Discovery Metadata (Audit Properties)
# Logic: Tag the table as 'bronze' and describe its origin
spark.sql(f"""
ALTER TABLE {table_name} SET TBLPROPERTIES (
  'quality' = 'bronze',
  'source' = 'transactions_csv',
  'description' = 'Enterprise transactions data ingested from volume chunk1'
)
""")


In [0]:
# Document audit columns for easy discovery in Unity Catalog
spark.sql("ALTER TABLE " + table_name + " ALTER COLUMN load_dt COMMENT 'Data load timestamp'")
spark.sql("ALTER TABLE " + table_name + " ALTER COLUMN source_file COMMENT 'Source path of the ingested file'")