This notebook establishes a Data Auditing and Integrity Framework for master data stored in XML format. It ensures that data loaded into the Bronze layer of the Medallion architecture remains consistent and accurate compared to the raw source files.

##1. Audit Configuration & Hashing Logic

This section defines the mapping between raw XML files and their corresponding Delta tables, alongside a robust method for comparing data content.

####Logic: 

A master_tables dictionary stores metadata for menu_items, payment_methods, stores, and vouchers. It includes specific "hash columns" used to generate an MD5 fingerprint for every row.

####Why this code: 
 
 XML parsing can sometimes lead to subtle data type changes. By concatenating key columns and generating an MD5 hash, we create a "digital signature" for each record that allows for exact content comparison between the raw file and the database.

In [0]:
from pyspark.sql import functions as F

# 1. Configuration: Define the path for Chunk 4 XML files and the target Bronze schema
base_path = "/Volumes/vstone-catalog/vstone_schema/chunked_data/chunk4/"
catalog_schema = "`vstone-catalog`.bronze_schema"

In [0]:

# 2. Metadata Mapping: Link XML files to Bronze tables and define columns for integrity hashing
master_tables = {
    "menu_items": {
        "xml": "menu_items.xml",
        "tag": "item",
        "table": f"{catalog_schema}.bronze_menuitems",
        "hash_cols": ["item_id", "item_name", "category", "price"]
    },
    "payment_methods": {
        "xml": "payment_methods.xml",
        "tag": "item",
        "table": f"{catalog_schema}.bronze_paymentmethods",
        "hash_cols": ["method_id", "method_name", "category"]
    },
    "stores": {
        "xml": "stores.xml",
        "tag": "item",
        "table": f"{catalog_schema}.bronze_stores",
        "hash_cols": ["store_id", "store_name", "latitude", "longitude"]
    },
    "vouchers": {
        "xml": "vouchers.xml",
        "tag": "item",
        "table": f"{catalog_schema}.bronze_vouchers",
        "hash_cols": ["voucher_id", "voucher_code", "discount_value"]
    }# ... additional tables for stores and vouchers follow the same logic
}

def add_row_hash(df, cols, prefix=""):
    """
    Logic: Concatenates columns with a separator and generates an MD5 hash.
    Why: Provides a unique row-level identifier to verify data hasn't changed during ingestion.
    """
    return df.withColumn(f"{prefix}row_hash", F.md5(F.concat_ws("||", 
        *[F.coalesce(F.col(c).cast("string"), F.lit("NULL")) for c in cols]
    )))


##2. Automated Execution Loop

This block automates the audit process by iterating through the defined tables and performing row-level reconciliation.

####Logic: 

For each table, the script loads the raw XML and the Delta table. It performs a Global Reconciliation (checking row counts) and a Row-Level Content Check (joining on Primary Keys to compare MD5 hashes). 

####Why this code: 

Manual auditing is not scalable. This loop provides an automated "Success" or "Alert" status for each dataset, ensuring that any discrepancies (missing rows or altered values) are immediately flagged for the engineering team.

In [0]:


# 3. Execution Loop for Data Auditing
for key, info in master_tables.items():
    print(f"\n--- Auditing Table: {key} ---")
    
   # A. Load Raw XML using the spark-xml connector
    raw_df = spark.read.format("xml") \
        .option("rowTag", info["tag"]) \
        .load(base_path + info["xml"])
    
    # B. Load the corresponding Bronze table from Delta Lake
    bronze_df = spark.table(info["table"])
    
    # C. Global Reconciliation: Verify row counts match exactly
    raw_count = raw_df.count()
    bronze_count = bronze_df.count()
    
    print(f"File: {info['xml']} | Raw Count: {raw_count} | Bronze Count: {bronze_count} | Diff: {raw_count - bronze_count}")
    
   # D. Integrity Check: Compare MD5 signatures
    raw_final = add_row_hash(raw_df, info["hash_cols"], "raw_")
    bronze_final = add_row_hash(bronze_df, info["hash_cols"], "bronze_")
    
    # Use the first ID column as the join key
    pk = info["hash_cols"][0]
    
    # Join and filter for mismatches or missing records in Bronze
    mismatches = raw_final.join(
        bronze_final.select(pk, "bronze_row_hash"),
        on=pk,
        how="left"
    ).filter((F.col("raw_row_hash") != F.col("bronze_row_hash")) | (F.col("bronze_row_hash").isNull()))
    
    #output results
    if mismatches.count() == 0:
        print(f"✅ SUCCESS: {key} data integrity verified.")
    else:
        print(f"❌ ALERT: Found {mismatches.count()} discrepancies in {key}.")
        mismatches.show()

####Summary of Integrity Checks

* Menu Items: Verified 8 records.

* Payment Methods: Verified 5 records.

* Stores: Verified 10 records.

* Vouchers: Verified 16 records.