
This notebook implements a High-Volume Data Reconciliation Framework for transaction item data.

 It ensures that the millions of records processed in the "Chunk 2" group are accurately reflected in the Bronze layer by performing both aggregate financial balancing and row-level cryptographic fingerprinting.

##1. Audit Configuration & Raw Data Loading

This section initializes the audit parameters and aggregates the raw CSV files for comparison.

####Logic: 

The script targets all CSV files within the chunk2 volume and defines a set of business-critical columns (IDs, quantity, and pricing) used to verify data consistency.

####Why this code: 
 
 Transaction items often represent the highest volume of data in the system. By loading all files in the folder as a single group, the script can perform a "Grand Total" reconciliation against the Bronze table.

In [0]:
from pyspark.sql import functions as F

# 1. Configuration: Define source paths and critical columns for hashing
raw_path = "/Volumes/vstone-catalog/vstone_schema/chunked_data/chunk2/transaction_items/*.csv"
bronze_table_name = "`vstone-catalog`.bronze_schema.transactions_items_bronze"
# Columns that must remain unchanged during ingestion
hash_cols = ["transaction_id", "item_id", "quantity", "unit_price", "subtotal", "created_at"]


In [0]:

# 2. Load Raw Data
# Logic: Treat all CSV files in the folder as one consolidated dataset for Chunk 2
raw_df = spark.read.format("csv") \
    .option("header", "true") \
    .option("inferSchema", "true") \
    .load(raw_path) \
    .withColumn("source_label", F.lit("chunk2_csv"))


In [0]:

# 3. Load processed Bronze Data
bronze_df = spark.table(bronze_table_name)



##2. Global Reconciliation (Step 1)

This phase verifies that the total number of records and the total financial value match between the source and the target.

####Logic: 

The script calculates the total row count and the sum of the subtotal column for both the raw files and the records in the Bronze table labeled as chunk2_csv. 

####Why this code:
 This serves as the primary financial audit. Any discrepancy in amt_diff (total subtotal) or count_diff would indicate missing records or altered pricing during the ingestion process.

In [0]:
# 4. Global Reconciliation (Grand Totals)
# Logic: Sum up all Raw files to compare against the consolidated Bronze table
raw_total = raw_df.agg(
    F.lit("chunk2_csv").alias("source"),
    F.count("*").alias("raw_count"),
    F.sum("subtotal").alias("raw_sum")
)

bronze_total = bronze_df.filter(F.col("source") == "chunk2_csv").agg(
    F.lit("chunk2_csv").alias("source"),
    F.count("*").alias("bronze_count"),
    F.sum("subtotal").alias("bronze_sum")
)
# Join and calculate differences
comparison_df = raw_total.join(bronze_total, "source") \
    .select(
        "source",
        "raw_count",
        "bronze_count",
        (F.col("raw_count") - F.col("bronze_count")).alias("count_diff"),
        F.round(F.col("raw_sum") - F.col("bronze_sum"), 2).alias("amt_diff")
    )

print("--- Step 1: Global Reconciliation (Grand Totals for Chunk 2) ---")
comparison_df.show()


##3. Row-Level Integrity Audit (Step 2)

The final validation ensures that every individual field in every row is an exact 1:1 replica of the source data.

####Logic: 

An MD5 hash is generated for every row based on the hash_cols. The script then performs a join on transaction_id and item_id to ensure that every raw record has a matching "twin" in the Bronze table with an identical hash. 

####Why this code: 

Aggregate sums can sometimes hide errors (e.g., one row being $1 higher and another $1 lower). Cryptographic hashing ensures that not a single character of data has changed.

In [0]:

# 5. Row-Level Integrity Audit (MD5 Fingerprint)
def add_row_hash(df, cols, prefix=""):
    """
    Logic: Concatenates columns and generates an MD5 hash fingerprint.
    Why: Guarantees 100% data fidelity at the individual record level.
    """
    return df.withColumn(f"{prefix}row_hash", F.md5(F.concat_ws("||", 
        *[F.coalesce(F.col(c).cast("string"), F.lit("NULL")) for c in hash_cols]
    )))

raw_final = add_row_hash(raw_df, hash_cols, "raw_")
bronze_final = add_row_hash(bronze_df, hash_cols, "bronze_")

# Audit: Verify if every item in Raw has an identical hash in Bronze
mismatches = raw_final.join(
    bronze_final.select("transaction_id", "item_id", "bronze_row_hash"),
    (raw_final.transaction_id == bronze_final.transaction_id) & 
    (raw_final.item_id == bronze_final.item_id) &
    (raw_final.raw_row_hash == bronze_final.bronze_row_hash),
    "left"
).filter(F.col("bronze_row_hash").isNull())

In [0]:


# 6. Final Verdict
mismatch_count = mismatches.count()
if mismatch_count == 0:
    print(f"✅ FINAL VERDICT: 100% Data Integrity Verified for all items in Chunk 2.")
else:
    print(f"❌ ALERT: Found {mismatch_count} discrepancies.")

####Audit Summary

* Total Records Processed: 29,246,323 items.

* Financial Reconciliation: 0.0 difference in total subtotal.

* Integrity Status: 100% verified via row-level fingerprinting.