
This notebook establishes a High-Fidelity Data Validation Framework for large-scale transaction data. 

It ensures that the ingestion process from raw CSV files to the Bronze layer maintains 100% data integrity through aggregate reconciliation and row-level fingerprinting.


##1. Data Loading and Indexing

The first phase involves loading the raw CSV transactions and simulating a row position to enable granular auditing.

####Logic: 

The script reads raw CSV files from the Chunk 1 volume, captures the source file path, and uses a Window function with monotonically_increasing_id() to generate a gapless row number for every record. 

####Why this code:

 Standard CSV ingestion does not inherently track row positions. By creating a file_row_number, we can pinpoint the exact location of any data discrepancy found later in the audit.

In [0]:
from pyspark.sql import functions as F
from pyspark.sql.window import Window

# 1. Configuration: Define source volume path and target Bronze table
raw_path = "/Volumes/vstone-catalog/vstone_schema/chunked_data/chunk1/*.csv"
bronze_table_name = "`vstone-catalog`.bronze_schema.transactions_bronze"



In [0]:
# 2. Load Raw Data with Metadata
# Logic: capture the file path to distinguish records between monthly transaction files
raw_df_initial = spark.read.format("csv") \
    .option("header", "true") \
    .option("inferSchema", "true") \
    .load(raw_path) \
    .select(
        "*", 
        F.col("_metadata.file_path").alias("source_file")
    )


In [0]:

# 3. Simulate Row Index for Auditing
# Logic: Use a Window spec to create a stable row count per file
window_spec = Window.partitionBy("source_file").orderBy(F.monotonically_increasing_id())
raw_df = raw_df_initial.withColumn("file_row_number", F.row_number().over(window_spec))


In [0]:

# 4. Load Bronze Data
bronze_df = spark.table(bronze_table_name)


In [0]:

# 5. Define MD5 Logic
hash_cols = [
    "transaction_id", "store_id", "payment_method_id", "voucher_id", 
    "user_id", "original_amount", "discount_applied", "final_amount", "created_at"
]

def add_row_hash(df, prefix=""):
    return df.withColumn(f"{prefix}row_hash", F.md5(F.concat_ws("||", 
        *[F.coalesce(F.col(c).cast("string"), F.lit("NULL")) for c in hash_cols]
    )))


##2. Aggregate Validation (Step 1)

Before performing deep row-level checks, the script performs a high-level summary validation to ensure no records or financial values were lost.

####Logic: 

It groups both the raw and Bronze data by source_file and calculates the total record count and the sum of the final_amount. 

####Why this code: 

This provides an immediate "smoke test". If the count_diff or amt_diff is non-zero, it indicates a critical failure in the ingestion pipeline that needs immediate attention before proceeding to more expensive row-level checks.

In [0]:

# 6. High-Level Aggregate Validation
# Logic: Compare record counts and total transaction amounts per file
raw_agg = raw_df.groupBy("source_file").agg(
    F.count("*").alias("raw_count"),
    F.sum("final_amount").alias("raw_sum_amt")
)

bronze_agg = bronze_df.groupBy("source_file").agg(
    F.count("*").alias("bronze_count"),
    F.sum("final_amount").alias("bronze_sum_amt")
)
# Join aggregates to calculate differences
comparison_df = raw_agg.join(bronze_agg, "source_file", "outer") \
    .withColumn("count_diff", F.col("raw_count") - F.col("bronze_count")) \
    .withColumn("amt_diff", F.round(F.col("raw_sum_amt") - F.col("bronze_sum_amt"), 2))

print("Step 1: Aggregate Validation Results:")
comparison_df.select("source_file", "raw_count", "bronze_count", "count_diff", "amt_diff").show(truncate=False)

##3. Row-Level Integrity Audit (Step 2)

The final validation gate uses MD5 hashing to ensure that every single field in every record was ingested correctly.

####Logic:

 It generates an MD5 hash (fingerprint) of all business-critical columns (IDs, amounts, and timestamps). It then left-joins the raw data with the Bronze data on transaction_id and filters for any row where the hashes do not match.
 
####Why this code: 
  
  Financial data requires zero-tolerance for errors. This step detects "silent" data corruption, such as a decimal point shifting or a timestamp being truncated, which aggregate sums might miss.

In [0]:


# 7. Row-Level Integrity Validation
raw_final = add_row_hash(raw_df, "raw_")
bronze_final = add_row_hash(bronze_df, "bronze_")

mismatches = raw_final.select("transaction_id", "raw_row_hash", "source_file", "file_row_number").join(
    bronze_final.select("transaction_id", "bronze_hash" if "bronze_hash" in bronze_df.columns else "bronze_row_hash"),
    on="transaction_id",
    how="left"
).filter((F.col("raw_row_hash") != F.col("bronze_row_hash")) | (F.col("bronze_row_hash").isNull()))

# 8. Report
mismatch_count = mismatches.count()
if mismatch_count == 0:
    print(f"✅ SUCCESS: All records match the Bronze table exactly.")
else:
    print(f"❌ CRITICAL: Found {mismatch_count} discrepancies.")
    mismatches.select("file_row_number", "transaction_id", "source_file").show(20)