This notebook follows the Medallion Architecture to process raw transaction item data from the Bronze layer into a structured, validated, and deduplicated Silver table. It focuses on maintaining the relationship between transactions and the individual items sold, ensuring data integrity for financial reporting.

##Silver Layer: Transaction Items Transformation

**Notebook Objective:** This notebook automates the cleaning and validation of transaction-level line items. 

It ensures that every item is linked to a valid transaction, handles price calculations, and implements a "Quarantine" pattern for records that fail business logic.

##1. Initial Data Profiling (Bronze Layer)

We begin by investigating the raw data to identify missing identifiers or invalid financial figures that would corrupt downstream analytics.

In [0]:
%sql
-- 1. Check for Nulls in critical identifiers (Foreign Keys)
SELECT * FROM `vstone-catalog`.bronze_schema.transactions_items_bronze 
WHERE cast(quantity as double) * cast(unit_price as double) != cast(subtotal as double);

-- 2. Validate Price and Quantity
-- Industry standard: Neither price nor quantity should be zero or negative
SELECT count(*) FROM `vstone-catalog`.bronze_schema.transactions_items_bronze 
WHERE transaction_id IS NULL OR item_id IS NULL;

##2. Configuration & Schema Initialization
We define the Unity Catalog paths and ensure the destination environment is ready. Using backticks handles the hyphenated catalog name vstone-catalog.

In [0]:
import pandas as pd
from pyspark.sql import functions as F
from pyspark.sql.window import Window
from pyspark.sql.types import StringType

# --- 1. CONFIGURATION ---
# Backticks are required because of the hyphen in the catalog name
CATALOG = "`vstone-catalog`"
SILVER_SCHEMA = "silver_schema"
BRONZE_TABLE = f"{CATALOG}.bronze_schema.transactions_items_bronze"
SILVER_TABLE = f"{CATALOG}.{SILVER_SCHEMA}.silver_transaction_items"
QUARANTINE_TABLE = f"{CATALOG}.{SILVER_SCHEMA}.quarantine_transaction_items"



In [0]:
# Initialize schema
spark.sql(f"CREATE SCHEMA IF NOT EXISTS {CATALOG}.{SILVER_SCHEMA}")

##3. Data Ingestion & Header Standardization
To maintain a clean data lake, we standardize all column headers to *snake_case.* This ensures that our Spark transformations and future SQL queries remain consistent.

In [0]:


# --- 3. LOAD & STANDARDIZE ---
df_bronze = spark.read.table(BRONZE_TABLE)

# Convert headers to lowercase and replace spaces with underscores)
standardized_cols = [col.lower().replace(" ", "_").strip() for col in df_bronze.columns]
df_standardized = df_bronze.toDF(*standardized_cols)

##4. Quality Gates & Quarantine Logic
Transaction items are the most granular level of financial data. 

We apply strict "Quality Gates" to catch errors like missing IDs or invalid pricing before they reach the Silver table.

In [0]:


# --- 4. DATA TYPING & QUALITY GATES ---
df_casted = df_standardized.select(
    "transaction_id", 
    "item_id", 
    "source", 
    "load_dt",
    F.col("quantity").cast("double").alias("quantity"),
    F.col("unit_price").cast("double").alias("unit_price"),
    F.col("subtotal").cast("double").alias("subtotal"),
    F.to_timestamp(F.col("created_at")).alias("created_at")
)

# --- 3. QUALITY GATES ---
# Business Rules: 
# 1. Transaction and Item IDs must exist.
# 2. Financials: Price and Quantity must be positive.
math_valid = (F.round(F.col("quantity") * F.col("unit_price"), 2) == F.round(F.col("subtotal"), 2))
id_valid = (F.col("transaction_id").isNotNull()) & (F.col("item_id").isNotNull())
qty_valid = (F.col("quantity") > 0)

valid_mask = math_valid & id_valid & qty_valid

# Divert failed records to a Quarantine table for auditing
df_quarantine = df_casted.filter(~valid_mask) \
    .withColumn("quarantine_reason", 
        F.when(~id_valid, "MISSING_MANDATORY_ID")
         .when(~qty_valid, "ZERO_OR_NEGATIVE_QUANTITY")
         .otherwise("SUBTOTAL_MISMATCH")) \
    .withColumn("quarantined_at", F.current_timestamp())

# Keep only the valid records
df_clean = df_casted.filter(valid_mask)

##5. Deduplication & Final Transformation
We use a **Window function** to ensure we only keep the latest version of any given line item. We also cast data types to ensure high precision for financial double-entry bookkeeping.

In [0]:


# --- 4. DEDUPLICATION & NORMALIZATION ---
# Logic: Partition by the unique combination of transaction and item
window_spec = Window.partitionBy("transaction_id", "item_id").orderBy(F.col("load_dt").desc())

df_silver_final = df_clean.withColumn("row_rank", F.row_number().over(window_spec)) \
    .filter("row_rank == 1").drop("row_rank")

##6. Atomic Writes & Table Constraints
The data is committed using the Delta Lake format, which supports ACID transactions. We apply hard constraints to the table to prevent future "dirty" data from being inserted.

In [0]:


# --- 6. ATOMIC WRITES ---
# Write Quarantine records
df_quarantine.write.format("delta").mode("append") \
    .option("mergeSchema", "true") \
    .saveAsTable(QUARANTINE_TABLE)

# Write Silver records
df_silver_final.write.format("delta").mode("overwrite") \
    .option("overwriteSchema", "true") \
    .saveAsTable(SILVER_TABLE)


In [0]:

# --- 6. APPLY CONSTRAINTS ---
# Ensure primary keys are never null in the Silver layer

spark.sql(f"ALTER TABLE {SILVER_TABLE} ALTER COLUMN transaction_id SET NOT NULL")
spark.sql(f"ALTER TABLE {SILVER_TABLE} ALTER COLUMN item_id SET NOT NULL")

try:
    # Adding a check constraint for data integrity
    spark.sql(f"ALTER TABLE {SILVER_TABLE} ADD CONSTRAINT check_subtotal_pos CHECK (subtotal >= 0)")
except Exception as e:
    print(f"Note: Constraint might already exist or failed: {e}")

print(f"Process complete. Silver table {SILVER_TABLE} updated.")

In [0]:
%sql
-- View the audit trail
DESCRIBE HISTORY `vstone-catalog`.silver_schema.silver_transaction_items;



In [0]:
%sql
-- Query the items as they were during the June 2024 promotion
SELECT * FROM `vstone-catalog`.silver_schema.silver_transaction_items 
TIMESTAMP AS OF ' 2026-01-12 15:34:38';

##Industry Logics & Standards 
**1. Granular Integrity**
In the retail industry, a single transaction has multiple "line items". By deduplicating on the combination of transaction_id and item_id, we ensure that our total sales quantity remains accurate.

**2. The Quarantine Audit Trail**
Rather than deleting records with price = 0, we move them to quarantine_transaction_items. This allows the finance team to investigate if these were "test" transactions or actual system bugs.

**3. Delta Lake Constraints**
By using ALTER TABLE ... ADD CONSTRAINT, we treat the Delta table like a traditional relational database. This "Schema-on-Write" approach is the gold standard for preventing data corruption in enterprise environments.

**4. Idempotency**
This notebook is designed to be idempotent. Because it uses .mode("overwrite"), you can safely re-run the pipeline multiple times without creating duplicate records or causing data inflation.