This notebook follows the Medallion Architecture to transform raw voucher and discount data from the Bronze layer into a high-quality, validated Silver table. The process focuses on cleaning date formats, standardizing discount values, and implementing data quality gates to ensure that only valid, usable vouchers are available for the Gold layer.

##Silver Layer: Vouchers Transformation
**Notebook Objective:** 
The notebook implements the refining of promotional voucher data. It ensures that every voucher has a valid identifier and discount value, standardizes expiration dates for time-sensitive analysis, and redirects malformed records to a quarantine table for auditing.

##1. Initial Data Profiling (Bronze Layer)
We start by performing a "health check" on the raw data to identify missing identifiers or illogical discount values (such as negative numbers) that would disrupt financial reporting.

In [0]:
%sql
-- 1. Identify "Logical Date Errors" (Start date after End date)
SELECT count(*) 
FROM `vstone-catalog`.bronze_schema.bronze_vouchers 
WHERE valid_from > valid_to;

-- 2. Validate financial logic: Discounts must be greater than 0
SELECT count(*) 
FROM `vstone-catalog`.bronze_schema.bronze_vouchers 
WHERE discount_value > 100 OR discount_value < 0;

-- 3. Check for Voucher Code uniqueness
SELECT voucher_code, count(*) 
FROM `vstone-catalog`.bronze_schema.bronze_vouchers 
GROUP BY voucher_code HAVING count(*) > 1;

##2. Configuration & Environment Setup
We define the naming conventions for our Delta tables using Unity Catalog standards and initialize the destination schema to ensure a self-contained pipeline.

In [0]:
import pandas as pd
from pyspark.sql import functions as F
from pyspark.sql.window import Window
from pyspark.sql.functions import pandas_udf
from pyspark.sql.types import StringType

# --- 1. CONFIGURATION ---
# Using backticks to handle hyphens in the catalog name for Spark SQL compliance
CATALOG = "`vstone-catalog`"
SILVER_SCHEMA = "silver_schema"
BRONZE_TABLE = f"{CATALOG}.bronze_schema.bronze_vouchers"
SILVER_TABLE = f"{CATALOG}.{SILVER_SCHEMA}.silver_vouchers"
QUARANTINE_TABLE = f"{CATALOG}.{SILVER_SCHEMA}.quarantine_vouchers"

# Bootstrap: Ensure the Silver schema exists before processing--
spark.sql(f"CREATE SCHEMA IF NOT EXISTS {CATALOG}.{SILVER_SCHEMA}")

# --- 2. PANDAS UDF FOR STANDARDIZATION ---
@pandas_udf(StringType())
def clean_voucher_code_udf(code_series: pd.Series) -> pd.Series:
    """Removes whitespace and forces uppercase for voucher codes."""
    return code_series.str.strip().str.upper()


##3. Data Ingestion & Header Standardization
To maintain a clean and searchable Data Lakehouse, we standardize all column headers into snake_case.

In [0]:

# --- 3. LOAD & STANDARDIZE HEADERS ---
df_bronze = spark.read.table(BRONZE_TABLE)

# Convert all column names to lowercase and replace spaces with underscores
standardized_cols = [col.lower().replace(" ", "_").strip() for col in df_bronze.columns]
df_standardized = df_bronze.toDF(*standardized_cols)


##4. Quality Gates & Quarantine Logic
In professional data engineering, we never discard data. Instead, we use a **Quarantine Pattern** to divert records that fail business rules (like missing IDs or invalid discount amounts) for later review.

In [0]:

# --- 3. QUALITY GATES ---
# Business Rules: 
# 1. voucher_id must exist.
# 2. discount_value must be positive
df_prepared = df_standardized.withColumn("discount_value", F.col("discount_value").cast("double")) \
                             .withColumn("valid_from", F.to_date("valid_from")) \
                             .withColumn("valid_to", F.to_date("valid_to"))

# Business Rules:
date_logic_valid = (F.col("valid_from") <= F.col("valid_to"))
id_valid = (F.col("voucher_id").isNotNull()) & (F.col("voucher_code").isNotNull())
value_valid = (F.col("discount_value") >= 0)

valid_mask = date_logic_valid & id_valid & value_valid

# Redirect failed records to a Quarantine table with a reason code
df_quarantine = df_prepared.filter(~valid_mask) \
    .withColumn("quarantine_reason", 
        F.when(~id_valid, "MISSING_ID_OR_CODE")
         .when(~date_logic_valid, "DATE_LOGIC_ERROR")
         .otherwise("INVALID_DISCOUNT_VALUE")) \
    .withColumn("quarantined_at", F.current_timestamp())

# Proceed with clean data only
df_clean = df_prepared.filter(valid_mask)


##5. Deduplication & Final Transformation
Vouchers may have multiple updates in the raw system. We use a Window function to ensure each voucher_id is unique in Silver, keeping only the most recent entry.

In [0]:

# --- 4. DEDUPLICATION & NORMALIZATION ---
# Logic: Partition by voucher_id and keep the latest record based on load_dt
window_spec = Window.partitionBy("voucher_id").orderBy(F.col("load_dt").desc())

df_silver_final = df_clean.withColumn("row_rank", F.row_number().over(window_spec)) \
    .filter("row_rank == 1") \
    .drop("row_rank") \
    .withColumn("voucher_code", clean_voucher_code_udf(F.col("voucher_code"))) \
    .withColumn("discount_value", F.round(F.col("discount_value"), 2)) \
    .withColumn("load_dt", F.to_timestamp(F.col("load_dt")))


##6. Atomic Delta Writes & Constraints
The data is committed using the Delta Lake format. We apply storage-level constraints to act as a "firewall," ensuring that future data writes cannot violate our core integrity rules.

In [0]:

# --- 5. ATOMIC WRITES ---
# Write to Quarantine (Append) and Silver (Overwrite)
df_quarantine.write.format("delta").mode("append").option("mergeSchema", "true").saveAsTable(QUARANTINE_TABLE)

df_silver_final.write.format("delta").mode("overwrite").option("overwriteSchema", "true").saveAsTable(SILVER_TABLE)

# --- 6. APPLY DELTA CONSTRAINTS ---
# Ensure primary keys are NOT NULL and business logic is enforced at the storage layer
spark.sql(f"ALTER TABLE {SILVER_TABLE} ALTER COLUMN voucher_id SET NOT NULL")

try:
    spark.sql(f"ALTER TABLE {SILVER_TABLE} ADD CONSTRAINT valid_date_range CHECK (valid_from <= valid_to)")
except Exception as e:
    print(f"Constraint valid_date_range skipped: {e}")

print(f"Silver table {SILVER_TABLE} updated successfully.")

In [0]:
%sql
-- Identify the version before the last update
DESCRIBE HISTORY `vstone-catalog`.silver_schema.silver_vouchers;



In [0]:
%sql
-- Query version 5 to see old voucher values
SELECT * FROM `vstone-catalog`.silver_schema.silver_vouchers VERSION AS OF 3
WHERE voucher_id = 101;

##Industry Logics & Standards 
**Financial Accuracy:** By casting discount_value to a double and rounding to two decimal places, we ensure the data is ready for precise financial reporting in the Gold layer.

**Idempotency:** The notebook uses .mode("overwrite"), meaning it can be re-run multiple times without creating duplicate data or inflating record counts.

**Auditability:** The Quarantine table provides a full audit trail of "bad data." This allows data engineers to identify if a specific source system is consistently sending incorrect voucher codes.

**Temporal Precision:** Standardizing expiry_date to a date type and load_dt to a timestamp is critical for point-in-time analysis, such as "How many vouchers were valid on January 1st?"