This notebook follows the Medallion Architecture to transform raw payment method data into a structured, high-quality Silver table. It focuses on data profiling, header standardization, deduplication, and the implementation of a "Quarantine" pattern for data governance.

##Silver Layer: Payment Methods Transformation

**Notebook Objective:** This notebook implements the cleaning and validation of payment method data. It ensures that the final dataset is deduplicated, follows naming standards, and adheres to strict business rules (Data Quality Gates) before being used for financial reporting

##1. Initial Data Profiling (Bronze Layer)

Before processing, we perform "smoke tests" using SQL to understand the state of the raw data. This helps identify null values, duplicates, and category distributions.

In [0]:
%sql
-- 1. Check for Nulls in critical columns to determine if a quarantine is needed
SELECT 
  count(*) - count(method_id) AS missing_ids,
  count(*) - count(method_name) AS missing_names
FROM `vstone-catalog`.bronze_schema.bronze_paymentmethods;

-- 2. Check for Duplicate method_ids; Silver requires a unique grain per ID
SELECT method_id, count(*) 
FROM `vstone-catalog`.bronze_schema.bronze_paymentmethods 
GROUP BY method_id HAVING count(*) > 1;

-- 3. Check for Category distribution to ensure data completeness

SELECT category, count(*) 
FROM `vstone-catalog`.bronze_schema.bronze_paymentmethods 
GROUP BY category;

##2. Configuration & Standardization

We define our environment variables and implement a Pandas UDF (User Defined Function). UDFs are used here to ensure that string formatting is handled efficiently across the distributed Spark cluster.

In [0]:
import pandas as pd
from pyspark.sql import functions as F
from pyspark.sql.window import Window
from pyspark.sql.functions import pandas_udf
from pyspark.sql.types import StringType

# --- 1. CONFIGURATION ---
# Note: Backticks are required because of the hyphen in the catalog name
CATALOG = "`vstone-catalog`"
SILVER_SCHEMA = "silver_schema"
BRONZE_TABLE = f"{CATALOG}.bronze_schema.bronze_paymentmethods"
SILVER_TABLE = f"{CATALOG}.{SILVER_SCHEMA}.silver_paymentmethods"
QUARANTINE_TABLE = f"{CATALOG}.{SILVER_SCHEMA}.quarantine_paymentmethods"

# Initialize Environment
spark.sql(f"CREATE SCHEMA IF NOT EXISTS {CATALOG}.{SILVER_SCHEMA}")

# --- 2. PANDAS UDF FOR COLUMN STANDARDIZATION ---
@pandas_udf(StringType())
def standardize_header_udf(col_series: pd.Series) -> pd.Series:
    """Standardizes string inputs to lowercase snake_case."""
    return col_series.str.lower().str.replace(r'[^a-zA-Z0-9]', '_', regex=True).str.strip('_')


##3. Data Cleaning & Header Normalization

Industry standards dictate that column names should be consistent (e.g., no spaces or hyphens). We rename headers to snake_case to ensure compatibility across various BI tools.

In [0]:
# --- 3. LOAD & STANDARDIZE HEADERS ---
df_bronze = spark.read.table(BRONZE_TABLE)

# Apply standardization to the column names themselves
# We use Python logic here for column names, while the UDF is better for row data
standardized_col_names = [
    col.lower().strip().replace(" ", "_").replace("-", "_") 
    for col in df_bronze.columns
]
df_standardized = df_bronze.toDF(*standardized_col_names)

##4. Quality Gates & Quarantine Pattern
To prevent the code from failing due to bad data, we use a Quarantine Pattern. Invalid records (missing IDs) are redirected to a separate table for auditing, while clean records proceed to the Silver table.

In [0]:
# --- 4. QUALITY GATES & QUARANTINE ---
# Business Rule: method_id is our primary key and must NOT be null
is_valid = F.col("method_id").isNotNull()

# Isolate malformed records for the Data Quality team to review
df_quarantine = df_standardized.filter(~is_valid) \
    .withColumn("quarantine_reason", F.lit("MISSING_METHOD_ID")) \
    .withColumn("quarantined_at", F.current_timestamp())

# Filter only clean data for the Silver table
df_clean = df_standardized.filter(is_valid)

##5. Deduplication & Final Transformation
We use a Window Function to handle duplicates. By ranking records by their load timestamp, we ensure that only the "latest version" of a payment method is kept.

In [0]:
# --- 5. DEDUPLICATION & NORMALIZATION ---
# Logic: Partition by ID and keep the most recent record (row_rank == 1)
window_spec = Window.partitionBy("method_id").orderBy(F.col("load_dt").desc())

df_silver_final = df_clean.withColumn("row_rank", F.row_number().over(window_spec)) \
    .filter("row_rank == 1") \
    .drop("row_rank") \
    .withColumn("load_dt", F.to_timestamp(F.col("load_dt"))) \
    .withColumn("method_name", F.initcap(F.col("method_name")))

##6. Atomic Delta Writes & Constraints
We commit the data using **Delta Lake.** We also apply an ALTER TABLE constraint. This acts as a permanent "firewall" to ensure no future null IDs can ever be written into the Silver layer

In [0]:

# --- 6. ATOMIC WRITES ---
# Append failed records to the audit log
df_quarantine.write.format("delta") \
    .mode("append") \
    .option("mergeSchema", "true") \
    .saveAsTable(QUARANTINE_TABLE)

# Overwrite Silver table with clean, deduplicated data
df_silver_final.write.format("delta") \
    .mode("overwrite") \
    .option("overwriteSchema", "true") \
    .saveAsTable(SILVER_TABLE)

# --- 7. APPLY DELTA CONSTRAINTS ---
# Enforce NOT NULL at the storage level for absolute data integrity
spark.sql(f"ALTER TABLE {SILVER_TABLE} CHANGE COLUMN method_id SET NOT NULL")

print(f"Success: {SILVER_TABLE} and {QUARANTINE_TABLE} have been updated.")

In [0]:
%sql
DESCRIBE HISTORY `vstone-catalog`.silver_schema.silver_paymentmethods;

In [0]:
%sql
-- Query using the earliest available timestamp
SELECT * FROM `vstone-catalog`.silver_schema.silver_paymentmethods 
TIMESTAMP AS OF '2026-01-12 16:14:50';

In [0]:
%sql
-- This should return 0 rows thanks to our filter and Delta constraints
SELECT * FROM `vstone-catalog`.`silver_schema`.silver_paymentmethods WHERE method_id IS NULL;

##Industry Logics & Standards 
**1. The "Single Version of Truth"**
In the Bronze layer, data might be duplicated because of multiple system exports. In the Silver layer, the logic partitionBy("method_id").orderBy(desc("load_dt")) ensures that for every ID, only the most current record exists. This is critical for financial accuracy.

**2. Schema Evolution vs. Enforcement**
We use .option("overwriteSchema", "true"). In an industry setting, this allows the Silver table to adapt if new columns are added to the source system, while the SET NOT NULL constraint ensures that even if the schema grows, the quality of the primary keys never degrades.

**3. Time Travel Capability**
The notebook includes a command for TIMESTAMP AS OF. Because this is a Delta Table, we can query the state of the payment methods as they existed at a specific point in time. This is standard for auditing and "undoing" accidental data deletions.

**4. Data Governance (Quarantine)**
By tagging records with a quarantine_reason, we transform a "data failure" into "actionable metadata." Instead of the pipeline crashing, the data engineer receives an alert to check the quarantine_paymentmethods table to fix the source system.