This notebook implements the Medallion Architecture to process raw user data from the Bronze layer into a high-integrity, analytics-ready Silver table. The logic focuses on identifying PII (Personally Identifiable Information) patterns, standardizing contact information, and enforcing data quality through a quarantine mechanism.

##Silver Layer: Users Transformation
**Notebook Objective:** This notebook implements the cleaning and validation of user account data. 

It ensures that every user has a valid identifier and email format, deduplicates account records to find the latest "state" of a user, and enforces strict storage-level constraints for data governance.

##1. Initial Data Profiling (Bronze Layer)
We begin by auditing the raw Bronze data to identify missing critical identifiers and evaluate the quality of email entries.

In [0]:
%sql
-- 1. Check for missing primary identifiers
SELECT count(*) FROM `vstone-catalog`.bronze_schema.users_bronze WHERE _rescued_data IS NOT NULL;

-- 2. Check for date format inconsistencies
SELECT birthdate, count(*) 
FROM `vstone-catalog`.bronze_schema.users_bronze 
GROUP BY birthdate ORDER BY count(*) DESC LIMIT 5;

-- 3. Check for Null IDs
SELECT count(*) FROM `vstone-catalog`.bronze_schema.users_bronze WHERE user_id IS NULL;

##2. Configuration & Environment Setup
We define the paths using Unity Catalog's three-tier namespace and initialize the destination schema to ensure the code is self-contained.

In [0]:
import pandas as pd
from pyspark.sql import functions as F
from pyspark.sql.window import Window
from pyspark.sql.functions import pandas_udf
from pyspark.sql.types import StringType

# --- 1. CONFIGURATION ---
# Using backticks to escape hyphens in the catalog name
CATALOG = "`vstone-catalog`"
SILVER_SCHEMA = "silver_schema"
BRONZE_TABLE = f"{CATALOG}.bronze_schema.users_bronze"
SILVER_TABLE = f"{CATALOG}.{SILVER_SCHEMA}.silver_users"
QUARANTINE_TABLE = f"{CATALOG}.{SILVER_SCHEMA}.quarantine_users"

# Bootstrap: Ensure the Silver schema exists before processing
spark.sql(f"CREATE SCHEMA IF NOT EXISTS {CATALOG}.{SILVER_SCHEMA}")


##3. Ingestion & Header Standardization
To ensure cross-table compatibility, we normalize all column headers into snake_case.

In [0]:

# --- 2. PANDAS UDF FOR STANDARDIZATION ---
@pandas_udf(StringType())
def standardize_gender_udf(gender_series: pd.Series) -> pd.Series:
    """Standardizes gender strings (e.g., 'M', 'male', 'MALE' -> 'Male')."""
    mapping = {'m': 'Male', 'f': 'Female', 'n': 'Non-binary', 'o': 'Other'}
    # Clean string and take first character for mapping
    clean_series = gender_series.str.lower().str.strip().str[0]
    return clean_series.map(mapping).fillna('Unknown')


In [0]:
# --- 2. LOAD & STANDARDIZE HEADERS ---
df_bronze = spark.read.table(BRONZE_TABLE)

# Standardize column names to lowercase and replace spaces with underscores
standardized_cols = [col.lower().replace(" ", "_").strip() for col in df_bronze.columns]
df_standardized = df_bronze.toDF(*standardized_cols)

##4. Quality Gates & Quarantine Pattern

We implement a "Quarantine" logic to separate malformed records. In an industry setting, this prevents the loss of data while ensuring that only valid information reaches the Silver table.

In [0]:

# --- 3. QUALITY GATES ---
# Business Rules: 
# 1. user_id must be present.

id_valid = F.col("user_id").isNotNull()
reg_valid = F.to_timestamp(F.col("registered_at")).isNotNull()

valid_mask = id_valid & reg_valid

# Redirect invalid records to Quarantine table with a reason code
df_quarantine = df_standardized.filter(~valid_mask) \
    .withColumn("quarantine_reason", 
        F.when(~id_valid, "MISSING_USER_ID")
         .otherwise("INVALID_REGISTRATION_DATE")) \
    .withColumn("quarantined_at", F.current_timestamp())
    
# Proceed with clean data
df_clean = df_standardized.filter(valid_mask)

##5. Deduplication & Transformation
Since user profiles can change over time (e.g., name changes or email updates), we use a Window function to identify the most recent record for each user.

In [0]:


# --- 4. DEDUPLICATION & NORMALIZATION ---
# Logic: Partition by user_id and keep the record with the latest load_dt
window_spec = Window.partitionBy("user_id").orderBy(F.col("load_dt").desc())

# Check if _rescued_data exists before dropping to prevent errors
drop_cols = ["row_rank"]
if "_rescued_data" in df_clean.columns:
    drop_cols.append("_rescued_data")

df_silver_final = df_clean.withColumn("row_rank", F.row_number().over(window_spec)) \
    .filter("row_rank == 1") \
    .drop(*drop_cols) \
    .withColumn("gender", standardize_gender_udf(F.col("gender"))) \
    .withColumn("birthdate", F.to_date(F.col("birthdate"))) \
    .withColumn("registered_at", F.to_timestamp(F.col("registered_at"))) \
    .withColumn("load_dt", F.to_timestamp(F.col("load_dt")))


##6. Atomic Delta Writes & Table Constraints
We commit the data using the Delta Lake format and apply hard constraints to the table. This ensures that any future data ingestion that violates these rules will be blocked at the storage level.

In [0]:

# --- 5. ATOMIC WRITES ---
# Write to Quarantine (Append mode) and Silver (Overwrite mode)
df_quarantine.write.format("delta").mode("append") \
    .option("mergeSchema", "true") \
    .saveAsTable(QUARANTINE_TABLE)

# Write Silver table (Overwrite for full refresh)
df_silver_final.write.format("delta").mode("overwrite") \
    .option("overwriteSchema", "true") \
    .saveAsTable(SILVER_TABLE)

In [0]:

# --- 6. APPLY DELTA CONSTRAINTS ---
# Enforce that user_id can never be null in the Silver layer
spark.sql(f"ALTER TABLE {SILVER_TABLE} ALTER COLUMN user_id SET NOT NULL")

print(f"Silver table {SILVER_TABLE} updated successfully.")

In [0]:
%sql
-- Check when a specific user record was updated
DESCRIBE HISTORY `vstone-catalog`.silver_schema.silver_users;



In [0]:
%sql
-- Query a specific version to see a user's previous 'gender' or 'birthdate' entry
SELECT * FROM `vstone-catalog`.silver_schema.silver_users VERSION AS OF 1 WHERE user_id = 'usr_123';

##Industry Logics & Standards 
The Single Source of Truth: By deduplicating based on load_dt, the Silver table acts as the "current state" of the user database, essential for CRM and personalized marketing.

**Idempotency:** The use of .mode("overwrite") ensures that if the notebook is re-run, it replaces the Silver data with the exact same cleaned results, preventing data duplication.

**Data Quality Governance:** The Quarantine Pattern provides an audit trail. Instead of data simply "disappearing" because it failed a check, it is stored in quarantine_users so data engineers can fix source-system issues.

**Storage-Level Firewalls:** Using ALTER TABLE ... SET NOT NULL is a proactive security measure. It ensures that no matter what pipeline or user writes to the Silver table in the future, the integrity of the primary key remains intact.