# Silver Dimension — Customer SCD2 Processing

This notebook applies Slowly Changing Dimension Type 2 logic to maintain historical accuracy for the customer dimension within the Lakehouse Expansion pillar. It detects new and changed records, generates row-level hashes for comparison, assigns effective dating, and appends updated records into the Silver dimension table.

The workflow is consolidated into a single execution block to maintain clarity, reproducibility, and operational consistency across environments. This pattern ensures that downstream analytics and reporting layers always reflect accurate historical state.

In [None]:
# Step 1 — Load Bronze source
bronze_df = spark.read.table("lakehouse.bronze_customers")

# Step 2 — Load existing Silver dimension (or initialize empty)
try:
    silver_df = spark.read.table("lakehouse.silver_dim_customer")
except:
    silver_df = spark.createDataFrame([], bronze_df.schema)

# Step 3 — Hash Bronze and Silver rows for change detection
from pyspark.sql.functions import sha2, concat_ws, current_timestamp, lit

bronze_hashed = bronze_df.withColumn(
    "hash",
    sha2(concat_ws("||", *bronze_df.columns), 256)
)

silver_hashed = silver_df.withColumn(
    "hash",
    sha2(concat_ws("||", *silver_df.columns), 256)
)

# Step 4 — Identify new or changed records
joined = bronze_hashed.join(
    silver_hashed,
    on="customer_id",
    how="left"
)

changes = joined.filter("hash != hash_right OR hash_right IS NULL")

# Step 5 — Apply SCD2 logic
scd2_ready = (
    changes
    .withColumn("effective_start", current_timestamp())
    .withColumn("effective_end", lit(None))
    .withColumn("is_current", lit(True))
)

# Step 6 — Write updated dimension
scd2_ready.write.mode("append").format("delta").saveAsTable("lakehouse.silver_dim_customer")

# Step 7 — Return preview
scd2_ready.limit(10).toPandas()