**⭐ 1. What This Pattern Solves**

SCD Type 2 handles historical changes in dimension data while keeping history intact.
Use-cases include:

Customer changes address → keep old and new records

Product price changes → track price history

Employee department changes → preserve historical assignments

Goal: maintain a full history for analytics while marking the current record.

**⭐ 2. SQL Equivalent**

In [0]:
%sql
MERGE INTO dim_customer AS target
USING staging_customer AS source
ON target.customer_id = source.customer_id
WHEN MATCHED AND target.name != source.name THEN
    UPDATE SET target.end_date = CURRENT_DATE
WHEN NOT MATCHED THEN
    INSERT (customer_id, name, start_date, end_date, is_current)
    VALUES (source.customer_id, source.name, CURRENT_DATE, NULL, TRUE)


**⭐ 3. Core Idea**

Compare incoming data with existing dimension

If a change is detected, “expire” old row (end_date)

Insert a new row with updated data and is_current=True

Use merge or upsert operations (Delta Lake ideal for this)

**⭐ 4. Template Code (MEMORIZE THIS)**

In [0]:
from delta.tables import DeltaTable
from pyspark.sql.functions import current_date, lit, col

delta_table = DeltaTable.forPath(spark, "path/to/dim_table")

delta_table.alias("target").merge(
    source_df.alias("source"),
    "target.id = source.id"
).whenMatchedUpdate(
    condition = "target.attribute <> source.attribute",
    set = {
        "end_date": current_date(),
        "is_current": lit(False)
    }
).whenNotMatchedInsert(
    values = {
        "id": col("source.id"),
        "attribute": col("source.attribute"),
        "start_date": current_date(),
        "end_date": None,
        "is_current": lit(True)
    }
).execute()

**⭐ 5. Detailed Example**

In [0]:
from pyspark.sql import SparkSession
from delta.tables import DeltaTable
from pyspark.sql.functions import current_date, lit, col

spark = SparkSession.builder.getOrCreate()

# Existing dimension table (Delta)
data = [("C1", "Alice", "2025-01-01", None, True)]
dim_df = spark.createDataFrame(data, ["id", "name", "start_date", "end_date", "is_current"])
dim_df.write.format("delta").mode("overwrite").save("/tmp/dim_customer")

# Incoming new data
new_data = [("C1", "Alice B")]
source_df = spark.createDataFrame(new_data, ["id", "name"])

# Merge for SCD Type 2
delta_table = DeltaTable.forPath(spark, "/tmp/dim_customer")
delta_table.alias("target").merge(
    source_df.alias("source"),
    "target.id = source.id"
).whenMatchedUpdate(
    condition = "target.name <> source.name",
    set = {
        "end_date": current_date(),
        "is_current": lit(False)
    }
).whenNotMatchedInsert(
    values = {
        "id": col("source.id"),
        "name": col("source.name"),
        "start_date": current_date(),
        "end_date": None,
        "is_current": lit(True)
    }
).execute()

delta_table.toDF().show()

**Step-by-step:**

Compare existing rows with incoming updates

Expire old rows if there’s a change (end_date + is_current=False)

Insert new rows as current

**⭐ 6. Mini Practice Problems**

Track changes in employee_title for SCD Type 2.

Handle price updates for products in a dimension table.

Implement SCD Type 2 on a small customer dataset with multiple attribute changes.

**⭐ 7. Full Data Engineering Problem**

Scenario: Daily ETL ingest of customer updates. Customers may change addresses, emails, or status. The analytics team requires full history.

Solution Approach:

Read staging customer updates

Load existing dimension from Delta

Use merge with SCD Type 2 logic

Write back to Delta with current/expired rows

Optimize with Delta ZORDER on frequently queried columns (e.g., customer_id)

**⭐ 8. Time & Space Complexity**

Time: O(N) for merge per batch; can be costly for large tables if partitioning is not used

Space: O(N) for storing new rows + metadata

Partitioning by key reduces shuffle overhead

**⭐ 9. Common Pitfalls**

Forgetting to mark old row as is_current=False

Not updating end_date → analytics will see duplicate current rows

Missing partitioning → slow merge and high shuffle

Using SCD Type 1 logic by mistake → overwrites history