# Customer Dimension - End-to-End Pipeline (SCD Type 2)

## Overview
In this notebook, we implement the pipeline for the **Customer Dimension**.
This dimension requires **SCD Type 2** logic. This means if a customer's details change (e.g., they move to a new address), we:
1.  **Expire** the old record (Set `active_flg` to 'N', set `end_date`).
2.  **Insert** a new record with the new details (Set `active_flg` to 'Y', set `start_date`).

## Architecture Flow
1.  **Source:** `customer_{rundate}.csv` file.
2.  **Landing Layer:** Ingest raw CSV to Delta.
3.  **Staging Layer:** 
    *   Split `name` into `first_name` and `last_name`.
    *   Handle NULLs in `plan_type`.
    *   Standardize dates.
4.  **Dimension Layer (SCD Type 2):**
    *   **Full Load:** Truncate and Load.
    *   **Incremental (SCD2):** 
        *   Identify changes between Staging and Dimension.
        *   **Merge (Update):** Close historical records.
        *   **Insert:** Append new active records.

In [None]:
# Import necessary libraries
import pyspark
import uuid
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.types import *
from delta.tables import *

# Initialize Spark Session
spark = SparkSession.builder \
    .appName("Customer Dimension Load") \
    .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
    .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog") \
    .getOrCreate()

# Job Parameters
run_date = "20220101"
schema_name = "pyspark_warehouse"
landing_file_name = f"customer_{run_date}.csv"

# Paths
base_path = "s3a://warehouse/" # Update as needed
source_path = f"{base_path}landing/source/customer/{landing_file_name}"

print(f"Processing Run Date: {run_date}")

In [None]:
# --- SIMULATION: Create Dummy Source CSV ---
data = [
    ("C001", "Ramesh Kumar", "123 Main St", "Anytown", "KA", "12345", "9999900001", "ramesh@email.com", "1985-05-20", "Gold"),
    ("C002", "Sita Sharma", "456 Elm St", "Othertown", "MH", "67890", "9999900002", "sita@email.com", "1990-08-15", None), # Null plan
    ("C003", "John Doe", "789 Oak Ave", "BigCity", "TN", "11223", "9999900003", "john@email.com", "1982-12-10", "Silver")
]
columns = ["customer_id", "name", "address", "city", "state", "zip_code", "phone_number", "email", "date_of_birth", "plan_type"]

df_source = spark.createDataFrame(data, columns)

# Write to Source Path
df_source.coalesce(1).write.mode("overwrite").option("header", "true").csv(source_path.replace(landing_file_name, ""))
print("Source file created.")

## 1. Landing Load
Read the CSV, cast to String, and write to the Landing Delta table.

In [None]:
# --- LANDING ---
df_raw = spark.read.option("header", "true").csv(source_path)

# Cast to String & Audit
df_landing = df_raw.select([col(c).cast("string") for c in df_raw.columns]) \
    .withColumn("insert_dt", current_timestamp()) \
    .withColumn("rundate", lit(run_date))

# Write
landing_table = f"{schema_name}.dim_customer_ld"
df_landing.write.format("delta").mode("append").saveAsTable(landing_table)
print(f"Loaded to {landing_table}")

## 2. Staging Load
We apply specific transformations here:
1.  **Name Split:** Convert `name` -> `first_name`, `last_name`.
2.  **Date Casting:** `date_of_birth` to Date type.
3.  **Null Handling:** Coalesce `plan_type` to "NA".
4.  **Deduplication:** Based on `customer_id`.

In [None]:
# --- STAGING ---
from pyspark.sql.window import Window

df_ld = spark.read.table(landing_table)

# Dedupe
window_spec = Window.partitionBy("customer_id").orderBy(col("insert_dt").desc())
df_deduped = df_ld.withColumn("rn", row_number().over(window_spec)).filter("rn=1").drop("rn")

# Transformations
df_stg = df_deduped \
    .withColumn("first_name", split(col("name"), " ")[0]) \
    .withColumn("last_name", split(col("name"), " ")[1]) \
    .withColumn("date_of_birth", to_date(col("date_of_birth"), "yyyy-MM-dd")) \
    .withColumn("plan_type", coalesce(col("plan_type"), lit("NA"))) \
    .withColumn("update_dt", current_timestamp())

# Select Columns
stg_cols = ["customer_id", "first_name", "last_name", "address", "city", "state", 
            "zip_code", "phone_number", "email", "date_of_birth", "plan_type", 
            "insert_dt", "update_dt", "rundate"]

df_stg_final = df_stg.select(stg_cols)

# Write
staging_table = f"{schema_name}.dim_customer_stg"
df_stg_final.write.format("delta").mode("overwrite").saveAsTable(staging_table)
print(f"Loaded to {staging_table}")
df_stg_final.show(5)

## 3. Dimension Load (SCD Type 2)

### The SCD2 Logic
1.  **Surrogate Key:** Generate `row_wid`.
2.  **SCD Columns:**
    *   `effective_start_dt`: Current Timestamp (for new rows).
    *   `effective_end_dt`: High Date (9999-12-31) for active rows.
    *   `active_flg`: 'Y' for active rows.
3.  **Update (Merge):** Identify records in the Target table that match the incoming Staging data (by `customer_id`) AND are currently active (`active_flg='Y'`). For these records, update `active_flg='N'` and set `effective_end_dt` to now.
4.  **Insert:** Insert the new records from Staging as active.

In [None]:
# --- DIMENSION (SCD2) ---

# UUID UDF
uuid_udf = udf(lambda: str(uuid.uuid4()), StringType())

# Read Staging
df_stage = spark.read.table(staging_table)

# Prepare Data for Insertion (New Records)
# Add SCD2 specific columns for new records
df_new_records = df_stage \
    .withColumn("row_wid", uuid_udf()) \
    .withColumn("effective_start_dt", current_timestamp()) \
    .withColumn("effective_end_dt", to_timestamp(lit("9999-12-31 00:00:00"))) \
    .withColumn("active_flg", lit("Y"))

dim_table = f"{schema_name}.dim_customer"

# --- SCD 2 IMPLEMENTATION ---

if not DeltaTable.isDeltaTable(spark, f"/user/hive/warehouse/{dim_table}"):
    # FIRST RUN: Just write the data
    print("Table not found. Creating Initial Load.")
    df_new_records.write.format("delta").saveAsTable(dim_table)
else:
    print("Incremental Load: executing SCD2 Logic.")
    delta_target = DeltaTable.forName(spark, dim_table)
    
    # 1. UPDATE (Close History)
    # We join Target and Staging on customer_id. 
    # If match found, we expire the Target record.
    # Note: In a strict SCD2, we compare hash/columns to see if data actually changed. 
    # Here, we assume incoming data implies a change or new version.
    
    # We use a temporary merge logic to update the target
    # Ideally, SCD2 involves: 
    #   1. Update old records 
    #   2. Insert new records
    
    # Step A: Update existing active records to expire them
    delta_target.alias("tgt").merge(
        df_stage.alias("src"),
        "tgt.customer_id = src.customer_id AND tgt.active_flg = 'Y'"
    ).whenMatchedUpdate(set={
        "active_flg": lit("N"),
        "effective_end_dt": current_timestamp(),
        "update_dt": current_timestamp()
    }).execute()
    
    # Step B: Insert the new records
    # We append the df_new_records prepared earlier
    df_new_records.write.format("delta").mode("append").saveAsTable(dim_table)

print("SCD2 Load Complete.")

# Generate Manifest
spark.sql(f"GENERATE symlink_format_manifest FOR TABLE {dim_table}")

In [None]:
# Validation
print("Final Dimension Data:")
# Sort by customer_id and start_date to see history
spark.sql(f"""
    SELECT customer_id, first_name, plan_type, effective_start_dt, effective_end_dt, active_flg 
    FROM {dim_table} 
    ORDER BY customer_id, effective_start_dt
""").show(truncate=False)