# Product Dimension - End-to-End Pipeline (SCD Type 2)

## Overview
In this notebook, we implement the pipeline for the **Product Dimension**.
Similar to the Customer dimension, this is an **SCD Type 2** table. We track changes in product attributes (like price, size, or expiration) over time.

## Architecture Flow
1.  **Source:** `product_{rundate}.csv` file.
2.  **Landing Layer:** Ingest raw CSV to Delta.
3.  **Staging Layer:** 
    *   Deduplication based on `product_id`.
    *   Type casting (Price to Float, Quantity to Int, etc.).
    *   Preparation of SCD2 columns.
4.  **Dimension Layer (SCD Type 2):**
    *   **Merge (Update):** Expire old active records for products that have changed.
    *   **Insert:** Append new versions of the products as active records.

In [None]:
# Import necessary libraries
import pyspark
import uuid
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.types import *
from delta.tables import *

# Initialize Spark Session
spark = SparkSession.builder \
    .appName("Product Dimension Load") \
    .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
    .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog") \
    .getOrCreate()

# Job Parameters
run_date = "20220101"
schema_name = "pyspark_warehouse"
landing_file_name = f"product_{run_date}.csv"

# Paths
base_path = "s3a://warehouse/" # Update as needed
source_path = f"{base_path}landing/source/product/{landing_file_name}"

print(f"Processing Run Date: {run_date}")

In [None]:
# --- SIMULATION: Create Dummy Source CSV ---
data = [
    ("P001", "Purina Pro Plan", "Purina", "Dry", "Chicken", "5 kgs", "20.00", "100", "2024-12-31", "https://img.url/p001"),
    ("P002", "Hill's Science Diet", "Hill's", "Wet", "Beef", "10 cans", "15.50", "50", "2023-06-30", "https://img.url/p002"),
    ("P003", "Blue Buffalo", "Blue Buffalo", "Dry", "Lamb", "2 kgs", "12.00", "200", "2025-01-01", "https://img.url/p003")
]
columns = ["product_id", "product_name", "brand", "type", "flavor", "size", "price", "quantity", "expiration_date", "image_url"]

df_source = spark.createDataFrame(data, columns)

# Write to Source Path
df_source.coalesce(1).write.mode("overwrite").option("header", "true").csv(source_path.replace(landing_file_name, ""))
print("Source file created.")

## 1. Landing Load
Read the CSV, cast to String, and write to the Landing Delta table.

In [None]:
# --- LANDING ---
df_raw = spark.read.option("header", "true").csv(source_path)

# Cast to String & Audit
df_landing = df_raw.select([col(c).cast("string") for c in df_raw.columns]) \
    .withColumn("insert_dt", current_timestamp()) \
    .withColumn("rundate", lit(run_date))

# Write
landing_table = f"{schema_name}.dim_product_ld"
df_landing.write.format("delta").mode("append").saveAsTable(landing_table)
print(f"Loaded to {landing_table}")

## 2. Staging Load
Transformations:
1.  **Type Casting:** Convert `price` to Float, `quantity` to Integer, `expiration_date` to Date.
2.  **Deduplication:** Based on `product_id`.

In [None]:
# --- STAGING ---
from pyspark.sql.window import Window

df_ld = spark.read.table(landing_table)

# Dedupe
window_spec = Window.partitionBy("product_id").orderBy(col("insert_dt").desc())
df_deduped = df_ld.withColumn("rn", row_number().over(window_spec)).filter("rn=1").drop("rn")

# Transformations
df_stg = df_deduped \
    .withColumn("price", col("price").cast("float")) \
    .withColumn("quantity", col("quantity").cast("integer")) \
    .withColumn("expiration_date", to_date(col("expiration_date"), "yyyy-MM-dd")) \
    .withColumn("update_dt", current_timestamp())

# Select Columns
stg_cols = ["product_id", "product_name", "brand", "type", "flavor", "size", 
            "price", "quantity", "expiration_date", "image_url", 
            "insert_dt", "update_dt", "rundate"]

df_stg_final = df_stg.select(stg_cols)

# Write
staging_table = f"{schema_name}.dim_product_stg"
df_stg_final.write.format("delta").mode("overwrite").saveAsTable(staging_table)
print(f"Loaded to {staging_table}")
df_stg_final.show(5)

## 3. Dimension Load (SCD Type 2)

### The SCD2 Logic
1.  **Surrogate Key:** Generate `row_wid` using UUID.
2.  **SCD Columns:**
    *   `effective_start_dt`: Current Timestamp.
    *   `effective_end_dt`: High Date (9999-12-31).
    *   `active_flg`: 'Y'.
3.  **Update (Expire):** If `product_id` matches and is active -> Set `active_flg='N'` and `effective_end_dt=now`.
4.  **Insert:** Append new records.

In [None]:
# --- DIMENSION (SCD2) ---

# UUID UDF
uuid_udf = udf(lambda: str(uuid.uuid4()), StringType())

# Read Staging
df_stage = spark.read.table(staging_table)

# Prepare Data for Insertion (New Records)
df_new_records = df_stage \
    .withColumn("row_wid", uuid_udf()) \
    .withColumn("effective_start_dt", current_timestamp()) \
    .withColumn("effective_end_dt", to_timestamp(lit("9999-12-31 00:00:00"))) \
    .withColumn("active_flg", lit("Y"))

dim_table = f"{schema_name}.dim_product"

# --- SCD 2 IMPLEMENTATION ---

if not DeltaTable.isDeltaTable(spark, f"/user/hive/warehouse/{dim_table}"):
    # FIRST RUN
    print("Table not found. Creating Initial Load.")
    df_new_records.write.format("delta").saveAsTable(dim_table)
else:
    print("Incremental Load: executing SCD2 Logic.")
    delta_target = DeltaTable.forName(spark, dim_table)
    
    # Step A: Update existing active records to expire them
    delta_target.alias("tgt").merge(
        df_stage.alias("src"),
        "tgt.product_id = src.product_id AND tgt.active_flg = 'Y'"
    ).whenMatchedUpdate(set={
        "active_flg": lit("N"),
        "effective_end_dt": current_timestamp(),
        "update_dt": current_timestamp()
    }).execute()
    
    # Step B: Insert the new records
    df_new_records.write.format("delta").mode("append").saveAsTable(dim_table)

print("SCD2 Load Complete.")

# Generate Manifest
spark.sql(f"GENERATE symlink_format_manifest FOR TABLE {dim_table}")

In [None]:
# Validation
print("Final Dimension Data:")
spark.sql(f"""
    SELECT product_id, product_name, price, effective_start_dt, effective_end_dt, active_flg 
    FROM {dim_table} 
    ORDER BY product_id, effective_start_dt
""").show(truncate=False)