# Date Dimension - Data Warehouse Load (SCD Type 1)

## Overview
In this notebook, we move data from the **Staging Layer** to the final **Data Warehouse (Dimension)** layer.

### The Dimension Load Pattern (SCD Type 1)
We are implementing a Slowly Changing Dimension Type 1 (Overwriting old history with new state) strategy.

### Steps involved:
1.  **Read Staging Data:** Read all data from the Staging table (as it acts as a queue/buffer for the latest batch).
2.  **Surrogate Key Generation:** Create a unique identifier (`row_wid`) for the dimension. For the Date dimension, we will generate a "Smart Key" based on the date format (e.g., 20220101).
3.  **Transformation:** Ensure data types align with the final schema.
4.  **Write Strategy (Upsert/Merge):** 
    *   **Check Load Type:** If this is the very first run (Full Load), we truncate the target table to ensure a clean slate.
    *   **Merge:** Use Delta Lake's `MERGE INTO` command to match records based on the Natural Key (`date`). 
        *   **Matched:** Update the record.
        *   **Not Matched:** Insert the record.
5.  **Job Control & Manifest:** Update logs and generate the manifest file.

In [None]:
# Import necessary libraries
import pyspark
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.types import *
from delta.tables import *

# Assuming utility modules are available
from lib.utils import get_spark_session, get_run_date
from lib.job_control import insert_log, get_max_timestamp

# Initialize Spark Session
spark = get_spark_session("Date Dimension Load")

# 1. Job Parameters
# For this tutorial, we act as if it is a Full Load
run_date = "20220101" 
schema_name = "pyspark_warehouse"

# Source (Staging) and Target (Dimension) configuration
src_table = f"{schema_name}.dim_date_stg"
tgt_table = "dim_date"
tgt_table_full = f"{schema_name}.{tgt_table}"

print(f"Reading from: {src_table}")
print(f"Writing to:   {tgt_table_full}")

## 1. Read Staging Data
We read the data prepared in the previous step. Since Staging is always truncated/overwritten in our pipeline, we read the whole table.

In [None]:
# Read from Staging
df_stg = spark.read.table(src_table)

print(f"Staging Data Count: {df_stg.count()}")
df_stg.printSchema()

## 2. Generate Surrogate Keys & Transform
For a Date Dimension, a common practice for the Surrogate Key (`row_wid`) is to use the integer representation of the date (e.g., `20220101`).

We also ensure the column names and types match the target table definition.

In [None]:
# Generate Surrogate Key (row_wid)
# We cast the date column to format 'yyyyMMdd' and then to Integer/Long
df_dim_temp = df_stg.withColumn("row_wid", date_format(col("date"), "yyyyMMdd").cast("long"))

# Select and Reorder columns to match Target Schema
final_cols = [
    "row_wid", 
    "date", 
    "day", 
    "month", 
    "year", 
    "day_of_week", 
    "rundate", 
    "insert_dt", 
    "update_dt"
]

df_final = df_dim_temp.select(final_cols)

print("Final Dimension Data Preview:")
df_final.show(5)

## 3. Handling Full Load vs Incremental
We check the **Job Control Table**. 
*   If `max_timestamp` is the default low value, it implies the dimension is empty or this is a forced Full Load.
*   **Action:** If Full Load, we `TRUNCATE` the target Delta table and run `VACUUM` to clean up old files immediately.

In [None]:
# Check High Watermark
max_timestamp = get_max_timestamp(spark, schema_name, tgt_table)
print(f"Max Timestamp: {max_timestamp}")

# Logic for Initial Load
if max_timestamp == "1900-01-01 00:00:00.000000":
    print("Full Load Detected. Truncating Target Table...")
    
    # Initialize DeltaTable object
    dt_target = DeltaTable.forName(spark, tgt_table_full)
    
    # Truncate (Delete all rows)
    dt_target.delete() 
    
    # Vacuum to remove physical files (retention check disabled for immediate effect)
    spark.conf.set("spark.databricks.delta.retentionDurationCheck.enabled", "false")
    dt_target.vacuum(0) 
    
    print("Target Table Truncated and Vacuumed.")
else:
    print("Incremental Load Detected.")

## 4. Upsert (Merge) Data
We use the **Delta Merge** operation to implement SCD Type 1.
*   **Join Condition:** `target.date = source.date` (Natural Key)
*   **When Matched:** Update all columns.
*   **When Not Matched:** Insert the new row.

In [None]:
# Initialize DeltaTable for Target
dt_target = DeltaTable.forName(spark, tgt_table_full)

# Define the Merge
dt_target.alias("target") \
    .merge(
        df_final.alias("source"),
        "target.date = source.date" # Natural Key Join
    ) \
    .whenMatchedUpdateAll() \
    .whenNotMatchedInsertAll() \
    .execute()

print("Merge (Upsert) completed successfully.")

## 5. Job Logging & Manifest
Record the execution stats and update the Athena manifest.

In [None]:
# 1. Update Job Control
# Note: For Merge operations, obtaining the exact count of inserted/updated rows 
# requires parsing Merge metrics. For simplicity here, we log the source count.
insert_log(spark, schema_name, tgt_table, df_final.count(), run_date)
print("Job Control updated.")

# 2. Generate Manifest
spark.sql(f"GENERATE symlink_format_manifest FOR TABLE {tgt_table_full}")
print("Manifest generated.")

# 3. Final Validation
print("Data in Final Dimension:")
spark.sql(f"SELECT * FROM {tgt_table_full} ORDER BY date LIMIT 5").show()

print("Job Control Status:")
spark.sql(f"SELECT * FROM {schema_name}.job_control WHERE table_name='{tgt_table}' ORDER BY insert_dt DESC LIMIT 1").show()