# Date Dimension - Staging Data Load

## Overview
In this notebook, we move data from the **Landing Layer** to the **Staging Layer**.

### The Staging Layer Pattern
The Staging Layer serves as an intermediate processing zone where data is cleaned, de-duplicated, and typed before entering the Data Warehouse (Dimension/Fact) tables.

### Steps involved:
1.  **Read Incremental Data:** We query the **Job Control Table** to find the `max_timestamp` of the previous load. We then read only the records from the Landing Layer that were inserted *after* this timestamp.
2.  **De-duplication:** We handle potential duplicate data arrival. We group by the **Natural Key** (in this case, `date`) and keep the record with the latest `insert_dt`.
3.  **Type Casting:** In Landing, all columns were strings. Here, we cast them to their correct types (Date, Integer, etc.).
4.  **Audit Columns:** We add `insert_dt` and `update_dt` to track when records are processed in this layer.
5.  **Write Strategy:** We write to the Staging table in **Overwrite** mode.

In [None]:
# Import necessary libraries
import pyspark
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.types import *
from pyspark.sql.window import Window

# Assuming utility modules are available
from lib.utils import get_spark_session, get_run_date
from lib.job_control import insert_log, get_max_timestamp

# Initialize Spark Session
spark = get_spark_session("Date Staging Load")

# 1. Job Parameters
run_date = "20220101" 
schema_name = "pyspark_warehouse"

# Source (Landing) and Target (Staging) configuration
src_table = f"{schema_name}.dim_date_ld"
tgt_table = "dim_date_stg"
tgt_table_full = f"{schema_name}.{tgt_table}"

print(f"Reading from: {src_table}")
print(f"Writing to:   {tgt_table_full}")

## 1. Determine High Watermark
We check the `job_control` table to see when the `dim_date_stg` table was last loaded. This helps us define the cut-off point for reading from the Landing layer.

In [None]:
# Get the Max Timestamp (High Watermark)
# If first run, returns 1900-01-01
max_timestamp = get_max_timestamp(spark, schema_name, tgt_table)

print(f"High Watermark (Max Timestamp): {max_timestamp}")

## 2. Read Incremental Data
We read data from the Landing Layer where the `insert_dt` is greater than our High Watermark.

In [None]:
# Read from Landing Table
df_landing = spark.read.table(src_table)

# Filter for incremental data
df_incremental = df_landing.filter(col("insert_dt") > max_timestamp)

print(f"Records fetched from Landing: {df_incremental.count()}")

## 3. De-duplication Logic
To ensure data quality, we remove duplicates based on the Natural Key.
*   **Partition By:** `date` (Natural Key)
*   **Order By:** `insert_dt DESC` (Prioritize latest data)
*   **Logic:** Assign a `row_number` and keep only `rn == 1`.

In [None]:
# Define Window Spec
window_spec = Window.partitionBy("date").orderBy(col("insert_dt").desc())

# Calculate Row Number
df_deduped_prep = df_incremental.withColumn("rn", row_number().over(window_spec))

# Filter only the latest record per key
df_deduped = df_deduped_prep.filter(col("rn") == 1).drop("rn")

print(f"Count after De-duplication: {df_deduped.count()}")

## 4. Transformation & Type Casting
Convert the raw string columns to proper business data types and add audit information.

In [None]:
# Apply Transformations
df_stg = df_deduped \
    .withColumn("date", to_date(col("date"), "yyyy-MM-dd")) \
    .withColumn("day", col("day").cast("integer")) \
    .withColumn("month", col("month").cast("integer")) \
    .withColumn("year", col("year").cast("integer")) \
    .withColumn("day_of_week", col("day_of_week").cast("string")) \
    .withColumn("insert_dt", current_timestamp()) \
    .withColumn("update_dt", current_timestamp()) \
    .withColumn("rundate", lit(run_date))

# Select specific columns to ensure schema order
final_cols = ["date", "day", "month", "year", "day_of_week", 
              "insert_dt", "update_dt", "rundate"]

df_final = df_stg.select(final_cols)

print("Staging Schema:")
df_final.printSchema()
df_final.show(5)

## 5. Write to Staging
We write to the staging table in **Overwrite** mode. The Staging layer usually holds the "delta" or the current batch of data being processed before it merges into the final Dimension/Fact.

In [None]:
# Write Data
df_final.write \
    .format("delta") \
    .mode("overwrite") \
    .saveAsTable(tgt_table_full)

print(f"Data written to {tgt_table_full} in OVERWRITE mode.")

## 6. Job Logging & Manifest
Update the control table and generate the manifest for Athena.

In [None]:
# 1. Update Job Control
insert_log(spark, schema_name, tgt_table, df_final.count(), run_date)
print("Job Control updated.")

# 2. Generate Manifest
spark.sql(f"GENERATE symlink_format_manifest FOR TABLE {tgt_table_full}")
print("Manifest generated.")

# 3. Validation
spark.sql(f"SELECT * FROM {tgt_table_full} LIMIT 5").show()