# Date Dimension - Landing Data Load

## Overview
In this notebook, we perform the **Landing Data Load** for the Date Dimension. 

### The Landing Layer Pattern
The Landing Layer (or Raw Layer) serves as the entry point for data into the Data Lakehouse. The standard process involves:
1.  **Reading Source:** Reading incremental files or generating data.
2.  **Casting to String:** Converting all columns to String type. This prevents job failures due to data type mismatches from the source, allowing us to handle type casting in the Staging layer.
3.  **Audit Columns:** Adding `insert_dt` (timestamp) and `rundate` to track lineage.
4.  **Write Strategy:** Writing data in **Append Mode** (usually) to maintain history, or Overwrite for full reloads.
5.  **Job Control:** Updating the control table with the status.

### Specifics for Date Dimension
Unlike other tables that read from files, the **Date Dimension** is generated programmatically using a utility function. 
*   **Target Table:** `dim_date_ld`
*   **Run Date:** `20220101` (Full Load scenario)

In [None]:
# Import necessary libraries
import pyspark
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.types import *

# Assuming we have our project utility modules (as per the video context)
# In a real setup, these would be in a separate .py file available in the path
from lib.utils import get_spark_session, get_run_date, spark_create_date_data
from lib.job_control import insert_log, get_max_timestamp

# Initialize Spark Session
spark = get_spark_session("Date Landing Load")

# 1. Set Job Parameters
# For this tutorial, we simulate a Full Load with a specific run date
# In production, this might come from a config file or airflow argument
run_date = "20220101" 

print(f"Spark Version: {spark.version}")
print(f"Run Date: {run_date}")

## 1. Generate Data
We use the utility function `spark_create_date_data` to generate 2 years of date data based on the `run_date`.

In [None]:
# Define columns required for Date Landing
# Note: In Landing Layer, we treat everything as String initially
date_schema_cols = ["date", "day", "month", "year", "day_of_week"]

# Generate Data
# The utility function generates date records. 
# We ask for 2 years of data starting from the run_date.
df_raw = spark_create_date_data(spark, run_date, 2)

print("Raw Data Schema:")
df_raw.printSchema()

print(f"Raw Data Count: {df_raw.count()}")
df_raw.show(5)

## 2. Transformation: Cast to String & Add Audit Columns
For the Landing Layer, we standardize the schema by casting all business columns to String. We then append the audit information.

In [None]:
# 1. Cast all columns to String
# This is a defensive programming practice for the raw layer
df_casted = df_raw.select([col(c).cast("string") for c in df_raw.columns])

# 2. Add Audit Columns
# insert_dt: Current timestamp of load
# rundate: The business date for which the job is running
df_final = df_casted.withColumn("insert_dt", current_timestamp()) \
                    .withColumn("rundate", lit(run_date))

print("Final Landing Data Schema:")
df_final.printSchema()
df_final.show(5, truncate=False)

## 3. Write Data to Landing Layer
We implement a check using the **Job Control Table**.
*   **First Run:** If the max timestamp is the default low value (`1900-01-01`), it implies the table is empty. We use **Overwrite** mode.
*   **Subsequent Runs:** If data exists, we use **Append** mode.

In [None]:
# Table Configuration
schema_name = "pyspark_warehouse" # Database name
table_name = "dim_date_ld"
table_full_name = f"{schema_name}.{table_name}"
landing_path = f"s3a://warehouse/landing/{table_name}" # Example Path

# Check Job Control to decide Write Mode
# get_max_timestamp returns "1900-01-01 00:00:00" if no entry exists
max_timestamp = get_max_timestamp(spark, schema_name, table_name)

print(f"Max Timestamp in Job Control: {max_timestamp}")

if max_timestamp == "1900-01-01 00:00:00.000000":
    print("No previous load found. Mode: OVERWRITE")
    write_mode = "overwrite"
else:
    print("Previous load found. Mode: APPEND")
    write_mode = "append"

# Write Data
df_final.write \
    .format("delta") \
    .mode(write_mode) \
    .saveAsTable(table_full_name)

print(f"Data successfully written to {table_full_name} in {write_mode} mode.")

## 4. Update Job Control & Generate Manifest
Finally, we log the success of the job and generate the Symlink Manifest file (required if we want to query Delta Lake tables via AWS Athena).

In [None]:
# 1. Update Job Control Table
# We log the schema, table, max insert_dt, and record counts
insert_log(spark, schema_name, table_name, df_final.count(), run_date)

print("Job Control Table updated.")

# 2. Generate Symlink Manifest
# This allows external engines like Athena/Presto to read the Delta table
spark.sql(f"GENERATE symlink_format_manifest FOR TABLE {table_full_name}")

print("Symlink Manifest generated.")

In [None]:
# Final Validation
# Check if data is readable
spark.sql(f"SELECT * FROM {table_full_name} LIMIT 5").show()

# Verify Job Control Entry
spark.sql(f"""
    SELECT * FROM {schema_name}.job_control 
    WHERE table_name = '{table_name}' 
    ORDER BY insert_dt DESC LIMIT 1
""").show(truncate=False)