# Sales Fact - Data Warehouse Load (Star Schema Integration)

## Overview
In this notebook, we perform the final load for the **Sales Fact Table**.
*   **Source:** `fact_sales_stg` (Flattened sales data).
*   **Target:** `fact_sales` (Final Fact table in the Data Warehouse).
*   **Strategy:** **Append-Only**. Sales transactions are historical facts and usually do not change once finalized (unless there are returns/corrections, which are often handled as new negative transactions or separate processes).

## Key Integration Step: Surrogate Key Lookups
A Fact table should ideally contain **Foreign Keys (Surrogate Keys)** pointing to the Dimension tables, rather than Natural Keys. This ensures referential integrity and performance.

We will join the Staging data with:
1.  **Dim Store** (to get `store_wid`)
2.  **Dim Customer** (to get `customer_wid`)
3.  **Dim Product** (to get `product_wid`)
4.  **Dim Date** (to get `date_wid`) - *Note: In many designs, date_wid is just the integer date YYYYMMDD*

In [None]:
# Import necessary libraries
import pyspark
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.types import *
from delta.tables import *

# Initialize Spark Session
spark = SparkSession.builder \
    .appName("Sales Fact Load") \
    .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
    .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog") \
    .getOrCreate()

# Job Parameters
run_date = "20220101"
schema_name = "pyspark_warehouse"

# Table Names
stg_table = f"{schema_name}.fact_sales_stg"
dim_store = f"{schema_name}.dim_store"
dim_customer = f"{schema_name}.dim_customer"
dim_product = f"{schema_name}.dim_product"
dim_date = f"{schema_name}.dim_date"
fact_table = f"{schema_name}.fact_sales"

print(f"Processing Run Date: {run_date}")

## 1. Read Staging Data
Read the processed sales data from the staging layer.

In [None]:
# Read Staging
df_stg = spark.read.table(stg_table)

print(f"Staging Records: {df_stg.count()}")

## 2. Dimension Lookups (Surrogate Keys)
We join the staging data with dimensions to retrieve the Warehouse IDs (WIDs).
*   **Join Condition:** Natural Keys (e.g., `store_id`)
*   **Filter:** For SCD2 dimensions (Customer, Product), we strictly should check if the transaction date falls between `effective_start_dt` and `effective_end_dt`. 
    *   *Simplification for this tutorial:* We will join on the Natural Key and filter for `active_flg = 'Y'` (Current View) or just assume the dimensions align with the transaction date for the initial load. 
    *   *Best Practice:* `stg.order_date BETWEEN dim.start_date AND dim.end_date`.

In [None]:
# 1. Read Dimensions
df_dim_store = spark.read.table(dim_store).select("store_id", "row_wid").withColumnRenamed("row_wid", "store_wid")
df_dim_cust = spark.read.table(dim_customer).where("active_flg='Y'").select("customer_id", "row_wid").withColumnRenamed("row_wid", "customer_wid")
df_dim_prod = spark.read.table(dim_product).where("active_flg='Y'").select("product_id", "row_wid").withColumnRenamed("row_wid", "product_wid")
# For Date, usually we join on the date column
df_dim_date = spark.read.table(dim_date).select("date", "row_wid").withColumnRenamed("row_wid", "date_wid")

# 2. Perform Joins
# Start with Staging
df_joined = df_stg \
    .join(df_dim_store, "store_id", "left") \
    .join(df_dim_cust, "customer_id", "left") \
    .join(df_dim_prod, "product_id", "left") \
    .join(df_dim_date, df_stg["order_date"] == df_dim_date["date"], "left")

# 3. Select Final Fact Columns
final_cols = [
    "integration_key", 
    "store_wid", 
    "customer_wid", 
    "product_wid", 
    "date_wid", 
    "order_id", 
    "invoice_num", 
    "qty", 
    "price", 
    "tax_amt", 
    "discount_amt", 
    "line_total", 
    "insert_dt", 
    "update_dt", 
    "rundate"
]

# Handle potential nulls in WIDs (e.g., default to -1 or keep null)
df_fact_prep = df_joined.select(final_cols) \
    .fillna(-1, subset=["store_wid", "customer_wid", "product_wid", "date_wid"])

print("Fact Table Data Preview:")
df_fact_prep.show(5)

## 3. Write to Fact Table
We append the data to the Fact table.

In [None]:
# Write to Fact Table (Append)
df_fact_prep.write.format("delta").mode("append").saveAsTable(fact_table)

print(f"Data successfully loaded to {fact_table}")

In [None]:
# --- POST LOAD ACTIVITIES ---

# 1. Update Job Control
# (Mock logic)
print(f"LOG: {fact_table} loaded with {df_fact_prep.count()} rows.")

# 2. Generate Manifest
spark.sql(f"GENERATE symlink_format_manifest FOR TABLE {fact_table}")
print("Manifest generated.")

# 3. Final Analysis Query
spark.sql(f"""
    SELECT 
        d.year, d.month, 
        p.category, 
        SUM(f.line_total) as total_sales
    FROM {fact_table} f
    JOIN {dim_date} d ON f.date_wid = d.row_wid
    JOIN {dim_product} p ON f.product_wid = p.row_wid
    GROUP BY d.year, d.month, p.category
    ORDER BY total_sales DESC
    LIMIT 5
""").show() # Note: Adjust columns based on actual schema availability