# Sales Fact - Staging Data Load (JSON Parsing & Explosion)

## Overview
In this notebook, we transform the raw JSON strings from the Landing layer into a structured Staging table.
*   **Source:** `fact_sales_ld` (Contains raw JSON in `value` column).
*   **Goal:** Parse the JSON, explode nested arrays (`orders` and `order_lines`), and flatten the structure.
*   **Transformations:**
    1.  **Schema Inference:** Determine the schema of the JSON string.
    2.  **Parse:** Use `from_json` to convert the string to a Struct.
    3.  **Explode:** 
        *   Explode `orders` array to get individual orders.
        *   Explode `order_lines` array to get individual line items per order.
    4.  **Enrichment:** 
        *   Join with `dim_product` to get `price` (since it's missing in the source).
        *   Calculate `line_total`, `tax_amount`, and `discount_amount`.
        *   Generate `integration_key` (Composite Key).

## Key Concept: Explosion
The source data is hierarchical: `Root -> Orders[] -> OrderLines[]`.
To store this in a relational Data Warehouse fact table, we need to flatten it so that **one row represents one line item**.

In [None]:
# Import necessary libraries
import pyspark
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.types import *
from delta.tables import *

# Initialize Spark Session
spark = SparkSession.builder \
    .appName("Sales Fact Staging Load") \
    .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
    .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog") \
    .getOrCreate()

# Job Parameters
run_date = "20220101"
schema_name = "pyspark_warehouse"

# Paths & Tables
landing_table = f"{schema_name}.fact_sales_ld"
staging_table = f"{schema_name}.fact_sales_stg"
dim_product = f"{schema_name}.dim_product"

print(f"Processing Run Date: {run_date}")

## 1. Read Landing Data
Read the raw JSON strings. In a real incremental load, we would filter by `insert_dt`.

In [None]:
# Read Landing Table
df_raw = spark.read.table(landing_table)

# For this load, let's process everything. 
# In production: filter where insert_dt > max_timestamp
print(f"Raw Records: {df_raw.count()}")

## 2. Schema Inference & Parsing
We sample the data to infer the JSON schema, then use `from_json` to parse the `value` column.

In [None]:
# 1. Infer Schema
# We take a sample record to determine the structure
sample_json = df_raw.select("value").first()[0]
json_schema = schema_of_json(sample_json)

print("Inferred JSON Schema:")
print(json_schema)

# 2. Parse JSON
df_parsed = df_raw.withColumn("json_data", from_json(col("value"), json_schema))

# 3. Explode Orders
# json_data.orders is an Array. We explode it to get one row per order.
df_orders = df_parsed.select(explode(col("json_data.orders")).alias("order"), "insert_dt", "rundate")

# 4. Explode Order Lines
# order.order_lines is also an Array. We explode it to get one row per line item.
# We keep the parent order details as well.
df_exploded = df_orders.select(
    col("order.order_id"),
    col("order.invoice_num"),
    col("order.order_date"),
    col("order.store_id"),
    col("order.customer_id"),
    explode(col("order.order_lines")).alias("line"),
    "insert_dt",
    "rundate"
)

# 5. Flatten Structure
df_flattened = df_exploded.select(
    col("order_id"),
    col("invoice_num"),
    col("order_date"),
    col("store_id"),
    col("customer_id"),
    col("line.product_id"),
    col("line.qty"),
    col("line.tax"),
    col("line.tax_type"), # e.g., Percentage
    col("line.discount"),
    col("line.discount_type"), # e.g., Percentage
    "insert_dt",
    "rundate"
)

print("Exploded & Flattened Data:")
df_flattened.show(5)

## 3. Enrichment & Calculations
The source data contains `qty`, `tax`, and `discount`, but usually not the unit `price`. We need to look up the price from the **Product Dimension**.

**Logic:**
1.  **Lookup Price:** Join with `dim_product` on `product_id` (and ensure `active_flg='Y'`).
2.  **Calculate Totals:**
    *   `sub_total` = qty * price
    *   `tax_amt` = (sub_total * tax / 100) OR absolute tax
    *   `discount_amt` = (sub_total * discount / 100) OR absolute discount
    *   `line_total` = sub_total + tax_amt - discount_amt

In [None]:
# 1. Read Product Dimension for Price Lookup
df_product = spark.read.table(dim_product).filter("active_flg = 'Y'") \
    .select("product_id", "price", "row_wid") \
    .withColumnRenamed("row_wid", "product_wid")

# 2. Join to get Price
df_joined = df_flattened.join(df_product, "product_id", "left")

# 3. Calculations
df_calc = df_joined \
    .withColumn("sub_total", col("qty") * col("price")) \
    .withColumn("tax_amt", when(col("tax_type") == "Percentage", (col("sub_total") * col("tax") / 100)).otherwise(col("tax"))) \
    .withColumn("discount_amt", when(col("discount_type") == "Percentage", (col("sub_total") * col("discount") / 100)).otherwise(col("discount"))) \
    .withColumn("line_total", col("sub_total") + col("tax_amt") - col("discount_amt"))

# 4. Generate Integration Key (Unique ID for the Fact Row)
# Ideally a combination of OrderID + ProductID + Sequence
df_stg = df_calc.withColumn("integration_key", md5(concat(col("order_id"), col("product_id")))) \
    .withColumn("update_dt", current_timestamp())

# 5. Select Final Staging Columns
final_cols = [
    "integration_key", "order_id", "invoice_num", "order_date", 
    "store_id", "customer_id", "product_id", "product_wid",
    "qty", "price", "tax_amt", "discount_amt", "line_total",
    "insert_dt", "update_dt", "rundate"
]

df_stg_final = df_stg.select(final_cols)

print("Final Staging Data:")
df_stg_final.show(5)

## 4. Write to Staging
Overwrite the staging table.

In [None]:
# Write to Staging Table
df_stg_final.write.format("delta").mode("overwrite").saveAsTable(staging_table)
print(f"Data loaded to {staging_table}")

# Generate Manifest
spark.sql(f"GENERATE symlink_format_manifest FOR TABLE {staging_table}")