# Sales Fact - Landing Data Load (JSON Source)

## Overview
In this notebook, we begin loading the **Sales Fact** table.
*   **Source Data:** The stores drop sales data in **JSON** format.
*   **File Pattern:** `order_{rundate}_{seq}.json` (e.g., `order_20220101_1.json`). Multiple files can exist for a single run date.
*   **Data Structure:** Nested JSON containing an array of `orders`, where each order contains an array of `order_lines`.

## The Landing Layer Strategy
Instead of parsing the complex JSON structure immediately, we will:
1.  **Read as Text:** Read the entire JSON file content into a single column (typically named `value`).
2.  **Add Audit:** Attach `insert_dt` and `rundate`.
3.  **Write to Delta:** Store this raw string representation in the Landing Delta table (`fact_sales_ld`).

This approach ensures efficient ingestion and allows us to handle parsing failures or schema changes downstream in the Staging layer without stopping the ingestion pipeline.

In [None]:
# Import necessary libraries
import pyspark
import json
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.types import *
from delta.tables import *

# Initialize Spark Session
spark = SparkSession.builder \
    .appName("Sales Fact Landing Load") \
    .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
    .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog") \
    .getOrCreate()

# Job Parameters
run_date = "20220101"
schema_name = "pyspark_warehouse"
landing_file_pattern = f"order_{run_date}_*.json"

# Paths
base_path = "s3a://warehouse/" # Update as needed
source_path = f"{base_path}landing/source/orders/"
landing_table = f"{schema_name}.fact_sales_ld"

print(f"Processing Run Date: {run_date}")

In [None]:
# --- SIMULATION: Create Dummy JSON Source Files ---
# We will create 2 JSON files to simulate multiple sequence files dropping for the same day.

# Helper function to write json
def write_json_file(filename, data_dict):
    path = source_path + filename
    # Convert dict to JSON string
    json_str = json.dumps(data_dict)
    # Create RDD and write as text to simulate a file
    spark.sparkContext.parallelize([json_str]).saveAsTextFile(path)
    print(f"Created simulation file: {path}")

# Data for File 1
data_1 = {
    "orders": [
        {
            "order_id": "O1001",
            "invoice_num": "INV1001",
            "order_date": "2022-01-01",
            "store_id": 5001,
            "customer_id": "C001",
            "order_lines": [
                {"product_id": "P001", "qty": 2, "price": 20.00, "tax": 2.5, "discount": 0},
                {"product_id": "P002", "qty": 1, "price": 15.50, "tax": 1.5, "discount": 1.0}
            ]
        }
    ]
}

# Data for File 2
data_2 = {
    "orders": [
        {
            "order_id": "O1002",
            "invoice_num": "INV1002",
            "order_date": "2022-01-01",
            "store_id": 5002,
            "customer_id": "C002",
            "order_lines": [
                {"product_id": "P003", "qty": 5, "price": 12.00, "tax": 5.0, "discount": 2.0}
            ]
        }
    ]
}

# Write files (clean up first if exists for simulation)
# Note: In real local spark, saveAsTextFile creates a folder. We assume the reader handles it.
import shutil
import os
# Logic to mock file creation locally if S3 is not available
# For this notebook, we assume the reader will read whatever is at 'source_path'
# Here we write df to text to simulate
spark.createDataFrame([(json.dumps(data_1),)], ["value"]).coalesce(1).write.mode("overwrite").text(source_path + f"order_{run_date}_1.json")
spark.createDataFrame([(json.dumps(data_2),)], ["value"]).coalesce(1).write.mode("overwrite").text(source_path + f"order_{run_date}_2.json")

print("Source JSON files generated.")

## 1. Landing Load (Read as Text)
We read the JSON files using `spark.read.text`. This treats the entire file content (or line) as a string. This is crucial for handling semi-structured data where the schema might evolve or be complex.

In [None]:
# --- READ DATA ---

# Read all files matching the pattern for the rundate
# wholetext=True ensures multi-line JSONs are read as a single record if needed, 
# though here we generated single-line JSONs.
df_raw = spark.read.option("wholetext", "false").text(source_path + landing_file_pattern)

# Add Audit Columns
df_landing = df_raw \
    .withColumn("insert_dt", current_timestamp()) \
    .withColumn("rundate", lit(run_date))

print("Raw Data Preview (Value column contains JSON string):")
df_landing.show(truncate=False)
df_landing.printSchema()

In [None]:
# --- WRITE TO LANDING ---

# Write to Delta Table
df_landing.write.format("delta").mode("append").saveAsTable(landing_table)

print(f"Data successfully written to {landing_table}")

## 2. Post-Load Activities
1.  **Archive:** Move processed files to archive folder.
2.  **Log:** Update Job Control.
3.  **Manifest:** Generate Symlink Manifest for Athena.

In [None]:
# --- JOB CONTROL & MANIFEST ---

# 1. Update Job Control (Mock Function)
def insert_log(spark, schema, table, count, rundate):
    print(f"LOG: {schema}.{table} loaded with {count} rows for {rundate}")

insert_log(spark, schema_name, "fact_sales_ld", df_landing.count(), run_date)

# 2. Generate Manifest
spark.sql(f"GENERATE symlink_format_manifest FOR TABLE {landing_table}")
print("Manifest generated.")

# 3. Validation
print("Data in Landing Table:")
spark.sql(f"SELECT * FROM {landing_table}").show()