%md
# Gold Modeling — Gulf to Bay Databricks

This notebook performs the Gold‑layer dimensional modeling for the Gulf to Bay modernization pipeline. Gold tables represent business‑ready fact and dimension structures optimized for analytics, KPI generation, and semantic modeling.

The transformations in this notebook join conformed Silver datasets, enforce referential integrity, derive business metrics, and produce clean dimensional entities. These Gold tables serve as the foundation for downstream BI tools, including Power BI and Fabric semantic models.

In [None]:
# ============================================================
#  GOLD CONFIG — Paths, Table Names, and Setup
#  - Defines catalog/schema locations
#  - Establishes Silver input tables
#  - Establishes Gold output tables
#  - Keeps all pathing centralized for maintainability
# ============================================================

from pyspark.sql import functions as F

CATALOG = "gulf_to_bay_databricks"
SCHEMA = "default"

# Silver source tables (input)
SILVER = {
    "sales":     f"{CATALOG}.{SCHEMA}.sales_silver",
    "customers": f"{CATALOG}.{SCHEMA}.customers_silver",
    "products":  f"{CATALOG}.{SCHEMA}.products_silver"
}

# Gold target tables (output)
GOLD = {
    "fact_sales":     f"{CATALOG}.{SCHEMA}.fact_sales",
    "dim_customers":  f"{CATALOG}.{SCHEMA}.dim_customers",
    "dim_products":   f"{CATALOG}.{SCHEMA}.dim_products"
}

In [None]:
# ============================================================
#  LOAD SILVER TABLES
#  - Reads cleaned Silver Delta tables from Unity Catalog
#  - No transformations applied here
#  - Ensures all downstream logic uses consistent DataFrames
# ============================================================

df_sales_silver     = spark.table(SILVER["sales"])
df_customers_silver = spark.table(SILVER["customers"])
df_products_silver  = spark.table(SILVER["products"])

print("Silver tables loaded successfully.")

In [None]:
# ============================================================
#  DIMENSION MODELING
#  - Creates conformed dimension tables
#  - Ensures unique business keys
#  - Applies surrogate keys for analytics engines
# ============================================================

# Customer Dimension
dim_customers = (
    df_customers_silver
    .dropDuplicates(["customer_id"])   # enforce uniqueness
    .withColumn("customer_sk", F.monotonically_increasing_id())  # surrogate key
)

# Product Dimension
dim_products = (
    df_products_silver
    .dropDuplicates(["product_id"])
    .withColumn("product_sk", F.monotonically_increasing_id())
)

In [None]:
# ============================================================
#  FACT MODELING
#  - Joins Silver sales with dimensions
#  - Enforces referential integrity
#  - Derives business metrics (e.g., sales_amount)
# ============================================================

fact_sales = (
    df_sales_silver.alias("s")
    .join(dim_customers.alias("c"), F.col("s.customer_id") == F.col("c.customer_id"), "left")
    .join(dim_products.alias("p"),  F.col("s.product_id")  == F.col("p.product_id"),  "left")
    .select(
        # Surrogate keys
        F.col("c.customer_sk"),
        F.col("p.product_sk"),

        # Natural keys
        F.col("s.order_id"),
        F.col("s.customer_id"),
        F.col("s.product_id"),

        # Measures
        F.col("s.quantity"),
        F.col("s.unit_price"),
        (F.col("s.quantity") * F.col("s.unit_price")).alias("sales_amount"),

        # Dates
        F.col("s.order_date"),

        # Metadata
        F.current_timestamp().alias("gold_load_utc")
    )
)

In [None]:
# ============================================================
#  WRITE GOLD TABLES TO UNITY CATALOG
#  - Writes fact and dimension tables as Delta
#  - Overwrites existing Gold tables for deterministic runs
#  - Ensures BI tools consume stable, conformed structures
# ============================================================

(
    dim_customers
    .write
    .format("delta")
    .mode("overwrite")
    .saveAsTable(GOLD["dim_customers"])
)

(
    dim_products
    .write
    .format("delta")
    .mode("overwrite")
    .saveAsTable(GOLD["dim_products"])
)

(
    fact_sales
    .write
    .format("delta")
    .mode("overwrite")
    .saveAsTable(GOLD["fact_sales"])
)

print("Gold tables written successfully.")

In [None]:
# ============================================================
#  VALIDATION PREVIEW — SAMPLE FACT OUTPUT
#  - Displays a small sample of the fact table
#  - Confirms joins, surrogate keys, and measures
# ============================================================

display(fact_sales.limit(20))