# Gold Layer (Sprint 2) – Fact Development (fact_*)

## Purpose
This notebook builds the Gold **fact table** for Sprint 2’s star schema.
The fact table preserves the same business logic and metrics as Sprint 1 Gold, but in a star schema format for BI.

## Gold Layer Guardrails (Sprint 2)

- Gold notebooks must read **promoted Silver tables only** via:
  - `silver_shortcut.sl_*`
- Gold notebooks must **NOT** read from:
  - `sl_dev_*`
- Gold notebooks must **NOT** write back to Silver.

This guardrail ensures Gold development and validation are based on
stable, promoted Silver data only, even though `sl_dev_*` tables may
still be visible via schema shortcuts during development.

## Inputs (Read-only)
- Use **promoted Silver tables only** through the OneLake shortcut:
  - `silver_shortcut.sl_orders`
  - `silver_shortcut.sl_order_items`
  - `silver_shortcut.sl_order_reviews`
  - `silver_shortcut.sl_sellers`
  - plus any other required `sl_*` tables

## Output (Written to Gold)
- `fact_seller_daily_perf`

## Fact Grain (Must Not Change)
- 1 row = `seller_id` × `order_date`

## Expected Metrics (Sprint 1 parity)
The fact table should contain the same logical metrics as Sprint 1 Gold (even if column names are updated), for example:
- total_orders
- total_items
- total_revenue
- total_gmv_incl_freight
- avg_delivery_days
- has_late_delivery
- avg_review_score

## Rules / Guardrails
- Do not write dimensions here.
- Do not read from `sl_dev_*`.
- Protect the grain: avoid joins that multiply rows before aggregation.

## Completion Criteria
- `fact_seller_daily_perf` is created successfully in Gold.
- Grain is correct and stable.
- Metrics match Sprint 1 logic.

## Handoff
Once completed, Janson will validate in `nb_03_gold_star_dev_validation`:
- uniqueness of (seller_id, order_date)
- join behavior fact → dims (no row explosion)
- metric parity vs Sprint 1 Gold


In [1]:
%%sql
-- =====================================================
-- GOLD LAYER - FACT TABLES
-- =====================================================

-- =====================================================
-- FACT_ORDER_ITEMS (Detailed Fact)
-- =====================================================

CREATE OR REPLACE TABLE fact_order_items
USING DELTA
AS
SELECT 
    row_number() OVER (ORDER BY oi.order_id, oi.order_item_id) AS fact_order_item_key,
    
    -- Foreign Keys
    do.order_key,
    dp.product_key,
    dc.customer_key,
    ds.seller_key,
    
    -- Date Keys (Role-Playing Dimension)
    CAST(date_format(CAST(o.order_purchase_timestamp AS DATE), 'yyyyMMdd') AS INT) AS order_purchase_date_key,
    CAST(date_format(CAST(o.order_approved_at AS DATE), 'yyyyMMdd') AS INT) AS order_approved_date_key,
    CAST(date_format(CAST(o.order_delivered_carrier_date AS DATE), 'yyyyMMdd') AS INT) AS order_delivered_carrier_date_key,
    CAST(date_format(CAST(o.order_delivered_customer_date AS DATE), 'yyyyMMdd') AS INT) AS order_delivered_customer_date_key,
    CAST(date_format(CAST(o.order_estimated_delivery_date AS DATE), 'yyyyMMdd') AS INT) AS order_estimated_delivery_date_key,
    
    -- Degenerate Dimension
    oi.order_item_id,
    
    -- MEASURES (Item-Level Only)
    oi.price,
    oi.freight_value,
    
    -- Metadata
    current_timestamp() AS row_insert_timestamp,
    current_timestamp() AS row_update_timestamp

FROM silver_shortcut.sl_order_items oi
INNER JOIN silver_shortcut.sl_orders o 
    ON oi.order_id = o.order_id
INNER JOIN dim_order do 
    ON o.order_id = do.order_id
INNER JOIN dim_product dp 
    ON oi.product_id = dp.product_id
INNER JOIN dim_customer dc 
    ON o.customer_id = dc.customer_id
INNER JOIN dim_seller ds 
    ON oi.seller_id = ds.seller_id
WHERE oi.order_id IS NOT NULL
    AND oi.product_id IS NOT NULL
    AND oi.seller_id IS NOT NULL
    AND o.customer_id IS NOT NULL
    AND o.order_purchase_timestamp IS NOT NULL;


-- =====================================================
-- FACT_SELLER_DAILY_SALES (Aggregate Fact)
-- Seller×Day grain - Only seller performance metrics
-- =====================================================

CREATE OR REPLACE TABLE fact_seller_daily_sales
USING DELTA
AS
SELECT 
    f.seller_key,
    f.order_purchase_date_key AS date_key,
    
    -- Order & Item Counts
    COUNT(DISTINCT f.order_key) AS total_orders,
    COUNT(*) AS total_items_sold,
    COUNT(DISTINCT f.customer_key) AS unique_customers,
    COUNT(DISTINCT f.product_key) AS unique_products,
    
    -- Revenue Metrics
    SUM(f.price) AS gross_sales,
    SUM(COALESCE(f.freight_value, 0)) AS total_freight,
    SUM(f.price + COALESCE(f.freight_value, 0)) AS total_revenue,
    
    -- Average Metrics
    AVG(f.price) AS avg_item_price,
    SUM(f.price + COALESCE(f.freight_value, 0)) / NULLIF(COUNT(DISTINCT f.order_key), 0) AS avg_order_value,
    CAST(COUNT(*) AS DECIMAL(8,2)) / NULLIF(COUNT(DISTINCT f.order_key), 0) AS avg_items_per_order,
    
    -- Delivery Performance (from dim_order)
    COUNT(DISTINCT CASE WHEN o.order_delivered_customer_date IS NOT NULL THEN f.order_key END) AS orders_delivered,
    AVG(o.days_to_deliver) AS avg_delivery_days,
    COUNT(DISTINCT CASE WHEN o.delivery_vs_estimate_days <= 0 THEN f.order_key END) AS orders_delivered_on_time,
    ROUND(
        100.0 * COUNT(DISTINCT CASE WHEN o.delivery_vs_estimate_days <= 0 THEN f.order_key END) / 
        NULLIF(COUNT(DISTINCT CASE WHEN o.delivery_vs_estimate_days IS NOT NULL THEN f.order_key END), 0),
        2
    ) AS on_time_delivery_rate,
    
    -- Metadata
    current_timestamp() AS row_insert_timestamp,
    current_timestamp() AS row_update_timestamp

FROM fact_order_items f
LEFT JOIN dim_order o 
    ON f.order_key = o.order_key
WHERE f.order_purchase_date_key IS NOT NULL
GROUP BY f.seller_key, f.order_purchase_date_key;




StatementMeta(, 2a6cd6d2-6ac7-406f-9957-af10430f4dfa, 3, Finished, Available, Finished)

<Spark SQL result set with 0 rows and 0 fields>

<Spark SQL result set with 0 rows and 0 fields>