# Gold Layer (Sprint 2) – Star Schema Validation (QC before BI)

## Purpose
This is a **validation-only** notebook.
It performs quality checks on Gold star schema outputs before BI consumes them.

This notebook should **NOT** perform transformations or table writes (except optional small audit tables if agreed).

## Gold Layer Guardrails (Sprint 2)

- Gold notebooks must read **promoted Silver tables only** via:
  - `silver_shortcut.sl_*`
- Gold notebooks must **NOT** read from:
  - `sl_dev_*`
- Gold notebooks must **NOT** write back to Silver.

This guardrail ensures Gold development and validation are based on
stable, promoted Silver data only, even though `sl_dev_*` tables may
still be visible via schema shortcuts during development.

## Inputs (Read-only)
Gold outputs produced by Hon Boon:
- `dim_date`
- `dim_seller`
- `fact_seller_daily_perf`

Reference baseline (Sprint 1 parity comparison):
- Sprint 1 Gold table(s) in `lh_olist_shared` (read-only)
  - e.g. `lh_olist_shared.dbo.gold_seller_daily_perf` (or equivalent)

## Validation Scope
### A) Star schema structure
- Dimensions have unique keys:
  - `dim_date` unique by `order_date`
  - `dim_seller` unique by `seller_id`

### B) Fact grain and uniqueness
- `fact_seller_daily_perf` must be unique by (`seller_id`, `order_date`)

### C) Relationship checks (no exploding joins)
- Join `fact → dim_seller` should keep row count the same
- Join `fact → dim_date` should keep row count the same
- If row count increases, it indicates a broken key or non-unique dimension

### D) Metric parity vs Sprint 1 Gold
- Re-aggregate fact totals (orders, items, revenue, GMV, etc.)
- Compare totals against Sprint 1 Gold totals
- Differences indicate upstream duplication, filtering, or logic drift

## Expected Outcome
- All checks pass
- We are confident to promote Gold tables and connect the semantic model + BI

## Ownership
- Janson owns this notebook.
- Others may open for viewing, but should not edit without coordination (auto-save risk).


In [2]:
-- ================================================
-- Validation check for dim tables duplicate values
-- ================================================

SELECT 'dim_date' AS table_name, COUNT(DISTINCT date_key) AS distinct_keys, COUNT(*) AS total_rows
FROM dim_date
UNION ALL
SELECT 'dim_customer', COUNT(DISTINCT customer_id), COUNT(*) FROM dim_customer
UNION ALL
SELECT 'dim_product', COUNT(DISTINCT product_id), COUNT(*) FROM dim_product
UNION ALL
SELECT 'dim_seller', COUNT(DISTINCT seller_id), COUNT(*) FROM dim_seller
UNION ALL
SELECT 'dim_order', COUNT(DISTINCT order_id), COUNT(*) FROM dim_order
UNION ALL
SELECT 'dim_payment', COUNT(DISTINCT payment_type), COUNT(*) FROM dim_payment;

StatementMeta(, 0b273384-f17d-4752-99bf-512d72b5e0f6, 4, Finished, Available, Finished)

<Spark SQL result set with 6 rows and 3 fields>

In [9]:
-- =========================
-- Validation for Dim Tables
-- =========================

SELECT 'dim_customer - duplicate_key' AS table_name,
       COUNT(*) AS mismatch_count
FROM dim_customer
GROUP BY customer_key
HAVING COUNT(*) > 1

UNION ALL
SELECT 'dim_customer - null_key' AS table_name,
       COUNT(*) AS mismatch_count
FROM dim_customer
WHERE customer_key IS NULL

UNION ALL
SELECT 'dim_date - duplicate_key' AS table_name,
       COUNT(*) AS mismatch_count
FROM dim_date
GROUP BY date_key
HAVING COUNT(*) > 1

UNION ALL
SELECT 'dim_date - null_key' AS table_name,
       COUNT(*) AS mismatch_count
FROM dim_date
WHERE date_key IS NULL

UNION ALL
SELECT 'dim_order - duplicate_key' AS table_name,
       COUNT(*) AS mismatch_count
FROM dim_order
GROUP BY order_key
HAVING COUNT(*) > 1

UNION ALL
SELECT 'dim_order - null_key' AS table_name,
       COUNT(*) AS mismatch_count
FROM dim_order
WHERE order_key IS NULL

UNION ALL
SELECT 'dim_product - duplicate_key' AS table_name,
       COUNT(*) AS mismatch_count
FROM dim_product
GROUP BY product_key
HAVING COUNT(*) > 1

UNION ALL
SELECT 'dim_product - null_key' AS table_name,
       COUNT(*) AS mismatch_count
FROM dim_product
WHERE product_key IS NULL

UNION ALL
SELECT 'dim_seller - duplicate_key' AS table_name,
       COUNT(*) AS mismatch_count
FROM dim_seller
GROUP BY seller_key
HAVING COUNT(*) > 1

UNION ALL
SELECT 'dim_seller - null_key' AS table_name,
       COUNT(*) AS mismatch_count
FROM dim_seller
WHERE seller_key IS NULL

UNION ALL
SELECT 'dim_payment - duplicate_key' AS table_name,
       COUNT(*) AS mismatch_count
FROM dim_payment
GROUP BY payment_key
HAVING COUNT(*) > 1

UNION ALL
SELECT 'dim_payment - null_key' AS table_name,
       COUNT(*) AS mismatch_count
FROM dim_payment
WHERE payment_key IS NULL;



StatementMeta(, 0b273384-f17d-4752-99bf-512d72b5e0f6, 31, Finished, Available, Finished)

<Spark SQL result set with 6 rows and 2 fields>

In [3]:
-- ===================================
-- FACT_SELLER_DAILY_SALES VALIDATION
-- ===================================

SELECT 'fact_seller_daily_sales - duplicate_check' AS table_name,
       COALESCE(SUM(dup_count),0) AS mismatch_count
FROM (
    SELECT seller_key, date_key, COUNT(*) - 1 AS dup_count
    FROM fact_seller_daily_sales
    GROUP BY seller_key, date_key
    HAVING COUNT(*) > 1
) t

UNION ALL

SELECT 'fact_seller_daily_sales - null_check' AS table_name,
       (SELECT COUNT(*) FROM fact_seller_daily_sales 
        WHERE seller_key IS NULL 
           OR date_key IS NULL) AS mismatch_count

UNION ALL

SELECT 'fact_seller_daily_sales - integrity_check_sellers' AS table_name,
       COUNT(*) AS mismatch_count
FROM fact_seller_daily_sales f
LEFT JOIN dim_seller s ON f.seller_key = s.seller_key
WHERE s.seller_key IS NULL

UNION ALL

SELECT 'fact_seller_daily_sales - integrity_check_dates' AS table_name,
       COUNT(*) AS mismatch_count
FROM fact_seller_daily_sales f
LEFT JOIN dim_date d ON f.date_key = d.date_key
WHERE d.date_key IS NULL;


StatementMeta(, 0b273384-f17d-4752-99bf-512d72b5e0f6, 5, Finished, Available, Finished)

<Spark SQL result set with 4 rows and 2 fields>

In [4]:
-- ===========================
-- FACT_ORDER_ITEMS VALIDATION
-- ============================

SELECT 'fact_order_items - duplicate_check' AS table_name,
       COALESCE(SUM(dup_count),0) AS mismatch_count
FROM (
    SELECT order_key, order_item_id, COUNT(*) - 1 AS dup_count
    FROM fact_order_items
    GROUP BY order_key, order_item_id
    HAVING COUNT(*) > 1
) t

UNION ALL

SELECT 'fact_order_items - null_check' AS table_name,
       (SELECT COUNT(*) FROM fact_order_items 
        WHERE order_key IS NULL 
           OR product_key IS NULL 
           OR customer_key IS NULL 
           OR seller_key IS NULL 
           OR order_purchase_date_key IS NULL) AS mismatch_count

UNION ALL

SELECT 'fact_order_items - integrity_check_orders' AS table_name,
       COUNT(*) AS mismatch_count
FROM fact_order_items f
LEFT JOIN dim_order d ON f.order_key = d.order_key
WHERE d.order_key IS NULL

UNION ALL

SELECT 'fact_order_items - integrity_check_products' AS table_name,
       COUNT(*) AS mismatch_count
FROM fact_order_items f
LEFT JOIN dim_product p ON f.product_key = p.product_key
WHERE p.product_key IS NULL

UNION ALL

SELECT 'fact_order_items - integrity_check_customers' AS table_name,
       COUNT(*) AS mismatch_count
FROM fact_order_items f
LEFT JOIN dim_customer c ON f.customer_key = c.customer_key
WHERE c.customer_key IS NULL

UNION ALL

SELECT 'fact_order_items - integrity_check_sellers' AS table_name,
       COUNT(*) AS mismatch_count
FROM fact_order_items f
LEFT JOIN dim_seller s ON f.seller_key = s.seller_key
WHERE s.seller_key IS NULL

StatementMeta(, 0b273384-f17d-4752-99bf-512d72b5e0f6, 6, Finished, Available, Finished)

<Spark SQL result set with 6 rows and 2 fields>

In [9]:
select * from dim_order WHERE order_id = '118045506e1c1dda060171af43fe11b4';

SELECT 
    DATEDIFF(order_delivered_customer_date, order_estimated_delivery_date) AS days_diff
FROM dim_order
WHERE order_id = '118045506e1c1dda060171af43fe11b4';

StatementMeta(, cca1bed6-15db-49bb-bad0-0a7d6c163664, 17, Finished, Available, Finished)

<Spark SQL result set with 1 rows and 13 fields>

<Spark SQL result set with 1 rows and 1 fields>