# Gold Layer – Validation Notebook (Sprint 1)

## Purpose
Validate whether **Gold Dev** output `gold_dev_seller_daily_perf` is ready for promotion to shared Gold `gold_seller_daily_perf`.

This notebook is designed as a promotion gate:
- Silver source alignment (policy gate)
- BI contract checks
- Grain checks (1 row per seller_id x order_date)
- Executive PASS/FAIL baseline check (FULL OUTER JOIN)
- Delivery and review sanity checks

## Promotion rule
Gold Dev may be promoted **only if all validation checks pass** and Gold Dev has been rebuilt from **locked Silver (`sl_*`)**.

Promotion action:
- `gold_dev_seller_daily_perf` → `gold_seller_daily_perf`

## Section 0: Load Gold Dev table

Objective: Load Gold Dev output and inspect basic structure.

### Baseline key count check (row-level sanity)

This check compares the number of `(seller_id × order_date)` rows produced by Gold Dev
against the number of unique seller-days that *should* exist based on Silver input data.

- `base_keys` derives the expected grain directly from `sl_orders` + `sl_order_items`
- `gold_rows` counts rows in `gold_dev_seller_daily_perf`
- `baseline_key_rows` counts the expected seller-day keys

**Expected result:**  
`gold_rows = baseline_key_rows`

If these counts differ, Gold Dev is either:
- missing seller-days, or
- producing extra rows (wrong grain or fan-out)

This is a fast structural sanity check before deeper metric validation.

In [25]:
%%sql
WITH base_keys AS (
  SELECT
    CAST(o.order_purchase_timestamp AS DATE) AS order_date,
    oi.seller_id
  FROM sl_order_items oi
  JOIN sl_orders o ON oi.order_id = o.order_id
  GROUP BY CAST(o.order_purchase_timestamp AS DATE), oi.seller_id
)
SELECT
  (SELECT COUNT(*) FROM gold_dev_seller_daily_perf) AS gold_rows,
  (SELECT COUNT(*) FROM base_keys) AS baseline_key_rows;

StatementMeta(, 6b5c2733-195c-48a1-bc46-fd1685e2a50c, 27, Finished, Available, Finished)

<Spark SQL result set with 1 rows and 2 fields>

In [13]:
from pyspark.sql import functions as F

GOLD_DEV = "gold_dev_seller_daily_perf"
df = spark.table(GOLD_DEV)

print("Table:", GOLD_DEV)
print("Rows:", df.count())
df.printSchema()
df.show(5, truncate=False)

StatementMeta(, 6b5c2733-195c-48a1-bc46-fd1685e2a50c, 15, Finished, Available, Finished)

Table: gold_dev_seller_daily_perf
Rows: 69413
root
 |-- order_date: date (nullable = true)
 |-- seller_id: string (nullable = true)
 |-- seller_city: string (nullable = true)
 |-- seller_state: string (nullable = true)
 |-- total_orders: long (nullable = true)
 |-- total_items: long (nullable = true)
 |-- total_revenue: double (nullable = true)
 |-- total_gmv_incl_freight: double (nullable = true)
 |-- avg_delivery_days: double (nullable = true)
 |-- has_late_delivery: integer (nullable = true)
 |-- avg_review_score: double (nullable = true)
 |-- delivered_orders: long (nullable = true)
 |-- on_time_delivered_orders: long (nullable = true)
 |-- late_delivered_orders: long (nullable = true)
 |-- sum_delivery_duration_days: long (nullable = true)
 |-- delivered_orders_with_duration: long (nullable = true)

+----------+--------------------------------+---------------------+------------+------------+-----------+-----------------+----------------------+------------------+-----------------+-

## Promotion gate: Silver source alignment (lineage policy)

Purpose: enforce that promotion only happens after Gold Dev has been rebuilt from **locked Silver (`sl_*`)**.

Rule:
- If any `sl_dev_*` tables exist, promotion is blocked unless Gold Dev notebook has been updated to read from `sl_*` and Gold Dev has been rebuilt after the swap.

Interpretation:
- `sl_dev_*` tables may exist in the Lakehouse and this is expected.
- Promotion is allowed only after:
  1. Gold dev notebook SQL has been updated to read from `sl_*` (locked Silver), and
  2. `gold_dev_seller_daily_perf` has been rebuilt after that change.

This cell acts as a visibility and policy reminder, not an automated lineage detector.


### Lineage confirmation

`sl_dev_*` tables may still exist in the Lakehouse. Promotion is allowed only after:
1) Gold dev notebook SQL has been updated to read from `sl_*` (locked Silver), and
2) `gold_dev_seller_daily_perf` has been rebuilt after that change.

This notebook validates output correctness. Source alignment is confirmed separately in the Gold dev notebook.


In [14]:
%%sql
-- Promotion gate: detect presence of Silver dev tables
SHOW TABLES LIKE 'sl_dev_*';

StatementMeta(, 6b5c2733-195c-48a1-bc46-fd1685e2a50c, 16, Finished, Available, Finished)

<Spark SQL result set with 4 rows and 3 fields>

In [15]:
%%sql
DESCRIBE HISTORY gold_dev_seller_daily_perf;

StatementMeta(, 6b5c2733-195c-48a1-bc46-fd1685e2a50c, 17, Finished, Available, Finished)

<Spark SQL result set with 6 rows and 15 fields>

## Section 1: BI contract checks

Objective: Ensure Gold Dev contains the expected columns used by BI.

Edit this list only if the BI contract changes.


In [16]:
expected_cols = [
    "order_date",
    "seller_id",
    "seller_city",
    "seller_state",
    "total_orders",
    "total_items",
    "total_revenue",
    "total_gmv_incl_freight",
    "avg_delivery_days",
    "has_late_delivery",
    "avg_review_score",
]

missing = [c for c in expected_cols if c not in df.columns]
print("Missing columns:", missing)
assert len(missing) == 0, f"FAIL: Missing BI contract columns: {missing}"
print("PASS: All BI contract columns present.")

StatementMeta(, 6b5c2733-195c-48a1-bc46-fd1685e2a50c, 18, Finished, Available, Finished)

Missing columns: []
PASS: All BI contract columns present.


## Section 2: Grain check

Objective: Confirm Gold Dev grain is exactly **1 row per seller_id x order_date**.

In [26]:
dupes = (
    df.groupBy("seller_id", "order_date")
      .count()
      .filter(F.col("count") > 1)
)

d = dupes.count()
print("Duplicate grain rows:", d)

if d > 0:
    dupes.orderBy(F.desc("count")).show(20, truncate=False)

assert d == 0, "FAIL: Duplicate seller_id x order_date rows found."
print("PASS: Grain is correct (1 row per seller_id x order_date).")

StatementMeta(, 6b5c2733-195c-48a1-bc46-fd1685e2a50c, 28, Finished, Available, Finished)

Duplicate grain rows: 0
PASS: Grain is correct (1 row per seller_id x order_date).


### Comparison Tolerances

The executive baseline check applies small tolerances to floating-point metrics
to avoid false failures caused by rounding or representation differences.

Tolerances used:

- `total_revenue`: ±0.01  
- `total_gmv_incl_freight`: ±0.01  

Exact matching (tolerance = 0) is enforced for:

- `total_items`
- seller-day completeness (missing / extra rows)

These tolerances are intentional and reflect standard data warehouse
best practices for financial aggregates.

## Executive baseline check

This is the **promotion gate** for Gold Dev.

We rebuild a clean daily seller baseline from Silver (`sl_orders` + `sl_order_items`)
without reviews, then compare it to `gold_dev_seller_daily_perf` using a
**FULL OUTER JOIN**.

This check confirms:
- no seller-days are missing in Gold
- no extra seller-days exist in Gold
- item, revenue, and GMV totals match the baseline (within tolerance)

If **any** seller-day is missing, extra, or mismatched, the result is **FAIL**.
Only when this cell returns **PASS** is Gold Dev safe to promote.

## Section 3A: Executive PASS/FAIL — Gold vs baseline (FULL OUTER JOIN)

Objective: Compare Gold Dev metrics against a clean baseline built only from:
- `sl_orders`
- `sl_order_items`

This detects:
- missing seller-days in Gold
- extra seller-days in Gold
- any metric mismatch vs baseline (items, revenue, gmv)

Promotion rule for this gate:
- `pass_fail` must be `PASS`

#### If PASS and lineage gate satisfied, Gold Dev is approved for promotion.

In [27]:
%%sql
-- Executive baseline check (PASS/FAIL)

WITH base AS (
  SELECT
    CAST(o.order_purchase_timestamp AS DATE) AS order_date,
    oi.seller_id,
    COUNT(DISTINCT oi.order_id) AS total_orders_base,
    COUNT(*) AS total_items_base,
    SUM(COALESCE(oi.price, 0)) AS total_revenue_base,
    SUM(COALESCE(oi.price, 0) + COALESCE(oi.freight_value, 0)) AS total_gmv_base
  FROM sl_order_items oi
  JOIN sl_orders o
    ON oi.order_id = o.order_id
  GROUP BY
    CAST(o.order_purchase_timestamp AS DATE),
    oi.seller_id
),
gold AS (
  SELECT
    order_date,
    seller_id,
    total_orders,
    total_items,
    total_revenue,
    total_gmv_incl_freight
  FROM gold_dev_seller_daily_perf
),
cmp AS (
  SELECT
    COALESCE(g.order_date, b.order_date) AS order_date,
    COALESCE(g.seller_id, b.seller_id) AS seller_id,

    g.total_orders,
    b.total_orders_base,

    g.total_items,
    b.total_items_base,

    g.total_revenue,
    b.total_revenue_base,

    g.total_gmv_incl_freight,
    b.total_gmv_base,

    CASE WHEN g.seller_id IS NULL THEN 1 ELSE 0 END AS missing_in_gold,
    CASE WHEN b.seller_id IS NULL THEN 1 ELSE 0 END AS extra_in_gold,

    ABS(COALESCE(g.total_items, 0) - COALESCE(b.total_items_base, 0)) AS abs_items_diff,
    ABS(COALESCE(g.total_revenue, 0.0) - COALESCE(b.total_revenue_base, 0.0)) AS abs_revenue_diff,
    ABS(COALESCE(g.total_gmv_incl_freight, 0.0) - COALESCE(b.total_gmv_base, 0.0)) AS abs_gmv_diff
  FROM gold g
  FULL OUTER JOIN base b
    ON g.order_date = b.order_date
   AND g.seller_id = b.seller_id
),
agg AS (
  SELECT
    SUM(missing_in_gold) AS missing_seller_days_in_gold,
    SUM(extra_in_gold) AS extra_seller_days_in_gold,
    SUM(
      CASE
        WHEN abs_items_diff > 0
          OR abs_revenue_diff > 0.01
          OR abs_gmv_diff > 0.01
        THEN 1 ELSE 0
      END
    ) AS mismatched_seller_days,
    SUM(abs_items_diff) AS abs_item_diff_sum,
    SUM(abs_revenue_diff) AS abs_revenue_diff_sum,
    SUM(abs_gmv_diff) AS abs_gmv_diff_sum
  FROM cmp
)

SELECT
  CASE
    WHEN missing_seller_days_in_gold = 0
     AND extra_seller_days_in_gold = 0
     AND mismatched_seller_days = 0
    THEN 'PASS'
    ELSE 'FAIL'
  END AS pass_fail,
  missing_seller_days_in_gold,
  extra_seller_days_in_gold,
  mismatched_seller_days,
  abs_item_diff_sum,
  abs_revenue_diff_sum,
  abs_gmv_diff_sum
FROM agg;

StatementMeta(, 6b5c2733-195c-48a1-bc46-fd1685e2a50c, 29, Finished, Available, Finished)

<Spark SQL result set with 1 rows and 7 fields>


--------------------------------------------------------------
Query flow:
Build a clean daily seller baseline directly from Silver input
tables (`sl_orders` + `sl_order_items`), excluding reviews.

1. Baseline construction:
   - Derive `order_date` from order purchase timestamp
   - Group by `seller_id × order_date`
   - Compute:
     - total_orders (distinct order_id)
     - total_items (row count)
     - total_revenue (sum of item price)
     - total_gmv (price + freight)

2. Gold snapshot:
   - Read Gold Dev output from `gold_dev_seller_daily_perf`
   - Select the same grain and metrics as the baseline

3. Comparison step:
   - FULL OUTER JOIN Gold and baseline on
     `seller_id × order_date`
   - Detect:
     - seller-days missing in Gold
     - extra seller-days in Gold
     - metric differences (items, revenue, GMV)

4. Aggregation:
   - Count how many seller-days are:
     - missing
     - extra
     - mismatched beyond tolerance
   - Sum absolute metric differences for audit visibility

5. Executive decision:
   - Return PASS only if:
     - no missing seller-days
     - no extra seller-days
     - no mismatched seller-days
--------------------------------------------------------------


Query goal and expected outcome:
Determine whether Gold Dev output is safe to promote.

Expected results:
1. Output must return exactly one row.
2. `pass_fail` must be 'PASS'.
3. `missing_seller_days_in_gold` = 0
4. `extra_seller_days_in_gold` = 0
5. `mismatched_seller_days` = 0
6. Metric differences should be:
   - exactly zero for items
   - zero or near-zero for revenue and GMV
     (floating-point tolerance applied)

Interpretation:
- PASS means Gold Dev matches Silver-derived truth at the
  seller-day grain and is promotion-ready.
- FAIL means Gold Dev contains structural or metric errors
  and must not be promoted.
--------------------------------------------------------------


## Section 3B: Diagnostic drilldown (run only if Section 3A FAILs)

Objective: Show top seller-days that are missing, extra, or mismatched.

In [19]:
%%sql
-- Diagnostic drilldown: only run if executive cell FAILs

WITH base AS (
  SELECT
    CAST(o.order_purchase_timestamp AS DATE) AS order_date,
    oi.seller_id,
    COUNT(DISTINCT oi.order_id) AS total_orders_base,
    COUNT(*) AS total_items_base,
    SUM(COALESCE(oi.price, 0)) AS total_revenue_base,
    SUM(COALESCE(oi.price, 0) + COALESCE(oi.freight_value, 0)) AS total_gmv_base
  FROM sl_order_items oi
  JOIN sl_orders o
    ON oi.order_id = o.order_id
  GROUP BY
    CAST(o.order_purchase_timestamp AS DATE),
    oi.seller_id
),
gold AS (
  SELECT
    order_date,
    seller_id,
    total_orders,
    total_items,
    total_revenue,
    total_gmv_incl_freight
  FROM gold_dev_seller_daily_perf
),
cmp AS (
  SELECT
    COALESCE(g.order_date, b.order_date) AS order_date,
    COALESCE(g.seller_id, b.seller_id) AS seller_id,

    COALESCE(g.total_orders, 0) AS total_orders,
    COALESCE(b.total_orders_base, 0) AS total_orders_base,

    COALESCE(g.total_items, 0) AS total_items,
    COALESCE(b.total_items_base, 0) AS total_items_base,

    COALESCE(g.total_revenue, 0.0) AS total_revenue,
    COALESCE(b.total_revenue_base, 0.0) AS total_revenue_base,

    COALESCE(g.total_gmv_incl_freight, 0.0) AS total_gmv_incl_freight,
    COALESCE(b.total_gmv_base, 0.0) AS total_gmv_base,

    CASE WHEN g.seller_id IS NULL THEN 1 ELSE 0 END AS missing_in_gold,
    CASE WHEN b.seller_id IS NULL THEN 1 ELSE 0 END AS extra_in_gold
  FROM gold g
  FULL OUTER JOIN base b
    ON g.order_date = b.order_date
   AND g.seller_id = b.seller_id
)

SELECT *
FROM cmp
WHERE
  missing_in_gold = 1
  OR extra_in_gold = 1
  OR ABS(total_items - total_items_base) > 0
  OR ABS(total_revenue - total_revenue_base) > 0.01
  OR ABS(total_gmv_incl_freight - total_gmv_base) > 0.01
ORDER BY
  missing_in_gold DESC,
  extra_in_gold DESC,
  ABS(total_items - total_items_base) DESC,
  ABS(total_revenue - total_revenue_base) DESC
LIMIT 50;

StatementMeta(, 6b5c2733-195c-48a1-bc46-fd1685e2a50c, 21, Finished, Available, Finished)

<Spark SQL result set with 0 rows and 12 fields>

## Section 4: Delivery sanity checks

Objective:
- avg_delivery_days must not be negative
- has_late_delivery values should be reasonable (commonly 0/1 or true/false)

In [20]:
neg = (
    df.filter(F.col("avg_delivery_days").isNotNull())
      .filter(F.col("avg_delivery_days") < 0)
      .count()
)

print("Negative avg_delivery_days rows:", neg)
assert neg == 0, "FAIL: negative avg_delivery_days found"
print("PASS: avg_delivery_days is non-negative (or NULL).")

vals = [r[0] for r in df.select("has_late_delivery").distinct().collect()]
print("Distinct has_late_delivery values:", vals)
print("NOTE: Expected values are typically 0/1 or True/False depending on implementation.")

StatementMeta(, 6b5c2733-195c-48a1-bc46-fd1685e2a50c, 22, Finished, Available, Finished)

Negative avg_delivery_days rows: 0
PASS: avg_delivery_days is non-negative (or NULL).
Distinct has_late_delivery values: [1, 0]
NOTE: Expected values are typically 0/1 or True/False depending on implementation.


### Type sanity check: has_late_delivery

Confirm the output data type is stable for BI (expected `int` with values 0/1, or `boolean` with true/false).

In [21]:
spark.table("gold_dev_seller_daily_perf") \
  .selectExpr("typeof(has_late_delivery) as t") \
  .distinct() \
  .show()

StatementMeta(, 6b5c2733-195c-48a1-bc46-fd1685e2a50c, 23, Finished, Available, Finished)

+---+
|  t|
+---+
|int|
+---+



## Section 5: Review sanity checks

Objective:
- avg_review_score must be within [1, 5] when present
- NULL avg_review_score is expected for seller-days without reviews

In [22]:
bad_scores = (
    df.filter(F.col("avg_review_score").isNotNull())
      .filter((F.col("avg_review_score") < 1) | (F.col("avg_review_score") > 5))
      .count()
)

print("avg_review_score out of bounds rows:", bad_scores)
assert bad_scores == 0, "FAIL: avg_review_score outside [1,5]"
print("PASS: avg_review_score within bounds (or NULL).")

null_reviews = df.filter(F.col("avg_review_score").isNull()).count()
total_rows = df.count()
print(f"avg_review_score NULL rows: {null_reviews}/{total_rows} ({null_reviews/total_rows:.2%})")

StatementMeta(, 6b5c2733-195c-48a1-bc46-fd1685e2a50c, 24, Finished, Available, Finished)

avg_review_score out of bounds rows: 0
PASS: avg_review_score within bounds (or NULL).
avg_review_score NULL rows: 424/69413 (0.61%)


## Section 6: Promotion decision

Promote only when:
- Lineage gate is satisfied (Gold rebuilt from `sl_*`)
- BI contract passes
- Grain passes
- Section 3A executive baseline is PASS
- Delivery and review sanity checks pass

In [23]:
%%sql
-- Promotion SQL (run only after all checks PASS)
CREATE OR REPLACE TABLE gold_seller_daily_perf
AS
SELECT * FROM gold_dev_seller_daily_perf;

StatementMeta(, 6b5c2733-195c-48a1-bc46-fd1685e2a50c, 25, Finished, Available, Finished)

<Spark SQL result set with 0 rows and 0 fields>

In [2]:
%%sql
-- Post-promotion parity check (run after promotion)
SELECT
(SELECT COUNT(*) FROM gold_dev_seller_daily_perf) AS gold_dev_rows,
(SELECT COUNT(*) FROM gold_seller_daily_perf) AS gold_rows;

StatementMeta(, e3134228-f3f1-436d-b33e-8f43219554fd, 3, Finished, Available, Finished)

<Spark SQL result set with 1 rows and 2 fields>

## Peak Order Spike Validation – Gold Layer (Authoritative)

To validate the visible spike in the dashboard, we confirmed the peak date
and order volume directly from the promoted Gold table.

In [1]:
%%sql
SELECT
  order_date,
  SUM(total_orders) AS total_orders
FROM gold_seller_daily_perf
WHERE order_date = '2017-11-24'
GROUP BY order_date;

StatementMeta(, eaeb8a6b-020d-475e-b9d8-2cfbbef747a2, 2, Finished, Available, Finished)

<Spark SQL result set with 1 rows and 2 fields>

In [1]:
%%sql
SELECT
  order_date,
  SUM(total_orders) AS total_orders
FROM gold_seller_daily_perf
GROUP BY order_date
ORDER BY total_orders DESC
LIMIT 5;

StatementMeta(, e3134228-f3f1-436d-b33e-8f43219554fd, 2, Finished, Available, Finished)

<Spark SQL result set with 5 rows and 2 fields>