# Gold Mart – Seller × Day Performance (Sprint 1)

## Purpose
This notebook builds the **Gold-level fact table**  
`gold_seller_daily_perf` at **seller × day grain**.

This table is the **single source of truth for analytics** in Sprint 1 and
is directly consumed by:
- The semantic model
- Power BI dashboards

The table name, grain, and schema are **stable across sprints**.
Business logic inside the table may evolve, but downstream consumers
must not need to rewire.

---

## Grain Definition
**One row = one seller on one calendar day**


This grain supports:
- Seller performance comparison
- Daily trend analysis
- Delivery and customer satisfaction KPIs

---

## Source Tables (Sprint 1)
This Gold mart is built from **Stub Silver tables**:

- `sl_stub_orders`
- `sl_stub_order_items`
- `sl_stub_reviews`

These tables act as **schema contracts** for Sprint 1.
In later sprints, the Gold mart will be rebuilt from finalized Silver tables
(`sl_*`) without changing its schema or grain.

---

## Key Metrics Produced
- Total orders per seller per day
- Total items sold
- Revenue and GMV (including freight)
- Average delivery days
- Late delivery indicator (seller-day level)
- Average review score

---

## Design Principles
- **Contract-first:** schema and grain stability over perfect logic
- **Idempotent:** safe to overwrite on each run
- **Analytics-ready:** Gold is the only layer exposed to BI
- **Minimal coupling:** downstream layers do not depend on Silver internals

---

## Notes for Contributors
- Do **not** change the table name `gold_seller_daily_perf`
- Do **not** change grain or column names without team agreement
- Development or experimental logic should be built in separate
  dev tables (e.g. `gold_dev_seller_daily_perf`) and promoted deliberately

---

## Downstream Usage
This table is consumed by:
- Power BI semantic model
- Sprint 1 demo dashboards
- Future automated Fabric pipelines

Changes here have direct impact on reporting.


In [1]:
CREATE OR REPLACE TABLE gold_seller_daily_perf AS
WITH base AS (
  SELECT
    i.seller_id,
    o.order_id,
    CAST(o.order_purchase_ts AS DATE) AS order_date,

    -- revenue proxy (items only; simple MVP)
    COALESCE(i.price, 0) AS item_price,
    COALESCE(i.freight_value, 0) AS freight_value,

    -- delivery metrics (may be null for undelivered orders)
    DATEDIFF(o.delivered_customer_ts, o.order_purchase_ts) AS delivery_days,

    CASE
      WHEN o.delivered_customer_ts IS NOT NULL
       AND o.estimated_delivery_ts IS NOT NULL
       AND o.delivered_customer_ts > o.estimated_delivery_ts
      THEN 1 ELSE 0
    END AS is_late,

    r.review_score
  FROM sl_stub_order_items i
  JOIN sl_stub_orders o
    ON o.order_id = i.order_id
  LEFT JOIN sl_stub_reviews r
    ON r.order_id = o.order_id
)
SELECT
  seller_id,
  order_date,

  COUNT(DISTINCT order_id) AS total_orders,
  COUNT(*) AS total_items,

  SUM(item_price) AS total_revenue,
  SUM(item_price + freight_value) AS total_gmv_incl_freight,

  AVG(delivery_days) AS avg_delivery_days,

  -- seller-day late flag: if any order late that day, flag = 1
  MAX(is_late) AS has_late_delivery,

  AVG(review_score) AS avg_review_score
FROM base
GROUP BY seller_id, order_date;


StatementMeta(, 35aba7d1-4f28-4052-932d-8c7151fd3947, 2, Finished, Available, Finished)

<Spark SQL result set with 0 rows and 0 fields>

In [2]:
SELECT
  COUNT(*) AS seller_day_rows,
  COUNT(DISTINCT seller_id) AS sellers,
  MIN(order_date) AS min_day,
  MAX(order_date) AS max_day
FROM gold_seller_daily_perf;

StatementMeta(, 35aba7d1-4f28-4052-932d-8c7151fd3947, 3, Finished, Available, Finished)

<Spark SQL result set with 1 rows and 4 fields>

In [3]:
SELECT
  seller_id,
  SUM(total_revenue) AS revenue
FROM gold_seller_daily_perf
GROUP BY seller_id
ORDER BY revenue DESC
LIMIT 10;

StatementMeta(, 35aba7d1-4f28-4052-932d-8c7151fd3947, 4, Finished, Available, Finished)

<Spark SQL result set with 10 rows and 2 fields>

In [4]:
SELECT seller_id, order_date, COUNT(*) AS c
FROM gold_seller_daily_perf
GROUP BY seller_id, order_date
HAVING COUNT(*) > 1;

StatementMeta(, 35aba7d1-4f28-4052-932d-8c7151fd3947, 5, Finished, Available, Finished)

<Spark SQL result set with 0 rows and 3 fields>

In [5]:
SELECT COUNT(*) FROM gold_seller_daily_perf;

StatementMeta(, 35aba7d1-4f28-4052-932d-8c7151fd3947, 6, Finished, Available, Finished)

<Spark SQL result set with 1 rows and 1 fields>

In [1]:
-- Compare row counts
SELECT 'gold' AS tbl, COUNT(*) AS rows FROM gold_seller_daily_perf
UNION ALL
SELECT 'gold_dev' AS tbl, COUNT(*) AS rows FROM gold_dev_seller_daily_perf;

-- Compare key coverage
SELECT 'gold' AS tbl, COUNT(DISTINCT seller_id) sellers, COUNT(DISTINCT order_date) days
FROM gold_seller_daily_perf
UNION ALL
SELECT 'gold_dev' AS tbl, COUNT(DISTINCT seller_id) sellers, COUNT(DISTINCT order_date) days
FROM gold_dev_seller_daily_perf;

StatementMeta(, 990b84a1-9872-45dc-a2fd-40f7e876d893, 3, Finished, Available, Finished)

<Spark SQL result set with 2 rows and 2 fields>

<Spark SQL result set with 2 rows and 3 fields>

In [2]:
SELECT
  SUM(total_orders) AS sum_orders,
  SUM(total_items) AS sum_items,
  ROUND(SUM(total_revenue), 2) AS sum_revenue
FROM gold_seller_daily_perf;

SELECT
  SUM(total_orders) AS sum_orders,
  SUM(total_items) AS sum_items,
  ROUND(SUM(total_revenue), 2) AS sum_revenue
FROM gold_dev_seller_daily_perf;


StatementMeta(, 990b84a1-9872-45dc-a2fd-40f7e876d893, 5, Finished, Available, Finished)

<Spark SQL result set with 1 rows and 3 fields>

<Spark SQL result set with 1 rows and 3 fields>

#### Gold Promotion Validation using Delta Time Travel

This section validates whether the promoted Gold table (gold_seller_daily_perf) differs materially from the original Gold table that BI was already using before promotion.

Because promotion overwrites the same table name, we use Delta Lake time travel to compare:

Version 0 → original Gold table (used by BI in parallel)

Latest version → promoted Gold table

In [4]:
DESCRIBE HISTORY gold_seller_daily_perf;

StatementMeta(, 990b84a1-9872-45dc-a2fd-40f7e876d893, 8, Finished, Available, Finished)

<Spark SQL result set with 2 rows and 15 fields>

##### Row Count Validation (Structural Stability)

**Purpose**
Verify that promotion did not change the grain or row cardinality of the Gold table.

**Expectation**

* Row counts should match exactly
* Confirms no rows were added or dropped during promotion

In [6]:
-- Original (Version 0)
SELECT 'v0_original' AS ver, COUNT(*) AS rows
FROM gold_seller_daily_perf VERSION AS OF 0

UNION ALL

-- Current (Latest = Version 1)
SELECT 'v1_current' AS ver, COUNT(*) AS rows
FROM gold_seller_daily_perf;


StatementMeta(, 990b84a1-9872-45dc-a2fd-40f7e876d893, 11, Finished, Available, Finished)

<Spark SQL result set with 2 rows and 2 fields>

##### Purpose
Validate whether core business KPIs changed between the original Gold table and the promoted Gold table.

**Metrics Checked**

* total_orders
* total_items
* total_revenue

These represent the primary seller performance KPIs consumed by BI.

In [7]:
-- Original (Version 0)
SELECT
  'v0_original' AS ver,
  SUM(total_orders) AS sum_orders,
  SUM(total_items) AS sum_items,
  ROUND(SUM(total_revenue), 2) AS sum_revenue
FROM gold_seller_daily_perf VERSION AS OF 0;

-- Current (Latest)
SELECT
  'v1_current' AS ver,
  SUM(total_orders) AS sum_orders,
  SUM(total_items) AS sum_items,
  ROUND(SUM(total_revenue), 2) AS sum_revenue
FROM gold_seller_daily_perf;

StatementMeta(, 990b84a1-9872-45dc-a2fd-40f7e876d893, 13, Finished, Available, Finished)

<Spark SQL result set with 1 rows and 4 fields>

<Spark SQL result set with 1 rows and 4 fields>

**Interpretation**

* Identical sum_orders confirms order-level KPIs are unchanged
* Minor differences in sum_items or sum_revenue indicate Gold aggregation refinement, not BI issues
* Order-based dashboards are expected to remain unchanged

In [9]:
-- Row counts: v0 vs current
SELECT 'v0_original' AS ver, COUNT(*) AS rows
FROM gold_seller_daily_perf VERSION AS OF 0
UNION ALL
SELECT 'v1_current' AS ver, COUNT(*) AS rows
FROM gold_seller_daily_perf;

-- Signature totals: v0 vs current
SELECT
  'v0_original' AS ver,
  SUM(total_orders) AS sum_orders,
  SUM(total_items) AS sum_items,
  ROUND(SUM(total_revenue), 2) AS sum_revenue
FROM gold_seller_daily_perf VERSION AS OF 0
UNION ALL
SELECT
  'v1_current' AS ver,
  SUM(total_orders) AS sum_orders,
  SUM(total_items) AS sum_items,
  ROUND(SUM(total_revenue), 2) AS sum_revenue
FROM gold_seller_daily_perf;

StatementMeta(, 990b84a1-9872-45dc-a2fd-40f7e876d893, 16, Finished, Available, Finished)

<Spark SQL result set with 2 rows and 2 fields>

<Spark SQL result set with 2 rows and 4 fields>

##### Delivery Logic Validation (Using v0-compatible Columns)

Purpose
Validate that delivery classification logic remains consistent across versions using only columns that existed in the original Gold table.

Since detailed delivery breakdown columns did not exist in Version 0, we use has_late_delivery as a proxy.

Interpretation

* Identical late vs on-time counts confirm delivery logic consistency
* BI delivery KPIs are expected to remain unchanged
* Safe for BI to switch to newer Gold delivery columns

In [10]:
SELECT
  'v0_original' AS ver,
  SUM(CASE WHEN has_late_delivery = 1 THEN total_orders ELSE 0 END) AS late_orders_proxy,
  SUM(CASE WHEN has_late_delivery = 0 THEN total_orders ELSE 0 END) AS on_time_orders_proxy
FROM gold_seller_daily_perf VERSION AS OF 0
UNION ALL
SELECT
  'v1_current' AS ver,
  SUM(CASE WHEN has_late_delivery = 1 THEN total_orders ELSE 0 END) AS late_orders_proxy,
  SUM(CASE WHEN has_late_delivery = 0 THEN total_orders ELSE 0 END) AS on_time_orders_proxy
FROM gold_seller_daily_perf;

StatementMeta(, 990b84a1-9872-45dc-a2fd-40f7e876d893, 17, Finished, Available, Finished)

<Spark SQL result set with 2 rows and 3 fields>

##### Overall Validation Conclusion

Summary

Gold promotion preserved row count and order-level KPIs

* Delivery classification logic is unchanged
* Minor item/revenue differences reflect Gold refinement, not errors
* BI dashboards not changing after promotion is expected and correct

Outcome

* Gold promotion is validated
* BI is safe to consume promoted Gold
* Sprint 1 dashboards remain stable
* Ready to proceed into Sprint 2 without rework

**Conclusion**  
Gold promotion preserved grain, order-level KPIs, and delivery classification.  
Sprint 1 BI dashboards are expected to remain unchanged.