## Silver Validation (Read-Only)

Purpose:
- Validate Silver output correctness after promotion from Silver Dev
- Confirm Silver tables are structurally sound and ready for Gold consumption
- Verify schema, row counts, and key uniqueness

This notebook:
- Performs **read-only checks**
- Does **not** modify Silver tables
- Provides evidence that Silver is safe to use as a shared source of truth

This notebook does NOT:
- Enforce Gold logic correctness
- Validate business KPIs
- Perform lineage checks beyond Silver scope

### Validation rule

Silver tables may be consumed by Gold only if:
- All validation checks in this notebook pass
- No schema or key integrity issues are detected

## Configuration and helper functions

This cell defines:
- The Bronze and Silver tables to validate
- The expected primary keys for each Silver table
- Helper functions used throughout the notebook

Notes:
- Table names are taken directly from the Lakehouse Explorer
- `sl_dev_order_reviews` is treated as **optional** for Sprint 1
- No assumptions are made about schemas beyond the Silver contract

In [2]:
from pyspark.sql import functions as F

# Update if your lakehouse uses dbo prefix or different names
TABLES = {
    "orders_bronze": "br_orders",
    "items_bronze": "br_order_items",
    "sellers_bronze": "br_sellers",
    "reviews_bronze": "br_reviews",  # optional remembered

    "orders_silver": "sl_dev_orders",
    "items_silver": "sl_dev_order_items",
    "sellers_silver": "sl_dev_sellers",
    "reviews_silver": "sl_dev_order_reviews",  # optional
}

KEYS = {
    "sl_dev_orders": ["order_id"],
    "sl_dev_order_items": ["order_id", "order_item_id"],
    "sl_dev_sellers": ["seller_id"],
    "sl_dev_order_reviews": ["order_id"],
}

OPTIONAL_TABLES = {"sl_dev_order_reviews"}  # ok if missing

def table_exists(name: str) -> bool:
    try:
        spark.table(name).limit(1).count()
        return True
    except Exception:
        return False


StatementMeta(, 50e89d21-8ec5-4ff4-a434-89286b3446f4, 4, Finished, Available, Finished)

## Table existence check

Purpose:
- Verify that all expected Bronze and Silver tables exist in the Lakehouse
- Catch environment or attachment issues early (e.g. wrong Lakehouse)

Interpretation:
- `OK` → table exists and is queryable
- `OPTIONAL_MISSING` → acceptable for Sprint 1 (reviews)
- `MISSING` → blocker; validation cannot proceed reliably

In [3]:
print("=== Table existence ===")
for k, t in TABLES.items():
    exists = table_exists(t)
    flag = "OK" if exists else "MISSING"
    if (t in OPTIONAL_TABLES) and (not exists):
        flag = "OPTIONAL_MISSING"
    print(f"{k:>14}: {t:<30} {flag}")

StatementMeta(, 50e89d21-8ec5-4ff4-a434-89286b3446f4, 5, Finished, Available, Finished)

=== Table existence ===
 orders_bronze: br_orders                      OK
  items_bronze: br_order_items                 OK
sellers_bronze: br_sellers                     OK
reviews_bronze: br_reviews                     OK
 orders_silver: sl_dev_orders                  OK
  items_silver: sl_dev_order_items             OK
sellers_silver: sl_dev_sellers                 OK
reviews_silver: sl_dev_order_reviews           OK


## Row count comparison: Bronze vs Silver

Purpose:
- Ensure Silver is a **lossless clone** of Bronze
- Detect accidental row drops caused by filters or joins

Rule:
- For Sprint 1, Silver row counts should match Bronze exactly

Interpretation:
- `diff = 0` → correct
- Any negative diff → rows dropped (blocker)
- Any positive diff → duplication introduced (blocker)

### Row count sanity check

Why this matters:
- Detects accidental truncation or duplication
- Ensures Silver promotion did not alter data volume unexpectedly

Expected outcome:
- Row counts are stable or explainable
- Large unexpected changes require investigation


### Expected grain

Each Silver table has a defined and intentional grain:

- Orders: one row per `order_id`
- Order items: one row per item
- Reviews: multiple rows per `order_id` (expected)

The checks below confirm Silver respects these grain guarantees
before being used as a shared source of truth by Gold.

These grain guarantees are relied upon by Gold aggregations and baseline checks.

In [4]:
def count_rows(name: str) -> int:
    return spark.table(name).count()

pairs = [
    ("orders_bronze", "orders_silver"),
    ("items_bronze", "items_silver"),
    ("sellers_bronze", "sellers_silver"),
    ("reviews_bronze", "reviews_silver"),
]

print("=== Row counts (Bronze vs Silver) ===")
for b_key, s_key in pairs:
    b = TABLES[b_key]
    s = TABLES[s_key]
    if not table_exists(s):
        print(f"Skip {s} (missing)")
        continue
    if not table_exists(b):
        print(f"Skip {b} (missing)")
        continue

    b_cnt = count_rows(b)
    s_cnt = count_rows(s)
    diff = s_cnt - b_cnt
    print(f"{b:<28} {b_cnt:>10,} | {s:<22} {s_cnt:>10,} | diff {diff:+,}")


StatementMeta(, 50e89d21-8ec5-4ff4-a434-89286b3446f4, 6, Finished, Available, Finished)

=== Row counts (Bronze vs Silver) ===
br_orders                        99,441 | sl_dev_orders              99,441 | diff +0
br_order_items                  112,650 | sl_dev_order_items        112,650 | diff +0
br_sellers                        3,095 | sl_dev_sellers              3,095 | diff +0
br_reviews                       99,224 | sl_dev_order_reviews       99,224 | diff +0


## Schema inspection

Purpose:
- Confirm that Silver tables:
  - Contain the required columns
  - Use appropriate data types
  - Do not contain unexpected nested or complex structures

Notes:
- Extra clean columns are acceptable
- No business logic, KPIs, or aggregations should appear in Silver
- This is a visual/manual check rather than a strict assertion

In [5]:
print("=== Schemas ===")
for s in [TABLES["orders_silver"], TABLES["items_silver"], TABLES["sellers_silver"], TABLES["reviews_silver"]]:
    if table_exists(s):
        print(f"\nSchema: {s}")
        spark.table(s).printSchema()
    else:
        print(f"\nSchema: {s} (missing)")

StatementMeta(, 50e89d21-8ec5-4ff4-a434-89286b3446f4, 7, Finished, Available, Finished)

=== Schemas ===

Schema: sl_dev_orders
root
 |-- order_id: string (nullable = true)
 |-- customer_id: string (nullable = true)
 |-- order_status: string (nullable = true)
 |-- order_purchase_timestamp: timestamp (nullable = true)
 |-- order_approved_at: timestamp (nullable = true)
 |-- order_delivered_carrier_date: timestamp (nullable = true)
 |-- order_delivered_customer_date: timestamp (nullable = true)
 |-- order_estimated_delivery_date: timestamp (nullable = true)


Schema: sl_dev_order_items
root
 |-- order_id: string (nullable = true)
 |-- order_item_id: integer (nullable = true)
 |-- product_id: string (nullable = true)
 |-- seller_id: string (nullable = true)
 |-- price: double (nullable = true)
 |-- freight_value: double (nullable = true)


Schema: sl_dev_sellers
root
 |-- seller_id: string (nullable = true)
 |-- seller_city: string (nullable = true)
 |-- seller_state: string (nullable = true)


Schema: sl_dev_order_reviews
root
 |-- order_id: string (nullable = true)
 |-- rev

## Key uniqueness validation

Purpose:
- Validate that Silver tables respect their expected grain

Expected uniqueness:
- `sl_dev_orders` → order_id
- `sl_dev_sellers` → seller_id
- `sl_dev_order_items` → (order_id, order_item_id)

Notes:
- `sl_dev_order_reviews` may legitimately have multiple rows per order_id
  and is therefore measured but not enforced

Interpretation:
- `dup_rows = 0` → correct
- Any non-zero dupes in orders, sellers, or items → blocker

### Key uniqueness check

Why this matters:
- Silver tables are expected to have stable business keys
- Duplicate keys can cause metric inflation in Gold


In [6]:
def dup_key_count(table: str, key_cols: list[str]) -> int:
    df = spark.table(table)
    return df.groupBy(*key_cols).count().where(F.col("count") > 1).count()

print("=== Duplicate key rows (should be 0 for orders, sellers, items) ===")
for t, keys in KEYS.items():
    if not table_exists(t):
        print(f"{t}: missing")
        continue
    dups = dup_key_count(t, keys)
    print(f"{t:<22} keys={keys} dup_rows={dups:,}")

StatementMeta(, 50e89d21-8ec5-4ff4-a434-89286b3446f4, 8, Finished, Available, Finished)

=== Duplicate key rows (should be 0 for orders, sellers, items) ===
sl_dev_orders          keys=['order_id'] dup_rows=0
sl_dev_order_items     keys=['order_id', 'order_item_id'] dup_rows=0
sl_dev_sellers         keys=['seller_id'] dup_rows=0
sl_dev_order_reviews   keys=['order_id'] dup_rows=547


## NULL checks on key columns (data integrity)

Purpose:
- Validate that primary key columns in Silver tables do not contain NULL values
- Ensure Silver tables are safe to join and aggregate downstream

Keys checked:
- `sl_dev_orders` → order_id
- `sl_dev_sellers` → seller_id
- `sl_dev_order_items` → (order_id, order_item_id)
- `sl_dev_order_reviews` → order_id

Why this matters:
- NULL keys break joins and aggregations in Gold
- NULL keys usually indicate ingestion or transformation issues
- Enforcing non-null on keys is acceptable in Silver

Expected result:
- All NULL counts for key columns should b

In [7]:
def null_counts(table: str, cols: list[str]):
    df = spark.table(table)
    exprs = [F.sum(F.col(c).isNull().cast("int")).alias(f"null_{c}") for c in cols]
    return df.agg(*exprs)

print("=== Null counts on key columns (should be 0) ===")
for t, keys in KEYS.items():
    if not table_exists(t):
        print(f"{t}: missing")
        continue
    display(null_counts(t, keys))

StatementMeta(, 50e89d21-8ec5-4ff4-a434-89286b3446f4, 9, Finished, Available, Finished)

=== Null counts on key columns (should be 0) ===


SynapseWidget(Synapse.DataFrame, 92fbc38d-2378-4570-bc2b-c5b64ebd3c7e)

SynapseWidget(Synapse.DataFrame, 1c82c47e-00cf-4301-bc46-73f759ed518f)

SynapseWidget(Synapse.DataFrame, e59536bb-8f90-4bdf-9d94-a530d5fa86bd)

SynapseWidget(Synapse.DataFrame, 7426e2d1-56f8-4191-b81e-6635d33f9d40)

## Referential integrity checks

Purpose:
- Ensure relationships between Silver tables are valid
- Detect orphan records early

Checks performed:
- Order items must reference an existing order
- Order items must reference an existing seller
- Reviews must reference an existing order

Interpretation:
- Anti-join count = 0 → referential integrity intact
- Any non-zero value → data quality issue that must be understood

In [7]:
def anti_join_count(left_table: str, right_table: str, join_cols: list[str]) -> int:
    l = spark.table(left_table).select(*join_cols).dropDuplicates()
    r = spark.table(right_table).select(*join_cols).dropDuplicates()
    return l.join(r, on=join_cols, how="left_anti").count()

print("=== Referential integrity (ideally 0) ===")

orders = TABLES["orders_silver"]
items = TABLES["items_silver"]
sellers = TABLES["sellers_silver"]
reviews = TABLES["reviews_silver"]

if table_exists(items) and table_exists(orders):
    print("items.order_id not in orders:", anti_join_count(items, orders, ["order_id"]))

if table_exists(items) and table_exists(sellers):
    print("items.seller_id not in sellers:", anti_join_count(items, sellers, ["seller_id"]))

if table_exists(reviews) and table_exists(orders):
    print("reviews.order_id not in orders:", anti_join_count(reviews, orders, ["order_id"]))

StatementMeta(, 28648f29-d69c-46a3-b157-60bad96bd02e, 9, Finished, Available, Finished)

=== Referential integrity (ideally 0) ===
items.order_id not in orders: 0
items.seller_id not in sellers: 0
reviews.order_id not in orders: 0


## NULL preservation validation

Purpose:
- Confirm that Silver preserves **real-world missingness**
- Ensure NULLs are not force-filled or filtered out

What we check:
- Delivery timestamps for undelivered orders
- Review scores
- Other non-key attributes

Expected behaviour:
- NULLs exist where business reality requires them
- No empty strings remain after casting
- No fake default values (e.g. 0 or epoch dates)

This confirms Silver follows the rule:
"Keep NULLs as NULL"

Observed result:
- `order_delivered_customer_date` contains 2,965 NULL values.
- This matches the number of blank delivery dates in the raw CSV and corresponds to
  cancelled or undelivered orders.
- This confirms that Silver correctly preserves real-world missingness without
  forcing default values.

In [8]:
from pyspark.sql import functions as F

def null_profile(table, cols):
    df = spark.table(table)
    return df.select([F.sum(F.col(c).isNull().cast("int")).alias(c) for c in cols])

display(null_profile("sl_dev_orders", ["order_purchase_timestamp", "order_delivered_customer_date", "order_estimated_delivery_date"]))
display(null_profile("sl_dev_order_items", ["price", "freight_value"]))
display(null_profile("sl_dev_sellers", ["seller_city", "seller_state"]))
display(null_profile("sl_dev_order_reviews", ["review_score"]))

StatementMeta(, 50e89d21-8ec5-4ff4-a434-89286b3446f4, 10, Finished, Available, Finished)

SynapseWidget(Synapse.DataFrame, 3809b4b7-4b86-4d90-9f1b-09cfa9fbf26a)

SynapseWidget(Synapse.DataFrame, b8dedfb0-ff3e-492f-b732-f7125ffe2f6c)

SynapseWidget(Synapse.DataFrame, 17e281d9-45df-42f2-9187-9743794a5748)

SynapseWidget(Synapse.DataFrame, d56f4412-32de-4736-a146-95b05c4d768f)

## CSV blanks vs Silver NULL confirmation

Purpose:
- Validate that blank values in raw CSV files
  are correctly converted to NULL in Silver

Evidence:
- Raw CSV shows 2965 blank values in order_delivered_customer_date
- Silver shows 2965 NULL values in the same column
- No empty strings remain

This proves:
- Correct CSV parsing
- Correct type casting
- No silent data loss or coercion

In [9]:
from pyspark.sql import functions as F

df = spark.table("sl_dev_orders")
df.select(
    F.sum((F.col("order_delivered_customer_date").isNull()).cast("int")).alias("null_count"),
    F.sum((F.col("order_delivered_customer_date") == "").cast("int")).alias("empty_string_count")
).show()

StatementMeta(, 50e89d21-8ec5-4ff4-a434-89286b3446f4, 11, Finished, Available, Finished)

+----------+------------------+
|null_count|empty_string_count|
+----------+------------------+
|      2965|              NULL|
+----------+------------------+



## Promotion decision summary

Based on the validation results:

✔ Bronze and Silver row counts match exactly  
✔ Key uniqueness holds for orders, sellers, and items  
✔ Referential integrity is intact  
✔ NULLs are preserved correctly  
✔ CSV blanks are correctly converted to NULL  

Decision:
- `sl_dev_*` tables are **ready for promotion** to `sl_*` for Sprint 1

Notes:
- `sl_dev_order_reviews` contains multiple rows per order_id.
  This is acceptable in Silver; aggregation will be handled in Gold.

### Post-Promotion Sanity Check: sl_* vs sl_dev_*

This check is run **after Silver promotion** as a lightweight safety net.

**Purpose**
- Confirm that promoted `sl_*` tables are **structurally and logically identical**
  to their source `sl_dev_*` tables.
- Detect any unexpected issues during promotion (e.g. duplication, truncation,
  or incorrect overwrite).

**Scope (intentionally minimal)**
- Distinct key counts match between `sl_*` and `sl_dev_*`
- No deep row-by-row diff (already covered in pre-promotion validation)

**Reviews-specific notes**
- Reviews can have multiple rows per order (known behavior: ~547 orders).
- The Silver reviews table is **not aggregated to order-level**.
- Therefore, validation for reviews checks that promoted and dev tables match in:
  - distinct `order_id` count
  - duplication pattern (number of orders with >1 review row)

If all checks pass, Silver is safe for Gold to consume as the source of truth.

In [10]:
from pyspark.sql import functions as F

def dup_order_count(table_name):
    df = spark.table(table_name)
    return (
        df.groupBy("order_id")
          .count()
          .filter(F.col("count") > 1)
          .count()
    )

def distinct_orders(table_name):
    return spark.table(table_name).select("order_id").dropna().dropDuplicates().count()

# Reviews sanity (do NOT enforce 1 row per order because reviews can repeat)
sl_dist = distinct_orders("sl_order_reviews")
dev_dist = distinct_orders("sl_dev_order_reviews")

sl_dups = dup_order_count("sl_order_reviews")
dev_dups = dup_order_count("sl_dev_order_reviews")

print(f"sl_order_reviews distinct order_id: {sl_dist}")
print(f"sl_dev_order_reviews distinct order_id: {dev_dist}")
print(f"sl_order_reviews orders with >1 review row: {sl_dups}")
print(f"sl_dev_order_reviews orders with >1 review row: {dev_dups}")

if sl_dist != dev_dist:
    raise Exception(f"❌ Distinct order_id mismatch: sl={sl_dist}, sl_dev={dev_dist}")

if sl_dups != dev_dups:
    raise Exception(f"❌ Duplicate-order pattern mismatch: sl={sl_dups}, sl_dev={dev_dups}")

print("✅ Reviews post-promotion sanity PASSED (distinct + duplicate pattern match)")

StatementMeta(, 50e89d21-8ec5-4ff4-a434-89286b3446f4, 12, Finished, Available, Finished)

sl_order_reviews distinct order_id: 98673
sl_dev_order_reviews distinct order_id: 98673
sl_order_reviews orders with >1 review row: 547
sl_dev_order_reviews orders with >1 review row: 547
✅ Reviews post-promotion sanity PASSED (distinct + duplicate pattern match)


## Cross-Check – Silver Layer (Order-Level)

As an additional verification, we cross-checked the same date at the Silver
layer using raw order-level data.

----

In [1]:
%%sql
SELECT
  CAST(order_purchase_timestamp AS DATE) AS order_date,
  COUNT(DISTINCT order_id) AS orders
FROM sl_orders
WHERE CAST(order_purchase_timestamp AS DATE) = '2017-11-24'
GROUP BY CAST(order_purchase_timestamp AS DATE);

StatementMeta(, 937de595-08f4-4da0-82bc-08ef0dd3e441, 2, Finished, Available, Finished)

<Spark SQL result set with 1 rows and 2 fields>

----
**Result:**

Orders: 1176

Note:
The Silver count is higher by 1 due to differences in aggregation grain.
Silver reflects raw order-level truth, while Gold aggregates data at the
seller-day level for performance analytics. The discrepancy is expected and
does not impact BI correctness.

---