# Silver Layer – Migration and Development

## Purpose
This notebook serves two purposes:
1. Perform a **physical migration** of existing Silver tables from `lh_olist_shared`
   into the new `lh_olist_silver` lakehouse using Spark (to preserve schema,
   duplicates, and values exactly).
2. Develop and validate the **remaining Silver tables** required for future
   Gold sub-problems.

Dataflow Gen2 was evaluated for migration but Spark is used here to ensure
row-level correctness.

---

## Scope
### In scope
- Copy existing `sl_*` and `sl_dev_*` tables into `lh_olist_silver`
- Build and validate remaining Silver tables from Bronze (`br_*`)
- Apply consistent Silver-layer standards (types, null handling, dedup rules)

### Out of scope
- Gold fact/dimension modelling
- BI semantic modelling

---
## 1. Physical Migration of Existing Silver Tables

This section performs a one-time physical copy of existing Silver tables
from `lh_olist_shared` into `lh_olist_silver` using Spark.

---

In [4]:
pairs = [
  ("sl_orders","lh_olist_shared.dbo.sl_orders"),
  ("sl_order_items","lh_olist_shared.dbo.sl_order_items"),
  ("sl_order_reviews","lh_olist_shared.dbo.sl_order_reviews"),
  ("sl_sellers","lh_olist_shared.dbo.sl_sellers"),
  ("sl_dev_orders","lh_olist_shared.dbo.sl_dev_orders"),
  ("sl_dev_order_items","lh_olist_shared.dbo.sl_dev_order_items"),
  ("sl_dev_order_reviews","lh_olist_shared.dbo.sl_dev_order_reviews"),
  ("sl_dev_sellers","lh_olist_shared.dbo.sl_dev_sellers"),
]

for new_t, old_t in pairs:
    new_cnt = spark.table(new_t).count()
    old_cnt = spark.table(old_t).count()
    print(new_t, "new:", new_cnt, "old:", old_cnt, "match:", new_cnt==old_cnt)

StatementMeta(, d99af3e5-5470-462f-a121-95ef6f3765e9, 6, Finished, Available, Finished)

sl_orders new: 99441 old: 99441 match: True
sl_order_items new: 112650 old: 112650 match: True
sl_order_reviews new: 99224 old: 99224 match: True
sl_sellers new: 3095 old: 3095 match: True
sl_dev_orders new: 99441 old: 99441 match: True
sl_dev_order_items new: 112650 old: 112650 match: True
sl_dev_order_reviews new: 99224 old: 99224 match: True
sl_dev_sellers new: 3095 old: 3095 match: True


##### This cell validates that the migrated Silver tables in lh_olist_silver retain identical schemas compared to the original tables in lh_olist_shared.

In [9]:
tables = [
    "sl_orders","sl_order_items","sl_order_reviews","sl_sellers",
    "sl_dev_orders","sl_dev_order_items","sl_dev_order_reviews","sl_dev_sellers",
]

for t in tables:
    new_schema = spark.table(t).schema
    old_schema = spark.table(f"lh_olist_shared.dbo.{t}").schema
    print(f"{t} schema_match: {new_schema == old_schema}")

print(f"\n✅ Schema checks completed for {len(tables)} tables.")

StatementMeta(, d99af3e5-5470-462f-a121-95ef6f3765e9, 11, Finished, Available, Finished)

sl_orders schema_match: True
sl_order_items schema_match: True
sl_order_reviews schema_match: True
sl_sellers schema_match: True
sl_dev_orders schema_match: True
sl_dev_order_items schema_match: True
sl_dev_order_reviews schema_match: True
sl_dev_sellers schema_match: True

✅ Schema checks completed for 8 tables.


In [10]:
from pyspark.sql import functions as F

tables = [
    "sl_orders","sl_order_items","sl_order_reviews","sl_sellers",
    "sl_dev_orders","sl_dev_order_items","sl_dev_order_reviews","sl_dev_sellers",
]

def table_signature(table_name: str):
    df = spark.table(table_name)
    cols = df.columns  # stable order from metastore

    # Build a single string per row representing all columns (NULL-safe)
    row_str = F.concat_ws("||", *[F.coalesce(F.col(c).cast("string"), F.lit("∅")) for c in cols])

    # Strong row hash (string) + fast 64-bit hash (numeric) for aggregate checks
    h256 = F.sha2(row_str, 256).alias("h256")
    h64  = F.xxhash64(row_str).alias("h64")

    hashed = df.select(h256, h64)

    # Table-level signature: counts + two independent aggregates
    sig = (hashed.agg(
        F.count("*").alias("rows"),
        F.countDistinct("h256").alias("distinct_row_hashes"),
        F.sum("h64").alias("sum_h64"),
        F.expr("bit_xor(h64)").alias("xor_h64")
    ).collect()[0])

    return sig

def compare_table(t: str):
    new_sig = table_signature(t)
    old_sig = table_signature(f"lh_olist_shared.dbo.{t}")

    match = (
        new_sig["rows"] == old_sig["rows"]
        and new_sig["distinct_row_hashes"] == old_sig["distinct_row_hashes"]
        and new_sig["sum_h64"] == old_sig["sum_h64"]
        and new_sig["xor_h64"] == old_sig["xor_h64"]
    )

    print(f"\n=== Fingerprint: {t} ===")
    print("NEW:", dict(new_sig.asDict()))
    print("OLD:", dict(old_sig.asDict()))
    print("match:", match)

for t in tables:
    compare_table(t)

print(f"\n✅ Fingerprint checks completed for {len(tables)} tables.")

StatementMeta(, d99af3e5-5470-462f-a121-95ef6f3765e9, 12, Finished, Available, Finished)


=== Fingerprint: sl_orders ===
NEW: {'rows': 99441, 'distinct_row_hashes': 99441, 'sum_h64': -5144854484098317878, 'xor_h64': -7756377639038886958}
OLD: {'rows': 99441, 'distinct_row_hashes': 99441, 'sum_h64': -5144854484098317878, 'xor_h64': -7756377639038886958}
match: True

=== Fingerprint: sl_order_items ===
NEW: {'rows': 112650, 'distinct_row_hashes': 112650, 'sum_h64': -894262625420336762, 'xor_h64': -5235251594014130510}
OLD: {'rows': 112650, 'distinct_row_hashes': 112650, 'sum_h64': -894262625420336762, 'xor_h64': -5235251594014130510}
match: True

=== Fingerprint: sl_order_reviews ===
NEW: {'rows': 99224, 'distinct_row_hashes': 98875, 'sum_h64': -6416433710245124699, 'xor_h64': 1859019713694051919}
OLD: {'rows': 99224, 'distinct_row_hashes': 98875, 'sum_h64': -6416433710245124699, 'xor_h64': 1859019713694051919}
match: True

=== Fingerprint: sl_sellers ===
NEW: {'rows': 3095, 'distinct_row_hashes': 3095, 'sum_h64': 8268492846195716669, 'xor_h64': 1575174820700069101}
OLD: {'r

In [11]:
checks = [
    ("sl_orders", "order_id"),
    ("sl_order_items", "order_id"),
    ("sl_order_reviews", "order_id"),
    ("sl_sellers", "seller_id"),
]

for t, key in checks:
    new_df = spark.table(t)
    old_df = spark.table(f"lh_olist_shared.dbo.{t}")

    new_dist = new_df.select(key).distinct().count()
    old_dist = old_df.select(key).distinct().count()

    print(f"\n{t} distinct {key} new={new_dist} old={old_dist} match={new_dist==old_dist}")

# Known pattern: orders with >1 review row (you mentioned ~547 earlier)
t = "sl_order_reviews"
new_multi = spark.table(t).groupBy("order_id").count().filter("count > 1").count()
old_multi = spark.table(f"lh_olist_shared.dbo.{t}").groupBy("order_id").count().filter("count > 1").count()
print(f"\n{t} orders with >1 review row new={new_multi} old={old_multi} match={new_multi==old_multi}")

StatementMeta(, d99af3e5-5470-462f-a121-95ef6f3765e9, 13, Finished, Available, Finished)


sl_orders distinct order_id new=99441 old=99441 match=True

sl_order_items distinct order_id new=98666 old=98666 match=True

sl_order_reviews distinct order_id new=98673 old=98673 match=True

sl_sellers distinct seller_id new=3095 old=3095 match=True

sl_order_reviews orders with >1 review row new=547 old=547 match=True


---
## 2. Bronze → Silver Validation (Sprint 2)

This section validates that:
1) Existing Silver tables match their Bronze source (parity checks)
2) Remaining Bronze tables are readable and ready for Silver development


---
### 2.1 Existing Silver Tables – Bronze vs Silver Parity

The following Silver tables have already been migrated or promoted.
For these tables, we validate that Silver correctly reflects Bronze by
checking row-count parity.

Only tables that already exist in Silver are included in this check.

In [17]:
# Explicit mapping of Bronze → existing Silver tables
existing_tables = {
    "br_orders": "sl_orders",
    "br_order_items": "sl_order_items",
    "br_reviews": "sl_order_reviews",   # <-- fix here
    "br_sellers": "sl_sellers"
}

results = []

for br_t, sl_t in existing_tables.items():
    bronze_cnt = spark.table(f"bronze_shortcut.{br_t}").count()
    silver_cnt = spark.table(f"dbo.{sl_t}").count()

    results.append((br_t, bronze_cnt, sl_t, silver_cnt, bronze_cnt == silver_cnt))

spark.createDataFrame(
    results,
    ["bronze_table", "bronze_row_count", "silver_table", "silver_row_count", "count_match"]
).show(truncate=False)

StatementMeta(, 00b2245a-18e9-4fcc-983e-e65589a5e565, 20, Finished, Available, Finished)

+--------------+----------------+----------------+----------------+-----------+
|bronze_table  |bronze_row_count|silver_table    |silver_row_count|count_match|
+--------------+----------------+----------------+----------------+-----------+
|br_orders     |99441           |sl_orders       |99441           |true       |
|br_order_items|112650          |sl_order_items  |112650          |true       |
|br_reviews    |99224           |sl_order_reviews|99224           |true       |
|br_sellers    |3095            |sl_sellers      |3095            |true       |
+--------------+----------------+----------------+----------------+-----------+



---
### 2.2 Remaining Bronze Tables – Readability Check

The following Bronze tables have not yet been developed into Silver.
For these tables, we only validate that they are readable via the
OneLake schema shortcut and ready for Silver development.

No parity comparison is performed at this stage.

In [18]:
remaining_bronze_tables = [
    "br_customers",
    "br_products",
    "br_payments",
    "br_geolocation",
    "br_product_category_translation"
]

results = []

for t in remaining_bronze_tables:
    cnt = spark.table(f"bronze_shortcut.{t}").count()
    results.append((t, cnt))

spark.createDataFrame(
    results,
    ["bronze_table", "bronze_row_count"]
).show(truncate=False)

StatementMeta(, 00b2245a-18e9-4fcc-983e-e65589a5e565, 21, Finished, Available, Finished)

+-------------------------------+----------------+
|bronze_table                   |bronze_row_count|
+-------------------------------+----------------+
|br_customers                   |99441           |
|br_products                    |32951           |
|br_payments                    |103886          |
|br_geolocation                 |1000163         |
|br_product_category_translation|71              |
+-------------------------------+----------------+



---
### Validation Summary

- Bronze → Silver schema shortcut is functioning correctly.
- Existing Silver tables match Bronze row counts.
- Remaining Bronze tables are readable and ready for Silver development.


This staged validation approach aligns with Sprint 2 migration strategy
and avoids premature validation on tables not yet developed.

---

## 3. Development of Remaining Silver Tables (Sprint 2)

This section builds the remaining Silver **dev** tables from the Bronze shortcut (`bronze_shortcut.br_*`).

Scope:
- Create `sl_dev_*` tables for the remaining 5 entities:
  - customers, geolocation, payments, products, product_category_translation

Design principles (same spirit as Sprint 1):
- Light standardisation only (trim strings, basic type casts)
- No business logic or enrichment here
- Output tables are DEV only (`sl_dev_*`) and are **not** consumed by Gold directly
- Promotion to `sl_*` happens later after validation

Expected outcomes:
- `dbo.sl_dev_customers`
- `dbo.sl_dev_geolocation`
- `dbo.sl_dev_payments`
- `dbo.sl_dev_products`
- `dbo.sl_dev_product_category_translation`

In [1]:
# 3.1 Setup + guardrails
from pyspark.sql import functions as F

BRONZE_DB = "bronze_shortcut"   # shortcut database name in lh_olist_silver
SILVER_DB = "dbo"               # where sl_dev_* tables live in lh_olist_silver

def bronze_tbl(name: str) -> str:
    return f"{BRONZE_DB}.{name}"

def silver_tbl(name: str) -> str:
    return f"{SILVER_DB}.{name}"

def trim_all_strings(df):
    # trims all string columns (safe + demo-friendly)
    out = df
    for c, t in df.dtypes:
        if t == "string":
            out = out.withColumn(c, F.trim(F.col(c)))
    return out

def write_dev_table(df, table_name: str):
    (
        df.write
          .mode("overwrite")
          .format("delta")
          .saveAsTable(silver_tbl(table_name))
    )
    print(f"✅ Written: {silver_tbl(table_name)} | rows={df.count()}")

StatementMeta(, 9ae13031-09eb-4e6a-adca-bc77031dffea, 3, Finished, Available, Finished)

In [2]:
# 3.2 Customers → dbo.sl_dev_customers
src = spark.table(bronze_tbl("br_customers"))
df = trim_all_strings(src)

# (Light standardisation only)
# Keep ids as string, keep zip prefix as string (safe)
df = df.select(
    "customer_id",
    "customer_unique_id",
    "customer_zip_code_prefix",
    "customer_city",
    "customer_state"
)

write_dev_table(df, "sl_dev_customers")
display(df.limit(5))
df.printSchema()

StatementMeta(, 9ae13031-09eb-4e6a-adca-bc77031dffea, 4, Finished, Available, Finished)

✅ Written: dbo.sl_dev_customers | rows=99441


SynapseWidget(Synapse.DataFrame, 88942271-5ffc-42de-8989-955fa67a4599)

root
 |-- customer_id: string (nullable = true)
 |-- customer_unique_id: string (nullable = true)
 |-- customer_zip_code_prefix: string (nullable = true)
 |-- customer_city: string (nullable = true)
 |-- customer_state: string (nullable = true)



In [3]:
# 3.3 Geolocation → dbo.sl_dev_geolocation
src = spark.table(bronze_tbl("br_geolocation"))
df = trim_all_strings(src)

# Light casting (lat/lng should be numeric; some datasets store as string)
df = df.select(
    F.col("geolocation_zip_code_prefix").alias("geolocation_zip_code_prefix"),
    F.col("geolocation_lat").cast("double").alias("geolocation_lat"),
    F.col("geolocation_lng").cast("double").alias("geolocation_lng"),
    "geolocation_city",
    "geolocation_state"
)

write_dev_table(df, "sl_dev_geolocation")
display(df.limit(5))
df.printSchema()

StatementMeta(, 9ae13031-09eb-4e6a-adca-bc77031dffea, 5, Finished, Available, Finished)

✅ Written: dbo.sl_dev_geolocation | rows=1000163


SynapseWidget(Synapse.DataFrame, d01b6c2d-23c0-4bfd-b53d-0407416bf653)

root
 |-- geolocation_zip_code_prefix: string (nullable = true)
 |-- geolocation_lat: double (nullable = true)
 |-- geolocation_lng: double (nullable = true)
 |-- geolocation_city: string (nullable = true)
 |-- geolocation_state: string (nullable = true)



In [4]:
# 3.4 Payments → dbo.sl_dev_payments
src = spark.table(bronze_tbl("br_payments"))
df = trim_all_strings(src)

df = df.select(
    "order_id",
    F.col("payment_sequential").cast("int").alias("payment_sequential"),
    "payment_type",
    F.col("payment_installments").cast("int").alias("payment_installments"),
    F.col("payment_value").cast("double").alias("payment_value"),
)

write_dev_table(df, "sl_dev_payments")
display(df.limit(5))
df.printSchema()

StatementMeta(, 9ae13031-09eb-4e6a-adca-bc77031dffea, 6, Finished, Available, Finished)

✅ Written: dbo.sl_dev_payments | rows=103886


SynapseWidget(Synapse.DataFrame, c5620abe-25dd-47df-b665-155f02d5e4a0)

root
 |-- order_id: string (nullable = true)
 |-- payment_sequential: integer (nullable = true)
 |-- payment_type: string (nullable = true)
 |-- payment_installments: integer (nullable = true)
 |-- payment_value: double (nullable = true)



In [5]:
# 3.5 Products → dbo.sl_dev_products
src = spark.table(bronze_tbl("br_products"))
df = trim_all_strings(src)

df = df.select(
    "product_id",
    "product_category_name",
    F.col("product_name_lenght").cast("int").alias("product_name_lenght"),
    F.col("product_description_lenght").cast("int").alias("product_description_lenght"),
    F.col("product_photos_qty").cast("int").alias("product_photos_qty"),
    F.col("product_weight_g").cast("double").alias("product_weight_g"),
    F.col("product_length_cm").cast("double").alias("product_length_cm"),
    F.col("product_height_cm").cast("double").alias("product_height_cm"),
    F.col("product_width_cm").cast("double").alias("product_width_cm"),
)

write_dev_table(df, "sl_dev_products")
display(df.limit(5))
df.printSchema()

StatementMeta(, 9ae13031-09eb-4e6a-adca-bc77031dffea, 7, Finished, Available, Finished)

✅ Written: dbo.sl_dev_products | rows=32951


SynapseWidget(Synapse.DataFrame, 7514b0d1-7b6a-4603-890c-3daf3a82dbe6)

root
 |-- product_id: string (nullable = true)
 |-- product_category_name: string (nullable = true)
 |-- product_name_lenght: integer (nullable = true)
 |-- product_description_lenght: integer (nullable = true)
 |-- product_photos_qty: integer (nullable = true)
 |-- product_weight_g: double (nullable = true)
 |-- product_length_cm: double (nullable = true)
 |-- product_height_cm: double (nullable = true)
 |-- product_width_cm: double (nullable = true)



In [6]:
# 3.6 Product category translation → dbo.sl_dev_product_category_translation
src = spark.table(bronze_tbl("br_product_category_translation"))
df = trim_all_strings(src)

df = df.select(
    "product_category_name",
    "product_category_name_english"
)

write_dev_table(df, "sl_dev_product_category_translation")
display(df.limit(5))
df.printSchema()

StatementMeta(, 9ae13031-09eb-4e6a-adca-bc77031dffea, 8, Finished, Available, Finished)

✅ Written: dbo.sl_dev_product_category_translation | rows=71


SynapseWidget(Synapse.DataFrame, ee2491ac-fa1e-4127-8794-983b59f6c30c)

root
 |-- product_category_name: string (nullable = true)
 |-- product_category_name_english: string (nullable = true)



In [7]:
# Check whether these tables are created
dev_tables = [
    "sl_dev_customers",
    "sl_dev_geolocation",
    "sl_dev_payments",
    "sl_dev_products",
    "sl_dev_product_category_translation"
]

rows = []
for t in dev_tables:
    cnt = spark.table(silver_tbl(t)).count()
    rows.append((t, cnt))

display(spark.createDataFrame(rows, ["silver_dev_table", "row_count"]).orderBy("silver_dev_table"))

StatementMeta(, 9ae13031-09eb-4e6a-adca-bc77031dffea, 9, Finished, Available, Finished)

SynapseWidget(Synapse.DataFrame, 2ef67b64-872a-4c61-8966-89034cbfa1ad)

In [8]:
# Quick sanity check: Bronze vs Silver (row count parity)
# This is a lightweight validation only (not promotion)

table_mapping = {
    "br_products": "sl_dev_products",
    "br_product_category_translation": "sl_dev_product_category_translation",
    "br_customers": "sl_dev_customers",
    "br_geolocation": "sl_dev_geolocation",
    "br_payments": "sl_dev_payments"
}

results = []

for br_tbl, sl_tbl in table_mapping.items():
    bronze_cnt = spark.table(f"bronze_shortcut.{br_tbl}").count()
    silver_cnt = spark.table(f"dbo.{sl_tbl}").count()

    results.append((
        br_tbl,
        bronze_cnt,
        sl_tbl,
        silver_cnt,
        bronze_cnt == silver_cnt
    ))

spark.createDataFrame(
    results,
    [
        "bronze_table",
        "bronze_row_count",
        "silver_table",
        "silver_row_count",
        "count_match"
    ]
).show(truncate=False)

StatementMeta(, 9ae13031-09eb-4e6a-adca-bc77031dffea, 10, Finished, Available, Finished)

+-------------------------------+----------------+-----------------------------------+----------------+-----------+
|bronze_table                   |bronze_row_count|silver_table                       |silver_row_count|count_match|
+-------------------------------+----------------+-----------------------------------+----------------+-----------+
|br_products                    |32951           |sl_dev_products                    |32951           |true       |
|br_product_category_translation|71              |sl_dev_product_category_translation|71              |true       |
|br_customers                   |99441           |sl_dev_customers                   |99441           |true       |
|br_geolocation                 |1000163         |sl_dev_geolocation                 |1000163         |true       |
|br_payments                    |103886          |sl_dev_payments                    |103886          |true       |
+-------------------------------+----------------+----------------------

In [15]:
from pyspark.sql import functions as F

# Bronze -> Silver mapping (remaining 5)
CHECKS = [
    ("br_customers", "dbo.sl_dev_customers"),
    ("br_geolocation", "dbo.sl_dev_geolocation"),
    ("br_payments", "dbo.sl_dev_payments"),
    ("br_products", "dbo.sl_dev_products"),
    ("br_product_category_translation", "dbo.sl_dev_product_category_translation"),
]

def schema_map(df):
    # {col: "type|nullable"}
    return {f.name: f"{f.dataType.simpleString()}|{f.nullable}" for f in df.schema.fields}

rows = []
schema_diffs = []
col_diffs = []

for br_tbl, sl_tbl in CHECKS:
    br_name = f"bronze_shortcut.{br_tbl}"

    br = spark.table(br_name)
    sl = spark.table(sl_tbl)

    # 1) rowcount parity
    br_cnt = br.count()
    sl_cnt = sl.count()
    cnt_match = (br_cnt == sl_cnt)

    # 2) column coverage
    br_cols = set(br.columns)
    sl_cols = set(sl.columns)

    missing_in_silver = sorted(list(br_cols - sl_cols))
    extra_in_silver = sorted(list(sl_cols - br_cols))

    col_ok = (len(missing_in_silver) == 0 and len(extra_in_silver) == 0)

    col_diffs.append((br_tbl, sl_tbl, ",".join(missing_in_silver) if missing_in_silver else None,
                      ",".join(extra_in_silver) if extra_in_silver else None))

    # 3) schema diffs (only compare shared columns)
    br_map = schema_map(br)
    sl_map = schema_map(sl)

    common_cols = sorted(list(br_cols & sl_cols))
    for c in common_cols:
        if br_map.get(c) != sl_map.get(c):
            schema_diffs.append((br_tbl, sl_tbl, c, br_map.get(c), sl_map.get(c)))

    rows.append((br_tbl, sl_tbl, br_cnt, sl_cnt, cnt_match, col_ok))

summary_df = spark.createDataFrame(
    rows,
    ["bronze_table", "silver_table", "bronze_count", "silver_count", "count_match", "columns_ok"]
)

print("=== Summary (Counts + Column Coverage) ===")
display(summary_df.orderBy("bronze_table"))

# Spark cannot infer schema when columns are all None
# Normalize None -> empty string for safe DataFrame creation
col_diffs_clean = []
for br, sl, missing, extra in col_diffs:
    col_diffs_clean.append((
        br,
        sl,
        missing or "",
        extra or ""
    ))

col_diff_df = spark.createDataFrame(
    col_diffs_clean,
    ["bronze_table", "silver_table", "missing_cols_in_silver", "extra_cols_in_silver"]
)

print("=== Column Diffs (should be empty) ===")
display(col_diff_df.orderBy("bronze_table"))

# Schema diffs are expected IF cast types in Silver
# Display them so to confirm the changes are intentional.
print("=== Schema Diffs (review if intentional casts) ===")
if len(schema_diffs) == 0:
    print("✅ No schema diffs detected (Silver types match Bronze exactly).")
else:
    schema_diff_df = spark.createDataFrame(
        schema_diffs,
        ["bronze_table", "silver_table", "column", "bronze(type|nullable)", "silver(type|nullable)"]
    )
    display(schema_diff_df.orderBy("bronze_table", "column"))

StatementMeta(, 9ae13031-09eb-4e6a-adca-bc77031dffea, 17, Finished, Available, Finished)

=== Summary (Counts + Column Coverage) ===


SynapseWidget(Synapse.DataFrame, ee50c521-b8e6-46cc-a6bb-db196a81e31a)

=== Column Diffs (should be empty) ===


SynapseWidget(Synapse.DataFrame, d47ef776-5adf-4af8-89e0-4f5912f43339)

=== Schema Diffs (review if intentional casts) ===


SynapseWidget(Synapse.DataFrame, 232f083a-f8dc-426c-96ca-b295201da565)

In [13]:
# Sanity check after casts (no unexpected null explosion)
from pyspark.sql import functions as F

CAST_COLS_CHECK = {
    "dbo.sl_dev_geolocation": ["geolocation_lat", "geolocation_lng"],
    "dbo.sl_dev_payments": ["payment_sequential", "payment_installments", "payment_value"],
    "dbo.sl_dev_products": ["product_name_lenght", "product_description_lenght", "product_photos_qty",
                            "product_weight_g", "product_length_cm", "product_height_cm", "product_width_cm"],
}

for tbl, cols in CAST_COLS_CHECK.items():
    df = spark.table(tbl)
    existing = [c for c in cols if c in df.columns]
    if not existing:
        continue
    nulls = df.select([F.sum(F.col(c).isNull().cast("int")).alias(c) for c in existing])
    print(f"\n{tbl} null counts (cast cols):")
    display(nulls)

StatementMeta(, 9ae13031-09eb-4e6a-adca-bc77031dffea, 15, Finished, Available, Finished)


dbo.sl_dev_geolocation null counts (cast cols):


SynapseWidget(Synapse.DataFrame, 2d345e56-359d-42e1-80c2-dc837d708291)


dbo.sl_dev_payments null counts (cast cols):


SynapseWidget(Synapse.DataFrame, 283e4b9b-81a0-4344-b806-020f273c96dd)


dbo.sl_dev_products null counts (cast cols):


SynapseWidget(Synapse.DataFrame, c0cd25aa-bb9f-45c2-ad14-48ae92dc12cf)

In [12]:
tables_to_show = [
    "dbo.sl_dev_customers",
    "dbo.sl_dev_geolocation",
    "dbo.sl_dev_payments",
    "dbo.sl_dev_products",
    "dbo.sl_dev_product_category_translation",
]

for t in tables_to_show:
    print("\n==============================")
    print(t)
    spark.table(t).printSchema()

StatementMeta(, 9ae13031-09eb-4e6a-adca-bc77031dffea, 14, Finished, Available, Finished)


dbo.sl_dev_customers
root
 |-- customer_id: string (nullable = true)
 |-- customer_unique_id: string (nullable = true)
 |-- customer_zip_code_prefix: string (nullable = true)
 |-- customer_city: string (nullable = true)
 |-- customer_state: string (nullable = true)


dbo.sl_dev_geolocation
root
 |-- geolocation_zip_code_prefix: string (nullable = true)
 |-- geolocation_lat: double (nullable = true)
 |-- geolocation_lng: double (nullable = true)
 |-- geolocation_city: string (nullable = true)
 |-- geolocation_state: string (nullable = true)


dbo.sl_dev_payments
root
 |-- order_id: string (nullable = true)
 |-- payment_sequential: integer (nullable = true)
 |-- payment_type: string (nullable = true)
 |-- payment_installments: integer (nullable = true)
 |-- payment_value: double (nullable = true)


dbo.sl_dev_products
root
 |-- product_id: string (nullable = true)
 |-- product_category_name: string (nullable = true)
 |-- product_name_lenght: integer (nullable = true)
 |-- product_descri

---
#### Validation & Promotion Note
##### The remaining Silver tables created in this section are not required for current Sprint 2 Gold metrics and BI.


##### Validation and promotion of these tables will be performed by another team member (Peisi) to maintain separation of duties and reflect production governance.
---

## 4. Promotion Readiness Checklist

Before promoting any `sl_dev_*` table to `sl_*`, the following checks
must pass:
- Row count expectations
- Key uniqueness (where applicable)
- Null profile for critical columns
- Schema stability

In [1]:
%%sql
-- Promote customers
CREATE OR REPLACE TABLE dbo.sl_customers AS
SELECT * FROM dbo.sl_dev_customers;

-- Promote geolocation
CREATE OR REPLACE TABLE dbo.sl_geolocation AS
SELECT * FROM dbo.sl_dev_geolocation;

-- Promote payments
CREATE OR REPLACE TABLE dbo.sl_payments AS
SELECT * FROM dbo.sl_dev_payments;

-- Promote products
CREATE OR REPLACE TABLE dbo.sl_products AS
SELECT * FROM dbo.sl_dev_products;

-- Promote product category translation
CREATE OR REPLACE TABLE dbo.sl_product_category_translation AS
SELECT * FROM dbo.sl_dev_product_category_translation;

StatementMeta(, 6e101798-3736-47dc-96ea-ed6cd57ede91, 6, Finished, Available, Finished)

<Spark SQL result set with 0 rows and 0 fields>

<Spark SQL result set with 0 rows and 0 fields>

<Spark SQL result set with 0 rows and 0 fields>

<Spark SQL result set with 0 rows and 0 fields>

<Spark SQL result set with 0 rows and 0 fields>

In [2]:
%%sql
-- Row-count parity: sl_dev_* vs sl_*
SELECT 'customers' AS table_name,
       (SELECT COUNT(*) FROM dbo.sl_dev_customers) AS dev_count,
       (SELECT COUNT(*) FROM dbo.sl_customers)     AS prod_count
UNION ALL
SELECT 'geolocation',
       (SELECT COUNT(*) FROM dbo.sl_dev_geolocation),
       (SELECT COUNT(*) FROM dbo.sl_geolocation)
UNION ALL
SELECT 'payments',
       (SELECT COUNT(*) FROM dbo.sl_dev_payments),
       (SELECT COUNT(*) FROM dbo.sl_payments)
UNION ALL
SELECT 'products',
       (SELECT COUNT(*) FROM dbo.sl_dev_products),
       (SELECT COUNT(*) FROM dbo.sl_products)
UNION ALL
SELECT 'product_category_translation',
       (SELECT COUNT(*) FROM dbo.sl_dev_product_category_translation),
       (SELECT COUNT(*) FROM dbo.sl_product_category_translation);

StatementMeta(, 6e101798-3736-47dc-96ea-ed6cd57ede91, 7, Finished, Available, Finished)

<Spark SQL result set with 5 rows and 3 fields>

In [3]:
%%sql
-- Schema spot check (representative tables)
DESCRIBE TABLE dbo.sl_products;

DESCRIBE TABLE dbo.sl_payments;

StatementMeta(, 6e101798-3736-47dc-96ea-ed6cd57ede91, 9, Finished, Available, Finished)

<Spark SQL result set with 9 rows and 3 fields>

<Spark SQL result set with 5 rows and 3 fields>

In [4]:
%%sql
-- Null sanity checks for critical columns
SELECT
  'sl_products.product_id' AS column_checked,
  SUM(CASE WHEN product_id IS NULL THEN 1 ELSE 0 END) AS null_count
FROM dbo.sl_products

UNION ALL

SELECT
  'sl_customers.customer_id',
  SUM(CASE WHEN customer_id IS NULL THEN 1 ELSE 0 END)
FROM dbo.sl_customers

UNION ALL

SELECT
  'sl_orders.order_id',
  SUM(CASE WHEN order_id IS NULL THEN 1 ELSE 0 END)
FROM dbo.sl_orders;

StatementMeta(, 6e101798-3736-47dc-96ea-ed6cd57ede91, 10, Finished, Available, Finished)

<Spark SQL result set with 3 rows and 2 fields>

---

## Promotion Summary (Silver)

The following Silver tables have been promoted from sl_dev_* to sl_* using
CREATE OR REPLACE TABLE AS SELECT (CTAS):

- sl_customers
- sl_geolocation
- sl_payments
- sl_products
- sl_product_category_translation

Validation performed:
- Row-count parity: dev vs promoted tables (PASS)
- Schema spot checks on representative tables (PASS)
- Null checks on critical business keys (PASS)

Validation reference:
- sl_dev_* tables validated earlier by Peisi
- Promotion executed by Alvin

These Silver tables are now ready for Gold consumption in Sprint 2.

---

---

# Silver Validation: Payment Behaviour & Grain Checks

## Objective
Validate the **source-of-truth (Silver)** data to confirm:
1. Order-item grain is clean (no unintended duplicates), and  
2. Payment behaviour naturally supports **multi-row payments per order**.

These checks establish that downstream Gold issues are **modeling-related**, not source data quality problems.

---

## Summary of Findings

### 1. Order Items Are Unique at Order-Item Grain

- Silver `order_items` contains **exactly one row per order_id + order_item_id**.
- No duplicates were found at the order-item business grain.

**Evidence**
- Grouping by `(order_id, order_item_id)` returns a single row per combination.
- Confirms item-level data is clean and stable in Silver.

---

### 2. Payments Are Naturally Multi-row Per Order

- Silver `payments` contains **multiple rows per order_id** for certain orders.
- This is expected business behaviour (e.g. credit card installments).

**Evidence**
- Some orders have **more than one payment row**.
- Maximum payment rows per order are within expected bounds (installment-based behaviour).

---

### 3. Silver Explains Both Gold Scenarios

**Example A: Single Payment Order**
- 1 order-item row
- 1 payment row
- No fan-out expected downstream

**Example B: Multi-payment Order**
- 1 order-item row
- Multiple payment rows
- Fan-out will occur if payments are joined at item grain

This confirms Silver data itself is **correct**, and any duplication in Gold arises only when payment rows are joined incorrectly.

---

## Conclusions

- Silver tables correctly represent business truth:
  - Clean order-item grain
  - Valid multi-row payment behaviour
- There are **no data quality issues** in Silver for orders, items, or payments.
- Payment fan-out observed in Gold is **expected** given Silver payment behaviour.

---

## Design Implication for Gold

- Payments should **not** be joined into item-level or seller performance facts.
- Gold duplication issues are caused by **mixed-grain joins**, not by Silver data defects.

---

## Validation Outcome

Silver layer is validated as:
- **Clean**
- **Grain-correct**
- **Fit for downstream Gold modeling**

All corrective actions should be applied in **Gold**, not Silver.


In [4]:
%%sql
-- Silver: are there duplicates at the true order-item grain?
SELECT
  oi.order_id,
  oi.order_item_id,
  COUNT(*) AS cnt
FROM dbo.sl_order_items oi
GROUP BY oi.order_id, oi.order_item_id
HAVING COUNT(*) > 1
ORDER BY cnt DESC;

StatementMeta(, b62d0beb-b5a0-4073-bd39-e6868551914f, 6, Finished, Available, Finished)

<Spark SQL result set with 0 rows and 3 fields>

In [5]:
%%sql
-- Silver: payments per order distribution
SELECT
  COUNT(*) AS total_payment_rows,
  COUNT(DISTINCT order_id) AS distinct_orders,
  ROUND(COUNT(*) * 1.0 / COUNT(DISTINCT order_id), 4) AS avg_payment_rows_per_order
FROM dbo.sl_dev_payments;

-- Silver: top orders with most payment rows
SELECT
  order_id,
  COUNT(*) AS payment_rows
FROM dbo.sl_dev_payments
GROUP BY order_id
ORDER BY payment_rows DESC
LIMIT 20;

StatementMeta(, b62d0beb-b5a0-4073-bd39-e6868551914f, 8, Finished, Available, Finished)

<Spark SQL result set with 1 rows and 3 fields>

<Spark SQL result set with 20 rows and 2 fields>

In [8]:
%%sql
-- For a specific order: count items vs payments, then expected joined rows
WITH
items AS (
  SELECT order_id, COUNT(*) AS item_rows
  FROM dbo.sl_order_items
  WHERE order_id = 465
  GROUP BY order_id
),
payments AS (
  SELECT order_id, COUNT(*) AS payment_rows
  FROM dbo.sl_dev_payments
  WHERE order_id = 465
  GROUP BY order_id
)
SELECT
  i.order_id,
  i.item_rows,
  p.payment_rows,
  (i.item_rows * p.payment_rows) AS expected_rows_if_joined
FROM items i
JOIN payments p
  ON i.order_id = p.order_id;

StatementMeta(, b62d0beb-b5a0-4073-bd39-e6868551914f, 11, Finished, Available, Finished)

<Spark SQL result set with 0 rows and 4 fields>

In [11]:
%%sql
-- Silver items
SELECT order_id, order_item_id, COUNT(*) AS item_rows
FROM dbo.sl_order_items
WHERE order_id = '012a238ab54294a3b365812ccc82b135'
GROUP BY order_id, order_item_id;

-- Silver payments
SELECT order_id, COUNT(*) AS payment_rows
FROM dbo.sl_dev_payments
WHERE order_id = '012a238ab54294a3b365812ccc82b135'
GROUP BY order_id;

StatementMeta(, b62d0beb-b5a0-4073-bd39-e6868551914f, 16, Finished, Available, Finished)

<Spark SQL result set with 3 rows and 3 fields>

<Spark SQL result set with 1 rows and 2 fields>