# Data Quality Rules — Silver Layer (Post-Transformation Validation)

This notebook validates **Silver layer tables** after transformations applied by the
`CricinfoCommentaryParser_SilverLayer` notebook.

**Bronze DQ vs Silver DQ:**
- Bronze DQ validates **raw ingested data** (nulls, format, schema drift from Auto Loader)
- Silver DQ validates **derived columns, enrichments, and business logic** after transformations

---

### Silver Tables Validated

| Silver Table | Merge Key | Transformations Validated |
|---|---|---|
| `silver.match_events` | matchid + match_ball_number + event | Runs calculation, dismissal extraction, name resolution, innings_score, wickets_lost, over/ball split |
| `silver.match_players` | matchid + player_name + team | Deduplication, batting/bowling flags |
| `silver.match_metadata` | matchid | Toss split, date parsing, timezone resolution, UTC conversion, team name resolution |

### DQ Rule Categories

| # | Category | Code Prefix | Purpose |
|---|---|---|---|
| 4.1 | Completeness | SC | Silver-derived columns must be populated |
| 4.2 | Validity | SV | Runs, extras, dismissals, overs conform to expected domains/ranges |
| 4.3 | Uniqueness | SU | Merge keys unique after dedup processing |
| 4.4 | Consistency | SI | Cross-table referential integrity + internal logic |
| 4.5 | Transformation Quality | ST | Name resolution, score accumulation, UTC conversion quality |
| 4.6 | Accuracy | SA | Cricket domain rules on derived data |
| 4.7 | Volume / Statistical | SS | Row counts, distributions, anomaly detection |
| 4.7b | Bronze→Silver Row Count | SR | Cross-layer row count validation, data loss detection |
| 4.8 | Schema | SD | Expected Silver columns present |

## **Tables Validated:**
#### `T20_catalog.silver.match_events` — ball-by-ball with enriched runs, dismissals, name resolution, running totals
#### `T20_catalog.silver.match_metadata` — match info with parsed dates, UTC times, split toss, resolved team names
#### `T20_catalog.silver.match_players` — deduplicated player records per match

## 1. Imports & Configuration

In [0]:
# ── Job Parameters ────────────────────────────────────────────────────────────
# Default values are used during interactive runs.
# Databricks Job overrides these at runtime via the Parameters section.
# Key names here must match exactly what the Job defines.

dbutils.widgets.text(
    "catalog_name", "T20_catalog",
    "Catalog Name"
)

In [0]:
from pyspark.sql import SparkSession, DataFrame
from pyspark.sql.functions import (
    col, count, when, isnull, lit, sum as _sum, avg, min as _min, max as _max,
    length, trim, regexp_extract, current_timestamp, countDistinct, expr,
    upper, lower, abs as _abs, lag, lead, datediff, year, coalesce
)
from pyspark.sql.types import IntegerType, LongType, DoubleType, StringType
from pyspark.sql.window import Window
from datetime import datetime, timezone

# Unity Catalog Configuration
CATALOG_NAME  = dbutils.widgets.get("catalog_name")
SCHEMA_NAME = "silver"
FULL_SCHEMA = f"{CATALOG_NAME}.{SCHEMA_NAME}"

# Silver DQ Audit table (separate from Bronze DQ)
DQ_AUDIT_TABLE = f"{FULL_SCHEMA}.dq_audit_log"

# Silver tables to validate
SILVER_EVENTS   = f"{FULL_SCHEMA}.match_events"
SILVER_METADATA = f"{FULL_SCHEMA}.match_metadata"
SILVER_PLAYERS  = f"{FULL_SCHEMA}.match_players"

# Run identifiers
run_timestamp = datetime.now(timezone.utc)
run_id = "SLV_" + run_timestamp.strftime("%Y%m%d_%H%M%S")

print(f"DQ Run ID:     {run_id}")
print(f"Run Timestamp: {run_timestamp}")

## 2. DQ Audit Table Setup & Helper Functions

In [0]:
# Create audit table if not exists
spark.sql(f"""
    CREATE TABLE IF NOT EXISTS {DQ_AUDIT_TABLE} (
        run_id              STRING      COMMENT 'Unique identifier for this DQ run (SLV_ prefix for Silver)',
        run_timestamp       TIMESTAMP   COMMENT 'When the DQ check was executed',
        table_name          STRING      COMMENT 'Fully qualified table name being checked',
        rule_category       STRING      COMMENT 'Category: completeness, validity, uniqueness, consistency, transformation, accuracy, volume, schema',
        rule_name           STRING      COMMENT 'Unique rule identifier and descriptive name',
        rule_description    STRING      COMMENT 'Detailed description of what the rule checks',
        total_records       LONG        COMMENT 'Total records in scope',
        passed_records      LONG        COMMENT 'Records that passed the check',
        failed_records      LONG        COMMENT 'Records that failed the check',
        pass_percentage     DOUBLE      COMMENT 'Percentage of records that passed',
        status              STRING      COMMENT 'PASS, WARN, FAIL based on thresholds',
        threshold_pct       DOUBLE      COMMENT 'Minimum acceptable pass percentage',
        details             STRING      COMMENT 'Additional details or sample failures'
    )
    USING DELTA
    COMMENT 'Data quality audit log for Silver layer - IPL cricket pipeline'
""")

print(f"✓ DQ audit table ready: {DQ_AUDIT_TABLE}")

In [0]:
def log_dq_result(table_name, rule_category, rule_name, rule_description,
                  total_records, passed_records, threshold_pct=100.0, details=""):
    """Log a single DQ check result to the audit table."""
    
    failed_records = total_records - passed_records
    pass_pct = (passed_records / total_records * 100) if total_records > 0 else 0.0
    
    if pass_pct >= threshold_pct:
        status = "PASS"
    elif pass_pct >= (threshold_pct - 5):
        status = "WARN"
    else:
        status = "FAIL"
    
    icon = "✓" if status == "PASS" else ("⚠" if status == "WARN" else "✗")
    print(f"  {icon} [{status}] {rule_name}: {pass_pct:.2f}% passed ({failed_records} failures)")
    
    row = spark.createDataFrame([{
        "run_id": run_id,
        "run_timestamp": run_timestamp,
        "table_name": table_name,
        "rule_category": rule_category,
        "rule_name": rule_name,
        "rule_description": rule_description,
        "total_records": int(total_records),
        "passed_records": int(passed_records),
        "failed_records": int(failed_records),
        "pass_percentage": round(pass_pct, 2),
        "status": status,
        "threshold_pct": threshold_pct,
        "details": details[:500]  # truncate long details
    }])
    
    row.write.mode("append").saveAsTable(DQ_AUDIT_TABLE)
    return status


def get_null_count(df, column):
    """Count nulls, empty strings, and 'null' string values."""
    return df.filter(
        col(column).isNull() | 
        (trim(col(column)) == "") | 
        (lower(trim(col(column))) == "null") |
        (lower(trim(col(column))) == "none") |
        (lower(trim(col(column))) == "n/a")
    ).count()

## 3. Load Silver Tables

In [0]:
df_events   = spark.table(SILVER_EVENTS)
df_metadata  = spark.table(SILVER_METADATA)
df_players   = spark.table(SILVER_PLAYERS)

events_count   = df_events.count()
metadata_count = df_metadata.count()
players_count  = df_players.count()

print(f"Silver Tables Loaded:")
print(f"  {SILVER_EVENTS}:   {events_count:,} rows")
print(f"  {SILVER_METADATA}: {metadata_count:,} rows")
print(f"  {SILVER_PLAYERS}:  {players_count:,} rows")

## 4. DATA QUALITY RULES

### 4.1 COMPLETENESS RULES
_Checks that Silver-layer derived columns are populated (not null, not empty).
These columns did not exist in Bronze — they were created by Silver transformations._

In [0]:
print("=" * 80)
print("CATEGORY: COMPLETENESS")
print("=" * 80)

# ──────────────────────────────────────────────
# SILVER MATCH_EVENTS - Completeness
# ──────────────────────────────────────────────
print(f"\n--- {SILVER_EVENTS} ---")

# SC-E01: bowler (resolved full name) must not be null
null_bowler = get_null_count(df_events, "bowler")
log_dq_result(SILVER_EVENTS, "completeness", "SC-E01: bowler_not_null",
    "Bowler full name must be populated after name resolution UDF",
    events_count, events_count - null_bowler, 99.0)

# SC-E02: batsman (resolved full name) must not be null
null_batsman = get_null_count(df_events, "batsman")
log_dq_result(SILVER_EVENTS, "completeness", "SC-E02: batsman_not_null",
    "Batsman full name must be populated after name resolution UDF",
    events_count, events_count - null_batsman, 99.0)

# SC-E03: team must not be null (resolved from innings abbreviation)
null_team = get_null_count(df_events, "team")
log_dq_result(SILVER_EVENTS, "completeness", "SC-E03: team_not_null",
    "Team name must be resolved for every delivery (including Super Over backfill)",
    events_count, events_count - null_team, 99.0)

# SC-E04: runs (calculated) must not be null
null_runs = df_events.filter(col("runs").isNull()).count()
log_dq_result(SILVER_EVENTS, "completeness", "SC-E04: runs_not_null",
    "Calculated runs (from runs_text + no-ball penalty) must be populated for every delivery",
    events_count, events_count - null_runs, 100.0)

# SC-E05: over (parsed from ball) must not be null
null_over = df_events.filter(col("over").isNull()).count()
log_dq_result(SILVER_EVENTS, "completeness", "SC-E05: over_not_null",
    "Over number (1-indexed, parsed from ball column) must be populated",
    events_count, events_count - null_over, 100.0)

# SC-E06: over_ball_number must not be null
null_ball_num = df_events.filter(col("over_ball_number").isNull()).count()
log_dq_result(SILVER_EVENTS, "completeness", "SC-E06: over_ball_number_not_null",
    "Ball number within over (parsed from ball column) must be populated",
    events_count, events_count - null_ball_num, 100.0)

# SC-E07: Extras classification must not be null
null_extras = get_null_count(df_events, "Extras")
log_dq_result(SILVER_EVENTS, "completeness", "SC-E07: extras_not_null",
    "Extras classification must be populated (normal, wide, noball, legbyes, byes)",
    events_count, events_count - null_extras, 100.0)

# SC-E08: dismissal_method must not be null
null_dismissal = get_null_count(df_events, "dismissal_method")
log_dq_result(SILVER_EVENTS, "completeness", "SC-E08: dismissal_method_not_null",
    "Dismissal method must be populated ('Not Out' when no wicket falls)",
    events_count, events_count - null_dismissal, 100.0)

# SC-E09: innings_score (running total) must not be null
null_innings_score = df_events.filter(col("innings_score").isNull()).count()
log_dq_result(SILVER_EVENTS, "completeness", "SC-E09: innings_score_not_null",
    "Running innings score (cumulative sum of runs) must be calculated for every delivery",
    events_count, events_count - null_innings_score, 100.0)

# SC-E10: wickets_lost (running total) must not be null
null_wickets = df_events.filter(col("wickets_lost").isNull()).count()
log_dq_result(SILVER_EVENTS, "completeness", "SC-E10: wickets_lost_not_null",
    "Running wickets count must be calculated for every delivery",
    events_count, events_count - null_wickets, 100.0)

# SC-E11: silver_load_timestamp must not be null
null_silver_ts = df_events.filter(col("silver_load_timestamp").isNull()).count()
log_dq_result(SILVER_EVENTS, "completeness", "SC-E11: silver_load_timestamp_not_null",
    "Silver processing timestamp must be present for lineage tracking",
    events_count, events_count - null_silver_ts, 100.0)

# ──────────────────────────────────────────────
# SILVER MATCH_METADATA - Completeness
# ──────────────────────────────────────────────
print(f"\n--- {SILVER_METADATA} ---")

# SC-M01: match_date (parsed from match_days) must not be null
null_match_date = df_metadata.filter(col("match_date").isNull()).count()
log_dq_result(SILVER_METADATA, "completeness", "SC-M01: match_date_not_null",
    "Parsed match_date (from 'd MMMM yyyy' in match_days) must be populated",
    metadata_count, metadata_count - null_match_date, 100.0)

# SC-M02: toss (split toss winner) must not be null
null_toss = get_null_count(df_metadata, "toss")
log_dq_result(SILVER_METADATA, "completeness", "SC-M02: toss_winner_not_null",
    "Toss winner team name must be populated after splitting toss column",
    metadata_count, metadata_count - null_toss, 100.0)

# SC-M03: decision (split toss decision) must not be null
null_decision = get_null_count(df_metadata, "decision")
log_dq_result(SILVER_METADATA, "completeness", "SC-M03: decision_not_null",
    "Toss decision ('elected to bat/field first') must be populated after split",
    metadata_count, metadata_count - null_decision, 95.0)

# SC-M04: match_start_utc must not be null
null_start_utc = df_metadata.filter(col("match_start_utc").isNull()).count()
log_dq_result(SILVER_METADATA, "completeness", "SC-M04: match_start_utc_not_null",
    "UTC match start time (geocoded timezone → to_utc_timestamp) must be populated",
    metadata_count, metadata_count - null_start_utc, 90.0)

# SC-M05: first_innings (resolved team name) must not be null
null_first = get_null_count(df_metadata, "first_innings")
log_dq_result(SILVER_METADATA, "completeness", "SC-M05: first_innings_not_null",
    "First innings team name must be resolved via get_full_name UDF",
    metadata_count, metadata_count - null_first, 95.0)

# SC-M06: second_innings (resolved team name) must not be null
null_second = get_null_count(df_metadata, "second_innings")
log_dq_result(SILVER_METADATA, "completeness", "SC-M06: second_innings_not_null",
    "Second innings team name must be resolved via get_full_name UDF",
    metadata_count, metadata_count - null_second, 95.0)

# SC-M07: silver_load_timestamp must not be null
null_m_silver_ts = df_metadata.filter(col("silver_load_timestamp").isNull()).count()
log_dq_result(SILVER_METADATA, "completeness", "SC-M07: silver_load_timestamp_not_null",
    "Silver processing timestamp must be present for lineage tracking",
    metadata_count, metadata_count - null_m_silver_ts, 100.0)

# ──────────────────────────────────────────────
# SILVER MATCH_PLAYERS - Completeness
# ──────────────────────────────────────────────
print(f"\n--- {SILVER_PLAYERS} ---")

# SC-P01: player_name must not be null
null_player = get_null_count(df_players, "player_name")
log_dq_result(SILVER_PLAYERS, "completeness", "SC-P01: player_name_not_null",
    "Player name must be populated in Silver after deduplication",
    players_count, players_count - null_player, 100.0)

# SC-P02: team must not be null
null_p_team = get_null_count(df_players, "team")
log_dq_result(SILVER_PLAYERS, "completeness", "SC-P02: team_not_null",
    "Team name must be populated for every player record",
    players_count, players_count - null_p_team, 100.0)

# SC-P03: innings must not be null
null_p_innings = get_null_count(df_players, "innings")
log_dq_result(SILVER_PLAYERS, "completeness", "SC-P03: innings_not_null",
    "Innings assignment must be populated for every player",
    players_count, players_count - null_p_innings, 100.0)

# SC-P04: silver_load_timestamp must not be null
null_p_silver_ts = df_players.filter(col("silver_load_timestamp").isNull()).count()
log_dq_result(SILVER_PLAYERS, "completeness", "SC-P04: silver_load_timestamp_not_null",
    "Silver processing timestamp must be present for lineage tracking",
    players_count, players_count - null_p_silver_ts, 100.0)

### 4.2 VALIDITY / FORMAT RULES
_Checks that Silver-derived values conform to expected ranges and domains._

In [0]:
print("=" * 80)
print("CATEGORY: VALIDITY")
print("=" * 80)

# ──────────────────────────────────────────────
# SILVER MATCH_EVENTS - Validity
# ──────────────────────────────────────────────
print(f"\n--- {SILVER_EVENTS} ---")

# SV-E01: runs must be non-negative integer 0-7 (max: 6 + 1 no-ball penalty)
invalid_runs = df_events.filter(
    (col("runs").isNull()) | (col("runs") < 0) | (col("runs") > 7)
).count()
log_dq_result(SILVER_EVENTS, "validity", "SV-E01: runs_range_0_7",
    "Calculated runs per ball must be 0-7 (max: 6 boundary + 1 no-ball penalty)",
    events_count, events_count - invalid_runs, 100.0)

# SV-E02: over must be 1-20 (1-indexed) for non-Super Over deliveries
invalid_over = df_events.filter(
    (col("over").isNotNull()) &
    ((col("over") < 1) | (col("over") > 20)) &
    (col("is_super_over") == False)
).count()
log_dq_result(SILVER_EVENTS, "validity", "SV-E02: over_range_1_20",
    "Over number must be 1-20 for regular T20 innings (1-indexed from ball split)",
    events_count, events_count - invalid_over, 98.0)

# SV-E03: over_ball_number must be 1-6
invalid_ball_num = df_events.filter(
    (col("over_ball_number").isNotNull()) &
    ((col("over_ball_number") < 1) | (col("over_ball_number") > 6))
).count()
log_dq_result(SILVER_EVENTS, "validity", "SV-E03: ball_number_range_1_6",
    "Ball number within over must be 1-6 (parsed from decimal part of ball)",
    events_count, events_count - invalid_ball_num, 98.0)

# SV-E04: Extras must be a valid domain value
valid_extras = ["normal", "wide", "noball", "legbyes", "byes"]
invalid_extras = df_events.filter(
    col("Extras").isNotNull() & ~col("Extras").isin(valid_extras)
).count()
log_dq_result(SILVER_EVENTS, "validity", "SV-E04: extras_valid_domain",
    f"Extras must be one of: {', '.join(valid_extras)}",
    events_count, events_count - invalid_extras, 100.0)

# SV-E05: dismissal_method must be a valid domain value
valid_dismissals = ["Not Out", "Run Out", "Caught", "Caught & Bowled", "LBW", "Stumped", "Bowled"]
invalid_dismissal = df_events.filter(
    col("dismissal_method").isNotNull() & ~col("dismissal_method").isin(valid_dismissals)
).count()
log_dq_result(SILVER_EVENTS, "validity", "SV-E05: dismissal_method_valid_domain",
    f"Dismissal method must be one of: {', '.join(valid_dismissals)}",
    events_count, events_count - invalid_dismissal, 100.0)

# SV-E06: innings_score must be non-negative
negative_score = df_events.filter(col("innings_score") < 0).count()
log_dq_result(SILVER_EVENTS, "validity", "SV-E06: innings_score_non_negative",
    "Running innings score (cumulative sum) must be >= 0",
    events_count, events_count - negative_score, 100.0)

# SV-E07: wickets_lost must be 0-10
invalid_wickets = df_events.filter(
    (col("wickets_lost") < 0) | (col("wickets_lost") > 10)
).count()
log_dq_result(SILVER_EVENTS, "validity", "SV-E07: wickets_lost_range_0_10",
    "Wickets lost per innings must be between 0 and 10",
    events_count, events_count - invalid_wickets, 100.0)

# SV-E08: super_over_team must be 'normal' or contain 'Super Over'
invalid_so_team = df_events.filter(
    (col("super_over_team") != "normal") &
    (~col("super_over_team").contains("Super Over"))
).count()
log_dq_result(SILVER_EVENTS, "validity", "SV-E08: super_over_team_valid",
    "super_over_team must be 'normal' or contain 'Super Over'",
    events_count, events_count - invalid_so_team, 100.0)

# SV-E09: bowler name should be alphabetic (resolved names)
invalid_bowler_fmt = df_events.filter(
    col("bowler").isNotNull() & ~col("bowler").rlike(r"^[A-Za-z\s\.\'\-]+$")
).count()
log_dq_result(SILVER_EVENTS, "validity", "SV-E09: bowler_name_format",
    "Resolved bowler name should contain only letters, spaces, dots, hyphens, apostrophes",
    events_count, events_count - invalid_bowler_fmt, 95.0)

# SV-E10: batsman name should be alphabetic (resolved names)
invalid_batsman_fmt = df_events.filter(
    col("batsman").isNotNull() & ~col("batsman").rlike(r"^[A-Za-z\s\.\'\-]+$")
).count()
log_dq_result(SILVER_EVENTS, "validity", "SV-E10: batsman_name_format",
    "Resolved batsman name should contain only letters, spaces, dots, hyphens, apostrophes",
    events_count, events_count - invalid_batsman_fmt, 95.0)

# ──────────────────────────────────────────────
# SILVER MATCH_METADATA - Validity
# ──────────────────────────────────────────────
print(f"\n--- {SILVER_METADATA} ---")

# SV-M01: match_date year must be 2008-2030
invalid_date = df_metadata.filter(
    col("match_date").isNotNull() &
    ((year(col("match_date")) < 2008) | (year(col("match_date")) > 2030))
).count()
log_dq_result(SILVER_METADATA, "validity", "SV-M01: match_date_valid_range",
    "Parsed match_date year must be between 2008 and 2030",
    metadata_count, metadata_count - invalid_date, 100.0)

# SV-M02: decision must contain 'bat' or 'field'
invalid_decision = df_metadata.filter(
    col("decision").isNotNull() &
    (~lower(col("decision")).contains("bat")) &
    (~lower(col("decision")).contains("field"))
).count()
log_dq_result(SILVER_METADATA, "validity", "SV-M02: decision_bat_or_field",
    "Toss decision (split from toss column) must contain 'bat' or 'field'",
    metadata_count, metadata_count - invalid_decision, 95.0)

# SV-M03: first_innings != second_innings (different teams)
same_teams = df_metadata.filter(
    col("first_innings").isNotNull() &
    col("second_innings").isNotNull() &
    (trim(col("first_innings")) == trim(col("second_innings")))
).count()
log_dq_result(SILVER_METADATA, "validity", "SV-M03: innings_teams_different",
    "First and second innings must have different resolved team names",
    metadata_count, metadata_count - same_teams, 100.0)

# SV-M04: match_start_utc should be within ±1 day of match_date (timezone offset)
invalid_utc = df_metadata.filter(
    col("match_start_utc").isNotNull() &
    col("match_date").isNotNull() &
    (_abs(datediff(col("match_start_utc").cast("date"), col("match_date"))) > 1)
).count()
log_dq_result(SILVER_METADATA, "validity", "SV-M04: utc_matches_match_date",
    "match_start_utc date should be within 1 day of match_date (timezone offset variance)",
    metadata_count, metadata_count - invalid_utc, 95.0)

# SV-M05: first_innings_start_utc < first_innings_end_utc (session order)
invalid_session_order = df_metadata.filter(
    col("first_innings_start_utc").isNotNull() &
    col("first_innings_end_utc").isNotNull() &
    (col("first_innings_start_utc") >= col("first_innings_end_utc"))
).count()
log_dq_result(SILVER_METADATA, "validity", "SV-M05: first_session_times_ordered",
    "First innings start UTC must be before first innings end UTC",
    metadata_count, metadata_count - invalid_session_order, 100.0)

# SV-M06: second_innings_start_utc < second_innings_end_utc
invalid_session2 = df_metadata.filter(
    col("second_innings_start_utc").isNotNull() &
    col("second_innings_end_utc").isNotNull() &
    (col("second_innings_start_utc") >= col("second_innings_end_utc"))
).count()
log_dq_result(SILVER_METADATA, "validity", "SV-M06: second_session_times_ordered",
    "Second innings start UTC must be before second innings end UTC",
    metadata_count, metadata_count - invalid_session2, 100.0)

# SV-M07: super_over_count must be non-negative
if "super_over_count" in df_metadata.columns:
    invalid_so_count = df_metadata.filter(
        col("super_over_count").isNotNull() & (col("super_over_count") < 0)
    ).count()
    log_dq_result(SILVER_METADATA, "validity", "SV-M07: super_over_count_non_negative",
        "super_over_count must be >= 0",
        metadata_count, metadata_count - invalid_so_count, 100.0)

### 4.3 UNIQUENESS RULES
_Checks that Silver merge keys are unique — no duplicates should remain after dedup_dataframe processing._

In [0]:
print("=" * 80)
print("CATEGORY: UNIQUENESS")
print("=" * 80)

# ──────────────────────────────────────────────
# SILVER MATCH_EVENTS - Uniqueness
# ──────────────────────────────────────────────
print(f"\n--- {SILVER_EVENTS} ---")

# SU-E01: Merge key (matchid + match_ball_number + event) must be unique
distinct_event_keys = df_events.select("matchid", "match_ball_number", "event").distinct().count()
log_dq_result(SILVER_EVENTS, "uniqueness", "SU-E01: merge_key_unique",
    "Merge key (matchid + match_ball_number + event) must be unique in Silver after dedup",
    events_count, distinct_event_keys, 100.0,
    f"Total rows: {events_count}, Distinct keys: {distinct_event_keys}")

# SU-E02: No fully duplicate rows
full_distinct_e = df_events.distinct().count()
log_dq_result(SILVER_EVENTS, "uniqueness", "SU-E02: no_full_duplicates",
    "There should be no fully identical rows across all columns",
    events_count, full_distinct_e, 100.0)

# ──────────────────────────────────────────────
# SILVER MATCH_METADATA - Uniqueness
# ──────────────────────────────────────────────
print(f"\n--- {SILVER_METADATA} ---")

# SU-M01: matchid should be unique (one metadata record per match)
distinct_m_matchid = df_metadata.select("matchid").distinct().count()
log_dq_result(SILVER_METADATA, "uniqueness", "SU-M01: matchid_unique",
    "Each match should have exactly one Silver metadata record after dedup",
    metadata_count, distinct_m_matchid, 100.0,
    f"Total rows: {metadata_count}, Distinct matchids: {distinct_m_matchid}")

# SU-M02: No fully duplicate metadata rows
full_distinct_m = df_metadata.distinct().count()
log_dq_result(SILVER_METADATA, "uniqueness", "SU-M02: no_full_duplicates",
    "There should be no fully identical metadata rows",
    metadata_count, full_distinct_m, 100.0)

# ──────────────────────────────────────────────
# SILVER MATCH_PLAYERS - Uniqueness
# ──────────────────────────────────────────────
print(f"\n--- {SILVER_PLAYERS} ---")

# SU-P01: Merge key (matchid + player_name + team) must be unique
distinct_player_keys = df_players.select("matchid", "player_name", "team").distinct().count()
log_dq_result(SILVER_PLAYERS, "uniqueness", "SU-P01: merge_key_unique",
    "Merge key (matchid + player_name + team) must be unique in Silver after dedup",
    players_count, distinct_player_keys, 100.0,
    f"Total rows: {players_count}, Distinct keys: {distinct_player_keys}")

# SU-P02: No fully duplicate player rows
full_distinct_p = df_players.distinct().count()
log_dq_result(SILVER_PLAYERS, "uniqueness", "SU-P02: no_full_duplicates",
    "There should be no fully identical player rows after dedup",
    players_count, full_distinct_p, 100.0)

### 4.4 CONSISTENCY / CROSS-TABLE INTEGRITY RULES
_Checks referential integrity across Silver tables and internal logical consistency._

In [0]:
print("=" * 80)
print("CATEGORY: CONSISTENCY / INTEGRITY")
print("=" * 80)

events_matchids  = df_events.select("matchid").distinct()
metadata_matchids = df_metadata.select("matchid").distinct()
players_matchids  = df_players.select("matchid").distinct()

# SI-01: Every event matchid must exist in metadata
orphan_events = events_matchids.join(metadata_matchids, "matchid", "left_anti").count()
total_event_matchids = events_matchids.count()
log_dq_result(SILVER_EVENTS, "integrity", "SI-01: events_matchid_in_metadata",
    "Every matchid in Silver events must have a corresponding Silver metadata record",
    total_event_matchids, total_event_matchids - orphan_events, 100.0,
    f"Orphan event matchids (no metadata): {orphan_events}")

# SI-02: Every event matchid must exist in players
orphan_events_p = events_matchids.join(players_matchids, "matchid", "left_anti").count()
log_dq_result(SILVER_EVENTS, "integrity", "SI-02: events_matchid_in_players",
    "Every matchid in Silver events must have Silver player records",
    total_event_matchids, total_event_matchids - orphan_events_p, 100.0,
    f"Orphan event matchids (no players): {orphan_events_p}")

# SI-03: Every metadata matchid should have events
total_meta_matchids = metadata_matchids.count()
metadata_no_events = metadata_matchids.join(events_matchids, "matchid", "left_anti").count()
log_dq_result(SILVER_METADATA, "integrity", "SI-03: metadata_has_events",
    "Every match in Silver metadata should have ball-by-ball events",
    total_meta_matchids, total_meta_matchids - metadata_no_events, 95.0,
    f"Matches with metadata but no events: {metadata_no_events}")

# SI-04: matchid sets aligned across all 3 Silver tables
all_three = events_matchids.join(metadata_matchids, "matchid", "inner") \
    .join(players_matchids, "matchid", "inner").count()
all_unique = events_matchids.union(metadata_matchids).union(players_matchids).distinct().count()
log_dq_result(SILVER_EVENTS, "integrity", "SI-04: all_tables_matchid_aligned",
    "All three Silver tables should have the same set of matchids",
    all_unique, all_three, 95.0,
    f"In all 3 tables: {all_three}, Total unique: {all_unique}")

# SI-05: Bowler names in events should exist in players table
event_bowlers = df_events.select("matchid", col("bowler").alias("player_name")).distinct()
player_names  = df_players.select("matchid", "player_name").distinct()
orphan_bowlers = event_bowlers.join(player_names, ["matchid", "player_name"], "left_anti").count()
total_bowler_entries = event_bowlers.count()
log_dq_result(SILVER_EVENTS, "integrity", "SI-05: bowlers_in_players_table",
    "Every bowler in Silver events should exist in Silver players for the same match",
    total_bowler_entries, total_bowler_entries - orphan_bowlers, 90.0,
    f"Bowler-match combos not found in players: {orphan_bowlers}")

# SI-06: Batsman names in events should exist in players table
event_batsmen = df_events.select("matchid", col("batsman").alias("player_name")).distinct()
orphan_batsmen = event_batsmen.join(player_names, ["matchid", "player_name"], "left_anti").count()
total_batsman_entries = event_batsmen.count()
log_dq_result(SILVER_EVENTS, "integrity", "SI-06: batsmen_in_players_table",
    "Every batsman in Silver events should exist in Silver players for the same match",
    total_batsman_entries, total_batsman_entries - orphan_batsmen, 90.0,
    f"Batsman-match combos not found in players: {orphan_batsmen}")

# SI-07: Dismissals must have dismissed_by populated
dismissals = df_events.filter(col("dismissal_method") != "Not Out")
dismissals_count = dismissals.count()
dismissal_no_fielder = dismissals.filter(
    col("dismissed_by").isNull() | (trim(col("dismissed_by")) == "")
).count()
if dismissals_count > 0:
    log_dq_result(SILVER_EVENTS, "consistency", "SI-07: dismissal_has_fielder",
        "Wickets (dismissal_method != Not Out) should have dismissed_by populated",
        dismissals_count, dismissals_count - dismissal_no_fielder, 90.0,
        f"Dismissals missing dismissed_by: {dismissal_no_fielder}")

# SI-08: Not Out rows should NOT have dismissed_by
not_outs = df_events.filter(col("dismissal_method") == "Not Out")
not_outs_count = not_outs.count()
fielder_on_not_out = not_outs.filter(
    col("dismissed_by").isNotNull() & (trim(col("dismissed_by")) != "")
).count()
if not_outs_count > 0:
    log_dq_result(SILVER_EVENTS, "consistency", "SI-08: not_out_no_fielder",
        "'Not Out' deliveries should not have dismissed_by populated",
        not_outs_count, not_outs_count - fielder_on_not_out, 100.0)

# SI-09: Player count per match should be 20-30
players_per_match = df_players.groupBy("matchid").agg(
    countDistinct("player_name").alias("player_count")
)
valid_player_count = players_per_match.filter(
    (col("player_count") >= 20) & (col("player_count") <= 30)
).count()
total_matches_p = players_per_match.count()
log_dq_result(SILVER_PLAYERS, "consistency", "SI-09: player_count_per_match",
    "Each match should have 20-30 unique players (11 per team + possible subs)",
    total_matches_p, valid_player_count, 90.0)

### 4.5 TRANSFORMATION QUALITY RULES
_Validates Silver-specific transformations: name resolution quality, score accumulation integrity, UTC conversion correctness._

These rules have **no Bronze equivalent** — they validate the quality of Silver transformations themselves.

In [0]:
print("=" * 80)
print("CATEGORY: TRANSFORMATION QUALITY")
print("=" * 80)

# ──────────────────────────────────────────────
# Name Resolution Quality
# ──────────────────────────────────────────────
print("\n--- Name Resolution ---")

# ST-01: Bowler names should be fully resolved (contain space → first + last name)
unresolved_bowler = df_events.filter(
    col("bowler").isNotNull() & (~col("bowler").contains(" "))
).count()
non_null_bowler = df_events.filter(col("bowler").isNotNull()).count()
log_dq_result(SILVER_EVENTS, "transformation", "ST-01: bowler_name_resolved",
    "Bowler names should be full names (contain at least one space = first + last)",
    non_null_bowler, non_null_bowler - unresolved_bowler, 95.0,
    f"Unresolved bowlers (single word, no space): {unresolved_bowler}")

# ST-02: Batsman names should be fully resolved
unresolved_batsman = df_events.filter(
    col("batsman").isNotNull() & (~col("batsman").contains(" "))
).count()
non_null_batsman = df_events.filter(col("batsman").isNotNull()).count()
log_dq_result(SILVER_EVENTS, "transformation", "ST-02: batsman_name_resolved",
    "Batsman names should be full names (contain at least one space)",
    non_null_batsman, non_null_batsman - unresolved_batsman, 95.0,
    f"Unresolved batsmen (single word, no space): {unresolved_batsman}")

# ST-03: Team names should be full names (not abbreviations ≤3 chars like SA, AFG)
short_team = df_events.filter(
    col("team").isNotNull() & (length(col("team")) <= 3)
).count()
non_null_team = df_events.filter(col("team").isNotNull()).count()
log_dq_result(SILVER_EVENTS, "transformation", "ST-03: team_name_resolved",
    "Team names should be full names (>3 chars, not abbreviations)",
    non_null_team, non_null_team - short_team, 95.0,
    f"Abbreviated teams still present (<=3 chars): {short_team}")

# ──────────────────────────────────────────────
# Score Accumulation Integrity
# ──────────────────────────────────────────────
print("\n--- Score Accumulation ---")

# Shared window for innings ordering
score_window = Window.partitionBy("matchid", "team", "super_over_team") \
    .orderBy("over", "over_ball_number")
df_score_check = df_events.withColumn("prev_score", lag("innings_score").over(score_window))

# ST-04: innings_score must never decrease (monotonically non-decreasing)
decreasing_score = df_score_check.filter(
    col("prev_score").isNotNull() & (col("innings_score") < col("prev_score"))
).count()
log_dq_result(SILVER_EVENTS, "transformation", "ST-04: innings_score_monotonic",
    "Running innings score must never decrease within an innings",
    events_count, events_count - decreasing_score, 100.0,
    f"Score decreases detected: {decreasing_score}")

# ST-05: wickets_lost must never decrease (monotonically non-decreasing)
df_wicket_check = df_events.withColumn("prev_wickets", lag("wickets_lost").over(score_window))
decreasing_wickets = df_wicket_check.filter(
    col("prev_wickets").isNotNull() & (col("wickets_lost") < col("prev_wickets"))
).count()
log_dq_result(SILVER_EVENTS, "transformation", "ST-05: wickets_lost_monotonic",
    "Running wickets count must never decrease within an innings",
    events_count, events_count - decreasing_wickets, 100.0,
    f"Wicket count decreases detected: {decreasing_wickets}")

# ST-06: Score increment between balls should equal runs column
df_score_diff = df_score_check.withColumn("score_diff",
    col("innings_score") - col("prev_score"))
mismatched_runs = df_score_diff.filter(
    col("prev_score").isNotNull() & (col("score_diff") != col("runs"))
).count()
has_prev = df_score_diff.filter(col("prev_score").isNotNull()).count()
log_dq_result(SILVER_EVENTS, "transformation", "ST-06: score_increment_equals_runs",
    "Score increment between consecutive balls must equal the calculated runs column",
    has_prev, has_prev - mismatched_runs, 100.0,
    f"Mismatched increments (score_diff != runs): {mismatched_runs}")

# ──────────────────────────────────────────────
# Metadata Transformation Quality
# ──────────────────────────────────────────────
print("\n--- Metadata Transformations ---")

# ST-07: Toss winner should match first_innings or second_innings team
toss_not_in_teams = df_metadata.filter(
    col("toss").isNotNull() &
    col("first_innings").isNotNull() &
    col("second_innings").isNotNull() &
    (col("toss") != col("first_innings")) &
    (col("toss") != col("second_innings"))
).count()
log_dq_result(SILVER_METADATA, "transformation", "ST-07: toss_winner_in_teams",
    "Toss winner must be either the first or second innings team",
    metadata_count, metadata_count - toss_not_in_teams, 95.0,
    f"Toss winner not matching either team: {toss_not_in_teams}")

# ST-08: Session chronological: first_innings_end_utc <= second_innings_start_utc
invalid_session_gap = df_metadata.filter(
    col("first_innings_end_utc").isNotNull() &
    col("second_innings_start_utc").isNotNull() &
    (col("first_innings_end_utc") > col("second_innings_start_utc"))
).count()
log_dq_result(SILVER_METADATA, "transformation", "ST-08: session_chronological_order",
    "First innings end UTC must be <= second innings start UTC",
    metadata_count, metadata_count - invalid_session_gap, 100.0)

### 4.6 ACCURACY / DOMAIN-SPECIFIC RULES
_Cricket-specific business logic validation on Silver-derived data._

In [0]:
print("=" * 80)
print("CATEGORY: ACCURACY / DOMAIN-SPECIFIC")
print("=" * 80)

# SA-01: Max 10 wickets per innings
max_wickets = df_events.groupBy("matchid", "team", "super_over_team").agg(
    _max("wickets_lost").alias("max_wickets")
)
over_10_wickets = max_wickets.filter(col("max_wickets") > 10).count()
total_innings = max_wickets.count()
log_dq_result(SILVER_EVENTS, "accuracy", "SA-01: max_10_wickets_per_innings",
    "No innings should have more than 10 wickets",
    total_innings, total_innings - over_10_wickets, 100.0)

# SA-02: No bowler should bowl more than 4 overs in a T20 innings
bowler_overs = df_events.filter(
    (col("is_super_over") == False) & col("bowler").isNotNull()
).groupBy("matchid", "team", "bowler").agg(
    countDistinct("over").alias("overs_bowled")
)
over_limit = bowler_overs.filter(col("overs_bowled") > 4).count()
total_spells = bowler_overs.count()
log_dq_result(SILVER_EVENTS, "accuracy", "SA-02: bowler_max_4_overs",
    "No bowler should bowl more than 4 overs in a T20 innings",
    total_spells, total_spells - over_limit, 95.0,
    f"Bowler spells exceeding 4 overs: {over_limit}")

# SA-03: Balls with runs_text 'OUT' should have runs = 0
total_outs = df_events.filter(col("runs_text") == "OUT").count()
if total_outs > 0:
    out_with_runs = df_events.filter(
        (col("runs_text") == "OUT") & (col("runs") != 0)
    ).count()
    log_dq_result(SILVER_EVENTS, "accuracy", "SA-03: out_ball_zero_runs",
        "Balls with runs_text='OUT' should have calculated runs=0",
        total_outs, total_outs - out_with_runs, 100.0)

# SA-04: No-ball deliveries must have runs >= 1 (penalty run)
total_noballs = df_events.filter(col("Extras") == "noball").count()
if total_noballs > 0:
    noball_zero = df_events.filter(
        (col("Extras") == "noball") & (col("runs") < 1)
    ).count()
    log_dq_result(SILVER_EVENTS, "accuracy", "SA-04: noball_minimum_1_run",
        "No-ball deliveries must have at least 1 run (penalty run added in Silver)",
        total_noballs, total_noballs - noball_zero, 100.0)

# SA-05: 'FOUR runs' should calculate to 4 (or 5 with no-ball)
total_fours = df_events.filter(col("runs_text").contains("FOUR")).count()
if total_fours > 0:
    four_wrong = df_events.filter(
        col("runs_text").contains("FOUR") & ~col("runs").isin([4, 5])
    ).count()
    log_dq_result(SILVER_EVENTS, "accuracy", "SA-05: four_runs_value_check",
        "FOUR runs_text should calculate to 4 (or 5 with no-ball penalty)",
        total_fours, total_fours - four_wrong, 100.0)

# SA-06: 'SIX runs' should calculate to 6 (or 7 with no-ball)
total_sixes = df_events.filter(col("runs_text").contains("SIX")).count()
if total_sixes > 0:
    six_wrong = df_events.filter(
        col("runs_text").contains("SIX") & ~col("runs").isin([6, 7])
    ).count()
    log_dq_result(SILVER_EVENTS, "accuracy", "SA-06: six_runs_value_check",
        "SIX runs_text should calculate to 6 (or 7 with no-ball penalty)",
        total_sixes, total_sixes - six_wrong, 100.0)

# SA-07: Final innings score should be reasonable for T20 (30-300)
final_scores = df_events.filter(col("is_super_over") == False).groupBy(
    "matchid", "team"
).agg(_max("innings_score").alias("final_score"))
unreasonable = final_scores.filter(
    (col("final_score") < 30) | (col("final_score") > 300)
).count()
total_team_innings = final_scores.count()
log_dq_result(SILVER_EVENTS, "accuracy", "SA-07: final_score_reasonable",
    "Final innings score should be between 30 and 300 for T20 (excluding Super Overs)",
    total_team_innings, total_team_innings - unreasonable, 90.0)

# SA-08: Commentary text should have minimum length (not truncated)
short_commentary = df_events.filter(
    col("commentary").isNotNull() & (length(trim(col("commentary"))) < 5)
).count()
log_dq_result(SILVER_EVENTS, "accuracy", "SA-08: commentary_min_length",
    "Commentary text should be at least 5 characters (not truncated or garbage)",
    events_count, events_count - short_commentary, 75.0)

### 4.7 VOLUME / STATISTICAL RULES
_Checks row counts, distributions, and detects anomalies in Silver data._

In [0]:
print("=" * 80)
print("CATEGORY: VOLUME / STATISTICAL")
print("=" * 80)

# SS-01: Tables must not be empty
log_dq_result(SILVER_EVENTS, "volume", "SS-01: events_table_not_empty",
    "Silver match_events table must contain data",
    1, 1 if events_count > 0 else 0, 100.0)

log_dq_result(SILVER_METADATA, "volume", "SS-02: metadata_table_not_empty",
    "Silver match_metadata table must contain data",
    1, 1 if metadata_count > 0 else 0, 100.0)

log_dq_result(SILVER_PLAYERS, "volume", "SS-03: players_table_not_empty",
    "Silver match_players table must contain data",
    1, 1 if players_count > 0 else 0, 100.0)

# SS-04: Events per match should be 100-500 for T20
events_per_match = df_events.groupBy("matchid").count().withColumnRenamed("count", "n")
valid_vol = events_per_match.filter((col("n") >= 100) & (col("n") <= 500)).count()
total_matches = events_per_match.count()
log_dq_result(SILVER_EVENTS, "volume", "SS-04: events_per_match_reasonable",
    "Each T20 match should have 100-500 ball-by-ball events (including extras)",
    total_matches, valid_vol, 90.0)

# Stats for info
stats = events_per_match.select(
    _min("n").alias("min_rows"),
    _max("n").alias("max_rows"),
    avg("n").alias("avg_rows")
).first()
print(f"  ℹ Events per match: min={stats['min_rows']}, max={stats['max_rows']}, avg={stats['avg_rows']:.0f}")

# SS-05: Distinct bowlers per innings should be 3-10
bowlers_per_innings = df_events.filter(col("bowler").isNotNull()) \
    .groupBy("matchid", "team").agg(countDistinct("bowler").alias("bowler_count"))
valid_bowler_ct = bowlers_per_innings.filter(
    (col("bowler_count") >= 3) & (col("bowler_count") <= 10)
).count()
total_innings_b = bowlers_per_innings.count()
log_dq_result(SILVER_EVENTS, "volume", "SS-05: bowlers_per_innings_reasonable",
    "Each innings should have 3-10 distinct bowlers (T20)",
    total_innings_b, valid_bowler_ct, 90.0)

# SS-06: Distinct batsmen per innings should be 2-11
batsmen_per_innings = df_events.filter(col("batsman").isNotNull()) \
    .groupBy("matchid", "team").agg(countDistinct("batsman").alias("batsman_count"))
valid_batsman_ct = batsmen_per_innings.filter(
    (col("batsman_count") >= 2) & (col("batsman_count") <= 11)
).count()
total_innings_bt = batsmen_per_innings.count()
log_dq_result(SILVER_EVENTS, "volume", "SS-06: batsmen_per_innings_reasonable",
    "Each innings should have 2-11 distinct batsmen",
    total_innings_bt, valid_batsman_ct, 90.0)

### 4.7b BRONZE-TO-SILVER ROW COUNT VALIDATION
_Compares row counts between Bronze and Silver tables to detect data loss, unexpected row explosion,
or incomplete processing during the Bronze → Silver transformation._

**Expected behavior:**
- `match_events`: Silver ≤ Bronze (dedup removes duplicates, no new rows created)
- `match_metadata`: Silver ≤ Bronze (dedup removes duplicates)
- `match_players`: Silver ≤ Bronze (dedup removes duplicates)
- `matchid` coverage: Every Bronze matchid should be present in Silver (no data loss)
- Significant row drops (>5%) indicate possible transformation bugs or filter errors

In [0]:
print("=" * 80)
print("CATEGORY: BRONZE-TO-SILVER ROW COUNT VALIDATION")
print("=" * 80)

# ──────────────────────────────────────────────
# Load Bronze Tables for Comparison
# ──────────────────────────────────────────────
BRONZE_SCHEMA = f"{CATALOG_NAME}.bronze"
BRONZE_EVENTS   = f"{BRONZE_SCHEMA}.match_events"
BRONZE_METADATA = f"{BRONZE_SCHEMA}.match_metadata"
BRONZE_PLAYERS  = f"{BRONZE_SCHEMA}.match_players"

df_bronze_events   = spark.table(BRONZE_EVENTS)
df_bronze_metadata = spark.table(BRONZE_METADATA)
df_bronze_players  = spark.table(BRONZE_PLAYERS)

bronze_events_count   = df_bronze_events.count()
bronze_metadata_count = df_bronze_metadata.count()
bronze_players_count  = df_bronze_players.count()

print(f"\nBronze Tables Loaded for Comparison:")
print(f"  {BRONZE_EVENTS}:   {bronze_events_count:,} rows")
print(f"  {BRONZE_METADATA}: {bronze_metadata_count:,} rows")
print(f"  {BRONZE_PLAYERS}:  {bronze_players_count:,} rows")
print(f"\nSilver Tables (already loaded):")
print(f"  {SILVER_EVENTS}:   {events_count:,} rows")
print(f"  {SILVER_METADATA}: {metadata_count:,} rows")
print(f"  {SILVER_PLAYERS}:  {players_count:,} rows")

# ──────────────────────────────────────────────
# SR-01 to SR-03: Silver row count must not exceed Bronze
# (Silver applies dedup — rows should stay same or decrease)
# ──────────────────────────────────────────────
print(f"\n--- Row Count: Silver <= Bronze (no row explosion) ---")

# SR-01: match_events — Silver <= Bronze
events_no_explosion = 1 if events_count <= bronze_events_count else 0
log_dq_result(SILVER_EVENTS, "volume", "SR-01: events_silver_lte_bronze",
    "Silver match_events row count must not exceed Bronze (dedup should only remove rows)",
    1, events_no_explosion, 100.0,
    f"Bronze: {bronze_events_count:,}, Silver: {events_count:,}, "
    f"Diff: {bronze_events_count - events_count:,} rows removed by dedup")

# SR-02: match_metadata — Silver <= Bronze
metadata_no_explosion = 1 if metadata_count <= bronze_metadata_count else 0
log_dq_result(SILVER_METADATA, "volume", "SR-02: metadata_silver_lte_bronze",
    "Silver match_metadata row count must not exceed Bronze (dedup should only remove rows)",
    1, metadata_no_explosion, 100.0,
    f"Bronze: {bronze_metadata_count:,}, Silver: {metadata_count:,}, "
    f"Diff: {bronze_metadata_count - metadata_count:,} rows removed by dedup")

# SR-03: match_players — Silver <= Bronze
players_no_explosion = 1 if players_count <= bronze_players_count else 0
log_dq_result(SILVER_PLAYERS, "volume", "SR-03: players_silver_lte_bronze",
    "Silver match_players row count must not exceed Bronze (dedup should only remove rows)",
    1, players_no_explosion, 100.0,
    f"Bronze: {bronze_players_count:,}, Silver: {players_count:,}, "
    f"Diff: {bronze_players_count - players_count:,} rows removed by dedup")

# ──────────────────────────────────────────────
# SR-04 to SR-06: No significant data loss (>5% drop = WARN)
# ──────────────────────────────────────────────
print(f"\n--- Row Count: No Significant Data Loss (>5% drop) ---")

def check_row_count_retention(silver_table, bronze_count, silver_count, rule_id, table_label):
    """Check that Silver retains at least 95% of Bronze rows (allowing for dedup)."""
    if bronze_count == 0:
        log_dq_result(silver_table, "volume", f"{rule_id}: {table_label}_no_data_loss",
            f"Silver {table_label} should retain >=95% of Bronze rows (data loss check)",
            1, 0, 100.0, "Bronze table is empty - cannot validate retention")
        return

    retention_pct = (silver_count / bronze_count) * 100
    drop_pct = 100 - retention_pct

    log_dq_result(silver_table, "volume", f"{rule_id}: {table_label}_no_data_loss",
        f"Silver {table_label} should retain >=95% of Bronze rows (data loss check)",
        bronze_count, silver_count, 95.0,
        f"Bronze: {bronze_count:,}, Silver: {silver_count:,}, "
        f"Retention: {retention_pct:.2f}%, Drop: {drop_pct:.2f}%")

# SR-04: match_events retention
check_row_count_retention(SILVER_EVENTS, bronze_events_count, events_count,
    "SR-04", "events")

# SR-05: match_metadata retention
check_row_count_retention(SILVER_METADATA, bronze_metadata_count, metadata_count,
    "SR-05", "metadata")

# SR-06: match_players retention
check_row_count_retention(SILVER_PLAYERS, bronze_players_count, players_count,
    "SR-06", "players")

# ──────────────────────────────────────────────
# SR-07 to SR-09: matchid Coverage (no matches lost)
# ──────────────────────────────────────────────
print(f"\n--- matchid Coverage: Bronze matchids present in Silver ---")

bronze_event_matchids    = df_bronze_events.select("matchid").distinct()
bronze_metadata_matchids = df_bronze_metadata.select("matchid").distinct()
bronze_player_matchids   = df_bronze_players.select("matchid").distinct()

silver_event_matchids    = df_events.select("matchid").distinct()
silver_metadata_matchids = df_metadata.select("matchid").distinct()
silver_player_matchids   = df_players.select("matchid").distinct()

# SR-07: Every Bronze events matchid should exist in Silver events
bronze_evt_total = bronze_event_matchids.count()
missing_evt_matchids = bronze_event_matchids.join(
    silver_event_matchids, "matchid", "left_anti"
).count()
log_dq_result(SILVER_EVENTS, "volume", "SR-07: events_matchid_coverage",
    "Every matchid in Bronze events must be present in Silver events (no matches dropped)",
    bronze_evt_total, bronze_evt_total - missing_evt_matchids, 100.0,
    f"Bronze matchids: {bronze_evt_total}, Missing in Silver: {missing_evt_matchids}")

# SR-08: Every Bronze metadata matchid should exist in Silver metadata
bronze_meta_total = bronze_metadata_matchids.count()
missing_meta_matchids = bronze_metadata_matchids.join(
    silver_metadata_matchids, "matchid", "left_anti"
).count()
log_dq_result(SILVER_METADATA, "volume", "SR-08: metadata_matchid_coverage",
    "Every matchid in Bronze metadata must be present in Silver metadata (no matches dropped)",
    bronze_meta_total, bronze_meta_total - missing_meta_matchids, 100.0,
    f"Bronze matchids: {bronze_meta_total}, Missing in Silver: {missing_meta_matchids}")

# SR-09: Every Bronze players matchid should exist in Silver players
bronze_plr_total = bronze_player_matchids.count()
missing_plr_matchids = bronze_player_matchids.join(
    silver_player_matchids, "matchid", "left_anti"
).count()
log_dq_result(SILVER_PLAYERS, "volume", "SR-09: players_matchid_coverage",
    "Every matchid in Bronze players must be present in Silver players (no matches dropped)",
    bronze_plr_total, bronze_plr_total - missing_plr_matchids, 100.0,
    f"Bronze matchids: {bronze_plr_total}, Missing in Silver: {missing_plr_matchids}")

# ──────────────────────────────────────────────
# SR-10 & SR-11: Per-match row count comparison (events)
# Detect individual matches with abnormal row loss or explosion
# ──────────────────────────────────────────────
print(f"\n--- Per-Match Row Count Comparison (Events) ---")

bronze_per_match = df_bronze_events.groupBy("matchid").count() \
    .withColumnRenamed("count", "bronze_rows")
silver_per_match = df_events.groupBy("matchid").count() \
    .withColumnRenamed("count", "silver_rows")

row_comparison = bronze_per_match.join(silver_per_match, "matchid", "left") \
    .withColumn("silver_rows", coalesce(col("silver_rows"), lit(0))) \
    .withColumn("row_diff", col("bronze_rows") - col("silver_rows")) \
    .withColumn("retention_pct",
        when(col("bronze_rows") > 0,
             (col("silver_rows") / col("bronze_rows") * 100))
        .otherwise(0.0))

# SR-10: Matches where Silver has >10% fewer rows than Bronze (abnormal loss)
abnormal_loss = row_comparison.filter(col("retention_pct") < 90).count()
total_compared = row_comparison.count()
log_dq_result(SILVER_EVENTS, "volume", "SR-10: per_match_row_retention",
    "Each match should retain >=90% of Bronze event rows in Silver (per-match check)",
    total_compared, total_compared - abnormal_loss, 95.0,
    f"Matches with >10% row loss: {abnormal_loss}")

# SR-11: Matches where Silver has MORE rows than Bronze (should never happen)
row_explosion_matches = row_comparison.filter(col("silver_rows") > col("bronze_rows")).count()
log_dq_result(SILVER_EVENTS, "volume", "SR-11: no_per_match_row_explosion",
    "No individual match should have more Silver rows than Bronze rows",
    total_compared, total_compared - row_explosion_matches, 100.0,
    f"Matches with row explosion (Silver > Bronze): {row_explosion_matches}")

# ──────────────────────────────────────────────
# Summary Statistics
# ──────────────────────────────────────────────
evt_ret = (events_count / bronze_events_count * 100) if bronze_events_count > 0 else 0
meta_ret = (metadata_count / bronze_metadata_count * 100) if bronze_metadata_count > 0 else 0
plr_ret = (players_count / bronze_players_count * 100) if bronze_players_count > 0 else 0

print(f"\n  ℹ Row Count Summary:")
print(f"    Events   — Bronze: {bronze_events_count:,}  → Silver: {events_count:,}  "
      f"(Δ {bronze_events_count - events_count:,}, {evt_ret:.2f}% retained)")
print(f"    Metadata — Bronze: {bronze_metadata_count:,}  → Silver: {metadata_count:,}  "
      f"(Δ {bronze_metadata_count - metadata_count:,}, {meta_ret:.2f}% retained)")
print(f"    Players  — Bronze: {bronze_players_count:,}  → Silver: {players_count:,}  "
      f"(Δ {bronze_players_count - players_count:,}, {plr_ret:.2f}% retained)")


### 4.8 SCHEMA VALIDATION
_Verifies expected Silver columns are present after transformations. Unlike Bronze schema drift
(Auto Loader), Silver schema changes indicate transformation code bugs._

In [0]:
print("=" * 80)
print("CATEGORY: SCHEMA")
print("=" * 80)

# SD-01: Silver events expected columns (derived columns from transformations)
expected_event_cols = {
    "Batchid", "matchid", "event", "ball", "over", "over_ball_number",
    "runs", "innings_score", "wickets_lost", "commentary",
    "bowler", "batsman", "team", "is_super_over", "super_over_team",
    "runs_text", "Extras", "dismissal_method", "dismissed_by",
    "source_file", "load_timestamp", "silver_load_timestamp"
}
actual_event_cols = set(df_events.columns)
missing_e = expected_event_cols - actual_event_cols
extra_e = actual_event_cols - expected_event_cols
log_dq_result(SILVER_EVENTS, "schema", "SD-01: events_expected_columns",
    "Silver events must contain all expected columns from transformation notebook",
    len(expected_event_cols), len(expected_event_cols) - len(missing_e), 100.0,
    f"Missing: {missing_e or 'None'}, Extra: {extra_e or 'None'}")

# SD-02: Silver metadata expected columns
expected_meta_cols = {
    "Batchid", "matchid", "ground", "toss", "decision", "series", "season",
    "player_of_the_match", "player_of_the_series", "t20_debut", "t20i_debut",
    "umpires", "tv_umpire", "reserve_umpire", "match_referee",
    "points", "player_replacements", "match_number",
    "match_date", "match_start_utc",
    "first_innings_start_utc", "first_innings_end_utc",
    "second_innings_start_utc", "second_innings_end_utc",
    "first_innings", "second_innings",
    "has_super_over", "super_over_count", "series_result",
    "source_file", "load_timestamp", "silver_load_timestamp"
}
actual_meta_cols = set(df_metadata.columns)
missing_m = expected_meta_cols - actual_meta_cols
extra_m = actual_meta_cols - expected_meta_cols
log_dq_result(SILVER_METADATA, "schema", "SD-02: metadata_expected_columns",
    "Silver metadata must contain all expected columns from transformation notebook",
    len(expected_meta_cols), len(expected_meta_cols) - len(missing_m), 100.0,
    f"Missing: {missing_m or 'None'}, Extra: {extra_m or 'None'}")

# SD-03: Silver players expected columns
expected_player_cols = {
    "Batchid", "matchid", "innings", "team", "player_name",
    "batted", "batting_position", "player_type", "retired",
    "not_out", "bowled", "source_file", "load_timestamp",
    "silver_load_timestamp"
}
actual_player_cols = set(df_players.columns)
missing_p = expected_player_cols - actual_player_cols
extra_p = actual_player_cols - expected_player_cols
log_dq_result(SILVER_PLAYERS, "schema", "SD-03: players_expected_columns",
    "Silver players must contain all expected columns from transformation notebook",
    len(expected_player_cols), len(expected_player_cols) - len(missing_p), 100.0,
    f"Missing: {missing_p or 'None'}, Extra: {extra_p or 'None'}")

## 5. DQ Summary Report

In [0]:
print("\n" + "=" * 80)
print(f"DATA QUALITY SUMMARY — Silver Layer — Run ID: {run_id}")
print("=" * 80)

# Overall status breakdown
summary_df = spark.sql(f"""
    SELECT
        status,
        COUNT(*) AS rule_count,
        ROUND(AVG(pass_percentage), 2) AS avg_pass_pct
    FROM {DQ_AUDIT_TABLE}
    WHERE run_id = '{run_id}'
    GROUP BY status
    ORDER BY status
""")
display(summary_df)

# Detailed failures
print("\n--- FAILED RULES ---")
failures_df = spark.sql(f"""
    SELECT
        table_name, rule_category, rule_name,
        pass_percentage, failed_records, details
    FROM {DQ_AUDIT_TABLE}
    WHERE run_id = '{run_id}' AND status = 'FAIL'
    ORDER BY pass_percentage ASC
""")
display(failures_df)

# Warnings
print("\n--- WARNINGS ---")
warnings_df = spark.sql(f"""
    SELECT
        table_name, rule_category, rule_name,
        pass_percentage, failed_records, details
    FROM {DQ_AUDIT_TABLE}
    WHERE run_id = '{run_id}' AND status = 'WARN'
    ORDER BY pass_percentage ASC
""")
display(warnings_df)

## 6. Historical DQ Trend (Optional Query)

In [0]:
# Compare Silver DQ pass rates over time
trend_df = spark.sql(f"""
    SELECT
        run_id,
        run_timestamp,
        COUNT(*) AS total_rules,
        SUM(CASE WHEN status = 'PASS' THEN 1 ELSE 0 END) AS passed,
        SUM(CASE WHEN status = 'WARN' THEN 1 ELSE 0 END) AS warnings,
        SUM(CASE WHEN status = 'FAIL' THEN 1 ELSE 0 END) AS failed,
        ROUND(AVG(pass_percentage), 2) AS avg_pass_pct
    FROM {DQ_AUDIT_TABLE}
    WHERE run_id LIKE 'SLV_%'
    GROUP BY run_id, run_timestamp
    ORDER BY run_timestamp DESC
    LIMIT 20
""")
display(trend_df)

## 7. Pipeline Gate (Optional — Fail pipeline if critical rules fail)

In [0]:
# Check for critical failures and optionally halt downstream processing
critical_failures = spark.sql(f"""
    SELECT COUNT(*) AS cnt
    FROM {DQ_AUDIT_TABLE}
    WHERE run_id = '{run_id}'
      AND status = 'FAIL'
      AND threshold_pct = 100.0
""").first()["cnt"]

if critical_failures > 0:
    msg = f"⛔ PIPELINE GATE: {critical_failures} critical Silver DQ rule(s) failed! Review before proceeding to Gold layer."
    print(msg)
    # Uncomment to actually halt the pipeline:
    # raise Exception(msg)
else:
    print("✅ PIPELINE GATE: All critical Silver DQ rules passed. Safe to proceed to Gold layer.")