# Data Quality Rules — Gold Layer (Post-Aggregation Validation)

This notebook validates **Gold layer tables** after aggregations built by the
`CricketCommentaryParser_GoldLayer` notebook.

**Silver DQ vs Gold DQ:**
- Silver DQ validates **derived columns, enrichments, and business logic** after transformations
- Gold DQ validates **aggregation correctness, cross-table consistency, and analytics readiness**

---

### Gold Tables Validated

| Gold Table | Grain | Key Validations |
|---|---|---|
| `gold.batting_stats_per_match` | batsman × match | Runs, balls, SR, run distribution sums, legbye/bye exclusion |
| `gold.batting_stats_per_series` | batsman × series | Aggregation correctness vs match-level, milestone counts |
| `gold.batting_stats_overall` | batsman (career) | Aggregation correctness vs series-level, career totals |
| `gold.bowling_stats_per_match` | bowler × match | Overs, runs conceded (excl legbyes/byes), maiden logic, wickets |
| `gold.bowling_stats_per_series` | bowler × series | Aggregation correctness vs match-level, 5W haul counts |
| `gold.bowling_stats_overall` | bowler (career) | Aggregation correctness vs series-level, career totals |
| `gold.match_summary` | match | Winner extraction, team resolution, completeness |
| `gold.player_team_scd2` | player × team stint | SCD2 integrity — no overlaps, active flag correctness |

### DQ Rule Categories

| # | Category | Code Prefix | Purpose |
|---|---|---|---|
| 1 | Completeness | GC | Gold columns must be populated |
| 2 | Validity | GV | Ranges, domains, non-negative checks |
| 3 | Uniqueness | GU | Grain-level uniqueness (no duplicate aggregations) |
| 4 | Consistency | GI | Cross-table aggregation integrity (match→series→overall) |
| 5 | Silver-to-Gold Row Count | GR | Row count validation across layers |
| 6 | Aggregation Accuracy | GA | Run distribution sums, legbye/bye exclusion, maiden logic |
| 7 | SCD2 Integrity | GS | SCD Type 2 structural rules |
| 8 | Volume / Statistical | GV | Table not empty, reasonable row counts |
| 9 | Schema | GD | Expected Gold columns present |

## 1. Imports & Configuration

In [0]:
# ── Job Parameters ────────────────────────────────────────────────────────────
# Default values are used during interactive runs.
# Databricks Job overrides these at runtime via the Parameters section.
# Key names here must match exactly what the Job defines.

dbutils.widgets.text(
    "catalog_name", "T20_catalog",
    "Catalog Name"
)

In [0]:
from pyspark.sql import SparkSession, DataFrame
from pyspark.sql.functions import (
    col, count, when, isnull, lit, sum as _sum, avg, min as _min, max as _max,
    length, trim, regexp_extract, current_timestamp, countDistinct, expr,
    upper, lower, abs as _abs, coalesce, floor, round as _round
)
from pyspark.sql.types import IntegerType, LongType, DoubleType, StringType
from pyspark.sql.window import Window
from datetime import datetime, timezone

# ─── Unity Catalog Configuration ───
CATALOG_NAME  = dbutils.widgets.get("catalog_name")

# Silver (source for cross-layer checks)
SILVER_SCHEMA    = f"{CATALOG_NAME}.silver"
SILVER_EVENTS    = f"{SILVER_SCHEMA}.match_events"
SILVER_METADATA  = f"{SILVER_SCHEMA}.match_metadata"
SILVER_PLAYERS   = f"{SILVER_SCHEMA}.match_players"

# Gold (tables to validate)
GOLD_SCHEMA = f"{CATALOG_NAME}.gold"
GOLD_BAT_MATCH      = f"{GOLD_SCHEMA}.batting_stats_per_match"
GOLD_BAT_SERIES     = f"{GOLD_SCHEMA}.batting_stats_per_series"
GOLD_BAT_OVERALL    = f"{GOLD_SCHEMA}.batting_stats_overall"
GOLD_BOWL_MATCH     = f"{GOLD_SCHEMA}.bowling_stats_per_match"
GOLD_BOWL_SERIES    = f"{GOLD_SCHEMA}.bowling_stats_per_series"
GOLD_BOWL_OVERALL   = f"{GOLD_SCHEMA}.bowling_stats_overall"
GOLD_MATCH_SUMMARY  = f"{GOLD_SCHEMA}.match_summary"
GOLD_PLAYER_TEAM    = f"{GOLD_SCHEMA}.player_team_scd2"

# DQ Audit table for Gold layer
DQ_AUDIT_TABLE = f"{GOLD_SCHEMA}.dq_audit_log"

# Run identifiers
run_timestamp = datetime.now(timezone.utc)
run_id = "GLD_" + run_timestamp.strftime("%Y%m%d_%H%M%S")

print(f"DQ Run ID:     {run_id}")
print(f"Run Timestamp: {run_timestamp}")

## 2. DQ Audit Table Setup & Helper Functions

In [0]:
# Create audit table if not exists
spark.sql(f"""
    CREATE TABLE IF NOT EXISTS {DQ_AUDIT_TABLE} (
        run_id              STRING      COMMENT 'Unique identifier for this DQ run (GLD_ prefix for Gold)',
        run_timestamp       TIMESTAMP   COMMENT 'When the DQ check was executed',
        table_name          STRING      COMMENT 'Fully qualified table name being checked',
        rule_category       STRING      COMMENT 'Category: completeness, validity, uniqueness, consistency, aggregation, scd2, volume, schema',
        rule_name           STRING      COMMENT 'Unique rule identifier and descriptive name',
        rule_description    STRING      COMMENT 'Detailed description of what the rule checks',
        total_records       LONG        COMMENT 'Total records in scope',
        passed_records      LONG        COMMENT 'Records that passed the check',
        failed_records      LONG        COMMENT 'Records that failed the check',
        pass_percentage     DOUBLE      COMMENT 'Percentage of records that passed',
        status              STRING      COMMENT 'PASS, WARN, FAIL based on thresholds',
        threshold_pct       DOUBLE      COMMENT 'Minimum acceptable pass percentage',
        details             STRING      COMMENT 'Additional details or sample failures'
    )
    USING DELTA
    COMMENT 'Data quality audit log for Gold layer - IPL cricket pipeline'
""")

print(f"✓ DQ audit table ready: {DQ_AUDIT_TABLE}")

In [0]:
def log_dq_result(table_name, rule_category, rule_name, rule_description,
                  total_records, passed_records, threshold_pct=100.0, details=""):
    """Log a single DQ check result to the audit table."""

    failed_records = total_records - passed_records
    pass_pct = (passed_records / total_records * 100) if total_records > 0 else 0.0

    if pass_pct >= threshold_pct:
        status = "PASS"
    elif pass_pct >= (threshold_pct - 5):
        status = "WARN"
    else:
        status = "FAIL"

    icon = "✓" if status == "PASS" else ("⚠" if status == "WARN" else "✗")
    print(f"  {icon} [{status}] {rule_name}: {pass_pct:.2f}% passed ({failed_records} failures)")

    row = spark.createDataFrame([{
        "run_id": run_id,
        "run_timestamp": run_timestamp,
        "table_name": table_name,
        "rule_category": rule_category,
        "rule_name": rule_name,
        "rule_description": rule_description,
        "total_records": int(total_records),
        "passed_records": int(passed_records),
        "failed_records": int(failed_records),
        "pass_percentage": round(pass_pct, 2),
        "status": status,
        "threshold_pct": threshold_pct,
        "details": details[:500]
    }])

    row.write.mode("append").saveAsTable(DQ_AUDIT_TABLE)
    return status


def get_null_count(df, column):
    """Count nulls, empty strings, and 'null' string values."""
    return df.filter(
        col(column).isNull() |
        (trim(col(column)) == "") |
        (lower(trim(col(column))) == "null") |
        (lower(trim(col(column))) == "none") |
        (lower(trim(col(column))) == "n/a")
    ).count()

## 3. Load Gold & Silver Tables

In [0]:
# ─── Load Gold tables ───
df_bat_match     = spark.table(GOLD_BAT_MATCH)
df_bat_series    = spark.table(GOLD_BAT_SERIES)
df_bat_overall   = spark.table(GOLD_BAT_OVERALL)
df_bowl_match    = spark.table(GOLD_BOWL_MATCH)
df_bowl_series   = spark.table(GOLD_BOWL_SERIES)
df_bowl_overall  = spark.table(GOLD_BOWL_OVERALL)
df_match_summary = spark.table(GOLD_MATCH_SUMMARY)
df_scd2          = spark.table(GOLD_PLAYER_TEAM)

# Row counts
bat_match_ct     = df_bat_match.count()
bat_series_ct    = df_bat_series.count()
bat_overall_ct   = df_bat_overall.count()
bowl_match_ct    = df_bowl_match.count()
bowl_series_ct   = df_bowl_series.count()
bowl_overall_ct  = df_bowl_overall.count()
match_summ_ct    = df_match_summary.count()
scd2_ct          = df_scd2.count()

print(f"Gold Tables Loaded:")
print(f"  {GOLD_BAT_MATCH}:      {bat_match_ct:,} rows")
print(f"  {GOLD_BAT_SERIES}:     {bat_series_ct:,} rows")
print(f"  {GOLD_BAT_OVERALL}:    {bat_overall_ct:,} rows")
print(f"  {GOLD_BOWL_MATCH}:     {bowl_match_ct:,} rows")
print(f"  {GOLD_BOWL_SERIES}:    {bowl_series_ct:,} rows")
print(f"  {GOLD_BOWL_OVERALL}:   {bowl_overall_ct:,} rows")
print(f"  {GOLD_MATCH_SUMMARY}:  {match_summ_ct:,} rows")
print(f"  {GOLD_PLAYER_TEAM}:    {scd2_ct:,} rows")

# ─── Load Silver tables for cross-layer checks ───
df_silver_events   = spark.table(SILVER_EVENTS)
df_silver_metadata = spark.table(SILVER_METADATA)
df_silver_players  = spark.table(SILVER_PLAYERS)

silver_events_ct   = df_silver_events.count()
silver_metadata_ct = df_silver_metadata.count()
silver_players_ct  = df_silver_players.count()

print(f"\nSilver Tables (for cross-layer checks):")
print(f"  {SILVER_EVENTS}:   {silver_events_ct:,} rows")
print(f"  {SILVER_METADATA}: {silver_metadata_ct:,} rows")
print(f"  {SILVER_PLAYERS}:  {silver_players_ct:,} rows")

## 4. DATA QUALITY RULES

### 4.1 COMPLETENESS RULES
_Checks that key Gold columns are populated (not null, not empty)._

In [0]:
print("=" * 80)
print("CATEGORY: COMPLETENESS")
print("=" * 80)

# ── Batting Stats Per Match ──
print(f"\n--- {GOLD_BAT_MATCH} ---")

for rule_id, col_name, desc in [
    ("GC-BM01", "batsman",     "Batsman name must be populated"),
    ("GC-BM02", "team",        "Team name must be populated"),
    ("GC-BM03", "matchid",     "Match ID must be populated"),
    ("GC-BM04", "series",      "Series name must be populated"),
]:
    nulls = get_null_count(df_bat_match, col_name)
    log_dq_result(GOLD_BAT_MATCH, "completeness", f"{rule_id}: {col_name}_not_null",
        desc, bat_match_ct, bat_match_ct - nulls, 100.0)

# Numeric columns should not be null
for rule_id, col_name, desc in [
    ("GC-BM05", "balls_faced",  "Balls faced must not be null"),
    ("GC-BM06", "total_runs",   "Total runs must not be null"),
    ("GC-BM07", "strike_rate",  "Strike rate must not be null"),
]:
    nulls = df_bat_match.filter(col(col_name).isNull()).count()
    log_dq_result(GOLD_BAT_MATCH, "completeness", f"{rule_id}: {col_name}_not_null",
        desc, bat_match_ct, bat_match_ct - nulls, 100.0)

# ── Bowling Stats Per Match ──
print(f"\n--- {GOLD_BOWL_MATCH} ---")

for rule_id, col_name, desc in [
    ("GC-WM01", "bowler",            "Bowler name must be populated"),
    ("GC-WM02", "matchid",           "Match ID must be populated"),
    ("GC-WM03", "runs_conceded",     "Runs conceded must not be null"),
    ("GC-WM04", "legal_balls_bowled", "Legal balls bowled must not be null"),
    ("GC-WM05", "wickets_taken",     "Wickets taken must not be null"),
]:
    if col_name in ["bowler", "matchid"]:
        nulls = get_null_count(df_bowl_match, col_name)
    else:
        nulls = df_bowl_match.filter(col(col_name).isNull()).count()
    log_dq_result(GOLD_BOWL_MATCH, "completeness", f"{rule_id}: {col_name}_not_null",
        desc, bowl_match_ct, bowl_match_ct - nulls, 100.0)

# ── Match Summary ──
print(f"\n--- {GOLD_MATCH_SUMMARY} ---")

for rule_id, col_name, desc in [
    ("GC-MS01", "matchid",       "Match ID must be populated"),
    ("GC-MS02", "series",        "Series must be populated"),
    ("GC-MS03", "team_1",        "Team 1 must be populated"),
    ("GC-MS04", "team_2",        "Team 2 must be populated"),
    ("GC-MS05", "venue",         "Venue must be populated"),
    ("GC-MS06", "match_date",    "Match date must be populated"),
    ("GC-MS07", "toss_winner",   "Toss winner must be populated"),
    ("GC-MS08", "toss_decision", "Toss decision must be populated"),
    ("GC-MS09", "result_text",   "Result text must be populated"),
]:
    if col_name == "match_date":
        nulls = df_match_summary.filter(col(col_name).isNull()).count()
    else:
        nulls = get_null_count(df_match_summary, col_name)
    log_dq_result(GOLD_MATCH_SUMMARY, "completeness", f"{rule_id}: {col_name}_not_null",
        desc, match_summ_ct, match_summ_ct - nulls, 95.0)

# ── SCD2 ──
print(f"\n--- {GOLD_PLAYER_TEAM} ---")

for rule_id, col_name, desc in [
    ("GC-SC01", "player_name", "Player name must be populated"),
    ("GC-SC02", "team",        "Team must be populated"),
    ("GC-SC03", "start_date",  "Start date must be populated"),
]:
    if col_name == "start_date":
        nulls = df_scd2.filter(col(col_name).isNull()).count()
    else:
        nulls = get_null_count(df_scd2, col_name)
    log_dq_result(GOLD_PLAYER_TEAM, "completeness", f"{rule_id}: {col_name}_not_null",
        desc, scd2_ct, scd2_ct - nulls, 100.0)

### 4.2 VALIDITY / RANGE RULES
_Checks that Gold values conform to expected ranges and domains._

In [0]:
print("=" * 80)
print("CATEGORY: VALIDITY")
print("=" * 80)

# ── Batting Stats Per Match ──
print(f"\n--- {GOLD_BAT_MATCH} ---")

# GV-BM01: balls_faced must be non-negative
invalid = df_bat_match.filter(col("balls_faced") < 0).count()
log_dq_result(GOLD_BAT_MATCH, "validity", "GV-BM01: balls_faced_non_negative",
    "Balls faced must be >= 0", bat_match_ct, bat_match_ct - invalid, 100.0)

# GV-BM02: total_runs must be non-negative
invalid = df_bat_match.filter(col("total_runs") < 0).count()
log_dq_result(GOLD_BAT_MATCH, "validity", "GV-BM02: total_runs_non_negative",
    "Total runs scored must be >= 0", bat_match_ct, bat_match_ct - invalid, 100.0)

# GV-BM03: strike_rate must be non-negative
invalid = df_bat_match.filter(col("strike_rate") < 0).count()
log_dq_result(GOLD_BAT_MATCH, "validity", "GV-BM03: strike_rate_non_negative",
    "Strike rate must be >= 0", bat_match_ct, bat_match_ct - invalid, 100.0)

# GV-BM04: T20 individual score should be 0-250 (reasonable range)
invalid = df_bat_match.filter(
    (col("total_runs") < 0) | (col("total_runs") > 250)
).count()
log_dq_result(GOLD_BAT_MATCH, "validity", "GV-BM04: runs_reasonable_range",
    "Individual T20 score should be between 0 and 250",
    bat_match_ct, bat_match_ct - invalid, 100.0)

# GV-BM05: Run distribution values must be non-negative
run_dist_cols = ["dot_balls", "runs_1s", "runs_2s", "runs_3s", "runs_4s", "runs_5s", "runs_6s"]
invalid = df_bat_match.filter(
    (col("dot_balls") < 0) | (col("runs_1s") < 0) | (col("runs_2s") < 0) |
    (col("runs_3s") < 0) | (col("runs_4s") < 0) | (col("runs_5s") < 0) | (col("runs_6s") < 0)
).count()
log_dq_result(GOLD_BAT_MATCH, "validity", "GV-BM05: run_distribution_non_negative",
    "All run distribution columns must be >= 0",
    bat_match_ct, bat_match_ct - invalid, 100.0)

# ── Bowling Stats Per Match ──
print(f"\n--- {GOLD_BOWL_MATCH} ---")

# GV-WM01: runs_conceded must be non-negative
invalid = df_bowl_match.filter(col("runs_conceded") < 0).count()
log_dq_result(GOLD_BOWL_MATCH, "validity", "GV-WM01: runs_conceded_non_negative",
    "Runs conceded must be >= 0", bowl_match_ct, bowl_match_ct - invalid, 100.0)

# GV-WM02: wickets_taken must be 0-10
invalid = df_bowl_match.filter(
    (col("wickets_taken") < 0) | (col("wickets_taken") > 10)
).count()
log_dq_result(GOLD_BOWL_MATCH, "validity", "GV-WM02: wickets_range_0_10",
    "Wickets taken per match must be 0-10",
    bowl_match_ct, bowl_match_ct - invalid, 100.0)

# GV-WM03: overs_bowled must be 0-4 for T20 (max 4 overs per bowler)
# Using legal_balls_bowled <= 24 (4 overs × 6 balls)
invalid = df_bowl_match.filter(col("legal_balls_bowled") > 24).count()
log_dq_result(GOLD_BOWL_MATCH, "validity", "GV-WM03: max_4_overs_bowled",
    "Bowler should bowl at most 4 overs (24 legal balls) in a T20 match",
    bowl_match_ct, bowl_match_ct - invalid, 95.0)

# GV-WM04: maiden_overs must be <= overs_bowled
invalid = df_bowl_match.filter(
    col("maiden_overs") > floor(col("legal_balls_bowled") / 6)
).count()
log_dq_result(GOLD_BOWL_MATCH, "validity", "GV-WM04: maidens_lte_overs",
    "Maiden overs cannot exceed total complete overs bowled",
    bowl_match_ct, bowl_match_ct - invalid, 100.0)

# GV-WM05: dot_balls must be <= legal_balls_bowled
invalid = df_bowl_match.filter(col("dot_balls") > col("legal_balls_bowled")).count()
log_dq_result(GOLD_BOWL_MATCH, "validity", "GV-WM05: dots_lte_legal_balls",
    "Dot balls cannot exceed legal balls bowled",
    bowl_match_ct, bowl_match_ct - invalid, 100.0)

# GV-WM06: economy_rate should be non-negative
invalid = df_bowl_match.filter(col("economy_rate") < 0).count()
log_dq_result(GOLD_BOWL_MATCH, "validity", "GV-WM06: economy_non_negative",
    "Economy rate must be >= 0",
    bowl_match_ct, bowl_match_ct - invalid, 100.0)

# ── Match Summary ──
print(f"\n--- {GOLD_MATCH_SUMMARY} ---")

# GV-MS01: team_1 != team_2
same_teams = df_match_summary.filter(
    col("team_1").isNotNull() & col("team_2").isNotNull() &
    (trim(col("team_1")) == trim(col("team_2")))
).count()
log_dq_result(GOLD_MATCH_SUMMARY, "validity", "GV-MS01: teams_different",
    "Team 1 and Team 2 must be different teams",
    match_summ_ct, match_summ_ct - same_teams, 100.0)

# GV-MS02: result_type must be a valid domain
valid_results = ["by runs", "by wickets", "super over", "no result", "tie", "other"]
invalid = df_match_summary.filter(
    col("result_type").isNotNull() & ~col("result_type").isin(valid_results)
).count()
log_dq_result(GOLD_MATCH_SUMMARY, "validity", "GV-MS02: result_type_valid",
    f"Result type must be one of: {', '.join(valid_results)}",
    match_summ_ct, match_summ_ct - invalid, 100.0)

# GV-MS03: toss_winner should be team_1 or team_2
invalid = df_match_summary.filter(
    col("toss_winner").isNotNull() &
    (col("toss_winner") != col("team_1")) &
    (col("toss_winner") != col("team_2"))
).count()
log_dq_result(GOLD_MATCH_SUMMARY, "validity", "GV-MS03: toss_winner_is_a_team",
    "Toss winner must be either Team 1 or Team 2",
    match_summ_ct, match_summ_ct - invalid, 95.0)

### 4.3 UNIQUENESS RULES
_Checks that Gold tables have no duplicate rows at the expected grain._

In [0]:
print("=" * 80)
print("CATEGORY: UNIQUENESS")
print("=" * 80)

# GU-01: batting_stats_per_match — unique on (matchid, batsman, team)
distinct_bat_match = df_bat_match.select("matchid", "batsman", "team").distinct().count()
log_dq_result(GOLD_BAT_MATCH, "uniqueness", "GU-01: bat_match_grain_unique",
    "Batting per match must be unique on (matchid, batsman, team)",
    bat_match_ct, distinct_bat_match, 100.0,
    f"Total: {bat_match_ct}, Distinct keys: {distinct_bat_match}")

# GU-02: batting_stats_per_series — unique on (series, season, batsman)
distinct_bat_series = df_bat_series.select("series", "season", "batsman").distinct().count()
log_dq_result(GOLD_BAT_SERIES, "uniqueness", "GU-02: bat_series_grain_unique",
    "Batting per series must be unique on (series, season, batsman)",
    bat_series_ct, distinct_bat_series, 100.0)

# GU-03: batting_stats_overall — unique on (batsman)
distinct_bat_overall = df_bat_overall.select("batsman").distinct().count()
log_dq_result(GOLD_BAT_OVERALL, "uniqueness", "GU-03: bat_overall_grain_unique",
    "Batting overall must be unique on (batsman)",
    bat_overall_ct, distinct_bat_overall, 100.0)

# GU-04: bowling_stats_per_match — unique on (matchid, bowler)
distinct_bowl_match = df_bowl_match.select("matchid", "bowler").distinct().count()
log_dq_result(GOLD_BOWL_MATCH, "uniqueness", "GU-04: bowl_match_grain_unique",
    "Bowling per match must be unique on (matchid, bowler)",
    bowl_match_ct, distinct_bowl_match, 100.0)

# GU-05: bowling_stats_per_series — unique on (series, season, bowler)
distinct_bowl_series = df_bowl_series.select("series", "season", "bowler").distinct().count()
log_dq_result(GOLD_BOWL_SERIES, "uniqueness", "GU-05: bowl_series_grain_unique",
    "Bowling per series must be unique on (series, season, bowler)",
    bowl_series_ct, distinct_bowl_series, 100.0)

# GU-06: bowling_stats_overall — unique on (bowler)
distinct_bowl_overall = df_bowl_overall.select("bowler").distinct().count()
log_dq_result(GOLD_BOWL_OVERALL, "uniqueness", "GU-06: bowl_overall_grain_unique",
    "Bowling overall must be unique on (bowler)",
    bowl_overall_ct, distinct_bowl_overall, 100.0)

# GU-07: match_summary — unique on (matchid)
distinct_match = df_match_summary.select("matchid").distinct().count()
log_dq_result(GOLD_MATCH_SUMMARY, "uniqueness", "GU-07: match_summary_grain_unique",
    "Match summary must be unique on (matchid)",
    match_summ_ct, distinct_match, 100.0)

# GU-08: player_team_scd2 — unique on (player_name, team, start_date)
distinct_scd2 = df_scd2.select("player_name", "team", "start_date").distinct().count()
log_dq_result(GOLD_PLAYER_TEAM, "uniqueness", "GU-08: scd2_grain_unique",
    "SCD2 must be unique on (player_name, team, start_date)",
    scd2_ct, distinct_scd2, 100.0)

### 4.4 CROSS-TABLE AGGREGATION CONSISTENCY
_Validates that series-level aggregations match sum of match-level, and overall matches sum of series-level._

In [0]:
print("=" * 80)
print("CATEGORY: CROSS-TABLE AGGREGATION CONSISTENCY")
print("=" * 80)

# ══════════════════════════════════════════════
# BATTING: match → series consistency
# ══════════════════════════════════════════════
print(f"\n--- Batting: match → series ---")

# Reaggregate match-level to series-level and compare
bat_match_reagg = df_bat_match.groupBy("series", "season", "batsman").agg(
    _sum("total_runs").alias("expected_runs"),
    _sum("balls_faced").alias("expected_balls")
)

bat_series_check = df_bat_series.select(
    "series", "season", "batsman",
    col("total_runs").alias("actual_runs"),
    col("total_balls_faced").alias("actual_balls")
)

bat_ms_compare = bat_match_reagg.join(
    bat_series_check, ["series", "season", "batsman"], "inner"
)

# GI-01: Batting runs — series total must equal sum of match totals
runs_mismatch = bat_ms_compare.filter(
    col("expected_runs") != col("actual_runs")
).count()
total_comparisons = bat_ms_compare.count()
log_dq_result(GOLD_BAT_SERIES, "consistency", "GI-01: bat_runs_match_to_series",
    "Series batting runs must equal sum of match-level batting runs per batsman",
    total_comparisons, total_comparisons - runs_mismatch, 100.0,
    f"Mismatched batsman-series combos: {runs_mismatch}")

# GI-02: Batting balls — series total must equal sum of match totals
balls_mismatch = bat_ms_compare.filter(
    col("expected_balls") != col("actual_balls")
).count()
log_dq_result(GOLD_BAT_SERIES, "consistency", "GI-02: bat_balls_match_to_series",
    "Series balls faced must equal sum of match-level balls faced per batsman",
    total_comparisons, total_comparisons - balls_mismatch, 100.0)

# ══════════════════════════════════════════════
# BATTING: series → overall consistency
# ══════════════════════════════════════════════
print(f"\n--- Batting: series → overall ---")

bat_series_reagg = df_bat_series.groupBy("batsman").agg(
    _sum("total_runs").alias("expected_runs"),
    _sum("total_balls_faced").alias("expected_balls"),
    _sum("scores_gte_50").alias("expected_50s")
)

bat_overall_check = df_bat_overall.select(
    "batsman",
    col("total_runs").alias("actual_runs"),
    col("total_balls_faced").alias("actual_balls"),
    col("scores_gte_50").alias("actual_50s")
)

bat_so_compare = bat_series_reagg.join(bat_overall_check, "batsman", "inner")
total_so = bat_so_compare.count()

# GI-03: Overall runs must equal sum of series runs
mismatch = bat_so_compare.filter(col("expected_runs") != col("actual_runs")).count()
log_dq_result(GOLD_BAT_OVERALL, "consistency", "GI-03: bat_runs_series_to_overall",
    "Overall batting runs must equal sum of series-level runs per batsman",
    total_so, total_so - mismatch, 100.0)

# GI-04: Overall 50s must equal sum of series 50s
mismatch = bat_so_compare.filter(col("expected_50s") != col("actual_50s")).count()
log_dq_result(GOLD_BAT_OVERALL, "consistency", "GI-04: bat_50s_series_to_overall",
    "Overall 50+ scores must equal sum of series-level 50+ counts",
    total_so, total_so - mismatch, 100.0)

# ══════════════════════════════════════════════
# BOWLING: match → series consistency
# ══════════════════════════════════════════════
print(f"\n--- Bowling: match → series ---")

bowl_match_reagg = df_bowl_match.groupBy("series", "season", "bowler").agg(
    _sum("runs_conceded").alias("expected_runs"),
    _sum("legal_balls_bowled").alias("expected_balls"),
    _sum("wickets_taken").alias("expected_wickets")
)

bowl_series_check = df_bowl_series.select(
    "series", "season", "bowler",
    col("total_runs_conceded").alias("actual_runs"),
    col("total_legal_balls").alias("actual_balls"),
    col("total_wickets").alias("actual_wickets")
)

bowl_ms_compare = bowl_match_reagg.join(
    bowl_series_check, ["series", "season", "bowler"], "inner"
)
total_bowl_ms = bowl_ms_compare.count()

# GI-05: Bowling runs — series must equal sum of match totals
mismatch = bowl_ms_compare.filter(col("expected_runs") != col("actual_runs")).count()
log_dq_result(GOLD_BOWL_SERIES, "consistency", "GI-05: bowl_runs_match_to_series",
    "Series bowling runs conceded must equal sum of match-level runs per bowler",
    total_bowl_ms, total_bowl_ms - mismatch, 100.0)

# GI-06: Bowling wickets — series must equal sum of match totals
mismatch = bowl_ms_compare.filter(col("expected_wickets") != col("actual_wickets")).count()
log_dq_result(GOLD_BOWL_SERIES, "consistency", "GI-06: bowl_wickets_match_to_series",
    "Series bowling wickets must equal sum of match-level wickets per bowler",
    total_bowl_ms, total_bowl_ms - mismatch, 100.0)

# ══════════════════════════════════════════════
# BOWLING: series → overall consistency
# ══════════════════════════════════════════════
print(f"\n--- Bowling: series → overall ---")

bowl_series_reagg = df_bowl_series.groupBy("bowler").agg(
    _sum("total_runs_conceded").alias("expected_runs"),
    _sum("total_wickets").alias("expected_wickets"),
    _sum("five_wicket_hauls").alias("expected_5w")
)

bowl_overall_check = df_bowl_overall.select(
    "bowler",
    col("total_runs_conceded").alias("actual_runs"),
    col("total_wickets").alias("actual_wickets"),
    col("five_wicket_hauls").alias("actual_5w")
)

bowl_so_compare = bowl_series_reagg.join(bowl_overall_check, "bowler", "inner")
total_bowl_so = bowl_so_compare.count()

# GI-07: Overall bowling runs must equal sum of series runs
mismatch = bowl_so_compare.filter(col("expected_runs") != col("actual_runs")).count()
log_dq_result(GOLD_BOWL_OVERALL, "consistency", "GI-07: bowl_runs_series_to_overall",
    "Overall bowling runs conceded must equal sum of series-level runs per bowler",
    total_bowl_so, total_bowl_so - mismatch, 100.0)

# GI-08: Overall 5W hauls must equal sum of series 5W hauls
mismatch = bowl_so_compare.filter(col("expected_5w") != col("actual_5w")).count()
log_dq_result(GOLD_BOWL_OVERALL, "consistency", "GI-08: bowl_5w_series_to_overall",
    "Overall 5-wicket hauls must equal sum of series-level 5W counts",
    total_bowl_so, total_bowl_so - mismatch, 100.0)

# ══════════════════════════════════════════════
# MATCH SUMMARY: matchid alignment across Gold tables
# ══════════════════════════════════════════════
print(f"\n--- matchid Alignment ---")

summary_matchids = df_match_summary.select("matchid").distinct()
bat_matchids     = df_bat_match.select("matchid").distinct()
bowl_matchids    = df_bowl_match.select("matchid").distinct()

# GI-09: Every batting match should be in match_summary
orphan_bat = bat_matchids.join(summary_matchids, "matchid", "left_anti").count()
total_bat_matches = bat_matchids.count()
log_dq_result(GOLD_BAT_MATCH, "consistency", "GI-09: bat_matchid_in_summary",
    "Every matchid in batting stats must exist in match_summary",
    total_bat_matches, total_bat_matches - orphan_bat, 100.0)

# GI-10: Every bowling match should be in match_summary
orphan_bowl = bowl_matchids.join(summary_matchids, "matchid", "left_anti").count()
total_bowl_matches = bowl_matchids.count()
log_dq_result(GOLD_BOWL_MATCH, "consistency", "GI-10: bowl_matchid_in_summary",
    "Every matchid in bowling stats must exist in match_summary",
    total_bowl_matches, total_bowl_matches - orphan_bowl, 100.0)

### 4.5 SILVER-TO-GOLD ROW COUNT VALIDATION
_Compares row counts between Silver and Gold to detect data loss or unexpected row changes._

In [0]:
print("=" * 80)
print("CATEGORY: SILVER-TO-GOLD ROW COUNT VALIDATION")
print("=" * 80)

# ── GR-01: match_summary count must equal Silver metadata count ──
# (1:1 mapping — one summary row per match)
match_eq = 1 if match_summ_ct == silver_metadata_ct else 0
log_dq_result(GOLD_MATCH_SUMMARY, "row_count", "GR-01: summary_matches_metadata",
    "Gold match_summary row count should equal Silver metadata row count (1:1 mapping)",
    1, match_eq, 100.0,
    f"Silver metadata: {silver_metadata_ct:,}, Gold summary: {match_summ_ct:,}")

# ── GR-02: matchid coverage — every Silver matchid present in Gold batting ──
silver_event_matchids = df_silver_events.select("matchid").distinct()
gold_bat_matchids = df_bat_match.select("matchid").distinct()
missing_bat = silver_event_matchids.join(gold_bat_matchids, "matchid", "left_anti").count()
total_silver_matches = silver_event_matchids.count()
log_dq_result(GOLD_BAT_MATCH, "row_count", "GR-02: bat_matchid_coverage",
    "Every Silver events matchid should have batting stats in Gold",
    total_silver_matches, total_silver_matches - missing_bat, 100.0,
    f"Silver matchids: {total_silver_matches}, Missing in Gold batting: {missing_bat}")

# ── GR-03: matchid coverage — every Silver matchid present in Gold bowling ──
gold_bowl_matchids = df_bowl_match.select("matchid").distinct()
missing_bowl = silver_event_matchids.join(gold_bowl_matchids, "matchid", "left_anti").count()
log_dq_result(GOLD_BOWL_MATCH, "row_count", "GR-03: bowl_matchid_coverage",
    "Every Silver events matchid should have bowling stats in Gold",
    total_silver_matches, total_silver_matches - missing_bowl, 100.0,
    f"Silver matchids: {total_silver_matches}, Missing in Gold bowling: {missing_bowl}")

# ── GR-04: Distinct batsmen in Gold overall <= distinct batsmen in Silver events ──
silver_distinct_batsmen = df_silver_events.select("batsman").distinct().count()
gold_distinct_batsmen = df_bat_overall.select("batsman").distinct().count()
batsmen_ok = 1 if gold_distinct_batsmen <= silver_distinct_batsmen else 0
log_dq_result(GOLD_BAT_OVERALL, "row_count", "GR-04: batsmen_count_lte_silver",
    "Gold distinct batsmen should not exceed Silver distinct batsmen",
    1, batsmen_ok, 100.0,
    f"Silver batsmen: {silver_distinct_batsmen:,}, Gold batsmen: {gold_distinct_batsmen:,}")

# ── GR-05: Distinct bowlers in Gold overall <= distinct bowlers in Silver events ──
silver_distinct_bowlers = df_silver_events.select("bowler").distinct().count()
gold_distinct_bowlers = df_bowl_overall.select("bowler").distinct().count()
bowlers_ok = 1 if gold_distinct_bowlers <= silver_distinct_bowlers else 0
log_dq_result(GOLD_BOWL_OVERALL, "row_count", "GR-05: bowlers_count_lte_silver",
    "Gold distinct bowlers should not exceed Silver distinct bowlers",
    1, bowlers_ok, 100.0,
    f"Silver bowlers: {silver_distinct_bowlers:,}, Gold bowlers: {gold_distinct_bowlers:,}")

# ── GR-06: Aggregation fan-out — series rows < match rows < Silver event rows ──
fanout_bat = 1 if (bat_overall_ct <= bat_series_ct <= bat_match_ct) else 0
log_dq_result(GOLD_BAT_OVERALL, "row_count", "GR-06: bat_aggregation_fanout",
    "Batting row counts must decrease: match > series > overall (aggregation reduces rows)",
    1, fanout_bat, 100.0,
    f"Match: {bat_match_ct:,} → Series: {bat_series_ct:,} → Overall: {bat_overall_ct:,}")

# ── GR-07: Bowling aggregation fan-out ──
fanout_bowl = 1 if (bowl_overall_ct <= bowl_series_ct <= bowl_match_ct) else 0
log_dq_result(GOLD_BOWL_OVERALL, "row_count", "GR-07: bowl_aggregation_fanout",
    "Bowling row counts must decrease: match > series > overall (aggregation reduces rows)",
    1, fanout_bowl, 100.0,
    f"Match: {bowl_match_ct:,} → Series: {bowl_series_ct:,} → Overall: {bowl_overall_ct:,}")

# Summary
print(f"\n  ℹ Row Count Summary:")
print(f"    Batting  — Match: {bat_match_ct:,} → Series: {bat_series_ct:,} → Overall: {bat_overall_ct:,}")
print(f"    Bowling  — Match: {bowl_match_ct:,} → Series: {bowl_series_ct:,} → Overall: {bowl_overall_ct:,}")
print(f"    Match Summary: {match_summ_ct:,} (Silver metadata: {silver_metadata_ct:,})")
print(f"    SCD2: {scd2_ct:,} records")

### 4.6 AGGREGATION ACCURACY RULES
_Validates cricket-specific aggregation logic: run distribution sums, legbye/bye exclusion, strike rate formula, maiden logic._

In [0]:
print("=" * 80)
print("CATEGORY: AGGREGATION ACCURACY")
print("=" * 80)

# ══════════════════════════════════════════════
# BATTING — Run distribution & Strike Rate
# ══════════════════════════════════════════════
print(f"\n--- Batting Aggregation Accuracy ---")

# GA-01: Run distribution must sum correctly
# total_runs should = (1s×1) + (2s×2) + (3s×3) + (4s×4) + (5s×5) + (6s×6)
# Note: dot_balls contribute 0 runs, so they don't factor into the sum
# However, 6s column includes runs >= 6 (could be 7 with noball), so this is approximate
df_bat_dist_check = df_bat_match.withColumn(
    "dist_sum",
    (col("runs_1s") * 1) + (col("runs_2s") * 2) + (col("runs_3s") * 3) +
    (col("runs_4s") * 4) + (col("runs_5s") * 5) + (col("runs_6s") * 6)
)

# Allow small discrepancy for no-ball + boundary (6s column captures >=6, could be 7)
dist_mismatch = df_bat_dist_check.filter(
    _abs(col("total_runs") - col("dist_sum")) > col("runs_6s")  # allow up to 1 extra per 6+ ball
).count()
log_dq_result(GOLD_BAT_MATCH, "aggregation", "GA-01: bat_run_distribution_sum",
    "Run distribution (1s-6s) weighted sum should approximately equal total_runs",
    bat_match_ct, bat_match_ct - dist_mismatch, 95.0,
    f"Mismatched: {dist_mismatch} (allowing tolerance for noball+boundary)")

# GA-02: balls_faced should >= sum of run distribution counts (dot+1s+2s+3s+4s+5s+6s)
# Balls faced = all deliveries except wides, so count of all distribution events
df_bat_ball_check = df_bat_match.withColumn(
    "dist_ball_count",
    col("dot_balls") + col("runs_1s") + col("runs_2s") + col("runs_3s") +
    col("runs_4s") + col("runs_5s") + col("runs_6s")
)
ball_mismatch = df_bat_ball_check.filter(
    col("balls_faced") != col("dist_ball_count")
).count()
log_dq_result(GOLD_BAT_MATCH, "aggregation", "GA-02: bat_balls_equal_distribution_count",
    "Balls faced must equal sum of all run distribution counts (dot+1s+2s+3s+4s+5s+6s)",
    bat_match_ct, bat_match_ct - ball_mismatch, 100.0,
    f"Mismatched: {ball_mismatch}")

# GA-03: Strike rate formula: (total_runs / balls_faced) * 100
sr_mismatch = df_bat_match.filter(
    (col("balls_faced") > 0) &
    (_abs(col("strike_rate") - (col("total_runs") / col("balls_faced") * 100)) > 0.1)
).count()
has_balls = df_bat_match.filter(col("balls_faced") > 0).count()
log_dq_result(GOLD_BAT_MATCH, "aggregation", "GA-03: bat_strike_rate_formula",
    "Strike rate must equal (total_runs / balls_faced) × 100",
    has_balls, has_balls - sr_mismatch, 100.0)

# GA-04: Legbye/bye exclusion — verify by cross-checking with Silver
# Sum of Gold batting runs per match should be LESS than Silver total runs per match
# (because Silver innings_score includes legbyes/byes but Gold batting runs don't)
print(f"\n--- Legbye/Bye Exclusion Verification ---")

gold_bat_per_match = df_bat_match.groupBy("matchid").agg(
    _sum("total_runs").alias("gold_bat_total")
)

# Silver total runs per match (includes legbyes/byes in team score)
silver_total_per_match = df_silver_events.groupBy("matchid").agg(
    _sum("runs").alias("silver_total")
)

legbye_check = gold_bat_per_match.join(silver_total_per_match, "matchid", "inner")

# Gold batting total should be <= Silver total (legbyes/byes excluded from batting)
bat_exceeds_total = legbye_check.filter(
    col("gold_bat_total") > col("silver_total")
).count()
total_matches_check = legbye_check.count()
log_dq_result(GOLD_BAT_MATCH, "aggregation", "GA-04: bat_runs_lte_total_score",
    "Gold batting runs (excl legbyes/byes) must be <= Silver total runs per match",
    total_matches_check, total_matches_check - bat_exceeds_total, 100.0,
    f"Matches where Gold batting > Silver total: {bat_exceeds_total}")

# GA-05: Bowler runs conceded per match should also be <= Silver total
gold_bowl_per_match = df_bowl_match.groupBy("matchid").agg(
    _sum("runs_conceded").alias("gold_bowl_total")
)

bowl_legbye_check = gold_bowl_per_match.join(silver_total_per_match, "matchid", "inner")
bowl_exceeds_total = bowl_legbye_check.filter(
    col("gold_bowl_total") > col("silver_total")
).count()
total_bowl_check = bowl_legbye_check.count()
log_dq_result(GOLD_BOWL_MATCH, "aggregation", "GA-05: bowl_runs_lte_total_score",
    "Gold bowling runs conceded (excl legbyes/byes) must be <= Silver total runs per match",
    total_bowl_check, total_bowl_check - bowl_exceeds_total, 100.0,
    f"Matches where Gold bowling > Silver total: {bowl_exceeds_total}")

# ══════════════════════════════════════════════
# BOWLING — Aggregation checks
# ══════════════════════════════════════════════
print(f"\n--- Bowling Aggregation Accuracy ---")

# GA-06: Economy rate formula: runs_conceded / (legal_balls / 6)
econ_mismatch = df_bowl_match.filter(
    (col("legal_balls_bowled") > 0) &
    (_abs(col("economy_rate") - (col("runs_conceded") / (col("legal_balls_bowled") / 6))) > 0.1)
).count()
has_legal = df_bowl_match.filter(col("legal_balls_bowled") > 0).count()
log_dq_result(GOLD_BOWL_MATCH, "aggregation", "GA-06: bowl_economy_formula",
    "Economy rate must equal runs_conceded / (legal_balls / 6)",
    has_legal, has_legal - econ_mismatch, 100.0)

# GA-07: Bowling run distribution should sum to total (approximately)
df_bowl_dist = df_bowl_match.withColumn(
    "bowl_dist_count",
    col("dot_balls") + col("runs_1s_conceded") + col("runs_2s_conceded") +
    col("runs_3s_conceded") + col("runs_4s_conceded") +
    col("runs_5s_conceded") + col("runs_6s_conceded")
)
bowl_dist_mismatch = df_bowl_dist.filter(
    col("bowl_dist_count") != col("legal_balls_bowled")
).count()
log_dq_result(GOLD_BOWL_MATCH, "aggregation", "GA-07: bowl_dist_equals_legal_balls",
    "Bowling run distribution counts should sum to legal balls bowled",
    bowl_match_ct, bowl_match_ct - bowl_dist_mismatch, 95.0,
    f"Mismatched: {bowl_dist_mismatch}")

# GA-08: Milestone consistency — scores_gte_50 must be <= scores_gte_30
milestone_bad = df_bat_series.filter(
    col("scores_gte_50") > col("scores_gte_30")
).count()
log_dq_result(GOLD_BAT_SERIES, "aggregation", "GA-08: milestone_hierarchy",
    "scores_gte_50 must be <= scores_gte_30 (50 is a subset of 30)",
    bat_series_ct, bat_series_ct - milestone_bad, 100.0)

# GA-09: scores_gte_100 must be <= scores_gte_50
milestone_bad2 = df_bat_series.filter(
    col("scores_gte_100") > col("scores_gte_50")
).count()
log_dq_result(GOLD_BAT_SERIES, "aggregation", "GA-09: milestone_hierarchy_100",
    "scores_gte_100 must be <= scores_gte_50 (100 is a subset of 50)",
    bat_series_ct, bat_series_ct - milestone_bad2, 100.0)

### 4.7 SCD TYPE 2 INTEGRITY RULES
_Validates the structural integrity of the SCD Type 2 player-team relationship table._

In [0]:
print("=" * 80)
print("CATEGORY: SCD TYPE 2 INTEGRITY")
print("=" * 80)

# GS-01: Every player must have exactly one active record
active_per_player = df_scd2.filter(col("is_active") == True).groupBy("player_name").agg(
    count("*").alias("active_count")
)
multi_active = active_per_player.filter(col("active_count") > 1).count()
total_players = active_per_player.count()
log_dq_result(GOLD_PLAYER_TEAM, "scd2", "GS-01: one_active_record_per_player",
    "Each player must have exactly one is_active=True record",
    total_players, total_players - multi_active, 100.0,
    f"Players with multiple active records: {multi_active}")

# GS-02: Players with no active record
all_players = df_scd2.select("player_name").distinct().count()
players_with_active = active_per_player.count()
no_active = all_players - players_with_active
log_dq_result(GOLD_PLAYER_TEAM, "scd2", "GS-02: every_player_has_active",
    "Every player must have at least one active record",
    all_players, all_players - no_active, 100.0,
    f"Players with no active record: {no_active}")

# GS-03: Active records must have end_date = NULL
active_with_end = df_scd2.filter(
    (col("is_active") == True) & col("end_date").isNotNull()
).count()
active_total = df_scd2.filter(col("is_active") == True).count()
log_dq_result(GOLD_PLAYER_TEAM, "scd2", "GS-03: active_has_null_end_date",
    "Active records (is_active=True) must have end_date = NULL",
    active_total, active_total - active_with_end, 100.0)

# GS-04: Inactive records must have end_date populated
inactive_no_end = df_scd2.filter(
    (col("is_active") == False) & col("end_date").isNull()
).count()
inactive_total = df_scd2.filter(col("is_active") == False).count()
if inactive_total > 0:
    log_dq_result(GOLD_PLAYER_TEAM, "scd2", "GS-04: inactive_has_end_date",
        "Inactive records (is_active=False) must have end_date populated",
        inactive_total, inactive_total - inactive_no_end, 95.0)

# GS-05: start_date must be <= end_date (when end_date exists)
invalid_dates = df_scd2.filter(
    col("end_date").isNotNull() &
    (col("start_date") > col("end_date"))
).count()
has_end = df_scd2.filter(col("end_date").isNotNull()).count()
if has_end > 0:
    log_dq_result(GOLD_PLAYER_TEAM, "scd2", "GS-05: start_before_end",
        "start_date must be <= end_date for closed records",
        has_end, has_end - invalid_dates, 100.0)

# GS-06: No overlapping date ranges for the same player
# Check if any two stints for the same player have overlapping date ranges
from pyspark.sql.functions import lead
scd_window = Window.partitionBy("player_name").orderBy("start_date")
df_scd_overlap = df_scd2.withColumn(
    "next_start", lead("start_date").over(scd_window)
)
overlaps = df_scd_overlap.filter(
    col("next_start").isNotNull() &
    col("end_date").isNotNull() &
    (col("end_date") > col("next_start"))
).count()
log_dq_result(GOLD_PLAYER_TEAM, "scd2", "GS-06: no_date_overlaps",
    "No player should have overlapping date ranges across team stints",
    scd2_ct, scd2_ct - overlaps, 100.0,
    f"Overlapping records: {overlaps}")

# GS-07: matches_for_team must be > 0
zero_matches = df_scd2.filter(col("matches_for_team") <= 0).count()
log_dq_result(GOLD_PLAYER_TEAM, "scd2", "GS-07: matches_for_team_positive",
    "Every SCD2 record must have at least 1 match for that team",
    scd2_ct, scd2_ct - zero_matches, 100.0)

# GS-08: Players in SCD2 should cover players in batting + bowling stats
scd_players = df_scd2.select("player_name").distinct()
bat_players = df_bat_overall.select(col("batsman").alias("player_name")).distinct()
missing_bat_players = bat_players.join(scd_players, "player_name", "left_anti").count()
total_bat_players = bat_players.count()
log_dq_result(GOLD_PLAYER_TEAM, "scd2", "GS-08: scd2_covers_batsmen",
    "All batsmen in Gold batting stats should exist in SCD2 table",
    total_bat_players, total_bat_players - missing_bat_players, 95.0,
    f"Batsmen missing from SCD2: {missing_bat_players}")

### 4.8 VOLUME / STATISTICAL RULES
_Checks that Gold tables are populated and have reasonable row counts._

In [0]:
print("=" * 80)
print("CATEGORY: VOLUME / STATISTICAL")
print("=" * 80)

# Tables must not be empty
gold_tables_info = [
    ("GV-V01", GOLD_BAT_MATCH,     bat_match_ct,    "batting_stats_per_match"),
    ("GV-V02", GOLD_BAT_SERIES,    bat_series_ct,   "batting_stats_per_series"),
    ("GV-V03", GOLD_BAT_OVERALL,   bat_overall_ct,  "batting_stats_overall"),
    ("GV-V04", GOLD_BOWL_MATCH,    bowl_match_ct,   "bowling_stats_per_match"),
    ("GV-V05", GOLD_BOWL_SERIES,   bowl_series_ct,  "bowling_stats_per_series"),
    ("GV-V06", GOLD_BOWL_OVERALL,  bowl_overall_ct, "bowling_stats_overall"),
    ("GV-V07", GOLD_MATCH_SUMMARY, match_summ_ct,   "match_summary"),
    ("GV-V08", GOLD_PLAYER_TEAM,   scd2_ct,         "player_team_scd2"),
]

for rule_id, table, row_ct, label in gold_tables_info:
    log_dq_result(table, "volume", f"{rule_id}: {label}_not_empty",
        f"Gold {label} table must contain data",
        1, 1 if row_ct > 0 else 0, 100.0,
        f"Row count: {row_ct:,}")

### 4.9 SCHEMA VALIDATION
_Verifies expected Gold columns are present._

In [0]:
print("=" * 80)
print("CATEGORY: SCHEMA")
print("=" * 80)

schema_checks = [
    ("GD-01", GOLD_BAT_MATCH, df_bat_match, {
        "matchid", "series", "season", "batsman", "team",
        "balls_faced", "total_runs", "strike_rate",
        "dot_balls", "runs_1s", "runs_2s", "runs_3s", "runs_4s", "runs_5s", "runs_6s",
        "not_out", "times_dismissed", "gold_load_timestamp"
    }),
    ("GD-02", GOLD_BAT_SERIES, df_bat_series, {
        "series", "season", "batsman", "matches_played",
        "total_balls_faced", "total_runs", "strike_rate", "batting_average",
        "dot_balls", "runs_1s", "runs_2s", "runs_3s", "runs_4s", "runs_5s", "runs_6s",
        "scores_gte_30", "scores_gte_50", "scores_gte_100",
        "not_out_count", "total_dismissals", "gold_load_timestamp"
    }),
    ("GD-03", GOLD_BAT_OVERALL, df_bat_overall, {
        "batsman", "series_played", "total_matches",
        "total_balls_faced", "total_runs", "strike_rate", "batting_average",
        "dot_balls", "runs_1s", "runs_2s", "runs_3s", "runs_4s", "runs_5s", "runs_6s",
        "scores_gte_30", "scores_gte_50", "scores_gte_100",
        "not_out_count", "total_dismissals", "gold_load_timestamp"
    }),
    ("GD-04", GOLD_BOWL_MATCH, df_bowl_match, {
        "matchid", "series", "season", "bowler",
        "total_deliveries", "legal_balls_bowled", "overs_bowled",
        "runs_conceded", "economy_rate",
        "maiden_overs", "dot_balls",
        "runs_1s_conceded", "runs_2s_conceded", "runs_3s_conceded",
        "runs_4s_conceded", "runs_5s_conceded", "runs_6s_conceded",
        "wickets_taken", "gold_load_timestamp"
    }),
    ("GD-05", GOLD_BOWL_SERIES, df_bowl_series, {
        "series", "season", "bowler", "matches_played",
        "total_deliveries", "total_legal_balls", "total_overs_bowled",
        "total_runs_conceded", "economy_rate", "bowling_average", "bowling_strike_rate",
        "total_maiden_overs", "total_dot_balls",
        "runs_1s_conceded", "runs_2s_conceded", "runs_3s_conceded",
        "runs_4s_conceded", "runs_5s_conceded", "runs_6s_conceded",
        "total_wickets", "five_wicket_hauls", "gold_load_timestamp"
    }),
    ("GD-06", GOLD_BOWL_OVERALL, df_bowl_overall, {
        "bowler", "series_played", "total_matches",
        "total_deliveries", "total_legal_balls", "total_overs_bowled",
        "total_runs_conceded", "economy_rate", "bowling_average", "bowling_strike_rate",
        "total_maiden_overs", "total_dot_balls",
        "runs_1s_conceded", "runs_2s_conceded", "runs_3s_conceded",
        "runs_4s_conceded", "runs_5s_conceded", "runs_6s_conceded",
        "total_wickets", "five_wicket_hauls", "gold_load_timestamp"
    }),
    ("GD-07", GOLD_MATCH_SUMMARY, df_match_summary, {
        "matchid", "match_number", "series", "season",
        "team_1", "team_2", "venue", "match_date", "match_start_utc",
        "toss_winner", "toss_decision",
        "result_text", "winner", "win_margin", "result_type",
        "has_super_over", "player_of_the_match",
        "gold_load_timestamp"
    }),
    ("GD-08", GOLD_PLAYER_TEAM, df_scd2, {
        "player_name", "team", "start_date", "end_date",
        "is_active", "matches_for_team", "gold_load_timestamp"
    }),
]

for rule_id, table_name, df, expected_cols in schema_checks:
    actual_cols = set(df.columns)
    missing = expected_cols - actual_cols
    extra = actual_cols - expected_cols
    log_dq_result(table_name, "schema", f"{rule_id}: expected_columns",
        f"Gold table must contain all expected columns",
        len(expected_cols), len(expected_cols) - len(missing), 100.0,
        f"Missing: {missing or 'None'}, Extra: {extra or 'None'}")

## 5. DQ Summary Report

In [0]:
print("\n" + "=" * 80)
print(f"DATA QUALITY SUMMARY — Gold Layer — Run ID: {run_id}")
print("=" * 80)

# Overall status breakdown
summary_df = spark.sql(f"""
    SELECT
        status,
        COUNT(*) AS rule_count,
        ROUND(AVG(pass_percentage), 2) AS avg_pass_pct
    FROM {DQ_AUDIT_TABLE}
    WHERE run_id = '{run_id}'
    GROUP BY status
    ORDER BY status
""")
display(summary_df)

# Category breakdown
print("\n--- BY CATEGORY ---")
cat_df = spark.sql(f"""
    SELECT
        rule_category,
        COUNT(*) AS rules,
        SUM(CASE WHEN status = 'PASS' THEN 1 ELSE 0 END) AS passed,
        SUM(CASE WHEN status = 'FAIL' THEN 1 ELSE 0 END) AS failed,
        ROUND(AVG(pass_percentage), 2) AS avg_pass_pct
    FROM {DQ_AUDIT_TABLE}
    WHERE run_id = '{run_id}'
    GROUP BY rule_category
    ORDER BY rule_category
""")
display(cat_df)

# Detailed failures
print("\n--- FAILED RULES ---")
failures_df = spark.sql(f"""
    SELECT
        table_name, rule_category, rule_name,
        pass_percentage, failed_records, details
    FROM {DQ_AUDIT_TABLE}
    WHERE run_id = '{run_id}' AND status = 'FAIL'
    ORDER BY pass_percentage ASC
""")
display(failures_df)

# Warnings
print("\n--- WARNINGS ---")
warnings_df = spark.sql(f"""
    SELECT
        table_name, rule_category, rule_name,
        pass_percentage, failed_records, details
    FROM {DQ_AUDIT_TABLE}
    WHERE run_id = '{run_id}' AND status = 'WARN'
    ORDER BY pass_percentage ASC
""")
display(warnings_df)

## 6. Pipeline Gate

In [0]:
# Check for critical failures
critical_failures = spark.sql(f"""
    SELECT COUNT(*) AS cnt
    FROM {DQ_AUDIT_TABLE}
    WHERE run_id = '{run_id}'
      AND status = 'FAIL'
      AND threshold_pct = 100.0
""").first()["cnt"]

if critical_failures > 0:
    msg = f"⛔ PIPELINE GATE: {critical_failures} critical Gold DQ rule(s) failed! Review before serving to dashboards."
    print(msg)
    # Uncomment to actually halt:
    # raise Exception(msg)
else:
    print("✅ PIPELINE GATE: All critical Gold DQ rules passed. Tables ready for analytics & dashboards.")