# CricketCommentaryParser — Gold Layer

This notebook builds **Gold layer aggregation tables** from Silver layer data for the IPL cricket pipeline.

Gold tables are analytics-ready, pre-aggregated, and optimized for dashboards & reporting.

---

### Gold Tables Produced

| # | Gold Table | Source Silver Table(s) | Description |
|---|---|---|---|
| 1 | `gold.batting_stats_per_match` | `silver.match_events`, `silver.match_players` | Batsman stats per match — runs, balls, SR, run distribution |
| 2 | `gold.batting_stats_per_series` | `gold.batting_stats_per_match`, `silver.match_metadata` | Batsman stats aggregated per series — includes milestone counts |
| 3 | `gold.batting_stats_overall` | `gold.batting_stats_per_series` | Batsman career stats across all series — includes milestone counts |
| 4 | `gold.bowling_stats_per_match` | `silver.match_events` | Bowler stats per match — overs, runs, wickets, maidens, dot balls |
| 5 | `gold.bowling_stats_per_series` | `gold.bowling_stats_per_match`, `silver.match_metadata` | Bowler stats per series — includes 5-wicket haul counts |
| 6 | `gold.bowling_stats_overall` | `gold.bowling_stats_per_series` | Bowler career stats across all series — includes 5-wicket haul counts |
| 7 | `gold.match_summary` | `silver.match_metadata` | Match-level summary — teams, venue, toss, winner, dates |
| 8 | `gold.player_team_scd2` | `silver.match_events`, `silver.match_metadata` | Player ↔ Team relationship (SCD Type 2) with start/end dates & active flag |

### Architecture

```
Bronze (raw) → Silver (cleansed/enriched) → Gold (aggregated/analytics-ready)
```

### Key Design Decisions
- **Leg byes & byes excluded** from both batting AND bowling individual stats (credited to team score only)
- Batsman runs: exclude legbyes, byes (not earned by batsman)
- Bowler runs conceded: exclude legbyes, byes (not bowler's fault)
- **Legal balls only** for overs calculation (wides and no-balls don't count toward the over)
- **Maiden over** = an over where the bowler conceded 0 runs off 6 legal deliveries
- **SCD Type 2** tracks player-team changes over time with effective dating

## 1. Imports & Configuration

In [0]:
# ── Job Parameters ────────────────────────────────────────────────────────────
# Default values are used during interactive runs.
# Databricks Job overrides these at runtime via the Parameters section.
# Key names here must match exactly what the Job defines.

dbutils.widgets.text(
    "catalog_name", "T20_catalog",
    "Catalog Name"
)


In [0]:
from pyspark.sql import SparkSession, DataFrame
from pyspark.sql.functions import (
    col, count, when, isnull, lit, sum as _sum, avg, min as _min, max as _max,
    length, trim, regexp_extract, current_timestamp, countDistinct, expr,
    upper, lower, coalesce, floor, concat_ws, row_number, dense_rank,
    round as _round, first, collect_set, array_contains, size,
    to_date, date_format,lead
)
from pyspark.sql.types import IntegerType, LongType, DoubleType, StringType
from pyspark.sql.window import Window
from datetime import datetime, timezone

# ─── Unity Catalog Configuration ───
CATALOG_NAME  = dbutils.widgets.get("catalog_name")

# Silver (source)
SILVER_SCHEMA    = f"{CATALOG_NAME}.silver"
SILVER_EVENTS    = f"{SILVER_SCHEMA}.match_events"
SILVER_METADATA  = f"{SILVER_SCHEMA}.match_metadata"
SILVER_PLAYERS   = f"{SILVER_SCHEMA}.match_players"

# Gold (target)
GOLD_SCHEMA = f"{CATALOG_NAME}.gold"

# Gold table names
GOLD_BAT_MATCH      = f"{GOLD_SCHEMA}.batting_stats_per_match"
GOLD_BAT_SERIES     = f"{GOLD_SCHEMA}.batting_stats_per_series"
GOLD_BAT_OVERALL    = f"{GOLD_SCHEMA}.batting_stats_overall"
GOLD_BOWL_MATCH     = f"{GOLD_SCHEMA}.bowling_stats_per_match"
GOLD_BOWL_SERIES    = f"{GOLD_SCHEMA}.bowling_stats_per_series"
GOLD_BOWL_OVERALL   = f"{GOLD_SCHEMA}.bowling_stats_overall"
GOLD_MATCH_SUMMARY  = f"{GOLD_SCHEMA}.match_summary"
GOLD_PLAYER_TEAM    = f"{GOLD_SCHEMA}.player_team_scd2"

# Run metadata
run_timestamp = datetime.now(timezone.utc)
gold_load_ts  = current_timestamp()

print(f"Gold Layer Processing")
print(f"Run Timestamp: {run_timestamp}")
print(f"Target Schema: {GOLD_SCHEMA}")

## 2. Create Gold Schema & Load Silver Tables

In [0]:
# Create gold schema if not exists
spark.sql(f"CREATE SCHEMA IF NOT EXISTS {GOLD_SCHEMA}")
print(f"✓ Schema ready: {GOLD_SCHEMA}")

# Load Silver tables
df_events   = spark.table(SILVER_EVENTS)
df_metadata = spark.table(SILVER_METADATA)
df_players  = spark.table(SILVER_PLAYERS)

print(f"\nSilver Tables Loaded:")
print(f"  {SILVER_EVENTS}:   {df_events.count():,} rows")
print(f"  {SILVER_METADATA}: {df_metadata.count():,} rows")
print(f"  {SILVER_PLAYERS}:  {df_players.count():,} rows")

## 3. Helper: Enrich Events with Series Info

Join events with metadata to attach `series` and `season` columns needed for series-level aggregations.

In [0]:
# Enrich events with series/season from metadata
df_meta_slim = df_metadata.select(
    col("matchid"),
    col("series"),
    col("season"),
    col("match_date")
)

df_events_enriched = df_events.join(df_meta_slim, "matchid", "left")

print(f"✓ Events enriched with series/season: {df_events_enriched.count():,} rows")

---
## 4. BATTING STATISTICS

### 4.1 Batting Stats Per Match

One row per **batsman per match**. Columns:
- Total balls faced, total runs scored (excluding leg byes)
- Run distribution: 1s, 2s, 3s, 4s, 5s, 6s
- Strike rate

In [0]:
# ──────────────────────────────────────────────
# BATTING STATS PER MATCH
# ──────────────────────────────────────────────

# Filter: exclude leg byes from batsman's runs (leg byes are team extras, not batsman credit)
# Also exclude byes — only count runs the batsman actually scored
# Legal ball = counts toward balls faced (wides do NOT count as balls faced by batsman)

df_batting_base = df_events_enriched.filter(
    col("batsman").isNotNull()
)

# Runs credited to batsman: exclude legbyes and byes (those are extras, not batsman runs)
# Wides don't count as balls faced by the batsman
df_batting_base = df_batting_base.withColumn(
    "batsman_runs",
    when(col("Extras").isin("legbyes", "byes","wide"), lit(0)).otherwise(col("runs"))
).withColumn(
    "ball_faced",
    when(col("Extras") == "wide", lit(0)).otherwise(lit(1))
)

# Determine if batsman was not out in this innings
# A batsman is not out if they never had a dismissal in that match-team innings
df_bat_dismissals = df_batting_base.groupBy("matchid", "batsman", "team").agg(
    _sum(when(col("dismissal_method") != "Not Out", lit(1)).otherwise(lit(0))).alias("times_dismissed")
)

# Aggregate batting stats per match
df_bat_match = df_batting_base.groupBy(
    "matchid", "batsman", "team", "series", "season"
).agg(
    # Total balls faced (wides excluded)
    _sum("ball_faced").alias("balls_faced"),

    # Total runs scored (excluding legbyes and byes)
    _sum("batsman_runs").alias("total_runs"),

    # Run distribution — based on batsman_runs per delivery
    _sum(when(col("batsman_runs") == 0, lit(1)).otherwise(lit(0))).alias("dot_balls"),
    _sum(when(col("batsman_runs") == 1, lit(1)).otherwise(lit(0))).alias("runs_1s"),
    _sum(when(col("batsman_runs") == 2, lit(1)).otherwise(lit(0))).alias("runs_2s"),
    _sum(when(col("batsman_runs") == 3, lit(1)).otherwise(lit(0))).alias("runs_3s"),
    _sum(when(col("batsman_runs") == 4, lit(1)).otherwise(lit(0))).alias("runs_4s"),
    _sum(when(col("batsman_runs") == 5, lit(1)).otherwise(lit(0))).alias("runs_5s"),
    _sum(when(col("batsman_runs") >= 6, lit(1)).otherwise(lit(0))).alias("runs_6s")
)

# Join dismissal info to determine not_out flag
df_bat_match = df_bat_match.join(
    df_bat_dismissals,
    ["matchid", "batsman", "team"],
    "left"
).withColumn(
    "not_out", when(col("times_dismissed") == 0, lit(True)).otherwise(lit(False))
)

# Calculate strike rate: (total_runs / balls_faced) * 100
df_bat_match = df_bat_match.withColumn(
    "strike_rate",
    when(col("balls_faced") > 0,
         _round((col("total_runs") / col("balls_faced")) * 100, 2))
    .otherwise(lit(0.0))
)

# Add gold layer timestamp
df_bat_match = df_bat_match.withColumn("gold_load_timestamp", gold_load_ts)

# Select final columns in order
df_bat_match = df_bat_match.select(
    "matchid", "series", "season", "batsman", "team",
    "balls_faced", "total_runs", "strike_rate",
    "dot_balls", "runs_1s", "runs_2s", "runs_3s", "runs_4s", "runs_5s", "runs_6s",
    "not_out", "times_dismissed",
    "gold_load_timestamp"
)

# Write to Gold
df_bat_match.write.mode("overwrite").option("overwriteSchema", "true").saveAsTable(GOLD_BAT_MATCH)

bat_match_count = spark.table(GOLD_BAT_MATCH).count()
print(f"✓ {GOLD_BAT_MATCH}: {bat_match_count:,} rows written")
display(spark.table(GOLD_BAT_MATCH).limit(5))

### 4.2 Batting Stats Per Series

One row per **batsman per series**. Aggregates match-level stats and adds:
- Number of innings with scores ≥30, ≥50, ≥100
- Number of times the batsman remained not out

In [0]:
# ──────────────────────────────────────────────
# BATTING STATS PER SERIES
# ──────────────────────────────────────────────

# Read from the Gold match-level table we just wrote
df_bat_match_src = spark.table(GOLD_BAT_MATCH)

df_bat_series = df_bat_match_src.groupBy(
    "series", "season", "batsman"
).agg(
    # Matches played in this series
    countDistinct("matchid").alias("matches_played"),

    # Core batting stats
    _sum("balls_faced").alias("total_balls_faced"),
    _sum("total_runs").alias("total_runs"),

    # Run distribution
    _sum("dot_balls").alias("dot_balls"),
    _sum("runs_1s").alias("runs_1s"),
    _sum("runs_2s").alias("runs_2s"),
    _sum("runs_3s").alias("runs_3s"),
    _sum("runs_4s").alias("runs_4s"),
    _sum("runs_5s").alias("runs_5s"),
    _sum("runs_6s").alias("runs_6s"),

    # Milestone counts — individual match scores
    _sum(when(col("total_runs") >= 30, lit(1)).otherwise(lit(0))).alias("scores_gte_30"),
    _sum(when(col("total_runs") >= 50, lit(1)).otherwise(lit(0))).alias("scores_gte_50"),
    _sum(when(col("total_runs") >= 100, lit(1)).otherwise(lit(0))).alias("scores_gte_100"),

    # Not out count
    _sum(when(col("not_out") == True, lit(1)).otherwise(lit(0))).alias("not_out_count"),

    # Total dismissals
    _sum("times_dismissed").alias("total_dismissals")
)

# Calculate overall strike rate for the series
df_bat_series = df_bat_series.withColumn(
    "strike_rate",
    when(col("total_balls_faced") > 0,
         _round((col("total_runs") / col("total_balls_faced")) * 100, 2))
    .otherwise(lit(0.0))
)

# Calculate batting average: total_runs / total_dismissals
df_bat_series = df_bat_series.withColumn(
    "batting_average",
    when(col("total_dismissals") > 0,
         _round(col("total_runs") / col("total_dismissals"), 2))
    .otherwise(col("total_runs").cast("double"))  # not out in all innings = runs itself
)

# Add gold layer timestamp
df_bat_series = df_bat_series.withColumn("gold_load_timestamp", gold_load_ts)

# Select final columns
df_bat_series = df_bat_series.select(
    "series", "season", "batsman", "matches_played",
    "total_balls_faced", "total_runs", "strike_rate", "batting_average",
    "dot_balls", "runs_1s", "runs_2s", "runs_3s", "runs_4s", "runs_5s", "runs_6s",
    "scores_gte_30", "scores_gte_50", "scores_gte_100",
    "not_out_count", "total_dismissals",
    "gold_load_timestamp"
)

# Write to Gold
df_bat_series.write.mode("overwrite").option("overwriteSchema", "true").saveAsTable(GOLD_BAT_SERIES)

bat_series_count = spark.table(GOLD_BAT_SERIES).count()
print(f"✓ {GOLD_BAT_SERIES}: {bat_series_count:,} rows written")
display(spark.table(GOLD_BAT_SERIES).limit(5))

### 4.3 Batting Stats Overall (Across All Series)

One row per **batsman** — career-level aggregation across all IPL seasons.

In [0]:
# ──────────────────────────────────────────────
# BATTING STATS OVERALL (ACROSS ALL SERIES)
# ──────────────────────────────────────────────

df_bat_series_src = spark.table(GOLD_BAT_SERIES)

df_bat_overall = df_bat_series_src.groupBy("batsman").agg(
    # Career span
    countDistinct("series").alias("series_played"),
    _sum("matches_played").alias("total_matches"),

    # Core batting stats
    _sum("total_balls_faced").alias("total_balls_faced"),
    _sum("total_runs").alias("total_runs"),

    # Run distribution
    _sum("dot_balls").alias("dot_balls"),
    _sum("runs_1s").alias("runs_1s"),
    _sum("runs_2s").alias("runs_2s"),
    _sum("runs_3s").alias("runs_3s"),
    _sum("runs_4s").alias("runs_4s"),
    _sum("runs_5s").alias("runs_5s"),
    _sum("runs_6s").alias("runs_6s"),

    # Milestones (sum across series)
    _sum("scores_gte_30").alias("scores_gte_30"),
    _sum("scores_gte_50").alias("scores_gte_50"),
    _sum("scores_gte_100").alias("scores_gte_100"),

    # Not out & dismissals
    _sum("not_out_count").alias("not_out_count"),
    _sum("total_dismissals").alias("total_dismissals")
)

# Calculate overall career strike rate
df_bat_overall = df_bat_overall.withColumn(
    "strike_rate",
    when(col("total_balls_faced") > 0,
         _round((col("total_runs") / col("total_balls_faced")) * 100, 2))
    .otherwise(lit(0.0))
)

# Calculate career batting average
df_bat_overall = df_bat_overall.withColumn(
    "batting_average",
    when(col("total_dismissals") > 0,
         _round(col("total_runs") / col("total_dismissals"), 2))
    .otherwise(col("total_runs").cast("double"))
)

# Add gold layer timestamp
df_bat_overall = df_bat_overall.withColumn("gold_load_timestamp", gold_load_ts)

# Select final columns
df_bat_overall = df_bat_overall.select(
    "batsman", "series_played", "total_matches",
    "total_balls_faced", "total_runs", "strike_rate", "batting_average",
    "dot_balls", "runs_1s", "runs_2s", "runs_3s", "runs_4s", "runs_5s", "runs_6s",
    "scores_gte_30", "scores_gte_50", "scores_gte_100",
    "not_out_count", "total_dismissals",
    "gold_load_timestamp"
)

# Write to Gold
df_bat_overall.write.mode("overwrite").option("overwriteSchema", "true").saveAsTable(GOLD_BAT_OVERALL)

bat_overall_count = spark.table(GOLD_BAT_OVERALL).count()
print(f"✓ {GOLD_BAT_OVERALL}: {bat_overall_count:,} rows written")
display(spark.table(GOLD_BAT_OVERALL).limit(5))

---
## 5. BOWLING STATISTICS

### 5.1 Bowling Stats Per Match

One row per **bowler per match**. Columns:
- Total runs conceded **(excluding leg byes and byes — those are team extras, not bowler's fault)**
- Total balls bowled, total overs (6 legal balls = 1 over)
- Maiden overs (0 bowler-runs off 6 legal balls — leg byes don't break the maiden)
- Dot balls (0 bowler-runs on a legal delivery)
- Run distribution conceded: 1s, 2s, 3s, 4s, 5s, 6s **(based on bowler-charged runs only)**
- Wickets taken (excluding run outs)

In [0]:
# ──────────────────────────────────────────────
# BOWLING STATS PER MATCH
# ──────────────────────────────────────────────

# Bowling perspective: the bowler's team is the OPPOSING team to the batting team
# We aggregate from the batsman's perspective but group by bowler
#
# KEY RULE: Leg byes and byes are NOT charged to the bowler.
#   - Leg byes: ball hits batsman's body, runs taken — credited to team extras only
#   - Byes: ball passes everyone, runs taken — credited to team extras only
#   - These are added to the overall team score but NOT to the bowler's figures
#   - Wides and no-balls ARE charged to the bowler (bowler's fault)

df_bowling_base = df_events_enriched.filter(
    col("bowler").isNotNull()
)

# Legal ball = not a wide and not a no-ball (these don't count toward the over)
df_bowling_base = df_bowling_base.withColumn(
    "is_legal_ball",
    when(col("Extras").isin("wide", "noball"), lit(0)).otherwise(lit(1))
)

# Runs charged to bowler: EXCLUDE legbyes and byes (not bowler's fault)
# Wides and no-balls ARE charged to the bowler
df_bowling_base = df_bowling_base.withColumn(
    "bowler_runs",
    when(col("Extras").isin("legbyes", "byes"), lit(0)).otherwise(col("runs"))
)

# Dot ball = legal delivery where 0 runs charged to bowler (using bowler_runs, not total runs)
# Note: a legbye delivery with runs taken is still a "dot ball" in the bowler's figures
df_bowling_base = df_bowling_base.withColumn(
    "is_dot_ball",
    when((col("is_legal_ball") == 1) & (col("bowler_runs") == 0), lit(1)).otherwise(lit(0))
)

# Wicket taken by bowler (exclude run outs — those aren't credited to the bowler)
df_bowling_base = df_bowling_base.withColumn(
    "is_wicket",
    when(
        (col("dismissal_method") != "Not Out") &
        (col("dismissal_method") != "Run Out"),
        lit(1)
    ).otherwise(lit(0))
)

# ── Step 1: Calculate maiden overs ──
# A maiden over = bowler bowled 6 legal balls and conceded 0 runs CHARGED TO BOWLER
# (legbyes/byes in an over do NOT break the maiden)
df_over_stats = df_bowling_base.groupBy(
    "matchid", "bowler", "over", "series", "season"
).agg(
    _sum("is_legal_ball").alias("legal_balls_in_over"),
    _sum("bowler_runs").alias("bowler_runs_in_over")
)

df_maiden_overs = df_over_stats.groupBy(
    "matchid", "bowler", "series", "season"
).agg(
    _sum(
        when((col("legal_balls_in_over") == 6) & (col("bowler_runs_in_over") == 0), lit(1))
        .otherwise(lit(0))
    ).alias("maiden_overs")
)

# ── Step 2: Main bowling aggregation per match ──
df_bowl_match = df_bowling_base.groupBy(
    "matchid", "bowler", "series", "season"
).agg(
    # Total runs conceded BY BOWLER (excluding legbyes and byes)
    _sum("bowler_runs").alias("runs_conceded"),

    # Balls bowled (all deliveries including extras)
    count("*").alias("total_deliveries"),

    # Legal balls bowled (for overs calculation)
    _sum("is_legal_ball").alias("legal_balls_bowled"),

    # Dot balls (based on bowler_runs, not total runs)
    _sum("is_dot_ball").alias("dot_balls"),

    # Run distribution conceded — based on bowler_runs (excludes legbyes/byes)
    _sum(when(col("bowler_runs") == 1, lit(1)).otherwise(lit(0))).alias("runs_1s_conceded"),
    _sum(when(col("bowler_runs") == 2, lit(1)).otherwise(lit(0))).alias("runs_2s_conceded"),
    _sum(when(col("bowler_runs") == 3, lit(1)).otherwise(lit(0))).alias("runs_3s_conceded"),
    _sum(when(col("bowler_runs") == 4, lit(1)).otherwise(lit(0))).alias("runs_4s_conceded"),
    _sum(when(col("bowler_runs") == 5, lit(1)).otherwise(lit(0))).alias("runs_5s_conceded"),
    _sum(when(col("bowler_runs") >= 6, lit(1)).otherwise(lit(0))).alias("runs_6s_conceded"),

    # Wickets taken (excluding run outs)
    _sum("is_wicket").alias("wickets_taken")
)

# Calculate overs bowled: legal_balls / 6 → displayed as "overs.balls" format
df_bowl_match = df_bowl_match.withColumn(
    "overs_bowled",
    _round(
        floor(col("legal_balls_bowled") / 6) +
        (col("legal_balls_bowled") % 6) / 10,
        1
    )
)

# Calculate economy rate: runs_conceded / (legal_balls_bowled / 6)
df_bowl_match = df_bowl_match.withColumn(
    "economy_rate",
    when(col("legal_balls_bowled") > 0,
         _round(col("runs_conceded") / (col("legal_balls_bowled") / 6), 2))
    .otherwise(lit(0.0))
)

# Join maiden overs
df_bowl_match = df_bowl_match.join(
    df_maiden_overs,
    ["matchid", "bowler", "series", "season"],
    "left"
).withColumn("maiden_overs", coalesce(col("maiden_overs"), lit(0)))

# Add gold layer timestamp
df_bowl_match = df_bowl_match.withColumn("gold_load_timestamp", gold_load_ts)

# Select final columns
df_bowl_match = df_bowl_match.select(
    "matchid", "series", "season", "bowler",
    "total_deliveries", "legal_balls_bowled", "overs_bowled",
    "runs_conceded", "economy_rate",
    "maiden_overs", "dot_balls",
    "runs_1s_conceded", "runs_2s_conceded", "runs_3s_conceded",
    "runs_4s_conceded", "runs_5s_conceded", "runs_6s_conceded",
    "wickets_taken",
    "gold_load_timestamp"
)

# Write to Gold
df_bowl_match.write.mode("overwrite").option("overwriteSchema", "true").saveAsTable(GOLD_BOWL_MATCH)

bowl_match_count = spark.table(GOLD_BOWL_MATCH).count()
print(f"✓ {GOLD_BOWL_MATCH}: {bowl_match_count:,} rows written")
display(spark.table(GOLD_BOWL_MATCH).limit(5))


### 5.2 Bowling Stats Per Series

One row per **bowler per series**. Includes 5-wicket haul counts.

In [0]:
# ──────────────────────────────────────────────
# BOWLING STATS PER SERIES
# ──────────────────────────────────────────────

df_bowl_match_src = spark.table(GOLD_BOWL_MATCH)

df_bowl_series = df_bowl_match_src.groupBy(
    "series", "season", "bowler"
).agg(
    # Matches
    countDistinct("matchid").alias("matches_played"),

    # Core bowling stats
    _sum("total_deliveries").alias("total_deliveries"),
    _sum("legal_balls_bowled").alias("total_legal_balls"),
    _sum("runs_conceded").alias("total_runs_conceded"),
    _sum("maiden_overs").alias("total_maiden_overs"),
    _sum("dot_balls").alias("total_dot_balls"),
    _sum("wickets_taken").alias("total_wickets"),

    # Run distribution conceded
    _sum("runs_1s_conceded").alias("runs_1s_conceded"),
    _sum("runs_2s_conceded").alias("runs_2s_conceded"),
    _sum("runs_3s_conceded").alias("runs_3s_conceded"),
    _sum("runs_4s_conceded").alias("runs_4s_conceded"),
    _sum("runs_5s_conceded").alias("runs_5s_conceded"),
    _sum("runs_6s_conceded").alias("runs_6s_conceded"),

    # 5-wicket hauls: number of matches where bowler took >= 5 wickets
    _sum(when(col("wickets_taken") >= 5, lit(1)).otherwise(lit(0))).alias("five_wicket_hauls")
)

# Calculate total overs bowled
df_bowl_series = df_bowl_series.withColumn(
    "total_overs_bowled",
    _round(
        floor(col("total_legal_balls") / 6) +
        (col("total_legal_balls") % 6) / 10,
        1
    )
)

# Calculate series economy rate
df_bowl_series = df_bowl_series.withColumn(
    "economy_rate",
    when(col("total_legal_balls") > 0,
         _round(col("total_runs_conceded") / (col("total_legal_balls") / 6), 2))
    .otherwise(lit(0.0))
)

# Calculate bowling average: runs_conceded / wickets
df_bowl_series = df_bowl_series.withColumn(
    "bowling_average",
    when(col("total_wickets") > 0,
         _round(col("total_runs_conceded") / col("total_wickets"), 2))
    .otherwise(lit(None).cast("double"))
)

# Calculate bowling strike rate: balls per wicket
df_bowl_series = df_bowl_series.withColumn(
    "bowling_strike_rate",
    when(col("total_wickets") > 0,
         _round(col("total_legal_balls") / col("total_wickets"), 2))
    .otherwise(lit(None).cast("double"))
)

# Add gold layer timestamp
df_bowl_series = df_bowl_series.withColumn("gold_load_timestamp", gold_load_ts)

# Select final columns
df_bowl_series = df_bowl_series.select(
    "series", "season", "bowler", "matches_played",
    "total_deliveries", "total_legal_balls", "total_overs_bowled",
    "total_runs_conceded", "economy_rate", "bowling_average", "bowling_strike_rate",
    "total_maiden_overs", "total_dot_balls",
    "runs_1s_conceded", "runs_2s_conceded", "runs_3s_conceded",
    "runs_4s_conceded", "runs_5s_conceded", "runs_6s_conceded",
    "total_wickets", "five_wicket_hauls",
    "gold_load_timestamp"
)

# Write to Gold
df_bowl_series.write.mode("overwrite").option("overwriteSchema", "true").saveAsTable(GOLD_BOWL_SERIES)

bowl_series_count = spark.table(GOLD_BOWL_SERIES).count()
print(f"✓ {GOLD_BOWL_SERIES}: {bowl_series_count:,} rows written")
display(spark.table(GOLD_BOWL_SERIES).limit(5))

### 5.3 Bowling Stats Overall (Across All Series)

One row per **bowler** — career-level bowling aggregation.

In [0]:
# ──────────────────────────────────────────────
# BOWLING STATS OVERALL (ACROSS ALL SERIES)
# ──────────────────────────────────────────────

df_bowl_series_src = spark.table(GOLD_BOWL_SERIES)

df_bowl_overall = df_bowl_series_src.groupBy("bowler").agg(
    # Career span
    countDistinct("series").alias("series_played"),
    _sum("matches_played").alias("total_matches"),

    # Core bowling stats
    _sum("total_deliveries").alias("total_deliveries"),
    _sum("total_legal_balls").alias("total_legal_balls"),
    _sum("total_runs_conceded").alias("total_runs_conceded"),
    _sum("total_maiden_overs").alias("total_maiden_overs"),
    _sum("total_dot_balls").alias("total_dot_balls"),
    _sum("total_wickets").alias("total_wickets"),

    # Run distribution conceded
    _sum("runs_1s_conceded").alias("runs_1s_conceded"),
    _sum("runs_2s_conceded").alias("runs_2s_conceded"),
    _sum("runs_3s_conceded").alias("runs_3s_conceded"),
    _sum("runs_4s_conceded").alias("runs_4s_conceded"),
    _sum("runs_5s_conceded").alias("runs_5s_conceded"),
    _sum("runs_6s_conceded").alias("runs_6s_conceded"),

    # 5-wicket hauls
    _sum("five_wicket_hauls").alias("five_wicket_hauls")
)

# Calculate total overs bowled
df_bowl_overall = df_bowl_overall.withColumn(
    "total_overs_bowled",
    _round(
        floor(col("total_legal_balls") / 6) +
        (col("total_legal_balls") % 6) / 10,
        1
    )
)

# Calculate career economy rate
df_bowl_overall = df_bowl_overall.withColumn(
    "economy_rate",
    when(col("total_legal_balls") > 0,
         _round(col("total_runs_conceded") / (col("total_legal_balls") / 6), 2))
    .otherwise(lit(0.0))
)

# Calculate career bowling average
df_bowl_overall = df_bowl_overall.withColumn(
    "bowling_average",
    when(col("total_wickets") > 0,
         _round(col("total_runs_conceded") / col("total_wickets"), 2))
    .otherwise(lit(None).cast("double"))
)

# Calculate career bowling strike rate
df_bowl_overall = df_bowl_overall.withColumn(
    "bowling_strike_rate",
    when(col("total_wickets") > 0,
         _round(col("total_legal_balls") / col("total_wickets"), 2))
    .otherwise(lit(None).cast("double"))
)

# Add gold layer timestamp
df_bowl_overall = df_bowl_overall.withColumn("gold_load_timestamp", gold_load_ts)

# Select final columns
df_bowl_overall = df_bowl_overall.select(
    "bowler", "series_played", "total_matches",
    "total_deliveries", "total_legal_balls", "total_overs_bowled",
    "total_runs_conceded", "economy_rate", "bowling_average", "bowling_strike_rate",
    "total_maiden_overs", "total_dot_balls",
    "runs_1s_conceded", "runs_2s_conceded", "runs_3s_conceded",
    "runs_4s_conceded", "runs_5s_conceded", "runs_6s_conceded",
    "total_wickets", "five_wicket_hauls",
    "gold_load_timestamp"
)

# Write to Gold
df_bowl_overall.write.mode("overwrite").option("overwriteSchema", "true").saveAsTable(GOLD_BOWL_OVERALL)

bowl_overall_count = spark.table(GOLD_BOWL_OVERALL).count()
print(f"✓ {GOLD_BOWL_OVERALL}: {bowl_overall_count:,} rows written")
display(spark.table(GOLD_BOWL_OVERALL).limit(5))

---
## 6. MATCH SUMMARY

### 6.1 Match Summary Table

One row per **match** — extracted from Silver metadata. Includes winner, teams, venue, toss details, and match officials.

In [0]:
# ──────────────────────────────────────────────
# MATCH SUMMARY TABLE
# ──────────────────────────────────────────────

# Extract winner from series_result column
# series_result typically contains text like "Team X won by Y runs" or "Team X won by Y wickets"
df_match_summary = df_metadata.withColumn(
    "winner",
    regexp_extract(col("series_result"), r"^(.+?)\s+won\s+by", 1)
).withColumn(
    "win_margin",
    regexp_extract(col("series_result"), r"won\s+by\s+(.+)$", 1)
).withColumn(
    "result_type",
    when(lower(col("series_result")).contains("run"), lit("by runs"))
    .when(lower(col("series_result")).contains("wicket"), lit("by wickets"))
    .when(lower(col("series_result")).contains("super over"), lit("super over"))
    .when(lower(col("series_result")).contains("no result"), lit("no result"))
    .when(lower(col("series_result")).contains("tie"), lit("tie"))
    .otherwise(lit("other"))
)

# Select relevant match details
df_match_summary = df_match_summary.select(
    # Match identification
    col("matchid"),
    col("match_number"),
    col("series"),
    col("season"),

    # Teams
    col("first_innings").alias("team_1"),
    col("second_innings").alias("team_2"),

    # Venue & Date
    col("ground").alias("venue"),
    col("match_date"),
    col("match_start_utc"),

    # Toss
    col("toss").alias("toss_winner"),
    col("decision").alias("toss_decision"),

    # Result
    col("series_result").alias("result_text"),
    col("winner"),
    col("win_margin"),
    col("result_type"),

    # Super over
    col("has_super_over"),
    col("super_over_count"),

    # Awards
    col("player_of_the_match"),

    # Match officials
    col("umpires"),
    col("tv_umpire"),
    col("match_referee"),

    # Points
    col("points")
).withColumn("gold_load_timestamp", gold_load_ts)

# Write to Gold
df_match_summary.write.mode("overwrite").option("overwriteSchema", "true").saveAsTable(GOLD_MATCH_SUMMARY)

match_summary_count = spark.table(GOLD_MATCH_SUMMARY).count()
print(f"✓ {GOLD_MATCH_SUMMARY}: {match_summary_count:,} rows written")
display(spark.table(GOLD_MATCH_SUMMARY).limit(5))

---
## 7. PLAYER-TEAM RELATIONSHIP (SCD Type 2)

### 7.1 Player-Team SCD Type 2 Table

Tracks which team each player belonged to over time. When a player switches teams between
seasons, the old record is closed (end_date set, is_active = false) and a new record is opened.

**SCD Type 2 Logic:**
1. For each player, find all (player, team) combinations with their earliest and latest match dates
2. Order by start_date to detect team changes
3. Close previous team record when a new team is detected
4. Mark the most recent record as `is_active = true`

**Columns:** `player_name`, `team`, `start_date`, `end_date`, `is_active`

In [0]:
# ──────────────────────────────────────────────
# PLAYER-TEAM SCD TYPE 2
# ──────────────────────────────────────────────

# Step 1: Get all player-team-matchdate combinations from events + metadata
# Using events (has player names and matchid) joined with metadata (has match_date)

# Get batsmen appearances
df_bat_appearances = df_events.select(
    col("matchid"),
    col("batsman").alias("player_name"),
    col("team")
).distinct()

# Get bowler appearances (bowler plays for the opposing team)
# For bowler team, we need the team that is NOT batting
# We'll get the bowler-team mapping from the players table instead for accuracy
df_bowl_appearances = df_players.select(
    col("matchid"),
    col("player_name"),
    col("team")
).distinct()

# Combine all player appearances (union batsmen + all players from players table)
df_all_appearances = df_bat_appearances.unionByName(df_bowl_appearances).distinct()

# Join with metadata to get match_date
df_player_dates = df_all_appearances.join(
    df_metadata.select("matchid", "match_date", "series", "season"),
    "matchid",
    "left"
).filter(col("match_date").isNotNull())

# Step 2: For each player-team combo, get the earliest and latest match date
df_player_team_spans = df_player_dates.groupBy(
    "player_name", "team"
).agg(
    _min("match_date").alias("first_match_date"),
    _max("match_date").alias("last_match_date"),
    countDistinct("matchid").alias("matches_for_team")
)

# Step 3: Build SCD Type 2 using window functions
# Order each player's team stints by first_match_date
scd_window = Window.partitionBy("player_name").orderBy("first_match_date")
scd_window_desc = Window.partitionBy("player_name").orderBy(col("first_match_date").desc())

df_scd = df_player_team_spans.withColumn(
    "row_num", row_number().over(scd_window)
).withColumn(
    "latest_row", row_number().over(scd_window_desc)
)

# The next team's start date becomes this team's end date
df_scd = df_scd.withColumn(
    "next_team_start",
    lead("first_match_date").over(scd_window)
)

# Set start_date and end_date
df_scd = df_scd.withColumn(
    "start_date", col("first_match_date")
).withColumn(
    "end_date",
    when(col("next_team_start").isNotNull(), col("next_team_start"))
    .otherwise(lit(None).cast("date"))  # NULL = still active
).withColumn(
    "is_active",
    when(col("latest_row") == 1, lit(True)).otherwise(lit(False))
)

# Select final SCD Type 2 columns
df_scd_final = df_scd.select(
    "player_name",
    "team",
    "start_date",
    "end_date",
    "is_active",
    "matches_for_team"
).withColumn("gold_load_timestamp", gold_load_ts)

# Write to Gold
df_scd_final.write.mode("overwrite").option("overwriteSchema", "true").saveAsTable(GOLD_PLAYER_TEAM)

scd_count = spark.table(GOLD_PLAYER_TEAM).count()
active_count = spark.table(GOLD_PLAYER_TEAM).filter(col("is_active") == True).count()
print(f"✓ {GOLD_PLAYER_TEAM}: {scd_count:,} total records ({active_count:,} active)")
display(spark.table(GOLD_PLAYER_TEAM).orderBy("player_name", "start_date").limit(10))

### 7.2 SCD Type 2 — Validation Examples

Quick checks to verify the SCD2 logic is working correctly.

In [0]:
# ── Validation: Players who changed teams ──
print("Players with multiple team stints (team changes):")
multi_team = spark.table(GOLD_PLAYER_TEAM).groupBy("player_name").agg(
    count("*").alias("team_stints"),
    countDistinct("team").alias("distinct_teams")
).filter(col("distinct_teams") > 1).orderBy(col("team_stints").desc())

display(multi_team.limit(15))

# Show full history for a player who changed teams
if multi_team.count() > 0:
    sample_player = multi_team.first()["player_name"]
    print(f"\nFull SCD2 history for: {sample_player}")
    display(
        spark.table(GOLD_PLAYER_TEAM)
        .filter(col("player_name") == sample_player)
        .orderBy("start_date")
    )

---
## 8. Gold Layer Summary

In [0]:
print("\n" + "=" * 80)
print("GOLD LAYER PROCESSING COMPLETE")
print("=" * 80)

gold_tables = [
    ("Batting Stats Per Match",    GOLD_BAT_MATCH),
    ("Batting Stats Per Series",   GOLD_BAT_SERIES),
    ("Batting Stats Overall",      GOLD_BAT_OVERALL),
    ("Bowling Stats Per Match",    GOLD_BOWL_MATCH),
    ("Bowling Stats Per Series",   GOLD_BOWL_SERIES),
    ("Bowling Stats Overall",      GOLD_BOWL_OVERALL),
    ("Match Summary",              GOLD_MATCH_SUMMARY),
    ("Player-Team SCD2",           GOLD_PLAYER_TEAM),
]

print(f"\n{'Table':<35} {'Full Name':<50} {'Rows':>10}")
print("-" * 100)
total_rows = 0
for label, table_name in gold_tables:
    try:
        row_count = spark.table(table_name).count()
        total_rows += row_count
        print(f"  ✓ {label:<33} {table_name:<50} {row_count:>10,}")
    except Exception as e:
        print(f"  ✗ {label:<33} {table_name:<50} {'ERROR':>10}")

print("-" * 100)
print(f"  {'TOTAL':<33} {'':<50} {total_rows:>10,}")
print(f"\nGold Layer Run Timestamp: {run_timestamp}")
print("✅ All Gold tables written successfully.")

In [0]:
# Run DQ checks after ingestion completes
dq_result = dbutils.notebook.run(
    "/Workspace/Shared/cricketCommentaryScraper/DataQualityRulesGoldLayer",
    timeout_seconds=600,  # 10 min timeout,
    arguments       = {
    "catalog_name":  CATALOG_NAME
    }
)
print(f"DQ Result: {dq_result}")