🎯 Feature Engineering Overview

We have two problem statements:
	1.	First Innings Score Prediction (Regression)
    
	•	Dependent variable (target) = Final innings total runs.
    
	•	Independent variables (features) = match state during first innings (overs, wickets, teams, venue, run rate, etc.).
    
	.	Second Innings Win Probability (Classification)
    
	•	Dependent variable (target) = Binary outcome (1 if chasing team wins, 0 if loses).
    
	•	Independent variables (features) = chase state (runs required, balls left, wickets in hand, run rate vs required run rate, venue, teams).

In [7]:
import sys
from pathlib import Path

# add src/ folder to Python path so we can import config
sys.path.append(str(Path.cwd().parents[0] / "src"))
import config as C

In [8]:
import pandas as pd

matches = pd.read_csv(C.MATCHES_CLEAN)
deliveries = pd.read_csv(C.DELIV_CLEAN)

Step 1: Load cleaned data
    
Step 2: Create “final score” for each innings
    
Step 3: Build independent features
    
Step 4: Encode categorical features
    
🏏 Part 2: Second Innings Win Probability

👉 Why? Predict whether chasing team will win (with probability %) based on current chase state.

⸻

Step 1: Get second innings deliveries

Step 2: Merge with match results
    
Step 3: Build chase state features

👉 Why each?
	•	runs_required = how many runs left.
	•	balls_left = how many balls left.
	•	wickets_left = chasing resource.
	•	current_run_rate (CRR) = batting form.
	•	required_run_rate (RRR) = pressure measure.
    
Step 4: Target variable (win/loss)

👉 Binary outcome → Did chasing team = winner?

Step 5: Select final dataset
    
✅ Now you have:
	•	X_second = features (chase state)
	•	y_second = target (binary win/loss)

This dataset is ready for classification models like Logistic Regression, Random Forest, or XGBoost.

📌 Recap
	•	Dependent variable (y) = what we predict (score or win/loss).
	•	Independent variables (X) = features that describe the state of the game.
	•	Why: Each feature is chosen because it reflects cricket strategy (runs left, overs left, wickets left, etc.).

🏏 1. First Innings Score Prediction (Projected Score)

👉 Goal: After each delivery in the first innings, predict what the final score will be.

🔹 Key Factors Cricinfo (and we) consider
	1.	Runs so far
	•	The strongest baseline predictor. A team with 80/2 after 10 overs has a very different trajectory from 40/4.
	2.	Overs left & balls faced
	•	Context: A team at 80/2 after 10 overs vs 80/2 after 15 overs → very different outcomes.
	•	Run rate accelerates at different phases (powerplay vs death overs).
	3.	Wickets in hand
	•	This changes everything.
	•	90/1 in 10 overs → can aim for 180–200.
	•	90/5 in 10 overs → may struggle to reach 150.
	•	Models learn this “risk appetite” from history.
	4.	Current Run Rate (CRR)
	•	Indicator of how aggressively a team is scoring right now.
	5.	Run Rate by Phase
	•	Powerplay (overs 1–6): high scoring opportunities.
	•	Middle overs (7–15): often slower.
	•	Death overs (16–20): acceleration.
	•	Cricinfo models learn phase-based expected runs.
	6.	Venue factor (Ground size, pitch condition)
	•	Wankhede (Mumbai) → high scoring ground.
	•	Chepauk (Chennai) → slower pitch, lower scores.
	•	Historical average scores at venue feed into the model.
	7.	Opposition bowling strength
	•	If Bumrah + Rashid Khan are bowling, projections are lower.
	•	If part-timers bowl death overs, projections rise.
	•	Cricinfo has detailed player stats → we simplify by encoding bowling team.
	8.	Batting team strength
	•	RCB batting with AB de Villiers at the crease is different from a weaker lineup.
	•	At minimum, we capture batting team identity.
	9.	Match conditions (weather, dew, DLS)
	•	Dew at night → easier for batters.
	•	Rain interruptions → Duckworth–Lewis.

🏏 2. Second Innings Win Probability (Chase Predictor)

👉 Goal: After each delivery in a chase, predict probability that chasing team will win.

🔹 Key Factors Cricinfo uses
	1.	Target vs Runs Required
	•	Most critical feature. How many runs left to chase.
	2.	Balls Left
	•	Directly interacts with runs required → Run Required Rate (RRR).
	3.	Wickets in Hand
	•	This acts like “lives left in a video game.”
	•	40 runs needed off 24 balls with 8 wickets = very different from with 2 wickets.
	4.	Current Run Rate (CRR)
	•	If team is already scoring fast, probability increases.
	•	If behind required rate, probability drops.
	5.	Required Run Rate (RRR)
	•	Ratio of CRR vs RRR is crucial:
	•	CRR > RRR → team likely on track.
	•	CRR < RRR → pressure mounts.
	6.	Momentum (last 1–2 overs)
	•	Example: if 20 runs came in last over, model may increase probability.
	•	Captures short-term dynamics.
	7.	Venue factor (chasing bias)
	•	Some grounds (e.g., Eden Gardens) are better for chasing because of dew or short boundaries.
	8.	Historical chase difficulty
	•	Cricinfo uses historical win percentages for similar chase situations.
	•	Example: “In IPL history, teams chasing 50 off last 30 balls with 6 wickets have won 62% of times.”
	9.	Batting lineup strength
	•	If MS Dhoni is batting at death overs, win probability is higher.
	•	Cricinfo incorporates player-specific models.
	•	In our simpler model → just encode batting/bowling team.
	10.	Bowler quality / overs left

	•	Still having Bumrah overs left = chasing probability drops.
	•	Cricinfo uses ball-by-ball bowler matchup stats.

In [9]:
matches.shape, deliveries.shape

((636, 17), (150460, 22))

2️⃣ First Innings Score Prediction Dataset

👉 Goal: Build dataset where each row = one first innings,
	•	Target (y) = final score.
	•	Features (X) = team, opponent, venue, season.

In [2]:
# ----- Cell 1: Imports and setup -----
import sys
from pathlib import Path
sys.path.append(str(Path.cwd().parents[0] / "src"))   # access config.py

import config as C
import pandas as pd
import numpy as np

# ----- Cell 2: Load cleaned datasets -----
matches = pd.read_csv(C.MATCHES_CLEAN)
deliveries = pd.read_csv(C.DELIV_CLEAN)

matches.shape, deliveries.shape

((636, 17), (150460, 22))

#### Step 2. Prepare First Innings Dataset (Score Prediction)

We want:
	•	Target (y): Final score of first innings
	•	Features (X): Batting team, bowling team, venue, season

In [4]:
# ----- FIX: add batting/bowling teams per (match_id, inning) -----
# We take the *first* ball of each (match, inning) to capture batting/bowling teams for that inning
innings_meta = (deliveries
                .sort_values(["match_id","inning","over","ball"])
                .groupby(["match_id","inning"], as_index=False)
                .agg(batting_team=("batting_team","first"),
                     bowling_team=("bowling_team","first")))

# Total runs per (match_id, inning) as before
innings_runs = (deliveries
                .groupby(["match_id", "inning"])["total_runs"]
                .sum()
                .reset_index())

# Bring it all together: runs + inning meta + match context (venue, season)
innings_data = (innings_runs
                .merge(innings_meta, on=["match_id","inning"], how="left")
                .merge(matches[["id","venue","season"]], left_on="match_id", right_on="id", how="left"))

# Keep only first innings
first_innings = innings_data[innings_data["inning"] == 1].copy()

# Target variable = final total
first_innings["final_score"] = first_innings["total_runs"]

# Sanity check: do we have the columns now?
print(sorted(first_innings.columns))

# ----- Now select features safely -----
X_first = first_innings[["batting_team", "bowling_team", "venue", "season"]]
y_first = first_innings["final_score"]

X_first.head(), y_first.head()

['batting_team', 'bowling_team', 'final_score', 'id', 'inning', 'match_id', 'season', 'total_runs', 'venue']


(                  batting_team                 bowling_team  \
 0          Sunrisers Hyderabad  Royal Challengers Bangalore   
 2               Mumbai Indians       Rising Pune Supergiant   
 4                Gujarat Lions        Kolkata Knight Riders   
 6       Rising Pune Supergiant              Kings XI Punjab   
 8  Royal Challengers Bangalore             Delhi Daredevils   
 
                                        venue  season  
 0  Rajiv Gandhi International Stadium, Uppal    2017  
 2    Maharashtra Cricket Association Stadium    2017  
 4     Saurashtra Cricket Association Stadium    2017  
 6                     Holkar Cricket Stadium    2017  
 8                      M Chinnaswamy Stadium    2017  ,
 0    207
 2    184
 4    183
 6    163
 8    157
 Name: final_score, dtype: int64)

#### Step 3. Encode Categorical Features

In [5]:
from sklearn.preprocessing import OneHotEncoder

encoder = OneHotEncoder(sparse_output=False, handle_unknown="ignore")
X_encoded = encoder.fit_transform(X_first)

X_first_encoded = pd.DataFrame(X_encoded, columns=encoder.get_feature_names_out(X_first.columns))

X_first_encoded.shape, y_first.shape

((636, 71), (636,))

#### Step 4. Prepare Second Innings Dataset (Win Probability)

In [6]:
# Now we’ll build the ball-by-ball features that represent the current state of a chase.
# ----- Cell 5: Filter 2nd innings -----

second_innings = deliveries[deliveries["inning"] == 2].copy()

# Merge with match info to get winner/team data
second_innings = second_innings.merge(matches[["id","winner","team1","team2"]],
                                      left_on="match_id", right_on="id")

#### Step 5. Compute “Chase State” Features

These are the core features used by Cricinfo-like win predictors.

In [9]:
import numpy as np

# --- Make sure we’re only using the normal two innings (ignore super overs if any) ---
deliveries_clean = deliveries[deliveries["inning"].isin([1, 2])].copy()

# --- 2nd-innings deliveries ---
second_innings = deliveries_clean[deliveries_clean["inning"] == 2].copy()

# attach winner/team info
second_innings = second_innings.merge(
    matches[["id", "winner", "team1", "team2", "season"]],
    left_on="match_id", right_on="id", how="left"
)

# --- runs_so_far (cumulative for the chase) ---
second_innings["runs_so_far"] = (
    second_innings.groupby("match_id")["total_runs"].cumsum()
)

# --- build targets from the FIRST innings totals ---
targets = (
    deliveries_clean[deliveries_clean["inning"] == 1]
    .groupby("match_id", as_index=False)["total_runs"].sum()
    .rename(columns={"total_runs": "target"})
)

# defensive: if a column named 'target' somehow exists already in second_innings, drop it
if "target" in second_innings.columns:
    second_innings = second_innings.drop(columns=["target"])

# merge targets (one target per match)
second_innings = second_innings.merge(
    targets, on="match_id", how="left", validate="many_to_one"
)

# quick sanity check: should be 0
n_missing_target = second_innings["target"].isna().sum()
print("Missing targets after merge:", n_missing_target)

# --- runs_required ---
second_innings["runs_required"] = second_innings["target"] - second_innings["runs_so_far"]

# --- balls_bowled and balls_left ---
# Ensure over & ball are integers (if they aren't already)
second_innings["over"] = second_innings["over"].astype(int)
second_innings["ball"] = second_innings["ball"].astype(int)

second_innings["balls_bowled"] = (second_innings["over"] - 1) * 6 + second_innings["ball"]
second_innings["balls_left"] = 120 - second_innings["balls_bowled"]

# clamp negative balls_left (rare data quirks)
second_innings["balls_left"] = second_innings["balls_left"].clip(lower=0)

# --- wickets_left ---
# If your cleaning didn’t create 'is_wicket', compute it now
if "is_wicket" not in second_innings.columns:
    # count as wicket if dismissal_kind is set and not 'retired hurt'
    second_innings["is_wicket"] = (
        second_innings["dismissal_kind"].notna() &
        (second_innings["dismissal_kind"].str.lower() != "retired hurt")
    ).astype(int)

second_innings["wickets_left"] = 10 - second_innings.groupby("match_id")["is_wicket"].cumsum()

# --- CRR & RRR (guard against divide-by-zero) ---
# CRR = runs_so_far / overs_bowled * 6  (we compute via balls to be precise)
second_innings["crr"] = np.where(
    second_innings["balls_bowled"] > 0,
    (second_innings["runs_so_far"] / second_innings["balls_bowled"]) * 6,
    0.0
)

# RRR = runs_required / overs_left * 6  (guard when balls_left == 0)
second_innings["rrr"] = np.where(
    second_innings["balls_left"] > 0,
    (second_innings["runs_required"] / second_innings["balls_left"]) * 6,
    np.inf  # if no balls left but still runs needed → effectively infinite RRR
)

# final inspect
second_innings[["match_id","runs_so_far","target","runs_required","balls_bowled","balls_left","wickets_left","crr","rrr"]].head()

Missing targets after merge: 0


Unnamed: 0,match_id,runs_so_far,target,runs_required,balls_bowled,balls_left,wickets_left,crr,rrr
0,1,1,207,206,1,119,10,6.0,10.386555
1,1,1,207,206,2,118,10,3.0,10.474576
2,1,1,207,206,3,117,10,2.0,10.564103
3,1,3,207,204,4,116,10,4.5,10.551724
4,1,7,207,200,5,115,10,8.4,10.434783


In [10]:
# target: did the chasing team win?
second_innings["chasing_team"] = second_innings["batting_team"]
second_innings["won"] = (second_innings["chasing_team"] == second_innings["winner"]).astype(int)

X_second = second_innings[["runs_required", "balls_left", "wickets_left", "crr", "rrr"]]
y_second = second_innings["won"]

X_second.head(), y_second.head()

(   runs_required  balls_left  wickets_left  crr        rrr
 0            206         119            10  6.0  10.386555
 1            206         118            10  3.0  10.474576
 2            206         117            10  2.0  10.564103
 3            204         116            10  4.5  10.551724
 4            200         115            10  8.4  10.434783,
 0    0
 1    0
 2    0
 3    0
 4    0
 Name: won, dtype: int64)

#### Step 7. Save Processed Data

So you can use it directly in your modeling notebook next.

In [11]:
# Save processed feature sets
X_first_encoded.to_csv(C.PROC / "X_first.csv", index=False)
y_first.to_csv(C.PROC / "y_first.csv", index=False)
X_second.to_csv(C.PROC / "X_second.csv", index=False)
y_second.to_csv(C.PROC / "y_second.csv", index=False)

print("Feature datasets saved in:", C.PROC)

Feature datasets saved in: /Users/vijay/Documents/G2i/ML Project/cricket-winprob/data/processed


In [14]:
import numpy as np
import pandas as pd

# 1) Clamp impossible states
second_innings["runs_required"] = second_innings["runs_required"].clip(lower=0)
second_innings["balls_left"]     = second_innings["balls_left"].clip(lower=0)

# 2) Recompute RRR safely:
#    - If balls_left == 0 → RRR = 0 (innings ended)
#    - If runs_required == 0 → RRR = 0 (target achieved)
second_innings["rrr"] = np.where(
    (second_innings["balls_left"] > 0) & (second_innings["runs_required"] > 0),
    (second_innings["runs_required"] / second_innings["balls_left"]) * 6,
    0.0
)

# 3) CRR was already guarded; but ensure no inf/NaN remain
for col in ["crr", "rrr"]:
    second_innings[col].replace([np.inf, -np.inf], np.nan, inplace=True)

# 4) Drop any rows that still have NaN in feature columns
feature_cols = ["runs_required", "balls_left", "wickets_left", "crr", "rrr"]
clean_mask = second_innings[feature_cols].notna().all(axis=1)
second_innings = second_innings.loc[clean_mask].copy()

# 5) Rebuild X_second / y_second from the cleaned frame
X_second = second_innings[feature_cols].copy()
y_second = second_innings["won"].astype(int).copy()

print("Clean shapes:", X_second.shape, y_second.shape)
print("Any inf left?", np.isinf(X_second.values).any(), "Any NaN left?", np.isnan(X_second.values).any())

Clean shapes: (72350, 5) (72350,)
Any inf left? False Any NaN left? False


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  second_innings[col].replace([np.inf, -np.inf], np.nan, inplace=True)
