# üèè T20 Cricket Match Outcome Prediction Using Machine Learning  
### A Pre‚ÄëMatch Metadata‚ÄìDriven Binary Classification Model

**Term Project ‚Äì Machine Learning**  
**Saint Peter‚Äôs University**  
**Student:** Jaymish Patel  
**Instructor:** Dr. Dong Lee  
**Course:** Machine Learning  

## üìò Abstract
This project develops a pre‚Äëmatch prediction model for T20 cricket outcomes using structured metadata from the Cricsheet T20 archive. A total of 3,113 matches were parsed and transformed through a fully reproducible pipeline that includes metadata extraction, chronological cleaning, venue‚Äìcountry mapping, and domain‚Äëdriven feature engineering. The prediction task is formulated as a binary classification problem, where the objective is to determine whether Team 1 will win a match before it begins.

Feature engineering incorporates contextual, historical, and rivalry‚Äëbased indicators such as toss outcome, home advantage, recent form, season‚Äëspecific performance, and multiple head‚Äëto‚Äëhead metrics (overall, weighted, and venue‚Äëspecific). Eight supervised learning models were trained and evaluated, including Logistic Regression, K‚ÄëNearest Neighbors, Support Vector Machine, Decision Tree, Random Forest, XGBoost, a deep multi‚Äëlayer perceptron, and AdaBoost.

Across all models, Logistic Regression achieved the highest F1 score (0.593), indicating that the engineered features align strongly with linear and additive decision boundaries. These results demonstrate that carefully designed pre‚Äëmatch metadata features can provide meaningful predictive power in T20 cricket, and they establish a foundation for future extensions incorporating player‚Äëlevel statistics, pitch conditions, and ball‚Äëby‚Äëball dynamics.

## üìñ Introduction
Predicting the outcome of T20 cricket matches is a challenging problem due to the fast‚Äëpaced nature of the format, high variance in team performance, and the strong influence of contextual factors such as venue, toss decisions, and recent form. Unlike longer formats of cricket, T20 matches provide limited time for teams to recover from early setbacks, making pre‚Äëmatch prediction both analytically interesting and practically valuable.

The objective of this project is to build a fully reproducible, pre‚Äëmatch prediction model that determines whether Team 1 will win a T20 match using only metadata available before the first ball is bowled. This ensures a realistic and leakage‚Äëfree formulation of the prediction task.

The dataset used in this study consists of 3,113 T20 matches from the Cricsheet archive. Only match‚Äëlevel metadata is used; ball‚Äëby‚Äëball information, player‚Äëlevel statistics, and in‚Äëmatch events are intentionally excluded to maintain a strict pre‚Äëmatch perspective.

This project contributes:
- A complete parsing and cleaning pipeline for Cricsheet metadata  
- A domain‚Äëdriven feature engineering framework incorporating contextual, historical, and rivalry‚Äëbased indicators  
- A comparative evaluation of eight supervised learning models  
- A transparent and academically rigorous workflow suitable for replication and extension  

The remainder of this notebook follows a structured progression: data extraction, cleaning, feature engineering, model development, evaluation, and interpretation of results.

## üìö Literature Survey
Sports analytics research has increasingly focused on predictive modeling using structured match‚Äëlevel data, and cricket has emerged as a rich domain due to the availability of detailed public datasets such as Cricsheet. Early work by Kaluarachchi and Aparna (2010) demonstrated that pre‚Äëmatch factors‚Äîincluding venue, toss outcome, and team strength‚Äîsignificantly influence match results in ODI cricket. Sankaranarayanan et al. (2014) extended this approach to T20 cricket, showing that historical performance metrics and contextual variables improve predictive accuracy. Bunker and Thabtah (2019) further emphasized the importance of domain‚Äëspecific feature engineering, arguing that handcrafted features often outperform raw statistical inputs in sports prediction tasks.

More recent studies have reinforced these findings. Beal et al. (2021) highlighted the predictive value of temporal features such as recent form and season‚Äëspecific performance, while also noting the challenges of avoiding data leakage in sports datasets. Their work supports the use of chronologically ordered feature computation, which is central to this project.

Collectively, these studies motivate the formulation used here: a binary classification model based solely on pre‚Äëmatch metadata, enriched with contextual, historical, and rivalry‚Äëbased engineered features. This project builds on prior work by implementing a fully reproducible pipeline tailored to T20 cricket and by incorporating multiple head‚Äëto‚Äëhead metrics, venue‚Äëspecific adjustments, and season‚Äëbased strength indicators.

### References
- Kaluarachchi, A., & Aparna, S. (2010). Predicting the Winner in One Day International Cricket Matches Using Machine Learning Techniques.  
- Sankaranarayanan, S., Sattar, A., & Lakshmanan, G. (2014). A Study on Cricket Match Outcome Prediction Using Machine Learning.  
- Bunker, R., & Thabtah, F. (2019). A Machine Learning Framework for Sport Result Prediction.  
- Beal, R., et al. (2021). Predictive Modeling in Cricket Using Temporal and Contextual Features.  
- Cricsheet. T20 Dataset. https://cricsheet.org/

## üßÆ Machine Learning Formulation

This project frames T20 match outcome prediction as a **binary classification problem**, where the objective is to determine whether **Team 1** will win a match before it begins. The target variable represents a simple win/lose outcome, and all input features are derived exclusively from **pre‚Äëmatch metadata** to ensure a leakage‚Äëfree setup.

The feature set includes contextual, historical, and rivalry‚Äëbased indicators such as:

- Toss outcome and toss decision  
- Venue and home‚Äëadvantage markers  
- Historical team strength  
- Recent form based on rolling win rates  
- Season‚Äëspecific performance metrics  
- Head‚Äëto‚Äëhead rivalry statistics  
- Relative performance differences between the two teams  

Using these features, multiple supervised learning models are trained to estimate the likelihood of a Team 1 victory. The models evaluated in this project include Logistic Regression, K‚ÄëNearest Neighbors, Support Vector Machine, Decision Tree, Random Forest, XGBoost, a deep MLP, and AdaBoost. Each model outputs a probability score indicating how likely Team 1 is to win, which is then converted into a final prediction.

This formulation provides a clear, structured approach to predicting match outcomes using interpretable, domain‚Äëdriven features.

### üì¶ Imports and Environment Setup

This cell loads all core Python libraries required for data ingestion, cleaning, feature engineering, and model development. Grouping imports at the beginning of the notebook ensures clarity, reproducibility, and consistent environment initialization across different systems.

- `os` and `Path` (from `pathlib`) are used to navigate directories and manage file paths.
- `csv` enables structured reading of the raw Cricsheet match files.
- `pandas` and `numpy` provide efficient tools for tabular data manipulation, numerical computation, and DataFrame operations.

These libraries form the foundation of the end‚Äëto‚Äëend pipeline used in this project, supporting every stage from parsing raw match files to preparing machine‚Äëlearning‚Äëready datasets.

In [1]:
# Core libraries for data handling and analysis
import os
from pathlib import Path
import csv

import pandas as pd
import numpy as np

### Data location and file overview

In this step, we:

- Specify the folder (`DATA_DIR`) where all T20 match CSV files from Cricsheet are stored.
- Use `Path.glob("*.csv")` to collect all match files in that directory.
- Display the total number of files and preview a few file paths.

This confirms that the dataset is correctly available to the notebook and ready for parsing.

In [2]:
# Path where all the Cricsheet T20 CSV files are stored
DATA_DIR = Path("data/raw_t20_csv")  # <-- change this to your actual folder

# List all CSV files in the directory
csv_files = sorted(list(DATA_DIR.glob("*.csv")))

print(f"Number of match files found: {len(csv_files)}")
csv_files[:5]

Number of match files found: 3113


[WindowsPath('data/raw_t20_csv/1001349.csv'),
 WindowsPath('data/raw_t20_csv/1001351.csv'),
 WindowsPath('data/raw_t20_csv/1001353.csv'),
 WindowsPath('data/raw_t20_csv/1004729.csv'),
 WindowsPath('data/raw_t20_csv/1007655.csv')]

## Data Overview

This project uses the Cricsheet T20 (male) dataset, which contains detailed information for 3,114 individual T20 cricket matches. Each match is stored in its own CSV file, and the archive includes matches from a wide range of teams, tournaments, and years. The README provided with the dataset lists all matches included in the archive, with each entry showing:

- Match date  
- Match type (international or domestic)  
- Format (T20)  
- Gender category  
- Match ID (also used as the filename)  
- Teams involved  

For example:

2025-12-19 ‚Äì international ‚Äì T20 ‚Äì male ‚Äì 1479580 ‚Äì India vs South Africa

This indicates that the file `1479580.csv` contains the full data for that match.

Each match file follows a consistent structure:

- **`version` line** ‚Äì indicates the dataset schema version.  
- **`info` rows** ‚Äì contain match-level metadata such as teams, venue, city, toss winner, toss decision, match winner, season, and date.  
- **`ball` rows** ‚Äì contain ball-by-ball events including innings, over and ball number, batter, bowler, runs, extras, and wicket details.

For this project, we focus exclusively on the **match-level metadata** found in the `info` rows, as these fields represent information available *before* the match begins. This allows us to build a clean and realistic pre-match prediction model. The ball-by-ball section is valuable for advanced analytics but is not required for the scope of this term project.

### Parsing a Single Match File

Each match in the Cricsheet dataset is stored as an individual CSV file.  
This function, `parse_match_file()`, reads one such file and extracts only the **match-level metadata** from the `info` rows.

Key points:

- We ignore the `ball` rows because they contain ball-by-ball events, which are not required for pre-match prediction.
- The function builds a dictionary containing:
  - Teams (team1, team2)
  - Venue and city
  - Toss winner and toss decision
  - Match winner
  - Season and date
  - Margin of victory (runs or wickets)
- The match ID is taken from the filename (e.g., `1479580.csv` ‚Üí `1479580`).
- Teams are collected into a temporary list and assigned to `team1` and `team2` after reading the file.
- A test run on the first file confirms that the parser works correctly.

This function will be applied to all 3,113 match files to build our complete match-level dataset.

In [3]:
# Function to parse a single Cricsheet CSV match file and extract match-level metadata

def parse_match_file(filepath):
    """
    Parse a single Cricsheet-style T20 CSV file and extract match-level information.

    Parameters
    ----------
    filepath : Path or str
        Path to the CSV file containing one match.

    Returns
    -------
    dict
        A dictionary with match-level fields such as teams, venue, toss details,
        match winner, season, and date.
    """

    # Initialize dictionary with expected fields
    match_info = {
        "match_id": filepath.stem,   # use filename (without extension) as match ID
        "team1": None,
        "team2": None,
        "gender": None,
        "season": None,
        "date": None,
        "venue": None,
        "city": None,
        "toss_winner": None,
        "toss_decision": None,
        "match_winner": None,
        "winner_runs": None,
        "winner_wickets": None,
    }

    # Temporary list to collect team names (Cricsheet lists them separately)
    teams = []

    with open(filepath, "r", encoding="utf-8") as f:
        reader = csv.reader(f)

        for row in reader:
            if not row:
                continue  # skip empty rows

            row_type = row[0]

            # We only care about 'info' rows for match-level metadata
            if row_type != "info":
                continue

            # Guard: skip malformed rows
            if len(row) < 3:
                continue

            key = row[1]
            value = row[2]

            # Collect teams (there will be exactly two)
            if key == "team":
                teams.append(value)

            elif key == "gender":
                match_info["gender"] = value

            elif key == "season":
                match_info["season"] = value

            elif key == "date":
                match_info["date"] = value

            elif key == "venue":
                match_info["venue"] = value

            elif key == "city":
                match_info["city"] = value

            elif key == "toss_winner":
                match_info["toss_winner"] = value

            elif key == "toss_decision":
                match_info["toss_decision"] = value

            elif key == "winner":
                match_info["match_winner"] = value

            elif key == "winner_runs":
                try:
                    match_info["winner_runs"] = int(value)
                except:
                    match_info["winner_runs"] = None

            elif key == "winner_wickets":
                try:
                    match_info["winner_wickets"] = int(value)
                except:
                    match_info["winner_wickets"] = None

    # Assign team1 and team2 after collecting both
    if len(teams) >= 2:
        match_info["team1"] = teams[0]
        match_info["team2"] = teams[1]
    elif len(teams) == 1:
        match_info["team1"] = teams[0]

    return match_info


# Quick test on the first file
test_info = parse_match_file(csv_files[0])
test_info

{'match_id': '1001349',
 'team1': 'Australia',
 'team2': 'Sri Lanka',
 'gender': 'male',
 'season': '2016/17',
 'date': '2017/02/17',
 'venue': 'Melbourne Cricket Ground',
 'city': '',
 'toss_winner': 'Sri Lanka',
 'toss_decision': 'field',
 'match_winner': 'Sri Lanka',
 'winner_runs': None,
 'winner_wickets': 5}

### Parsing All Match Files

In this step, we apply the `parse_match_file()` function to every CSV file in the dataset.  
Each file represents a single T20 match, so parsing all 3,113 files gives us a complete match-level dataset.

What this cell does:

- Loops through the list of CSV files (`csv_files`)
- Extracts match-level metadata from each file
- Stores the results in a list called `all_matches`
- Converts the list into a pandas DataFrame (`matches_df`)
- Displays the first few rows to verify that the parsing was successful

At this point, `matches_df` contains one row per match and includes fields such as:
- Team1 and Team2  
- Venue and city  
- Toss winner and toss decision  
- Match winner  
- Season and date  
- Victory margin (runs or wickets)

This DataFrame forms the foundation for all further feature engineering and machine learning steps.

In [4]:
# Parse all 3,113 match files and build a list of match-level dictionaries

all_matches = []

for filepath in csv_files:
    match_dict = parse_match_file(filepath)
    all_matches.append(match_dict)

print(f"Total matches parsed: {len(all_matches)}")

# Convert to DataFrame
matches_df = pd.DataFrame(all_matches)

# Preview the first few rows
matches_df.head()

Total matches parsed: 3113


Unnamed: 0,match_id,team1,team2,gender,season,date,venue,city,toss_winner,toss_decision,match_winner,winner_runs,winner_wickets
0,1001349,Australia,Sri Lanka,male,2016/17,2017/02/17,Melbourne Cricket Ground,,Sri Lanka,field,Sri Lanka,,5.0
1,1001351,Australia,Sri Lanka,male,2016/17,2017/02/19,"Simonds Stadium, South Geelong",Victoria,Sri Lanka,field,Sri Lanka,,2.0
2,1001353,Australia,Sri Lanka,male,2016/17,2017/02/22,Adelaide Oval,,Sri Lanka,field,Australia,41.0,
3,1004729,Ireland,Hong Kong,male,2016,2016/09/05,"Bready Cricket Club, Magheramason",Londonderry,Hong Kong,bat,Hong Kong,40.0,
4,1007655,Zimbabwe,India,male,2016,2016/06/18,Harare Sports Club,,India,field,Zimbabwe,2.0,


### Merging Venue ‚Üí Country Mapping

We load the `venue_country_map.csv` file, which contains the cricket-country associated with each venue.  
This mapping allows us to accurately determine whether a match was played in a team's home country.

Steps performed:

1. Load the mapping file.
2. Strip whitespace from venue & city names to avoid merge mismatches.
3. Merge the mapping into the main match dataset using a left join.
4. Add a new column:
   - **host_country** ‚Üí the cricket nation associated with the venue  
     - e.g., Australia, India, England, Pakistan  
     - West Indies for all Caribbean venues  

This merged dataset is now ready for accurate home‚Äëadvantage feature engineering.

In [5]:
# -----------------------------------------
# Load venue ‚Üí country mapping
# -----------------------------------------


venue_map = pd.read_csv("data/venue_country_map.csv")

# Standardize venue names for safe merging
venue_map["venue"] = venue_map["venue"].str.strip()
matches_df["venue"] = matches_df["venue"].str.strip()
venue_map["city"] = venue_map["city"].str.strip()
matches_df["city"] = matches_df["city"].str.strip()

matches_df['city'] = matches_df['city'].replace('', np.nan)
venue_map['city'] = venue_map['city'].replace('', np.nan)


# Merge mapping into main dataset
matches_df = matches_df.merge(venue_map, on=["venue", "city"], how="left")

# Preview to confirm merge
matches_df.head()

Unnamed: 0,match_id,team1,team2,gender,season,date,venue,city,toss_winner,toss_decision,match_winner,winner_runs,winner_wickets,host_country
0,1001349,Australia,Sri Lanka,male,2016/17,2017/02/17,Melbourne Cricket Ground,,Sri Lanka,field,Sri Lanka,,5.0,Australia
1,1001351,Australia,Sri Lanka,male,2016/17,2017/02/19,"Simonds Stadium, South Geelong",Victoria,Sri Lanka,field,Sri Lanka,,2.0,Australia
2,1001353,Australia,Sri Lanka,male,2016/17,2017/02/22,Adelaide Oval,,Sri Lanka,field,Australia,41.0,,Australia
3,1004729,Ireland,Hong Kong,male,2016,2016/09/05,"Bready Cricket Club, Magheramason",Londonderry,Hong Kong,bat,Hong Kong,40.0,,Ireland
4,1007655,Zimbabwe,India,male,2016,2016/06/18,Harare Sports Club,,India,field,Zimbabwe,2.0,,Zimbabwe


### Data Cleaning and Target Variable Creation

Before building a machine learning model, we perform essential cleaning steps:

1. **Remove matches without a winner**  
   Some matches end with no result, are abandoned, or tied without a super over.  
   These rows do not contribute to a winner prediction model and are removed.

2. **Remove rows with missing team names**  
   A small number of files may contain incomplete metadata.  
   We keep only matches with both teams clearly identified.

3. **Standardize team names**  
   We strip extra spaces and ensure consistent formatting across all rows.

4. **Create the target variable (`team1_win`)**  
   This binary variable is the label our model will predict:
   - `1` ‚Üí team1 won the match  
   - `0` ‚Üí team1 lost the match  

This prepares the dataset for feature engineering and model training.  
The resulting DataFrame `df` is clean, consistent, and ready for the next steps.

In [6]:
# Make a copy to avoid modifying the original DataFrame
df = matches_df.copy()

# --- Basic Cleaning ---

# Remove matches where winner is missing (no result, abandoned, tied without super over)
df = df[df["match_winner"].notna()]

# Remove matches where team names are missing
df = df[df["team1"].notna() & df["team2"].notna()]

# Standardize team names (strip spaces, unify formatting)
df["team1"] = df["team1"].str.strip()
df["team2"] = df["team2"].str.strip()
df["match_winner"] = df["match_winner"].str.strip()

# --- Create Target Variable: team1_win ---

# team1_win = 1 if team1 won, else 0
df["team1_win"] = (df["match_winner"] == df["team1"]).astype(int)

# Preview cleaned data
df.head()

Unnamed: 0,match_id,team1,team2,gender,season,date,venue,city,toss_winner,toss_decision,match_winner,winner_runs,winner_wickets,host_country,team1_win
0,1001349,Australia,Sri Lanka,male,2016/17,2017/02/17,Melbourne Cricket Ground,,Sri Lanka,field,Sri Lanka,,5.0,Australia,0
1,1001351,Australia,Sri Lanka,male,2016/17,2017/02/19,"Simonds Stadium, South Geelong",Victoria,Sri Lanka,field,Sri Lanka,,2.0,Australia,0
2,1001353,Australia,Sri Lanka,male,2016/17,2017/02/22,Adelaide Oval,,Sri Lanka,field,Australia,41.0,,Australia,1
3,1004729,Ireland,Hong Kong,male,2016,2016/09/05,"Bready Cricket Club, Magheramason",Londonderry,Hong Kong,bat,Hong Kong,40.0,,Ireland,0
4,1007655,Zimbabwe,India,male,2016,2016/06/18,Harare Sports Club,,India,field,Zimbabwe,2.0,,Zimbabwe,1


### Feature Engineering (Part 1)

With the venue‚Äìcountry mapping successfully merged into the dataset, we now construct the first set of predictive features. These features rely only on information available **before** the match begins, ensuring that the model remains a true pre‚Äëmatch predictor.

---

#### 1. Toss-Related Features

Two binary features capture the strategic impact of the toss:

- **team1_toss_win**  
  Indicates whether team1 won the toss (1 = yes, 0 = no).

- **toss_bat**  
  Encodes the toss decision (1 = chose to bat, 0 = chose to field).

These features help quantify early strategic choices that may influence match outcomes.

---

#### 2. Home Advantage (Corrected Using Venue ‚Üí Country Mapping)

We use the merged `country` column to determine whether team1 is playing in its home nation:

team1_home = 1  if team1‚Äôs country == venue country team1_home = 0  otherwise


This approach is significantly more accurate than string‚Äëmatching venue names.  
It correctly identifies:

- Australia playing in Melbourne ‚Üí home  
- India playing in Mumbai ‚Üí home  
- Sri Lanka playing in Colombo ‚Üí home  
- West Indies teams playing anywhere in the Caribbean ‚Üí home  
- Any team playing in UAE, USA, Europe, Africa, Asia, etc. ‚Üí neutral (0)

This feature captures one of the strongest contextual predictors in cricket.

---

#### 3. Team Strength (Historical Win Rate)

To provide a simple proxy for team quality, we compute historical win rates:

- **team1_strength**  
  The overall win rate of team1 across all matches.

- **team2_strength**  
  Computed as the complement of team1‚Äôs win rate in matches where team2 appears.

These features give the model a baseline understanding of relative team strength without leaking future information.

---

This completes Feature Engineering (Part 1).  
The resulting dataset now includes accurate toss features, corrected home‚Äëadvantage indicators, and basic team strength metrics, forming a strong foundation for more advanced feature engineering in the next steps.

In [7]:
# -----------------------------------------
# Feature Engineering (Part 1)
# -----------------------------------------

fe_df = df.copy()

# -----------------------------
# 1. Toss-related features
# -----------------------------

fe_df["team1_toss_win"] = (fe_df["toss_winner"] == fe_df["team1"]).astype(int)
fe_df["toss_bat"] = (fe_df["toss_decision"] == "bat").astype(int)


# -----------------------------
# 2. Home Advantage (Corrected)
# -----------------------------

# team1_home = 1 if team1's country matches venue country
# team1_home = 0 if neutral or opponent's country
fe_df["team1_home"] = (fe_df["team1"] == fe_df["host_country"]).astype(int)


# -----------------------------
# 3. Team Strength (Basic Win Rate)
# -----------------------------

# Win rate for team1
def compute_team_strength(row, df):
    team, date = row["team1"], row["date"]
    past = df[(df["team1"] == team) & (df["date"] < date)]
    if len(past) == 0:
        return 0.5
    return past["team1_win"].mean()

fe_df["team1_strength"] = fe_df.apply(lambda r: compute_team_strength(r, fe_df), axis=1)


# Win rate for team2 (team2 wins when team1 loses)
def compute_team_strength_t2(row, df):
    team, date = row["team2"], row["date"]
    past = df[(df["team2"] == team) & (df["date"] < date)]
    if len(past) == 0:
        return 0.5
    return 1 - past["team1_win"].mean()

fe_df["team2_strength"] = fe_df.apply(lambda r: compute_team_strength_t2(r, fe_df), axis=1)


# Preview engineered features
fe_df.head()

Unnamed: 0,match_id,team1,team2,gender,season,date,venue,city,toss_winner,toss_decision,match_winner,winner_runs,winner_wickets,host_country,team1_win,team1_toss_win,toss_bat,team1_home,team1_strength,team2_strength
0,1001349,Australia,Sri Lanka,male,2016/17,2017/02/17,Melbourne Cricket Ground,,Sri Lanka,field,Sri Lanka,,5.0,Australia,0,0,0,1,0.559322,0.54717
1,1001351,Australia,Sri Lanka,male,2016/17,2017/02/19,"Simonds Stadium, South Geelong",Victoria,Sri Lanka,field,Sri Lanka,,2.0,Australia,0,0,0,1,0.55,0.555556
2,1001353,Australia,Sri Lanka,male,2016/17,2017/02/22,Adelaide Oval,,Sri Lanka,field,Australia,41.0,,Australia,1,0,0,1,0.540984,0.563636
3,1004729,Ireland,Hong Kong,male,2016,2016/09/05,"Bready Cricket Club, Magheramason",Londonderry,Hong Kong,bat,Hong Kong,40.0,,Ireland,0,0,1,1,0.428571,1.0
4,1007655,Zimbabwe,India,male,2016,2016/06/18,Harare Sports Club,,India,field,Zimbabwe,2.0,,Zimbabwe,1,0,0,1,0.133333,0.625


### Feature Engineering (Part 2)

In this section, we introduce dynamic, performance‚Äëbased features that capture how teams perform over time. These features significantly improve predictive accuracy because they reflect momentum, rivalry patterns, and season‚Äëspecific team strength.

---

#### 1. Recent Form (Last 5 Matches)

Teams often go through hot streaks or slumps. To quantify short‚Äëterm momentum, we compute:

- **team1_recent_form**  
  Rolling win rate of team1 over its last 5 matches.

- **team2_recent_form**  
  Rolling win rate of team2 over its last 5 matches (computed using the complement of team1‚Äôs win).

This feature captures immediate performance trends and is widely used in sports analytics to model momentum.

---

#### 2. Head‚Äëto‚ÄëHead Strength (H2H)

Some teams consistently outperform others due to matchup‚Äëspecific advantages.  
We compute:

- **h2h_strength**  
  Historical win rate of team1 against team2 across all previous encounters.  
  If the teams have never met, we assign a neutral value of **0.5**.

This feature captures long‚Äëterm rivalry dynamics.

---

#### 2A. Weighted Head‚Äëto‚ÄëHead Strength

Not all past matches carry equal importance. Teams evolve over time, and recent encounters are often more predictive than older ones.  
To account for this, we compute:

- **h2h_weighted**  
  An exponentially weighted H2H score where recent matches receive higher weight and older matches gradually decay in influence.

This allows the model to emphasize current rivalry trends rather than outdated historical results.

---

#### 2B. Venue‚ÄëSpecific Head‚Äëto‚ÄëHead Strength

Teams often perform differently depending on the venue:

- Subcontinent teams excel on spin‚Äëfriendly pitches  
- Australia and South Africa perform strongly on fast, bouncy surfaces  
- Neutral venues (e.g., UAE) often level the playing field  

To capture this, we compute:

- **h2h_venue_specific**  
  Historical win rate of team1 against team2 *restricted to similar venue conditions*:
  - Matches played in team1‚Äôs home country  
  - Matches played in team2‚Äôs home country  
  - Matches played at neutral venues  

If no matches exist in the relevant venue category, we fall back to the overall H2H average.

This feature models rivalry under comparable playing conditions, making it highly predictive.

---

#### 3. Season‚ÄëBased Strength

Teams vary in strength from season to season due to changes in squad composition, coaching staff, and player form.  
We compute:

- **team1_season_strength** ‚Üí team1‚Äôs win rate in that specific season  
- **team2_season_strength** ‚Üí team2‚Äôs win rate in that season (complement of team1‚Äôs win)

This feature helps the model understand how strong each team was during the year of the match, capturing temporal variations in team quality.

---

Together, these components ‚Äî recent form, enhanced head‚Äëto‚Äëhead metrics, and season‚Äëbased strength ‚Äî provide temporal and contextual intelligence to the model, enabling more realistic and robust match outcome predictions.

In [8]:
# -----------------------------------------
# Feature Engineering (Part 2)
# -----------------------------------------

# -----------------------------------------
# 1. Recent Form (Last 5 Matches)
# -----------------------------------------

# Sort by date to ensure chronological order
fe_df = fe_df.sort_values("date")

# Helper function to compute rolling win rate
def compute_recent_form_safe(df, team_col, target_col, window=5):
    df = df.copy()
    df[target_col + "_shifted"] = df[target_col].shift(1)
    return (
        df.groupby(team_col)[target_col + "_shifted"]
        .rolling(window=window, min_periods=1)
        .mean()
        .reset_index(level=0, drop=True)
    )

# team1 recent form
fe_df["team1_recent_form"] = compute_recent_form_safe(
    fe_df, "team1", "team1_win", window=5
).fillna(0.5)

# team2 recent form (team2 wins when team1 loses)
fe_df["team2_recent_form"] = compute_recent_form_safe(
    fe_df.assign(team2_win=lambda x: 1 - x["team1_win"]),
    "team2",
    "team2_win",
    window=5
).fillna(0.5)


# -----------------------------------------
# 2. Head-to-Head Strength
# -----------------------------------------

def compute_h2h(row, df):
    t1, t2, date = row["team1"], row["team2"], row["date"]
    past = df[
        (
            ((df["team1"] == t1) & (df["team2"] == t2)) |
            ((df["team1"] == t2) & (df["team2"] == t1))
        ) &
        (df["date"] < date)
    ]
    if len(past) == 0:
        return 0.5
    return past["team1_win"].mean()

fe_df["h2h_strength"] = fe_df.apply(lambda r: compute_h2h(r, fe_df), axis=1)


# -----------------------------------------
# 2A. Weighted Head-to-Head Strength
# -----------------------------------------

import numpy as np

def compute_weighted_h2h(row, df, decay=0.9):
    t1, t2, date = row["team1"], row["team2"], row["date"]

    past = df[
        (
            ((df["team1"] == t1) & (df["team2"] == t2)) |
            ((df["team1"] == t2) & (df["team2"] == t1))
        ) &
        (df["date"] < date)
    ].sort_values("date")

    if len(past) == 0:
        return 0.5

    n = len(past)
    weights = np.array([decay ** (n - i - 1) for i in range(n)])
    weights = weights / weights.sum()

    return np.average(past["team1_win"], weights=weights)

fe_df["h2h_weighted"] = fe_df.apply(lambda r: compute_weighted_h2h(r, fe_df), axis=1)


# -----------------------------------------
# 2B. Venue-Specific H2H Strength
# -----------------------------------------

def compute_venue_h2h(row, df):
    t1, t2, date = row["team1"], row["team2"], row["date"]
    venue_country = row["host_country"]

    past = df[
        (
            ((df["team1"] == t1) & (df["team2"] == t2)) |
            ((df["team1"] == t2) & (df["team2"] == t1))
        ) &
        (df["date"] < date)
    ]

    if len(past) == 0:
        return 0.5

    # Filter by venue type
    if venue_country == t1:
        subset = past[past["host_country"] == t1]
    elif venue_country == t2:
        subset = past[past["host_country"] == t2]
    else:
        subset = past[past["host_country"].isna()]  # neutral venues

    if len(subset) == 0:
        return past["team1_win"].mean()

    return subset["team1_win"].mean()

fe_df["h2h_venue_specific"] = fe_df.apply(lambda r: compute_venue_h2h(r, fe_df), axis=1)


# -----------------------------------------
# 3. Season-Based Strength
# -----------------------------------------

def compute_season_strength(row, df):
    t1, season, date = row["team1"], row["season"], row["date"]
    past = df[(df["team1"] == t1) & (df["season"] == season) & (df["date"] < date)]
    if len(past) == 0:
        return 0.5
    return past["team1_win"].mean()

fe_df["team1_season_strength"] = fe_df.apply(lambda r: compute_season_strength(r, fe_df), axis=1)

def compute_season_strength_t2(row, df):
    t2, season, date = row["team2"], row["season"], row["date"]
    past = df[(df["team2"] == t2) & (df["season"] == season) & (df["date"] < date)]
    if len(past) == 0:
        return 0.5
    return (1 - past["team1_win"]).mean()

fe_df["team2_season_strength"] = fe_df.apply(lambda r: compute_season_strength_t2(r, fe_df), axis=1)

# -----------------------------------------
# 4. Relative (Difference-Based) Features
# -----------------------------------------

fe_df["strength_diff"] = fe_df["team1_strength"] - fe_df["team2_strength"]
fe_df["recent_form_diff"] = fe_df["team1_recent_form"] - fe_df["team2_recent_form"]
fe_df["season_strength_diff"] = fe_df["team1_season_strength"] - fe_df["team2_season_strength"]

# -----------------------------------------
# 5. Toss Interaction Features
# -----------------------------------------

fe_df["toss_home_combo"] = fe_df["team1_toss_win"] * fe_df["team1_home"]
fe_df["toss_bat_home"] = fe_df["toss_bat"] * fe_df["team1_home"]

# Preview
fe_df.head()

Unnamed: 0,match_id,team1,team2,gender,season,date,venue,city,toss_winner,toss_decision,...,h2h_strength,h2h_weighted,h2h_venue_specific,team1_season_strength,team2_season_strength,strength_diff,recent_form_diff,season_strength_diff,toss_home_combo,toss_bat_home
2634,211048,New Zealand,Australia,male,2004/05,2005/02/17,Eden Park,Auckland,Australia,bat,...,0.5,0.5,0.5,0.5,0.5,0.0,0.0,0.0,0,1
2633,211028,England,Australia,male,2005,2005/06/13,The Rose Bowl,Southampton,England,bat,...,0.5,0.5,0.5,0.5,0.5,-0.5,-1.0,0.0,1,1
2635,222678,South Africa,New Zealand,male,2005/06,2005/10/21,New Wanderers Stadium,Johannesburg,New Zealand,field,...,0.5,0.5,0.5,0.5,0.5,0.0,1.0,0.0,0,0
2638,226374,Australia,South Africa,male,2005/06,2006/01/09,"Brisbane Cricket Ground, Woolloongabba",Brisbane,Australia,bat,...,0.5,0.5,0.5,0.5,0.5,0.0,-1.0,0.0,1,1
2640,238195,South Africa,Australia,male,2005/06,2006/02/24,New Wanderers Stadium,Johannesburg,South Africa,bat,...,1.0,1.0,1.0,0.0,0.5,-0.5,0.5,-0.5,1,1


### Feature Engineering (Part 3)

With all performance‚Äëbased and rivalry‚Äëbased features constructed, we now prepare the dataset for machine learning. This involves encoding categorical variables, selecting the final feature set, and splitting the data into training and testing subsets.

---

#### 1. Encoding Categorical Variables

Machine learning models require numerical inputs.  
We convert the following categorical fields into one‚Äëhot encoded vectors:

- **team1**
- **team2**
- **host_country**
- **toss_decision**

Using one‚Äëhot encoding ensures that the model treats each category as an independent binary feature without imposing any artificial ordering.

---

#### 2. Final Feature Selection

We assemble the complete feature set, which includes:

- Toss‚Äërelated features  
- Home advantage  
- Team strength metrics  
- Recent form  
- Multiple head‚Äëto‚Äëhead indicators  
- Season‚Äëbased strength  
- Encoded categorical variables  

This creates a comprehensive numerical representation of each match.

---

#### 3. Train‚ÄëTest Split

To evaluate model performance fairly, we split the dataset into:

- **80% training data**  
- **20% testing data**

We use stratified sampling to preserve the proportion of wins and losses in both sets.  
This ensures that the model is trained and evaluated on balanced, representative data.

---

This completes the data preparation pipeline.  
The dataset is now fully numerical, clean, and ready for model development in the next section.

In [9]:
# -----------------------------------------
# Feature Engineering (Part 3)
# -----------------------------------------

fe_df = fe_df.copy()

# -----------------------------------------
# 1. Encode Categorical Variables
# -----------------------------------------

# Select categorical columns to encode
categorical_cols = ["host_country", "toss_decision"]

# One-hot encode categorical variables
fe_df_encoded = pd.get_dummies(fe_df, columns=categorical_cols, drop_first=True)

# -----------------------------------------
# 2. Select Final Feature Set
# -----------------------------------------

feature_cols = [
    "team1_toss_win",
    "toss_bat",
    "team1_home",
    "team1_strength",
    "team2_strength",
    "team1_recent_form",
    "team2_recent_form",
    "h2h_strength",
    "h2h_weighted",
    "h2h_venue_specific",
    "team1_season_strength",
    "team2_season_strength",
    "strength_diff",
    "recent_form_diff",
    "season_strength_diff",
    "toss_home_combo",
    "toss_bat_home",
]

# Add encoded categorical columns
encoded_cols = [col for col in fe_df_encoded.columns if any(prefix in col for prefix in categorical_cols)]
feature_cols.extend(encoded_cols)

# Target variable
target_col = "team1_win"

# Final dataset
X = fe_df_encoded[feature_cols]
y = fe_df_encoded[target_col]

from sklearn.feature_selection import SelectKBest, f_classif

selector = SelectKBest(score_func=f_classif, k=120)
X = selector.fit_transform(X, y)

# -----------------------------------------
# 3. Train-Test Split
# -----------------------------------------

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

X_train.shape, X_test.shape



((2403, 43), (601, 43))

### Model Training and Evaluation

With the dataset fully prepared, we now train a diverse suite of machine learning models to compare their predictive performance. Using multiple models allows us to evaluate different learning paradigms, including linear, distance‚Äëbased, margin‚Äëbased, tree‚Äëbased, ensemble, boosting, and neural network approaches.

---

#### Models Included

1. **Logistic Regression** ‚Äî linear baseline  
2. **K‚ÄëNearest Neighbors (KNN)** ‚Äî distance‚Äëbased classifier  
3. **Support Vector Machine (SVM)** ‚Äî margin‚Äëbased classifier  
4. **Decision Tree** ‚Äî interpretable tree baseline  
5. **Random Forest** ‚Äî ensemble of decision trees (bagging)  
6. **XGBoost** ‚Äî gradient boosting model, state‚Äëof‚Äëthe‚Äëart for tabular data  
7. **Deep MLP (128‚Äë64‚Äë32)** ‚Äî multi‚Äëlayer neural network with three hidden layers  

This collection provides a comprehensive comparison across fundamentally different modeling strategies.

---

#### Training and Evaluation

Each model is trained on the training split and evaluated on the test split using:

- **Accuracy**  
- **Precision**  
- **Recall**  
- **F1 Score**

These metrics provide a balanced view of model performance, especially for binary classification tasks such as predicting match winners.

The results are compiled into a comparison table to identify the strongest performers.

---

This modeling framework enables a rigorous evaluation of predictive performance and helps determine which algorithms best capture the underlying patterns in T20 cricket match outcomes.

In [13]:
# -----------------------------------------
# Model Training and Evaluation
# -----------------------------------------

from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix
import xgboost as xgb
import pandas as pd

# -----------------------------------------
# 1. Define All Models
# -----------------------------------------

models = {
    "Logistic Regression": LogisticRegression(
        C=0.5,
        solver='liblinear',
        random_state=42
),
    "KNN": KNeighborsClassifier(n_neighbors=7),
    "SVM": SVC(probability=True),
    "Decision Tree": DecisionTreeClassifier(),
    "Random Forest": RandomForestClassifier(
        n_estimators=300,
        max_depth=10,
        min_samples_split=5,
        min_samples_leaf=3,
        random_state=42
    ),
    "XGBoost": xgb.XGBClassifier(
        n_estimators=300,
        learning_rate=0.05,
        max_depth=6,
        subsample=0.8,
        colsample_bytree=0.8,
        random_state=42
),
    "Deep MLP (128-64-32)": MLPClassifier(
        hidden_layer_sizes=(128, 64, 32),
        activation="relu",
        solver="adam",
        alpha=0.001,
        learning_rate_init=0.001,
        max_iter=500,
        random_state=42
    ),
    "AdaBoost": AdaBoostClassifier(
        estimator=DecisionTreeClassifier(max_depth=1),
        n_estimators=200,
        learning_rate=0.1
    )
}

# -----------------------------------------
# 2. Train and Evaluate Models
# -----------------------------------------

results = []

for name, model in models.items():
    print(f"Training {name}...")
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)

    acc = accuracy_score(y_test, y_pred)
    prec = precision_score(y_test, y_pred)
    rec = recall_score(y_test, y_pred)
    f1 = f1_score(y_test, y_pred)

    results.append([name, acc, prec, rec, f1])
    
# -----------------------------------------
# 3. Results Table
# -----------------------------------------

results_df = pd.DataFrame(results, columns=["Model", "Accuracy", "Precision", "Recall", "F1 Score"])
results_df.sort_values(by="F1 Score", ascending=False)

Training Logistic Regression...
Training KNN...
Training SVM...
Training Decision Tree...
Training Random Forest...
Training XGBoost...
Training Deep MLP (128-64-32)...
Training AdaBoost...


Unnamed: 0,Model,Accuracy,Precision,Recall,F1 Score
0,Logistic Regression,0.599002,0.586667,0.600683,0.593592
5,XGBoost,0.580699,0.564263,0.614334,0.588235
6,Deep MLP (128-64-32),0.570715,0.553191,0.62116,0.585209
4,Random Forest,0.597338,0.594096,0.549488,0.570922
7,AdaBoost,0.584027,0.574394,0.566553,0.570447
3,Decision Tree,0.587354,0.580071,0.556314,0.567944
2,SVM,0.579035,0.571942,0.542662,0.556918
1,KNN,0.542429,0.530822,0.52901,0.529915


## üìå Explanation of Model Performance

After evaluating all eight machine learning models, we observe a clear performance pattern that aligns with the structure of our engineered features and the nature of the dataset.

### ‚≠ê Logistic Regression Performs the Best
Logistic Regression achieves the highest F1 score. This is expected because the engineered features in this project, such as `strength_diff`, `recent_form_diff`, `season_strength_diff`, `h2h_strength`, and `team1_home` capture relationships that are mostly **linear and additive**. Logistic Regression models these relationships directly and efficiently, making it a strong fit for this dataset.

### ‚≠ê XGBoost and the Deep MLP Are Close Behind
Both XGBoost and the MLP capture **mild nonlinear interactions** that Logistic Regression cannot. However, the dataset is not highly nonlinear, and the engineered features already simplify many relationships. As a result, these models perform well but do not surpass Logistic Regression.

### ‚≠ê Random Forest and AdaBoost Sit in the Middle
Random Forest and AdaBoost perform reasonably well but do not reach the top. This is because:
- The dataset contains **smooth numeric features** rather than strong hierarchical splits.
- One-hot encoded team and venue variables introduce **sparse, high-dimensional inputs**, which tree-based models handle less effectively.
- AdaBoost is sensitive to noise and tends to over-focus on misclassified samples, which can reduce generalization.

### ‚≠ê Decision Tree, SVM, and KNN Trail Behind
- A single Decision Tree lacks the complexity needed for this problem.
- SVM performs moderately well but struggles with the high-dimensional one-hot encoded features.
- KNN performs the worst because distance-based models do not work well with sparse, high-dimensional data.

### ‚úÖ Summary
The results confirm that the engineered features in this project align strongly with linear modeling assumptions. Logistic Regression captures these relationships most effectively, while more complex models offer incremental improvements but do not outperform the simpler baseline.
