## Feature Engineering / Target Creation
- Drop highly correlated features to avoid multicollinearity.
- Scale numeric features using StandardScaler to normalize ranges.
- Create binary target variable for modeling: each match is represented twice, once for the winner and once for the loser.
- Combine winner and loser rows into a single dataset suitable for supervised learning.

In [1]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.ensemble import HistGradientBoostingClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
from sklearn.inspection import permutation_importance
import joblib

In [2]:
# Load cleaned tennis data
df = pd.read_csv("data/processed/tennis_clean_2016_2024.csv")

In [3]:
# --- Step 1: Drop highly correlated numeric features ---

# Select only numeric columns
numeric_df = df.select_dtypes(include=np.number)

# Compute absolute correlation matrix
corr_matrix = numeric_df.corr().abs()

# Select upper triangle to avoid duplicate correlations
upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(bool))

# Identify columns with correlation > 0.85 (highly correlated features)
to_drop = [column for column in upper.columns if any(upper[column] > 0.85)]
print("Highly correlated columns to drop:", to_drop)

# Drop these columns from the original dataframe
df = df.drop(columns=to_drop)

Highly correlated columns to drop: ['w_svpt', 'w_1stIn', 'w_1stWon', 'w_SvGms', 'w_bpFaced', 'l_svpt', 'l_1stIn', 'l_1stWon', 'l_SvGms', 'l_bpFaced', 'w_sets_won', 'total_sets', 'w_games_won', 'l_games_won', 'total_games']


In [4]:
# --- Step 2: Scale numeric features ---

numeric_cols = df.select_dtypes(include=np.number).columns

# Initialize StandardScaler
scaler = StandardScaler()

# Fit and transform numeric columns
df[numeric_cols] = scaler.fit_transform(df[numeric_cols])

# Quick check of transformed numeric features
print(df[numeric_cols].head())

   draw_size  winner_id  winner_seed  winner_ht  winner_age  loser_id  \
0  -0.029454   0.044342          NaN   0.044358   -0.860780  0.087923   
1  -0.029454  -0.551583          NaN   0.749897    0.859583 -0.518161   
2  -0.029454  -0.377252          NaN  -1.366719   -0.750500 -0.563185   
3  -0.029454  -0.556453          NaN  -0.661180    1.036030  2.455657   
4  -0.029454   2.396316          NaN  -0.661180   -1.456290 -0.548079   

   loser_seed  loser_ht  loser_age   best_of  ...     l_ace      l_df  \
0         NaN -0.971857  -0.857269 -0.494583  ...  1.077760  0.271842   
1         NaN -0.539942  -0.089041 -0.494583  ... -0.680516 -1.306195   
2         NaN -0.251998   1.601060 -0.494583  ...  2.249944 -0.517177   
3         NaN -0.251998  -1.801092 -0.494583  ... -0.875880 -0.517177   
4         NaN -0.251998   1.052326 -0.494583  ...  0.296304 -0.517177   

   l_2ndWon  l_bpSaved  winner_rank  winner_rank_points  loser_rank  \
0 -0.842212   0.063575    -0.092951           -0.31

In [5]:
# --- Step 3: Create target variable for supervised learning ---
# We'll create a "winner vs loser" binary target
# 1 = player won, 0 = player lost

df['target'] = 1  # winner row placeholder

# Create winner rows
winner_df = df.copy()
winner_df['player_name'] = winner_df['winner_name']
winner_df['opponent_name'] = winner_df['loser_name']
winner_df['target'] = 1  # won

# Create loser rows
loser_df = df.copy()
loser_df['player_name'] = loser_df['loser_name']
loser_df['opponent_name'] = loser_df['winner_name']
loser_df['target'] = 0  # lost

# Combine winner and loser rows into a single dataframe for modeling
df_model = pd.concat([winner_df, loser_df], ignore_index=True)

# Drop original winner/loser name columns (optional)
#df_model = df_model.drop(columns=['winner_name', 'loser_name'])

## Feature Engineering & Modeling

This part combines feature engineering with baseline and enhanced models. 
We will:
1. Train a baseline model
2. Create player/opponent overall win percentages
3. Create surface-specific win percentages
4. Create head-to-head win percentages
5. Create player recent-form (Last 5 matches)
6. Create player rank
7. Train an enhanced model with new features

In [6]:
# Baseline Model
# Select only numeric columns (excluding target)
X = df_model.select_dtypes(include=np.number).drop(columns=['target'])
y = df_model['target']

# Split data into train/test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize and train HistGradientBoostingClassifier (handles NaN values)
model = HistGradientBoostingClassifier(random_state=42)
model.fit(X_train, y_train)

# Make predictions and evaluate
y_pred = model.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))
print("\nConfusion Matrix:\n", confusion_matrix(y_test, y_pred))
print("\nClassification Report:\n", classification_report(y_test, y_pred))

Accuracy: 0.4259911894273128

Confusion Matrix:
 [[ 964 3607]
 [1605 2904]]

Classification Report:
               precision    recall  f1-score   support

           0       0.38      0.21      0.27      4571
           1       0.45      0.64      0.53      4509

    accuracy                           0.43      9080
   macro avg       0.41      0.43      0.40      9080
weighted avg       0.41      0.43      0.40      9080



In [7]:
# This feature gives the model context on player performance historically.

# Compute player overall win percentage
win_counts = df_model.groupby('player_name')['target'].sum()
match_counts = df_model.groupby('player_name')['target'].count()
player_win_pct = (win_counts / match_counts).to_dict()

# Map win percentage to players
df_model['player_win_pct'] = df_model['player_name'].map(player_win_pct)
df_model['opponent_win_pct'] = df_model['opponent_name'].map(player_win_pct)

# Preview
print(df_model[['player_name','opponent_name','player_win_pct','opponent_win_pct']].head())

          player_name    opponent_name  player_win_pct  opponent_win_pct
0      Frances Tiafoe    Soon Woo Kwon        0.533875          0.466102
1  Jan Lennard Struff  Thiago Monteiro        0.496000          0.407216
2         Sumit Nagal    Denis Istomin        0.225000          0.339450
3        John Millman  Lorenzo Musetti        0.460581          0.555066
4        Tomas Machac       Joao Sousa        0.531646          0.405694


In [8]:
# Adds context for player performance on specific surfaces. Default 0.5 if no historical data.

# Calculate wins & matches per player per surface
win_counts_surface = df_model.groupby(['player_name','surface'])['target'].sum()
match_counts_surface = df_model.groupby(['player_name','surface'])['target'].count()
player_surface_win_pct = (win_counts_surface / match_counts_surface).to_dict()

# Map to DataFrame
df_model['player_surface_win_pct'] = df_model.apply(
    lambda x: player_surface_win_pct.get((x['player_name'], x['surface']), 0.5), axis=1
)
df_model['opponent_surface_win_pct'] = df_model.apply(
    lambda x: player_surface_win_pct.get((x['opponent_name'], x['surface']), 0.5), axis=1
)

# Preview
print(df_model[['player_name','opponent_name','surface','player_surface_win_pct','opponent_surface_win_pct']].head())

          player_name    opponent_name surface  player_surface_win_pct  \
0      Frances Tiafoe    Soon Woo Kwon    Hard                0.540541   
1  Jan Lennard Struff  Thiago Monteiro    Hard                0.454106   
2         Sumit Nagal    Denis Istomin    Hard                0.150000   
3        John Millman  Lorenzo Musetti    Hard                0.484848   
4        Tomas Machac       Joao Sousa    Hard                0.535714   

   opponent_surface_win_pct  
0                  0.488636  
1                  0.327869  
2                  0.357143  
3                  0.471154  
4                  0.398551  


In [9]:
# We create features based on past matches between a player and their opponent:
# `h2h_wins`: total wins of the player vs this opponent
# `h2h_win_pct`: win percentage vs this opponent

# Count past wins of player vs opponent
h2h_wins = df_model.groupby(['player_name','opponent_name'])['target'].sum().to_dict()

# Map the H2H wins to create a new column
df_model['h2h_wins'] = df_model.apply(
    lambda x: h2h_wins.get((x['player_name'], x['opponent_name']), 0), axis=1
)

# Compute H2H win percentage
h2h_matches = df_model.groupby(['player_name','opponent_name'])['target'].count().to_dict()
df_model['h2h_win_pct'] = df_model.apply(
    lambda x: x['h2h_wins']/h2h_matches.get((x['player_name'], x['opponent_name']), 1), axis=1
)

# Preview new H2H columns
print(df_model[['player_name','opponent_name','h2h_wins','h2h_win_pct']].head())

          player_name    opponent_name  h2h_wins  h2h_win_pct
0      Frances Tiafoe    Soon Woo Kwon         2     1.000000
1  Jan Lennard Struff  Thiago Monteiro         2     0.666667
2         Sumit Nagal    Denis Istomin         1     1.000000
3        John Millman  Lorenzo Musetti         1     1.000000
4        Tomas Machac       Joao Sousa         1     1.000000


In [10]:
# We capture a player's recent performance over the last 5 matches:
# `player_recent_form`: average result of last 5 matches for the player
# `opponent_recent_form`: average result of last 5 matches for the opponent
# Shift by 1 to avoid using the current match

# Sort dataset by player and date to compute rolling stats correctly
df_model = df_model.sort_values(['player_name','tourney_date'])

# Player recent form using rolling average over past 5 matches
df_model['player_recent_form'] = (
    df_model.groupby('player_name')['target']
    .transform(lambda x: x.shift(1).rolling(5, min_periods=1).mean())
)

# Opponent recent form
df_model['opponent_recent_form'] = (
    df_model.groupby('opponent_name')['target']
    .transform(lambda x: x.shift(1).rolling(5, min_periods=1).mean())
)

# Fill remaining NaNs with neutral value (0.5)
df_model['player_recent_form'] = df_model['player_recent_form'].fillna(0.5)
df_model['opponent_recent_form'] = df_model['opponent_recent_form'].fillna(0.5)

# Preview recent form features
print(df_model[['player_name','opponent_name','player_recent_form','opponent_recent_form']].head(10))

              player_name      opponent_name  player_recent_form  \
43124  Abedallah Shelbayh      Soon Woo Kwon            0.500000   
20993  Abedallah Shelbayh         Elias Ymer            0.000000   
43682  Abedallah Shelbayh  Miomir Kecmanovic            0.500000   
43781  Abedallah Shelbayh       Pedro Cachin            0.333333   
44211  Abedallah Shelbayh    Roman Safiullin            0.250000   
44447  Abedallah Shelbayh     Rinky Hijikata            0.200000   
22645  Abedallah Shelbayh        Hugo Gaston            0.200000   
45335  Abedallah Shelbayh     Lorenzo Sonego            0.200000   
25759  Abedallah Shelbayh     Alexei Popyrin            0.200000   
25877  Abedallah Shelbayh  Tallon Griekspoor            0.200000   

       opponent_recent_form  
43124                   0.5  
20993                   0.5  
43682                   0.5  
43781                   0.5  
44211                   0.5  
44447                   0.5  
22645                   0.5  
45335      

In [11]:
# Player & Opponent Rank Features
# `player_rank` and `opponent_rank` adjusted depending on winner/loser
# Missing ranks are set to a high value (e.g., 1000)
# `rank_diff`: difference between player rank and opponent rank

# Create player_rank and opponent_rank columns based on match outcome
df_model['player_rank'] = df_model.apply(
    lambda x: x['winner_rank'] if x['target'] == 1 else x['loser_rank'], axis=1
)
df_model['opponent_rank'] = df_model.apply(
    lambda x: x['loser_rank'] if x['target'] == 1 else x['winner_rank'], axis=1
)

# Fill missing ranks with high number (unranked players)
df_model['player_rank'] = df_model['player_rank'].fillna(1000)
df_model['opponent_rank'] = df_model['opponent_rank'].fillna(1000)

# Compute rank difference feature
df_model['rank_diff'] = df_model['player_rank'] - df_model['opponent_rank']

# Preview rank features
print(df_model[['player_name','opponent_name','player_rank','opponent_rank','rank_diff']].head(10))

              player_name      opponent_name  player_rank  opponent_rank  \
43124  Abedallah Shelbayh      Soon Woo Kwon     1.721260       0.113196   
20993  Abedallah Shelbayh         Elias Ymer     2.848075       0.466335   
43682  Abedallah Shelbayh  Miomir Kecmanovic     1.638819      -0.367813   
43781  Abedallah Shelbayh       Pedro Cachin     1.418979       0.099453   
44211  Abedallah Shelbayh    Roman Safiullin     1.391499       0.470517   
44447  Abedallah Shelbayh     Rinky Hijikata     1.391499       0.772865   
22645  Abedallah Shelbayh        Hugo Gaston     2.133433      -0.046627   
45335  Abedallah Shelbayh     Lorenzo Sonego     1.162498      -0.147923   
25759  Abedallah Shelbayh     Alexei Popyrin     1.299898      -0.189153   
25877  Abedallah Shelbayh  Tallon Griekspoor     1.281578      -0.436529   

       rank_diff  
43124   1.608064  
20993   2.381740  
43682   2.006632  
43781   1.319526  
44211   0.920982  
44447   0.618633  
22645   2.180060  
45335   1.3

In [12]:
# Final Model with Engineered Features

# Prepare feature matrix (numeric columns + engineered features)
feature_cols = df_model.select_dtypes(include=np.number).columns.tolist()
feature_cols.remove('target')  # exclude target column

X = df_model[feature_cols]
y = df_model['target']

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize and train model
model = HistGradientBoostingClassifier(random_state=42)
model.fit(X_train, y_train)

# Make predictions and evaluate
y_pred = model.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))
print("\nConfusion Matrix:\n", confusion_matrix(y_test, y_pred))
print("\nClassification Report:\n", classification_report(y_test, y_pred))

Accuracy: 0.991079295154185

Confusion Matrix:
 [[4491   35]
 [  46 4508]]

Classification Report:
               precision    recall  f1-score   support

           0       0.99      0.99      0.99      4526
           1       0.99      0.99      0.99      4554

    accuracy                           0.99      9080
   macro avg       0.99      0.99      0.99      9080
weighted avg       0.99      0.99      0.99      9080



In [13]:
# Save the trained model
joblib.dump(model, 'models/tennis_match_predictor.pkl')
print("Model saved as 'tennis_match_predictor.pkl'")

# Later, to load the model for live predictions
# loaded_model = joblib.load('tennis_match_predictor.pkl')
# print("Model loaded successfully")

Model saved as 'tennis_match_predictor.pkl'


### Full Match Model (Post-Match Features)
This model achieves 99% accuracy by using all available match statistics, including features that are only known after a match is completed, such as:
- Aces, double faults, points won/lost
- Games and sets won
- Total tiebreaks
- Match duration

While this high accuracy demonstrates a successful ML pipeline and extensive feature engineering, it cannot be used for predicting upcoming matches because it relies on post-match data.

Purpose:
- Showcase data preprocessing and feature engineering skills
- Illustrate model training, evaluation, and performance metrics
- Serve as a reference for comparison with a pre-match predictive model

### Light-weight Prematch Model

In [14]:
# --- FEATURE ENGINEERING FOR LIGHTWEIGHT MODEL ---
# Load cleaned data
df = pd.read_csv("data/processed/tennis_clean_2016_2024.csv")

# --- CREATE TARGET AND PLAYER/OPPONENT COLUMNS FOR LIGHTWEIGHT MODEL ---
df['target'] = 1  # winner row placeholder
df['player_name'] = df['winner_name']
df['opponent_name'] = df['loser_name']

# Winner rows
winner_df = df.copy()
winner_df['player_name'] = winner_df['winner_name']
winner_df['opponent_name'] = winner_df['loser_name']
winner_df['target'] = 1

# Loser rows
loser_df = df.copy()
loser_df['player_name'] = loser_df['loser_name']
loser_df['opponent_name'] = loser_df['winner_name']
loser_df['target'] = 0

# Combine into one df
df_model = pd.concat([winner_df, loser_df], ignore_index=True)

df_model['player_win_pct'] = df_model['player_name'].map(player_win_pct)
df_model['opponent_win_pct'] = df_model['opponent_name'].map(player_win_pct)

# 2. Head-to-Head wins
h2h_wins = df_model.groupby(['player_name','opponent_name'])['target'].sum().to_dict()
h2h_matches = df_model.groupby(['player_name','opponent_name'])['target'].count().to_dict()
df_model['h2h_win_pct'] = df_model.apply(
    lambda x: h2h_wins.get((x['player_name'], x['opponent_name']), 0) /
              h2h_matches.get((x['player_name'], x['opponent_name']), 1), axis=1
)

# 3. Surface-specific win %
win_surface = df_model.groupby(['player_name','surface'])['target'].sum()
matches_surface = df_model.groupby(['player_name','surface'])['target'].count()

df_model['player_surface_win_pct'] = df_model.apply(
    lambda x: win_surface.get((x['player_name'], x['surface']), 0.5) /
              matches_surface.get((x['player_name'], x['surface']), 1), axis=1
)
df_model['opponent_surface_win_pct'] = df_model.apply(
    lambda x: win_surface.get((x['opponent_name'], x['surface']), 0.5) /
              matches_surface.get((x['opponent_name'], x['surface']), 1), axis=1
)

# 4. Player and opponent ranks
df_model['player_rank'] = df_model.apply(
    lambda x: x['winner_rank'] if x['target'] == 1 else x['loser_rank'], axis=1
)
df_model['opponent_rank'] = df_model.apply(
    lambda x: x['loser_rank'] if x['target'] == 1 else x['winner_rank'], axis=1
)
df_model['player_rank'] = df_model['player_rank'].fillna(1000)
df_model['opponent_rank'] = df_model['opponent_rank'].fillna(1000)
df_model['rank_diff'] = df_model['player_rank'] - df_model['opponent_rank']

In [15]:
# --- TRAIN LIGHTWEIGHT MODEL ---
feature_cols = [
    'player_win_pct', 'opponent_win_pct', 'h2h_win_pct',
    'player_surface_win_pct', 'opponent_surface_win_pct', 'rank_diff'
]
X = df_model[feature_cols]
y = df_model['target']

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

light_model = HistGradientBoostingClassifier(random_state=42)
light_model.fit(X_train, y_train)

In [16]:
# --- TEST MODEL ---
y_pred = light_model.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))
print("\nConfusion Matrix:\n", confusion_matrix(y_test, y_pred))
print("\nClassification Report:\n", classification_report(y_test, y_pred))

Accuracy: 0.8657488986784141

Confusion Matrix:
 [[3913  658]
 [ 561 3948]]

Classification Report:
               precision    recall  f1-score   support

           0       0.87      0.86      0.87      4571
           1       0.86      0.88      0.87      4509

    accuracy                           0.87      9080
   macro avg       0.87      0.87      0.87      9080
weighted avg       0.87      0.87      0.87      9080



In [17]:
def predict_match(player, opponent, surface, df_model=df_model, model=light_model):
    """
    Predicts the probability that `player` will beat `opponent` on the given `surface`.

    Args:
        player (str): Name of the player to evaluate.
        opponent (str): Name of the opposing player.
        surface (str): Match surface (e.g., "Clay", "Grass", "Hard").
        df_model (pd.DataFrame): Processed dataset used to look up player ranks.
        model (sklearn-like model): Trained classifier with `.predict_proba()`.

    Returns:
        float: Probability of `player` winning the match.
    """
    
    # Helper function to safely retrieve values from a dictionary
    # If key not found, returns the provided default (e.g., 0.5 or 1)
    def safe_lookup(mapping, key, default=0.5):
        return mapping.get(key, default)

    # Lookup general player win percentages
    player_win = safe_lookup(player_win_pct, player)
    opponent_win = safe_lookup(player_win_pct, opponent)

    # Lookup head-to-head (H2H) record:
    # h2h_wins = number of times `player` has beaten `opponent`
    # h2h_matches = total matches between `player` and `opponent`
    # If no matches exist, defaults avoid division by zero
    h2h = safe_lookup(h2h_wins, (player, opponent)) / safe_lookup(h2h_matches, (player, opponent), 1)

    # Lookup player and opponent win rates on the given surface
    # Uses matches_surface to normalize the number of wins
    # Defaults prevent zero-division errors
    player_surface = safe_lookup(win_surface, (player, surface)) / safe_lookup(matches_surface, (player, surface), 1)
    opponent_surface = safe_lookup(win_surface, (opponent, surface)) / safe_lookup(matches_surface, (opponent, surface), 1)
    
    # Compute rank difference:
    # Retrieve each player's rank from df_model
    # If rank is missing (NaN), assign a high default (1000 = "unranked")
    # Positive rank_diff_val means player is lower ranked than opponent
    player_rank_val = df_model.loc[df_model['player_name'] == player, 'winner_rank'].min()
    opponent_rank_val = df_model.loc[df_model['player_name'] == opponent, 'winner_rank'].min()
    if pd.isna(player_rank_val): player_rank_val = 1000
    if pd.isna(opponent_rank_val): opponent_rank_val = 1000
    rank_diff_val = player_rank_val - opponent_rank_val

    # Construct feature vector for prediction
    X_new = pd.DataFrame([[
        player_win, opponent_win, h2h, player_surface, opponent_surface, rank_diff_val
    ]], columns=feature_cols)

    # Use the trained model to predict win probability
    # predict_proba returns [P(loss), P(win)], so we take index [1]
    prob = model.predict_proba(X_new)[0][1]  # probability of player winning

    return prob # Final output = probability that `player` wins

In [18]:
# Saving model and metadata for use in seperate Python script
joblib.dump(light_model, "models/light_model.pkl")
df_model.to_csv("metadata/df_model_light.csv", index=False)

joblib.dump(player_win_pct, "metadata/player_win_pct.pkl")
joblib.dump(h2h_wins, "metadata/h2h_wins.pkl")
joblib.dump(h2h_matches, "metadata/h2h_matches.pkl")
joblib.dump(win_surface, "metadata/win_surface.pkl")
joblib.dump(matches_surface, "metadata/matches_surface.pkl")
joblib.dump(feature_cols, "metadata/feature_cols.pkl")

['metadata/feature_cols.pkl']

In [19]:
# Example matchups
examples = [
    ("Rafael Nadal", "Novak Djokovic", "Clay"),
    ("Rafael Nadal", "Novak Djokovic", "Grass"),
    ("Jenson Brooksby", "Holger Rune", "Hard"),
    ("Carlos Alcaraz", "Casper Ruud", "Hard"),
    ("Carlos Alcaraz", "Jannik Sinner", "Clay")
]

print(f"{'Player':<15} {'Opponent':<15} {'Surface':<10} {'Win Probability':<15}")
for player, opponent, surface in examples:
    prob = predict_match(player, opponent, surface)  # your lightweight function
    print(f"{player:<15} {opponent:<15} {surface:<10} {prob:.2f}")

Player          Opponent        Surface    Win Probability
Rafael Nadal    Novak Djokovic  Clay       0.63
Rafael Nadal    Novak Djokovic  Grass      0.37
Jenson Brooksby Holger Rune     Hard       0.53
Carlos Alcaraz  Casper Ruud     Hard       0.81
Carlos Alcaraz  Jannik Sinner   Clay       0.63


### Project Summary & Takeaways
In this project, we explored ATP tennis match data from 2016–2024 and built a complete machine learning pipeline to predict match outcomes. 

Key points:

- Data Cleaning & Preprocessing:
  - Removed walkovers, defaults, and special tournaments.
  - Standardized variable types, handled missing values, and parsed match scores.
  - Consolidated multi-year match data into a single clean dataset.
- Exploratory Data Analysis (EDA):
  - Investigated distributions of match durations, total sets/games, and tiebreaks.
  - Explored player statistics, tournament counts, and surface-specific performance.
  - Generated correlation heatmaps and summary statistics to inform feature engineering.
- Feature Engineering:
  - Created player/opponent win percentages, surface-specific stats, head-to-head metrics, recent form, and rank differences.
  - Developed a binary target (1 = player won, 0 = player lost) and transformed the dataset to a “player vs opponent” format.
- Modeling:
  - Built a full-featured HistGradientBoostingClassifier, achieving ~99% accuracy.
  - This model demonstrates the end-to-end ML workflow and advanced feature engineering but relies on post-match statistics, so it is not suitable for real-time pre-match predictions.
- Lightweight Predictive Model:
  - Created a smaller, pre-match features-only model suitable for predicting upcoming matches using player rankings, recent form, and surface-specific win rates.
  - Provides probabilistic outputs for each player’s chance of winning, demonstrating practical pre-match predictive capability.
- Next Steps / Extensions:
  - Incorporate 2025 data and a process to pull data weekly
  - Incorporate more nuanced features like tournament type weighting, player fatigue, or injury history.
  - Build an interactive interface or API to input upcoming matches and get predictions.
  - Explore ensemble methods or Bayesian approaches to improve probability calibration.

Conclusion: This project highlights the full ML pipeline—from data cleaning to feature engineering, modeling, and prediction—with emphasis on creating interpretable and reproducible insights for tennis match outcomes.