## Notebook 2: Feature Engineering

In this notebook, we transform our cleaned match data into a trainable dataset for machine learning.

Input: master_data_cleaned.csv (from Notebook 1)

Our goals are:
1. Restructure data: Convert "Winner vs. Loser" rows into "Player A vs. Player B" format (doubling the dataset) to ensure symmetry.
2. Prevent leakage: We cannot use match stats (e.g., Aces, Winners) from the current match to predict the winner. We must transform these into historical averages (e.g., "Player A's average aces over the last 50 matches").
3. Create useful features:
    - Head-to-Head: Past record against the specific opponent.
    - Surface performance: Win rates specifically on Hard, Clay, and Grass.
    - Rolling stats: Averages over the last 52 weeks (Serve %, Break Points Saved).
    - Recent form: Win rate in the last 10 matches.

In [188]:
import pandas as pd
import numpy as np

# Display settings
pd.set_option('display.max_columns', None)

# Load the clean dataset from Notebook 1
input_file = 'master_data_cleaned.csv'
df = pd.read_csv(input_file)

# Parse Date again (CSV loses datetime format)
df['tourney_date'] = pd.to_datetime(df['tourney_date'])
print(f"Loaded {len(df):,} matches from {df['tourney_date'].min().year} to {df['tourney_date'].max().year}.")

Loaded 198,055 matches from 1968 to 2026.


## 1. Feature 1: Head-to-Head Records (H2H)

We calculate the number of past wins Player 1 has against Player 2.

This feature must be calculated iteratively. We walk through time, match by match.
- h2h_wins: How many times P1 has beaten P2 in the past.
- h2h_losses: How many times P1 has lost to P2 in the past (which is just P2's wins).

### Why this matters
Tennis is often about matchups. A player might be lower-ranked but have a specific game style (e.g., a massive lefty serve) that troubles a specific top opponent.

The "mental block" factor: If Player A has beaten Player B five times in a row, Player A has a significant psychological advantage that rankings alone don't capture.

Logic:
1. We need to count how many times P1 has defeated P2 before the current match takes place.
1. We initialize a dictionary to store win counts for every pair of players.
1. We iterate through the matches chronologically.
1. For each match (P1 vs P2), we look up their current record.
1. After we record the stat for this match, we update the dictionary so the next time they play, the record is up-to-date.

In [189]:
# 1. Sort raw dataframe chronologically (essential for history)
df = df.sort_values(by=['tourney_date', 'match_num']).reset_index(drop=True)

# 2. Function to calculate raw H2H wins
def calculate_h2h_raw(df):
    # Dictionary to store the history: key=(winner_id, loser_id), value=count
    # We use IDs to be unique (names can be duplicates).
    wins_map = {}
    
    w_h2h_wins = []   # How many times Winner has beaten Loser
    l_h2h_wins = []   # How many times Loser has beaten Winner
    
    # Iterate row by row
    # iterrows is slow for massive data, but acceptable for ~200k tennis matches
    for idx, row in df.iterrows():
        w_id = row['winner_id']
        l_id = row['loser_id']
        
        # Get current record BEFORE this match
        w_past = wins_map.get((w_id, l_id), 0)
        l_past = wins_map.get((l_id, w_id), 0)
        
        w_h2h_wins.append(w_past)
        l_h2h_wins.append(l_past)
        
        # Update record for FUTURE (next time they play)
        # Only update the Winner's count (since this row is "Winner won")
        wins_map[(w_id, l_id)] = wins_map.get((w_id, l_id), 0) + 1
        
    return w_h2h_wins, l_h2h_wins

print("Calculating Head-to-Head records on raw data...")
df['winner_h2h_wins'], df['loser_h2h_wins'] = calculate_h2h_raw(df)

print("H2H column added to the dataframe.")

Calculating Head-to-Head records on raw data...
H2H column added to the dataframe.


## 2. Data Restructuring (Data Symmetry)

Machine Learning models need to see both "Player A wins" and "Player A loses" scenarios. Currently, our dataset is always Winner vs Loser. We will duplicate every row:
- Original: Winner (P1) vs. Loser (P2) $\rightarrow$ Target = 1
- Swapped: Loser (P1) vs. Winner (P2) $\rightarrow$ Target = 0

### Why data symmetry? Prevent position bias.

Imagine we are teaching a child to identify the winner of a match, but we always put the winner's photo on the left side and the loser's photo on the right side.
- Problem: The child (or your AI model) will stop looking at the players' skills. It will simply learn: "The person on the left always wins."
- Result: When we ask the model to predict a future match where we don't know the winner yet, we have to put someone on the left. The model will blindly predict them as the winner, regardless of their stats, simply because of where they are standing in the dataset.

By "doubling" the dataset, we force the model to look at the stats, not the column position.
- Scenario A: Djokovic (P1) vs. Nadal (P2) $\rightarrow$ Target: 1 (P1 Won)
- Scenario B: Nadal (P1) vs. Djokovic (P2) $\rightarrow$ Target: 0 (P1 Lost)

Now, the model sees Djokovic in the "P1" slot winning and losing (depending on the match). It must analyze his aces, rank, and serve percentage to figure out the outcome, rather than just relying on his position in the table.

In [190]:
# 1. Create two copies of the original dataframe
# One where we view the match from the Winner's perspective
# One where we view the match from the Loser's perspective
df_winner = df.copy()
df_loser = df.copy()

# 2. Rename columns for the "Winner Perspective" (Target = 1)
# Here, P1 is the Winner, P2 is the Loser.
df_winner = df_winner.rename(columns={
    'winner_id': 'p1_id', 'winner_name': 'p1_name', 'winner_rank': 'p1_rank', 
    'winner_age': 'p1_age', 'winner_ht': 'p1_ht', 'winner_hand': 'p1_hand', 'winner_seed': 'p1_seed',
    'loser_id': 'p2_id', 'loser_name': 'p2_name', 'loser_rank': 'p2_rank', 
    'loser_age': 'p2_age', 'loser_ht': 'p2_ht', 'loser_hand': 'p2_hand', 'loser_seed': 'p2_seed',

    # H2H Mapping
    'winner_h2h_wins': 'p1_h2h_wins',  # P1 (Winner) wins vs P2
    'loser_h2h_wins': 'p1_h2h_losses', # P2 (Loser) wins vs P1 = P1 losses
    
    # Standard match stats mapping
    'w_ace': 'p1_ace', 'w_df': 'p1_df', 'w_svpt': 'p1_svpt', 'w_1stIn': 'p1_1stIn', 
    'w_1stWon': 'p1_1stWon', 'w_2ndWon': 'p1_2ndWon', 'w_SvGms': 'p1_SvGms', 
    'w_bpSaved': 'p1_bpSaved', 'w_bpFaced': 'p1_bpFaced',
    'l_ace': 'p2_ace', 'l_df': 'p2_df', 'l_svpt': 'p2_svpt', 'l_1stIn': 'p2_1stIn', 
    'l_1stWon': 'p2_1stWon', 'l_2ndWon': 'p2_2ndWon', 'l_SvGms': 'p2_SvGms', 
    'l_bpSaved': 'p2_bpSaved', 'l_bpFaced': 'p2_bpFaced'
})
df_winner['target'] = 1

# Add post H2H for verification (do not use post stats for training)
# Here, P1 Won, so Post Wins = Pre Wins + 1
df_winner['p1_post_wins'] = df_winner['p1_h2h_wins'] + 1
df_winner['p1_post_losses'] = df_winner['p1_h2h_losses']

# 3. Rename columns for the "Loser Perspective" (Target = 0)
# Here, P1 is the Loser, P2 is the Winner. We swap the columns.
df_loser = df_loser.rename(columns={
    'loser_id': 'p1_id', 'loser_name': 'p1_name', 'loser_rank': 'p1_rank', 
    'loser_age': 'p1_age', 'loser_ht': 'p1_ht', 'loser_hand': 'p1_hand', 'loser_seed': 'p1_seed',
    'winner_id': 'p2_id', 'winner_name': 'p2_name', 'winner_rank': 'p2_rank', 
    'winner_age': 'p2_age', 'winner_ht': 'p2_ht', 'winner_hand': 'p2_hand', 'winner_seed': 'p2_seed',
    
    # H2H Mapping
    'loser_h2h_wins': 'p1_h2h_wins',     # P1 (Loser) wins vs P2
    'winner_h2h_wins': 'p1_h2h_losses',  # P2 (Winner) wins vs P1 = P1 losses

    # Standard match stats mapping
    # Notice how we map 'l_ace' (loser's aces) to 'p1_ace' here
    'l_ace': 'p1_ace', 'l_df': 'p1_df', 'l_svpt': 'p1_svpt', 'l_1stIn': 'p1_1stIn', 
    'l_1stWon': 'p1_1stWon', 'l_2ndWon': 'p1_2ndWon', 'l_SvGms': 'p1_SvGms', 
    'l_bpSaved': 'p1_bpSaved', 'l_bpFaced': 'p1_bpFaced',
    'w_ace': 'p2_ace', 'w_df': 'p2_df', 'w_svpt': 'p2_svpt', 'w_1stIn': 'p2_1stIn', 
    'w_1stWon': 'p2_1stWon', 'w_2ndWon': 'p2_2ndWon', 'w_SvGms': 'p2_SvGms', 
    'w_bpSaved': 'p2_bpSaved', 'w_bpFaced': 'p2_bpFaced'
})
df_loser['target'] = 0

# Add post H2H for verification (do not use post stats for training)
# Here, P1 Lost, so Post Wins = Pre Wins (Unchanged)
df_loser['p1_post_wins'] = df_loser['p1_h2h_wins'] 
df_loser['p1_post_losses'] = df_loser['p1_h2h_losses'] + 1

# 4. Concatenate and sort
# We now have 2x the rows. We sort by date to keep the timeline clean.
df_combined = pd.concat([df_winner, df_loser], ignore_index=True)
df_combined = df_combined.sort_values(by=['tourney_date', 'match_num']).reset_index(drop=True)

print(f"Data restructuring complete.")
print(f"Raw data rows: {len(df):,}")
print(f"Restructured data rows (symmetric): {len(df_combined):,}")

# Quick check: Djokovic vs Nadal rivalry
mask = ((df_combined['p1_name'] == 'Novak Djokovic') & (df_combined['p2_name'] == 'Rafael Nadal'))
cols = ['tourney_name','tourney_date', 'p1_name', 'p2_name', 'target', 'p1_h2h_wins', 'p1_h2h_losses', 'p1_post_wins', 'p1_post_losses']
print("\nDjokovic vs Nadal H2H history (last 10 matches):")
print(df_combined[mask][cols].tail(10))


Data restructuring complete.
Raw data rows: 198,055
Restructured data rows (symmetric): 396,110

Djokovic vs Nadal H2H history (last 10 matches):
           tourney_name tourney_date         p1_name       p2_name  target  \
354855     Rome Masters   2018-05-14  Novak Djokovic  Rafael Nadal       0   
355812        Wimbledon   2018-07-02  Novak Djokovic  Rafael Nadal       1   
358488  Australian Open   2019-01-14  Novak Djokovic  Rafael Nadal       1   
360371     Rome Masters   2019-05-13  Novak Djokovic  Rafael Nadal       0   
363738          ATP Cup   2020-01-03  Novak Djokovic  Rafael Nadal       1   
365913    Roland Garros   2020-09-28  Novak Djokovic  Rafael Nadal       0   
368655     Rome Masters   2021-05-08  Novak Djokovic  Rafael Nadal       0   
369120    Roland Garros   2021-05-31  Novak Djokovic  Rafael Nadal       1   
374837    Roland Garros   2022-05-23  Novak Djokovic  Rafael Nadal       0   
388001   Paris Olympics   2024-07-29  Novak Djokovic  Rafael Nadal       1

## 3. Feature 2: Surface-Specific Serformance

A player's general win rate can be misleading. A "clay court specialist" might be unbeatable on dirt but vulnerable on grass. We calculate the cumulative win rate for each player on the specific surface of the current match.

Techniques:
1. We Group by Player AND Surface.
1. We use .expanding() to calculate the "lifetime win rate" up to that point.
2. We use .shift(1) to ensure we don't include the result of the current match.

### What does the p1_surface_win_pct column represent?

This column represents the stats "entering" the match, not leaving it.

z.B., If Nadal's clay-specific win rate on the day of his match against Djokovic in Paris Olympics (Row 388003) had decreased to reflect the loss against Djokovic, that would be data leakage. It would mean the model "knew" he lost the match before making the prediction. Follow the logic trace below.

Row 387877 (vs. Fucsovics)
- Entering the match: Nadal's clay win rate was 0.903525.
- Match Result: Nadal WON (target = 1).
- Effect: Because he won, his lifetime percentage improved.

Row 388003 (vs. Djokovic)
- Entering the match: The stat reflects his new, improved record (because of the win in the previous row).
- Value: That is why the number increased to 0.903704.
- Match Result: Nadal LOST (target = 0).
- Effect: This loss will cause the win rate to drop for the next match (which isn't in the list).

In [191]:
# 1. Create a dummy variable for "Win" (Target is already 1/0)
# We calculate the cumulative sum of wins and total matches played on THIS surface.
df_combined['p1_surface_wins'] = df_combined.groupby(['p1_id', 'surface'])['target'].transform(
    lambda x: x.expanding().sum().shift(1).fillna(0)
)

df_combined['p1_surface_matches'] = df_combined.groupby(['p1_id', 'surface'])['target'].transform(
    lambda x: x.expanding().count().shift(1).fillna(0)
)

# 2. Calculate win percentage on THIS surface
df_combined['p1_surface_win_pct'] = safe_div(df_combined['p1_surface_wins'], df_combined['p1_surface_matches'])

# 3. Fill Missing Values (First match on a surface)
# If a player has never played on this surface, we default to a neutral 0.5 (or use their overall win rate).
# For now, let's use 0.5 to assume "average" until proven otherwise.
# A better approach: Fill with their overall "rolling" win rate when their surface data is missing.
df_combined['p1_surface_win_pct'] = df_combined['p1_surface_win_pct'].fillna(0.5)

print("Surface performance calculated.")

# Quick check: Nadal on clay vs grass around his peak
# Nadal was "Clay God" around 2005-2010, nearly unbeatable.
# But Nadal was vulnerable on hard around the same period.
mask = (
    (df_combined['p1_name'] == 'Rafael Nadal') & 
    (df_combined['tourney_date'].dt.year >= 2005) & 
    (df_combined['tourney_date'].dt.year <= 2010)
)
cols = ['tourney_name','tourney_date', 'surface', 'target','p1_surface_win_pct']

for surface in ['Clay', 'Hard']:
    surface_mask = mask & (df_combined['surface'] == surface)
    print(f"Nadal's {surface.lower()} matches (2005-2010): {surface_mask.sum()}")
    print(df_combined[surface_mask][cols])

Surface performance calculated.
Nadal's clay matches (2005-2010): 185
           tourney_name tourney_date surface  target  p1_surface_win_pct
273572     Buenos Aires   2005-02-07    Clay       1            0.702703
273632     Buenos Aires   2005-02-07    Clay       1            0.710526
273665     Buenos Aires   2005-02-07    Clay       0            0.717949
273728  Costa do Sauipe   2005-02-14    Clay       1            0.700000
273800  Costa do Sauipe   2005-02-14    Clay       1            0.707317
...                 ...          ...     ...     ...                 ...
307636    Roland Garros   2010-05-24    Clay       1            0.921659
307652    Roland Garros   2010-05-24    Clay       1            0.922018
307660    Roland Garros   2010-05-24    Clay       1            0.922374
307664    Roland Garros   2010-05-24    Clay       1            0.922727
307666    Roland Garros   2010-05-24    Clay       1            0.923077

[185 rows x 5 columns]
Nadal's hard matches (2005-201

## 4. Feature 3: Historical Rolling Stats

We calculate a player's "form" by averaging their stats over the last 50 matches (approx. 52 weeks).

Key Metrics:
1. Serve Win %: The most important stat in tennis. (Points Won on Serve / Total Serve Points).
1. Return Win %: How good are they at breaking opponents?
1. Ace %: Aces normalized by the number of service points (fairer than "Aces per Match" which favors long 5-setters).
1. Break Point Save %: A measure of mental toughness under pressure.

Technical note: We use .shift(1) to ensure that the average for "Match X" is calculated using only matches before Match X.

### Why are we doing this?

Problem: We know how many aces Djokovic hit after the match. But the model needs to know how likely he is to hit aces before the match starts.

Solution: We calculate his "form". We look at his last 52 weeks (approx. 50 matches) and calculate his average performance.

Critical detail (Leakage Prevention): We must be extremely careful not to include the current match in the average. If we are predicting the AO 2026 Quarterfinal, we can only look at data up to the Round of 16. We use .shift() in the code to ensure we only see the past.

In [192]:
# 1. Helper function: Safe division
# Avoids division by zero (e.g., if a player faced 0 break points).
# This function will be used in multiple percentage calculations.
def safe_div(a, b):
    return np.where(b > 0, a / b, 0)

# 2. Calculate raw percentages for each match
# These are the stats for the specific match played on that day.
# We will average these stats later.
df_combined['p1_serve_pts'] = df_combined['p1_svpt']
df_combined['p1_ace_pct'] = safe_div(df_combined['p1_ace'], df_combined['p1_svpt'])
df_combined['p1_df_pct'] = safe_div(df_combined['p1_df'], df_combined['p1_svpt'])
df_combined['p1_1st_in_pct'] = safe_div(df_combined['p1_1stIn'], df_combined['p1_svpt'])
df_combined['p1_1st_win_pct'] = safe_div(df_combined['p1_1stWon'], df_combined['p1_1stIn'])
df_combined['p1_2nd_win_pct'] = safe_div(df_combined['p1_2ndWon'], (df_combined['p1_svpt'] - df_combined['p1_1stIn']))
df_combined['p1_sv_win_pct'] = safe_div((df_combined['p1_1stWon'] + df_combined['p1_2ndWon']), df_combined['p1_svpt'])
df_combined['p1_bp_save_pct'] = safe_div(df_combined['p1_bpSaved'], df_combined['p1_bpFaced'])
df_combined['p1_bp_convert_pct'] = safe_div(df_combined['p1_bpFaced'] - df_combined['p1_bpSaved'], df_combined['p1_bpFaced'])

# 3. Rolling function
def calculate_rolling_stats(df, window_size=50):
    cols_to_roll = [
        'p1_ace_pct', 'p1_df_pct', 'p1_1st_in_pct', 
        'p1_1st_win_pct', 'p1_2nd_win_pct', 'p1_sv_win_pct', 
        'p1_bp_save_pct', 'p1_bp_convert_pct'
    ]
    
    # Sort: Player -> Date. Crucial for chronological rolling.
    df = df.sort_values(by=['p1_id', 'tourney_date'])
    
    # Group by Player and apply Rolling Mean
    # .shift(1) is essential. It moves the average down by one row.
    # This ensures that for Match N, we use the average of N-50 to N-1.
    # We do NOT see the stats of Match N in the average.
    rolling_stats = df.groupby('p1_id')[cols_to_roll].transform(
        lambda x: x.rolling(window=window_size, min_periods=5).mean().shift(1)
    )
    
    # Rename columns to indicate they are historical averages
    rolling_stats.columns = [f"rolling_{col}" for col in cols_to_roll]
    
    return rolling_stats

print("Calculating rolling stats ...")
df_rolling = calculate_rolling_stats(df_combined)

# 4. Merge back
df_combined = pd.concat([df_combined, df_rolling], axis=1)

# 5. Imputation for new players
# Players with <5 matches will have NaN rolling stats.
# We fill these with global averages (neutral assumption).
for col in df_rolling.columns:
    global_mean = df_combined[col].mean()
    df_combined[col] = df_combined[col].fillna(global_mean)

print("Rolling stats calculated and imputed.")

# Quick check: Look at Djokovic's serve win % to see if stats evolve
mask = (df_combined['p1_name'] == 'Novak Djokovic')
cols = ['tourney_name','tourney_date', 'rolling_p1_sv_win_pct']
print("\nDjokovic's serve win % evolution (first 10 matches):")
print(df_combined[mask][cols].head(10))

Calculating rolling stats ...
Rolling stats calculated and imputed.

Djokovic's serve win % evolution (first 10 matches):
                       tourney_name tourney_date  rolling_p1_sv_win_pct
268180  Davis Cup G2 R1: SCG vs LAT   2004-04-09               0.324854
270441                         Umag   2004-07-19               0.324854
271562                    Bucharest   2004-09-13               0.324854
271619                    Bucharest   2004-09-13               0.324854
271847                      Bangkok   2004-09-27               0.324854
273091              Australian Open   2005-01-17               0.425113
274066  Davis Cup G1 R1: SCG vs ZIM   2005-03-04               0.407292
274270  Davis Cup G1 R1: SCG vs ZIM   2005-03-04               0.349107
274763                     Valencia   2005-04-04               0.305469
275279  Davis Cup G1 R2: SCG vs BEL   2005-04-29               0.329499


## 5. Feature 4: Recent Form (Short-Term Win Rate)

Here we calculate the player's momentum. Their "recent form" tells us if they are currently "hot" or "cold" regarding actual Wins and Losses

Metric: Percentage of matches won in the last 10 matches.

### Why only last 10 matches? 

A window of 50 (used earlier) captures "Class" (overall quality). A window of 10 captures "Form" (current streak).

Leakage Prevention: Again, we use .shift(1) to strictly look at the previous 10 matches before the current one.

In [193]:
# 1. Sort by Player and Date
df_combined = df_combined.sort_values(by=['p1_id', 'tourney_date'])

# 2. Calculate Recent Win Rate
# We take the 'target' (1 for win, 0 for loss) and average it over the last 10 games.
df_combined['p1_form_win_pct'] = df_combined.groupby('p1_id')['target'].transform(
    lambda x: x.rolling(window=10, min_periods=3).mean().shift(1)
)

# 3. Fill Missing Values
# If a player has played fewer than 3 matches, we default to neutral 0.5 (or global mean).
# For now, let's use 0.5 to assume "average form".
df_combined['p1_form_win_pct'] = df_combined['p1_form_win_pct'].fillna(0.5)

print("Recent form calculated.")

# Quick check: Player on a winning streak
# z.B., Djokovic in late 2015/early 2016 was simply unbeatable.
mask = (df_combined['p1_name'] == 'Novak Djokovic') & (df_combined['tourney_date'].dt.year == 2016)
cols = ['tourney_name', 'tourney_date', 'p1_name', 'target', 'p1_form_win_pct']
print("\nDjokovic's form in early 2016:")
print(df_combined[mask][cols].head(10))

Recent form calculated.

Djokovic's form in early 2016:
           tourney_name tourney_date         p1_name  target  p1_form_win_pct
340385             Doha   2016-01-04  Novak Djokovic       1              0.9
340433             Doha   2016-01-04  Novak Djokovic       1              0.9
340456             Doha   2016-01-04  Novak Djokovic       1              0.9
340460             Doha   2016-01-04  Novak Djokovic       1              0.9
340462             Doha   2016-01-04  Novak Djokovic       1              0.9
340572  Australian Open   2016-01-18  Novak Djokovic       1              0.9
340700  Australian Open   2016-01-18  Novak Djokovic       1              0.9
340764  Australian Open   2016-01-18  Novak Djokovic       1              1.0
340796  Australian Open   2016-01-18  Novak Djokovic       1              1.0
340812  Australian Open   2016-01-18  Novak Djokovic       1              1.0


## 6. Feature Selection and Export

We filter the dataset to keep only the columns that our Machine Learning model is "allowed" to see.

- Keep: Pre-match stats (z.B., rank, age, H2H, rolling averages, surface win %).
- Drop: Post-match stats (z.B., aces in this match, minutes, p1_post_wins). Using these would cause data leakage (cheating).

In [194]:
# 1. Define the feature list
# These are the ONLY columns we will use for training.
features = [
    # Metadata (for tracking, not training)
    # Added 'tourney_id' for safe merging (tourney_name is not unique)
    'tourney_id', 'tourney_name', 'tourney_date', 'match_num', 'p1_name', 'p2_name', 
    
    # Target (what we want to predict)
    'target',
    
    # Core features (known BEFORE the match)
    'surface', 'draw_size', 'tourney_level',  # Tournament Context
    'p1_rank', 'p2_rank',                     # Rankings
    'p1_age', 'p2_age', 'p1_ht', 'p2_ht',     # Physicals
    'p1_h2h_wins', 'p1_h2h_losses',           # History
    
    # rolling stats (P1's recent form)
    'rolling_p1_ace_pct', 'rolling_p1_df_pct', 
    'rolling_p1_1st_in_pct', 'rolling_p1_1st_win_pct', 
    'rolling_p1_2nd_win_pct', 'rolling_p1_sv_win_pct', 
    'rolling_p1_bp_save_pct', 'rolling_p1_bp_convert_pct',
    
    # Surface and form specifics
    'p1_surface_win_pct', 'p1_form_win_pct'
]

# 2. Create featured dataframe
df_featured = df_combined[features].copy()

# 3. Handling missing values (precautionary)
# If any NaN remains, fill with 0 or neutral value
df_df_featured = df_featured.fillna(0)

# 4. Save the final featured dataset
output_file = 'master_data_featured.csv'
df_featured.to_csv(output_file, index=False)

dff = df_featured[df_featured['tourney_date'].dt.year >= 2000].reset_index(drop=True)
print(f"Data Loaded: {len(dff):,} matches (2000-Present)")

Data Loaded: 156,176 matches (2000-Present)
