## Notebook 3: Advanced Feature Engineering - Elo Ratings

In Notebook 2, we calculated standard tennis statistics (z.B., serve %, recent form). In this notebook, we introduce Elo Ratings, a system originally designed for Chess but highly predictive in tennis.

### What is Elo?
Originally designed for Chess, Elo is a system that calculates the relative skill levels of players in zero-sum games.
- ATP Rankings vs Elo:
    - ATP Rankings are based on how far you go in a tournament. 
    - Elo is based on who you beat. 
    - Hence, beating the World No. 1 yields more points than beating the World No. 100.
- The Zero-Sum Rule: 
    - If Player A gains 10 points, Player B loses 10 points. 
    - This keeps the system slightly inflation-proof.

### Our Approach
We maintain two different types of Elo ratings.
- Overall Elo: A single rating representing general strength.
- Surface Elo: Separate ratings for Hard, Clay, and Grass. This is crucial for tennis (z.B., Nadal's Clay rating is historically much higher than his Grass rating).

In this notebook, we load two datasets:
- df_clean: The raw history (1 row per match). We use this to calculate Elo to avoid double-counting.
- df_featured: The training data (2 rows per match). We will merge the calculated ratings into this file.


In [28]:
import pandas as pd
import numpy as np
import time

# Display settings to see all columns
pd.set_option('display.max_columns', None)

# 1. Load the CLEAN dataset (1 row per match)
# We calculate the running Elo history on this file because it represents the timeline perfectly.
# Using the "doubled" dataset would cause us to update the ratings twice per match (bad).
df_clean = pd.read_csv('master_data_cleaned.csv')
df_clean['tourney_date'] = pd.to_datetime(df_clean['tourney_date'])

# Sort chronologically
# We need to calculate ratings in the correct order of time.
# Added 'tourney_id' to the sort for perfect deterministic ordering
df_clean = df_clean.sort_values(by=['tourney_date', 'tourney_id', 'match_num']).reset_index(drop=True)

# 2. Load the FEATURED dataset (2 rows per match)
# This is our target. We will deposit the calculated ratings into this file at the end.
df_featured = pd.read_csv('master_data_featured.csv')
df_featured['tourney_date'] = pd.to_datetime(df_featured['tourney_date'])

print(f"Loaded Clean Data (Calculation Source): {len(df_clean):,} matches")
print(f"Loaded Featured Data (Merge Target): {len(df_featured):,} rows")

Loaded Clean Data (Calculation Source): 198,055 matches
Loaded Featured Data (Merge Target): 396,110 rows


## 2. The Elo Engine
Here we define the EloTracker class. This "engine" handles the math.

### The Math
The formula for updating a player's rating is:
$$R_{new} = R_{old} + K \times (Actual - Expected)$$
where:
- $R_{old}$: The player's current rating (starts at 1500).
- $K$ (K-factor): The "weight" of the match.
    - Grand Slams ($K=50$): Big updates. A win here matters significantly.
    - Futures ($K=20$): Small updates. A win here is less impactful.
- $Expected$: The probability of winning, calculated using the difference in ratings.
    - If Djokovic (Rating 2500) plays a Qualifier (Rating 1500), Djokovic's expected score is near 1.0 (99%).
    - If he wins, he gains very few points: $Actual (1) - Expected (0.99) = 0.01$.
    - If he loses, he Elo crashes: $Actual (0) - Expected (0.99) = -0.99$.

In [29]:
class EloTracker:
    def __init__(self, start_rating=1500):
        # A dictionary to store the current rating of every player ID
        # Format: {'player_id': 1560.5, 'another_id': 1490.2}
        self.ratings = {} 
        self.start_rating = start_rating
        
    def get_rating(self, player_id):
        # Return the player's rating. 
        # If they are new, return the starting 1500.
        return self.ratings.get(player_id, self.start_rating)
    
    def update_ratings(self, winner_id, loser_id, k_factor):
        # 1. Get current ratings (BEFORE the match)
        r_winner = self.get_rating(winner_id)
        r_loser = self.get_rating(loser_id)
        
        # 2. Calculate Expected Score (win probability)
        # Formula: 1 / (1 + 10^((OpponentRating - MyRating) / 400))
        # This is the standard logistic curve used in chess.
        expected_winner = 1 / (1 + 10 ** ((r_loser - r_winner) / 400))
        expected_loser = 1 / (1 + 10 ** ((r_winner - r_loser) / 400))
        
        # 3. Calculate New Ratings
        # Winner: Actual score is 1.0
        new_r_winner = r_winner + k_factor * (1 - expected_winner)
        
        # Loser: Actual score is 0.0
        new_r_loser = r_loser + k_factor * (0 - expected_loser)
        
        # 4. Save New Ratings to the dictionary
        self.ratings[winner_id] = new_r_winner
        self.ratings[loser_id] = new_r_loser
        
        # 5. Return the PRE-match ratings
        # Important: We return r_winner, NOT new_r_winner.
        # Reason: For prediction, we only know the rating BEFORE the match happens.
        return r_winner, r_loser

# Define K-Factors (wweight of the match)
# Grand Slams impact your rating more than Challengers.
K_MAP = {
    'G': 50,      # Grand Slam (Highest weight)
    'F': 45,      # Tour Finals
    'M': 40,      # Masters 1000
    'A': 35,      # ATP 500/250
    'D': 30,      # Davis Cup
    'C': 25,      # Challengers
    'S': 20,      # Futures (Lowest weight)
    'Unknown': 30 # Fallback
}

## 3. Calculating Overall Elo
Now we run the history simulation. We iterate through 150,000+ matches chronologically.

Loop Logic:
1. Look at Match $N$.
2. Retrieve the current ratings of the Winner and Loser.
3. Store them in our list (to add to the dataframe later).
4. Update the EloTracker so the ratings are correct for Match $N+1$.

In [30]:
elo_tracker = EloTracker(start_rating=1500)
winner_elos = []
loser_elos = []

print("Calculating Overall Elo history (1968-2026) ...")
start_time = time.time()

# Iterate through each match in chronological order
for idx, row in df_clean.iterrows():
    w_id = row['winner_id']
    l_id = row['loser_id']
    level = row['tourney_level']
    
    # 1. Determine the K-Factor for this specific match
    k = K_MAP.get(level, 30)  # Default to 30 if level is missing
    
    # 2. Update ratings and capture the values entering the match
    w_elo, l_elo = elo_tracker.update_ratings(w_id, l_id, k)
    
    winner_elos.append(w_elo)
    loser_elos.append(l_elo)

# Add the calculated lists back into the clean dataframe
df_clean['winner_elo'] = winner_elos
df_clean['loser_elo'] = loser_elos

print(f"Overall Elo calculated in {time.time() - start_time:.2f} seconds.")

# Quick Check:
print(f"\nOverall Elo ratings from the last 3 matches before AO 2026:")
print(df_clean[['tourney_name','tourney_date', 'winner_name', 'winner_elo', 'loser_name', 'loser_elo']].tail(3))

Calculating Overall Elo history (1968-2026) ...
Overall Elo calculated in 3.09 seconds.

Overall Elo ratings from the last 3 matches before AO 2026:
       tourney_name tourney_date   winner_name   winner_elo  \
198052     Adelaide   2026-01-16   Ugo Humbert  1773.060491   
198053     Auckland   2026-01-17  Jakub Mensik  1786.858666   
198054     Adelaide   2026-01-17  Tomas Machac  1726.534933   

                         loser_name    loser_elo  
198052  Alejandro Davidovich Fokina  1812.581783  
198053               Sebastian Baez  1676.588333  
198054                  Ugo Humbert  1789.759441  


## 4. Calculating Surface-Specific Elo

A generic rating doesn't capture surface specialists (z.B., Nadal on Clay vs. Nadal on Indoor Hard). We address this limitation by running three independent Elo simulations in parallel.

- Hard Court Tracker: Only updates when surface == 'Hard'.
- Clay Court Tracker: Only updates when surface == 'Clay'.
- Grass Court Tracker: Only updates when surface == 'Grass'.

Note on 'Carpet': This surface is obsolete. For simplicity, we track it separately, but it won't be heavily used in modern prediction.

In [31]:
# Initialize a separate dictionary for each surface
surface_trackers = {
    'Hard': EloTracker(),
    'Clay': EloTracker(),
    'Grass': EloTracker(),
    'Carpet': EloTracker(),
    'Unknown': EloTracker()
}

winner_surface_elos = []
loser_surface_elos = []

print("Calculating Surface-Specific Elo history (1968-2026) ...")

# Iterate through each match in chronological order
for idx, row in df_clean.iterrows():
    w_id = row['winner_id']
    l_id = row['loser_id']
    surface = row['surface']
    level = row['tourney_level']
    k = K_MAP.get(level, 35)
    
    # 1. Identify which tracker to use
    if surface not in surface_trackers:
        target_tracker = surface_trackers['Unknown']
    else:
        target_tracker = surface_trackers[surface]
    
    # 2. Update ONLY that specific tracker
    # The 'Clay' tracker doesn't care if you lost a match on 'Grass'.
    w_elo, l_elo = target_tracker.update_ratings(w_id, l_id, k)
    
    winner_surface_elos.append(w_elo)
    loser_surface_elos.append(l_elo)

# Add to the clean dataframe
df_clean['winner_surface_elo'] = winner_surface_elos
df_clean['loser_surface_elo'] = loser_surface_elos

print(f"Surface Elo calculated in {time.time() - start_time:.2f} seconds.")

Calculating Surface-Specific Elo history (1968-2026) ...
Surface Elo calculated in 6.53 seconds.


## 5. Merging Elo into the Training Data

We now have the ratings, but they are in the wrong format (df_clean has 1 row per match). We need to move them to df_featured (2 rows per match).

Merge Logic: 
1. Extract: We take the Elo columns from df_clean.
1. Join: We merge df_clean and df_featured using tourney_date and match_num as a unique identifier.
1. Distribute:
    - If target == 1 (P1 Won): We assign winner_elo to P1 and loser_elo to P2.
    - If target == 0 (P1 Lost): We assign loser_elo to P1 and winner_elo to P2.

In [35]:
# 1. Create a "lookup key" to join the datasets
# We use tourney_id+match_num as unique ID (torney_name is not unique)
# We can simply merge based on the index if we sort them identically, but a merge key is safer.
# Now, subset the Elo data to just what we need to merge
elo_features = df_clean[['tourney_id', 'match_num', 'winner_id', 'loser_id', 
                         'winner_elo', 'loser_elo', 
                         'winner_surface_elo', 'loser_surface_elo']].copy()

# Critical FIX: Ensure the Elo Source is Unique
# We drop duplicates based on the merge keys to prevent row explosion.
elo_features = elo_features.drop_duplicates(subset=['tourney_id', 'match_num'])

# 2. Merge onto the featured dataset
# Since df_featured has P1 and P2, we need to know who is who. 
# In df_featured, 'target=1' means P1 is Winner. 'target=0' means P1 is Loser.
# Strategy: Merge on (tourney_date, match_num), then assign p1_elo based on target.
df_final = pd.merge(df_featured, elo_features, on=['tourney_id', 'match_num'], how='left')

# 3. Assign Elo to P1 and P2
# If target == 1 (P1 Won): P1 is Winner, P2 is Loser
# If target == 0 (P1 Lost): P1 is Loser, P2 is Winner
conditions = [df_final['target'] == 1, df_final['target'] == 0]

# Overall Elo
# If P1 Won: P1 gets WinnerElo, P2 gets LoserElo
# If P1 Lost: P1 gets LoserElo, P2 gets WinnerElo
choices_p1 = [df_final['winner_elo'], df_final['loser_elo']]
choices_p2 = [df_final['loser_elo'], df_final['winner_elo']]
df_final['p1_elo'] = np.select(conditions, choices_p1)
df_final['p2_elo'] = np.select(conditions, choices_p2)

# Same for Surface Elo
choices_p1_surf = [df_final['winner_surface_elo'], df_final['loser_surface_elo']]
choices_p2_surf = [df_final['loser_surface_elo'], df_final['winner_surface_elo']]
df_final['p1_surface_elo'] = np.select(conditions, choices_p1_surf)
df_final['p2_surface_elo'] = np.select(conditions, choices_p2_surf)

# 4. Cleanup
# Remove the temporary "winner_elo" columns we just merged
df_final = df_final.drop(columns=['winner_id', 'loser_id', 'winner_elo', 'loser_elo', 
                                  'winner_surface_elo', 'loser_surface_elo'])

# 5. Save the final dataset
output_file = 'master_data_final.csv'
df_final.to_csv(output_file, index=False)


# Merge sanity check: Check row counts prior to and after merge
if len(df_final) == len(df_featured):
    print("SUCCESS: Merge is clean. Row counts match.")
else:
    print(f"WARNING: Row count mismatch! ({len(df_featured)} vs {len(df_final)})")

print("SUCCESS: Elo feature engineering complete.")
print(f"Final dataset shape: {df_final.shape}")
print(f"Saved to: {output_file}")

# Quick check: Djkovic's Elo in late 2023
print("\nDjokovic's Elo rating in late 2023:")
mask = (df_final['p1_name'] == 'Novak Djokovic') & (df_final['tourney_date'].dt.year == 2023)
cols = ['tourney_name', 'tourney_date', 'p1_name', 'p1_elo', 'p1_surface_elo']
print(df_final[mask][cols].tail())

SUCCESS: Merge is clean. Row counts match.
SUCCESS: Elo feature engineering complete.
Final dataset shape: (396110, 32)
Saved to: master_data_final.csv

Djokovic's Elo rating in late 2023:
                          tourney_name tourney_date         p1_name  \
90963                      Tour Finals   2023-11-13  Novak Djokovic   
90964                      Tour Finals   2023-11-13  Novak Djokovic   
90965                      Tour Finals   2023-11-13  Novak Djokovic   
90966  Davis Cup Finals QF: SRB vs GBR   2023-11-23  Novak Djokovic   
90967  Davis Cup Finals SF: ITA vs SRB   2023-11-25  Novak Djokovic   

            p1_elo  p1_surface_elo  
90963  2329.052087     2316.535254  
90964  2292.108507     2280.263135  
90965  2304.060299     2288.352485  
90966  2315.728410     2300.974494  
90967  2316.415782     2301.698320  
