# Predict

**INPUT**: You can try any model in "./models/" (for example best_xgb_model.json) and any new data the model was not trained on "./data/all/aus_open_2025_new.csv". Alternatively, you can just predict individual matches.

**OUTPUT**: Predictions of matches

In [None]:
import numpy as np
import pandas as pd
from tqdm import tqdm
from sklearn.preprocessing import StandardScaler
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.tree import plot_tree
from sklearn import tree
from xgboost import XGBClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score
from tensorflow import keras
from tensorflow.keras import layers
from utils.updateStats import getStats, updateStats, createStats
pd.set_option('display.max_columns', None)

## Re-Calculate all the stats

Firstly, we need to re-calculate all the stats. I could have export this in 1.CreateDataset, but I thought it would be better if I did it here again for simplicity (instead of export the statistics, which might weight a lot).

This is fairly okay, since it only takes a minute on my machine. Obviosly, if it took longer, I would directly export all the stats in 1.CreateDataset instead of doing this.

In [None]:
clean_data = pd.read_csv("./data/0cleanDataset.csv")
prev_stats = createStats()

# Iterate through each row in clean_data
for index, row in tqdm(clean_data.iterrows(), total=len(clean_data)):
    ########## UPDATE STATS ##########
    # We only need to update the stats, since we don't need to create a dataset
    prev_stats = updateStats(row, prev_stats)

100%|██████████| 95375/95375 [00:05<00:00, 16548.56it/s]


## Predict Any Two Players

In [None]:
# Load the model from models
xgb_model = XGBClassifier()
xgb_model.load_model("./models/xgb_model.json")

# I define this here to make the results more easy to interpret
mapper = np.vectorize(lambda x: "Player 2 Wins" if x == 0 else "Player 1 Wins")

Here, I'm going to predict a match between Sinner and Alcaraz. I'm going to simulate them playing in a grand slam, and in a Hard Court.

In [None]:
# Example match between Carlos Alcaraz and Jannik Sinner
player1 = {
    "Name": "Jannik Sinner",                    # Name is not needed, but I wrote it for clarity
    "ID": 206173,                               # You can search for the ID in "./data/atp_players.csv"
    "ATP_POINTS": 11000,                        # You can find this in the ATP website
    "ATP_RANK": 1,                              # You can find this in the ATP website
    "AGE": 23.6,                                # You don't need to calculate the age to a point decimal (but the more info the better)
    "HEIGHT": 191,                              # This can also be found in "./data/atp_players.csv"
}

player2 = {
    "Name": "Carlos Alcaraz",
    "ID": 207989,
    "ATP_POINTS": 5000,
    "ATP_RANK": 3,
    "AGE": 21.6,
    "HEIGHT": 183,
}

match = {
    "BEST_OF": 5,                               # Set this to 5 if grand slam, otherwise 3 normally
    "DRAW_SIZE": 128,                           # Depending on the tournament
    "SURFACE": "Hard",                          # Surface of the match. Options are ("Hard", "Clay", "Grass", "Carpet")
}

# Call getStatsPlayers function
output = getStats(player1, player2, match, prev_stats)

match_data = pd.DataFrame([dict(sorted(output.items()))])
mapper(xgb_model.predict(np.array(match_data, dtype=object)))

array(['Player 1 Wins'], dtype='<U13')

In [None]:
match_data

Unnamed: 0,AGE_DIFF,ATP_POINTS_DIFF,ATP_RANK_DIFF,BEST_OF,DRAW_SIZE,ELO_DIFF,ELO_GRAD_LAST_100_DIFF,ELO_GRAD_LAST_10_DIFF,ELO_GRAD_LAST_200_DIFF,ELO_GRAD_LAST_25_DIFF,ELO_GRAD_LAST_3_DIFF,ELO_GRAD_LAST_50_DIFF,ELO_GRAD_LAST_5_DIFF,ELO_SURFACE_DIFF,H2H_DIFF,H2H_SURFACE_DIFF,HEIGHT_DIFF,N_GAMES_DIFF,P_1ST_IN_LAST_100_DIFF,P_1ST_IN_LAST_10_DIFF,P_1ST_IN_LAST_200_DIFF,P_1ST_IN_LAST_25_DIFF,P_1ST_IN_LAST_3_DIFF,P_1ST_IN_LAST_50_DIFF,P_1ST_IN_LAST_5_DIFF,P_1ST_WON_LAST_100_DIFF,P_1ST_WON_LAST_10_DIFF,P_1ST_WON_LAST_200_DIFF,P_1ST_WON_LAST_25_DIFF,P_1ST_WON_LAST_3_DIFF,P_1ST_WON_LAST_50_DIFF,P_1ST_WON_LAST_5_DIFF,P_2ND_WON_LAST_100_DIFF,P_2ND_WON_LAST_10_DIFF,P_2ND_WON_LAST_200_DIFF,P_2ND_WON_LAST_25_DIFF,P_2ND_WON_LAST_3_DIFF,P_2ND_WON_LAST_50_DIFF,P_2ND_WON_LAST_5_DIFF,P_ACE_LAST_100_DIFF,P_ACE_LAST_10_DIFF,P_ACE_LAST_200_DIFF,P_ACE_LAST_25_DIFF,P_ACE_LAST_3_DIFF,P_ACE_LAST_50_DIFF,P_ACE_LAST_5_DIFF,P_BP_SAVED_LAST_100_DIFF,P_BP_SAVED_LAST_10_DIFF,P_BP_SAVED_LAST_200_DIFF,P_BP_SAVED_LAST_25_DIFF,P_BP_SAVED_LAST_3_DIFF,P_BP_SAVED_LAST_50_DIFF,P_BP_SAVED_LAST_5_DIFF,P_DF_LAST_100_DIFF,P_DF_LAST_10_DIFF,P_DF_LAST_200_DIFF,P_DF_LAST_25_DIFF,P_DF_LAST_3_DIFF,P_DF_LAST_50_DIFF,P_DF_LAST_5_DIFF,WIN_LAST_100_DIFF,WIN_LAST_10_DIFF,WIN_LAST_200_DIFF,WIN_LAST_25_DIFF,WIN_LAST_3_DIFF,WIN_LAST_50_DIFF,WIN_LAST_5_DIFF
0,2.0,6000,-2,5,128,153.164374,0.00063,-0.030303,0.001775,0.000769,0.0,0.007827,-0.2,217.45029,-2,-3,8,77,-3.025244,3.164825,-5.890701,-1.182884,6.01561,-3.723458,1.3387,5.740334,0.600188,4.445571,2.220428,-10.438965,4.76851,-3.171083,0.882389,0.787472,0.186019,0.290907,-14.050224,0.30305,-4.582411,3.84029,4.023506,3.423329,4.112445,0.904653,4.829249,0.633495,9.673198,12.583333,3.225148,9.313709,22.222222,9.236202,15.833333,-0.611287,-1.304063,-0.134434,-0.702541,-0.268875,-0.038423,-1.614634,13,3,3,6,0,5,1


Uhhhh! How cool! I simulated a match between Carlos Alcaraz and Jannik Sinner. As you can see, if the surface is Hard, it predicted Jannik Sinner would win. However, I tried chaning the surface to Grass or Clay, and Carlos Alcaraz was predicted as the favorite.

This is super cool, because that's what I would have predicted myself. Carlos won Roland Garros, and Wimbledon and he's really good at both of those surfaces. Meanwhile, Sinner excels at Hard courts and has won the last two Australian Open tournaments and the last US Open.

In [None]:
# Try see how sure of the prediction the model is
probs = xgb_model.predict_proba(np.array(match_data, dtype=object))

# Extract probability of each class
prob_player1_wins = probs[0][1]
prob_player2_wins = probs[0][0]

print(f"Probability of {player1['Name']} winning: {prob_player1_wins:.2%}")
print(f"Probability of {player2['Name']} winning: {prob_player2_wins:.2%}")

Probability of Jannik Sinner winning: 71.22%
Probability of Carlos Alcaraz winning: 28.78%


We can also check what is the estimated probablity based on the predictions of the trees (which is a bit more complicated than just a discrete vote, like in random forests).

## Predict Australian Open

In [None]:
aus_open_data = pd.read_csv("./data/all/aus_open_2025_new.csv")
aus_open_data

Unnamed: 0,tourney_id,tourney_name,surface,draw_size,tourney_level,tourney_date,match_num,p1_id,p1_seed,p1_entry,p1_name,p1_hand,p1_ht,p1_ioc,p1_age,p2_id,p2_seed,p2_entry,p2_name,p2_hand,p2_ht,p2_ioc,p2_age,score,best_of,round,minutes,p1_ace,p1_df,p1_svpt,p1_1stIn,p1_1stWon,p1_2ndWon,p1_SvGms,p1_bpSaved,p1_bpFaced,p2_ace,p2_df,p2_svpt,p2_1stIn,p2_1stWon,p2_2ndWon,p2_SvGms,p2_bpSaved,p2_bpFaced,p1_rank,p1_rank_points,p2_rank,p2_rank_points,RESULT
0,AUSTRALIAN_OPEN_2025,0.0,Hard,128,0.0,0.0,0.0,117357,0.0,0.0,0.0,0.0,183.0,0.0,27.873374,200384,0.0,0.0,0.0,0.0,173.0,0.0,24.514716,0.0,5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,179,323,81,703,0
1,AUSTRALIAN_OPEN_2025,0.0,Hard,128,0.0,0.0,0.0,202261,0.0,0.0,0.0,0.0,193.0,0.0,23.780287,209950,0.0,0.0,0.0,0.0,185.0,0.0,20.804928,0.0,5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,93,627,21,2280,0
2,AUSTRALIAN_OPEN_2025,0.0,Hard,128,0.0,0.0,0.0,105453,0.0,0.0,0.0,0.0,178.0,0.0,35.257358,106329,0.0,0.0,0.0,0.0,183.0,0.0,30.837782,0.0,5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,76,743,105,566,1
3,AUSTRALIAN_OPEN_2025,0.0,Hard,128,0.0,0.0,0.0,134770,0.0,0.0,0.0,0.0,183.0,0.0,26.276523,144719,0.0,0.0,0.0,0.0,183.0,0.0,27.908966,0.0,5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,6,4210,62,922,1
4,AUSTRALIAN_OPEN_2025,0.0,Hard,128,0.0,0.0,0.0,207352,0.0,0.0,0.0,0.0,185.0,0.0,23.199863,200273,0.0,0.0,0.0,0.0,188.0,0.0,26.613279,0.0,5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,67,784,219,264,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
111,AUSTRALIAN_OPEN_2025,0.0,Hard,128,0.0,0.0,0.0,207989,0.0,0.0,0.0,0.0,183.0,0.0,21.908966,104925,0.0,0.0,0.0,0.0,188.0,0.0,37.862423,0.0,5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3,7010,7,3900,0
112,AUSTRALIAN_OPEN_2025,0.0,Hard,128,0.0,0.0,0.0,210097,0.0,0.0,0.0,0.0,193.0,0.0,22.479124,132283,0.0,0.0,0.0,0.0,191.0,0.0,29.892539,0.0,5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,20,2280,55,1026,1
113,AUSTRALIAN_OPEN_2025,0.0,Hard,128,0.0,0.0,0.0,200282,0.0,0.0,0.0,0.0,183.0,0.0,26.120465,206173,0.0,0.0,0.0,0.0,191.0,0.0,23.626968,0.0,5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,8,3535,1,11830,0
114,AUSTRALIAN_OPEN_2025,0.0,Hard,128,0.0,0.0,0.0,210097,0.0,0.0,0.0,0.0,193.0,0.0,22.479124,206173,0.0,0.0,0.0,0.0,191.0,0.0,23.626968,0.0,5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,20,2280,1,11830,0


In [None]:
aus_open_predict = aus_open_data
predictions = []
probs_p1 = []
probs_p2 = []

for index, row in tqdm(aus_open_predict.iterrows(), total=len(aus_open_predict)):
    player1 = {
        "ID": row["p1_id"],
        "ATP_POINTS": row["p1_rank_points"],
        "ATP_RANK": row["p1_rank"],
        "AGE": row["p1_age"],
        "HEIGHT": row["p1_ht"],
    }

    player2 = {
        "ID": row["p2_id"],
        "ATP_POINTS": row["p2_rank_points"],
        "ATP_RANK": row["p2_rank"],
        "AGE": row["p2_age"],
        "HEIGHT": row["p2_ht"],
    }

    match = {
        "BEST_OF": row["best_of"],
        "DRAW_SIZE": row["draw_size"],
        "SURFACE": row["surface"],
    }

    ########## GET STATS ##########
    # Call getStatsPlayers function
    output = getStats(player1, player2, match, prev_stats)

    match_data = pd.DataFrame([dict(sorted(output.items()))])

    # Predict Match Outcome
    prediction = xgb_model.predict(np.array(match_data, dtype=object))
    predictions.append(prediction[0])

    # Predict to Get Probabilities
    probs = xgb_model.predict_proba(np.array(match_data, dtype=object))

    # Extract probability of each class
    prob_player1_wins = probs[0][1]
    prob_player2_wins = probs[0][0]

    probs_p1.append(prob_player1_wins)
    probs_p2.append(prob_player2_wins)


# Convert final dataset to DataFrame
aus_open_predict["PREDICTION"] = predictions
aus_open_predict["% Player 1 Wins"] = probs_p1
aus_open_predict["% Player 2 Wins"] = probs_p2

100%|██████████| 116/116 [00:00<00:00, 569.10it/s]


In [None]:
# Accuracy
accuracy_score(aus_open_predict["PREDICTION"], aus_open_predict["RESULT"])

0.7327586206896551

In [None]:
# Jannik Sinner's Run in the Australian Open
aus_open_predict[(aus_open_predict["p1_id"] == 206173) | (aus_open_predict["p2_id"] == 206173.0)]

Unnamed: 0,tourney_id,tourney_name,surface,draw_size,tourney_level,tourney_date,match_num,p1_id,p1_seed,p1_entry,p1_name,p1_hand,p1_ht,p1_ioc,p1_age,p2_id,p2_seed,p2_entry,p2_name,p2_hand,p2_ht,p2_ioc,p2_age,score,best_of,round,minutes,p1_ace,p1_df,p1_svpt,p1_1stIn,p1_1stWon,p1_2ndWon,p1_SvGms,p1_bpSaved,p1_bpFaced,p2_ace,p2_df,p2_svpt,p2_1stIn,p2_1stWon,p2_2ndWon,p2_SvGms,p2_bpSaved,p2_bpFaced,p1_rank,p1_rank_points,p2_rank,p2_rank_points,RESULT,PREDICTION,% Player 1 Wins,% Player 2 Wins
19,AUSTRALIAN_OPEN_2025,0.0,Hard,128,0.0,0.0,0.0,206173,0.0,0.0,0.0,0.0,191.0,0.0,23.626968,111797,0.0,0.0,0.0,0.0,201.0,0.0,29.473648,0.0,5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1,11830,36,1340,1,1,0.944918,0.055082
87,AUSTRALIAN_OPEN_2025,0.0,Hard,128,0.0,0.0,0.0,206173,0.0,0.0,0.0,0.0,191.0,0.0,23.626968,209262,0.0,0.0,0.0,0.0,183.0,0.0,24.095825,0.0,5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1,11830,173,336,1,1,0.953008,0.046992
102,AUSTRALIAN_OPEN_2025,0.0,Hard,128,0.0,0.0,0.0,206173,0.0,0.0,0.0,0.0,191.0,0.0,23.626968,106218,0.0,0.0,0.0,0.0,180.0,0.0,31.689938,0.0,5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1,11830,46,1175,1,1,0.939566,0.060434
107,AUSTRALIAN_OPEN_2025,0.0,Hard,128,0.0,0.0,0.0,206173,0.0,0.0,0.0,0.0,191.0,0.0,23.626968,208029,0.0,0.0,0.0,0.0,188.0,0.0,21.925394,0.0,5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1,11830,13,2910,1,1,0.914102,0.085898
113,AUSTRALIAN_OPEN_2025,0.0,Hard,128,0.0,0.0,0.0,200282,0.0,0.0,0.0,0.0,183.0,0.0,26.120465,206173,0.0,0.0,0.0,0.0,191.0,0.0,23.626968,0.0,5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,8,3535,1,11830,0,0,0.089048,0.910952
114,AUSTRALIAN_OPEN_2025,0.0,Hard,128,0.0,0.0,0.0,210097,0.0,0.0,0.0,0.0,193.0,0.0,22.479124,206173,0.0,0.0,0.0,0.0,191.0,0.0,23.626968,0.0,5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,20,2280,1,11830,0,0,0.076903,0.923097
115,AUSTRALIAN_OPEN_2025,0.0,Hard,128,0.0,0.0,0.0,100644,0.0,0.0,0.0,0.0,198.0,0.0,27.950034,206173,0.0,0.0,0.0,0.0,191.0,0.0,23.626968,0.0,5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2,7635,1,11830,0,0,0.204768,0.795232


In [None]:
# Carlos Alcaraz's Run in the Australian Open
aus_open_predict[(aus_open_predict["p1_id"] == 207989) | (aus_open_predict["p2_id"] == 207989)]

Unnamed: 0,tourney_id,tourney_name,surface,draw_size,tourney_level,tourney_date,match_num,p1_id,p1_seed,p1_entry,p1_name,p1_hand,p1_ht,p1_ioc,p1_age,p2_id,p2_seed,p2_entry,p2_name,p2_hand,p2_ht,p2_ioc,p2_age,score,best_of,round,minutes,p1_ace,p1_df,p1_svpt,p1_1stIn,p1_1stWon,p1_2ndWon,p1_SvGms,p1_bpSaved,p1_bpFaced,p2_ace,p2_df,p2_svpt,p2_1stIn,p2_1stWon,p2_2ndWon,p2_SvGms,p2_bpSaved,p2_bpFaced,p1_rank,p1_rank_points,p2_rank,p2_rank_points,RESULT,PREDICTION,% Player 1 Wins,% Player 2 Wins
30,AUSTRALIAN_OPEN_2025,0.0,Hard,128,0.0,0.0,0.0,207686,0.0,0.0,0.0,0.0,185.0,0.0,24.339493,207989,0.0,0.0,0.0,0.0,183.0,0.0,21.908966,0.0,5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,77,743,3,7010,0,0,0.05285,0.94715
60,AUSTRALIAN_OPEN_2025,0.0,Hard,128,0.0,0.0,0.0,207989,0.0,0.0,0.0,0.0,183.0,0.0,21.908966,106415,0.0,0.0,0.0,0.0,170.0,0.0,29.511978,0.0,5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3,7010,65,807,1,1,0.949098,0.050902
90,AUSTRALIAN_OPEN_2025,0.0,Hard,128,0.0,0.0,0.0,207989,0.0,0.0,0.0,0.0,183.0,0.0,21.908966,132686,0.0,0.0,0.0,0.0,185.0,0.0,28.11499,0.0,5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3,7010,33,1445,1,1,0.948548,0.051452
111,AUSTRALIAN_OPEN_2025,0.0,Hard,128,0.0,0.0,0.0,207989,0.0,0.0,0.0,0.0,183.0,0.0,21.908966,104925,0.0,0.0,0.0,0.0,188.0,0.0,37.862423,0.0,5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3,7010,7,3900,0,1,0.53872,0.46128


In [None]:
# Correct Results
aus_open_predict[((aus_open_predict["RESULT"] == 1) & (aus_open_predict["PREDICTION"] == 1))
                 | ((aus_open_predict["RESULT"] == 0) & (aus_open_predict["PREDICTION"] == 0))]

Unnamed: 0,tourney_id,tourney_name,surface,draw_size,tourney_level,tourney_date,match_num,p1_id,p1_seed,p1_entry,p1_name,p1_hand,p1_ht,p1_ioc,p1_age,p2_id,p2_seed,p2_entry,p2_name,p2_hand,p2_ht,p2_ioc,p2_age,score,best_of,round,minutes,p1_ace,p1_df,p1_svpt,p1_1stIn,p1_1stWon,p1_2ndWon,p1_SvGms,p1_bpSaved,p1_bpFaced,p2_ace,p2_df,p2_svpt,p2_1stIn,p2_1stWon,p2_2ndWon,p2_SvGms,p2_bpSaved,p2_bpFaced,p1_rank,p1_rank_points,p2_rank,p2_rank_points,RESULT,PREDICTION,% Player 1 Wins,% Player 2 Wins
0,AUSTRALIAN_OPEN_2025,0.0,Hard,128,0.0,0.0,0.0,117357,0.0,0.0,0.0,0.0,183.0,0.0,27.873374,200384,0.0,0.0,0.0,0.0,173.0,0.0,24.514716,0.0,5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,179,323,81,703,0,0,0.366281,0.633719
1,AUSTRALIAN_OPEN_2025,0.0,Hard,128,0.0,0.0,0.0,202261,0.0,0.0,0.0,0.0,193.0,0.0,23.780287,209950,0.0,0.0,0.0,0.0,185.0,0.0,20.804928,0.0,5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,93,627,21,2280,0,0,0.204786,0.795214
2,AUSTRALIAN_OPEN_2025,0.0,Hard,128,0.0,0.0,0.0,105453,0.0,0.0,0.0,0.0,178.0,0.0,35.257358,106329,0.0,0.0,0.0,0.0,183.0,0.0,30.837782,0.0,5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,76,743,105,566,1,1,0.685114,0.314886
3,AUSTRALIAN_OPEN_2025,0.0,Hard,128,0.0,0.0,0.0,134770,0.0,0.0,0.0,0.0,183.0,0.0,26.276523,144719,0.0,0.0,0.0,0.0,183.0,0.0,27.908966,0.0,5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,6,4210,62,922,1,1,0.881256,0.118744
5,AUSTRALIAN_OPEN_2025,0.0,Hard,128,0.0,0.0,0.0,208103,0.0,0.0,0.0,0.0,185.0,0.0,23.396988,111454,0.0,0.0,0.0,0.0,183.0,0.0,28.848734,0.0,5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,29,1660,168,342,1,1,0.844618,0.155382
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
110,AUSTRALIAN_OPEN_2025,0.0,Hard,128,0.0,0.0,0.0,100644,0.0,0.0,0.0,0.0,198.0,0.0,27.950034,126205,0.0,0.0,0.0,0.0,185.0,0.0,27.876112,0.0,5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2,7635,11,3195,1,1,0.765330,0.234670
112,AUSTRALIAN_OPEN_2025,0.0,Hard,128,0.0,0.0,0.0,210097,0.0,0.0,0.0,0.0,193.0,0.0,22.479124,132283,0.0,0.0,0.0,0.0,191.0,0.0,29.892539,0.0,5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,20,2280,55,1026,1,1,0.783410,0.216590
113,AUSTRALIAN_OPEN_2025,0.0,Hard,128,0.0,0.0,0.0,200282,0.0,0.0,0.0,0.0,183.0,0.0,26.120465,206173,0.0,0.0,0.0,0.0,191.0,0.0,23.626968,0.0,5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,8,3535,1,11830,0,0,0.089048,0.910952
114,AUSTRALIAN_OPEN_2025,0.0,Hard,128,0.0,0.0,0.0,210097,0.0,0.0,0.0,0.0,193.0,0.0,22.479124,206173,0.0,0.0,0.0,0.0,191.0,0.0,23.626968,0.0,5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,20,2280,1,11830,0,0,0.076903,0.923097


In [None]:
# Wrong Results
aus_open_predict[((aus_open_predict["RESULT"] == 0) & (aus_open_predict["PREDICTION"] == 1))
                 | ((aus_open_predict["RESULT"] == 1) & (aus_open_predict["PREDICTION"] == 0))]

Unnamed: 0,tourney_id,tourney_name,surface,draw_size,tourney_level,tourney_date,match_num,p1_id,p1_seed,p1_entry,p1_name,p1_hand,p1_ht,p1_ioc,p1_age,p2_id,p2_seed,p2_entry,p2_name,p2_hand,p2_ht,p2_ioc,p2_age,score,best_of,round,minutes,p1_ace,p1_df,p1_svpt,p1_1stIn,p1_1stWon,p1_2ndWon,p1_SvGms,p1_bpSaved,p1_bpFaced,p2_ace,p2_df,p2_svpt,p2_1stIn,p2_1stWon,p2_2ndWon,p2_SvGms,p2_bpSaved,p2_bpFaced,p1_rank,p1_rank_points,p2_rank,p2_rank_points,RESULT,PREDICTION,% Player 1 Wins,% Player 2 Wins
4,AUSTRALIAN_OPEN_2025,0.0,Hard,128,0.0,0.0,0.0,207352,0.0,0.0,0.0,0.0,185.0,0.0,23.199863,200273,0.0,0.0,0.0,0.0,188.0,0.0,26.613279,0.0,5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,67,784,219,264,0,1,0.687307,0.312693
12,AUSTRALIAN_OPEN_2025,0.0,Hard,128,0.0,0.0,0.0,106148,0.0,0.0,0.0,0.0,183.0,0.0,32.027379,126214,0.0,0.0,0.0,0.0,188.0,0.0,27.832307,0.0,5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,57,981,26,1705,1,0,0.367447,0.632553
14,AUSTRALIAN_OPEN_2025,0.0,Hard,128,0.0,0.0,0.0,210506,0.0,0.0,0.0,0.0,193.0,0.0,20.602327,126774,0.0,0.0,0.0,0.0,193.0,0.0,26.637919,0.0,5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,42,1270,12,3195,1,0,0.342174,0.657826
17,AUSTRALIAN_OPEN_2025,0.0,Hard,128,0.0,0.0,0.0,105902,0.0,0.0,0.0,0.0,183.0,0.0,33.194387,208502,0.0,0.0,0.0,0.0,183.0,0.0,22.626968,0.0,5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,89,637,304,173,1,0,0.371813,0.628187
28,AUSTRALIAN_OPEN_2025,0.0,Hard,128,0.0,0.0,0.0,210262,0.0,0.0,0.0,0.0,188.0,0.0,21.654346,200303,0.0,0.0,0.0,0.0,191.0,0.0,26.36961,0.0,5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,125,481,97,612,1,0,0.27317,0.72683
39,AUSTRALIAN_OPEN_2025,0.0,Hard,128,0.0,0.0,0.0,106432,0.0,0.0,0.0,0.0,188.0,0.0,28.380561,106426,0.0,0.0,0.0,0.0,185.0,0.0,28.84052,0.0,5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,87,639,150,382,0,1,0.600351,0.399649
42,AUSTRALIAN_OPEN_2025,0.0,Hard,128,0.0,0.0,0.0,208659,0.0,0.0,0.0,0.0,203.0,0.0,21.733744,104792,0.0,0.0,0.0,0.0,193.0,0.0,38.583162,0.0,5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,30,1651,41,1280,0,1,0.553709,0.446291
44,AUSTRALIAN_OPEN_2025,0.0,Hard,128,0.0,0.0,0.0,200267,0.0,0.0,0.0,0.0,185.0,0.0,25.829569,207680,0.0,0.0,0.0,0.0,183.0,0.0,24.295688,0.0,5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,60,948,73,758,0,1,0.685964,0.314036
46,AUSTRALIAN_OPEN_2025,0.0,Hard,128,0.0,0.0,0.0,207681,0.0,0.0,0.0,0.0,178.0,0.0,24.487337,127157,0.0,0.0,0.0,0.0,188.0,0.0,26.553046,0.0,5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,85,662,101,582,0,1,0.533641,0.466359
47,AUSTRALIAN_OPEN_2025,0.0,Hard,128,0.0,0.0,0.0,207518,0.0,0.0,0.0,0.0,185.0,0.0,23.082136,208286,0.0,0.0,0.0,0.0,185.0,0.0,24.106776,0.0,5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,15,2600,39,1305,1,0,0.498523,0.501477


## Why you should not bet using my model?
Firstly, I'm just a youtuber and a CS student. Also, this was a two/three week long project.
Bookmakers are cracked, and they have the best models, which they keep a secret. I doubt that my model will ever be able to compete with them.
That being said, I think this is a fun project about how you can use Machine Learning to do some pretty cool things. Also, this model could be improved in a lot of ways, which I will briefly explain below.

I hope you enjoyed!

## To Imporve
- Take into account when a player is injured or takes a break
    - Vary K ELO factor
    - Drop ELO points if abssence
- Do further analysis on the PCA and visualize better the data to observe patterns
- Calculate more stats
    - Calculate Average ELO Rating at last tournament
    - Average ELO opponent won vs Average ELO opponent lost
    - Check regularity of a player by checking upset percentage
    - Check regularity of a player by checking
    - Calculate the probability of win from ELO depending on best of 3 or best of 5
        - https://github.com/JeffSackmann/tennis_misc/blob/master/fiveSetProb.py
- Train on last ten/five years - Less data, but more recent
- See if probabilites of XGBoost would be better than betting odds (highly unlikely)

## Resources
- https://www.tennisabstract.com/blog/2019/12/03/an-introduction-to-tennis-elo/
- https://github.com/JeffSackmann/tennis_misc/blob/master/fiveSetProb.py
- https://github.com/JeffSackmann/tennis_atp