### Basic Model
The most basic model will just be to take the raw features and stuff it into XGBoost. This will be the baseline model for comparison. First however, we must address the look-ahead bias.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import xgboost as xgb
from sklearn.model_selection import train_test_split, TimeSeriesSplit
from sklearn.metrics import accuracy_score, log_loss, roc_auc_score

In [3]:
df = pd.read_csv("./data/tennis_matches_cleaned.csv")
print(df.columns.tolist())

['surface', 'draw_size', 'tourney_level', 'tourney_date', 'id_a', 'name_a', 'hand_a', 'ht_a', 'age_a', 'id_b', 'name_b', 'hand_b', 'ht_b', 'age_b', 'score', 'best_of', 'round', 'minutes', 'ace_a', 'df_a', 'svpt_a', '1stIn_a', '1stWon_a', '2ndWon_a', 'SvGms_a', 'bpSaved_a', 'bpFaced_a', 'ace_b', 'df_b', 'svpt_b', '1stIn_b', '1stWon_b', '2ndWon_b', 'SvGms_b', 'bpSaved_b', 'bpFaced_b', 'rank_a', 'rank_points_a', 'rank_b', 'rank_points_b', 'result']


We have to limit the training features to the ones we have access to before the game starts. Features like tournament level, date, id and name are known to us, but hold no predictive power. Based off this, we extract the relevant features and target. We only keep the relevant features and targets.

Further, as surface and hand are categorial features, we can use one hot encoding to convert the features to a numerical input.

In [4]:
pre_match_columns = [
    'surface',
    'draw_size',
    'tourney_date',
    'hand_a',
    'ht_a',
    'age_a',
    'hand_b',
    'ht_b',
    'age_b',
    'best_of',
    'round',
    'rank_a',
    'rank_points_a',
    'rank_b',
    'rank_points_b',
    'result'
]

print(df[pre_match_columns].head())

categorical_cols = ['surface', 'hand_a', 'hand_b']

if 'tourney_date' in df.columns:
    df = df.sort_values('tourney_date') # for time series split

data = pd.get_dummies(
    df[pre_match_columns],
    columns=categorical_cols,
    drop_first=False,
    dtype=int
)

print('\n' + str(data.columns.tolist()))

  surface  draw_size  tourney_date hand_a   ht_a  age_a hand_b   ht_b  age_b  \
0    Hard         32      19910107      R  180.0   25.6      R  175.0   20.6   
1    Hard         32      19910107      R  188.0   31.8      R  180.0   21.5   
2    Hard         32      19910107      R  185.0   21.6      R  185.0   25.3   
3    Hard         32      19910107      L  173.0   23.8      R  180.0   25.8   
4    Hard         32      19910107      R  196.0   20.6      R  185.0   19.7   

   best_of round  rank_a  rank_points_a  rank_b  rank_points_b  result  
0        3   R32     9.0         1487.0    78.0          459.0       1  
1        3   R32   220.0          114.0    94.0          371.0       0  
2        3   R32   212.0          116.0    77.0          468.0       0  
3        3   R32    72.0          483.0    65.0          502.0       0  
4        3   R32    28.0          876.0   190.0          142.0       0  

['draw_size', 'tourney_date', 'ht_a', 'age_a', 'ht_b', 'age_b', 'best_of', 'roun

Now the data is transformed to be fully numerical, we can start training our base model. We can use XGBoost, an extremely powerful tool for predicting tabular data. To accurately benchmark the model, we can use time series split, ensuring that training data always precedes validation data in chronological order. This prevents data leakage from future matches and provides a realistic estimate of how the model would perform when predicting upcoming games.

In [5]:
feature_cols = [
    'draw_size', 
    'ht_a', 
    'age_a', 
    'ht_b', 
    'age_b', 
    'rank_a', 
    'rank_points_a', 
    'rank_b', 
    'rank_points_b', 
    'best_of', 
    'surface_Carpet', 
    'surface_Clay', 
    'surface_Grass', 
    'surface_Hard', 
    'hand_a_A', 
    'hand_a_L', 
    'hand_a_R', 
    'hand_a_U', 
    'hand_b_A', 
    'hand_b_L', 
    'hand_b_R', 
    'hand_b_U'
]

target_col = ['result']

X = data[feature_cols]
y = data[target_col]

params = {
    'objective': 'binary:logistic',
    'eval_metric': ['logloss', 'auc'],
    'max_depth': 6,
    'eta': 0.05,
    'subsample': 0.8,
    'colsample_bytree': 0.8,
    'seed': 42,
    'tree_method': 'hist'      # fast on CPU
}

loglosses = []
aucs = []
accs = []

tscv = TimeSeriesSplit(n_splits=20)

for fold, (train_idx, test_idx) in enumerate(tscv.split(X)):
    # split train and test
    X_train, X_test = X.iloc[train_idx], X.iloc[test_idx]
    y_train, y_test = y.iloc[train_idx], y.iloc[test_idx]

    dtrain = xgb.DMatrix(X_train, label=y_train)
    dtest   = xgb.DMatrix(X_test, label=y_test)

    model = xgb.train(
        params,
        dtrain,
        num_boost_round=2000,
        evals=[(dtrain, 'train'), (dtest, 'test')],
        early_stopping_rounds=100,
        verbose_eval=False,
    )

    # Predict
    pred_proba = model.predict(dtest)
    pred_label = (pred_proba >= 0.5).astype(int)

    # Metrics
    ll = log_loss(y_test, pred_proba)
    au = roc_auc_score(y_test, pred_proba)
    acc = accuracy_score(y_test, pred_label)
    
    loglosses.append(ll)
    aucs.append(au)
    accs.append(acc)
    
    print(f"{fold}) LogLoss: {ll:.4f} | AUC: {au:.4f} | Acc: {acc:.4f}")

print("\n" + "="*50)
print(f"Avg LogLoss: {np.mean(loglosses):.4f} ± {np.std(loglosses):.4f}")
print(f"Avg AUC:     {np.mean(aucs):.4f} ± {np.std(aucs):.4f}")
print(f"Avg Accuracy:     {np.mean(accs):.4f} ± {np.std(accs):.4f}")


0) LogLoss: 0.6429 | AUC: 0.6841 | Acc: 0.6322
1) LogLoss: 0.6354 | AUC: 0.6893 | Acc: 0.6322
2) LogLoss: 0.6348 | AUC: 0.6914 | Acc: 0.6408
3) LogLoss: 0.6463 | AUC: 0.6780 | Acc: 0.6267
4) LogLoss: 0.6553 | AUC: 0.6635 | Acc: 0.6180
5) LogLoss: 0.6441 | AUC: 0.6779 | Acc: 0.6293
6) LogLoss: 0.6392 | AUC: 0.6860 | Acc: 0.6327
7) LogLoss: 0.6335 | AUC: 0.6931 | Acc: 0.6405
8) LogLoss: 0.6166 | AUC: 0.7156 | Acc: 0.6536
9) LogLoss: 0.6192 | AUC: 0.7115 | Acc: 0.6550
10) LogLoss: 0.6111 | AUC: 0.7269 | Acc: 0.6605
11) LogLoss: 0.6024 | AUC: 0.7346 | Acc: 0.6686
12) LogLoss: 0.6033 | AUC: 0.7284 | Acc: 0.6640
13) LogLoss: 0.5979 | AUC: 0.7386 | Acc: 0.6711
14) LogLoss: 0.5965 | AUC: 0.7393 | Acc: 0.6730
15) LogLoss: 0.6140 | AUC: 0.7224 | Acc: 0.6562
16) LogLoss: 0.6417 | AUC: 0.6833 | Acc: 0.6332
17) LogLoss: 0.6310 | AUC: 0.6981 | Acc: 0.6456
18) LogLoss: 0.6295 | AUC: 0.7008 | Acc: 0.6417
19) LogLoss: 0.6330 | AUC: 0.6925 | Acc: 0.6293

Avg LogLoss: 0.6264 ± 0.0171
Avg AUC:     0.7028 