# Predicting Match Winners with XGBoost Classifier

In this notebook, we aim to predict the winner of a match using an XGBoost classifier. We want to give higher importance to correctly predicting upsets, defined as instances where a player with a lower default ELO wins the match.

## Table of Contents
1. [Data Preparation](#data-preparation)
2. [Baseline Model](#baseline-model)
3. [Weighted Model](#weighted-model)
4. [Hyperparameter Optimization with Optuna](#hyperparameter-optimization)
5. [Conclusion](#conclusion)


In [1]:
# Standard library imports
import datetime
import os
from collections import deque
import time

# Third-party imports
import matplotlib.pyplot as plt
import numpy as np
import optuna
import pandas as pd
import seaborn as sns

from sklearn.model_selection import train_test_split
import xgboost as xgb
from sklearn.metrics import accuracy_score, log_loss
import optuna

from tqdm import tqdm

if os.path.exists('/workspace/data_2'):
    # Load the dictionary of DataFrames from the pickle
    data_path = '/workspace/data_2/'
else:
    data_path = '../data/'
    
# if torch.cuda.is_available() == False:
#     RuntimeError("GPU detected: False")
#     print("GPU detected: False")
# else:
#     device = torch.device("cuda")
#     print("The GPU is detected.")



### Load Data

In [2]:
dataset_df = pd.read_pickle(data_path + 'full_dataset_df.pkl')


Identify columns for training.

In [3]:
for i, col in enumerate(dataset_df.columns):
    print(i, col)
    

0 key_x
1 game
2 tournament_key
3 winner_id
4 loser_id
5 p1_id
6 p2_id
7 p1_score
8 p2_score
9 valid_score
10 best_of
11 location_names
12 bracket_name
13 bracket_order
14 set_order
15 game_data
16 top_8
17 top_8_location_names
18 valid_top_8_bracket
19 top_8_bracket_location_names
20 major
21 key_y
22 start
23 end
24 start_week
25 p1_characters
26 p2_characters
27 p1_consistent
28 p2_consistent
29 matchup_strings
30 end_week
31 players_have_history
32 (p1/p2)_sorted
33 (p1/p2)_was_sorted
34 results_sorted
35 results
36 matchup_1
37 matchup_2
38 matchup_3
39 matchup_4
40 matchup_5
41 matchup_6
42 matchup_7
43 matchup_8
44 matchup_9
45 matchup_10
46 winner
47 p1_default_elo
48 p2_default_elo
49 p1_default_rd
50 p2_default_rd
51 p1_default_updates
52 p2_default_updates
53 start_index
54 start_date
55 p1_fox_count
56 p1_falco_count
57 p1_marth_count
58 p1_sheik_count
59 p1_captainfalcon_count
60 p1_jigglypuff_count
61 p1_peach_count
62 p1_luigi_count
63 p1_samus_count
64 p1_ganondorf_coun

Separate the features of player one and player two

In [4]:
# Define features and target
features = (
    list(dataset_df.columns[36:46]) +
    list(dataset_df.columns[47:53]) +
    list(dataset_df.columns[55:])
).copy()
target = 'winner'

print(features)



['matchup_1', 'matchup_2', 'matchup_3', 'matchup_4', 'matchup_5', 'matchup_6', 'matchup_7', 'matchup_8', 'matchup_9', 'matchup_10', 'p1_default_elo', 'p2_default_elo', 'p1_default_rd', 'p2_default_rd', 'p1_default_updates', 'p2_default_updates', 'p1_fox_count', 'p1_falco_count', 'p1_marth_count', 'p1_sheik_count', 'p1_captainfalcon_count', 'p1_jigglypuff_count', 'p1_peach_count', 'p1_luigi_count', 'p1_samus_count', 'p1_ganondorf_count', 'p1_iceclimbers_count', 'p1_drmario_count', 'p1_yoshi_count', 'p1_pikachu_count', 'p1_link_count', 'p1_mrgameandwatch_count', 'p1_donkeykong_count', 'p1_mario_count', 'p1_zelda_count', 'p1_roy_count', 'p1_younglink_count', 'p1_kirby_count', 'p1_ness_count', 'p1_bowser_count', 'p1_pichu_count', 'p1_random_count', 'p1_mewtwo_count', 'p2_fox_count', 'p2_falco_count', 'p2_marth_count', 'p2_sheik_count', 'p2_captainfalcon_count', 'p2_jigglypuff_count', 'p2_peach_count', 'p2_luigi_count', 'p2_samus_count', 'p2_ganondorf_count', 'p2_iceclimbers_count', 'p2_drm

# Data Preparation


In [5]:
# 1. Define the 'expected_winner' column
dataset_df['expected_winner'] = np.where(
    dataset_df['p1_default_elo'] > dataset_df['p2_default_elo'], 1,
    np.where(dataset_df['p1_default_elo'] < dataset_df['p2_default_elo'], 0, np.nan)
)

# 2. Define 'upset' only when 'expected_winner' is not NaN
dataset_df['upset'] = np.where(
    dataset_df['expected_winner'].notna() & (dataset_df['winner'] != dataset_df['expected_winner']), 1, 0
)

# 3. Remove matches where ELOs are equal
dataset_df = dataset_df[dataset_df['expected_winner'].notna()].reset_index(drop=True)

# 4. Split the data into training and test sets
train_data, test_data = train_test_split(dataset_df, test_size=0.2, random_state=42, stratify=dataset_df['upset'])

# 5. Reset index for test_data
test_data = test_data.reset_index(drop=True)

# 6. Separate features and target
X_train_full = train_data[features].reset_index(drop=True)
y_train_full = train_data[target].reset_index(drop=True)

X_test = test_data[features]
y_test = test_data[target]

# Define the upset mask for the test set
upset_mask_test = test_data['upset'] == 1



## Training Base Models

We train multiple XGBoost models with upset weights ranging from **1.0** to **3.5** (incrementing by **0.5**). For each upset weight:

- **Cross-Validation**: We perform K-fold cross-validation to generate out-of-fold predictions for the meta-model training.
- **Out-of-Fold Predictions**: Predictions on validation folds are stored for training the meta-model.
- **Test Predictions**: After training on the full training data, predictions on the test set are stored for final evaluation.


In [8]:
from sklearn.model_selection import KFold

# Upset weights to consider
upset_weights = np.arange(1.0, 3.6, .5)  # Adjust increments as needed
best_params = {'n_estimators': 332,
  'max_depth': 13,
  'learning_rate': 0.0329014414333458,
  'min_child_weight': 3,
  'gamma': 0.024318270664498532,
  'subsample': 0.8478652099231178,
  'colsample_bytree': 0.6737054254112979,
  'reg_alpha': 3.090492668111583e-05,
  'reg_lambda': 6.748516964647809e-06,
  'tree_method': 'hist'}

# Number of folds for cross-validation
n_folds = 5

# Prepare arrays to hold out-of-fold predictions and test predictions
oof_predictions = pd.DataFrame(np.zeros((len(X_train_full), len(upset_weights))), columns=[f'weight_{w}' for w in upset_weights])
test_predictions = pd.DataFrame(np.zeros((len(X_test), len(upset_weights))), columns=[f'weight_{w}' for w in upset_weights])

# KFold cross-validation
kf = KFold(n_splits=n_folds, shuffle=True, random_state=42)

for idx, weight in enumerate(upset_weights):
    oof_pred = np.zeros(len(X_train_full))
    test_pred = np.zeros(len(X_test))
    
    print(f"Training models with upset weight: {weight}")
    
    for fold, (train_idx, val_idx) in enumerate(kf.split(X_train_full, y_train_full)):
        X_train = X_train_full.loc[train_idx]
        y_train = y_train_full.loc[train_idx]
        X_val = X_train_full.loc[val_idx]
        y_val = y_train_full.loc[val_idx]
        
        # Sample weights for training data
        sample_weight = np.ones(len(y_train))
        sample_weight[train_data.iloc[train_idx]['upset'] == 1] = weight
        
        # Train the model
        model = xgb.XGBClassifier(**best_params,  eval_metric='logloss')
        # model = xgb.XGBClassifier(tree_method='hist', eval_metric='logloss')
        model.fit(X_train, y_train, sample_weight=sample_weight)
        
        # Predict on validation fold
        oof_pred[val_idx] = model.predict_proba(X_val)[:, 1]
        
    # Store out-of-fold predictions
    oof_predictions.iloc[:, idx] = oof_pred
    
    # Retrain model on full training data
    sample_weight_full = np.ones(len(y_train_full))
    sample_weight_full[train_data['upset'] == 1] = weight
    model = xgb.XGBClassifier(**best_params, eval_metric='logloss')
    # model = xgb.XGBClassifier(tree_method='hist', eval_metric='logloss')
    model.fit(X_train_full, y_train_full, sample_weight=sample_weight_full)
    
    # Predict on test data
    test_predictions.iloc[:, idx] = model.predict_proba(X_test)[:, 1]


Training models with upset weight: 1.0
Training models with upset weight: 1.5
Training models with upset weight: 2.0
Training models with upset weight: 2.5
Training models with upset weight: 3.0
Training models with upset weight: 3.5


In [9]:
# Meta-model training data
X_meta_train = oof_predictions
y_meta_train = y_train_full

# Meta-model test data
X_meta_test = test_predictions

# Train logistic regression as meta-model
from sklearn.linear_model import LogisticRegression

meta_model = LogisticRegression(max_iter=10_000)
meta_model.fit(X_meta_train, y_meta_train)

# Predict on test data
y_pred_meta = meta_model.predict(X_meta_test)
y_pred_proba_meta = meta_model.predict_proba(X_meta_test)[:, 1]

# Overall accuracy
overall_accuracy_meta = accuracy_score(y_test, y_pred_meta)

# Compute accuracies for upsets and non-upsets
accuracy_upsets_meta = accuracy_score(
    y_test[upset_mask_test], y_pred_meta[upset_mask_test]
)
accuracy_non_upsets_meta = accuracy_score(
    y_test[~upset_mask_test], y_pred_meta[~upset_mask_test]
)

print(f"Ensemble Model Overall Accuracy: {overall_accuracy_meta:.4f}")
print(f"Ensemble Model Upset Accuracy: {accuracy_upsets_meta:.4f}")
print(f"Ensemble Model Non-Upset Accuracy: {accuracy_non_upsets_meta:.4f}")


Ensemble Model Overall Accuracy: 0.7784
Ensemble Model Upset Accuracy: 0.3614
Ensemble Model Non-Upset Accuracy: 0.9208


In [10]:
meta_model = xgb.XGBClassifier(tree_method='hist', eval_metric='error')
meta_model.fit(X_meta_train, y_meta_train)

# Predict on test data
y_pred_meta = meta_model.predict(X_meta_test)
y_pred_proba_meta = meta_model.predict_proba(X_meta_test)[:, 1]

# Overall accuracy
overall_accuracy_meta = accuracy_score(y_test, y_pred_meta)

# Compute accuracies for upsets and non-upsets
accuracy_upsets_meta = accuracy_score(
    y_test[upset_mask_test], y_pred_meta[upset_mask_test]
)
accuracy_non_upsets_meta = accuracy_score(
    y_test[~upset_mask_test], y_pred_meta[~upset_mask_test]
)

print(f"Ensemble Model Overall Accuracy: {overall_accuracy_meta:.4f}")
print(f"Ensemble Model Upset Accuracy: {accuracy_upsets_meta:.4f}")
print(f"Ensemble Model Non-Upset Accuracy: {accuracy_non_upsets_meta:.4f}")

Ensemble Model Overall Accuracy: 0.7788
Ensemble Model Upset Accuracy: 0.3179
Ensemble Model Non-Upset Accuracy: 0.9362


In [11]:
meta_model = xgb.XGBClassifier(eval_metric='error')
meta_model.fit(X_meta_train, y_meta_train)

# Predict on test data
y_pred_meta = meta_model.predict(X_meta_test)
y_pred_proba_meta = meta_model.predict_proba(X_meta_test)[:, 1]

# Overall accuracy
overall_accuracy_meta = accuracy_score(y_test, y_pred_meta)

# Compute accuracies for upsets and non-upsets
accuracy_upsets_meta = accuracy_score(
    y_test[upset_mask_test], y_pred_meta[upset_mask_test]
)
accuracy_non_upsets_meta = accuracy_score(
    y_test[~upset_mask_test], y_pred_meta[~upset_mask_test]
)

print(f"Ensemble Model Overall Accuracy: {overall_accuracy_meta:.4f}")
print(f"Ensemble Model Upset Accuracy: {accuracy_upsets_meta:.4f}")
print(f"Ensemble Model Non-Upset Accuracy: {accuracy_non_upsets_meta:.4f}")

Ensemble Model Overall Accuracy: 0.7788
Ensemble Model Upset Accuracy: 0.3179
Ensemble Model Non-Upset Accuracy: 0.9362


## Training Meta-Model

We use the out-of-fold predictions from the base models as features to train a logistic regression meta-model.

- **Features**: Predicted probabilities from base models with different upset weights.
- **Target**: Actual match outcomes (`winner` column).
- **Meta-Model**: Logistic regression that learns to combine base model predictions.


In [12]:
from sklearn.metrics import roc_auc_score, confusion_matrix

# Calculate ROC AUC
roc_auc = roc_auc_score(y_test, y_pred_proba_meta)

print(f"Ensemble Model ROC AUC: {roc_auc:.4f}")

# Confusion Matrix
conf_matrix = confusion_matrix(y_test, y_pred_meta)
print("Confusion Matrix:")
print(conf_matrix)


Ensemble Model ROC AUC: 0.8573
Confusion Matrix:
[[139074  39698]
 [ 39450 139550]]


## Evaluating the Ensemble

We evaluate the ensemble model's performance:

- **Overall Accuracy**: *[Insert Overall Accuracy]*
- **Upset Accuracy**: *[Insert Upset Accuracy]*
- **Non-Upset Accuracy**: *[Insert Non-Upset Accuracy]*
- **ROC AUC Score**: Measures the model's ability to distinguish between classes.
- **Confusion Matrix**: Provides detailed insight into true positives, false positives, etc.

The ensemble model shows improved performance in predicting upsets while maintaining overall accuracy.
