# Tourament Predictor

### 2021 Results

In 2021, I leveraged the kaggle input data and combined with kenpom data. I then applied a Gradient Boosting Classifier and predicted the results.

My resulting bracket was pretty decent. It had an overall log loss 0.66587 of which put me at #503 (out of 707) on the Kaggle March Madness prediction leaderboard.

![image.png](attachment:image.png)

## 2022 Predictions

Now, we are gonna predict the results for each game in the 2022 tourament and print the results

**Steps**

(1) Import Train & Test Sets and combine with Kenpom data

(2) Prepare sets for model training

(3) Use Lazy Predict to test multiple model types

(4) Apply a ExtraTreesClassifier to predict the win probility of each potential game

(5) Analyze Results

## Import Packages

In [14]:
import numpy as np
import pandas as pd
from sklearn.experimental import enable_hist_gradient_boosting
from sklearn.ensemble import HistGradientBoostingRegressor, HistGradientBoostingClassifier, ExtraTreesClassifier
from sklearn.model_selection import KFold, GroupKFold
from sklearn.metrics import log_loss, accuracy_score
from tqdm.notebook import tqdm
import lazypredict
from lazypredict.Supervised import LazyClassifier
import xgboost as xgb
import glob
import os
import gc
from rpy2 import robjects
import PyPDF2
import cv2
import matplotlib.pyplot as plt
plt.rcParams["figure.figsize"] = (40,40)

## Set Path, Import Datasets, & Merge with Kenpom

Make sure to update the PATH to wherever your files are located

In [15]:
# Set Path
PATH = ""
os.chdir(PATH)
print(os.getcwd())

# Loading Datasets
train = pd.read_csv("inputs/tourney.csv")
test = pd.read_csv('inputs/test.csv')
kenpom = pd.read_csv('inputs/kenpom.csv')
kenpom = kenpom.drop(['team'],axis = 1)

/Users/matthewdenko/Documents/March_Madness_Predictions/March_Madness/march_madness_2022


In [16]:
# Merging train with Kenpom
train = train.merge(kenpom, how='left', left_on=['Season','TeamID1'],right_on=['year','teamid_x'])
train = train.merge(kenpom, how='left', left_on=['Season','TeamID2'],right_on=['year','teamid_x'])
train = train.drop_duplicates()
train.head()

Unnamed: 0,Season,TeamID1,TeamID2,result,win_pct_A_1,win_pct_N_1,win_pct_H_1,win_pct_All_1,win_pct_A_2,win_pct_N_2,...,luck_rank_y,sos_pyth_y,sos_pyth_rank_y,sos_oppo_y,sos_oppo_rank_y,sos_oppd_y,sos_oppd_rank_y,ncsos_pyth_y,ncsos_pyth_rank_y,teamid_x_y
0,2003,1421,1411,1,0.12,1.0,0.83,0.45,0.33,0.75,...,75.0,-12.96,321.0,94.4,322.0,107.4,314.0,-8.09,308.0,1411.0
16,2003,1112,1436,1,0.92,0.0,0.93,0.89,0.5,0.6,...,119.0,-5.14,251.0,99.3,249.0,104.4,245.0,-2.29,213.0,1436.0
17,2003,1113,1272,1,0.4,0.5,0.8,0.62,0.64,1.0,...,151.0,4.83,84.0,103.4,133.0,98.6,54.0,3.33,81.0,1272.0
18,2003,1141,1166,1,0.64,1.0,0.92,0.79,0.64,1.0,...,99.0,-0.19,155.0,102.5,159.0,102.7,168.0,-0.51,164.0,1166.0
22,2003,1143,1301,1,0.64,0.25,0.93,0.72,0.3,0.5,...,223.0,8.48,34.0,106.9,22.0,98.4,49.0,-6.48,292.0,1301.0


In [17]:
# Merging test with Kenpom
test = test.merge(kenpom, how='left', left_on=['Season','TeamID1'],right_on=['year','teamid_x'])
test = test.merge(kenpom, how='left', left_on=['Season','TeamID2'],right_on=['year','teamid_x'])
test = test.drop_duplicates()
test.head()

Unnamed: 0,ID,Season,TeamID1,TeamID2,win_pct_A_1,win_pct_N_1,win_pct_H_1,win_pct_All_1,win_pct_A_2,win_pct_N_2,...,luck_rank_y,sos_pyth_y,sos_pyth_rank_y,sos_oppo_y,sos_oppo_rank_y,sos_oppd_y,sos_oppd_rank_y,ncsos_pyth_y,ncsos_pyth_rank_y,teamid_x_y
0,2022_1103_1104,2022,1103,1104,0.67,0.67,0.77,0.71,0.36,0.5,...,226,12.79,1,109.7,3,96.9,7,10.22,11,1104.0
1,2022_1103_1112,2022,1103,1112,0.67,0.67,0.77,0.71,0.75,1.0,...,77,6.34,67,106.0,75,99.7,57,-0.22,170,1112.0
2,2022_1103_1116,2022,1103,1116,0.67,0.67,0.77,0.71,0.56,0.6,...,117,7.93,46,106.8,56,98.9,35,-2.79,251,1116.0
3,2022_1103_1120,2022,1103,1120,0.67,0.67,0.77,0.71,0.7,0.67,...,271,8.77,34,107.2,44,98.5,26,4.16,58,1120.0
4,2022_1103_1124,2022,1103,1124,0.67,0.67,0.77,0.71,0.73,0.75,...,138,10.71,12,107.4,39,96.7,5,-1.65,215,1124.0


## Prep Data For Model Training

In [18]:
# Setting X & Y for test
y = train["result"]
s = train["Season"]
X = train.drop(['Season','TeamID1','TeamID2','result','teamid_x_y'], axis=1)
X_test = test.drop(['ID', 'Season','TeamID1','TeamID2','teamid_x_y'], axis=1)

In [19]:
# Shaping data
train_oof = np.zeros((X.shape[0],))
test_preds = 0
train_oof.shape

# train test split
NUM_FOLDS = 10
kf = GroupKFold(n_splits=NUM_FOLDS)
max_iter = 550

# Training and running model predictions
for f, (train_ind, val_ind) in tqdm(enumerate(kf.split(X, y, s))):
        #print(f'Fold {f}')
        train_df, val_df = X.iloc[train_ind], X.iloc[val_ind]
        train_target, val_target = y.iloc[train_ind], y.iloc[val_ind]
        train_df = train_df.fillna(0)
        val_df = val_df.fillna(0)

0it [00:00, ?it/s]

In [20]:
# Running Multiple Models to quickly test which classifier is most accurate
clf = LazyClassifier(verbose=0,ignore_warnings=True,custom_metric = log_loss)
models, predictions = clf.fit(train_df,X, train_target,y)
models

100%|██████████| 29/29 [00:17<00:00,  1.62it/s]


Unnamed: 0_level_0,Accuracy,Balanced Accuracy,ROC AUC,F1 Score,log_loss,Time Taken
Model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
ExtraTreesClassifier,0.97,0.97,0.97,0.97,1.13,0.59
RandomForestClassifier,0.96,0.96,0.96,0.96,1.29,1.41
LGBMClassifier,0.96,0.96,0.96,0.96,1.45,1.94
DecisionTreeClassifier,0.96,0.96,0.96,0.96,1.49,0.24
XGBClassifier,0.96,0.96,0.96,0.96,1.52,1.79
ExtraTreeClassifier,0.95,0.95,0.95,0.95,1.75,0.04
LabelSpreading,0.94,0.94,0.94,0.94,1.9,0.39
LabelPropagation,0.94,0.94,0.94,0.94,1.9,0.29
BaggingClassifier,0.93,0.93,0.93,0.93,2.27,1.37
NuSVC,0.89,0.89,0.89,0.89,3.77,1.14


## Train and Execute Model

Looking at the results above, the Extra Trees Classifier performed the best. It had the highest accuracy and the lowest log loss.

So, unlike 2021 where we used a Gradient Boosting Classifier we are going to try an Extra Trees Classifier.

We are gonna train a ExtraTreesClassifier from sklearn. 

The Extra Trees Classifier classifier implements a meta estimator that fits a number of randomized decision trees (a.k.a. extra-trees) on various sub-samples of the dataset and uses averaging to improve the predictive accuracy and control over-fitting.

In [21]:
# Setting Model Parameters
NUM_FOLDS = 10
kf = GroupKFold(n_splits=NUM_FOLDS)
max_iter = 550

# Training and running model predictions
for f, (train_ind, val_ind) in tqdm(enumerate(kf.split(X, y, s))):
        #print(f'Fold {f}')
        train_df, val_df = X.iloc[train_ind], X.iloc[val_ind]
        train_target, val_target = y.iloc[train_ind], y.iloc[val_ind]
        train_df = train_df.fillna(0)
        val_df = val_df.fillna(0)
        #model = ExtraTreesClassifier(max_iter=max_iter, validation_fraction=None, learning_rate=0.01, max_depth=2, min_samples_leaf=32)
        model = ExtraTreesClassifier()

        model =  model.fit(train_df, train_target)
        temp_oof = model.predict_proba(val_df)[:,1]
        temp_test = model.predict_proba(X_test)[:,1]

        train_oof[val_ind] = temp_oof
        test_preds += temp_test/NUM_FOLDS
        
        print(log_loss(val_target, temp_oof))

0it [00:00, ?it/s]

0.5217488796413313
0.4675413532876759
0.5003285630074428
0.49721920641326584
0.49588374593913376
0.4922101186493693
0.49995605912992347
0.4967566020301191
0.5571887221369959
0.589934932182794


## Evaluate Model Performance Measures

In [22]:
# Evaluating Log Loss and Accuracy
X = X.fillna(0)
y_preds = model.predict(X)
y_probs = model.predict_proba(X)
loss = log_loss(y, train_oof)
acc = accuracy_score(y,y_preds)
print("Logg Loss:", f'{loss:,.2f}')
print("\nAccuracy Score:", f'{acc:,.2f}')

Logg Loss: 0.51

Accuracy Score: 0.96


The model performance is pretty good. An accuracy score of 96% is very high and the log loss is relatively low as well. Next we are gonna export the predictions for 2022 and see how the bracket looks!

In [23]:
# Saving Probabilities
np.save('train_oof', train_oof)
np.save('test_preds', test_preds)

## Export Kaggle Submission File

In [24]:
# Reading in 2022 data
test = pd.read_csv('inputs/test.csv')
MSampleSubmission = pd.read_csv('inputs/MSampleSubmissionStage2.csv')

In [25]:
# Shaping Predictions for submission file
idx = test_preds.shape[0] //2
test_preds[idx:] = 1 - test_preds[idx:]

pred = pd.concat([test.ID, pd.Series(test_preds)], axis=1).groupby('ID')[0]\
        .mean().reset_index().rename(columns={0:'Pred'})
sub = MSampleSubmission.drop(['Pred'],axis=1).merge(pred, on='ID')
sub.to_csv('outputs/submission2.csv', index=False)
sub.head()


Unnamed: 0,ID,Pred
0,2022_1103_1104,0.23
1,2022_1103_1112,0.08
2,2022_1103_1116,0.17
3,2022_1103_1120,0.1
4,2022_1103_1124,0.08


## Bracket

![image.png](attachment:image.png)

## Conclusion

The model generated strong results for past touraments (strong accuarcy score, low log loss). However, we can't measure overfit effectively because we don't have the performance of the test set. If I had more time I would split the historical data into a train and test set to help reduce the overfit possiblity, however I wanted to submit my predictions before the tourament started. 

In the end the model predicted very closely to the actual tournament seeds. This is not suprising because many of the same inputs that the tourament takes into consideration when seeding is similar to the input used in this model prediction. This model is what they call in the bracket industry CHALK! There are only a few upsets, and it predicts an all 1 seed final four. However, Kansas is predicted to win it all when they are ranked as the 4th overall number 1 seed. Please do not use this model as betting advice.