# 2024 Bracket Simulation
This notebook uses a previously optimized simulation to predict this year's March Madness

By: Jackson Isidor and Alex Sullivan

This optimized model included and `XGBoost` model with:
- `Features`: 'badj_em_diff', 'wab_diff', 'barthag_diff', 'talent_diff', 'elite_sos_diff', 'win_percent_diff', 'pppo_diff', 'k_off_diff'
- `Parameters`: 
    - n_estimators=300
    - max_depth=7
    - learning_rate=0.01
    - subsample=0.8
    - colsample_bytree=0.8
    - gamma=0

In [230]:
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
import seaborn as sns

from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import cross_val_score, cross_val_predict
from sklearn.model_selection import GridSearchCV

from xgboost import XGBClassifier

import warnings
warnings.filterwarnings("ignore")

## Train the Model
Load in all of the historical data and train the model and all of it to prepare for the 2024 predictions:

In [231]:
matchups = pd.read_csv("/Users/jacksonisidor/Documents/March Madness Project/Data Processing And Exploration/matchups.csv")

In [232]:
matchups.head()

Unnamed: 0.1,Unnamed: 0,year,team_1,seed_1,round_1,current_round,team_2,seed_2,round_2,badj_em_1,...,badj_o_diff,badj_d_diff,wab_diff,barthag_diff,talent_diff,elite_sos_diff,win_percent_diff,pppo_diff,k_off_diff,avg_hgt_diff
0,0,2023,Alabama,1,16,64,Texas A&M Corpus Chris,16,64,27.1,...,-27.1,27.1,16.2,0.505,62.286,24.154,19.776876,0.008,0.998,3.043
1,1,2023,Maryland,8,32,64,West Virginia,9,64,16.5,...,-16.5,16.5,-0.5,-0.032,3.69,-6.315,6.060606,-0.002,0.056,0.467
2,2,2023,San Diego St.,5,2,64,College of Charleston,12,64,21.2,...,-21.2,21.2,4.1,0.137,37.036,16.74,-9.659091,-0.062,-5.79,-0.205
3,3,2023,Virginia,4,64,64,Furman,13,32,16.9,...,-16.9,16.9,6.1,0.174,50.41,14.426,0.705645,-0.072,-7.131,-0.022
4,4,2023,Creighton,6,8,64,North Carolina St.,11,64,21.0,...,-21.0,21.0,0.8,0.073,12.798,8.184,-6.060606,-0.02,-1.706,0.827


In [233]:
predictors = ['badj_em_diff', 'badj_o_diff', 'badj_d_diff', 'wab_diff', 'barthag_diff', 'elite_sos_diff', 
              'win_percent_diff', 'avg_hgt_diff']
target = "winner"

# Train model on rest of the years
xgb_pipeline = make_pipeline(StandardScaler(), 
                             XGBClassifier(n_estimators=300,
                             max_depth=7,
                             learning_rate=0.01,
                             subsample=0.8,
                             colsample_bytree=0.8,
                             gamma=0
                             ))

xgb_pipeline.fit(matchups[predictors], matchups[target])

**This fitted pipeline will be used throughout the 2024 simulation.**

## Data Processing
I did not have updated 2024 data when I ran the simulation first, so I will have to perform the same preprocessing on the new data. This will just involve copy and pasting over code from the previous preprocessing notebooks. 

In [234]:
raw_matchups = pd.read_csv("/Users/jacksonisidor/Documents/March Madness Project/MM Data Sets/Tournament Matchups 2024.csv")
raw_stats = pd.read_csv("/Users/jacksonisidor/Documents/March Madness Project/MM Data Sets/KenPom Barttorvik 2024.csv")

raw_stats = raw_stats[(raw_stats.YEAR == 2024)]

In [235]:
raw_matchups.head()

Unnamed: 0,YEAR,BY YEAR NO,BY ROUND NO,TEAM NO,TEAM,SEED,ROUND,CURRENT ROUND,SCORE
0,2024,2036,0,1067,Connecticut,1,1,64,
1,2024,2035,0,1026,Stetson,16,64,64,
2,2024,1972,0,1060,Florida Atlantic,8,1,64,
3,2024,1971,0,1036,Northwestern,9,64,64,
4,2024,2000,0,1029,San Diego St.,5,1,64,


In [236]:
raw_stats.head()

Unnamed: 0,YEAR,CONF,CONF ID,QUAD NO,QUAD ID,TEAM NO,TEAM ID,TEAM,SEED,ROUND,...,BADJT RANK,AVG HGT RANK,EFF HGT RANK,EXP RANK,TALENT RANK,FT% RANK,OP FT% RANK,PPPO RANK,PPPD RANK,ELITE SOS RANK
0,2024,MAC,17,61,1,1079,2,Akron,14,0,...,276,238,199,19,176,164,47,122,51,249
1,2024,SEC,28,63,3,1078,3,Alabama,4,0,...,13,33,8,156,106,10,314,2,263,7
2,2024,P12,24,63,3,1077,8,Arizona,2,0,...,16,50,37,196,7,195,134,8,14,47
3,2024,SEC,28,64,4,1076,12,Auburn,4,0,...,58,86,76,127,69,59,284,12,8,69
4,2024,B12,7,63,3,1075,14,Baylor,3,0,...,274,31,22,304,34,97,254,15,155,1


## Round of 68 Predictions

My simulation only does the round of 64 and beyond, but the round of 68 games have not been played yet.

I need to predict these and alter the data set based on those results before I continue with preprocessing.

In [237]:
r68matchups = raw_matchups[raw_matchups["CURRENT ROUND"] == 68]

r68_matchup_stats = pd.merge(r68matchups, raw_stats, on=["YEAR", "TEAM"], how="left").drop(columns=["SEED_y", "ROUND_y"])
r68_matchup_stats.rename(columns={"SEED_x":"SEED", "ROUND_x":"ROUND"}, inplace=True)

Since I will be using the next step matchup processing again, it will be a function:

In [238]:
def make_matchups(teams):
    matchups = pd.DataFrame(columns=['year', 'team_1', 'seed_1', 'round_1', 'current_round', 'score_1',
                                     'team_2', 'seed_2', 'round_2', 'score_2'])

    matchup_info_list = []
    # iterate through data frame and jump 2 each iteration
    for i in range(0, len(teams), 2):
        team1_info = teams.iloc[i]
        team2_info = teams.iloc[i+1]

        matchup_info = {
                'year': team1_info['YEAR'],
                'team_1': team1_info['TEAM'],
                'seed_1': team1_info['SEED'],
                'round_1': team1_info['ROUND'],
                'score_1' : team1_info['SCORE'],
                'score_2' : team2_info['SCORE'],
                'current_round': team1_info['CURRENT ROUND'],
                'team_2': team2_info['TEAM'],
                'seed_2': team2_info['SEED'],
                'round_2': team2_info['ROUND'],
                'badj_em_1': team1_info['BADJ EM'],
                'badj_o_1': team1_info['BADJ D'],
                'badj_d_1': team1_info['BADJ O'],
                'wab_1': team1_info['WAB'],
                'barthag_1': team1_info['BARTHAG'],
                'talent_1': team1_info['TALENT'],
                'elite_sos_1': team1_info['ELITE SOS'],
                'win_percent_1': team1_info['WIN%'],
                'pppo_1': team1_info['PPPO'],
                'k_off_1': team1_info['K OFF'],
                'avg_hgt_1': team1_info["AVG HGT"],
                'badj_em_2': team2_info['BADJ EM'],
                'badj_o_2': team1_info['BADJ O'],
                'badj_d_2': team1_info['BADJ D'],
                'wab_2': team2_info['WAB'],
                'barthag_2': team2_info['BARTHAG'],
                'talent_2': team2_info['TALENT'],
                'elite_sos_2': team2_info['ELITE SOS'],
                'win_percent_2': team2_info['WIN%'],
                'pppo_2': team2_info['PPPO'],
                'k_off_2': team2_info['K OFF'],
                'avg_hgt_2': team2_info["AVG HGT"]
            }
    
        matchup_info_list.append(matchup_info)

    matchups = pd.concat([matchups, pd.DataFrame(matchup_info_list)])
        
    # get the stat differences same as before
    stat_variables = ['badj_em', 'badj_o', 'badj_d', 'wab', 'barthag', 'talent', 'elite_sos', 'win_percent', 'pppo', 
                  'k_off', 'avg_hgt']
    for variable in stat_variables:
        matchups[f'{variable}_diff'] = matchups[f'{variable}_1'] - matchups[f'{variable}_2']
        
    return matchups

In [239]:
# Process the round of 68 matchups
r68_matchups = make_matchups(r68_matchup_stats)

In [240]:
# Make predictions for the round of 68
r68_predictions = xgb_pipeline.predict(r68_matchups[predictors])
r68_matchups["R68 Prediction"] = r68_predictions

In [241]:
r68_matchups

Unnamed: 0,year,team_1,seed_1,round_1,current_round,score_1,team_2,seed_2,round_2,score_2,...,badj_d_diff,wab_diff,barthag_diff,talent_diff,elite_sos_diff,win_percent_diff,pppo_diff,k_off_diff,avg_hgt_diff,R68 Prediction
0,2024,Howard,16,64,68,,Wagner,16,64,,...,-8.407,-0.4,0.051,31.116,0.34,1.724138,0.091,8.9186,0.207,1
1,2024,Boise St.,10,64,68,,Colorado,10,64,,...,16.224,-0.8,-0.032,-19.982,1.61,-3.921569,-0.029,-2.44,-0.386,1
2,2024,Montana St.,16,64,68,,Grambling St.,16,64,,...,-2.97,-4.6,0.138,-12.302,-5.726,-6.451613,0.08,7.4674,0.282,0
3,2024,Virginia,10,64,68,,Colorado St.,10,64,,...,12.652,-0.3,-0.031,39.424,-1.437,0.94697,-0.094,-9.322,1.257,0


I will now just manually drop Wagner, Colorado, Montana St. and Virginia from the original df because they were predicted to lose:

In [242]:
teams_to_drop = ["Wagner", "Boise St.", "Montana St.", "Virginia"]
raw_matchups = raw_matchups[~raw_matchups['TEAM'].isin(teams_to_drop)]
raw_matchups["CURRENT ROUND"] = 64

## Preprocessing Continued
Now, back to the planned preprocessing steps with only round of 64 teams.

In [243]:
## merge dfs
raw_matchup_stats = pd.merge(raw_matchups, raw_stats, on=["YEAR", "TEAM"], how="left").drop(columns=["SEED_y", "ROUND_y"])
raw_matchup_stats.rename(columns={"SEED_x":"SEED", "ROUND_x":"ROUND"}, inplace=True)

Upon inspection, I found some failed merges (NaNs that should be there), so I will go through a similar process to last time to resolve this. 

In [244]:
unique_teams_matchups = set(raw_matchups["TEAM"].unique())
unique_teams_stats = set(raw_stats["TEAM"].unique())

# Teams present in raw_matchups but not in stats_data
teams_only_in_matchups = unique_teams_matchups - unique_teams_stats

# Teams present in stats_data but not in raw_matchups
teams_only_in_stats = unique_teams_stats - unique_teams_matchups

print("Teams only in matchups:", teams_only_in_matchups)
print("Teams only in stats:", teams_only_in_stats)

Teams only in matchups: set()
Teams only in stats: {'Montana St.', 'Virginia', 'Wagner', 'Boise St.'}


The teams only in stats are fine because I dropped those when I predicted them to lose in the round of 16. However, the teams only in matchups need to be resolved. 

In [245]:
## merge rows
r64_2024_matchups = make_matchups(raw_matchup_stats)

In [246]:
r64_2024_matchups.head()

Unnamed: 0,year,team_1,seed_1,round_1,current_round,score_1,team_2,seed_2,round_2,score_2,...,badj_o_diff,badj_d_diff,wab_diff,barthag_diff,talent_diff,elite_sos_diff,win_percent_diff,pppo_diff,k_off_diff,avg_hgt_diff
0,2024,Connecticut,1,1,64,,Stetson,16,64,,...,-33.493,33.493,18.1,0.615,38.477,20.981,29.886148,0.118,11.298,0.696
1,2024,Florida Atlantic,8,1,64,,Northwestern,9,64,,...,-14.274,14.274,-0.3,-0.056,-13.964,-10.433,10.132576,0.051,5.209,-1.035
2,2024,San Diego St.,5,1,64,,UAB,12,64,,...,-17.67,17.67,5.3,0.216,21.68,12.628,2.083333,-0.025,-2.37,0.49
3,2024,Auburn,4,1,64,,Yale,13,64,,...,-28.605,28.605,8.0,0.213,33.957,11.838,10.446247,0.072,6.939,-0.509
4,2024,BYU,6,1,64,,Duquesne,11,64,,...,-21.656,21.656,4.7,0.146,19.283,11.228,2.049911,0.121,12.098,1.998


## Simulation Functions
I used a few functions in the last notebook that operated as the simulator, so I will bring those over here now (I realize all this copy and pasting isn't the best deployment strategy, but I will work on improving that after this project is due).

Explanation of each of these functions can be found in the mm_bracket_simulator.ipynb notebook.

In [247]:
def score_bracket(predicted, actual):
    
    score = 0
    for (pred_index, pred_matchup), (act_index, act_matchup) in zip(predicted.iterrows(), actual.iterrows()):
        
        if (pred_matchup["team_1"] == act_matchup["team_1"]) and (pred_matchup["prediction"] == act_matchup["winner"] == 1):
            score += 64 / pred_matchup["current_round"]
            
        elif (pred_matchup["team_2"] == act_matchup["team_2"]) and (pred_matchup["prediction"] == act_matchup["winner"] == 0): 
            score += 64 / pred_matchup["current_round"]
            
    return score

In [248]:
def get_winner_info(matchups):
    next_round_teams_list = []
    
    for index, matchup in matchups.iterrows():
        # if team_1 wins, get all info that ends in "_1"
        if matchup["prediction"] == 1:
            winning_team_info = matchup.filter(regex='_1$').rename(lambda x: x[:-2], axis=0)
        # if team_2 wins, get all info that ends in "_2"
        else:
            winning_team_info = matchup.filter(regex='_2$').rename(lambda x: x[:-2], axis=0)
        
        winning_team_info["year"] = matchup["year"]
        winning_team_info["current_round"] = matchup["current_round"] / 2
        
        next_round_teams_list.append(pd.DataFrame(winning_team_info).T)
    
    next_round_teams = pd.concat(next_round_teams_list, ignore_index=True)
        
    return next_round_teams

In [249]:
def next_sim_matchups(winning_teams):
    matchups = pd.DataFrame(columns=['year', 'team_1', 'seed_1', 'round_1', 'current_round', 'team_2', 'seed_2', 'round_2'])

    matchup_info_list = []
    # iterate through data frame and jump 2 each iteration
    for i in range(0, len(winning_teams)-1, 2):
        team1_info = winning_teams.iloc[i]
        team2_info = winning_teams.iloc[i+1]

        matchup_info = {
            'year': team1_info['year'],
            'team_1': team1_info['team'],
            'seed_1': team1_info['seed'],
            'round_1': team1_info['round'],
            'current_round': team1_info['current_round'],
            'team_2': team2_info['team'],
            'seed_2': team2_info['seed'],
            'round_2': team2_info['round'],
            'badj_em_1': team1_info['badj_em'],
            'badj_o_1': team1_info['badj_o'],
            'badj_d_1': team1_info['badj_d'],
            'wab_1': team1_info['wab'],  
            'barthag_1': team1_info['barthag'],
            'talent_1': team1_info['talent'],
            'elite_sos_1': team1_info['elite_sos'],
            'win_percent_1': team1_info['win_percent'],
            'pppo_1': team1_info['pppo'],
            'k_off_1': team1_info['k_off'],
            'avg_hgt_1': team1_info['avg_hgt'],
            'badj_em_2': team2_info['badj_em'],
            'badj_o_2': team2_info['badj_o'],
            'badj_d_2': team2_info['badj_d'],
            'wab_2': team2_info['wab'],  
            'barthag_2': team2_info['barthag'],
            'talent_2': team2_info['talent'],
            'elite_sos_2': team2_info['elite_sos'],
            'win_percent_2': team2_info['win_percent'],
            'pppo_2': team2_info['pppo'],
            'k_off_2': team2_info['k_off'],
            'avg_hgt_2': team2_info['avg_hgt']
        }
        matchup_info_list.append(matchup_info)

    matchups = pd.concat([matchups, pd.DataFrame(matchup_info_list)])
        
    # get the stat differences same as before
    stat_variables = ['badj_em', 'badj_o', 'badj_d', 'wab', 'barthag', 'talent', 'elite_sos', 'win_percent', 'pppo', 
                  'k_off', 'avg_hgt']
    for variable in stat_variables:
        matchups[f'{variable}_diff'] = matchups[f'{variable}_1'] - matchups[f'{variable}_2']
        
    return matchups

In [250]:
def sim_bracket(round_matchups, model):

    # get predictions for each game in the current round and add that column to the df
    preds = model.predict(round_matchups[predictors])
    round_matchups.loc[:, "prediction"] = preds
    
    # add in probabilities too in case I want to identify the most likely upsets
    probs = model.predict_proba(round_matchups[predictors])
    round_matchups.loc[:, "win probability"] = probs[:, 1]

    
    # base case for recursion (we are in the championship round)
    if round_matchups["current_round"].iloc[0] == 2:
        return round_matchups
    
    # pass teams on to the next round in a new df and combine them into new matchups
    next_round_teams = get_winner_info(round_matchups)
    next_round_matchups = next_sim_matchups(next_round_teams)

    # recurse through making a simulated df that mimics the structure of the actual df
    return pd.concat([round_matchups, sim_bracket(next_round_matchups, model)], ignore_index=True)

## Simulate 2024 March Madness
Pass in the 2024 round of 64 matchups and the previously trained model, then output the resulting bracket:

In [251]:
r64_2024_matchups

Unnamed: 0,year,team_1,seed_1,round_1,current_round,score_1,team_2,seed_2,round_2,score_2,...,badj_o_diff,badj_d_diff,wab_diff,barthag_diff,talent_diff,elite_sos_diff,win_percent_diff,pppo_diff,k_off_diff,avg_hgt_diff
0,2024,Connecticut,1,1,64,,Stetson,16,64,,...,-33.493,33.493,18.1,0.615,38.477,20.981,29.886148,0.118,11.298,0.696
1,2024,Florida Atlantic,8,1,64,,Northwestern,9,64,,...,-14.274,14.274,-0.3,-0.056,-13.964,-10.433,10.132576,0.051,5.209,-1.035
2,2024,San Diego St.,5,1,64,,UAB,12,64,,...,-17.67,17.67,5.3,0.216,21.68,12.628,2.083333,-0.025,-2.37,0.49
3,2024,Auburn,4,1,64,,Yale,13,64,,...,-28.605,28.605,8.0,0.213,33.957,11.838,10.446247,0.072,6.939,-0.509
4,2024,BYU,6,1,64,,Duquesne,11,64,,...,-21.656,21.656,4.7,0.146,19.283,11.228,2.049911,0.121,12.098,1.998
5,2024,Illinois,3,1,64,,Morehead St.,14,64,,...,-24.236,24.236,9.4,0.333,40.756,20.861,3.137255,0.086,8.275,0.691
6,2024,Washington St.,7,1,64,,Drake,10,64,,...,-16.17,16.17,0.2,0.028,14.742,8.666,-9.090909,-0.034,-3.587,2.936
7,2024,Iowa St.,2,1,64,,South Dakota St.,15,64,,...,-27.105,27.105,14.3,0.375,36.587,23.039,18.121442,0.009,0.305,0.506
8,2024,North Carolina,1,1,64,,Howard,16,64,,...,-23.59,23.59,18.7,0.64,57.091,22.285,29.411765,0.072,7.089,0.733
9,2024,Mississippi St.,8,1,64,,Michigan St.,9,64,,...,-17.693,17.693,0.59,-0.021,-24.13,-2.042,4.188948,-0.016,-1.421,-0.378


In [252]:
bracket2024 = sim_bracket(r64_2024_matchups, xgb_pipeline)

In [259]:
bracket2024[bracket2024.current_round == 2][["team_1", "team_2", "prediction", "win probability"]]

Unnamed: 0,team_1,team_2,prediction,win probability
62,Connecticut,Houston,0,0.293639


In [254]:
bracket2024.to_csv("BracketPredictions2024.csv")

I will now enter this bracket into various sites (ESPN CBS, etc) and see how it does when the tournament ends in a month.