# Model
In this notebook, we:
- Define the structure of our prediction model.
- Generate features that may be useful to predict NCAA tournament outcome.
- Try different models and assess their performance.
- Predict on the 2024 March Madness bracket.

## Imports

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import os
import sys

# display 100 rows and 100 columns
pd.set_option('display.max_rows', 100)
pd.set_option('display.max_columns', 100)

## Load Data

In [2]:
# root dirs
root = 'data/'
mroot = 'data/mens/'
wroot = 'data/womens/'


# historical seeds
mseeds = pd.read_csv(mroot + 'MNCAATourneySeeds.csv')
wseeds = pd.read_csv(wroot + 'WNCAATourneySeeds.csv')

# team names
mteams = pd.read_csv(mroot + 'MTeams.csv')
wteams = pd.read_csv(wroot + 'WTeams.csv')

# historical tourney results
mresults = pd.read_csv(mroot + 'MNCAATourneyCompactResults.csv')
wresults = pd.read_csv(wroot + 'WNCAATourneyCompactResults.csv')

# tourney slots
mslots = pd.read_csv(mroot + 'MNCAATourneySlots.csv')
wslots = pd.read_csv(wroot + 'WNCAATourneySlots.csv')

# reg season results
mreg = pd.read_csv(mroot + 'MRegularSeasonCompactResults.csv')
wreg = pd.read_csv(wroot + 'WRegularSeasonCompactResults.csv')

In [None]:
# add, gender col, combine teams, drop old tables
mteams['Tournament'] = 'M'
wteams['Tournament'] = 'W'
teams = pd.concat([mteams.drop(['FirstD1Season', 'LastD1Season'], axis=1), wteams], ignore_index=True)
del mteams, wteams

# combine reg season results, drop old tables
regresults = pd.concat([mreg, wreg], ignore_index=True)
del mreg, wreg

# compute score differential
regresults['ScoreDiff'] = regresults['WScore'] - regresults['LScore']

# create a map for team names
team_map = teams.set_index('TeamID')['TeamName']

# map winning and losing team names
regresults['WTeamName'] = regresults['WTeamID'].map(team_map)
regresults['LTeamName'] = regresults['LTeamID'].map(team_map)

## ML Model
Essentially, the Label is the entire bracket prediction (63 games).

There are multiple ways that we could structure this problem:
- Predict tournament performance (num wins) based on reg season performance. This could also be paired with details of the tourney matchup.
- Predict winner of each single game using features of each team in matchup (current-season data and all-time data).

## Feature Creation

We can simply generalize a lot from the seed of the team (and their opponent). The seed bakes in the performance of the team that year, as well as recent performance and key injuries. Even though the primary focus of seeding is on team performance, geography is also considered (to minimize travel for team/fans). This means that seeding may not be the perfect metric to use.

---
---

Features that may help predict a team's tourney performance (__Pre-bracket__):

__Features__ = features of that team's season, __Labels__ = Num wins in NCAA tourney
- Wins
- Losses
- Win %
- Home/Road Win %
- Wins/Losses recently (say a month) before tourney
- Performance in conference tourney
- Performance in tourney prior year
- Avg/std pts
- Avg/std of opponent pts
- Avg/std pt differential
- Detailed stats
- Num championships in the past
- Play-in team?
- Longest win-streak in season

---

Features after 1st round is complete (__Intra-bracket__):

__Features__ = features of that team's season + bracket stats so far, __Labels__ = Binary W/L or score diff
- Final output of pre-bracket model
- Avg/std pts
- Avg/std of opponent pts
- Avg/std pt differential
- Detailed stats
- Coming off upset?