# EDA for March Learning Mania

### Goal: 
We want to create a model which can provide predictions in the format of YEAR_TEAM1ID_TEAM2ID_WINCHANCE

where the winchance is a value from 0->1 representing the probability that the team with the higher team ID will win. 

"For example, "2025_1101_1102" indicates a hypothetical matchup between team 1101 and 1102 in the year 2025. You must predict the probability that the team with the lower TeamId beats the team with the higher TeamId"

Output should look like this:
```
ID,Pred
2025_1101_1102,0.5
2025_1101_1103,0.5
2025_1101_1104,0.5
...
```

### Data files:
They give you a suffocating amount of data, including: 
- Basic Team & Season Data
- Detailed Game Statistics
- Geographical Data
- Ranking & Rating Systems
- Tournament Bracket Structure
- Supplementary Data
- Data for Submissions
- Miscellaneous

GPT summarized them in the blob below. The key point is that you should probably be able to build a basic model with the basic team and season data. 

Based on historical matches between teams, how will they perform in the future?

The first steps will to get the data into basic form

#### **Basic Team & Season Data**
- **MTeams.csv / WTeams.csv** – Lists all NCAA teams with unique Team IDs and historical Division-I participation details.
- **MSeasons.csv / WSeasons.csv** – Details past seasons, including season start dates and region assignments.
- **MNCAATourneySeeds.csv / WNCAATourneySeeds.csv** – Provides historical NCAA tournament seedings for teams since 1985 (men) and 1998 (women).
- **MRegularSeasonCompactResults.csv / WRegularSeasonCompactResults.csv** – Contains game results (winner, loser, score, location) for regular season games since 1985 (men) and 1998 (women).
- **MNCAATourneyCompactResults.csv / WNCAATourneyCompactResults.csv** – Similar to above but specific to NCAA tournament games.

#### **Detailed Game Statistics**
- **MRegularSeasonDetailedResults.csv / WRegularSeasonDetailedResults.csv** – Contains extended game stats like field goals, rebounds, assists, etc., since 2003 (men) and 2010 (women).
- **MNCAATourneyDetailedResults.csv / WNCAATourneyDetailedResults.csv** – Similar to above but specific to NCAA tournament games.

#### **Geographical Data**
- **Cities.csv** – Lists cities where games were played, including city IDs, names, and state abbreviations.
- **MGameCities.csv / WGameCities.csv** – Maps each game to its city, starting from 2010.

#### **Ranking & Rating Systems**
- **MMasseyOrdinals.csv** – Weekly rankings of men's teams across various rating systems since 2003.

#### **Tournament Bracket Structure**
- **MNCAATourneySlots.csv / WNCAATourneySlots.csv** – Defines how teams progress through the tournament based on seed matchups.
- **MNCAATourneySeedRoundSlots.csv** – Maps tournament seeds to their expected bracket slots and game rounds (men only).

#### **Supplementary Data**
- **MTeamCoaches.csv** – Lists head coaches for teams per season, including mid-season changes.
- **Conferences.csv** – Contains NCAA conference names and abbreviations.
- **MTeamConferences.csv / WTeamConferences.csv** – Tracks which teams belonged to which conferences each season.
- **MConferenceTourneyGames.csv / WConferenceTourneyGames.csv** – Identifies games from conference tournaments before the NCAA tournament.
- **MSecondaryTourneyTeams.csv / WSecondaryTourneyTeams.csv** – Lists teams that participated in secondary postseason tournaments (e.g., NIT).
- **MSecondaryTourneyCompactResults.csv / WSecondaryTourneyCompactResults.csv** – Contains results for games in secondary tournaments.

#### **Data for Submissions**
- **SampleSubmissionStage1.csv / SampleSubmissionStage2.csv** – Example submission files showing expected format.
- **SeedBenchmarkStage1.csv** – Baseline model predictions based on seed matchups.

#### **Miscellaneous**
- **MTeamSpellings.csv / WTeamSpellings.csv** – Helps map external team name variations to standardized Team IDs.


## Organize data

In [32]:
import pandas as pd
import yaml

with open('config.yaml', 'r') as file:
    config_file = yaml.safe_load(file)
data_dir = config_file.get("data_dir")

def peek(file_name):
    df = pd.read_csv(f"{data_dir}/{file_name}.csv").head()
    return(df)

In [None]:
# Straight forward, mostly a lookup table. 
peek("MTeams")

Unnamed: 0,TeamID,TeamName,FirstD1Season,LastD1Season
0,1101,Abilene Chr,2014,2025
1,1102,Air Force,1985,2025
2,1103,Akron,1985,2025
3,1104,Alabama,1985,2025
4,1105,Alabama A&M,2000,2025


In [None]:
# A bit confusing, will take a bit of work to get the seeds lined up, but maybed they've already done it for us
peek("MNCAATourneySeeds")

Unnamed: 0,Season,Seed,TeamID
0,1985,W01,1207
1,1985,W02,1210
2,1985,W03,1228
3,1985,W04,1260
4,1985,W05,1374


In [41]:
## So this has a lot of potential. 
# You could get a better sense of a probability of a win based on team-by-team point differentials. 
# The model should also see how "good" the team is that season, based on this differential.
peek("MRegularSeasonCompactResults")

Unnamed: 0,Season,DayNum,WTeamID,WScore,LTeamID,LScore,WLoc,NumOT
0,1985,20,1228,81,1328,64,N,0
1,1985,25,1106,77,1354,70,H,0
2,1985,25,1112,63,1223,56,H,0
3,1985,25,1165,70,1432,54,H,0
4,1985,25,1192,86,1447,74,H,0


In [None]:
# Exact same as above, but for the tournament.
# Still not sure how to use the daynumber to understand which game it is in the tournament.
# I'd like to include some "heating up" factor, becuase it seems teams go on runs, but maybe that's just a function of media
# aspirational if anything
peek("MNCAATourneyCompactResults")

Unnamed: 0,Season,DayNum,WTeamID,WScore,LTeamID,LScore,WLoc,NumOT
0,1985,136,1116,63,1234,54,N,0
1,1985,136,1120,59,1345,58,N,0
2,1985,136,1207,68,1250,43,N,0
3,1985,136,1229,58,1425,55,N,0
4,1985,136,1242,49,1325,38,N,0


In [44]:
# More data than I know what to do with...
peek("MRegularSeasonDetailedResults")

Unnamed: 0,Season,DayNum,WTeamID,WScore,LTeamID,LScore,WLoc,NumOT,WFGM,WFGA,...,LFGA3,LFTM,LFTA,LOR,LDR,LAst,LTO,LStl,LBlk,LPF
0,2003,10,1104,68,1328,62,N,0,27,58,...,10,16,22,10,22,8,18,9,2,20
1,2003,10,1272,70,1393,63,N,0,26,62,...,24,9,20,20,25,7,12,8,6,16
2,2003,11,1266,73,1437,61,N,0,24,58,...,26,14,23,31,22,9,12,2,5,23
3,2003,11,1296,56,1457,50,N,0,18,38,...,22,8,15,17,20,9,19,4,3,23
4,2003,11,1400,77,1208,71,N,0,30,61,...,16,17,27,21,15,12,10,7,1,14


## Model Development!

I liked this approach to splitting up the submission to get the exact teams you'll need: [notebook](https://www.kaggle.com/code/paultimothymooney/simple-starter-notebook-for-march-mania-2025)


In [None]:
import numpy as np
w_seed = pd.read_csv(f'{data_dir}/WNCAATourneySeeds.csv')
m_seed = pd.read_csv(f'{data_dir}/MNCAATourneySeeds.csv')
seed_df = pd.concat([m_seed, w_seed], axis=0).fillna(0.05)
submission_df = pd.read_csv(f'{data_dir}/SampleSubmissionStage2.csv')