# Creating the Perfect Bracket

There's nothing quite like the most riveting basketball event of the year: NCAA March Madness. The 64-team tournament consists of 4 regions, each with 16 teams ranked independently of the other regions according to their regular season performance. Each team attempts to win 6 successive games in order to emerge victorious as the NCAA national champion.

Perhaps what contributes most to the intrigue of March Madness is filling out a March Madness bracket. "The American Gaming Association estimated in 2019 that 40 million Americans filled out a combined 149 million brackets for a collective wager of \$4.6 billion." It's important to note that even a single bet can be quite lucrative, particularly when an upset occurs (when a lower-ranking underdog beats a higher-ranking favorite). For example, the first-ever upset of a #1 seed by a #16 seed occurred in the 2019 NCAA tournament. In that game "a \$100 bet paid out \$2,500", which translates to American betting odds of +2500!

<br>
*All quotations were cited from the following article: https://www.gobankingrates.com/money/business/money-behind-march-madness-ncaa-basketball-tournament/*

### Problem Structure

The purpose of this personal project is to perform supervised classification on March Madness data to more accurately predict the outcome of an NCAA tournament games--particularly the occurrence of upsets. This would allow for an increased possibility of yielding the kinds of profits mentioned above by filling out more accurate brackets relative to other participants.

# Data Fetching

### Perceived Predictors

Naturally, it will be vitally important to scrape available data that is pertinent to deciding the outcome of an NCAA March Madness game between any two given teams. To successfully do so, we must break down what are generally the most influential elements of a basketball team's success.

<br>Overall team performance during the regular season is generally a good indicator of how a team will perform in March Madness. This would be captured by statistics, both basic and advanced, such as the following:
**<br>Season Record (%)
<br>Conference Record (%); could be important given that the tournament is split into regions
<br>Regular Season Record vs. Tourney Opponent (%); set to theoretical discrete probability of 50% if no such matchups exist 
<br>Strength of Schedule (SOS); measures the difficulty of the teams played (higher number = greater difficulty)
<br>Top 25 Ranking (boolean); considered a consensus top-tier team
<br>Shots Made per Game (FG, 3P, FT)
<br>Point Differential per Game; measures how dominant/unsuccessful you are at outscoring your opponent on average
<br>Misc. Team Stats per Game (Rebounds, Assists, Blocks, etc.)
**

<br>However, March Madness is well-known for its Cinderalla stories--instances where average or underachieving regular season teams make big, unexpected runs in the tournament. Because of this, **it would likely be beneficial to also have team performance during the tournament as an indicator. The difficulty here will be transforming the data--which would be virtually the same categories as the data scraped for the regular season--in such a way that data leakage is avoided.**

<br>It's important to note that in the NCAA, more so than the NBA, experienced coaches can have just as much of an impact on a game's outcome as the players themselves. Hence, it's reasonable to assume that the following statistics could also be solid indicators:
**<br>Coach March Madness Appearances
<br>Coach Sweet Sixteen Appearances
<br>Coach Final Four Appearances
<br>Coach Championships Won
**

<br>And last but certainly not least, we need the data for the structure of the tournaments themselves:
**<br>Favorite Seed
<br>Underdog Seed
<br>Round Number (1-6)
<br>Game Outcome (boolean); did the underdog upset the favorite?
**

### Links

NCAA Upsets Breakdown - https://www.ncaa.com/news/basketball-men/bracketiq/2018-03-13/heres-how-pick-march-madness-upsets-according-data
<br>March Madness Bracket Data - https://apps.washingtonpost.com/sports/search/
<br>Regular Season, Coaches, & Ranks Data - https://www.sports-reference.com/cbb/

In [1]:
import pandas as pd
from data_fetch import get_team_data, get_rankings_data, get_coach_data

Team Regular Season

In [2]:
season_basic_df = get_team_data(url="https://www.sports-reference.com/cbb/seasons/2019-school-stats.html",
                                    attrs={'id': 'basic_school_stats'})
season_basic_df.head()

Unnamed: 0,Rk,School,G,W,L,W-L%,SRS,SOS,Unnamed: 8,W.1,...,FT,FTA,FT%,ORB,TRB,AST,STL,BLK,TOV,PF
0,1,Abilene Christian NCAA,34,27,7,0.794,-1.91,-7.34,,14,...,457,642,0.712,325,1110,525,297,93,407,635
1,2,Air Force,32,14,18,0.438,-4.28,0.24,,8,...,341,503,0.678,253,1077,434,154,57,423,543
2,3,Akron,33,17,16,0.515,4.86,1.09,,8,...,380,539,0.705,312,1204,399,185,106,388,569
3,4,Alabama A&M,32,5,27,0.156,-19.23,-8.38,,4,...,284,453,0.627,314,1032,385,234,50,487,587
4,5,Alabama-Birmingham,35,20,15,0.571,0.36,-1.52,,10,...,424,630,0.673,367,1279,401,218,82,399,578


In [3]:
season_adv_df = get_team_data(url="https://www.sports-reference.com/cbb/seasons/2019-advanced-school-stats.html",
                                    attrs={'id': 'adv_school_stats'})
season_adv_df.head()

Unnamed: 0,Rk,School,G,W,L,W-L%,SRS,SOS,Unnamed: 8,W.1,...,3PAr,TS%,TRB%,AST%,STL%,BLK%,eFG%,TOV%,ORB%,FT/FGA
0,1,Abilene Christian NCAA,34,27,7,0.794,-1.91,-7.34,,14,...,0.345,0.565,50.3,58.5,12.9,8.0,0.535,15.5,28.8,0.239
1,2,Air Force,32,14,18,0.438,-4.28,0.24,,8,...,0.4,0.541,50.1,54.1,7.0,5.8,0.517,17.4,23.7,0.192
2,3,Akron,33,17,16,0.515,4.86,1.09,,8,...,0.477,0.515,48.2,50.1,8.2,8.9,0.485,15.0,25.3,0.195
3,4,Alabama A&M,32,5,27,0.156,-19.23,-8.38,,4,...,0.32,0.479,47.1,52.3,10.7,4.7,0.457,19.4,27.6,0.157
4,5,Alabama-Birmingham,35,20,15,0.571,0.36,-1.52,,10,...,0.346,0.536,52.7,44.3,9.3,7.5,0.511,14.8,30.4,0.212


Tournament Game Data

In [4]:
mm_games_df = get_team_data(url=("https://apps.washingtonpost.com/sports/search/?pri_school_id=&pri_conference=&pri_coach"
                                 "=&pri_seed_from=1&pri_seed_to=16&pri_power_conference=&pri_bid_type=&opp_school_id"
                                 "=&opp_conference=&opp_coach=&opp_seed_from=1&opp_seed_to=16&opp_power_conference"
                                 "=&opp_bid_type=&game_type=7&from=2019&to=2020&submit="), 
                            attrs={'class': 'search-results'},
                            header=0)
mm_games_df.head()

Unnamed: 0,Year,Round,Seed,Team,Score,Seed.1,Team.1,Score.1
0,2019,National ChampionshipNational Championship,1,Virginia Virginia,85,3,Texas Tech Texas Tech,77
1,2019,Final FourFinal Four,1,Virginia Virginia,63,4,Auburn Auburn,62
2,2019,Final FourFinal Four,2,Michigan State Michigan State,51,3,Texas Tech Texas Tech,61
3,2019,Elite EightElite Eight,1,Gonzaga Gonzaga,69,3,Texas Tech Texas Tech,75
4,2019,Elite EightElite Eight,1,Virginia Virginia,80,3,Purdue Purdue,75


Team Rankings

In [5]:
rankings_df = get_rankings_data(url="https://www.sports-reference.com/cbb/seasons/2019-ratings.html")        
rankings_df.head()

Unnamed: 0,Team,Top_25
2,Gonzaga,1
3,Duke,1
4,Virginia,1
5,Michigan State,1
6,North Carolina,1


Coaches

In [6]:
coaches_df = get_coach_data(url="https://www.sports-reference.com/cbb/seasons/2019-coaches.html")        
coaches_df.head()

Unnamed: 0,Coach_Team,MM,S16,F4,Champs
2,Abilene Christian,1.0,,,
3,Air Force,,,,
4,Akron,3.0,1.0,,
5,Alabama,1.0,,,
6,Alabama A&M,,,,


# Data Cleaning

Merge Regular Season Data

In [7]:
season_basic_df.columns

Index(['Rk', 'School', 'G', 'W', 'L', 'W-L%', 'SRS', 'SOS', 'Unnamed: 8',
       'W.1', 'L.1', 'Unnamed: 11', 'W.2', 'L.2', 'Unnamed: 14', 'W.3', 'L.3',
       'Unnamed: 17', 'Tm.', 'Opp.', 'Unnamed: 20', 'MP', 'FG', 'FGA', 'FG%',
       '3P', '3PA', '3P%', 'FT', 'FTA', 'FT%', 'ORB', 'TRB', 'AST', 'STL',
       'BLK', 'TOV', 'PF'],
      dtype='object')

In [8]:
useless_feats = [col for col in season_basic_df.columns if 'Unnamed' in col]
useless_feats.extend([col for col in season_basic_df.columns if ('W.' in col) or ('L.' in col)])
useless_feats.extend(['Rk', 'MP'])

lin_dep_feats = ['W', 'L', 'SRS', 'FGA', '3PA', 'FTA']

feat_drops = useless_feats + lin_dep_feats

season_basic_df.drop(feat_drops, axis=1, inplace=True)
season_basic_df.head()

Unnamed: 0,School,G,W-L%,SOS,Tm.,Opp.,FG,FG%,3P,3P%,FT,FT%,ORB,TRB,AST,STL,BLK,TOV,PF
0,Abilene Christian NCAA,34,0.794,-7.34,2502,2161,897,0.469,251,0.38,457,0.712,325,1110,525,297,93,407,635
1,Air Force,32,0.438,0.24,2179,2294,802,0.452,234,0.329,341,0.678,253,1077,434,154,57,423,543
2,Akron,33,0.515,1.09,2271,2107,797,0.409,297,0.32,380,0.705,312,1204,399,185,106,388,569
3,Alabama A&M,32,0.156,-8.38,1938,2285,736,0.407,182,0.315,284,0.627,314,1032,385,234,50,487,587
4,Alabama-Birmingham,35,0.571,-1.52,2470,2370,906,0.452,234,0.337,424,0.673,367,1279,401,218,82,399,578


In [9]:
season_basic_df.dropna(inplace=True)

ncaa_team_basic_df = season_basic_df[season_basic_df['School'].str.contains('NCAA')]
ncaa_team_basic_df.head()

Unnamed: 0,School,G,W-L%,SOS,Tm.,Opp.,FG,FG%,3P,3P%,FT,FT%,ORB,TRB,AST,STL,BLK,TOV,PF
0,Abilene Christian NCAA,34,0.794,-7.34,2502,2161,897,0.469,251,0.38,457,0.712,325,1110,525,297,93,407,635
11,Arizona State NCAA,34,0.676,6.04,2638,2494,899,0.447,240,0.336,600,0.68,399,1351,459,213,109,466,675
18,Auburn NCAA,40,0.75,10.92,3188,2750,1097,0.45,454,0.377,540,0.711,457,1369,572,369,190,466,731
23,Baylor NCAA,34,0.588,9.26,2442,2302,869,0.442,274,0.341,430,0.677,450,1281,473,209,159,446,636
24,Belmont NCAA,33,0.818,-2.6,2868,2439,1042,0.498,343,0.372,441,0.737,286,1275,645,220,125,376,509


In [10]:
season_adv_df = pd.concat([season_adv_df['School'], season_adv_df.iloc[:, -13:]], axis=1)
season_adv_df.head()

Unnamed: 0,School,Pace,ORtg,FTr,3PAr,TS%,TRB%,AST%,STL%,BLK%,eFG%,TOV%,ORB%,FT/FGA
0,Abilene Christian NCAA,67.2,108.6,0.336,0.345,0.565,50.3,58.5,12.9,8.0,0.535,15.5,28.8,0.239
1,Air Force,67.4,99.5,0.283,0.4,0.541,50.1,54.1,7.0,5.8,0.517,17.4,23.7,0.192
2,Akron,68.5,100.1,0.277,0.477,0.515,48.2,50.1,8.2,8.9,0.485,15.0,25.3,0.195
3,Alabama A&M,67.5,88.7,0.25,0.32,0.479,47.1,52.3,10.7,4.7,0.457,19.4,27.6,0.157
4,Alabama-Birmingham,66.6,105.2,0.315,0.346,0.536,52.7,44.3,9.3,7.5,0.511,14.8,30.4,0.212


In [11]:
ncaa_all_stats_df = pd.merge(ncaa_team_basic_df, season_adv_df, on='School')
ncaa_all_stats_df['School'] = ncaa_all_stats_df['School'].apply(lambda school: school[:-5])
ncaa_all_stats_df.head()

Unnamed: 0,School,G,W-L%,SOS,Tm.,Opp.,FG,FG%,3P,3P%,...,3PAr,TS%,TRB%,AST%,STL%,BLK%,eFG%,TOV%,ORB%,FT/FGA
0,Abilene Christian,34,0.794,-7.34,2502,2161,897,0.469,251,0.38,...,0.345,0.565,50.3,58.5,12.9,8.0,0.535,15.5,28.8,0.239
1,Arizona State,34,0.676,6.04,2638,2494,899,0.447,240,0.336,...,0.355,0.543,52.6,51.1,8.5,9.6,0.506,16.1,31.3,0.298
2,Auburn,40,0.75,10.92,3188,2750,1097,0.45,454,0.377,...,0.494,0.569,49.2,52.1,13.2,15.6,0.543,14.3,31.6,0.221
3,Baylor,34,0.588,9.26,2442,2302,869,0.442,274,0.341,...,0.408,0.538,54.1,54.4,9.2,13.8,0.512,16.4,37.6,0.219
4,Belmont,33,0.818,-2.6,2868,2439,1042,0.498,343,0.372,...,0.44,0.603,52.2,61.9,8.9,9.1,0.58,13.7,25.1,0.211


Merge Coach Data



# Data Exploration (EDA)

### Questions of Interest

As any good data scientist should do, there are a few hypotheses I hope to address in my EDA:

1) Does your data have any null values? Are these values missing at random?

2) What is a bracket's accuracy given random guessing in favor of the majority class (base rate: favorite beats underdog)?

3) How often do upsets occur in a given year's March Madness? 

4) Which seeding combinations are the most likely to produce upsets?

5) What is the win percentage of each seed in the tournament?

### Visualizations

In [12]:
import matplotlib.pyplot as plt
import seaborn as sns

# Feature Engineering

# Feature Selection

# Model Selection

# Model Evaluation

# Conclusions