# Creating the Perfect Bracket

There's nothing quite like the most riveting basketball event of the year: NCAA March Madness. The 64-team tournament consists of 4 regions, each with 16 teams ranked independently of the other regions according to their regular season performance. Each team attempts to win 6 successive games in order to emerge victorious as the NCAA national champion.

Perhaps what contributes most to the intrigue of March Madness is filling out a March Madness bracket. "The American Gaming Association estimated in 2019 that 40 million Americans filled out a combined 149 million brackets for a collective wager of \$4.6 billion." It's important to note that even a single bet can be quite lucrative, particularly when an upset occurs (when a lower-ranking underdog beats a higher-ranking favorite). For example, the first-ever upset of a #1 seed by a #16 seed occurred in the 2019 NCAA tournament. In that game "a \$100 bet paid out \$2,500", which translates to American betting odds of +2500!

<br>
*All quotations were cited from the following article: https://www.gobankingrates.com/money/business/money-behind-march-madness-ncaa-basketball-tournament/*

### Problem Structure

The purpose of this personal project is to perform supervised classification on March Madness data to more accurately predict the outcome of an NCAA tournament games--particularly the occurrence of upsets. This would allow for an increased possibility of yielding the kinds of profits mentioned above by filling out more accurate brackets relative to other participants.

# Data Fetching

### Perceived Predictors

Naturally, it will be vitally important to scrape available data that is pertinent to deciding the outcome of an NCAA March Madness game between any two given teams. To successfully do so, we must break down what are generally the most influential elements of a basketball team's success.

<br>Overall team performance during the regular season is generally a good indicator of how a team will perform in March Madness. This would be captured by statistics, both basic and advanced, such as the following:
**<br>Season Record (%)
<br>Conference Record (%); could be important given that the tournament is split into regions
<br>Regular Season Record vs. Tourney Opponent (%); set to theoretical discrete probability of 50% if no such matchups exist 
<br>Strength of Schedule (SOS); measures the difficulty of the teams played (higher number = greater difficulty)
<br>Top 25 Ranking (boolean); considered a consensus top-tier team
<br>Shots Made per Game (FG, 3P, FT)
<br>Point Differential per Game; measures how dominant/unsuccessful you are at outscoring your opponent on average
<br>Misc. Team Stats per Game (Rebounds, Assists, Blocks, etc.)
**

<br>However, March Madness is well-known for its Cinderalla stories--instances where average or underachieving regular season teams make big, unexpected runs in the tournament. Because of this, **it would likely be beneficial to also have team performance during the tournament as an indicator. The difficulty here will be transforming the data--which would be virtually the same categories as the data scraped for the regular season--in such a way that data leakage is avoided.**

<br>It's important to note that in the NCAA, more so than the NBA, experienced coaches can have just as much of an impact on a game's outcome as the players themselves. Hence, it's reasonable to assume that the following statistics could also be solid indicators:
**<br>Coach March Madness Appearances
<br>Coach Sweet Sixteen Appearances
<br>Coach Final Four Appearances
<br>Coach Championships Won
**

<br>And last but certainly not least, we need the data for the structure of the tournaments themselves:
**<br>Favorite Seed
<br>Underdog Seed
<br>Round Number (1-6)
<br>Game Outcome (boolean); did the underdog upset the favorite?
**

<br>*Consider including odds? If so, which type (preseason, pre-tourney)?*

### Links

NCAA Upsets Breakdown - https://www.ncaa.com/news/basketball-men/bracketiq/2018-03-13/heres-how-pick-march-madness-upsets-according-data
<br>March Madness Bracket Data - https://apps.washingtonpost.com/sports/search/
<br>Regular Season, Coaches, & Ranks Data - https://www.sports-reference.com/cbb/

In [6]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import requests
from bs4 import BeautifulSoup

Team Regular Season

In [28]:
reg_season_basic_url = "https://www.sports-reference.com/cbb/seasons/2019-school-stats.html"
reg_season_basic_stats = pd.read_html(reg_season_basic_url, attrs={'id': 'basic_school_stats'}, 
                                header=1, index_col=0)

reg_season_adv_url = "https://www.sports-reference.com/cbb/seasons/2019-advanced-school-stats.html"
reg_season_adv_stats = pd.read_html(reg_season_adv_url, attrs={'id': 'adv_school_stats'}, 
                                header=1, index_col=0)
    
reg_season_basic_df = reg_season_basic_stats[0]
reg_season_adv_df = reg_season_adv_stats[0]

In [29]:
reg_season_basic_df

Unnamed: 0_level_0,School,G,W,L,W-L%,SRS,SOS,Unnamed: 8,W.1,L.1,...,FT,FTA,FT%,ORB,TRB,AST,STL,BLK,TOV,PF
Rk,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,Abilene Christian NCAA,34,27,7,.794,-1.91,-7.34,,14,4,...,457,642,.712,325,1110,525,297,93,407,635
2,Air Force,32,14,18,.438,-4.28,0.24,,8,10,...,341,503,.678,253,1077,434,154,57,423,543
3,Akron,33,17,16,.515,4.86,1.09,,8,10,...,380,539,.705,312,1204,399,185,106,388,569
4,Alabama A&M,32,5,27,.156,-19.23,-8.38,,4,14,...,284,453,.627,314,1032,385,234,50,487,587
5,Alabama-Birmingham,35,20,15,.571,0.36,-1.52,,10,8,...,424,630,.673,367,1279,401,218,82,399,578
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
349,Wright State,35,21,14,.600,3.29,-0.89,,13,5,...,510,692,.737,382,1229,484,214,72,402,545
350,Wyoming,32,8,24,.250,-9.75,0.19,,4,14,...,477,660,.723,167,983,331,176,88,450,588
351,Xavier,35,19,16,.543,9.61,8.06,,9,9,...,437,644,.679,371,1281,519,190,128,450,550
352,Yale NCAA,30,22,8,.733,5.52,-1.24,,10,4,...,411,557,.738,259,1157,503,177,131,392,510


In [30]:
reg_season_adv_df

Unnamed: 0_level_0,School,G,W,L,W-L%,SRS,SOS,Unnamed: 8,W.1,L.1,...,3PAr,TS%,TRB%,AST%,STL%,BLK%,eFG%,TOV%,ORB%,FT/FGA
Rk,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,Abilene Christian NCAA,34,27,7,.794,-1.91,-7.34,,14,4,...,.345,.565,50.3,58.5,12.9,8.0,.535,15.5,28.8,.239
2,Air Force,32,14,18,.438,-4.28,0.24,,8,10,...,.400,.541,50.1,54.1,7.0,5.8,.517,17.4,23.7,.192
3,Akron,33,17,16,.515,4.86,1.09,,8,10,...,.477,.515,48.2,50.1,8.2,8.9,.485,15.0,25.3,.195
4,Alabama A&M,32,5,27,.156,-19.23,-8.38,,4,14,...,.320,.479,47.1,52.3,10.7,4.7,.457,19.4,27.6,.157
5,Alabama-Birmingham,35,20,15,.571,0.36,-1.52,,10,8,...,.346,.536,52.7,44.3,9.3,7.5,.511,14.8,30.4,.212
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
349,Wright State,35,21,14,.600,3.29,-0.89,,13,5,...,.403,.543,52.3,54.7,9.0,6.1,.506,14.6,31.3,.251
350,Wyoming,32,8,24,.250,-9.75,0.19,,4,14,...,.435,.534,44.7,48.0,7.8,7.9,.492,18.6,15.6,.288
351,Xavier,35,19,16,.543,9.61,8.06,,9,9,...,.374,.553,53.4,56.3,8.1,10.6,.528,16.5,32.2,.221
352,Yale NCAA,30,22,8,.733,5.52,-1.24,,10,4,...,.350,.584,52.9,56.3,8.0,11.2,.556,15.9,25.8,.227


Coaches

In [3]:
coaches_url = "https://www.sports-reference.com/cbb/seasons/2019-coaches.html"

coach_page = requests.get(coaches_url)
soup = BeautifulSoup(coach_page.text, "html.parser")
table = soup.find("table", attrs={"id": "coaches"})
rows = table.find_all("tr")

coaches_df = pd.DataFrame(columns=['Coach_Team', 'MM', 'S16', 'F4', 'Champs'])

for i in range(len(rows)):
    if(rows[i].find('a')):
        coach_team = rows[i].find_all("a")[1]
        mm_apps = rows[i].find("td", attrs={"data-stat": "ncaa_car"})
        sw16_apps = rows[i].find("td", attrs={"data-stat": "sw16_car"})
        f4_apps = rows[i].find("td", attrs={"data-stat": "ff_car"})
        champ_wins = rows[i].find("td", attrs={"data-stat": "champ_car"})

        coaches_df.loc[i] = [coach_team.text, mm_apps.text, sw16_apps.text, f4_apps.text, champ_wins.text]

In [4]:
coaches_df

Unnamed: 0,Coach_Team,MM,S16,F4,Champs
2,Abilene Christian,1,,,
3,Air Force,,,,
4,Akron,3,1,,
5,Alabama,1,,,
6,Alabama A&M,,,,
...,...,...,...,...,...
385,Wright State,4,,,
386,Wyoming,,,,
387,Xavier,,,,
388,Yale,2,,,


# Data Exploration (EDA)

### Questions of Interest

As any good data scientist should do, there are a few hypotheses I hope to address in my EDA:

1) Does your data have any null values? Are these values missing at random?

2) What is a bracket's accuracy given random guessing in favor of the majority class (base rate: favorite beats underdog)?

3) How often do upsets occur in a given year's March Madness? 

4) Which seeding combinations are the most likely to produce upsets?

5) What is the win percentage of each seed in the tournament?

### Visualizations

# Data Cleaning

# Feature Engineering

# Feature Selection

# Model Selection

# Model Evaluation

# Conclusions