# Agonizing over your March Madness bracket?
Who isn't?! Not familiar with this American obsession? Here are the basics from the NCAA in [this](https://www.ncaa.com/news/basketball-men/bracketiq/2018-10-10/what-march-madness-ncaa-tournament-explained) article:
>**What is March Madness?**
The NCAA Division I men’s basketball tournament is a single-elimination tournament of 68 teams that compete in seven rounds for the national championship. The penultimate round is known as the Final Four, when only (you guessed it) four teams are left.
<br><br>
>**When did March Madness start?**
>The first NCAA Division I men’s basketball tournament was in 1939, and it has been held every year since.
<br><br>
>**How has the tournament changed since 1939?**
>The inaugural tournament had just eight teams, and saw Oregon beat Ohio State 46-33 for the title. In 1951, the field doubled to 16, and kept expanding over the next few decades until 1985, when the modern format of a 64-team tournament began. In 2001, after the Mountain West Conference joined Division I and received an automatic bid, pushing the total teams to 65, a single game was added prior to the first round. In 2011, three more teams were added, and with them, three more games to round out the First Four.


From Smithsonian.com in [this](https://www.smithsonianmag.com/history/when-did-filling-out-march-madness-bracket-become-popular-180950162/) article:
>The first NCAA bracket pool—putting some money where your bracket is—is thought to have started in 1977 in a Staten Island bar. 88 people filled out brackets in the pool that year, and paid $10 in a winner-take-all format.

From the same article, written in 2014:
>The odds of it happening are one in 9.2 quintillion: you’re more likely to die an excruciating death by vending machine, become president, win the Mega Millions jackpot or die from incorrectly using products made for right-handed people (if you're a lefty) than fill out a perfect NCAA basketball bracket... Over 60 million Americans fill out a bracket each year, with 1 billion dollars potentially spent on off-book gambling... 
<br><br>
>"Some things seem so obvious, like the idea these higher seeds should beat lower seeds all the time, but that doesn’t necessarily happen, and that results in all sorts of chaos," explains Ken Pomeroy, creator of the college basketball website [kenpom.com](https://kenpom.com/). "There’s that desire to try to predict something that’s difficult to predict."

## Machine learning could help!
[Wikipedia](https://en.wikipedia.org/wiki/Machine_learning):
>Machine learning (ML) is the scientific study of algorithms and statistical models that computer systems use to effectively perform a specific task without using explicit instructions, relying on models and inference instead. It is seen as a subset of artificial intelligence. Machine learning algorithms build a mathematical model of sample data, known as "training data", in order to make predictions or decisions without being explicitly programmed to perform the task... 
<br><br>
>Classification algorithms are used when the outputs are restricted to a limited set of values.

The limited set of output values in this case are possible tournament game outcomes: win or loss, and the training data are stats from previous seasons along with each season's tournament outcomes. The following is an attempt to predict which teams will win each game (and each possible match) of the NCAA Division I men’s basketball tournament (the 2018 tourney in this case, but easy to update for 2019):

<a id='top'></a>

#### BRACKET CRUNCHER OUTLINE

<a href='#1'>1. THE DATA</a>
* <a href='#1.1'>1.1 Data Cleaning</a>

<a href='#2'>2. BASKETBALL STATISTICS AS FEATURES</a>
* <a href='#2.1'>2.1 Calculate Statistics</a>
* <a href='#2.2'>2.2 Regular Season Averages</a>

<a href='#3'>3. EXPLORATORY ANALYSIS</a>
* <a href='#3.1'>3.1 The Madness in March</a>
* <a href='#3.2'>3.2 Teams, Conferences</a>
* <a href='#3.3'>3.3 Features to Model</a>

<a href='#4'>4. MACHINE LEARNING</a>
* <a href='#4.1'>4.1 Logistic Regression</a>
* <a href='#4.2'>4.2 Support Vector Machine</a>
* <a href='#4.3'>4.3 Decision Tree</a>
* <a href='#4.4'>4.4 Random Forest</a>
* <a href='#4.5'>4.5 XGBoost</a>
* <a href='#4.6'>4.6 Best Model</a>
* <a href='#4.7'>4.7 Model Explainability</a>
* <a href='#4.8'>4.8 Make Predictions and Build Bracket</a>


<a href='#5'>5. RESULTS (with printable bracket)</a>

<br>

In [None]:
import warnings
warnings.filterwarnings('ignore')

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

pd.set_option('display.max_columns', 200)
%matplotlib inline
from matplotlib import rcParams
rcParams['font.family'] = 'monospace'
from matplotlib.ticker import MaxNLocator

from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import confusion_matrix, roc_auc_score, roc_curve, log_loss

import os, eli5, shap
from eli5.sklearn import PermutationImportance
from pdpbox import pdp

print()
print('The odds of correctly predicting each possible outcome for all 63 games are one in {:,}.'.format(2**63))

<a href='#top' id='1'></a>

---
## 1. THE DATA

In [None]:
print(os.listdir('../input/mens-machine-learning-competition-2019/datafiles'))

`NCAATourneyCompactResults.csv`

>This file identifies the game-by-game NCAA® tournament results for all seasons of historical data. The data is formatted exactly like the RegularSeasonCompactResults data. Note that these games also include the play-in games (which always occurred on day 134/135) for those years that had play-in games. Thus each season you will see between 63 and 67 games listed, depending on how many play-in games there were.

>* DayNum=134 or 135 (Tue/Wed) - play-in games to get the tournament field down to the final 64 teams
>* DayNum=136 or 137 (Thu/Fri) - Round 1, to bring the tournament field from 64 teams to 32 teams
>* DayNum=138 or 139 (Sat/Sun) - Round 2, to bring the tournament field from 32 teams to 16 teams
>* DayNum=143 or 144 (Thu/Fri) - Round 3, otherwise known as "Sweet Sixteen", to bring the tournament field from 16 teams to 8 teams
>* DayNum=145 or 146 (Sat/Sun) - Round 4, otherwise known as "Elite Eight" or "regional finals", to bring the tournament field from 8 teams to 4 teams
>* DayNum=152 (Sat) - Round 5, otherwise known as "Final Four" or "national semifinals", to bring the tournament field from 4 teams to 2 teams
>* DayNum=154 (Mon) - Round 6, otherwise known as "national final" or "national championship", to bring the tournament field from 2 teams to 1 champion team

In [None]:
df_tourney_all = pd.read_csv('../input/mens-machine-learning-competition-2019/datafiles/NCAATourneyCompactResults.csv')
df_tourney_all.head()

`RegularSeasonDetailedResults.csv`

> This file provides team-level box scores for many regular seasons of historical data, starting with the 2003 season. 

>The column names should be self-explanatory to basketball fans (as above, "W" or "L" refers to the winning or losing team):

>* WFGM - field goals made (by the winning team)
>* WFGA - field goals attempted (by the winning team)
>* WFGM3 - three pointers made (by the winning team)
>* WFGA3 - three pointers attempted (by the winning team)
>* WFTM - free throws made (by the winning team)
>* WFTA - free throws attempted (by the winning team)
>* WOR - offensive rebounds (pulled by the winning team)
>* WDR - defensive rebounds (pulled by the winning team)
>* WAst - assists (by the winning team)
>* WTO - turnovers committed (by the winning team)
>* WStl - steals (accomplished by the winning team)
>* WBlk - blocks (accomplished by the winning team)
>* WPF - personal fouls committed (by the winning team)

In [None]:
df = pd.read_csv('../input/mens-machine-learning-competition-2019/datafiles/RegularSeasonDetailedResults.csv')
df.head()

<a href='#top' id='1.1'>return to menu</a>

## 1.1 Data Cleaning
Not much to clean up, but including team and conference names would be nice in graphs!

In [None]:
# Drop columns that are not needed:
df = df.drop(['DayNum', 'WLoc', 'NumOT'], axis=1)

# Save dataframes with Team and Conference names:
df_teams = pd.read_csv('../input/mens-machine-learning-competition-2019/datafiles/Teams.csv')
df_team_conferences = pd.read_csv('../input/mens-machine-learning-competition-2019/datafiles/TeamConferences.csv')
df_conferences = pd.read_csv('../input/mens-machine-learning-competition-2019/datafiles/Conferences.csv')

# Merge the conference dataframes to eventually use the full conference name:
df_conference_names = df_team_conferences.merge(df_conferences, on=['ConfAbbrev'])

# Pre-merge tidying to match with winner and loser IDs:
win_teams = df_teams.rename(columns={'TeamID':'WTeamID'})[['WTeamID', 'TeamName']]
win_confs = df_conference_names.rename(columns={'TeamID':'WTeamID'})[['Season', 'WTeamID', 'Description']]
lose_teams = df_teams.rename(columns={'TeamID':'LTeamID'})[['LTeamID', 'TeamName']]
lose_confs = df_conference_names.rename(columns={'TeamID':'LTeamID'})[['Season', 'LTeamID', 'Description']]

# Merge winning team name and conference, losing team name and conference with season results:
df = df.merge(win_teams, on='WTeamID').rename(columns={'TeamName': 'WTeamName'}) \
.merge(win_confs, on=['Season', 'WTeamID']).rename(columns={'Description': 'WConfName'}) \
.merge(lose_teams, on='LTeamID').rename(columns={'TeamName': 'LTeamName'}) \
.merge(lose_confs, on=['Season', 'LTeamID']).rename(columns={'Description': 'LConfName'})
df.head()

In [None]:
df['WFGM2'] = df.WFGM - df.WFGM3
df['WFGA2'] = df.WFGA - df.WFGA3
df['LFGM2'] = df.LFGM - df.LFGM3
df['LFGA2'] = df.LFGA - df.LFGA3

In [None]:
print('These are the {} conferences that have participated in NCAA Division I men\'s basketball with the number of wins in the dataset for each:'.format(len(df.WConfName.value_counts())))
df.WConfName.value_counts()

In [None]:
print('Season  #Games:')
df.Season.value_counts()

So, the machine learning model will analyze how 2003-2017 season statistics are related to the 2003-2017 tournament outcomes. It will then apply what it 'understands' about that relationship to make predictions for the 2018 tournament games based on 2018 season statistics.

<a href='#top' id='2'>return to menu</a>

---
## 2. BASKETBALL STATISTICS AS FEATURES
Box-score data are used to calculate basketball statistics. These are the features to analyze and possibly model.

### Possession
A team's possession ends when the team: 1) makes a field goal, 2) misses and fails to get the rebound, 3) turns the ball over, or 4) either makes the last free throw or does not get the rebound. (Or the period ends.) This can be estimated as:

$$ \text{poss} = \text{FGA} - \text{OR} + \text{TO} + 0.475\text{FTA} $$

In any game, the number of possessions is nearly equal for both teams, so **efficiency wins**!

---
## EFFICIENCY -
### Shooting Efficiency
Number of points per shooting opportunity, estimated as:

$$ \text{shoot_eff} = \frac{\text{Score}}{\text{FGA} + 0.475\text{FTA}} $$

### Scoring Opportunity
Number of scoring attempts, estimated as:

$$ \text{score_op} = \frac{\text{FGA} + 0.475\text{FTA}}{\text{poss}} $$

### Offensive Rating
Points scored per 100 possessions, estimated as:

$$ \text{off_rtg} = \frac{\text{Score}}{\text{poss}}  \times 100$$

### Defensive Rating
A team's defensive rating is their opponent's offensive rating:

$$ \text{def_rtg} = \text{opp_off_rtg} $$

### Net Efficiency
Sometimes also referred to as **Strength of Schedule**:

$$ \text{sos} = \text{off_rtg} - \text{opp_off_rtg} $$

---
## MORE ON SHOOTING -
### True Shooting Percentage
Similar to shooting efficiency but accounts for free throws, estimated as:

$$ \text{ts_pct} = \frac{\text{Score}}{2(\text{FGA} + 0.475\text{FTA})} \times 100$$


### Effective Field Goal Percentage
Adjusts for the fact that some field goals are worth more points than others, estimated as:

$$  \text{efg_pct} = \frac{\text{FGM2} + 1.5\text{FGM3}}{\text{FGA}} $$


---
## REBOUNDING -
### Offensive Rebound  Percentage
$$ \text{orb_pct} =  \frac{\text{OR}}{\text{OR} + \text{opp_DR}} $$

### Defensive Rebound  Percentage
$$ \text{drb_pct} =  \frac{\text{DR}}{\text{DR} + \text{opp_OR}} $$

### Rebound  Percentage
$$ \text{reb_pct} =  \frac{\text{orb_pct} + \text{drb_pct}}{2} $$


---
## OLIVER'S FOUR FACTORS -
Shoot, protect, recover, draw, frustrate! I know that's five; keep reading: 

*Basketball on Paper* author [Dean Oliver](http://www.basketballonpaper.com/author.html) outlines four factors that determine success in basketball:
1. Effective Field Goal Percentage
<br><br>
2. **Turnovers per Possession** 
$$ \text{to_poss} = \frac{\text{TO}}{\text{poss}} $$
<br><br>
3. Offensive Rebound Percentage
<br><br>
4. **Free Throw Rate**
$$ \text{ft_rate} = \frac{\text{FTM}}{\text{FGA}} $$

So, a team must shoot the ball well, take care of the ball (avoid turnovers), get back missed shots and get to the free throw line (and make them). But Oliver also stresses that these are important to both offense and defense. A team should cover the four factors, but should frustrate their opponent's efforts to do the same.


---
## OTHER FEATURES TO CONSIDER -

From the [NBA Advanced Stats](https://stats.nba.com/help/faq/) page:
> #### What is PACE? What does PACE tell fans besides the speed of the game?
Each team plays at a faster or slower pace, thus inflating or deflating player and team statistics. It is important to look at stats at a per possession level, rather than simply looking at points scored per game.
> #### What is PIE?
It is a simple metric that gives an excellent indication of performance at both the team and player level. It’s a major improvement to our EFF Rating. Notably 2 things changed: (1) We included Personal Fouls, (2) We added a denominator. We feel the key here is the denominator because it acts as an "automatic equalizer". Using the denominator, we find there is no need to consider the "PACE" of the statistics that are being analyzed. In its simplest terms, PIE shows what % of game events did that player or team achieve. The stats being analyzed are your traditional basketball statistics (PTS, REB, AST, TOV, etc..) A team that achieves more than 50% is likely to be a winning team. A player that achieves more than 10% is likely to be better than the average player. A high PIE % is highly correlated to winning. In fact, a team’s PIE rating and a team’s winning percentage correlate at an R square of .908 which indicates a "strong" correlation. We’ve introduced this statistic because we feel it incorporates a bit of defense into the equation. When a team misses a shot, all 5 players on the other team’s PIE rating goes up.

### Team Impact Estimate
$$ \text{IE_numerator} = \text{Score} + \text{FGM} + \text{FTM} - \text{FGA} - \text{FTA} + \text{DR} + 0.5\text{OR} + \text{Ast} + \text{Stl} + 0.5\text{Blk} - \text{PF} - \text{TO} $$

$$ \text{IE} = \frac{\text{IE_numerator}}{\text{IE_numerator} + \text{opp_IE_numerator}} $$

### Assist Ratio
The percentage of a team's possessions that end in an assist:

$$ \text{ast_rtio} = \frac{\text{Ast}}{\text{FGA} + 0.475\text{FTA} + \text{TO} + \text{Ast}} \times 100 $$

### Block Percentage
Indicates that a team blocked $n\%$ of its opponents' shots:

$$ \text{blk_pct} = \frac{\text{Blk}}{\text{opp_FGA2}} \times 100 $$


### Steal Percentage
Indicates that the team stole the ball for $n\%$ of its opponents' possessions:

$$ \text{stl_pct} = \frac{\text{Stl}}{\text{opp_poss}} \times 100 $$



In [None]:
# Check winner boxscore data needed to calculate stats:
df[['WFGA', 'WFTA', 'WTO', 'WOR', 'WScore', 'WFGM2', 'WFGM3', 'WFGM', 'WFTM', 'WDR', 'WAst', 'WStl', 'WBlk', 'WPF']].describe()

In [None]:
# Check loser boxscore data needed to calculate stats:
df[['LFGA', 'LFTA', 'LTO', 'LOR', 'LScore', 'LFGM2', 'LFGM3', 'LFGM', 'LFTM', 'LDR', 'LAst', 'LStl', 'LBlk', 'LPF']].describe()

<a href='#top' id='2.1'>return to menu</a>

## 2.1 Calculate Statistics

In [None]:
# Winner stats related to offensive efficiency:
df['Wposs'] = df.apply(lambda row: row.WFGA + 0.475 * row.WFTA + row.WTO - row.WOR, axis=1)
df['Wshoot_eff'] = df.apply(lambda row: row.WScore / (row.WFGA + 0.475 * row.WFTA), axis=1)
df['Wscore_op'] = df.apply(lambda row: (row.WFGA + 0.475 * row.WFTA) / row.Wposs, axis=1)
df['Woff_rtg'] = df.apply(lambda row: row.WScore/row.Wposs*100, axis=1)

# Loser stats related to offensive efficiency:
df['Lposs'] = df.apply(lambda row: row.LFGA + 0.475 * row.LFTA + row.LTO - row.LOR, axis=1)
df['Lshoot_eff'] = df.apply(lambda row: row.LScore / (row.LFGA + 0.475 * row.LFTA), axis=1)
df['Lscore_op'] = df.apply(lambda row: (row.LFGA + 0.475 * row.LFTA) / row.Lposs, axis=1)
df['Loff_rtg'] = df.apply(lambda row: row.LScore/row.Lposs*100, axis=1)

# Defensive and net efficiency:
df['Wdef_rtg'] = df.apply(lambda row: row.Loff_rtg, axis=1)
df['Wsos'] = df.apply(lambda row: row.Woff_rtg - row.Loff_rtg, axis=1)
df['Ldef_rtg'] = df.apply(lambda row: row.Woff_rtg, axis=1)
df['Lsos'] = df.apply(lambda row: row.Loff_rtg - row.Woff_rtg, axis=1)

# Impact Estimate - 
# First calculate the teams' overall statistical contribution (the numerator):
Wie = df.apply(lambda row: row.WScore + row.WFGM + row.WFTM - row.WFGA - row.WFTA + row.WDR + (0.5 * row.WOR) + row.WAst + row.WStl + (0.5 * row.WBlk) - row.WPF - row.WTO, axis=1)
Lie = df.apply(lambda row: row.LScore + row.LFGM + row.LFTM - row.LFGA - row.LFTA + row.LDR + (0.5 * row.LOR) + row.LAst + row.LStl + (0.5 * row.LBlk) - row.LPF - row.LTO, axis=1)

# Then divide by the total game statistics (the denominator):
df['Wie'] = Wie / (Wie + Lie) * 100
df['Lie'] = Lie / (Lie + Wie) * 100

# Other winner stats:
df['Wts_pct'] = df.apply(lambda row: row.WScore / (2 * (row.WFGA + 0.475 * row.WFTA)) * 100, axis=1)
df['Wefg_pct'] = df.apply(lambda row: (row.WFGM2 + 1.5 * row.WFGM3) / row.WFGA, axis=1)
df['Worb_pct'] = df.apply(lambda row: row.WOR / (row.WOR + row.LDR), axis=1)
df['Wdrb_pct'] = df.apply(lambda row: row.WDR / (row.WDR + row.LOR), axis=1)
df['Wreb_pct'] = df.apply(lambda row: (row.Worb_pct + row.Wdrb_pct) / 2, axis=1)
df['Wto_poss'] = df.apply(lambda row: row.WTO / row.Wposs, axis=1)
df['Wft_rate'] = df.apply(lambda row: row.WFTM / row.WFGA, axis=1)
df['Wast_rtio'] = df.apply(lambda row: row.WAst / (row.WFGA + 0.475*row.WFTA + row.WTO + row.WAst) * 100, axis=1)
df['Wblk_pct'] = df.apply(lambda row: row.WBlk / row.LFGA2 * 100, axis=1)
df['Wstl_pct'] = df.apply(lambda row: row.WStl / row.Lposs * 100, axis=1)

# Other loser stats:
df['Lts_pct'] = df.apply(lambda row: row.LScore / (2 * (row.LFGA + 0.475 * row.LFTA)) * 100, axis=1)
df['Lefg_pct'] = df.apply(lambda row: (row.LFGM2 + 1.5 * row.LFGM3) / row.LFGA, axis=1)
df['Lorb_pct'] = df.apply(lambda row: row.LOR / (row.LOR + row.WDR), axis=1)
df['Ldrb_pct'] = df.apply(lambda row: row.LDR / (row.LDR + row.WOR), axis=1)
df['Lreb_pct'] = df.apply(lambda row: (row.Lorb_pct + row.Ldrb_pct) / 2, axis=1)
df['Lto_poss'] = df.apply(lambda row: row.LTO / row.Lposs, axis=1)
df['Lft_rate'] = df.apply(lambda row: row.LFTM / row.LFGA, axis=1)
df['Last_rtio'] = df.apply(lambda row: row.LAst / (row.LFGA + 0.475*row.LFTA + row.LTO + row.LAst) * 100, axis=1)
df['Lblk_pct'] = df.apply(lambda row: row.LBlk / row.WFGA2 * 100, axis=1)
df['Lstl_pct'] = df.apply(lambda row: row.LStl / row.Wposs * 100, axis=1)

In [None]:
df.head()

<a href='#top' id='2.2'>return to menu</a>

## 2.2 Regular Season Averages

In [None]:
# Initialize dataframe to hold season averages:
df_avgs = pd.DataFrame()

# Get and save number of wins and losses:
df_avgs['n_wins'] = df['WTeamID'].groupby([df.Season, df.WTeamID, df.WTeamName, df.WConfName]).count()
df_avgs['n_loss'] = df['LTeamID'].groupby([df.Season, df.LTeamID, df.LTeamName, df.LConfName]).count()

df_avgs['n_loss'].fillna(0, inplace=True)

# Calculate win percentages:
df_avgs['win_pct'] = df_avgs['n_wins'] / (df_avgs['n_wins'] + df_avgs['n_loss'])

In [None]:
# Calculate averages for games won:
df_avgs['Wshoot_eff'] = df['Wshoot_eff'].groupby([df['Season'], df['WTeamID']]).mean()
df_avgs['Wscore_op'] = df['Wscore_op'].groupby([df['Season'], df['WTeamID']]).mean()
df_avgs['Woff_rtg'] = df['Woff_rtg'].groupby([df['Season'], df['WTeamID']]).mean()
df_avgs['Wdef_rtg'] = df['Wdef_rtg'].groupby([df['Season'], df['WTeamID']]).mean()
df_avgs['Wsos'] = df['Wsos'].groupby([df['Season'], df['WTeamID']]).mean()
df_avgs['Wts_pct'] = df['Wts_pct'].groupby([df['Season'], df['WTeamID']]).mean()
df_avgs['Wefg_pct'] = df['Wefg_pct'].groupby([df['Season'], df['WTeamID']]).mean()
df_avgs['Worb_pct'] = df['Worb_pct'].groupby([df['Season'], df['WTeamID']]).mean()
df_avgs['Wdrb_pct'] = df['Wdrb_pct'].groupby([df['Season'], df['WTeamID']]).mean()
df_avgs['Wreb_pct'] = df['Wreb_pct'].groupby([df['Season'], df['WTeamID']]).mean()
df_avgs['Wto_poss'] = df['Wto_poss'].groupby([df['Season'], df['WTeamID']]).mean()
df_avgs['Wft_rate'] = df['Wft_rate'].groupby([df['Season'], df['WTeamID']]).mean()
df_avgs['Wie'] = df['Wie'].groupby([df['Season'], df['WTeamID']]).mean()
df_avgs['Wast_rtio'] = df['Wast_rtio'].groupby([df['Season'], df['WTeamID']]).mean()
df_avgs['Wblk_pct'] = df['Wblk_pct'].groupby([df['Season'], df['WTeamID']]).mean()
df_avgs['Wstl_pct'] = df['Wstl_pct'].groupby([df['Season'], df['WTeamID']]).mean()

# Calculate averages for games lost:
df_avgs['Lshoot_eff'] = df['Lshoot_eff'].groupby([df['Season'], df['WTeamID']]).mean()
df_avgs['Lscore_op'] = df['Lscore_op'].groupby([df['Season'], df['WTeamID']]).mean()
df_avgs['Loff_rtg'] = df['Loff_rtg'].groupby([df['Season'], df['WTeamID']]).mean()
df_avgs['Ldef_rtg'] = df['Ldef_rtg'].groupby([df['Season'], df['WTeamID']]).mean()
df_avgs['Lsos'] = df['Lsos'].groupby([df['Season'], df['WTeamID']]).mean()
df_avgs['Lts_pct'] = df['Lts_pct'].groupby([df['Season'], df['WTeamID']]).mean()
df_avgs['Lefg_pct'] = df['Lefg_pct'].groupby([df['Season'], df['WTeamID']]).mean()
df_avgs['Lorb_pct'] = df['Lorb_pct'].groupby([df['Season'], df['WTeamID']]).mean()
df_avgs['Ldrb_pct'] = df['Ldrb_pct'].groupby([df['Season'], df['WTeamID']]).mean()
df_avgs['Lreb_pct'] = df['Lreb_pct'].groupby([df['Season'], df['WTeamID']]).mean()
df_avgs['Lto_poss'] = df['Lto_poss'].groupby([df['Season'], df['WTeamID']]).mean()
df_avgs['Lft_rate'] = df['Lft_rate'].groupby([df['Season'], df['WTeamID']]).mean()
df_avgs['Lie'] = df['Lie'].groupby([df['Season'], df['WTeamID']]).mean()
df_avgs['Last_rtio'] = df['Last_rtio'].groupby([df['Season'], df['WTeamID']]).mean()
df_avgs['Lblk_pct'] = df['Lblk_pct'].groupby([df['Season'], df['WTeamID']]).mean()
df_avgs['Lstl_pct'] = df['Lstl_pct'].groupby([df['Season'], df['WTeamID']]).mean()

In [None]:
# Calculate weighted average using win percentage:
df_avgs['shoot_eff'] = df_avgs['Wshoot_eff'] * df_avgs['win_pct'] + df_avgs['Lshoot_eff'] * (1 - df_avgs['win_pct'])
df_avgs['score_op'] = df_avgs['Wscore_op'] * df_avgs['win_pct'] + df_avgs['Lscore_op'] * (1 - df_avgs['win_pct'])
df_avgs['off_rtg'] = df_avgs['Woff_rtg'] * df_avgs['win_pct'] + df_avgs['Loff_rtg'] * (1 - df_avgs['win_pct'])
df_avgs['def_rtg'] = df_avgs['Wdef_rtg'] * df_avgs['win_pct'] + df_avgs['Ldef_rtg'] * (1 - df_avgs['win_pct'])
df_avgs['sos'] = df_avgs['Wsos'] * df_avgs['win_pct'] + df_avgs['Lsos'] * (1 - df_avgs['win_pct'])
df_avgs['ts_pct'] = df_avgs['Wts_pct'] * df_avgs['win_pct'] + df_avgs['Lts_pct'] * (1 - df_avgs['win_pct'])
df_avgs['efg_pct'] = df_avgs['Wefg_pct'] * df_avgs['win_pct'] + df_avgs['Lefg_pct'] * (1 - df_avgs['win_pct'])
df_avgs['orb_pct'] = df_avgs['Worb_pct'] * df_avgs['win_pct'] + df_avgs['Lorb_pct'] * (1 - df_avgs['win_pct'])
df_avgs['drb_pct'] = df_avgs['Wdrb_pct'] * df_avgs['win_pct'] + df_avgs['Ldrb_pct'] * (1 - df_avgs['win_pct'])
df_avgs['reb_pct'] = df_avgs['Wreb_pct'] * df_avgs['win_pct'] + df_avgs['Lreb_pct'] * (1 - df_avgs['win_pct'])
df_avgs['to_poss'] = df_avgs['Wto_poss'] * df_avgs['win_pct'] + df_avgs['Lto_poss'] * (1 - df_avgs['win_pct'])
df_avgs['ft_rate'] = df_avgs['Wft_rate'] * df_avgs['win_pct'] + df_avgs['Lft_rate'] * (1 - df_avgs['win_pct'])
df_avgs['ie'] = df_avgs['Wie'] * df_avgs['win_pct'] + df_avgs['Lie'] * (1 - df_avgs['win_pct'])
df_avgs['ast_rtio'] = df_avgs['Wast_rtio'] * df_avgs['win_pct'] + df_avgs['Last_rtio'] * (1 - df_avgs['win_pct'])
df_avgs['blk_pct'] = df_avgs['Wblk_pct'] * df_avgs['win_pct'] + df_avgs['Lblk_pct'] * (1 - df_avgs['win_pct'])
df_avgs['stl_pct'] = df_avgs['Wstl_pct'] * df_avgs['win_pct'] + df_avgs['Lstl_pct'] * (1 - df_avgs['win_pct'])

df_avgs.reset_index(inplace = True)
df_avgs = df_avgs.rename(columns={'WTeamID': 'TeamID', 'WTeamName': 'TeamName', 'WConfName': 'ConfName'})
df_avgs.head()

<a href='#top' id='3'>return to menu</a>

---
## 3. EXPLORATORY ANALYSIS
This is a look at what all of those statistics indicate about teams and conferences. First, an exploration of why the tournament is so damn exciting:  

<a href='#top' id='3.1'></a>

## 3.1 The Madness in March
Note that while some define upsets differently, in this notebook they are simply games for which the winning seed number was greater than the losing seed number.

In [None]:
def tourn_round(DayNum):
    """
    Consolidate tournament rounds into meaningful info.
    """
    if (DayNum == 136) | (DayNum == 137):
        return 64
    elif (DayNum == 138) | (DayNum == 139):
        return 32
    elif (DayNum == 143) | (DayNum == 144):
        return 16
    elif (DayNum == 145) | (DayNum == 146):
        return 8
    elif DayNum == 152:
        return 4
    elif DayNum == 154:
        return 2
    else:
        return 68
    
df_tourney_all['tourn_round'] = df_tourney_all.DayNum.apply(tourn_round)


In [None]:
df_seeds = pd.read_csv('../input/mens-machine-learning-competition-2019/datafiles/NCAATourneySeeds.csv')

# Get the seed number by taking the last two characters of 'Seed' values:
df_seeds['seed'] = df_seeds['Seed'].apply(lambda x : int(x[1:3]))
df_seeds.head()

In [None]:
# Drop the old 'Seed' column:
df_seeds = df_seeds[['Season', 'TeamID', 'seed']]

# Merge seeds, team names, and conference names with tournament data:
df_tourney_all = df_tourney_all.merge(df_seeds, how='left', left_on=['Season', 'WTeamID'], right_on=['Season', 'TeamID']) \
.rename(columns={'seed': 'Wseed'}).drop(['TeamID'], axis=1) \
.merge(df_seeds, how='left', left_on=['Season', 'LTeamID'], right_on=['Season', 'TeamID']) \
.rename(columns={'seed': 'Lseed'}).drop(['TeamID'], axis=1) \
.merge(win_teams, on='WTeamID').rename(columns={'TeamName': 'WTeamName'}) \
.merge(win_confs, on=['Season', 'WTeamID']).rename(columns={'Description': 'WConfName'}) \
.merge(lose_teams, on='LTeamID').rename(columns={'TeamName': 'LTeamName'}) \
.merge(lose_confs, on=['Season', 'LTeamID']).rename(columns={'Description': 'LConfName'})

# Calculate the point differential:
df_tourney_all['point_diff'] = df_tourney_all.WScore - df_tourney_all.LScore

In [None]:
df_tourney_all = df_tourney_all[df_tourney_all.Season < 2018]
df_tourney_all.head()

In [None]:
def plot_seed_scale():
    """
    Plot a scale reference of how seeds are represented in this notebook.
    """
    # Save list of seed integers:
    seeds = np.arange(16)+1
    
    # Set scatter points to the first color if the winning seed >8, otherwise set it to the second color:
    colors = np.where(seeds > 8, '#3c7f99', '#c5b783')
    
    # Scale the point sizes to reflect winning seed:
    point_size = seeds*100
    
    # Plot the seed points in a single line (`y=np.zeros(len(seeds)`):
    fig, ax = plt.subplots(figsize=(12, 1))
    plt.scatter(x=seeds, y=np.zeros(len(seeds), dtype=int), color=colors, alpha=0.35, s=point_size)
    plt.scatter(x=seeds, y=np.zeros(len(seeds), dtype=int), color=colors)
    plt.box(False)

    # Showw all of the xticks and only one ytick:
    plt.xticks(seeds), plt.yticks(np.arange(1), (''))

    # Set the ylabel by retrieving an indexed list of labels and changing the first (and only):
    labels = [item.get_text() for item in ax.get_yticklabels()]
    labels[0] = '  Seed Scale: '
    ax.set_yticklabels(labels, fontsize=14)

    # Remove tick marks:
    plt.tick_params(axis='both', which='both',length=0);

In [None]:
# Save a dataframe of the last 5 years and only games between the final 64 tournament teams:
madness = df_tourney_all[(df_tourney_all.Season >= 2013) & (df_tourney_all.tourn_round <= 64)]
madness.sort_values(by='tourn_round', ascending=False, inplace=True)

# Set scatter points to the first color if the winning seed >8, otherwise set it to the second color:
colors = np.where(madness.Wseed > 8, '#3c7f99', '#c5b783')

# Scale the point sizes to reflect winning seed:
point_size = madness.Wseed*100

rounds = ('Round One', 'Round Two', 'Sweet Sixteen', 'Elite Eight', 'Final Four', 'Championship')

# Plot point differential by round, using color and size to reveal winning seed characteristics:
fig, ax = plt.subplots(figsize=(12, 8))
plt.scatter(madness.point_diff, madness.tourn_round.astype(str), color=colors, alpha=0.35, s=point_size)
plt.scatter(madness.point_diff, madness.tourn_round.astype(str), color=colors, alpha=0.75)
plt.box(False) # get rid of border

# Titles and subtitles:
fig.text(x=-0.05, y=1.1, s='             This is the Madness!             ', fontsize=32, weight='bold', color='white', backgroundcolor='#c5b783')
fig.text(x=-0.05, y=1, s='  The first two rounds are the most intense. Though upsets are frequently tight, \n  obviously this is not always true. Still, the better seeds tend to advance.', fontsize=18)
plt.title('Seasons 2013-2017', fontsize=16)

# Reverse the y-axis to reflect tournament progression:
plt.ylim(plt.ylim()[::-1])

# Tick marks and labels, x-axis label -
# Get rid of tick marks:
plt.tick_params(axis='both', which='both', length=0)

# Set tick label font size:
plt.tick_params(axis='both', which='major', labelsize=16)

# Label y-ticks according to tourney round:
plt.yticks(np.arange(len(rounds)), rounds)

# Label x-axis and show grid lines:
ax.xaxis.grid(which='both', linewidth=0.75)
plt.xlabel('Point Differential', fontsize=18)

# Add an info bar at the bottom:
fig.text(x=-0.05, y=0, s=' outer point size = winner seed                         dark points: underdog win ', fontsize=18, weight='bold', color='white', backgroundcolor='#3c7f99')

# Show the full scale of winning seed representation:
plot_seed_scale();

In [None]:
# Championship game statistics for 2013 - 2017:
df_tourney_all[(df_tourney_all.Season >= 2013) & (df_tourney_all.tourn_round == 2)].sort_values(by='Season')

[2014 NCAA Division I Men's Basketball Tournament, Wikipedia](https://en.wikipedia.org/wiki/2014_NCAA_Division_I_Men%27s_Basketball_Tournament):
>With No. 7 seed Connecticut and No. 8 seed Kentucky reaching the championship game, this tournament's final was the first ever not to include at least one 1, 2, or 3 seed. It is also only the third final not to feature a 1 or 2 seed (1989 - #3 Michigan vs. #3 Seton Hall and 2011 - #3 Connecticut vs. #8 Butler). Connecticut was the first 7 seed ever to reach and win the championship game. 

But the big Final Four outlier is that Villanova (2) beat Oklahoma (2) 95-51 in 2016! In fact, without that game, there's an obvious trend toward tighter games as the tournament progresses.

---
SIDE NOTE: Round One (field of 64 teams, for a total of 32 games) is why work productivity during those first two days (always Thurs/Fri) is questioned every year. According to [CNBC](https://www.cnbc.com/2018/03/06/march-madness-takes-a-toll-on-productivity.html) (2018), "unproductive workers during March Madness amounted to an estimated $6.3 billion in corporate losses last year." I'd like to argue (based on my own experience) that it's great for workplace morale!

In [None]:
# Save dataframe plot to include all seasons and only games between the final 64 tournament teams:
madness = df_tourney_all[df_tourney_all.tourn_round <= 64]
madness.sort_values(by='tourn_round', ascending=False, inplace=True)

# Set scatter points to the first color if the winning seed >8, otherwise set it to the second color:
colors = np.where(madness.Wseed > 8, '#3c7f99', '#c5b783')

# Scale the point sizes to reflect winning seed:
point_size = madness.Wseed*100

# Plot point differential by round, using color and size to reveal winning seed characteristics:
fig, ax = plt.subplots(figsize=(12, 8))
plt.scatter(madness.point_diff, madness.tourn_round.astype(str), color=colors, alpha=0.35, s=point_size)
plt.scatter(madness.point_diff, madness.tourn_round.astype(str), color=colors, alpha=0.75)
plt.box(False)

plt.title('All Tournament Wins, All Seasons   ', fontsize=32)

# Reverse the y-axis to reflect tournament progression:
plt.ylim(plt.ylim()[::-1])

# Tick marks and labels, x-axis label -
plt.tick_params(axis='both', which='both',length=0)
plt.tick_params(axis='both', which='major', labelsize=16)
plt.yticks(np.arange(len(rounds)), rounds)
ax.xaxis.grid(which='both', linewidth=0.75)
plt.xlabel('Point Differential', fontsize=18)

plot_seed_scale();

That Final Four Villanova blowout continues to exist as an outlier when all seasons are included. Without it, the highest point differential for each round is smaller than the highest for the previous round.

No Round One underdog (seed > 8) has ever been successful in the Final Four! (and only a few have made it that far)

In [None]:
fig, ax = plt.subplots(figsize=(12, 6))

# Plot a count of every winning seed in the championship round:
sns.countplot(x=df_tourney_all[df_tourney_all.tourn_round == 2].Wseed, color='#3c7f99')
plt.box(False)

fig.text(x=0, y=1, s='       Championships per Seed since 1985       ', fontsize=32, weight='bold', color='white', backgroundcolor='#c5b783')
plt.title('(When the tournament field expanded to 64 teams.)', fontsize=18)


ax.yaxis.set_major_locator(MaxNLocator(integer=True, steps=[1, 2, 5, 10]))
ax.yaxis.grid(which='both', linewidth=0.5, color='#3c7f99')

plt.tick_params(axis='both', which='both',length=0)
plt.tick_params(axis='both', which='major', labelsize=16)
plt.ylabel('Number of Championships', fontsize=14), plt.xlabel('Tournament Seed', fontsize=18);

In [None]:
championships = df_tourney_all[df_tourney_all.tourn_round == 2]

fig, ax = plt.subplots(figsize=(12, 8))
sns.countplot(y=championships.WTeamName, order=championships.WTeamName.value_counts().index, color='#3c7f99')
plt.box(False)

fig.text(x=-0.05, y=0.95, s='       Championships per Team since 1985       ', fontsize=32, weight='bold', color='white', backgroundcolor='#c5b783')
plt.title('(When the tournament field expanded to 64 teams.)', fontsize=18)

plt.tick_params(axis='both', which='both',length=0)
plt.tick_params(axis='both', which='major', labelsize=16)
ax.xaxis.grid(which='both', linewidth=0.5, color='#3c7f99')
plt.xlabel(''), plt.ylabel('');

From that Smithsonian [article](https://www.smithsonianmag.com/history/when-did-filling-out-march-madness-bracket-become-popular-180950162/) in the intro:
>Forty years ago, picking a winner in the NCAA tournament was easy (spell it with me: U-C-L-A)... It wasn't until the tournament expanded to 64 teams—and upsets became easier—that the NCAA bracket became a national phenomenon.



But enough about the championship. If a team can't get through that crazy first round...

In [None]:
fig, ax = plt.subplots(figsize=(12, 8))
sns.countplot(x=df_tourney_all[df_tourney_all.tourn_round == 64].Wseed, color='#3c7f99')
plt.box(False)

fig.text(x=0, y=0.95, s='       First Round Wins by Seed since 1985     ', fontsize=32, weight='bold', color='white', backgroundcolor='#c5b783')
plt.title('(When the tournament field expanded to 64 teams.)', fontsize=18)

plt.tick_params(axis='both', which='both',length=0)
plt.tick_params(axis='both', which='major', labelsize=16)
ax.yaxis.grid(which='both', linewidth=0.5, color='#3c7f99')
plt.xlabel(''), plt.ylabel('');

Fans of the tournament are familiar with 12-5 upsets, and this shows why. They certainly happen more than expected. While the trend of wins for the top 4 seeds is no surprise, there's a dramatic drop in wins for 5, 6, and 7 seeds. Count on excitement from some 10, 11, 12 seeds!

(Also, this obviously doesn't include the 2018 tournament when a number 16 beat a number 1 for the first time.)

In [None]:
upsets = df_tourney_all[df_tourney_all.Wseed > df_tourney_all.Lseed]
upset_counts = upsets.groupby(['Season'], as_index=False).Wseed.count().rename(columns={'Wseed': 'upset_count'})

fig, ax = plt.subplots(figsize=(12, 8))
sns.lineplot(x=upset_counts.Season, y=upset_counts.upset_count, marker='d', color='#3c7f99')
plt.box(False)

fig.text(x=0, y=0.95, s='          Tournament Upsets by Season        ', fontsize=32, weight='bold', color='white', backgroundcolor='#c5b783')
plt.title('The history of upsets looks like my heart rhythm during an upset!', fontsize=18)

plt.tick_params(axis='both', which='both',length=0)
plt.tick_params(axis='both', which='major', labelsize=16)
ax.yaxis.grid(which='both', color='#c5b783')
plt.xlabel(''), plt.ylabel('Number of Upsets', fontsize=16)

# Add an info bar at the bottom:
fig.text(x=0, y=0, s='              upset: winning seed number > losing seed number                  ', fontsize=18, weight='bold', color='white', backgroundcolor='#3c7f99');

There are at least 12 upsets per tournament, and 5 tournaments experienced more than 20. The greatest number occurred in the 1999 tournament:

In [None]:
# Save dataframe plot to include all seasons and only games between the final 64 tournament teams:
madness = df_tourney_all[(df_tourney_all.tourn_round <= 64) & (df_tourney_all.Season == 1999)]
madness.sort_values(by='tourn_round', ascending=False, inplace=True)

colors = np.where(madness.Wseed > 8, '#3c7f99', '#c5b783')
point_size = madness.Wseed*100

fig, ax = plt.subplots(figsize=(12, 8))
plt.scatter(madness.point_diff, madness.tourn_round.astype(str), color=colors, alpha=0.35, s=point_size)
plt.scatter(madness.point_diff, madness.tourn_round.astype(str), color=colors, alpha=0.75)
plt.box(False)

plt.title('1999 Tournament Wins   ', fontsize=32)

# Reverse the y-axis:
plt.ylim(plt.ylim()[::-1])

# Tick marks and labels, x-axis label -
plt.tick_params(axis='both', which='both',length=0)
plt.tick_params(axis='both', which='major', labelsize=16)
plt.yticks(np.arange(len(rounds)), rounds)
ax.xaxis.grid(which='both', linewidth=0.75)
plt.xlabel('Point Differential', fontsize=18)

plot_seed_scale();

<a href='#top' id='3.2'>return to menu</a>

## 3.2 Teams, Conferences
As the graphs above demonstrate, and despite all of those upsets, if a team expects to advance, earning a better seed may matter. This means a Final Four hopeful **must** kick ass during the season.

In [None]:
sns.set_style({'xtick.bottom': False, 'ytick.left': False})

In [None]:
fig, (boxplot, histogram) = plt.subplots(2, sharex=True, figsize=(12, 8), gridspec_kw={"height_ratios": (.15, .85)})

sns.boxplot(df_avgs.n_wins, ax=boxplot, color='#c5b783')
sns.despine(left=True, bottom=True)
boxplot.set(xlabel='')


sns.kdeplot(df_avgs.n_wins, shade=True, ax=histogram, legend=False, color='#c5b783')
plt.box(False)
plt.title('Distribution of Wins Since 2003', fontsize=24)
plt.xlabel('Number of wins per Team')
plt.tick_params(axis='both', which='both',length=0);

In [None]:
fig = plt.figure(figsize=(18, 18))

# Save the correlation matrix:
matrix = df_avgs[['win_pct', 'shoot_eff', 'score_op', 'off_rtg', 'def_rtg', 'sos', 'ie', 'ts_pct', 'efg_pct', 'orb_pct', 'drb_pct', 'reb_pct', 'to_poss', 'ft_rate', 'ast_rtio', 'blk_pct', 'stl_pct']].corr()

# Create mask for the upper triangle:
mask = np.zeros_like(matrix, dtype=np.bool)
mask[np.triu_indices_from(mask)] = True

# Create a custom diverging colormap:
cmap = sns.diverging_palette(225, 45, as_cmap=True)

sns.heatmap(matrix, mask=mask, cmap=cmap, center=0, annot=True, square=True, linewidths=0.25, cbar_kws={'shrink': 0.25})
plt.tick_params(axis='both', which='both',length=0)
plt.tick_params(axis='both', which='major', labelsize=16)

from scipy.stats.stats import pearsonr
plt.title('Team Impact Estimate and Win Percentage correlate at an R square of {:0.3f}!'.format(pearsonr(df_avgs.ie, df_avgs.win_pct)[0]), fontsize=24, weight='bold')
fig.text(x=0.25, y=0.75, s='In fact, Win Percentage has a moderate or strong positive \nrelationship with most of the efficiency measures.', fontsize=20);

In [None]:
fig = plt.figure(figsize=(20, 20))
grid = plt.GridSpec(3, 2, wspace=0.25, hspace=0.5)

fig.text(x=0.30, y=0.95, s='-Winner Stats', fontsize=32, color='#c5b783')
fig.text(x=0.50, y=0.95, s='vs.', fontsize=32)
fig.text(x=0.55, y=0.95, s=' -Loser Stats', fontsize=32, color='#3c7f99')
fig.text(x=0.45, y=0.92, s='(Efficiency)', fontsize=24)
    
plt.subplot(grid[0, :1])
sns.kdeplot(df_avgs.Lshoot_eff, shade=True, legend=False, color='#3c7f99')
sns.kdeplot(df_avgs.Wshoot_eff, shade=True, legend=False, color='#c5b783')
plt.title('Shooting Efficiency', fontsize=24)
plt.box(False)
plt.tick_params(axis='both', which='both', length=0)
plt.tick_params(axis='both', which='major', labelsize=14)

plt.subplot(grid[0, 1:])
sns.kdeplot(df_avgs.Lscore_op, shade=True, legend=False, color='#3c7f99')
sns.kdeplot(df_avgs.Wscore_op, shade=True, legend=False, color='#c5b783')
plt.title('Scoring Opportunity', fontsize=24)
plt.box(False)
plt.tick_params(axis='both', which='both', length=0)
plt.tick_params(axis='both', which='major', labelsize=14)

plt.subplot(grid[1, :1])
sns.kdeplot(df_avgs.Loff_rtg, shade=True, legend=False, color='#3c7f99')
sns.kdeplot(df_avgs.Woff_rtg, shade=True, legend=False, color='#c5b783')
plt.title('Offensive Rating', fontsize=24)
plt.box(False)
plt.tick_params(axis='both', which='both', length=0)
plt.tick_params(axis='both', which='major', labelsize=14)

plt.subplot(grid[1, 1:])
sns.kdeplot(df_avgs.Ldef_rtg, shade=True, legend=False, color='#3c7f99')
sns.kdeplot(df_avgs.Wdef_rtg, shade=True, legend=False, color='#c5b783')
plt.title('Defensive Rating', fontsize=24)
plt.box(False)
plt.tick_params(axis='both', which='both', length=0)
plt.tick_params(axis='both', which='major', labelsize=14)

plt.subplot(grid[2, :1])
sns.kdeplot(df_avgs.Lsos, shade=True, legend=False, color='#3c7f99')
sns.kdeplot(df_avgs.Wsos, shade=True, legend=False, color='#c5b783')
plt.title('Net Efficiency (SOS)', fontsize=24)
plt.box(False)
plt.tick_params(axis='both', which='both', length=0)
plt.tick_params(axis='both', which='major', labelsize=14)

plt.subplot(grid[2, 1:])
sns.kdeplot(df_avgs.Lie, shade=True, legend=False, color='#3c7f99')
sns.kdeplot(df_avgs.Wie, shade=True, legend=False, color='#c5b783')
plt.title('Team Impact Estimate', fontsize=24)
plt.box(False)
plt.tick_params(axis='both', which='both', length=0)
plt.tick_params(axis='both', which='major', labelsize=14);

Most of the efficiency behavior is expected: winners have higher values than losers except for defensive rating.

In [None]:
fig, ax = plt.subplots(figsize=(10, 14))
grid = plt.GridSpec(1, 2, wspace=1)
fig.text(x=-0.15, y=0.95, s='    Efficiency by Past Champions, Seasons 2003-2017    ', fontsize=32, weight='bold', color='white', backgroundcolor='#c5b783')
fig.text(x=1, y=0.5, s='Championships\n\n' + '\n'.join('{}: {}'.format(k, v) for k, v in championships[championships.Season.isin(df_avgs.Season.unique())].WTeamName.value_counts().to_dict().items()), fontsize=18)

plt.subplot(grid[0, :1]).xaxis.grid(which='both', linewidth=0.5, color='#c5b783')
sns.boxplot(y='TeamName', x='ie', data=df_avgs[(df_avgs.Season < 2018) & (df_avgs.TeamName.isin(championships.WTeamName.unique()))], color='#3c7f99', 
            order=df_avgs.ie.groupby(df_avgs[(df_avgs.Season < 2018) & (df_avgs.TeamName.isin(championships.WTeamName.unique()))].TeamName).mean().sort_values(ascending=False).to_frame().index, 
            meanprops=dict(marker='o', markeredgecolor='#c5b783', markerfacecolor='#c5b783'), 
            showmeans=True)
plt.box(False)
plt.xlabel('Impact Estimate', fontsize=22); plt.ylabel('')
plt.xticks([50, 67.5, 85], fontsize=12)

plt.subplot(grid[0, 1:]).xaxis.grid(which='both', linewidth=0.5, color='#c5b783')
sns.boxplot(y='TeamName', x='off_rtg', data=df_avgs[(df_avgs.Season < 2018) & (df_avgs.TeamName.isin(championships.WTeamName.unique()))], color='#3c7f99', 
            order=df_avgs.off_rtg.groupby(df_avgs[(df_avgs.Season < 2018) & (df_avgs.TeamName.isin(championships.WTeamName.unique()))].TeamName).mean().sort_values(ascending=False).to_frame().index, 
            meanprops=dict(marker='o', markeredgecolor='#c5b783', markerfacecolor='#c5b783'), 
            showmeans=True)
plt.box(False)
plt.xlabel('Offensive Rating', fontsize=22); plt.ylabel('');

In [None]:
crrnt_seasn = df_avgs[(df_avgs.Season == df_avgs.Season.max()) & (df_avgs.TeamName.isin(championships.WTeamName.unique()))]
team_eff = crrnt_seasn['off_rtg'].groupby(crrnt_seasn.TeamName).mean().to_frame()
team_eff['def_rtg'] = crrnt_seasn['def_rtg'].groupby(crrnt_seasn.TeamName).mean()
team_eff.reset_index(inplace=True)

x_coords = team_eff.off_rtg
y_coords = team_eff.def_rtg
teams = team_eff.TeamName

fig, ax = plt.subplots(figsize=(10, 10))
fig.text(x=0, y=0.95, s='        Efficiency by Past Champions, 2018      ', fontsize=28, weight='bold', color='white', backgroundcolor='#c5b783')

for i, team in enumerate(teams):
    x = x_coords[i]
    y = y_coords[i]
    plt.plot(x, y, marker='o', markersize=10, markeredgecolor='#3c7f99', markerfacecolor='#3c7f99')
    plt.text(x+0.000025, y-0.000005, team, fontsize=12)
    
plt.autoscale(enable=True, axis='both')
ax.spines['right'].set_visible(False)
ax.spines['top'].set_visible(False)
plt.xlabel('Offensive Rating', fontsize=18); plt.ylabel('Defensive Rating', fontsize=18);

In [None]:
fig, ax = plt.subplots(figsize=(10, 14))
grid = plt.GridSpec(1, 2, wspace=1)
fig.text(x=-0.15, y=0.95, s='         Efficiency Top 25, Seasons 2003-2017       ', fontsize=32, weight='bold', color='white', backgroundcolor='#c5b783')
fig.text(x=1, y=0.5, s='Championships\n\n' + '\n'.join('{}: {}'.format(k, v) for k, v in championships[championships.Season.isin(df_avgs.Season.unique())].WTeamName.value_counts().to_dict().items()), fontsize=18)

plt.subplot(grid[0, :1]).xaxis.grid(which='both', linewidth=0.5, color='#c5b783')
sns.boxplot(y='TeamName', x='ie', data=df_avgs[df_avgs.Season < 2018], color='#3c7f99', 
            order=df_avgs.ie.groupby(df_avgs[df_avgs.Season < 2018].TeamName).mean().sort_values(ascending=False).to_frame().head(25).index, 
            meanprops=dict(marker='o', markeredgecolor='#c5b783', markerfacecolor='#c5b783'), 
            showmeans=True)
plt.box(False)
plt.xlabel('Impact Estimate', fontsize=22); plt.ylabel('')
plt.xticks([50, 67.5, 85], fontsize=12)

plt.subplot(grid[0, 1:]).xaxis.grid(which='both', linewidth=0.5, color='#c5b783')
sns.boxplot(y='TeamName', x='off_rtg', data=df_avgs[df_avgs.Season < 2018], color='#3c7f99', 
            order=df_avgs.off_rtg.groupby(df_avgs[df_avgs.Season < 2018].TeamName).mean().sort_values(ascending=False).to_frame().head(25).index, 
            meanprops=dict(marker='o', markeredgecolor='#c5b783', markerfacecolor='#c5b783'), 
            showmeans=True)
plt.box(False)
plt.xlabel('Offensive Rating', fontsize=22); plt.ylabel('');

In [None]:
crrnt_seasn = df_avgs[df_avgs.Season == df_avgs.Season.max()]
team_eff = crrnt_seasn['off_rtg'].groupby(crrnt_seasn.TeamName).mean().to_frame()
team_eff['def_rtg'] = crrnt_seasn['def_rtg'].groupby(crrnt_seasn.TeamName).mean()
team_eff = team_eff.sort_values(['off_rtg'], ascending=False).head(25)
team_eff.reset_index(inplace=True)

x_coords = team_eff.off_rtg
y_coords = team_eff.def_rtg
teams = team_eff.TeamName

fig, ax = plt.subplots(figsize=(10, 10))
fig.text(x=0, y=0.95, s='        Offensive Rating Top 25, 2018        ', fontsize=28, weight='bold', color='white', backgroundcolor='#c5b783')

for i, team in enumerate(teams):
    x = x_coords[i]
    y = y_coords[i]
    plt.plot(x, y, marker='o', markersize=10, markeredgecolor='#3c7f99', markerfacecolor='#3c7f99')
    plt.text(x+0.000025, y-0.000005, team, fontsize=12)
    
plt.autoscale(enable=True, axis='both')
ax.spines['right'].set_visible(False)
ax.spines['top'].set_visible(False)
plt.xlabel('Offensive Rating', fontsize=18); plt.ylabel('Defensive Rating', fontsize=18);

In [None]:
crrnt_seasn = df_avgs[df_avgs.Season == df_avgs.Season.max()]
team_eff = crrnt_seasn['off_rtg'].groupby(crrnt_seasn.TeamName).mean().to_frame()
team_eff['def_rtg'] = crrnt_seasn['def_rtg'].groupby(crrnt_seasn.TeamName).mean()
team_eff = team_eff.sort_values(['def_rtg']).head(25)
team_eff.reset_index(inplace=True)

x_coords = team_eff.off_rtg
y_coords = team_eff.def_rtg
teams = team_eff.TeamName

fig, ax = plt.subplots(figsize=(10, 10))
fig.text(x=0, y=0.95, s='        Defensive Rating Top 25, 2018       ', fontsize=28, weight='bold', color='white', backgroundcolor='#c5b783')

for i, team in enumerate(teams):
    x = x_coords[i]
    y = y_coords[i]
    plt.plot(x, y, marker='o', markersize=10, markeredgecolor='#3c7f99', markerfacecolor='#3c7f99')
    plt.text(x+0.000025, y-0.000005, team, fontsize=12)
    
plt.autoscale(enable=True, axis='both')
ax.spines['right'].set_visible(False)
ax.spines['top'].set_visible(False)
plt.xlabel('Offensive Rating', fontsize=18); plt.ylabel('Defensive Rating', fontsize=18);

Gonzaga and Michigan State are the only two teams in the top 25 for both Offensive and Defensive Rating in 2018.

In [None]:
fig, ax = plt.subplots(figsize=(10, 14))
grid = plt.GridSpec(1, 2, wspace=1)
fig.text(x=-0.225, y=0.95, s='  Efficiency by Conference, Seasons 2003-2017  ', fontsize=32, weight='bold', color='white', backgroundcolor='#c5b783')
fig.text(x=-0.15, y=-0.1, s='-'*75 + '\nChampionships 2003-2017\n\n' + '\n'.join('{}: {}'.format(k, v) for k, v in championships[championships.Season.isin(df_avgs.Season.unique())].WConfName.value_counts().to_dict().items()), fontsize=18)

plt.subplot(grid[0, :1]).xaxis.grid(which='both', linewidth=0.5, color='#c5b783')
sns.boxplot(y='ConfName', x='ie', data=df_avgs[df_avgs.Season < 2018], color='#3c7f99', 
            order=df_avgs.ie.groupby(df_avgs[df_avgs.Season < 2018].ConfName).mean().sort_values(ascending=False).to_frame().index, 
            meanprops=dict(marker='o', markeredgecolor='#c5b783', markerfacecolor='#c5b783'), 
            showmeans=True)
plt.box(False)
plt.xlabel('Impact Estimate', fontsize=22); plt.ylabel('')
plt.xticks([25, 50, 75], fontsize=12)

plt.subplot(grid[0, 1:]).xaxis.grid(which='both', linewidth=0.5, color='#c5b783')
sns.boxplot(y='ConfName', x='off_rtg', data=df_avgs[df_avgs.Season < 2018], color='#3c7f99', 
            order=df_avgs.off_rtg.groupby(df_avgs[df_avgs.Season < 2018].ConfName).mean().sort_values(ascending=False).to_frame().index, 
            meanprops=dict(marker='o', markeredgecolor='#c5b783', markerfacecolor='#c5b783'), 
            showmeans=True)
plt.box(False)
plt.xlabel('Offensive Rating', fontsize=22); plt.ylabel('');

In [None]:
crrnt_seasn = df_avgs[df_avgs.Season == df_avgs.Season.max()]
conf_eff = crrnt_seasn['off_rtg'].groupby(crrnt_seasn.ConfName).mean().to_frame()
conf_eff['def_rtg'] = crrnt_seasn['def_rtg'].groupby(crrnt_seasn.ConfName).mean()
conf_eff.reset_index(inplace=True)

x_coords = conf_eff.off_rtg
y_coords = conf_eff.def_rtg
confs = conf_eff.ConfName

fig, ax = plt.subplots(figsize=(10, 10))
fig.text(x=0, y=0.95, s='           Efficiency by Conference, 2018          ', fontsize=28, weight='bold', color='white', backgroundcolor='#c5b783')

for i, conf in enumerate(confs):
    x = x_coords[i]
    y = y_coords[i]
    plt.plot(x, y, marker='o', markersize=10, markeredgecolor='#3c7f99', markerfacecolor='#3c7f99')
    plt.text(x, y, conf, fontsize=12)
    
plt.autoscale(enable=True, axis='both')
ax.spines['right'].set_visible(False)
ax.spines['top'].set_visible(False)
plt.xlabel('Offensive Rating', fontsize=18); plt.ylabel('Defensive Rating', fontsize=18);

In [None]:
fig = plt.figure(figsize=(20, 20))
grid = plt.GridSpec(2, 2, wspace=0.25, hspace=0.5)

fig.text(x=0.30, y=0.95, s='-Winner Stats', fontsize=32, color='#c5b783')
fig.text(x=0.50, y=0.95, s='vs.', fontsize=32)
fig.text(x=0.55, y=0.95, s=' -Loser Stats', fontsize=32, color='#3c7f99')
fig.text(x=0.45, y=0.92, s='(Four Factors)', fontsize=24)
    

plt.subplot(grid[0, :1])
sns.kdeplot(df_avgs.Lefg_pct, shade=True, legend=False, color='#3c7f99')
sns.kdeplot(df_avgs.Wefg_pct, shade=True, legend=False, color='#c5b783')
plt.title('Effective Field Goal Percentage', fontsize=24)
plt.box(False)
plt.tick_params(axis='both', which='both', length=0)
plt.tick_params(axis='both', which='major', labelsize=14)

plt.subplot(grid[0, 1:])
sns.kdeplot(df_avgs.Lto_poss, shade=True, legend=False, color='#3c7f99')
sns.kdeplot(df_avgs.Wto_poss, shade=True, legend=False, color='#c5b783')
plt.title('Turnovers per Possession', fontsize=24)
plt.box(False)
plt.tick_params(axis='both', which='both', length=0)
plt.tick_params(axis='both', which='major', labelsize=14)

plt.subplot(grid[1, :1])
sns.kdeplot(df_avgs.Lorb_pct, shade=True, legend=False, color='#3c7f99')
sns.kdeplot(df_avgs.Worb_pct, shade=True, legend=False, color='#c5b783')
plt.title('Offensive Rebound Percentage', fontsize=24)
plt.box(False)
plt.tick_params(axis='both', which='both', length=0)
plt.tick_params(axis='both', which='major', labelsize=14)

plt.subplot(grid[1, 1:])
sns.kdeplot(df_avgs.Lft_rate, shade=True, legend=False, color='#3c7f99')
sns.kdeplot(df_avgs.Wft_rate, shade=True, legend=False, color='#c5b783')
plt.title('Free Throw Rate', fontsize=24)
plt.box(False)
plt.tick_params(axis='both', which='both', length=0)
plt.tick_params(axis='both', which='major', labelsize=14);

In [None]:
fig, ax = plt.subplots(figsize=(10, 20))
grid = plt.GridSpec(2, 2, wspace=1)
fig.text(x=-0.2, y=0.91, s='Four Factors by Past Champions, Seasons 2003-2017', fontsize=32, weight='bold', color='white', backgroundcolor='#c5b783')
fig.text(x=-0.15, y=-0.09, s='-'*75 + '\nChampionships 2003-2017\n\n' + '\n'.join('{}: {}'.format(k, v) for k, v in championships[championships.Season.isin(df_avgs.Season.unique())].WTeamName.value_counts().to_dict().items()), fontsize=18)

plt.subplot(grid[0, :1]).xaxis.grid(which='both', linewidth=0.5, color='#c5b783')
sns.boxplot(y='TeamName', x='efg_pct', data=df_avgs[(df_avgs.Season < 2018) & (df_avgs.TeamName.isin(championships.WTeamName.unique()))], color='#3c7f99', 
            order=df_avgs.efg_pct.groupby(df_avgs[(df_avgs.Season < 2018) & (df_avgs.TeamName.isin(championships.WTeamName.unique()))].TeamName).mean().sort_values(ascending=False).to_frame().index, 
            meanprops=dict(marker='o', markeredgecolor='#c5b783', markerfacecolor='#c5b783'), 
            showmeans=True)
plt.box(False)
plt.xlabel('Effective Field Goal Percentage', fontsize=18); plt.ylabel('')

plt.subplot(grid[0, 1:]).xaxis.grid(which='both', linewidth=0.5, color='#c5b783')
sns.boxplot(y='TeamName', x='to_poss', data=df_avgs[(df_avgs.Season < 2018) & (df_avgs.TeamName.isin(championships.WTeamName.unique()))], color='#3c7f99', 
            order=df_avgs.to_poss.groupby(df_avgs[(df_avgs.Season < 2018) & (df_avgs.TeamName.isin(championships.WTeamName.unique()))].TeamName).mean().sort_values(ascending=True).to_frame().index, 
            meanprops=dict(marker='o', markeredgecolor='#c5b783', markerfacecolor='#c5b783'), 
            showmeans=True)
plt.box(False)
plt.xlabel('Turnovers per Possession', fontsize=18); plt.ylabel('')

plt.subplot(grid[1, :1]).xaxis.grid(which='both', linewidth=0.5, color='#c5b783')
sns.boxplot(y='TeamName', x='orb_pct', data=df_avgs[(df_avgs.Season < 2018) & (df_avgs.TeamName.isin(championships.WTeamName.unique()))], color='#3c7f99', 
            order=df_avgs.orb_pct.groupby(df_avgs[(df_avgs.Season < 2018) & (df_avgs.TeamName.isin(championships.WTeamName.unique()))].TeamName).mean().sort_values(ascending=False).to_frame().index, 
            meanprops=dict(marker='o', markeredgecolor='#c5b783', markerfacecolor='#c5b783'), 
            showmeans=True)
plt.box(False)
plt.xlabel('Offensive Rebound Percentage', fontsize=18); plt.ylabel('')

plt.subplot(grid[1, 1:]).xaxis.grid(which='both', linewidth=0.5, color='#c5b783')
sns.boxplot(y='TeamName', x='ft_rate', data=df_avgs[(df_avgs.Season < 2018) & (df_avgs.TeamName.isin(championships.WTeamName.unique()))], color='#3c7f99', 
            order=df_avgs.ft_rate.groupby(df_avgs[(df_avgs.Season < 2018) & (df_avgs.TeamName.isin(championships.WTeamName.unique()))].TeamName).mean().sort_values(ascending=False).to_frame().index, 
            meanprops=dict(marker='o', markeredgecolor='#c5b783', markerfacecolor='#c5b783'), 
            showmeans=True)
plt.box(False)
plt.xlabel('Free Throw Rate', fontsize=18); plt.ylabel('');

In [None]:
fig, ax = plt.subplots(figsize=(15, 20))
grid = plt.GridSpec(2, 2, wspace=1)
fig.text(x=-0.025, y=0.91, s=' Top 25 for each of the Four Factors, Seasons 2003-2017 ', fontsize=32, weight='bold', color='white', backgroundcolor='#c5b783')
fig.text(x=0.025, y=-0.09, s='-'*90 + '\nChampionships 2003-2017\n\n' + '\n'.join('{}: {}'.format(k, v) for k, v in championships[championships.Season.isin(df_avgs.Season.unique())].WTeamName.value_counts().to_dict().items()), fontsize=18)

plt.subplot(grid[0, :1]).xaxis.grid(which='both', linewidth=0.5, color='#c5b783')
sns.boxplot(y='TeamName', x='efg_pct', data=df_avgs[df_avgs.Season < 2018], color='#3c7f99', 
            order=df_avgs.efg_pct.groupby(df_avgs[df_avgs.Season < 2018].TeamName).mean().sort_values(ascending=False).to_frame().head(25).index, 
            meanprops=dict(marker='o', markeredgecolor='#c5b783', markerfacecolor='#c5b783'), 
            showmeans=True)
plt.box(False)
plt.xlabel('Effective Field Goal Percentage', fontsize=18); plt.ylabel('')

plt.subplot(grid[0, 1:]).xaxis.grid(which='both', linewidth=0.5, color='#c5b783')
sns.boxplot(y='TeamName', x='to_poss', data=df_avgs[df_avgs.Season < 2018], color='#3c7f99', 
            order=df_avgs.to_poss.groupby(df_avgs[df_avgs.Season < 2018].TeamName).mean().sort_values(ascending=True).to_frame().head(25).index, 
            meanprops=dict(marker='o', markeredgecolor='#c5b783', markerfacecolor='#c5b783'), 
            showmeans=True)
plt.box(False)
plt.xlabel('Turnovers per Possession', fontsize=18); plt.ylabel('')

plt.subplot(grid[1, :1]).xaxis.grid(which='both', linewidth=0.5, color='#c5b783')
sns.boxplot(y='TeamName', x='orb_pct', data=df_avgs[df_avgs.Season < 2018], color='#3c7f99', 
            order=df_avgs.orb_pct.groupby(df_avgs[df_avgs.Season < 2018].TeamName).mean().sort_values(ascending=False).to_frame().head(25).index, 
            meanprops=dict(marker='o', markeredgecolor='#c5b783', markerfacecolor='#c5b783'), 
            showmeans=True)
plt.box(False)
plt.xlabel('Offensive Rebound Percentage', fontsize=18); plt.ylabel('')

plt.subplot(grid[1, 1:]).xaxis.grid(which='both', linewidth=0.5, color='#c5b783')
sns.boxplot(y='TeamName', x='ft_rate', data=df_avgs[df_avgs.Season < 2018], color='#3c7f99', 
            order=df_avgs.ft_rate.groupby(df_avgs[df_avgs.Season < 2018].TeamName).mean().sort_values(ascending=False).to_frame().head(25).index, 
            meanprops=dict(marker='o', markeredgecolor='#c5b783', markerfacecolor='#c5b783'), 
            showmeans=True)
plt.box(False)
plt.xlabel('Free Throw Rate', fontsize=18); plt.ylabel('');

It may be helpful to analyze tournament teams exclusively.

In [None]:
fig, ax = plt.subplots(figsize=(15, 20))
grid = plt.GridSpec(2, 2, wspace=1)
fig.text(x=-0.05, y=0.91, s='    Four Factors by Conference, Seasons 2003-2017    ', fontsize=32, weight='bold', color='white', backgroundcolor='#c5b783')
fig.text(x=0, y=-0.025, s='-'*85 + '\nChampionships 2003-2017\n\n' + '\n'.join('{}: {}'.format(k, v) for k, v in championships[championships.Season.isin(df_avgs.Season.unique())].WConfName.value_counts().to_dict().items()), fontsize=18)

plt.subplot(grid[0, :1]).xaxis.grid(which='both', linewidth=0.5, color='#c5b783')
sns.boxplot(y='ConfName', x='efg_pct', data=df_avgs[df_avgs.Season < 2018], color='#3c7f99', 
            order=df_avgs.efg_pct.groupby(df_avgs[df_avgs.Season < 2018].ConfName).mean().sort_values(ascending=False).to_frame().head(25).index, 
            meanprops=dict(marker='o', markeredgecolor='#c5b783', markerfacecolor='#c5b783'), 
            showmeans=True)
plt.box(False)
plt.xlabel('Effective Field Goal Percentage', fontsize=18); plt.ylabel('')

plt.subplot(grid[0, 1:]).xaxis.grid(which='both', linewidth=0.5, color='#c5b783')
sns.boxplot(y='ConfName', x='to_poss', data=df_avgs[df_avgs.Season < 2018], color='#3c7f99', 
            order=df_avgs.to_poss.groupby(df_avgs[df_avgs.Season < 2018].ConfName).mean().sort_values(ascending=True).to_frame().head(25).index, 
            meanprops=dict(marker='o', markeredgecolor='#c5b783', markerfacecolor='#c5b783'), 
            showmeans=True)
plt.box(False)
plt.xlabel('Turnovers per Possession', fontsize=18); plt.ylabel('')

plt.subplot(grid[1, :1]).xaxis.grid(which='both', linewidth=0.5, color='#c5b783')
sns.boxplot(y='ConfName', x='orb_pct', data=df_avgs[df_avgs.Season < 2018], color='#3c7f99', 
            order=df_avgs.orb_pct.groupby(df_avgs[df_avgs.Season < 2018].ConfName).mean().sort_values(ascending=False).to_frame().head(25).index, 
            meanprops=dict(marker='o', markeredgecolor='#c5b783', markerfacecolor='#c5b783'), 
            showmeans=True)
plt.box(False)
plt.xlabel('Offensive Rebound Percentage', fontsize=18); plt.ylabel('')

plt.subplot(grid[1, 1:]).xaxis.grid(which='both', linewidth=0.5, color='#c5b783')
sns.boxplot(y='ConfName', x='ft_rate', data=df_avgs[df_avgs.Season < 2018], color='#3c7f99', 
            order=df_avgs.ft_rate.groupby(df_avgs[df_avgs.Season < 2018].ConfName).mean().sort_values(ascending=False).to_frame().head(25).index, 
            meanprops=dict(marker='o', markeredgecolor='#c5b783', markerfacecolor='#c5b783'), 
            showmeans=True)
plt.box(False)
plt.xlabel('Free Throw Rate', fontsize=18); plt.ylabel('');

In [None]:
fig = plt.figure(figsize=(20, 20))
grid = plt.GridSpec(3, 2, wspace=0.25, hspace=0.5)

fig.text(x=0.30, y=0.95, s='-Winner Stats', fontsize=32, color='#c5b783')
fig.text(x=0.50, y=0.95, s='vs.', fontsize=32)
fig.text(x=0.55, y=0.95, s=' -Loser Stats', fontsize=32, color='#3c7f99')

plt.subplot(grid[0, :1])
sns.kdeplot(df_avgs.Lreb_pct, shade=True, legend=False, color='#3c7f99')
sns.kdeplot(df_avgs.Wreb_pct, shade=True, legend=False, color='#c5b783')
plt.title('Rebound Percentage', fontsize=24)
plt.box(False)
plt.tick_params(axis='both', which='both', length=0)
plt.tick_params(axis='both', which='major', labelsize=14)

plt.subplot(grid[0, 1:])
sns.kdeplot(df_avgs.Ldrb_pct, shade=True, legend=False, color='#3c7f99')
sns.kdeplot(df_avgs.Wdrb_pct, shade=True, legend=False, color='#c5b783')
plt.title('Defensive Rebound Percentage', fontsize=24)
plt.box(False)
plt.tick_params(axis='both', which='both', length=0)
plt.tick_params(axis='both', which='major', labelsize=14)

plt.subplot(grid[1, :1])
sns.kdeplot(df_avgs.Lts_pct, shade=True, legend=False, color='#3c7f99')
sns.kdeplot(df_avgs.Wts_pct, shade=True, legend=False, color='#c5b783')
plt.title('True Shooting Percentage', fontsize=24)
plt.box(False)
plt.tick_params(axis='both', which='both', length=0)
plt.tick_params(axis='both', which='major', labelsize=14)
    
plt.subplot(grid[1, 1:])
sns.kdeplot(df_avgs.Last_rtio, shade=True, legend=False, color='#3c7f99')
sns.kdeplot(df_avgs.Wast_rtio, shade=True, legend=False, color='#c5b783')
plt.title('Assist Ratio', fontsize=24)
plt.box(False)
plt.tick_params(axis='both', which='both', length=0)
plt.tick_params(axis='both', which='major', labelsize=14)

plt.subplot(grid[2, :1])
sns.kdeplot(df_avgs.Lblk_pct, shade=True, legend=False, color='#3c7f99')
sns.kdeplot(df_avgs.Wblk_pct, shade=True, legend=False, color='#c5b783')
plt.title('Block Percentage', fontsize=24)
plt.box(False)
plt.tick_params(axis='both', which='both', length=0)
plt.tick_params(axis='both', which='major', labelsize=14)

plt.subplot(grid[2, 1:])
sns.kdeplot(df_avgs.Lstl_pct, shade=True, legend=False, color='#3c7f99')
sns.kdeplot(df_avgs.Wstl_pct, shade=True, legend=False, color='#c5b783')
plt.title('Steal Percentage', fontsize=24)
plt.box(False)
plt.tick_params(axis='both', which='both', length=0)
plt.tick_params(axis='both', which='major', labelsize=14);

<a href='#top' id='3.3'>return to menu</a>

## 3.3 Features to Model
Team season averages of key calculated stats make up the initial set of basketball features to model:

In [None]:
df_features = df_avgs[['Season', 'TeamID', 'shoot_eff', 'score_op', 'off_rtg', 'def_rtg', 'sos', 'ie', 'efg_pct', 'to_poss', 'orb_pct', 'ft_rate', 'reb_pct', 'drb_pct', 'ts_pct', 'ast_rtio', 'blk_pct', 'stl_pct']]
df_features.head()

---
And, since EDA indicated the importance of seed placement for tournament progression and final success, that is included in the model too:

In [None]:
df_features = pd.merge(df_seeds, df_features, how='left', left_on=['Season', 'TeamID'], right_on=['Season', 'TeamID'])

df_tourney = df_tourney_all[(df_tourney_all.Season >= 2003) & (df_tourney_all.Season < 2018)]
df_tourney.reset_index(inplace=True, drop=True)
df_tourney.tail()

In [None]:
# Merge tourney games with tourney winners' season features:
df_winners = pd.merge(left=df_tourney[['Season', 'WTeamID', 'LTeamID']], right=df_features, how='left', left_on=['Season', 'WTeamID'], right_on=['Season', 'TeamID'])
df_winners.drop(['TeamID'], inplace=True, axis=1) 

# Merge tourney games with loser features:
df_losers = pd.merge(left=df_tourney[['Season', 'WTeamID', 'LTeamID']], right=df_features, how='left', left_on=['Season', 'LTeamID'], right_on=['Season', 'TeamID'])
df_losers.drop(['TeamID'], inplace=True, axis=1)

<a href='#top' id='4'>return to menu</a>

---
## 4. MACHINE LEARNING
In order to predict winners of games in the championship tournament, the classification algorithms require a target output (win or loss) to learn. This information is available, but not explicitly in a data column, so a result column is created by splitting winners from losers and assigning the appropriate outcome: 1 for winner game data, 0 for loser game data. 

### Prepare Data for Modeling

In [None]:
# Create winner target by subtracting loser data from winner data,
# and assigning a value of 1:
df_winner_diff = (df_winners.iloc[:, 3:] - df_losers.iloc[:, 3:])
df_winner_diff['result'] = 1

# Create loser target by subtracting winner data from loser data,
# and assigning a value of 0:
df_loser_diff = (df_losers.iloc[:, 3:] - df_winners.iloc[:, 3:])
df_loser_diff['result'] = 0

# Concatenate winner data with loser data:
df_model = pd.concat((df_winner_diff, df_loser_diff), axis=0)
df_model.head()

In [None]:
X = df_model.iloc[:, :-1]
y = df_model.result

# Split the dataframe into 65% training and 35% testing:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.35, random_state=32)

In [None]:
sns.countplot(y)
plt.xlabel(''), plt.ylabel('')
plt.xticks([0, 1], ('losses', 'wins'))
plt.title('The Target Classes are Balanced');

This certainly makes sense because each game has a winner and a loser.

In [None]:
from yellowbrick.classifier import ClassificationReport

def clfy_report(clf, X_train, X_test, y_train, y_test, param_grid, clf_label='clf', cv=10, scale=True):
    """
    Tune classifier hyperparameters and print metrics.
    """
    
    # Create pipeline steps for scaling and classifying:
    if scale:
        pipe = Pipeline([('scaler', StandardScaler()), (clf_label, clf)])
    else:
        pipe = Pipeline([(clf_label, clf)])
    
    # Instantiate grid search using 10-fold cross validation:
    search = GridSearchCV(pipe, param_grid, cv=10)
    
    # Learn relationship between predictors (basketball/tourney features) and outcome,
    # and the best parameters for defining such:
    search.fit(X_train, y_train)
    
    # Predictions on the test set, new data that haven't been introduced to the model:
    predicted = search.predict(X_test)
    
    # Predictions as probabilities:
    probabilities = search.predict_proba(X_test)[:, 1]
    
    # Accuracy scores for the training and test sets:
    train_accuracy = search.score(X_train, y_train)
    test_accuracy = search.score(X_test, y_test)

    print('Best Parameters: {}\n'.format(search.best_params_))
    print('Training Accuracy: {:0.2}'.format(train_accuracy))
    print('Test Accuracy: {:0.2}\n'.format(test_accuracy))
    
    # Confusion matrix labels:
    labels = np.array([['true losses','false wins'], ['false losses','true wins']])
    
    # Model evaluation metrics:
    confusion_mtrx = confusion_matrix(y_test, predicted)
    auc = roc_auc_score(y_test, probabilities)
    fpr, tpr, thresholds = roc_curve(y_test, probabilities)
    logloss = log_loss(y_test, search.predict_proba(X_test))
    
    # Plot all metrics in a grid of subplots:
    fig = plt.figure(figsize=(12, 12))
    grid = plt.GridSpec(2, 4, wspace=0.75, hspace=0.5)
    
    # Top-left plot - confusion matrix:
    plt.subplot(grid[0, :2])
    sns.heatmap(confusion_mtrx, annot=labels, fmt='')
    plt.xlabel('Predicted Games')
    plt.ylabel('Actual Games');
    
    # Top-right plot - ROC curve:
    plt.subplot(grid[0, 2:])
    plt.plot([0, 1], [0, 1], linestyle='--')
    plt.plot(fpr, tpr, marker='.')
    plt.xlabel('False Positive Rate')
    plt.ylabel('True Positive Rate')
    plt.title('AUROC: {:0.3}'.format(auc));
    
    # Bottom-left plot - support, or true predictions:
    plt.subplot(grid[1, :2])
    sns.countplot(y=predicted, orient='h')
    plt.yticks([1, 0], ('wins', 'losses'))
    plt.ylabel(''), plt.xlabel('Number Predicted');
    
    # Bottom-right plot - classification report:
    plt.subplot(grid[1, 2:])
    visualizer = ClassificationReport(search, classes=['losses', 'wins'])
    visualizer.fit(X_train, y_train)
    visualizer.score(X_test, y_test)
    g = visualizer.poof();
    
    return train_accuracy, test_accuracy, auc, logloss

In [None]:
def summary_of_models(train_accuracy, test_accuracy, auc, logloss, model_name='model_name'):
    
    model_name = pd.DataFrame({
        'model_name': model_name, 
        'Train Accuracy': round(train_accuracy, 3), 
        'Test Accuracy': round(test_accuracy, 3),
        'AUROC': round(auc, 3),
        'Log Loss': round(logloss, 3)
    }, index=[0])
    model_name = model_name[['model_name', 'Train Accuracy', 'Test Accuracy', 'AUROC', 'Log Loss']]
    
    return model_name

<a href='#top' id='4.1'>return to menu</a>

## 4.1 Logistic Regression
>In this model, the probabilities describing the possible outcomes of a single trial are modeled... The implementation of logistic regression in [scikit-learn](https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression) can be accessed from class `LogisticRegression`. This implementation can fit binary, One-vs- Rest, or multinomial logistic regression with optional L2 or L1 regularization.

In [None]:
warnings.filterwarnings(action='ignore', category=DeprecationWarning)
warnings.filterwarnings(action='ignore', category=UserWarning)

from sklearn.linear_model import LogisticRegression

# Tune Logistic Regression for optimal regularization strength
# and regularization method (penalty):
lr_clf = LogisticRegression(random_state=32)
lr_param_grid = {
    'clf__C': np.logspace(start=-10, stop=10, num=21),
    'clf__penalty': ['l1', 'l2']
}

train_accuracy, test_accuracy, auc, logloss = clfy_report(lr_clf, X_train, X_test, y_train, y_test, lr_param_grid, cv=10, scale=False)
logisticRegression = summary_of_models(train_accuracy, test_accuracy, auc, logloss, model_name='logisticRegression')

For this and all other classifiers, the following is the ideal representation from each graph (clockwise from top left):
1. The confusion matrix would be very light beige (high frequency) on the accuracy diagonal (true values) and black on the false diagonal.

2. The green line on the ROC graph would pull more toward 1 on the y-axis (true positive rate), making the area under the ROC (AUROC) closer to 1.

3. The classification report would be a very deep red, indicating high precision and recall (f1-score is an average of those) for win and loss predictions.

4. The predicted wins and losses would be fairly even given the underlying data, but some unevenness could be expected because of the random splitting of training and test data.

<a href='#top' id='4.2'>return to menu</a>

## 4.2 Support Vector Machine
>A support vector machine constructs a hyper-plane or set of hyper-planes in a high or infinite dimensional space, which can be used for classification, regression or other tasks. Intuitively, a good separation is achieved by the hyper-plane that has the largest distance to the nearest training data points of any class (so-called functional margin), since in general the larger the margin the lower the generalization error of the classifier.

[scikit-learn's](https://scikit-learn.org/stable/modules/svm.html#svm-classification) `SVC` class is among those available for support vector machine classification.

In [None]:
from sklearn import svm

# Tune Support Vector classification for optimal regularization strength 
# and the kernel coefficient for the default kernel type implemented, rbf:
svm_clf = svm.SVC(probability=True, random_state=32)
svm_param_grid = {
    'clf__C': np.logspace(start=-3, stop=3, num=7), 
    'clf__gamma': np.logspace(start=-4, stop=-1, num=4)
}

train_accuracy, test_accuracy, auc, logloss = clfy_report(svm_clf, X_train, X_test, y_train, y_test, svm_param_grid, cv=10, scale=False)
svmSVC = summary_of_models(train_accuracy, test_accuracy, auc, logloss, model_name='svmSVC')

# Start a summary of all models:
all_models = pd.concat([logisticRegression, svmSVC])

<a href='#top' id='4.3'>return to menu</a>

## 4.3 Decision Tree
>Decision Trees (DTs) are a non-parametric supervised learning method used for classification and regression. The goal is to create a model that predicts the value of a target variable by learning simple decision rules inferred from the data features.

[scikit-learn's](https://scikit-learn.org/stable/modules/tree.html#classification) `DecisionTreeClassifier` is the class available for decision tree classification.

In [None]:
from sklearn.tree import DecisionTreeClassifier

# Tune Decision Tree classifier for
# minimum fraction of samples required to split an internal node,
# minimum fraction of samples required to be at a leaf node
# and function to measure the quality of a split:
dt_clf = DecisionTreeClassifier(random_state=32)
dt_param_grid = {
    'clf__min_samples_split': np.linspace(0.1, 0.5, 5),
    'clf__min_samples_leaf': np.linspace(0.1, 0.5, 5), 
    'clf__criterion': ['gini', 'entropy']
}

train_accuracy, test_accuracy, auc, logloss = clfy_report(dt_clf, X_train, X_test, y_train, y_test, dt_param_grid, cv=10, scale=False)
decisionTreeClassifier = summary_of_models(train_accuracy, test_accuracy, auc, logloss, model_name='decisionTreeClassifier')
all_models = pd.concat([all_models, decisionTreeClassifier])

<a href='#top' id='4.4'>return to menu</a>

## 4.4 Random Forest
>The [sklearn.ensemble](https://scikit-learn.org/stable/modules/ensemble.html#random-forests) module includes two averaging algorithms based on randomized decision trees: the RandomForest algorithm and the Extra-Trees method.
<br><br>
>In random forests (see `RandomForestClassifier` and `RandomForestRegressor` classes), each tree in the ensemble is built from a sample drawn with replacement (i.e., a bootstrap sample) from the training set. In addition, when splitting a node during the construction of the tree, the split that is chosen is no longer the best split among all features. Instead, the split that is picked is the best split among a random subset of the features.

In [None]:
from sklearn.ensemble import RandomForestClassifier

rf_clf = RandomForestClassifier(random_state=32, n_jobs=-1)
rf_param_grid = {
    'clf__n_estimators': [16, 32, 64, 128],
    'clf__min_samples_leaf': [2, 4, 8, 16], 
    'clf__criterion': ['entropy']
}

train_accuracy, test_accuracy, auc, logloss = clfy_report(rf_clf, X_train, X_test, y_train, y_test, rf_param_grid, cv=10, scale=False)
randomForestClassifier = summary_of_models(train_accuracy, test_accuracy, auc, logloss, model_name='randomForestClassifier')
all_models = pd.concat([all_models, randomForestClassifier])

<a href='#top' id='4.5'>return to menu</a>

## 4.5 XGBoost
XGBoost is a library of gradient boosting algorithms. 
>Gradient boosting is a machine learning technique for regression and classification problems, which produces a prediction model in the form of an ensemble of weak prediction models, typically decision trees. It builds the model in a stage-wise fashion like other boosting methods do... [Wikipedia](https://en.wikipedia.org/wiki/Gradient_boosting)

In [None]:
import xgboost as xgb

xgb_clf = xgb.XGBClassifier(objective='binary:logistic', random_state=32)
xgb_param_grid = {
    'clf__max_depth': [2, 4, 8, 12],
    'clf__min_child_weight': [2, 4, 8],
    'clf__colsample_bytree': [0.25, 0.5, 0.75]
}

train_accuracy, test_accuracy, auc, logloss = clfy_report(xgb_clf, X_train, X_test, y_train, y_test, xgb_param_grid, cv=10, scale=False)
xgbClassifier = summary_of_models(train_accuracy, test_accuracy, auc, logloss, model_name='xgbClassifier')
all_models = pd.concat([all_models, xgbClassifier])

<a href='#top' id='4.6'>return to menu</a>

## 4.6 Best Model


In [None]:
# Sort all models by Test Accuracy:
compare_models = all_models.copy()
compare_models.set_index('model_name', inplace=True)
compare_models.sort_values(by=['Test Accuracy', 'AUROC'], ascending=False)

In [None]:
# Sort all models by Log Loss:
compare_models.sort_values(by=['Log Loss'], ascending=True)

<a href='#top' id='4.7'>return to menu</a>

## 4.7 Model Explainability
Which features were the most important in making predictions (what contributes most to a team's success in the tournament)?

### LOGISTIC REGRESSION

In [None]:
# Create pipeline for scaling and classifying:
pipe = Pipeline([('clf', lr_clf)])
    
lr_search = GridSearchCV(pipe, lr_param_grid, cv=10)
lr_search.fit(X_train, y_train)
print(lr_search.best_params_)

perm = PermutationImportance(lr_search, random_state=1).fit(X_test, y_test)
eli5.show_weights(perm, feature_names=X_test.columns.tolist())

This means that if the `seed` column is randomly shuffled, the classifier accuracy decreases by 0.1499, and since this involves multiple shuffles, the accuracy varied 0.0086 among the shufflings. Of course, the same concept applies to the other variables and they are all listed in order of importance.

The negative permutation importance values (highlighted in light red) mean that shuffling increased the accuracy, indicating that the particular feature truly doesn't matter but random chance (from shuffling) falsely caused accuracy improvement.

<a href='#top' id='4.8'>return to menu</a>

## 4.8 Make Predictions and Build Bracket
Using the template below of all possible matches for the 2018 tournament (and a 50/50 for all games), season and seed data are retrieved for every team id and used to make predictions:

In [None]:
df_predict = pd.read_csv('../input/mens-machine-learning-competition-2018/SampleSubmissionStage2.csv')
df_predict.head()

In [None]:
def get_year_team1_team2(ID):
    """Return a tuple with the year, team1 and team2
    for each ID in the sample submission file of possible matches."""
    return (int(x) for x in ID.split('_'))

In [None]:
def predict_poss_matches(clf, df_predict=df_predict, df_features=df_features):
    diff = []
    data = []

    for i, row in df_predict.iterrows():

        year, team1, team2 = get_year_team1_team2(row.ID)

        # Save 2018 stats/features for the first ID:
        team1 = df_features[(df_features['Season'] == year) & (df_features['TeamID'] == team1)].values[0]

        # Save 2018 stats/features for the first ID:
        team2 = df_features[(df_features['Season'] == year) & (df_features['TeamID'] == team2)].values[0]   

        diff = team1 - team2

        data.append(diff)

    n_poss_games = len(df_predict)
    columns = df_features.columns.get_values()
    final_predictions = pd.DataFrame(np.array(data).reshape(n_poss_games, np.array(data).shape[1]), columns=(columns))
    final_predictions.drop(['Season', 'TeamID'], inplace=True, axis=1)
    predictions = clf.predict_proba(final_predictions)[:, 1]
    clipped_predictions = np.clip(predictions, 0.05, 0.95)
    df_predict.Pred = clipped_predictions
    
    return df_predict

predict_poss_matches(lr_search).to_csv('best_model_results.csv', index=False) # LogLoss = 0.5970823250661108 Thanks to this website: https://www.marksmath.org/visualization/kaggle_brackets/

[Here](https://www.kaggle.com/c/mens-machine-learning-competition-2018/discussion/50200) is where I found the next gem, a package to automatically fill in the bracket using the prediction file:

In [None]:
%%capture
!pip install bracketeer

In [None]:
from bracketeer import build_bracket

b = build_bracket(
        outputPath='best_bracket.png',
        submissionPath='best_model_results.csv',
        teamsPath='../input/mens-machine-learning-competition-2018/datafiles/Teams.csv',
        seedsPath='../input/mens-machine-learning-competition-2018/Stage2UpdatedDataFiles/NCAATourneySeeds.csv',
        slotsPath='../input/mens-machine-learning-competition-2018/Stage2UpdatedDataFiles/NCAATourneySlots.csv',
        year=2018
)

I can't figure out the problem there so I ran it in a local notebook using the same predictions file with help from this post: https://www.kaggle.com/rtatman/download-a-csv-file-from-a-kernel.

In [None]:
from IPython.display import HTML
import base64

def create_download_link(df, title='Download predictions file!', filename='best_model_results_kaggle.csv'): 
    """
    Create a text link to download dataframe as a CSV file.
    """
    csv = df.to_csv(index=False)
    b64 = base64.b64encode(csv.encode())
    payload = b64.decode()
    html = '<a download="{filename}" href="data:text/csv;base64,{payload}" target="_blank">{title}</a>'
    html = html.format(payload=payload, title=title, filename=filename)
    return HTML(html)

create_download_link(predict_poss_matches(lr_search))

<a href='#top' id='5'>return to menu</a>

---
## 5. RESULTS

The best model predicted the bracket shown below for 2018:

![](https://imgur.com/fPNSmO9.png)

<br>

I was never crazy about including seed differential in the models, mainly because of the fact that there is little difference between the count of 12-5 upsets and the count of 10-7 upsets, yet the differences in the seed number are 7 and 3. 

Though all metrics were decent, and even without the benefit of 2018 results, I would have rejected this model based on the first round picks - only 3 predicted upsets (win seed > lose seed), and those were 9s over 8s. Of course, 2018 results are known and there were 20 total, 9 in the first round. I know the model is not trying to predict which games are going to be upsets, but this bracket would have been a major tournament history outlier.

It does seem as though earning a decent seed is important for staying in the tournament, but that crazy first round... Maybe that unique upset group needs to be weighted a bit differently?

---
In the meantime, I must see a printed bracket from the best no-seed-data model...

In [None]:
df_features = df_avgs[['Season', 'TeamID', 'shoot_eff', 'score_op', 'off_rtg', 'def_rtg', 'sos', 'ie', 'efg_pct', 'to_poss', 'orb_pct', 'ft_rate', 'reb_pct', 'drb_pct', 'ts_pct', 'ast_rtio', 'blk_pct', 'stl_pct']]


df_tourney = df_tourney_all[(df_tourney_all.Season >= 2003) & (df_tourney_all.Season < 2018)]
df_tourney.reset_index(inplace=True, drop=True)

# Merge tourney games with tourney winners' season features:
df_winners = pd.merge(left=df_tourney[['Season', 'WTeamID', 'LTeamID']], right=df_features, how='left', left_on=['Season', 'WTeamID'], right_on=['Season', 'TeamID'])
df_winners.drop(['TeamID'], inplace=True, axis=1) 

# Merge tourney games with loser features:
df_losers = pd.merge(left=df_tourney[['Season', 'WTeamID', 'LTeamID']], right=df_features, how='left', left_on=['Season', 'LTeamID'], right_on=['Season', 'TeamID'])
df_losers.drop(['TeamID'], inplace=True, axis=1)

# Create winner target by subtracting loser data from winner data,
# and assigning a value of 1:
df_winner_diff = (df_winners.iloc[:, 3:] - df_losers.iloc[:, 3:])
df_winner_diff['result'] = 1

# Create loser target by subtracting winner data from loser data,
# and assigning a value of 0:
df_loser_diff = (df_losers.iloc[:, 3:] - df_winners.iloc[:, 3:])
df_loser_diff['result'] = 0

# Concatenate winner data with loser data:
df_model = pd.concat((df_winner_diff, df_loser_diff), axis=0)

X = df_model.iloc[:, :-1]
y = df_model.result

# Split the dataframe into 65% training and 35% testing:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.35, random_state=32)

In [None]:
lr_clf = LogisticRegression(random_state=32)
train_accuracy, test_accuracy, auc, logloss = clfy_report(lr_clf, X_train, X_test, y_train, y_test, lr_param_grid, cv=10, scale=False)
logisticRegression = summary_of_models(train_accuracy, test_accuracy, auc, logloss, model_name='logisticRegression')

In [None]:
svm_clf = svm.SVC(probability=True, random_state=32)
train_accuracy, test_accuracy, auc, logloss = clfy_report(svm_clf, X_train, X_test, y_train, y_test, svm_param_grid, cv=10, scale=False)
svmSVC = summary_of_models(train_accuracy, test_accuracy, auc, logloss, model_name='svmSVC')

# Start a summary of all models:
all_models = pd.concat([logisticRegression, svmSVC])

In [None]:
dt_clf = DecisionTreeClassifier(random_state=32)
train_accuracy, test_accuracy, auc, logloss = clfy_report(dt_clf, X_train, X_test, y_train, y_test, dt_param_grid, cv=10, scale=False)
decisionTreeClassifier = summary_of_models(train_accuracy, test_accuracy, auc, logloss, model_name='decisionTreeClassifier')
all_models = pd.concat([all_models, decisionTreeClassifier])

In [None]:
rf_clf = RandomForestClassifier(random_state=32, n_jobs=-1)
train_accuracy, test_accuracy, auc, logloss = clfy_report(rf_clf, X_train, X_test, y_train, y_test, rf_param_grid, cv=10, scale=False)
randomForestClassifier = summary_of_models(train_accuracy, test_accuracy, auc, logloss, model_name='randomForestClassifier')
all_models = pd.concat([all_models, randomForestClassifier])

In [None]:
xgb_clf = xgb.XGBClassifier(objective='binary:logistic', random_state=32)
train_accuracy, test_accuracy, auc, logloss = clfy_report(xgb_clf, X_train, X_test, y_train, y_test, xgb_param_grid, cv=10, scale=False)
xgbClassifier = summary_of_models(train_accuracy, test_accuracy, auc, logloss, model_name='xgbClassifier')
all_models = pd.concat([all_models, xgbClassifier])

In [None]:
# Sort all models by Test Accuracy:
compare_models2 = all_models.copy()
compare_models2.set_index('model_name', inplace=True)
compare_models2.sort_values(by=['Test Accuracy', 'AUROC'], ascending=False)

In [None]:
compare_models.sort_values(by=['Test Accuracy', 'AUROC'], ascending=False)

### LOGISTIC REGRESSION

In [None]:
# Create pipeline for scaling and classifying:
pipe = Pipeline([('clf', lr_clf)])
    
lr_search_ns = GridSearchCV(pipe, lr_param_grid, cv=10)
lr_search_ns.fit(X_train, y_train)
print(lr_search_ns.best_params_)

perm = PermutationImportance(lr_search_ns, random_state=1).fit(X_test, y_test)
eli5.show_weights(perm, feature_names=X_test.columns.tolist())

So, after Net and Offensive Efficiency, the logistic regression model benefited most from the Four Factors!

In [None]:
df_predict = pd.read_csv('../input/mens-machine-learning-competition-2018/SampleSubmissionStage2.csv')

def predict_poss_matches(clf, df_predict=df_predict, df_features=df_features):
    diff = []
    data = []

    for i, row in df_predict.iterrows():

        year, team1, team2 = get_year_team1_team2(row.ID)

        # Save 2018 stats/features for the first ID:
        team1 = df_features[(df_features['Season'] == year) & (df_features['TeamID'] == team1)].values[0]

        # Save 2018 stats/features for the first ID:
        team2 = df_features[(df_features['Season'] == year) & (df_features['TeamID'] == team2)].values[0]   

        diff = team1 - team2

        data.append(diff)

    n_poss_games = len(df_predict)
    columns = df_features.columns.get_values()
    final_predictions = pd.DataFrame(np.array(data).reshape(n_poss_games, np.array(data).shape[1]), columns=(columns))
    final_predictions.drop(['Season', 'TeamID'], inplace=True, axis=1)
    predictions = clf.predict_proba(final_predictions)[:, 1]
    clipped_predictions = np.clip(predictions, 0.05, 0.95)
    df_predict.Pred = clipped_predictions
    
    return df_predict

predict_poss_matches(lr_search_ns).to_csv('best_model_results_noseed.csv', index=False) # LogLoss = 0.6613054887607995 Thanks to this website: https://www.marksmath.org/visualization/kaggle_brackets/

# b = build_bracket(
#         outputPath='best_bracket_noseeds.png',
#         submissionPath='../output/best_model_results_noseed.csv',
#         teamsPath='../input/mens-machine-learning-competition-2018/datafiles/Teams.csv',
#         seedsPath='../input/mens-machine-learning-competition-2018/Stage2UpdatedDataFiles/NCAATourneySeeds.csv',
#         slotsPath='../input/mens-machine-learning-competition-2018/Stage2UpdatedDataFiles/NCAATourneySlots.csv',
#         year=2018
# )

create_download_link(predict_poss_matches(lr_search_ns), filename='best_model_results_noseed_kaggle.csv')

![](https://imgur.com/Ra7ciqj.png)

Well this bracket performed worse. But had it been true, ignoring that improbability, nothing stands out that would have defied what we know (and love) about the tournament. At least, nothing quite like a tournament of better seeds winning nearly every game. Blah!

### Machine learning helps, but not enough to take over!
The best model with seed data correctly predicted Villanova as the champion, and ultimately predicted games more accurately. The best model without seed data did a crummy job with later rounds, was less accurate, but resembled March Madness.

This leaves me planning to explore better ways to capture seed importance. Until then, I may make hybrid picks (machine learning AND me), using this next (and last) series of graphs as a guide.

Recall that the modeled data are differences between tournament winners and losers, so each graph shows how that difference in a particular feature value influences the probability of a win. The method used (partial dependence plots) makes multiple predictions using all features while incrementally changing the value of the given feature and calculating the probability of a win for each new value.

In [None]:
sos = pdp.pdp_isolate(model=lr_search_ns, dataset=X_test, model_features=X_test.columns.tolist(), feature='sos')
pdp.pdp_plot(sos, 'Net Efficiency');

The first key takeaway is reassurance that what the model learned follows intuition. Generally, as the difference between a team and opponent becomes more positive, so does the probability of a win.

So, in this case, if all else is equal between two teams, the probability of a win increases by about 0.25 for a team that has a Net Efficiency of 10 more than their opponent.

In [None]:
off_rtg = pdp.pdp_isolate(model=lr_search_ns, dataset=X_test, model_features=X_test.columns.tolist(), feature='off_rtg')
pdp.pdp_plot(off_rtg, 'Offensive Rating');

In [None]:
orb_pct = pdp.pdp_isolate(model=lr_search_ns, dataset=X_test, model_features=X_test.columns.tolist(), feature='orb_pct')
pdp.pdp_plot(orb_pct, 'orb_pct');

---
Besides getting a handle on seed importance, there are certainly many other basketball statistics to consider - and good info/research available to do so. And non-basketball features - distance from home court perhaps. Or it may help to eliminate some. But I want to win too, so this is where my spirit of friendly competition ends! :)

<a id='seed' href='#top'>back to menu</a>