## Creating the Training Data

In this notebook, we will create the training data to be used by the various models for predicting scores in the NCAA basketball tournament. We will start by generating features for each dataset. Then, we will use blocking to reduce the training data to only include games that include games between tournament caliber teams. Finally, we will combine data sets to create training data for a machine learning model.

In [1]:
# Import packages
import sys
sys.path.append('../../College_Basketball')

import pandas as pd
import collegebasketball as cbb
cbb.__version__

'0.3'

## Feature Generation

Now that we have our data, we need to create some features for the ML algorithms. For each statistical attribute, there is a feature to show the attribute for the favored team, the attribute for the underdog, and the difference between the two. The favored team is defined as the team with a higher AdjEM on kenpom for each dataset. Using this system, a label of '1' represents an upset and a label of '0' means that the favored team won the game.

We will create a dataset with these features for each set of statistics (Kenpom, T-Rank, basic) for each year that these stats are available. Additionally, we will create a dataset that includes all three of these sets of statistics in a single data set.

In [2]:
# Load the joined datasets
load_path = '../Data/Combined_Data/'
kenpom_season = pd.read_csv('{}Kenpom.csv'.format(load_path))
kenpom_march = pd.read_csv('{}Kenpom_march.csv'.format(load_path))
TRank_season = pd.read_csv('{}TRank.csv'.format(load_path))
TRank_march = pd.read_csv('{}TRank_march.csv'.format(load_path))
stats_season = pd.read_csv('{}Basic.csv'.format(load_path))
stats_march = pd.read_csv('{}Basic_march.csv'.format(load_path))

In [3]:
# Generate features for Kenpom data
kenpom_season_vecs = cbb.gen_kenpom_features(kenpom_season)
kenpom_march_vecs = cbb.gen_kenpom_features(kenpom_march)

# Take a look
print("There are {} games in the Kenpom dataset.".format(len(kenpom_season_vecs)))
print("There are {} games in the march Kenpom dataset.".format(len(kenpom_march_vecs)))
kenpom_season_vecs.head(3)

There are 87521 games in the Kenpom dataset.
There are 1152 games in the march Kenpom dataset.


Unnamed: 0,Favored,Underdog,Year,Win_Loss_Fav,Win_Loss,Win_Loss_Diff,Rank_Fav,Rank,Rank_Diff,Seed_Fav,...,OppD Rank_Fav,OppD Rank,OppD Rank_Diff,NCSOS AdjEM_Fav,NCSOS AdjEM,NCSOS AdjEM_Diff,NCSOS AdjEM Rank_Fav,NCSOS AdjEM Rank,NCSOS AdjEM Rank_Diff,Label
0,Maryland,Arizona,2002,0.888889,0.705882,0.183007,3,13,-10,1.0,...,33,3,30,1.62,17.56,-15.94,120,1,119,1
1,Florida,Arizona,2002,0.709677,0.705882,0.003795,7,13,-6,5.0,...,25,3,22,-0.56,17.56,-18.12,173,1,172,1
2,Arizona,Wyoming,2002,0.705882,0.709677,-0.003795,13,67,-54,3.0,...,3,109,-106,17.56,-5.47,23.03,1,282,-281,0


Note that there was a mis-match in the column names for the T-Rank data as opposed to the other datasets. T-Rank calls the columns for wins and losses by the shorter 'W' and 'L'. In order to merge with the other datasets properly, we need to rename these columns to match the Kenpom data.

In [4]:
# Fix some feature names
TRank_season.rename(columns={'W_Home': 'Wins_Home', 'W_Away': 'Wins_Away', 
                            'L_Away': 'Losses_Away', 'L_Home': 'Losses_Home'}, inplace=True)
TRank_march.rename(columns={'W_Home': 'Wins_Home', 'W_Away': 'Wins_Away', 
                            'L_Away': 'Losses_Away', 'L_Home': 'Losses_Home'}, inplace=True)

# Generate features for T-Rank data
TRank_season_vecs = cbb.gen_TRank_features(TRank_season)
TRank_march_vecs = cbb.gen_TRank_features(TRank_march)

# Take a look
print("There are {} games in the T-Rank dataset.".format(len(TRank_season_vecs)))
print("There are {} games in the march T-Rank dataset.".format(len(TRank_march_vecs)))
TRank_season_vecs.head(3)

There are 58957 games in the T-Rank dataset.
There are 761 games in the march T-Rank dataset.


Unnamed: 0,Favored,Underdog,Year,Win_Loss_Fav,Win_Loss,Win_Loss_Diff,Rk_Fav,Rk,Rk_Diff,Seed_Fav,...,WAB_Fav,WAB,WAB_Diff,WAB Rank_Fav,WAB Rank,WAB Rank_Diff,AdjEM_Fav,AdjEM,AdjEM_Diff,Label
0,Memphis,UT-Martin,2008,0.95,0.515152,0.434848,2,252,-250,1.0,...,8.9,-11.0,19.9,5,220,-215,31.51,-8.1,39.61,0
1,Memphis,Richmond,2008,0.95,0.516129,0.433871,2,143,-141,1.0,...,8.9,-5.5,14.4,5,131,-126,31.51,1.48,30.03,0
2,Memphis,Siena,2008,0.95,0.676471,0.273529,2,97,-95,1.0,...,8.9,-3.8,12.7,5,97,-92,31.51,7.99,23.52,0


In [5]:
# Generate features for basic stats data
stats_season_vecs = cbb.gen_basic_features(stats_season)
stats_march_vecs = cbb.gen_basic_features(stats_march)

# Take a look
print("There are {} games in the basic stats dataset.".format(len(stats_season_vecs)))
print("There are {} games in the march basic stats dataset.".format(len(stats_march_vecs)))
stats_season_vecs.head(3)

There are 48179 games in the basic stats dataset.
There are 570 games in the march basic stats dataset.


Unnamed: 0,Favored,Underdog,Year,Win_Loss_Fav,Win_Loss,Win_Loss_Diff,Tm._Fav,Tm.,Tm._Diff,Seed_Fav,...,PF_Fav,PF,PF_Diff,PF_opp_Fav,PF_opp,PF_opp_Diff,AdjEM_Fav,AdjEM,AdjEM_Diff,Label
0,UNC,Florida International,2010,0.540541,0.21875,0.321791,74.540541,68.0,6.540541,,...,15.243243,19.46875,-4.225507,19.27027,18.3125,0.95777,13.39,-14.45,27.84,0
1,UNC,Albany (NY),2010,0.540541,0.21875,0.321791,74.540541,62.71875,11.821791,,...,15.243243,18.8125,-3.569257,19.27027,16.75,2.52027,13.39,-13.16,26.55,0
2,UNC,William & Mary,2010,0.540541,0.666667,-0.126126,74.540541,67.060606,7.479934,,...,15.243243,16.242424,-0.999181,19.27027,16.333333,2.936937,13.39,6.58,6.81,0


Now that the features for each dataset have been generated, we can join them all to form one larger set of training data that contains all of their features. Since the basic stats dataset only went back to 2010, this larger set of dataa will be restricted to just the games from 2010 up until now.

Unfortunately, I ran into an issue because the winning percentage data features from the Kenpom and T-Rank datasets appear to be slightly different sometimes. As a temporary fix, I decided to just go with the Kenpom winning percentage for this larger set of data.

In [6]:
# Generate features for each year of data
on_cols_kp_tr = ['Favored', 'Underdog', 'Year', 'Seed_Fav', 'Seed', 'Label', 'AdjEM_Fav', 'AdjEM', 'AdjEM_Diff']
on_cols_stats = on_cols_kp_tr + ['Win_Loss_Fav', 'Win_Loss', 'Win_Loss_Diff']

In [7]:
# Add an id column to the kenpom dataset
all_season = kenpom_season_vecs[kenpom_season_vecs['Year'] > 2009]
all_season.reset_index(level=0, inplace=True)

# Create a set of training data for years with all features
all_season = all_season.merge(TRank_season_vecs[TRank_season_vecs['Year'] > 2009], on=on_cols_kp_tr)
all_season = all_season.rename(columns={'Win_Loss_Fav_x': 'Win_Loss_Fav', 'Win_Loss_x': 'Win_Loss', 'Win_Loss_Diff_x': 'Win_Loss_Diff'})
all_season = all_season.drop(['Win_Loss_Fav_y', 'Win_Loss_y', 'Win_Loss_Diff_y'], axis=1)
all_season = all_season.merge(stats_season_vecs, on=on_cols_stats)
all_season = all_season.drop_duplicates('index').drop('index', axis=1)

all_march = kenpom_march_vecs[kenpom_march_vecs['Year'] > 2009].merge(TRank_march_vecs[TRank_march_vecs['Year'] > 2009], on=on_cols_kp_tr)
all_march = all_march.rename(columns={'Win_Loss_Fav_x': 'Win_Loss_Fav', 'Win_Loss_x': 'Win_Loss', 'Win_Loss_Diff_x': 'Win_Loss_Diff'})
all_march = all_march.drop(['Win_Loss_Fav_y', 'Win_Loss_y', 'Win_Loss_Diff_y'], axis=1)
all_march = all_march.merge(stats_march_vecs, on=on_cols_stats)

# Take a look
print("There are {} games in the dataset.".format(len(all_season)))
print("There are {} games in the march dataset.".format(len(all_march)))
all_season.head()

There are 48179 games in the dataset.
There are 570 games in the march dataset.


Unnamed: 0,Favored,Underdog,Year,Win_Loss_Fav,Win_Loss,Win_Loss_Diff,Rank_Fav,Rank,Rank_Diff,Seed_Fav,...,TOV_Diff,TOV_opp_Fav,TOV_opp,TOV_opp_Diff,PF_Fav,PF,PF_Diff,PF_opp_Fav,PF_opp,PF_opp_Diff
0,UNC,Florida International,2010,0.540541,0.21875,0.321791,61,306,-245,,...,1.595439,13.72973,14.71875,-0.98902,15.243243,19.46875,-4.225507,19.27027,18.3125,0.95777
1,UNC,Albany (NY),2010,0.540541,0.21875,0.321791,61,295,-234,,...,-0.967061,13.72973,12.25,1.47973,15.243243,18.8125,-3.569257,19.27027,16.75,2.52027
2,UNC,William & Mary,2010,0.540541,0.666667,-0.126126,61,107,-46,,...,4.371007,13.72973,10.545455,3.184275,15.243243,16.242424,-0.999181,19.27027,16.333333,2.936937
3,UNC,Valparaiso,2010,0.540541,0.46875,0.071791,61,183,-122,,...,1.470439,13.72973,13.46875,0.26098,15.243243,20.125,-4.881757,19.27027,17.9375,1.33277
4,Wake Forest,UNC,2010,0.645161,0.540541,0.104621,58,61,-3,9.0,...,0.004359,13.741935,13.72973,0.012206,20.129032,15.243243,4.885789,20.903226,19.27027,1.632956


## Blocking

We now have features for every game played in division one from the 2002 season to the 2017 season. However, we can improve the accuracy of our models if we remove results unrealated to our test set. Since the goal of this project is to predict specifically games for the NCAA Tournament, we will remove any games with teams that are not good enough.

In [8]:
print('We have Kenpom data for ' + str(len(kenpom_season_vecs) + len(kenpom_march_vecs)) + ' games.')
print(str(len(kenpom_season_vecs[kenpom_season_vecs['Label'] == 1]) 
          + len(kenpom_march_vecs[kenpom_march_vecs['Label'] == 1])) + ' of those games are upsets')
print('We have T-Rank data for ' + str(len(TRank_season_vecs) + len(TRank_march_vecs)) + ' games.')
print(str(len(TRank_season_vecs[TRank_season_vecs['Label'] == 1]) 
          + len(TRank_march_vecs[TRank_march_vecs['Label'] == 1])) + ' of those games are upsets')
print('We have Stats data for ' + str(len(stats_season_vecs) + len(stats_march_vecs)) + ' games.')
print(str(len(stats_season_vecs[stats_season_vecs['Label'] == 1]) 
          + len(stats_march_vecs[stats_march_vecs['Label'] == 1])) + ' of those games are upsets')

We have Kenpom data for 88673 games.
22066 of those games are upsets
We have T-Rank data for 59718 games.
14818 of those games are upsets
We have Stats data for 48749 games.
12102 of those games are upsets


In [9]:
# Block the feature vector tables for full season data
kenpom_season_vecs = cbb.block_table(kenpom_season_vecs)
TRank_season_vecs = cbb.block_table(TRank_season_vecs)
stats_season_vecs = cbb.block_table(stats_season_vecs)
all_season = cbb.block_table(all_season)

In [10]:
# Drop the kenpom columns from the TRank and basic stats data sets now that blocking is completed
drop_cols = ['AdjEM_Fav', 'AdjEM', 'AdjEM_Diff']
TRank_season_vecs = TRank_season_vecs.drop(drop_cols, axis=1)
TRank_march_vecs = TRank_march_vecs.drop(drop_cols, axis=1)
stats_season_vecs = stats_season_vecs.drop(drop_cols, axis=1)
stats_march_vecs = stats_march_vecs.drop(drop_cols, axis=1)

In [11]:
print('We have Kenpom data for ' + str(len(kenpom_season_vecs) + len(kenpom_march_vecs)) + ' games.')
print(str(len(kenpom_season_vecs[kenpom_season_vecs['Label'] == 1]) 
          + len(kenpom_march_vecs[kenpom_march_vecs['Label'] == 1])) + ' of those games are upsets')
print('We have T-Rank data for ' + str(len(TRank_season_vecs) + len(TRank_march_vecs)) + ' games.')
print(str(len(TRank_season_vecs[TRank_season_vecs['Label'] == 1]) 
          + len(TRank_march_vecs[TRank_march_vecs['Label'] == 1])) + ' of those games are upsets')
print('We have Stats data for ' + str(len(stats_season_vecs) + len(stats_march_vecs)) + ' games.')
print(str(len(stats_season_vecs[stats_season_vecs['Label'] == 1]) 
          + len(stats_march_vecs[stats_march_vecs['Label'] == 1])) + ' of those games are upsets')

We have Kenpom data for 33715 games.
8775 of those games are upsets
We have T-Rank data for 22433 games.
5837 of those games are upsets
We have Stats data for 18459 games.
4819 of those games are upsets


In [12]:
# Now save all of the training datasets 
path = '../Data/Training/'
kenpom_season_vecs.to_csv('{}kenpom_season.csv'.format(path), index=False)
kenpom_march_vecs.to_csv('{}kenpom_march.csv'.format(path), index=False)
TRank_season_vecs.to_csv('{}TRank_season.csv'.format(path), index=False)
TRank_march_vecs.to_csv('{}TRank_march.csv'.format(path), index=False)
stats_season_vecs.to_csv('{}stats_season.csv'.format(path), index=False)
stats_march_vecs.to_csv('{}stats_march.csv'.format(path), index=False)
all_season.to_csv('{}all_season.csv'.format(path), index=False)
all_march.to_csv('{}all_march.csv'.format(path), index=False)