## Creating the Training Data

In this notebook, we will create the training data to be used by the various models for predicting scores in the NCAA basketball tournament. We will start by generating features for each dataset. Then, we will use blocking to reduce the training data to only include games that include games between tournament caliber teams. Finally, we will combine data sets to create training data for a machine learning model.

In [1]:
# Import packages
import sys
sys.path.append('../../College_Basketball')

import pandas as pd
import collegebasketball as cbb
cbb.__version__

'0.3'

## Feature Generation

Now that we have our data, we need to create some features for the ML algorithms. For each statistical attribute, there is a feature to show the attribute for the favored team, the attribute for the underdog, and the difference between the two. The favored team is defined as the team with a higher AdjEM on kenpom for each dataset. Using this system, a label of '1' represents an upset and a label of '0' means that the favored team won the game.

We will create a dataset with these features for each set of statistics (Kenpom, T-Rank, basic) for each year that these stats are available. Additionally, we will create a dataset that includes all three of these sets of statistics in a single data set.

In [2]:
# Load the joined datasets
load_path = '../Data/Combined_Data/'
kenpom = pd.read_csv(f'{load_path}Kenpom.csv')
TRank = pd.read_csv(f'{load_path}TRank.csv')
stats = pd.read_csv(f'{load_path}Basic.csv')

In [3]:
# Generate features for Kenpom data
kenpom_vecs = cbb.gen_kenpom_features(kenpom)

# Take a look
print(f'There are {len(kenpom_vecs)} games in the Kenpom dataset.')
print(f'There are {len(cbb.filter_tournament(kenpom_vecs))} tournament games in the Kenpom dataset.')
kenpom_vecs.head(3)

There are 88789 games in the Kenpom dataset.
There are 1112 tournament games in the Kenpom dataset.


Unnamed: 0,Favored,Underdog,Year,Tournament,Win_Loss_Fav,Win_Loss,Win_Loss_Diff,Rank_Fav,Rank,Rank_Diff,...,OppD Rank_Fav,OppD Rank,OppD Rank_Diff,NCSOS AdjEM_Fav,NCSOS AdjEM,NCSOS AdjEM_Diff,NCSOS AdjEM Rank_Fav,NCSOS AdjEM Rank,NCSOS AdjEM Rank_Diff,Label
0,Maryland,Arizona,2002,,0.888889,0.705882,0.183007,3,13,-10,...,33,3,30,1.62,17.56,-15.94,120,1,119,1
1,Florida,Arizona,2002,,0.709677,0.705882,0.003795,7,13,-6,...,25,3,22,-0.56,17.56,-18.12,173,1,172,1
2,Arizona,Wyoming,2002,"NCAA, West - Second Round",0.705882,0.709677,-0.003795,13,67,-54,...,3,109,-106,17.56,-5.47,23.03,1,282,-281,0


In [4]:
# Generate features for T-Rank data
TRank_vecs = cbb.gen_TRank_features(TRank)

# Take a look
print(f'There are {len(TRank_vecs)} games in the T-Rank dataset.'.format())
print(f'There are {len(cbb.filter_tournament(TRank_vecs))} games in the march T-Rank dataset.')
TRank_vecs.head(3)

There are 59836 games in the T-Rank dataset.
There are 728 games in the march T-Rank dataset.


Unnamed: 0,Favored,Underdog,Year,Tournament,Win_Loss_Fav,Win_Loss,Win_Loss_Diff,Rk_Fav,Rk,Rk_Diff,...,WAB_Fav,WAB,WAB_Diff,WAB Rank_Fav,WAB Rank,WAB Rank_Diff,AdjEM_Fav,AdjEM,AdjEM_Diff,Label
0,Memphis,UT-Martin,2008,,0.95,0.515152,0.434848,2,252,-250,...,8.9,-11.0,19.9,5,220,-215,31.51,-8.1,39.61,0
1,Memphis,Richmond,2008,,0.95,0.516129,0.433871,2,143,-141,...,8.9,-5.5,14.4,5,131,-126,31.51,1.48,30.03,0
2,Memphis,Siena,2008,,0.95,0.676471,0.273529,2,97,-95,...,8.9,-3.8,12.7,5,97,-92,31.51,7.99,23.52,0


In [5]:
# Generate features for basic stats data
stats_vecs = cbb.gen_basic_features(stats)

# Take a look
print(f'There are {len(stats_vecs)} games in the basic stats dataset.')
print(f'There are {len(cbb.filter_tournament(stats_vecs))} games in the march basic stats dataset.')
stats_vecs.head(3)

There are 48867 games in the basic stats dataset.
There are 600 games in the march basic stats dataset.


Unnamed: 0,Favored,Underdog,Year,Tournament,Win_Loss_Fav,Win_Loss,Win_Loss_Diff,Tm._Fav,Tm.,Tm._Diff,...,PF_Fav,PF,PF_Diff,PF_opp_Fav,PF_opp,PF_opp_Diff,AdjEM_Fav,AdjEM,AdjEM_Diff,Label
0,UNC,Florida International,2010,,0.540541,0.21875,0.321791,74.540541,68.0,6.540541,...,15.243243,19.46875,-4.225507,19.27027,18.3125,0.95777,13.39,-14.45,27.84,0
1,UNC,Albany (NY),2010,,0.540541,0.21875,0.321791,74.540541,62.71875,11.821791,...,15.243243,18.8125,-3.569257,19.27027,16.75,2.52027,13.39,-13.16,26.55,0
2,UNC,William & Mary,2010,NIT,0.540541,0.666667,-0.126126,74.540541,67.060606,7.479934,...,15.243243,16.242424,-0.999181,19.27027,16.333333,2.936937,13.39,6.58,6.81,0


Now that the features for each dataset have been generated, we can join them all to form one larger set of training data that contains all of their features. Since the basic stats dataset only went back to 2010, this larger set of dataa will be restricted to just the games from 2010 up until now.

Unfortunately, I ran into an issue because the winning percentage data features from the Kenpom and T-Rank datasets appear to be slightly different sometimes. As a temporary fix, I decided to just go with the Kenpom winning percentage for this larger set of data.

In [6]:
# Generate features for each year of data
on_cols_kp_tr = ['Favored', 'Underdog', 'Year', 'Tournament', 'Seed_Fav', 'Seed', 'Label', 'AdjEM_Fav', 'AdjEM', 'AdjEM_Diff']
on_cols_stats = on_cols_kp_tr + ['Win_Loss_Fav', 'Win_Loss', 'Win_Loss_Diff']

In [7]:
# Add an id column to the kenpom dataset
all_vecs = kenpom_vecs[kenpom_vecs['Year'] > 2009]
all_vecs.reset_index(level=0, inplace=True)

# Create a set of training data for years with all features
all_vecs = all_vecs.merge(TRank_vecs[TRank_vecs['Year'] > 2009], on=on_cols_kp_tr)
all_vecs = all_vecs.rename(columns={'Win_Loss_Fav_x': 'Win_Loss_Fav', 'Win_Loss_x': 'Win_Loss', 'Win_Loss_Diff_x': 'Win_Loss_Diff'})
all_vecs = all_vecs.drop(['Win_Loss_Fav_y', 'Win_Loss_y', 'Win_Loss_Diff_y'], axis=1)
all_vecs = all_vecs.merge(stats_vecs, on=on_cols_stats)
all_vecs = all_vecs.drop_duplicates('index').drop('index', axis=1)

# Take a look
print("There are {} games in the dataset.".format(len(all_vecs)))
print("There are {} games in the march dataset.".format(len(cbb.filter_tournament(all_vecs))))
all_vecs.head()

There are 48867 games in the dataset.
There are 600 games in the march dataset.


Unnamed: 0,Favored,Underdog,Year,Tournament,Win_Loss_Fav,Win_Loss,Win_Loss_Diff,Rank_Fav,Rank,Rank_Diff,...,TOV_Diff,TOV_opp_Fav,TOV_opp,TOV_opp_Diff,PF_Fav,PF,PF_Diff,PF_opp_Fav,PF_opp,PF_opp_Diff
0,UNC,Florida International,2010,,0.540541,0.21875,0.321791,61,306,-245,...,1.595439,13.72973,14.71875,-0.98902,15.243243,19.46875,-4.225507,19.27027,18.3125,0.95777
1,UNC,Albany (NY),2010,,0.540541,0.21875,0.321791,61,295,-234,...,-0.967061,13.72973,12.25,1.47973,15.243243,18.8125,-3.569257,19.27027,16.75,2.52027
2,UNC,William & Mary,2010,NIT,0.540541,0.666667,-0.126126,61,107,-46,...,4.371007,13.72973,10.545455,3.184275,15.243243,16.242424,-0.999181,19.27027,16.333333,2.936937
3,UNC,Valparaiso,2010,,0.540541,0.46875,0.071791,61,183,-122,...,1.470439,13.72973,13.46875,0.26098,15.243243,20.125,-4.881757,19.27027,17.9375,1.33277
4,Wake Forest,UNC,2010,,0.645161,0.540541,0.104621,58,61,-3,...,0.004359,13.741935,13.72973,0.012206,20.129032,15.243243,4.885789,20.903226,19.27027,1.632956


In [8]:
# Now save all of the feature vectors
path = '../Data/Feature_Vectors/'
kenpom_vecs.to_csv('{}kenpom.csv'.format(path), index=False)
TRank_vecs.to_csv('{}TRank.csv'.format(path), index=False)
stats_vecs.to_csv('{}stats.csv'.format(path), index=False)
all_vecs.to_csv('{}training.csv'.format(path), index=False)