## Creating the Trainng Data

In this notebook, we will create the training data to be used by the various models for predicting scores in the NCAA basketball tournament. We will start by generating features from the Kenpom, T-Rank, and basic statistics data. Then, we will use blocking to reduce the training data to only include games that include games between tournament caliber teams. Finally, we will combine data sets to create training data for a kenpom model, T-Rank model, basic statistical model and finally a model that uses all of the available data.

In [1]:
# Import packages
import sys
sys.path.append('/Users/phil/Documents/Documents/College_Basketball')

import pandas as pd
import collegebasketball as cbb
cbb.__version__

'0.2'

## Feature Generation

Now that we have our data, we need to create some features for the ML algorithms. For each statistical attribute, there is a feature to show the attribute for the favored team, the attribute for the underdog, and the difference between the two. The favored team is defined as the team with a higher AdjEM on kenpom for each dataset. Using this system, a label of '1' represents an upset and a label of '0' means that the favored team won the game.

We will create a dataset with these features for each set of statistics (Kenpom, T-Rank, basic) for each year that these stats are available. Additionally, we will create a dataset that includes all three of these sets of statistics in a single data set.

In [2]:
# Store a dataframe of kenpom data for each year in a list
kenpom_season_data = []
kenpom_march_data = []
TRank_season_data = []
TRank_march_data = []
stats_season_data = []
stats_march_data = []
all_season_data = []
all_march_data = []

# Generate features for each year of data
on_cols = ['Favored', 'Underdog', 'Year', 'Win_Loss_Fav', 'Win_Loss', 'Win_Loss_Diff', 'Label']
for year in range(2002, 2019):
    
    # Load combined data for this season
    kenpom_season = cbb.load_csv('../Data/Combined_Data/Kenpom/{}_regular_season.csv'.format(year))
    if year < 2019:
        kenpom_march= cbb.load_csv('../Data/Combined_Data/Kenpom/{}_march.csv'.format(year))
    if year > 2007:
        TRank_season = cbb.load_csv('../Data/Combined_Data/TRank/{}_regular_season.csv'.format(year))
        if year < 2019:
            TRank_march = cbb.load_csv('../Data/Combined_Data/TRank/{}_march.csv'.format(year))
    if year > 2009:
        stats_season = cbb.load_csv('../Data/Combined_Data/Basic/{}_regular_season.csv'.format(year))
        if year < 2019:
            stats_march = cbb.load_csv('../Data/Combined_Data/Basic/{}_march.csv'.format(year))
    
    # Generate features for each data set
    kenpom_season_data.append(cbb.gen_kenpom_features(kenpom_season))
    kenpom_march_data.append(cbb.gen_kenpom_features(kenpom_march))
    if year > 2007:
        TRank_season_data.append(cbb.gen_TRank_features(TRank_season, kenpom_season))
        if year < 2019:
            TRank_march_data.append(cbb.gen_TRank_features(TRank_march, kenpom_march))
    if year > 2009:
        stats_season_data.append(cbb.gen_basic_features(stats_season, kenpom_season))
        if year < 2019:
            stats_march_data.append(cbb.gen_basic_features(stats_march, kenpom_march))
        
    # Combine all features into a single data set
    if year > 2009:
        all_season = kenpom_season_data[-1].merge(TRank_season_data[-1], on=on_cols)
        all_season_data.append(all_season.merge(stats_season_data[-1], on=on_cols))
        if year < 2019:
            all_march = kenpom_march_data[-1].merge(TRank_march_data[-1], on=on_cols)
            all_march_data.append(all_march.merge(stats_march_data[-1], on=on_cols))

In [3]:
# Combine feature vectors into full training sets
kenpom_season = pd.concat(kenpom_season_data)
kenpom_march = pd.concat(kenpom_march_data)
TRank_season = pd.concat(TRank_season_data)
TRank_march = pd.concat(TRank_march_data)
stats_season = pd.concat(stats_season_data)
stats_march = pd.concat(stats_march_data)
all_season = pd.concat(all_season_data)
all_march = pd.concat(all_march_data)

## Blocking

We now have features for every game played in division one from the 2002 season to the 2017 season. However, we can improve the accuracy of our models if we remove results unrealated to our test set. Since the goal of this project is to predict specifically games for the NCAA Tournament, we will remove any games with teams that are not good enough.

In [4]:
print('We have data for ' + str(len(kenpom_season) + len(kenpom_march)) + ' games.')
print(str(len(kenpom_season[kenpom_season['Label'] == 1]) + len(kenpom_march[kenpom_march['Label'] == 1])) + ' of those games are upsets')

We have data for 87926 games.
21903 of those games are upsets


In [5]:
# Since the blocking rules rely on the Kenpom AdjEM stat, we need to join T-Rank and basic stats data with Kenpom
kenpom_AdjEM = kenpom_season.loc[:,['Favored', 'Underdog', 'Year', 'AdjEM_Fav', 'AdjEM', 'AdjEM_Diff']]
on_cols = ['Favored', 'Underdog', 'Year']
TRank_season = TRank_season.merge(kenpom_AdjEM, on=on_cols)
stats_season = stats_season.merge(kenpom_AdjEM, on=on_cols)

In [6]:
TRank_season.columns

Index(['Favored', 'Underdog', 'Year', 'Win_Loss_Fav', 'Win_Loss',
       'Win_Loss_Diff', 'Rk_Fav', 'Rk', 'Rk_Diff', 'AdjOE_Fav',
       ...
       'WAB_Fav', 'WAB', 'WAB_Diff', 'WAB Rank_Fav', 'WAB Rank',
       'WAB Rank_Diff', 'Label', 'AdjEM_Fav', 'AdjEM', 'AdjEM_Diff'],
      dtype='object', length=106)

In [7]:
# Block the feature vector tables for full season data
kenpom_season = cbb.block_table(kenpom_season)
TRank_season = cbb.block_table(TRank_season)
stats_season = cbb.block_table(stats_season)
all_season = cbb.block_table(all_season)

In [8]:
# Drop the kenpom columns from the TRank and basic stats data sets now that blocking is completed
drop_cols = ['AdjEM_Fav', 'AdjEM', 'AdjEM_Diff']
TRank_season = TRank_season.drop(drop_cols, axis=1)
stats_season = stats_season.drop(drop_cols, axis=1)

In [9]:
print('We have data for ' + str(len(kenpom_season) + len(kenpom_march)) + ' games.')
print(str(len(kenpom_season[kenpom_season['Label'] == 1]) + len(kenpom_march[kenpom_march['Label'] == 1])) + ' of those games are upsets')

We have data for 60252 games.
15928 of those games are upsets


In [10]:
# Now save all of the training datasets 
path = '../Data/Training/'
kenpom_season.to_csv('{}kenpom_season.csv'.format(path), index=False)
kenpom_march.to_csv('{}kenpom_march.csv'.format(path), index=False)
TRank_season.to_csv('{}TRank_season.csv'.format(path), index=False)
TRank_march.to_csv('{}TRank_march.csv'.format(path), index=False)
stats_season.to_csv('{}stats_season.csv'.format(path), index=False)
stats_march.to_csv('{}stats_march.csv'.format(path), index=False)
all_season.to_csv('{}all_season.csv'.format(path), index=False)
all_march.to_csv('{}all_march.csv'.format(path), index=False)