## Creating the Trainng Data

In this notebook, we will create the training data to be used by the various models for predicting scores in the NCAA basketball tournament. We will start by generating features from the Kenpom, T-Rank, and basic statistics data. Then, we will use blocking to reduce the training data to only include games that include games between tournament caliber teams. Finally, we will combine data sets to create training data for a kenpom model, T-Rank model, basic statistical model and finally a model that uses all of the available data.

In [1]:
# Import packages
import sys
sys.path.append('/Users/phil/Documents/Documents/College_Basketball')

import pandas as pd
import collegebasketball as cbb
cbb.__version__

'0.2'

## Feature Generation

Now that we have our data, we need to create some features for the ML algorithms. For each statistical attribute, there is a feature to show the attribute for the favored team, the attribute for the underdog, and the difference between the two. The favored team is defined as the team with a higher AdjEM on kenpom for each dataset. Using this system, a label of '1' represents an upset and a label of '0' means that the favored team won the game.

We will create a dataset with these features for each set of statistics (Kenpom, T-Rank, basic) for each year that these stats are available.

In [17]:
# Store a dataframe of kenpom data for each year in a list
kenpom_season_data = []
kenpom_march_data = []
TRank_season_data = []
TRank_march_data = []
stats_season_data = []
stats_march_data = []
all_season_data = []
all_march_data = []

# Generate features for each year of data
on_cols = ['Favored', 'Underdog', 'Year', 'Win_Loss_Fav', 'Win_Loss', 'Win_Loss_Diff', 'Label']
for year in range(2018, 2019):
    
    # Load combined data for this season
    kenpom_season = cbb.load_csv('../Data/Combined_Data/Kenpom/{}_regular_season.csv'.format(year))
    if year < 2019:
        kenpom_march= cbb.load_csv('../Data/Combined_Data/Kenpom/{}_march.csv'.format(year))
    if year > 2007:
        TRank_season = cbb.load_csv('../Data/Combined_Data/TRank/{}_regular_season.csv'.format(year))
        if year < 2019:
            TRank_march = cbb.load_csv('../Data/Combined_Data/TRank/{}_march.csv'.format(year))
    if year > 2009:
        stats_season = cbb.load_csv('../Data/Combined_Data/Basic/{}_regular_season.csv'.format(year))
        if year < 2019:
            stats_march = cbb.load_csv('../Data/Combined_Data/Basic/{}_march.csv'.format(year))
    
    # Generate features for each data set
    kenpom_season_data.append(cbb.gen_kenpom_features(kenpom_season))
    kenpom_march_data.append(cbb.gen_kenpom_features(kenpom_march))
    if year > 2007:
        TRank_season_data.append(cbb.gen_TRank_features(TRank_season, kenpom_season))
        if year < 2019:
            TRank_march_data.append(cbb.gen_TRank_features(TRank_march, kenpom_march))
    if year > 2009:
        stats_season_data.append(cbb.gen_basic_features(stats_season, kenpom_season))
        if year < 2019:
            stats_march_data.append(cbb.gen_basic_features(stats_march, kenpom_march))
        
    # Combine all features into a single data set
    if year > 2009:
        all_season = kenpom_season_data[-1].merge(TRank_season_data[-1], on=on_cols)
        all_season_data.append(all_season.merge(stats_season_data[-1], on=on_cols))
        if year < 2019:
            all_march = kenpom_march_data[-1].merge(TRank_march_data[-1], on=on_cols)
            all_march_data.append(all_march.merge(stats_march_data[-1], on=on_cols))

In [8]:
# Combine feature vectors into full training sets
kenpom_season = pd.concat(kenpom_season_data)
kenpom_march = pd.concat(kenpom_march_data)
TRanks_season = pd.concat(TRank_season_data)
TRank_march = pd.concat(TRank_march_data)
stats_season = pd.concat(stats_season_data)
stats_march = pd.concat(stats_march_data)
all_season = pd.concat(stats_season_data)
all_march = pd.concat(stats_march_data)

Unnamed: 0,Favored,Underdog,Year,Win_Loss_Fav,Win_Loss,Win_Loss_Diff,Rank_Fav,Rank,Rank_Diff,AdjEM_Fav,...,OppD Rank_Fav,OppD Rank,OppD Rank_Diff,NCSOS AdjEM_Fav,NCSOS AdjEM,NCSOS AdjEM_Diff,NCSOS AdjEM Rank_Fav,NCSOS AdjEM Rank,NCSOS AdjEM Rank_Diff,Label
0,Arizona State,Syracuse,2018,0.645161,0.606061,0.039101,47,56,-9,14.61,...,64,61,3,1.57,-0.22,1.79,123,172,-49,1
1,TCU,Syracuse,2018,0.65625,0.606061,0.050189,22,56,-34,18.97,...,47,61,-14,-0.42,-0.22,-0.2,177,172,5,1
2,Michigan State,Syracuse,2018,0.878788,0.606061,0.272727,6,56,-50,26.31,...,67,61,6,-4.77,-0.22,-4.55,302,172,130,1
3,Michigan State,Bucknell,2018,0.878788,0.735294,0.143494,6,101,-95,26.31,...,67,275,-208,-4.77,6.01,-10.78,302,31,271,0
4,Duke,Syracuse,2018,0.787879,0.606061,0.181818,3,56,-53,29.04,...,10,61,-51,4.63,-0.22,4.85,53,172,-119,0


In [3]:
vecs_stats.head()

Unnamed: 0,Favored,Underdog,Year,Win_Loss_Fav,Win_Loss,Win_Loss_Diff,Tm._Fav,Tm.,Tm._Diff,Opp._Fav,...,TOV_opp_Fav,TOV_opp,TOV_opp_Diff,PF_Fav,PF,PF_Diff,PF_opp_Fav,PF_opp,PF_opp_Diff,Label
0,Arizona State,Syracuse,2018,0.645161,0.606061,0.039101,82.65625,66.648649,16.007601,74.84375,...,14.8125,12.675676,2.136824,18.875,16.513514,2.361486,21.5,18.351351,3.148649,1
1,TCU,Syracuse,2018,0.65625,0.606061,0.050189,82.060606,66.648649,15.411957,75.363636,...,12.424242,12.675676,-0.251433,16.757576,16.513514,0.244062,18.545455,18.351351,0.194103,1
2,Michigan State,Syracuse,2018,0.878788,0.606061,0.272727,80.2,66.648649,13.551351,64.885714,...,9.971429,12.675676,-2.704247,18.285714,16.513514,1.772201,19.942857,18.351351,1.591506,1
3,Michigan State,Bucknell,2018,0.878788,0.735294,0.143494,80.2,81.057143,-0.857143,64.885714,...,9.971429,12.514286,-2.542857,18.285714,18.714286,-0.428571,19.942857,21.2,-1.257143,0
4,Duke,Syracuse,2018,0.787879,0.606061,0.181818,84.351351,66.648649,17.702703,69.621622,...,12.243243,12.675676,-0.432432,15.513514,16.513514,-1.0,18.135135,18.351351,-0.216216,0


In [5]:
# Create wins and losses columns for each game
data = data.assign(Wins_Home='', Losses_Home='', Wins_Away='', Losses_Away='')
locations = ['Home', 'Away']
for i, row in data.iterrows():

    # Split up W/L record into two columns
    for loc in locations:
        rec_home = row['Rec_{}'.format(loc)]
        data.loc[i, 'Wins_{}'.format(loc)] = int(rec_home[0:rec_home.index('-')])
        data.loc[i, 'Losses_{}'.format(loc)] = int(rec_home[rec_home.index('-') + 1:])

data.head()

Unnamed: 0,Year,Home,Away,Home_Score,Away_Score,Rk_Home,Team_Home,Conf_Home,G_Home,Rec_Home,...,3P%D_Away,3P%D Rank_Away,Adj T._Away,Adj T. Rank_Away,WAB_Away,WAB Rank_Away,Wins_Home,Losses_Home,Wins_Away,Losses_Away
0,2018,Syracuse,Arizona State,60,56,47,Syracuse,ACC,37,23-14,...,34.2,126,72.7,38,-1.2,66,23,14,20,12
1,2018,Syracuse,TCU,57,52,47,Syracuse,ACC,37,23-14,...,37.6,303,70.0,144,2.2,27,23,14,21,12
2,2018,Syracuse,Michigan State,55,53,47,Syracuse,ACC,37,23-14,...,33.7,102,68.0,238,8.0,5,23,14,30,5
3,2018,Bucknell,Michigan State,78,82,102,Bucknell,Pat,35,25-10,...,33.7,102,68.0,238,8.0,5,25,10,30,5
4,2018,Syracuse,Duke,65,69,47,Syracuse,ACC,37,23-14,...,32.0,22,70.9,91,6.6,9,23,14,29,8


In [4]:
df = pd.DataFrame([[1, 2, 3], [4, 5, 6], [7, 8, 9]], columns=['A', 'B', 'C'])
df

Unnamed: 0,A,B,C
0,1,2,3
1,4,5,6
2,7,8,9


In [6]:
df['A'] = df['A'] / df['B']

In [3]:
data.columns

Index(['Year', 'Home', 'Away', 'Home_Score', 'Away_Score', 'Team_Home',
       'G_Home', 'SRS_Home', 'SOS_Home', 'Tm._Home', 'Opp._Home', 'MP_Home',
       'FG_opp_Home', 'FGA_opp_Home', 'FG%_opp_Home', '3P_opp_Home',
       '3PA_opp_Home', '3P%_opp_Home', 'FT_opp_Home', 'FTA_opp_Home',
       'FT%_opp_Home', 'ORB_opp_Home', 'TRB_opp_Home', 'AST_opp_Home',
       'STL_opp_Home', 'BLK_opp_Home', 'TOV_opp_Home', 'PF_opp_Home',
       'FG_Home', 'FGA_Home', 'FG%_Home', '3P_Home', '3PA_Home', '3P%_Home',
       'FT_Home', 'FTA_Home', 'FT%_Home', 'ORB_Home', 'TRB_Home', 'AST_Home',
       'STL_Home', 'BLK_Home', 'TOV_Home', 'PF_Home', 'Team_Away', 'G_Away',
       'SRS_Away', 'SOS_Away', 'Tm._Away', 'Opp._Away', 'MP_Away',
       'FG_opp_Away', 'FGA_opp_Away', 'FG%_opp_Away', '3P_opp_Away',
       '3PA_opp_Away', '3P%_opp_Away', 'FT_opp_Away', 'FTA_opp_Away',
       'FT%_opp_Away', 'ORB_opp_Away', 'TRB_opp_Away', 'AST_opp_Away',
       'STL_opp_Away', 'BLK_opp_Away', 'TOV_opp_Away', 'PF_opp