# Data Preparation and Cleaning

In this notebook, I clean the datasets and combine them into a single csv file that can be used later for feature generation.

In [1]:
# Import packages
import sys
sys.path.append('../../College_Basketball')

import pandas as pd
import collegebasketball as cbb
cbb.__version__

'2024'

## Load in the Game Scores Data

First, we will load in the games scores data from csv files we created earlier. Later, we'll join this data to the team stats datasets. 

In [2]:
# Location of the data
scores_path = '../Data/Scores/'

# Initialize some variables
scores_data = {}
year = 2024

# Load the scores datasets
scores_data = pd.read_csv(scores_path + str(year) + '_season.csv')

## Cleaning the Data

Next, we need to edit the school names in the kenpom, basic stats and T-Rank datasets to ensure that they match up with the school names from the scores dataset. We will verify that the names match using the `cbb.check_for_missing_names` command. It checks that each school name in the given team statistics dataset (kenpom, basic stats or T-Rank) is present in the game scores dataset.

In [3]:
# The location where the files will be saved
path = '../Data/'
    
# Load this year's data and clean up the school names to match up with scores data
kenpom_data = pd.read_csv('{0}Kenpom/{1}_kenpom.csv'.format(path, year))
kenpom_data = cbb.update_kenpom(kenpom_data)
assert len(cbb.check_for_missing_names(scores_data, kenpom_data, False)) == 0

# TRank data
TRank_data =  pd.read_csv('{0}TRank/{1}_TRank.csv'.format(path, year))
TRank_data = cbb.update_TRank(TRank_data)
assert len(cbb.check_for_missing_names(scores_data, TRank_data, False)) == 0

# Basic stats data
stats_data =  pd.read_csv('{0}SportsReference/{1}_stats.csv'.format(path, year))
stats_data = stats_data.rename(index=str, columns={'School': 'Team'})
stats_data = cbb.update_basic(stats_data)
assert len(cbb.check_for_missing_names(scores_data, stats_data, False)) == 0

In [4]:
# Lets take a quick look at one of the datasets
kenpom_data.head()

Unnamed: 0,Rank,Team,Seed,Conf,Wins,Losses,AdjEM,AdjO,AdjO Rank,AdjD,...,Luck,Luck Rank,OppAdjEM,OppAdjEM Rank,OppO,OppO Rank,OppD,OppD Rank,NCSOS AdjEM,NCSOS AdjEM Rank
0,1,UConn,1.0,BE,31,3,32.21,126.6,1,94.4,...,0.047,70,10.46,34,111.6,41,101.2,30,-3.36,285
1,2,Houston,1.0,B12,30,4,31.72,118.9,17,87.1,...,0.053,61,11.75,14,111.9,36,100.1,6,-0.79,225
2,3,Purdue,1.0,B10,29,4,29.12,125.0,4,95.9,...,0.045,75,13.74,4,114.0,5,100.3,10,10.35,13
3,4,Auburn,4.0,SEC,27,7,28.9,120.6,10,91.7,...,-0.067,324,9.6,53,111.9,35,102.3,71,1.45,152
4,5,Iowa State,2.0,B12,27,7,26.72,113.9,55,87.1,...,0.012,155,10.38,37,111.0,58,100.6,15,-7.2,351


## Joining the Datasets

Now that the school names from each data set matches up, we can join the kenpom and score data to form a single csv file. 

In [5]:
# Save the paths to the data 
save_path = '../Data/Combined_Data/Kenpom.csv'
    
# Join the dataframes to get kenpom for both home and away team
kenpom_df = pd.merge(scores_data, kenpom_data, left_on='Home', right_on='Team', sort=False)
kenpom_df = pd.merge(kenpom_df, kenpom_data, left_on='Away', right_on='Team', 
                     suffixes=('_Home', '_Away'), sort=False)

# Add a column to indicate the year
kenpom_df.insert(0, 'Year', year)
        
# Combine the data for every year and save to csv
all_kenpom = pd.read_csv(save_path)
kenpom_df = pd.concat([all_kenpom, kenpom_df])
kenpom_df.to_csv(save_path, index=False)
    
# Lets take a look at the data set
print("There are {} games in the Kenpom dataset.".format(len(kenpom_df)))
print("There are {} NCAA Tournament games in the Kenpom dataset.".format(len(cbb.filter_tournament(kenpom_df))))
kenpom_df.head()

There are 105574 games in the Kenpom dataset.
There are 1179 NCAA Tournament games in the Kenpom dataset.


Unnamed: 0,Year,Home,Away,Home_Score,Away_Score,Tournament,Rank_Home,Team_Home,Seed_Home,Conf_Home,...,Luck_Away,Luck Rank_Away,OppAdjEM_Away,OppAdjEM Rank_Away,OppO_Away,OppO Rank_Away,OppD_Away,OppD Rank_Away,NCSOS AdjEM_Away,NCSOS AdjEM Rank_Away
0,2002,Maryland,Arizona,67.0,71.0,,3,Maryland,1.0,ACC,...,0.079,15,14.22,1,111.3,1,97.1,3,17.56,1
1,2002,Florida,Arizona,71.0,75.0,,7,Florida,5.0,SEC,...,0.079,15,14.22,1,111.3,1,97.1,3,17.56,1
2,2002,Wyoming,Arizona,60.0,68.0,"NCAA, West - Second Round",67,Wyoming,11.0,MWC,...,0.079,15,14.22,1,111.3,1,97.1,3,17.56,1
3,2002,Oklahoma,Arizona,88.0,67.0,"NCAA, West - Regional Semifinal",5,Oklahoma,2.0,B12,...,0.079,15,14.22,1,111.3,1,97.1,3,17.56,1
4,2002,USC,Arizona,80.0,97.0,,12,USC,4.0,P10,...,0.079,15,14.22,1,111.3,1,97.1,3,17.56,1


Now we will clean up the team names in the T-Rank data and join it with the game scores data. Additionally, we need to join these data sets with the team Kenpom statistics. This join is necessary because we need to use the Tournament seed attribute in order to clean up the march dataset to only include NCAA Tournament games. It will also be beneficial down the road, during feature generation, for us to have the Kenpom AdjEM and W/L stats for each team as a way to judge what outcome of a game is considered an upset.

In [6]:
save_path = '../Data/Combined_Data/TRank.csv'

# Get only the columns we need from the kenpom data
kp = kenpom_data[['Team', 'AdjEM', 'Seed']]

# Join the dataframes to get TRank data and kenpom (seed, adj_em) for both home and away team
TRank_df = pd.merge(scores_data, TRank_data, left_on='Home', right_on='Team', sort=False)
TRank_df = pd.merge(TRank_df, TRank_data, left_on='Away', right_on='Team', 
                         suffixes=('_Home', '_Away'), sort=False)
TRank_df = pd.merge(TRank_df, kp, left_on='Home', right_on='Team', sort=False)
TRank_df = pd.merge(TRank_df, kp, left_on='Away', right_on='Team', 
                    suffixes=('_Home', '_Away'), sort=False)

# Add a column to indicate the year
TRank_df.insert(0, 'Year', year)

# T-Rank has introduced a new column - for now we'll just drop it but should include in future
drop_cols = ['3PR_Home', '3PR Rank_Home', '3PRD_Home', '3PRD Rank_Home', '3PR_Away', '3PR Rank_Away', 
             '3PRD_Away', '3PRD Rank_Away']
TRank_df = TRank_df.drop(drop_cols, axis=1)
    
# Combine the data for every year and save to csv
all_TRank = pd.read_csv(save_path)
all_TRank.rename(columns={'Team_Home.1': 'Team_Home', 'Team_Away.1': 'Team_Away'}, inplace=True)
TRank_df = pd.concat([all_TRank, TRank_df])
TRank_df.to_csv(save_path, index=False)
    
# Lets take a look at one of the data sets
print("There are {} games in the T-Rank dataset.".format(len(TRank_df)))
print("There are {} NCAA Tournament games in the T-Rank dataset.".format(len(cbb.filter_tournament(TRank_df))))
TRank_df.head()

There are 76621 games in the T-Rank dataset.
There are 795 NCAA Tournament games in the T-Rank dataset.


Unnamed: 0,Year,Home,Away,Home_Score,Away_Score,Tournament,Rk_Home,Team_Home,Conf_Home,G_Home,...,Adj T._Away,Adj T. Rank_Away,WAB_Away,WAB Rank_Away,Team_Home.1,AdjEM_Home,Seed_Home,Team_Away,AdjEM_Away,Seed_Away
0,2008,UT-Martin,Memphis,71.0,102.0,,252,UT-Martin,OVC,31,...,70.2,72,8.9,5,UT-Martin,-8.1,,Memphis,31.51,1.0
1,2008,Richmond,Memphis,63.0,80.0,,143,Richmond,A10,31,...,70.2,72,8.9,5,Richmond,1.48,,Memphis,31.51,1.0
2,2008,Siena,Memphis,58.0,102.0,,97,Siena,MAAC,34,...,70.2,72,8.9,5,Siena,7.99,13.0,Memphis,31.51,1.0
3,2008,Pepperdine,Memphis,53.0,90.0,,241,Pepperdine,WCC,31,...,70.2,72,8.9,5,Pepperdine,-6.39,,Memphis,31.51,1.0
4,2008,Alabama-Birmingham,Memphis,56.0,94.0,,66,Alabama-Birmingham,CUSA,34,...,70.2,72,8.9,5,Alabama-Birmingham,12.07,,Memphis,31.51,1.0


Lastly, we will run the same process for the basic statistics as we did for the T-Rank data.

In [7]:
save_path = '../Data/Combined_Data/Basic.csv'
    
# Get only the columns we need from the kenpom data
kp = kenpom_data[['Team', 'AdjEM', 'Seed', 'Wins', 'Losses']]

# Join the dataframes to get basic statistics data and kenpom (seed, adj_em) for both home and away team
basic_df = pd.merge(scores_data, stats_data, left_on='Home', right_on='Team', sort=False)
basic_df = pd.merge(basic_df, stats_data, left_on='Away', right_on='Team', 
                    suffixes=('_Home', '_Away'), sort=False)
basic_df = pd.merge(basic_df, kp, left_on='Home', right_on='Team', sort=False)
basic_df = pd.merge(basic_df, kp, left_on='Away', right_on='Team', 
                    suffixes=('_Home', '_Away'), sort=False)

# Add a column to indicate the year
basic_df.insert(0, 'Year', year)
    
# Combine the data for every year and save to csv
all_basic = pd.read_csv(save_path)
all_basic.rename(columns={'Team_Home.1': 'Team_Home', 'Team_Away.1': 'Team_Away'}, inplace=True)
basic_df = pd.concat([all_basic, basic_df])
basic_df.to_csv(save_path, index=False)
    
# Lets take a look at one of the data sets
print("There are {} games in the regular season basic statistics dataset.".format(len(basic_df)))
print("There are {} NCAA tournament games in the basic statistics dataset.".format(len(cbb.filter_tournament(basic_df))))
basic_df.head()

There are 65574 games in the regular season basic statistics dataset.
There are 667 NCAA tournament games in the basic statistics dataset.


Unnamed: 0,Year,Home,Away,Home_Score,Away_Score,Tournament,Team_Home,G_Home,SRS_Home,SOS_Home,...,Team_Home.1,AdjEM_Home,Seed_Home,Wins_Home,Losses_Home,Team_Away,AdjEM_Away,Seed_Away,Wins_Away,Losses_Away
0,2010,Florida International,UNC,72.0,88.0,,Florida International,32,-12.81,-2.74,...,Florida International,-14.45,,7,25,UNC,13.39,,20,17
1,2010,Albany (NY),UNC,70.0,87.0,,Albany (NY),32,-11.94,-5.53,...,Albany (NY),-13.16,,7,25,UNC,13.39,,20,17
2,2010,William & Mary,UNC,72.0,80.0,NIT,William & Mary,33,2.82,1.63,...,William & Mary,6.58,,22,11,UNC,13.39,,20,17
3,2010,Valparaiso,UNC,77.0,88.0,,Valparaiso,32,-2.9,0.34,...,Valparaiso,-0.92,,15,17,UNC,13.39,,20,17
4,2010,Wake Forest,UNC,82.0,69.0,,Wake Forest,31,11.45,8.85,...,Wake Forest,14.12,9.0,20,11,UNC,13.39,,20,17
