# Retrieving College Basketball Data

In this notebook, I retrieve the scores and kenpom data for all teams from 2002 to 2017. The data is saved in csv files so that it can be accessed later by other notebooks.

In [1]:
# Import packages
import sys
sys.path.append('../')

import datetime
import pandas as pd
import collegebasketball as cbb
cbb.__version__

'0.2'

## Getting the Scores Data

The scores are from https://www.sports-reference.com/cbb/. Below shows the code I used to download all the scores in college basketball from 2002 to 2017.

For each season, I create two different csv files. One contains the regular season data and the other file contains just the scores from the NCAA tournament. I will later use the Tournment scores to evaluate the performance of the models and seperating the data at this stage will simplify things later on.

In [2]:
# This list contains the starting dates of march madness for each year
march_start_dates = [12, 18, 16, 15, 14, 13, 18, 17, 16, 15, 13, 19, 18, 17, 15, 14, 19]

# The location where the files will be saved
path = '../Data/Scores/'

In [3]:
# We will be creating a csv file for each regular season and tournament from 2002 to 2019
for year in range(2002, 2020):

    # Set up the starting and ending dates of the regular season and march madness
    start_regular = datetime.date(year - 1, 11, 1)
    end_regular = datetime.date(year, 3, march_start_dates[year - 2003] - 1)
    
    start_march = datetime.date(year, 3, march_start_dates[year - 2003])
    end_march = datetime.date(year, 4, 10)
    
    # Set up the path for this years scores
    path_regular = path + str(year) + '_regular_season.csv'
    path_march = path + str(year) + '_march.csv'

    # Create and save the csv files for the regular season and march madness data for the year
    cbb.load_scores_dataframe(start_date=start_regular, end_date=end_regular, csv_file_path=path_regular)
    cbb.load_scores_dataframe(start_date=start_march, end_date=end_march, csv_file_path=path_march)

In [4]:
# Load a dataset to take an initial look
file_path = path + '2003_march.csv'
data = cbb.load_csv(file_path)

data.head()

Unnamed: 0,Home,Away,Home_Score,Away_Score
0,St. John's (NY),Notre Dame,83,80
1,Alabama-Birmingham,Charlotte,85,61
2,Eastern Washington,Weber State,57,60
3,Georgetown,Villanova,46,41
4,Howard,Delaware State,68,65


In [5]:
# Let's take a look at all the games involving Marquette during the 2003 Tournament
pd.concat([data[data['Home'] == 'Marquette'], data[data['Away'] == 'Marquette']])

Unnamed: 0,Home,Away,Home_Score,Away_Score
144,Marquette,Holy Cross,72,68
214,Marquette,Kentucky,83,69
222,Marquette,Kansas,61,94
16,Alabama-Birmingham,Marquette,83,76
180,Missouri,Marquette,92,101
206,Pitt,Marquette,74,77


## Getting Basic Team Stats

The teams stats data is also from https://www.sports-reference.com/cbb/. This data contains basic basketball statistics for each team at the end of each season. These stats will later be used to train the model and evaluate teams.

In [6]:
# The location where the files will be saved
path = '/Users/phil/Documents/Documents/College_Basketball/Data/SportsReference/'

# We will be creating a csv file of data for each season from 2003 to 2019
for year in range(2010, 2020):
    
    # Set the path for the current year data
    stats_path = path + str(year) + '_stats.csv'
    
    # Save the basic stats data into a csv file
    cbb.load_stats_dataframe(year=year, csv_file_path=stats_path)

In [7]:
# Load some data to take a look
stats_path = path + '2013_stats.csv'
data = cbb.load_csv(stats_path)

data.head()

Unnamed: 0,School,G,SRS,SOS,Tm.,Opp.,MP,FG_opp,FGA_opp,FG%_opp,...,FT,FTA,FT%,ORB,TRB,AST,STL,BLK,TOV,PF
0,Air Force,32,4.18,4.28,2241,2170,1285,764,1702,0.449,...,379,531,0.714,240,889,508,206,55,380,557
1,Akron NCAA,33,8.11,0.18,2369,2071,1340,723,1831,0.395,...,436,678,0.643,425,1145,485,231,189,458,577
2,Alabama A&M,31,-18.4,-9.82,1957,2160,1255,783,1753,0.447,...,410,651,0.63,377,992,367,182,102,444,550
3,Alabama-Birmingham,33,0.26,1.51,2357,2351,1325,816,1848,0.442,...,424,586,0.724,408,1078,529,269,85,510,621
4,Alabama State,32,-17.88,-9.01,1979,2231,1290,780,1808,0.431,...,343,575,0.597,398,1025,371,202,103,479,598


## Getting the Kenpom Data

The kenpom data is from https://kenpom.com. This website displays advanced stats for each team in the NCAA. These stats will later be used to train the model and evaluate teams.

In [8]:
# The location where the files will be saved
path = '/Users/phil/Documents/Documents/College_Basketball/Data/Kenpom/'

# We will be creating a csv file of kenpom data for each season from 2002 to 2019
for year in range(2002, 2020):
    
    # Set the path for the current year data
    kp_path = path + str(year) + '_kenpom.csv'
    
    # Save the kenpom data into a csv file
    cbb.load_kenpom_dataframe(csv_file_path=kp_path, year=year, save_data=True)

In [9]:
# Load some data to take a look
kp_path = path + '2003_kenpom.csv'
data = cbb.load_csv(kp_path)

data.head()

Unnamed: 0,Rank,Team,Conf,Wins,Losses,AdjEM,AdjO,AdjO Rank,AdjD,AdjD Rank,...,Luck,Luck Rank,OppAdjEM,OppAdjEM Rank,OppO,OppO Rank,OppD,OppD Rank,NCSOS AdjEM,NCSOS AdjEM Rank
0,1,Kentucky,SEC,32,4,29.18,116.5,5,87.4,4,...,0.051,53,11.2,4,108.6,3,97.4,16,6.77,33
1,2,Kansas,B12,30,8,28.62,115.0,12,86.4,3,...,-0.017,208,11.84,2,108.7,2,96.8,9,6.08,35
2,3,Pittsburgh,BE,28,5,28.61,114.8,14,86.2,2,...,-0.023,224,7.07,56,105.5,66,98.4,49,-8.24,310
3,4,Arizona,P10,28,4,26.8,115.6,10,88.8,8,...,-0.007,181,8.69,31,107.2,14,98.5,53,8.19,26
4,5,Illinois,B10,25,7,24.47,113.2,22,88.7,7,...,-0.029,242,7.29,52,105.8,59,98.5,54,-4.18,257


In [10]:
# Let's take a look at Marquette's kenpom numbers for 2003
data[data['Team'] == 'Marquette']

Unnamed: 0,Rank,Team,Conf,Wins,Losses,AdjEM,AdjO,AdjO Rank,AdjD,AdjD Rank,...,Luck,Luck Rank,OppAdjEM,OppAdjEM Rank,OppO,OppO Rank,OppD,OppD Rank,NCSOS AdjEM,NCSOS AdjEM Rank
14,15,Marquette,CUSA,27,6,21.3,120.5,2,99.2,109,...,0.07,30,7.94,42,106.0,54,98.0,35,-1.46,188


## Getting the T-Rank Data

The T-Rank data is from http://www.barttorvik.com/#. This website displays advanced stats for each team in the NCAA. These stats will later be used to train the model and evaluate teams.

In [11]:
# The location where the files will be saved
path = '/Users/phil/Documents/Documents/College_Basketball/Data/TRank/'

# We will be creating a csv file of data for each season from 2008 to 2019
for year in range(2008, 2020):
    
    # Set the path for the current year data
    TRank_path = path + str(year) + '_TRank.csv'
    
    # Save the T-Rank data into a csv file
    cbb.load_TRank_dataframe(year=year, csv_file_path=TRank_path)

In [12]:
# Load some data to take a look
TRank_path = path + '2008_TRank.csv'
data = cbb.load_csv(TRank_path)

data.head()

Unnamed: 0,Rk,Team,Conf,G,Rec,AdjOE,AdjOE Rank,AdjDE,AdjDE Rank,Barthag,...,2P%D,2P%D Rank,3P%,3P% Rank,3P%D,3P%D Rank,Adj T.,Adj T. Rank,WAB,WAB Rank
0,1,Kansas,B12,39,37-3,121.4,2,85.5,2,0.9825,...,41.2,4,39.7,12,32.8,54,68.6,117,9.9,3
1,2,Memphis,CUSA,40,38-2,117.0,6,86.1,3,0.9715,...,42.6,10,34.9,169,30.2,7,70.2,72,8.9,5
2,3,UCLA,P10,38,35-4,116.2,8,86.8,5,0.9664,...,45.4,64,34.4,193,32.9,57,66.1,215,10.8,2
3,4,North Carolina,ACC,39,36-3,122.9,1,91.9,26,0.9659,...,47.8,137,37.2,80,32.6,45,74.7,8,11.9,1
4,5,Wisconsin,B10,36,31-5,112.6,36,85.4,1,0.96,...,41.7,5,35.6,144,31.3,22,63.1,306,8.3,8


In [13]:
# Let's take a look at Marquette's kenpom numbers for 2008
data[data['Team'] == 'Marquette']

Unnamed: 0,Rk,Team,Conf,G,Rec,AdjOE,AdjOE Rank,AdjDE,AdjDE Rank,Barthag,...,2P%D,2P%D Rank,3P%,3P% Rank,3P%D,3P%D Rank,Adj T.,Adj T. Rank,WAB,WAB Rank
11,12,Marquette,BE,34,25-10,114.1,23,91.1,17,0.9304,...,46.7,96,35.8,135,30.4,8,68.6,122,3.8,22
