# Retrieving College Basketball Data

In this notebook, I retrieve the scores and kenpom data for all teams from 2002 to 2017. The data is saved in csv files so that it can be accessed later by other notebooks.

In [1]:
# Import packages
import sys
sys.path.append('../')

import datetime
import pandas as pd
import collegebasketball as cbb
cbb.__version__

'2023'

## Getting the Scores Data

The scores are from https://www.sports-reference.com/cbb/. Below shows the code I used to download all the scores in college basketball from 2002 to 2017.

For each season, I create a csv file with all of the game scores for that season. Each record contains the team names, score and the tournament the game was played in (if applicable).

In [2]:
# The location where the files will be saved
path = '../Data/Scores/'

In [3]:
# We will be creating a csv file for each regular season and tournament from 2002 to 2023 (you might want to ignore 2020)
for year in range(2002, 2024):

    # Set up the starting and ending dates of the regular season and march madness
    start = datetime.date(year - 1, 11, 1)
    end = datetime.date(year, 4, 10)
    
    # Set up the path for this years scores
    path_regular = path + str(year) + '_season.csv'

    # Create and save the csv files for the regular season and march madness data for the year
    cbb.load_scores_dataframe(start, end, csv_file_path=path_regular)

In [4]:
# Load a dataset to take an initial look
file_path = path + '2003_season.csv'
data = pd.read_csv(file_path)

data.head()

Unnamed: 0,Home,Away,Home_Score,Away_Score,Tournament
0,Alabama,Oklahoma,68,62,
1,Memphis,Syracuse,70,63,
2,Texas,Georgia,77,71,
3,Marquette,Villanova,73,61,
4,Eastern Washington,Wisconsin,55,81,


In [5]:
# Let's take a look at all the games involving Marquette during the 2003 Tournament
data = cbb.filter_tournament(data)
data[(data['Home'] == 'Marquette') | (data['Away'] == 'Marquette')]

Unnamed: 0,Home,Away,Home_Score,Away_Score,Tournament
4900,Marquette,Holy Cross,72,68,"NCAA, Midwest - First Round"
4936,Missouri,Marquette,92,101,"NCAA, Midwest - Second Round"
4962,Pitt,Marquette,74,77,"NCAA, Midwest - Regional Semifinal"
4970,Marquette,Kentucky,83,69,"NCAA, Midwest - Regional Final"
4978,Marquette,Kansas,61,94,"NCAA, National - National Semifinal"


## Getting Basic Team Stats

The teams stats data is also from https://www.sports-reference.com/cbb/. This data contains basic basketball statistics for each team at the end of each season. These stats will later be used to train the model and evaluate teams.

In [6]:
# The location where the files will be saved
path = '../Data/SportsReference/'

# We will be creating a csv file of data for each season
for year in range(2002, 2024):
    
    # Set the path for the current year data
    stats_path = path + str(year) + '_stats.csv'
    
    # Save the basic stats data into a csv file
    cbb.load_stats_dataframe(year=year, csv_file_path=stats_path)

In [7]:
# Load some data to take a look
stats_path = path + '2023_stats.csv'
data = pd.read_csv(stats_path)

data.head()

Unnamed: 0,School,G,SRS,SOS,Tm.,Opp.,MP,FG_opp,FGA_opp,FG%_opp,...,FT,FTA,FT%,ORB,TRB,AST,STL,BLK,TOV,PF
0,Abilene Christian,30,-2.54,1.15,2249,2133,1210,727,1526,0.476,...,414,573,0.723,307,972,471,264,71,379,617
1,Air Force,32,2.57,2.7,2142,2146,1295,764,1718,0.445,...,369,511,0.722,213,946,491,184,127,391,544
2,Akron,33,4.61,-1.23,2463,2209,1330,802,1873,0.428,...,475,643,0.739,335,1178,443,199,87,370,529
3,Alabama,34,23.82,10.14,2794,2329,1390,814,2189,0.372,...,564,777,0.726,434,1508,517,206,172,479,633
4,Alabama A&M,33,-10.39,-7.33,2297,2340,1336,781,1842,0.424,...,425,649,0.655,334,1079,419,264,133,498,624


## Getting the Kenpom Data

The kenpom data is from https://kenpom.com. This website displays advanced stats for each team in the NCAA. These stats will later be used to train the model and evaluate teams.

In [8]:
# The location where the files will be saved
path = '../Data/Kenpom/'

# We will be creating a csv file of kenpom data for each season
for year in range(2002, 2024):
    
    # Set the path for the current year data
    kp_path = path + str(year) + '_kenpom.csv'
    
    # Save the kenpom data into a csv file
    cbb.load_kenpom_dataframe(year=year, csv_file_path=kp_path)

In [9]:
# Load some data to take a look
kp_path = path + '2003_kenpom.csv'
data = pd.read_csv(kp_path)

data.head()

Unnamed: 0,Rank,Team,Seed,Conf,Wins,Losses,AdjEM,AdjO,AdjO Rank,AdjD,...,Luck,Luck Rank,OppAdjEM,OppAdjEM Rank,OppO,OppO Rank,OppD,OppD Rank,NCSOS AdjEM,NCSOS AdjEM Rank
0,1,Kentucky,1.0,SEC,32,4,29.18,116.5,5,87.4,...,0.051,53,11.2,4,108.6,3,97.4,16,6.77,33
1,2,Kansas,2.0,B12,30,8,28.62,115.0,12,86.4,...,-0.017,208,11.84,2,108.6,2,96.8,9,6.07,35
2,3,Pittsburgh,2.0,BE,28,5,28.61,114.8,14,86.2,...,-0.023,224,7.08,56,105.5,66,98.4,48,-8.24,310
3,4,Arizona,1.0,P10,28,4,26.8,115.6,10,88.8,...,-0.007,181,8.69,31,107.2,14,98.5,52,8.19,26
4,5,Illinois,4.0,B10,25,7,24.47,113.2,22,88.7,...,-0.029,242,7.29,52,105.8,60,98.5,53,-4.18,257


In [10]:
# Let's take a look at Marquette's kenpom numbers for 2003
data[data['Team'] == 'Marquette']

Unnamed: 0,Rank,Team,Seed,Conf,Wins,Losses,AdjEM,AdjO,AdjO Rank,AdjD,...,Luck,Luck Rank,OppAdjEM,OppAdjEM Rank,OppO,OppO Rank,OppD,OppD Rank,NCSOS AdjEM,NCSOS AdjEM Rank
14,15,Marquette,3.0,CUSA,27,6,21.3,120.5,2,99.2,...,0.07,30,7.93,42,106.1,50,98.2,40,-1.45,188


## Getting the T-Rank Data

The T-Rank data is from http://www.barttorvik.com/#. This website displays advanced stats for each team in the NCAA. These stats will later be used to train the model and evaluate teams.

In [11]:
# The location where the files will be saved
path = '../Data/TRank/'

# We will be creating a csv file of data for each season
for year in range(2008, 2023):
    
    # Set the path for the current year data
    TRank_path = path + str(year) + '_TRank.csv'
    
    # Save the T-Rank data into a csv file
    cbb.load_TRank_dataframe(year=year, csv_file_path=TRank_path)

In [12]:
# Load some data to take a look
TRank_path = path + '2008_TRank.csv'
data = pd.read_csv(TRank_path)

data.head()

Unnamed: 0,Rk,Team,Conf,G,Wins,Losses,AdjOE,AdjOE Rank,AdjDE,AdjDE Rank,...,2P%D,2P%D Rank,3P%,3P% Rank,3P%D,3P%D Rank,Adj T.,Adj T. Rank,WAB,WAB Rank
0,1,Kansas,B12,39,37,3,121.4,2,85.5,2,...,41.2,4,39.7,12,32.8,54,68.6,117,9.9,3
1,2,Memphis,CUSA,40,38,2,117.0,6,86.1,3,...,42.6,10,34.9,169,30.2,7,70.2,72,8.9,5
2,3,UCLA,P10,38,35,4,116.2,8,86.8,5,...,45.4,64,34.4,193,32.9,57,66.1,215,10.8,2
3,4,North Carolina,ACC,39,36,3,122.9,1,91.9,26,...,47.8,137,37.2,80,32.6,45,74.7,8,11.9,1
4,5,Wisconsin,B10,36,31,5,112.6,36,85.4,1,...,41.7,5,35.6,144,31.3,22,63.1,306,8.3,8


In [13]:
# Let's take a look at Marquette's kenpom numbers for 2008
data[data['Team'] == 'Marquette']

Unnamed: 0,Rk,Team,Conf,G,Wins,Losses,AdjOE,AdjOE Rank,AdjDE,AdjDE Rank,...,2P%D,2P%D Rank,3P%,3P% Rank,3P%D,3P%D Rank,Adj T.,Adj T. Rank,WAB,WAB Rank
11,12,Marquette,BE,34,25,10,114.1,23,91.1,17,...,46.7,96,35.8,135,30.4,8,68.6,122,3.8,22
