# Retrieving College Basketball Data

In this notebook, I retrieve the scores and kenpom data for all teams from 2002 to 2017. The data is saved in csv files so that it can be accessed later by other notebooks.

In [1]:
# Import packages
import sys
sys.path.append('../')

import datetime
import pandas as pd
import collegebasketball as cbb
cbb.__version__

'2024'

## Getting the Scores Data

The scores are from https://www.sports-reference.com/cbb/. Below shows the code I used to download all the scores in college basketball from 2002 to 2017.

For each season, I create a csv file with all of the game scores for that season. Each record contains the team names, score and the tournament the game was played in (if applicable).

In [2]:
# The location where the files will be saved
path = '../Data/Scores/'
year = 2024

In [3]:
# Load regular season scores data from the current season. This data set goes back to 2002 (though you might want to ignore 2020)
# and we have previously retrieved that data in this project
start = datetime.date(year - 1, 11, 1)
end = datetime.date(year, 3, 18)

# Set up the path for this years scores
path_regular = path + str(year) + '_season.csv'

# Create and save the csv files for the regular season and march madness data for the year
cbb.load_scores_dataframe(start, end, csv_file_path=path_regular)

In [4]:
# Load a dataset to take an initial look
file_path = path + '2024_season.csv'
data = pd.read_csv(file_path)

data.head()

Unnamed: 0,Home,Away,Home_Score,Away_Score,Tournament
0,North Carolina Central,Kansas,56.0,99.0,
1,Dartmouth,Duke,54.0,92.0,
2,Samford,Purdue,45.0,98.0,
3,James Madison,Michigan State,79.0,76.0,
4,Northern Illinois,Marquette,70.0,92.0,


In [5]:
# Let's take a look at all the games involving Marquette during the 2003 Tournament
data = cbb.filter_tournament(pd.read_csv(path + '2003_season.csv'))
data[(data['Home'] == 'Marquette') | (data['Away'] == 'Marquette')]

Unnamed: 0,Home,Away,Home_Score,Away_Score,Tournament
4900,Marquette,Holy Cross,72,68,"NCAA, Midwest - First Round"
4936,Missouri,Marquette,92,101,"NCAA, Midwest - Second Round"
4962,Pitt,Marquette,74,77,"NCAA, Midwest - Regional Semifinal"
4970,Marquette,Kentucky,83,69,"NCAA, Midwest - Regional Final"
4978,Marquette,Kansas,61,94,"NCAA, National - National Semifinal"


## Getting Basic Team Stats

The teams stats data is also from https://www.sports-reference.com/cbb/. This data contains basic basketball statistics for each team at the end of each season. These stats will later be used to train the model and evaluate teams.

In [6]:
# The location where the files will be saved
path = '../Data/SportsReference/'

# Like before we will add this season to the data going back to 2002
stats_path = path + str(year) + '_stats.csv'
    
# Save the basic stats data into a csv file
cbb.load_stats_dataframe(year=year, csv_file_path=stats_path)

In [7]:
# Load some data to take a look
stats_path = path + '2024_stats.csv'
data = pd.read_csv(stats_path)

data.head()

Unnamed: 0,School,G,SRS,SOS,Tm.,Opp.,MP,FG_opp,FGA_opp,FG%_opp,...,FT,FTA,FT%,ORB,TRB,AST,STL,BLK,TOV,PF
0,Abilene Christian,32,-3.82,-1.18,2320,2351,1295,833,1792,0.465,...,533,729,0.731,314,1070,405,253,65,404,634
1,Air Force,31,-4.34,1.86,2051,2243,1250,774,1631,0.475,...,324,475,0.682,225,872,451,202,122,372,541
2,Akron,34,3.01,-2.52,2517,2239,1365,827,1939,0.427,...,463,636,0.728,354,1252,447,192,99,389,563
3,Alabama,32,21.1,11.42,2904,2594,1290,884,2008,0.44,...,572,730,0.784,407,1267,515,232,133,383,636
4,Alabama A&M,33,-14.66,-7.57,2267,2501,1325,801,1877,0.427,...,622,865,0.719,373,1170,344,251,127,534,695


## Getting the Kenpom Data

The kenpom data is from https://kenpom.com. This website displays advanced stats for each team in the NCAA. These stats will later be used to train the model and evaluate teams.

In [8]:
# The location where the files will be saved
path = '../Data/Kenpom/'

# Since the kenpom data has become harder to scrap we'll manually save the html to a file
kp_input_path = '../Data/kenpom_html/kenpom_2024.html'

# This data also goes back to 2002, but we'll just retrieve the current season now
kp_path = path + str(year) + '_kenpom.csv'
    
# Save the kenpom data into a csv file
cbb.load_kenpom_dataframe(year=year, csv_file_path=kp_path, input_file_path=kp_input_path)

In [9]:
# Load some data to take a look
kp_path = path + '2024_kenpom.csv'
data = pd.read_csv(kp_path)

data.head()

Unnamed: 0,Rank,Team,Seed,Conf,Wins,Losses,AdjEM,AdjO,AdjO Rank,AdjD,...,Luck,Luck Rank,OppAdjEM,OppAdjEM Rank,OppO,OppO Rank,OppD,OppD Rank,NCSOS AdjEM,NCSOS AdjEM Rank
0,1,Connecticut,1.0,BE,31,3,32.21,126.6,1,94.4,...,0.047,70,10.46,34,111.6,41,101.2,30,-3.36,285
1,2,Houston,1.0,B12,30,4,31.72,118.9,17,87.1,...,0.053,61,11.75,14,111.9,36,100.1,6,-0.79,225
2,3,Purdue,1.0,B10,29,4,29.12,125.0,4,95.9,...,0.045,75,13.74,4,114.0,5,100.3,10,10.35,13
3,4,Auburn,4.0,SEC,27,7,28.9,120.6,10,91.7,...,-0.067,324,9.6,53,111.9,35,102.3,71,1.45,152
4,5,Iowa St.,2.0,B12,27,7,26.72,113.9,55,87.1,...,0.012,155,10.38,37,111.0,58,100.6,15,-7.2,351


In [10]:
# Let's take a look at Marquette's kenpom numbers for 2024
data[data['Team'] == 'Marquette']

Unnamed: 0,Rank,Team,Seed,Conf,Wins,Losses,AdjEM,AdjO,AdjO Rank,AdjD,...,Luck,Luck Rank,OppAdjEM,OppAdjEM Rank,OppO,OppO Rank,OppD,OppD Rank,NCSOS AdjEM,NCSOS AdjEM Rank
11,12,Marquette,2.0,BE,25,9,22.74,118.3,21,95.6,...,0.036,98,13.13,7,113.3,6,100.2,8,8.27,23


## Getting the T-Rank Data

The T-Rank data is from http://www.barttorvik.com/#. This website displays advanced stats for each team in the NCAA. These stats will later be used to train the model and evaluate teams.

In [11]:
# The location where the files will be saved
path = '../Data/TRank/'

# This data goes back to 2008 but we'll just get this season
TRank_path = path + str(year) + '_TRank.csv'
    
# Save the T-Rank data into a csv file
cbb.load_TRank_dataframe(year=year, csv_file_path=TRank_path)

In [12]:
# Load some data to take a look
TRank_path = path + '2024_TRank.csv'
data = pd.read_csv(TRank_path)

data.head()

Unnamed: 0,Rk,Team,Conf,G,Wins,Losses,AdjOE,AdjOE Rank,AdjDE,AdjDE Rank,...,3P%D,3P%D Rank,3PR,3PR Rank,3PRD,3PRD Rank,Adj T.,Adj T. Rank,WAB,WAB Rank
0,1,Houston,B12,34,30,4,119.2,15,85.5,1,...,30.0,13,36.6,201,40.9,293,63.3,349,10.6,3
1,2,Connecticut,BE,34,31,3,127.1,1,93.6,11,...,31.9,61,40.9,87,33.2,48,64.6,325,11.3,1
2,3,Purdue,B10,33,29,4,126.2,2,94.7,18,...,31.4,42,35.0,243,37.2,184,67.6,168,11.0,2
3,4,Iowa St.,B12,34,27,7,113.6,51,86.5,2,...,31.5,48,32.0,301,45.0,353,67.6,170,6.9,4
4,5,Auburn,SEC,34,27,7,120.7,10,92.1,6,...,29.8,9,37.5,177,33.3,51,69.8,58,5.5,9


In [13]:
# Let's take a look at Marquette's numbers for 2024
data[data['Team'] == 'Marquette']

Unnamed: 0,Rk,Team,Conf,G,Wins,Losses,AdjOE,AdjOE Rank,AdjDE,AdjDE Rank,...,3P%D,3P%D Rank,3PR,3PR Rank,3PRD,3PRD Rank,Adj T.,Adj T. Rank,WAB,WAB Rank
7,8,Marquette,BE,34,25,9,118.9,19,94.6,17,...,33.6,155,40.5,96,43.1,340,69.1,86,6.5,6
