## This code looks up basketball stats for all college basketball teams for a given year and combines them into a data file for modeling. Predicting sports data has long been something that intrigues me as I am a sports fan and especially enjoy basketball. Much of the data structuring and modelling takes inspiration from both Magel and Unrah (2013) and Brown (2019). The specific goal of this notebook file is to scrape data from basketballreference.com.

## The ultimate goal of this project is to create a machine learning model that can take two schools and give probability predictions for each school winning in a hypothetical head to head match-up. This section focuses on obtaining the data.

### This section pulls out a data frame that contains the names of every college basketball team. This list of names will be needed later in order to pull out each individual teams stats

In [1]:
#import packages
from bs4 import BeautifulSoup
import requests
import pandas as pd
import numpy as np
import time


In [2]:
#list year for data extraction
years = list(range(2023,2024))
print(years)

[2023]


In [3]:
#set starting URL for data frame that contains list of all schools
url_start= "https://www.sports-reference.com/cbb/seasons/men/{}-school-stats.html"

In [4]:
#Download and save URL file for overall stats for given year
# at this point the URL file has been saved to a file on the computer and a internet connection would not be
# needed to manipulate the file
for year in years:
    url = url_start.format(year)
    data = requests.get(url)
    
    with open('College Data/Overall/{}.html'.format(year), "w+") as f:
        f.write(data.text)

In [5]:
#open overall stats URL file
with open("College Data/Overall/2023.html") as f:
    page = f.read()
    

In [6]:
#Parse URL and pull out data as a pandas data frame called "all_schools"
soup = BeautifulSoup(page, "html.parser")
soup.find('tr', class_="over_header").decompose()
stats_table = soup.find(id="basic_school_stats")
all_schools = pd.read_html(str(stats_table))[0]


### The following cells work in this type of data flow. The school names obtained from "all_schools" are used in a loop to pull the URLs for each individuals school. These URLs contain data frames with information about every game played for a given season by a given school. In some instances the school name taken from "all_schools" does not match with the school name in the URL and an error page is produced instead of a page with data. as a result, interations of correcting text differences were needed to correct the names and obtain data for all of the schools.

In [7]:
#examining all of the school names that have been imported to look for issues with string names

pd.options.display.max_rows = 70
print(all_schools["School"].unique())

['Abilene Christian' 'Air Force' 'Akron' 'Alabama\xa0NCAA' 'Alabama A&M'
 'Alabama State' 'Albany (NY)' 'Alcorn State' 'American'
 'Appalachian State' 'Arizona\xa0NCAA' 'Arizona State\xa0NCAA'
 'Arkansas\xa0NCAA' 'Arkansas State' 'Arkansas-Pine Bluff' 'Army'
 'Auburn\xa0NCAA' 'Austin Peay' 'Ball State' 'Baylor\xa0NCAA' nan 'School'
 'Bellarmine' 'Belmont' 'Bethune-Cookman' 'Binghamton'
 'Boise State\xa0NCAA' 'Boston College' 'Boston University'
 'Bowling Green State' 'Bradley' 'Brigham Young' 'Brown' 'Bryant'
 'Bucknell' 'Buffalo' 'Butler' 'Cal Poly' 'Cal State Bakersfield'
 'Cal State Fullerton' 'Cal State Northridge' 'California'
 'California Baptist' 'Campbell' 'Canisius' 'Central Arkansas'
 'Central Connecticut State' 'Central Florida' 'Central Michigan'
 'Charleston Southern' 'Charlotte' 'Chattanooga' 'Chicago State'
 'Cincinnati' 'Clemson' 'Cleveland State' 'Coastal Carolina'
 'Colgate\xa0NCAA' 'College of Charleston\xa0NCAA' 'Colorado'
 'Colorado State' 'Columbia' 'Connecticut\x

In [8]:
#creating a function that removes symbols and corrects school names so they match with the names on
# basketball reference urls
def clean_school_names(variable):
    variable = variable.str.replace("(", "", regex = False)\
    .str.replace(")","", regex = False)\
    .str.replace("&","")\
    .str.replace(".","", regex = False)\
    .str.replace("'","", regex = False)\
    .str.replace("--", "-", regex = False)\
    .str.replace("  ", " ")\
    .str.replace("The Citadel" , "Citadel")\
    .str.replace("Houston Christian" , "Houston Baptist")\
    .str.replace("Kansas City" , "Missouri Kansas City")\
    .str.replace("Little Rock" , "Arkansas Little Rock")\
    .str.replace("Louisiana" , "Louisiana Lafayette")\
    .str.replace("NC State" , "North Carolina State")\
    .str.replace("Omaha" , "Nebraska Omaha")\
    .str.replace("Purdue Fort Wayne" , "IPFW")\
    .str.replace("SIU Edwardsville" , "Southern Illinois Edwardsville")\
    .str.replace("TCU" , "Texas Christian")\
    .str.replace("Texas-Rio Grande Valley" , "Texas Pan American")\
    .str.replace("UAB" , "Alabama Birmingham")\
    .str.replace("UC" , "California")\
    .str.replace("UT Arlington" , "Texas Arlington")\
    .str.replace("Utah Tech" , "Dixie State")\
    .str.replace("UTEP" , "Texas El Paso")\
    .str.replace("UTSA" , "Texas San Antonio")\
    .str.replace("VMI" , "Virginia Military Institute")\
    .str.replace("UNC", "North Carolina")\
    .str.replace("CaliforniaLA", "UCLA")\
    .str.replace("Louisiana Lafayette State", "Louisiana State")\
    .str.replace("Louisiana Lafayette Tech", "Louisiana Tech")\
    .str.replace("Southeastern Louisiana Lafayette", "Southeastern Louisiana")\
    .str.replace("Louisiana Lafayette-Monroe", "Louisiana Monroe")\
    .str.replace("St Thomas", "St Thomas MN")\
    .str.replace("Sam Houston", "Sam Houston State")\
    .str.replace("\xa0NCAA", "", regex = False)\
    .dropna()\
    .str.replace(' ', '-')\
    .str.lower()
    
    variable.drop(variable[variable == 'school'].index, inplace = True)

    return variable


In [10]:
#this code applies the name correction function to the set of school names from 'all_schools' and then looks at
# the corrected names
school_names = clean_school_names(all_schools['School'])
school_names.unique()

array(['abilene-christian', 'air-force', 'akron', 'alabama', 'alabama-am',
       'alabama-state', 'albany-ny', 'alcorn-state', 'american',
       'appalachian-state', 'arizona', 'arizona-state', 'arkansas',
       'arkansas-state', 'arkansas-pine-bluff', 'army', 'auburn',
       'austin-peay', 'ball-state', 'baylor', 'bellarmine', 'belmont',
       'bethune-cookman', 'binghamton', 'boise-state', 'boston-college',
       'boston-university', 'bowling-green-state', 'bradley',
       'brigham-young', 'brown', 'bryant', 'bucknell', 'buffalo',
       'butler', 'cal-poly', 'cal-state-bakersfield',
       'cal-state-fullerton', 'cal-state-northridge', 'california',
       'california-baptist', 'campbell', 'canisius', 'central-arkansas',
       'central-connecticut-state', 'central-florida', 'central-michigan',
       'charleston-southern', 'charlotte', 'chattanooga', 'chicago-state',
       'cincinnati', 'clemson', 'cleveland-state', 'coastal-carolina',
       'colgate', 'college-of-charlest

In [11]:
#set starting url for obtaining individual school data
# this is the base url where each school name is inserted into in order to pull URL pages for each
# of the schools. If the URL does not match the actual basketballreference.com URL an
# error page will be produced
school_url_start = "https://www.sports-reference.com/cbb/schools/{}/men/2023-gamelogs.html"

#download URLs for each indiviudal schools season data
for name in school_names:
    url = school_url_start.format(name)
    #needed to keep website from blocking access
    time.sleep(2)
    
    data = requests.get(url)
    
    with open('College Data/Schools/{}.html'.format(name), "w+") as f:
        f.write(data.text)

In [12]:
#Read in a html page that occurs if an error happens when trying to find a school URL
with open("Error.html") as f:
    pageE = f.read()
     

soupE = BeautifulSoup(pageE, "html.parser")


In [13]:
#This cell is a check. It takes each URL page that has been downloaded for each school and compares it to
# a error URL page. If the two pages match (meaning the school page errored out) then the school name printed
# for further inspection and correction in the "clean_school_names" function
for name in school_names:   
    
    with open("College Data/Schools/{}.html".format(name)) as f:
        page3 = f.read()
        
    soup3 = BeautifulSoup(page3, "html.parser")
    
    if soup3 == soupE:
        print(name)

### The following cells take the URLs for each school, extracts the data frames, and cleans the data frames lightly, then appends them into a empty list

In [12]:
#setting empty list
full_data = []

In [14]:
#using a loop to go through school URLs to obtain school season game data and append
# it together into one data frame
for name in school_names:
    
    
    with open("College Data/Schools/{}.html".format(name)) as f:
        page2 = f.read()

    soup = BeautifulSoup(page2, "html.parser")
    soup.find('tr', class_="over_header").decompose()
    stats_table = soup.find(id="div_sgl-basic_NCAAM")
    data = pd.read_html(str(stats_table))[0]

    #removing unneeded rows
    data.drop(data[(data['FG%'] == "School") | (data['FG%'] == "FG%")].index, inplace=True)
    #creating a playing location variable 
    data['Unnamed: 2'] = data['Unnamed: 2'].replace("@","A").replace(np.nan,"H")
    data = data.drop(columns=['Unnamed: 23']).dropna()
    data['Location'] = data['Unnamed: 2']
    data = data.drop(columns=['Unnamed: 2'])
    
    data['Team'] = name
    
    print(name)
    full_data.append(data)



abilene-christian
air-force
akron
alabama
alabama-am
alabama-state
albany-ny
alcorn-state
american
appalachian-state
arizona
arizona-state
arkansas
arkansas-state
arkansas-pine-bluff
army
auburn
austin-peay
ball-state
baylor
bellarmine
belmont
bethune-cookman
binghamton
boise-state
boston-college
boston-university
bowling-green-state
bradley
brigham-young
brown
bryant
bucknell
buffalo
butler
cal-poly
cal-state-bakersfield
cal-state-fullerton
cal-state-northridge
california
california-baptist
campbell
canisius
central-arkansas
central-connecticut-state
central-florida
central-michigan
charleston-southern
charlotte
chattanooga
chicago-state
cincinnati
clemson
cleveland-state
coastal-carolina
colgate
college-of-charleston
colorado
colorado-state
columbia
connecticut
coppin-state
cornell
creighton
dartmouth
davidson
dayton
delaware
delaware-state
denver
depaul
detroit-mercy
drake
drexel
duke
duquesne
east-carolina
east-tennessee-state
eastern-illinois
eastern-kentucky
eastern-michigan
east

In [17]:
#convert full data set into a pandas data frame and clean the "Opp" variable which\
# is the opponent played in each row
full_data_complete = pd.concat(full_data)
full_data_complete['Opp'] = clean_school_names(full_data_complete['Opp'])
full_data_complete.head()

Unnamed: 0,G,Date,Opp,W/L,Tm,Opp.1,FG,FGA,FG%,3P,...,FT%.1,ORB.1,TRB.1,AST.1,STL.1,BLK.1,TOV.1,PF.1,Location,Team
0,1,2022-11-07,jackson-state,W,65,56,23,57,0.404,8,...,0.714,7,40,9,6,1,21,21,H,abilene-christian
1,2,2022-11-11,texas-am,L,58,77,20,52,0.385,8,...,0.714,10,33,7,11,3,19,17,A,abilene-christian
2,3,2022-11-15,mcmurry,W,104,46,41,68,0.603,5,...,0.667,4,17,7,9,2,27,18,H,abilene-christian
3,4,2022-11-21,wright-state,L,61,77,25,58,0.431,7,...,0.632,1,24,18,12,5,18,14,N,abilene-christian
4,5,2022-11-22,weber-state,L,67,77,26,53,0.491,9,...,0.92,8,29,10,4,0,18,13,N,abilene-christian


In [18]:
#inspecting data column names
full_data_complete.columns

Index(['G', 'Date', 'Opp', 'W/L', 'Tm', 'Opp.1', 'FG', 'FGA', 'FG%', '3P',
       '3PA', '3P%', 'FT', 'FTA', 'FT%', 'ORB', 'TRB', 'AST', 'STL', 'BLK',
       'TOV', 'PF', 'FG.1', 'FGA.1', 'FG%.1', '3P.1', '3PA.1', '3P%.1', 'FT.1',
       'FTA.1', 'FT%.1', 'ORB.1', 'TRB.1', 'AST.1', 'STL.1', 'BLK.1', 'TOV.1',
       'PF.1', 'Location', 'Team'],
      dtype='object')

In [19]:
#exporting data frame as a csv file
full_data_complete.to_csv("College Data/All Teams Data.csv", index = False)

## We now have a completed data set where each row contains a game played and each column contains variables such as Free Throw %. This data has been cleaned very lightly and now requires further cleaning and modifying. The notebook file "College Data Cleaning" will take care of that process

# References

### Brown, B. (2019). Predictive Analytics for College Basketball: Using Logistic Predictive Analytics for College Basketball: Using Logistic Regression for Determining the Outcome of a Game  (thesis). Honors Theses and Capstones. 475. 

### Magel, R., &amp; Unruh, S. (2013). Determining factors influencing the outcome of college basketball games. Open Journal of Statistics, 03(04), 225–230. https://doi.org/10.4236/ojs.2013.34026 