ESPN API Scraper
Men's Basketball

- Goal of this file is to scrape all games within specified time frame, then scrape boxscore data for BOTH teams from each unique game id.
- Last update 02/02/26; having issues with too many requests
- API Doumentation found thanks to https://gist.github.com/akeaswaran/b48b02f1c94f873c6655e7129910fc3b

Specifically annotated to be shared with BYU Sports Analytics Association

In [1]:
import requests
import pandas as pd
from datetime import datetime 

In [2]:
## set start and end dates of games to scrape
start_date = "20251104"
end_date = "20260131"

## create a date range. This will create a date for each day in the range.
dates = pd.date_range(
    start=datetime.strptime(start_date, "%Y%m%d"),
    end=datetime.strptime(end_date, "%Y%m%d")
)

DatetimeIndex(['2025-11-04', '2025-11-05', '2025-11-06', '2025-11-07',
               '2025-11-08', '2025-11-09', '2025-11-10', '2025-11-11',
               '2025-11-12', '2025-11-13', '2025-11-14', '2025-11-15',
               '2025-11-16', '2025-11-17', '2025-11-18', '2025-11-19',
               '2025-11-20', '2025-11-21', '2025-11-22', '2025-11-23',
               '2025-11-24', '2025-11-25', '2025-11-26', '2025-11-27',
               '2025-11-28', '2025-11-29', '2025-11-30', '2025-12-01',
               '2025-12-02', '2025-12-03', '2025-12-04', '2025-12-05',
               '2025-12-06', '2025-12-07', '2025-12-08', '2025-12-09',
               '2025-12-10', '2025-12-11', '2025-12-12', '2025-12-13',
               '2025-12-14', '2025-12-15', '2025-12-16', '2025-12-17',
               '2025-12-18', '2025-12-19', '2025-12-20', '2025-12-21',
               '2025-12-22', '2025-12-23', '2025-12-24', '2025-12-25',
               '2025-12-26', '2025-12-27', '2025-12-28', '2025-12-29',
      

Quick explanation of APIs for those who are unfamiliar:

**The basics:** An API is a tool that lets one program ask another program for data or perform an action. Instead of you manually finding and collecting information, you send a request to the API and it sends you back exactly what you need.

**For your ESPN example:**

The ESPN API allows your code to request college basketball game data without storing it yourself. Here's exactly what happens in this notebook:

1. **Request**: Your code sends a request to ESPN's API endpoint with parameters like:
   - Date range (November 4, 2025 to January 31, 2026)
   - `seasontype=2` (regular season games)
   - `groups=50` (all Division 1 teams)
   - `limit=1000` (get up to 1000 games per day)

2. **API Processing**: ESPN's servers receive your request and find all men's college basketball games matching those parameters

3. **Response**: ESPN sends back JSON data containing all the games for that date, including game IDs, teams, status, and more

4. **Your Code Extracts**: You parse through the response to pull out what you need—the game ID, teams playing, date, and game status—and store it in your `games` list

So instead of manually looking up every college basketball game, the ESPN API does the heavy lifting and provides all the data in a structured format that your code can easily work with!

You CAN run out of requests when using an API, so we are going to not look at EVERY SINGLE GAME. Obviously, this may come with some limits. Feel free to experiment with how to pull data using the api and the amount to request at a time. 

In [26]:
## dataframe of all games Nov 4, 2025 to Jan 31, 2026

# set up an empty list to hold game data
games = []

# loop through each date in the range and scrape game data
for date in dates:

    # api endpoint url with date parameter
    url = (
        "https://site.api.espn.com/apis/site/v2/sports/basketball/mens-college-basketball/scoreboard"
        f"?dates={date.strftime('%Y%m%d')}"
        ## need to include season type to get regular season games
        "&seasontype=2"
        ## need to include groups to get all D1 games
        "&groups=50"
        ## include ranked teams only false to get all games
        "&rankedTeamsOnly=false"
        ## increase limit to get all games in one request
        "&limit=1000"
    )

    ## make the request and parse the json response
    data = requests.get(url).json()
    ## the raw data returned will be stores in "keys" which you can somewhat see in the following code chunk. To access games, we need to look at the "events" key.
    for event in data.get('events', []):

        ## look at both teams' conferences
        competitors = event["competitions"][0]["competitors"]

        home = next(c for c in competitors if c["homeAway"] == "home")
        away = next(c for c in competitors if c["homeAway"] == "away")

        game_info = {
            "game_id": event['id'],
            "date": event['date'],
            "home_team": next(c['team']['displayName'] for c in event['competitions'][0]['competitors'] if c['homeAway'] == 'home'),
            "away_team": next(c['team']['displayName'] for c in event['competitions'][0]['competitors'] if c['homeAway'] == 'away'),
            "status": event['status']['type']['description']
        }
        games.append(game_info)

In [27]:
## explore the scraped data before parsing
print(data.keys())

## look at all the info stored within each key!
data['events'][0]


dict_keys(['leagues', 'groups', 'events', 'provider', 'eventsDate'])


{'id': '401827652',
 'uid': 's:40~l:41~e:401827652',
 'date': '2026-01-31T19:00Z',
 'name': 'Arizona Wildcats at Arizona State Sun Devils',
 'shortName': 'ARIZ @ ASU',
 'season': {'year': 2026, 'type': 2, 'slug': 'regular-season'},
 'competitions': [{'id': '401827652',
   'uid': 's:40~l:41~e:401827652~c:401827652',
   'date': '2026-01-31T19:00Z',
   'attendance': 13838,
   'type': {'id': '1', 'abbreviation': 'STD'},
   'timeValid': True,
   'neutralSite': False,
   'conferenceCompetition': True,
   'playByPlayAvailable': True,
   'recent': False,
   'venue': {'id': '833',
    'fullName': 'Desert Financial Arena',
    'address': {'city': 'Tempe', 'state': 'AZ'},
    'indoor': True},
   'competitors': [{'id': '9',
     'uid': 's:40~l:41~t:9',
     'type': 'team',
     'order': 0,
     'homeAway': 'home',
     'winner': False,
     'team': {'id': '9',
      'uid': 's:40~l:41~t:9',
      'location': 'Arizona State',
      'name': 'Sun Devils',
      'abbreviation': 'ASU',
      'displayNam

In [29]:
## create a dataframe from the list of games
games_df = pd.DataFrame(games)
games_df

Unnamed: 0,game_id,date,home_team,away_team,status
0,401828553,2025-11-04T23:30Z,Purdue Boilermakers,Evansville Purple Aces,Final
1,401817228,2025-11-05T01:45Z,Duke Blue Devils,Texas Longhorns,Final
2,401826746,2025-11-05T00:00Z,Kentucky Wildcats,Nicholls Colonels,Final
3,401819898,2025-11-05T01:00Z,Texas Tech Red Raiders,Lindenwood Lions,Final
4,401826052,2025-11-04T23:00Z,UMBC Retrievers,Penn State-York Nittany Lions,Final
...,...,...,...,...,...
4147,401829330,2026-02-01T03:00Z,San José State Spartans,New Mexico Lobos,Final
4148,401829223,2026-02-01T03:00Z,San Francisco Dons,Pacific Tigers,Final
4149,401808507,2026-02-01T03:00Z,Sacramento State Hornets,Montana Grizzlies,Final
4150,401804908,2026-02-01T03:05Z,Sam Houston Bearkats,Louisiana Tech Bulldogs,Final


Now we want to use this data frame of every game_id to create a huge dataset of boxscores. First lets make sure we only include completed games, as well as see if we can only include the conferences we are interested in for this particular analysis. We will be scraping again, so the smaller the dataset the better, just because it is possible to run out of requests!

In [9]:
## drop if game is equal to postponed or canceled
games_df = games_df[~games_df['status'].isin(['Postponed', 'Canceled'])]

print(games_df['status'].value_counts())

print(games_df['game_id'].nunique())

status
Final    4135
Name: count, dtype: int64
4135


In [37]:
## create a lookup dictionary for team conferences
team_lookup = {}

for event in data.get("events", []):  # loop over the events list
    comp = event["competitions"][0]
    conf = comp.get("groups", {}).get("shortName", None)

    # get both team names
    home = next(c for c in comp["competitors"] if c["homeAway"] == "home")["team"]["displayName"]
    away = next(c for c in comp["competitors"] if c["homeAway"] == "away")["team"]["displayName"]

    # assign conference to both teams
    team_lookup[home] = conf
    team_lookup[away] = conf

team_lookup

{'Arizona State Sun Devils': 'Big 12',
 'Arizona Wildcats': 'Big 12',
 'Creighton Bluejays': 'Big East',
 'UConn Huskies': 'Big East',
 'Virginia Tech Hokies': 'ACC',
 'Duke Blue Devils': 'ACC',
 'Gonzaga Bulldogs': 'WCC',
 "Saint Mary's Gaels": 'WCC',
 'Houston Cougars': 'Big 12',
 'Cincinnati Bearcats': 'Big 12',
 'UCF Knights': 'Big 12',
 'Texas Tech Red Raiders': 'Big 12',
 'Kansas Jayhawks': 'Big 12',
 'BYU Cougars': 'Big 12',
 'Arkansas Razorbacks': 'SEC',
 'Kentucky Wildcats': 'SEC',
 'Georgia Tech Yellow Jackets': 'ACC',
 'North Carolina Tar Heels': 'ACC',
 'Boston College Eagles': 'ACC',
 'Virginia Cavaliers': 'ACC',
 'Vanderbilt Commodores': 'SEC',
 'Ole Miss Rebels': 'SEC',
 'Louisville Cardinals': 'ACC',
 'SMU Mustangs': 'ACC',
 'Clemson Tigers': 'ACC',
 'Pittsburgh Panthers': 'ACC',
 'Miami (OH) RedHawks': 'MAC',
 'Northern Illinois Huskies': 'MAC',
 'Winthrop Eagles': 'Big South',
 'UNC Asheville Bulldogs': 'Big South',
 'Seton Hall Pirates': 'Big East',
 'Marquette Golde

In [36]:
## add conference info to the dataframe
games_df["home_conf"] = games_df["home_team"].map(team_lookup)
games_df["away_conf"] = games_df["away_team"].map(team_lookup)

games_df

Unnamed: 0,game_id,date,home_team,away_team,status,home_conf,away_conf
0,401828553,2025-11-04T23:30Z,Purdue Boilermakers,Evansville Purple Aces,Final,,MVC
1,401817228,2025-11-05T01:45Z,Duke Blue Devils,Texas Longhorns,Final,ACC,SEC
2,401826746,2025-11-05T00:00Z,Kentucky Wildcats,Nicholls Colonels,Final,SEC,Southland
3,401819898,2025-11-05T01:00Z,Texas Tech Red Raiders,Lindenwood Lions,Final,Big 12,OVC
4,401826052,2025-11-04T23:00Z,UMBC Retrievers,Penn State-York Nittany Lions,Final,Am. East,
...,...,...,...,...,...,...,...
4147,401829330,2026-02-01T03:00Z,San José State Spartans,New Mexico Lobos,Final,Mountain West,Mountain West
4148,401829223,2026-02-01T03:00Z,San Francisco Dons,Pacific Tigers,Final,WCC,WCC
4149,401808507,2026-02-01T03:00Z,Sacramento State Hornets,Montana Grizzlies,Final,Big Sky,Big Sky
4150,401804908,2026-02-01T03:05Z,Sam Houston Bearkats,Louisiana Tech Bulldogs,Final,CUSA,CUSA


In [34]:
## filter games to only include those with teams from conferences of interest
conferences_of_interest = ["ACC", "Big East", "Big Ten", "Big 12", "SEC", "Pac-12", "A-10", "WCC", "Horizon", "MVC", "Colonial", "MAC", "MWC", "C-USA"]

filtered_games_df = games_df[
    (games_df["home_conf"].isin(conferences_of_interest)) |
    (games_df["away_conf"].isin(conferences_of_interest))
].reset_index(drop=True)

KeyError: 'home_conf'