ESPN API Scraper
Men's Basketball

- Goal of this file is to scrape all games within specified time frame, then scrape boxscore data for BOTH teams from each unique game id.
- Last update 02/02/26; having issues with too many requests
- API Doumentation found thanks to https://gist.github.com/akeaswaran/b48b02f1c94f873c6655e7129910fc3b

Specifically annotated to be shared with BYU Sports Analytics Association

In [29]:
import requests
import pandas as pd
from datetime import datetime 
import json
import time

In [23]:
## set start and end dates of games to scrape
start_date = "20251103"
end_date = "20260131"

## create a date range. This will create a date for each day in the range.
dates = pd.date_range(
    start=datetime.strptime(start_date, "%Y%m%d"),
    end=datetime.strptime(end_date, "%Y%m%d")
)

Quick explanation of APIs for those who are unfamiliar:

**The basics:** An API is a tool that lets one program ask another program for data or perform an action. Instead of you manually finding and collecting information, you send a request to the API and it sends you back exactly what you need.

**For your ESPN example:**

The ESPN API allows your code to request college basketball game data without storing it yourself. Here's exactly what happens in this notebook:

1. **Request**: Your code sends a request to ESPN's API endpoint with parameters like:
   - Date range (November 4, 2025 to January 31, 2026)
   - `seasontype=2` (regular season games)
   - `groups=50` (all Division 1 teams)
   - `limit=1000` (get up to 1000 games per day)

2. **API Processing**: ESPN's servers receive your request and find all men's college basketball games matching those parameters

3. **Response**: ESPN sends back JSON data containing all the games for that date, including game IDs, teams, status, and more

4. **Your Code Extracts**: You parse through the response to pull out what you need—the game ID, teams playing, date, and game status—and store it in your `games` list

So instead of manually looking up every college basketball game, the ESPN API does the heavy lifting and provides all the data in a structured format that your code can easily work with!

You CAN run out of requests when using an API, so we are going to not look at EVERY SINGLE GAME. Obviously, this may come with some limits. Feel free to experiment with how to pull data using the api and the amount to request at a time. 

In [24]:
## dataframe of all games Nov 4, 2025 to Jan 31, 2026

# set up an empty list to hold game data
games = []

# loop through each date in the range and scrape game data
for date in dates:

    # api endpoint url with date parameter
    url = (
        "https://site.api.espn.com/apis/site/v2/sports/basketball/mens-college-basketball/scoreboard"
        f"?dates={date.strftime('%Y%m%d')}"
        ## need to include season type to get regular season games
        "&seasontype=2"
        ## need to include groups to get all D1 games
        "&groups=50"
        ## include ranked teams only false to get all games
        "&rankedTeamsOnly=false"
        ## increase limit to get all games in one request
        "&limit=1000"
    )

    ## make the request and parse the json response
    data = requests.get(url).json()
    ## the raw data returned will be stores in "keys" which you can somewhat see in the following code chunk. To access games, we need to look at the "events" key.
    for event in data.get('events', []):

        ## look at both teams
        competitors = event["competitions"][0]["competitors"]

        home = next(c for c in competitors if c["homeAway"] == "home")
        away = next(c for c in competitors if c["homeAway"] == "away")

        ## get first and second half scores for both teams
        home_scores = home.get("linescores", [])
        away_scores = away.get("linescores", [])

        first_half_home_score = home_scores[0]["value"] if len(home_scores) > 0 else None
        second_half_home_score = home_scores[1]["value"] if len(home_scores) > 1 else None
        first_half_away_score = away_scores[0]["value"] if len(away_scores) > 0 else None
        second_half_away_score = away_scores[1]["value"] if len(away_scores) > 1 else None

        game_info = {
            "game_id": event['id'],
            "date": event['date'],
            "home_team": next(c['team']['displayName'] for c in event['competitions'][0]['competitors'] if c['homeAway'] == 'home'),
            "home_team_first_half_score": first_half_home_score,
            "home_team_second_half_score": second_half_home_score,
            "away_team": next(c['team']['displayName'] for c in event['competitions'][0]['competitors'] if c['homeAway'] == 'away'),
            "away_team_first_half_score": first_half_away_score,
            "away_team_second_half_score": second_half_away_score,
            "status": event['status']['type']['description']
        }
        games.append(game_info)

In [25]:
## explore the scraped data before parsing
print(data.keys())

## look at all the info stored within each key!
data['events'][0]

dict_keys(['leagues', 'groups', 'events', 'provider', 'eventsDate'])


{'id': '401827652',
 'uid': 's:40~l:41~e:401827652',
 'date': '2026-01-31T19:00Z',
 'name': 'Arizona Wildcats at Arizona State Sun Devils',
 'shortName': 'ARIZ @ ASU',
 'season': {'year': 2026, 'type': 2, 'slug': 'regular-season'},
 'competitions': [{'id': '401827652',
   'uid': 's:40~l:41~e:401827652~c:401827652',
   'date': '2026-01-31T19:00Z',
   'attendance': 13838,
   'type': {'id': '1', 'abbreviation': 'STD'},
   'timeValid': True,
   'neutralSite': False,
   'conferenceCompetition': True,
   'playByPlayAvailable': True,
   'recent': False,
   'venue': {'id': '833',
    'fullName': 'Desert Financial Arena',
    'address': {'city': 'Tempe', 'state': 'AZ'},
    'indoor': True},
   'competitors': [{'id': '9',
     'uid': 's:40~l:41~t:9',
     'type': 'team',
     'order': 0,
     'homeAway': 'home',
     'winner': False,
     'team': {'id': '9',
      'uid': 's:40~l:41~t:9',
      'location': 'Arizona State',
      'name': 'Sun Devils',
      'abbreviation': 'ASU',
      'displayNam

In [26]:
## create a dataframe from the list of games
games_df = pd.DataFrame(games)
games_df

Unnamed: 0,game_id,date,home_team,home_team_first_half_score,home_team_second_half_score,away_team,away_team_first_half_score,away_team_second_half_score,status
0,401824809,2025-11-04T01:00Z,Houston Cougars,44.0,31.0,Lehigh Mountain Hawks,23.0,34.0,Final
1,401826885,2025-11-04T00:00Z,Arizona Wildcats,50.0,43.0,Florida Gators,46.0,41.0,Final
2,401812785,2025-11-04T00:00Z,UConn Huskies,37.0,42.0,New Haven Chargers,24.0,31.0,Final
3,401820577,2025-11-03T23:30Z,St. John's Red Storm,54.0,54.0,Quinnipiac Bobcats,34.0,40.0,Final
4,401826083,2025-11-04T01:30Z,Michigan Wolverines,69.0,52.0,Oakland Golden Grizzlies,38.0,40.0,Final
...,...,...,...,...,...,...,...,...,...
4316,401829330,2026-02-01T03:00Z,San José State Spartans,35.0,45.0,New Mexico Lobos,41.0,49.0,Final
4317,401829223,2026-02-01T03:00Z,San Francisco Dons,35.0,52.0,Pacific Tigers,32.0,50.0,Final
4318,401808507,2026-02-01T03:00Z,Sacramento State Hornets,49.0,37.0,Montana Grizzlies,32.0,47.0,Final
4319,401804908,2026-02-01T03:05Z,Sam Houston Bearkats,34.0,49.0,Louisiana Tech Bulldogs,32.0,35.0,Final


Now we want to use this data frame of every game_id to create a huge dataset of boxscores. First lets make sure we only include completed games. We will be scraping again, so the smaller the dataset the better, just because it is possible to run out of requests! 

P.S. I wanted to scrape the conferences for the teams, but for some reason just could not get it done. I am going to handmake a conference library at some point, but if anyone can help me just scrape that it would be much appreciated!

In [27]:
## drop if game is equal to postponed or canceled
games_df = games_df[~games_df['status'].isin(['Postponed', 'Canceled'])]

print(games_df['status'].value_counts())

print(games_df['game_id'].nunique())

status
Final    4304
Name: count, dtype: int64
4304


In [None]:
## scrape box score data for each game.
## for each game, we will make a request to the box score endpoint using the game_id and store the response in a list. We can then parse this data to get the stats we want for each player in each game.
## to limit requests, we will split up the games into batches of 10 and make requests for each batch with a short delay in between to avoid hitting rate limits. We will also save the box score data to a file after each batch in case we need to stop and restart the scraper.
all_rows = []

for i, game_id in enumerate(games_df['game_id']):

    box_url = f"https://site.api.espn.com/apis/site/v2/sports/basketball/mens-college-basketball/summary?event={game_id}"
    data = requests.get(box_url).json()

    try:
        for team_block in data['boxscore']['players']:

            team_name = team_block['team']['displayName']
            stat_block = team_block['statistics'][0]

            labels = stat_block['labels']
            athletes = stat_block['athletes']

            for player in athletes:
                row = {
                    "game_id": game_id,
                    "team": team_name,
                    "player": player['athlete']['displayName'],
                    "position": player['athlete']
                        .get('position', {})
                        .get('abbreviation'),
                    "starter": player.get('starter'),
                    "didNotPlay": player.get('didNotPlay')
                }

                stats = player.get('stats', [])

                for col, val in zip(labels, stats):
                    row[col] = val

                all_rows.append(row)

    except KeyError:
        print(f"Skipping bad game: {game_id}")
        continue

    # polite rate limit pause
    if (i+1) % 10 == 0:
        time.sleep(1)


Skipping bad game: 401826937
Skipping bad game: 401812286
Skipping bad game: 401829462
Skipping bad game: 401822181
Skipping bad game: 401829392
Skipping bad game: 401830009
Skipping bad game: 401823389
Skipping bad game: 401829051
Skipping bad game: 401826091


In [31]:
# Convert to dataframe
player_stats_df = pd.DataFrame(all_rows)
player_stats_df

Unnamed: 0,game_id,team,player,position,starter,didNotPlay,MIN,PTS,FG,3PT,FT,REB,AST,TO,STL,BLK,OREB,DREB,PF
0,401824809,Lehigh Mountain Hawks,Hank Alvey,F,True,False,29,10,4-7,0-1,2-4,8,0,1,0,1,5,3,1
1,401824809,Lehigh Mountain Hawks,Edouard Benoit,F,True,False,26,0,0-7,0-5,0-0,6,1,0,1,0,1,5,3
2,401824809,Lehigh Mountain Hawks,Caleb Thomas,G,True,False,29,10,3-10,1-3,3-4,2,3,1,0,0,0,2,2
3,401824809,Lehigh Mountain Hawks,Joshua Ingram,G,True,False,35,9,4-6,0-0,1-2,5,0,5,3,1,0,5,4
4,401824809,Lehigh Mountain Hawks,Nasir Whitlock,G,True,False,36,18,6-15,1-2,5-6,9,0,5,1,0,3,6,2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
134795,401804908,Sam Houston Bearkats,Jaxson Ford,F,False,True,,,,,,,,,,,,,
134796,401804908,Sam Houston Bearkats,Isaiah Manning,F,False,True,,,,,,,,,,,,,
134797,401804908,Sam Houston Bearkats,Nathan Nguyen,G,False,True,,,,,,,,,,,,,
134798,401804908,Sam Houston Bearkats,Noah Benny,G,False,True,,,,,,,,,,,,,


In [36]:
## fill na values with 0
player_stats_df.fillna(0, inplace=True)

##count na values in each column
print(player_stats_df.isna().sum())

game_id       0
team          0
player        0
position      0
starter       0
didNotPlay    0
MIN           0
PTS           0
FG            0
3PT           0
FT            0
REB           0
AST           0
TO            0
STL           0
BLK           0
OREB          0
DREB          0
PF            0
dtype: int64


In [37]:
## save both dataframes to csv files
games_df.to_csv("games_data.csv", index=False)
player_stats_df.to_csv("player_stats_data.csv", index=False)