# NHL 2024-25 Data Collection

This notebook retrieves game and player statistics from the National Hockey League's 2024-25 season using the official NHL web API (`api-web.nhle.com`).  The goal is to produce clean, structured data sets that can be used for predictive modelling.  The notebook performs the following steps:

* **Scrape schedule:** Generate a list of all regular-season game IDs and basic game information (date, teams, scores, home/away).  The NHL API exposes a schedule endpoint that can be queried by date.  We iterate over every date in the season to build a full schedule.  The API’s season endpoints return JSON describing the games, including home/away teams, game ID and final score【591982464243434†L0-L17】.
* **Download boxscores:** For each game ID we query the NHL "gamecenter" endpoint, which returns a detailed boxscore that includes player statistics for both teams.  This endpoint groups players by position (forwards, defence and goalies) and provides stats such as goals, assists, points, shots, hits, penalty minutes and time-on-ice【131195590886814†L280-L307】.
* **Build player-game table:**  We flatten the boxscore JSON into a table where each row corresponds to a single player's performance in a single game.  The table includes the player's ID, name, position, team, game ID, date and all available statistics.  Players not dressed for the game (scratches) are also recorded when the information is present in the boxscore.
* **Build game-team table:**  We derive a simpler table where each row corresponds to a team in a given game, capturing final score, home/away flag, starting goalie and counts of scratches.  This makes it easy to join player-level data with team outcomes later.
* **Save data:**  Both tables are saved to disk (CSV or Parquet) so that they can be reused without re-scraping.

The notebook relies on the open NHL API described in a recent technical article【131195590886814†L280-L307】 and on the TopDownHockey scraper documentation for context【430751149834465†L264-L306】.  You will need an active internet connection when running the code because it queries the NHL servers.


In [9]:
import requests
import pandas as pd
import datetime as dt
from tqdm import tqdm

# Season you want to scrape.  The NHL encodes seasons as the two year span (e.g., '20242025')
SEASON = '20242025'

# Date range for the 2024-25 regular season.  Adjust if you want to include pre-season or playoffs
season_start = dt.date(2024, 10, 1)  # first regular season games
season_end   = dt.date(2025, 6, 30)   # last possible playoff date (adjust as necessary)

# Base URL for the NHL Web API
BASE_URL = "https://api-web.nhle.com/v1"

# A session with a custom User-Agent helps avoid 403 responses
session = requests.Session()
session.headers.update({'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64) Python Data Collector',
                        'Accept': 'application/json'})


In [10]:
def fetch_schedule(date: dt.date) -> list:
    """Query the NHL schedule for a given date and return a list of games."""
    url = f"{BASE_URL}/schedule/{date.isoformat()}"
    resp = session.get(url)
    if resp.status_code != 200:
        raise RuntimeError(f"Failed to fetch schedule for {date}: {resp.status_code}")
    data = resp.json()
    games = []
    for week in data.get('gameWeek', []):
        for game in week.get('games', []):
            game_info = {
                'gameId': game['id'],
                'season': game['season'],
                'gameType': game['gameType'],
                'date': week['date'],
                'homeTeamId': game['homeTeam']['id'],
                'homeTeamAbbrev': game['homeTeam']['abbrev'],
                'homeTeamName': game['homeTeam']['commonName']['default'],
                'homeScore': game['homeTeam'].get('score'),
                'awayTeamId': game['awayTeam']['id'],
                'awayTeamAbbrev': game['awayTeam']['abbrev'],
                'awayTeamName': game['awayTeam']['commonName']['default'],
                'awayScore': game['awayTeam'].get('score'),
                'gameState': game.get('gameState'),
                'gameScheduleState': game.get('gameScheduleState')
            }
            games.append(game_info)
    return games


In [None]:
# Build a full list of dates between start and end
all_dates = [season_start + dt.timedelta(days=i) for i in range((season_end - season_start).days + 1)]

schedule_rows = []
print(f"Fetching schedule for {len(all_dates)} days...")
for date in tqdm(all_dates):
    try:
        daily_games = fetch_schedule(date)
        schedule_rows.extend(daily_games)
    except Exception as e:
        # Continue on errors; you may want to log these dates and retry later
        print(f"Warning: {e}")

schedule_df = pd.DataFrame(schedule_rows)

# Filter to regular-season games (gameType 2).  Set gameType values: 1 = pre-season, 2 = regular season, 3 = playoffs.
schedule_df = schedule_df[schedule_df['gameType'] == 2].reset_index(drop=True)
print(f"Collected {len(schedule_df)} regular season games.")

# Save schedule to CSV
schedule_df.to_csv(f'nhl_{SEASON}_schedule.csv', index=False)

schedule_df.head()


In [11]:
#load the schedule
schedule_df = pd.read_csv(f'nhl_{SEASON}_schedule.csv')
print(f"Loaded {len(schedule_df)} games from schedule.")


Loaded 9179 games from schedule.


In [12]:
def fetch_boxscore(game_id: int) -> dict:
    """Download the NHL gamecenter boxscore for a single game."""
    url = f"{BASE_URL}/gamecenter/{int(game_id)}/boxscore"
    resp = session.get(url)
    if resp.status_code != 200:
        raise RuntimeError(f"Failed to fetch boxscore {game_id}: {resp.status_code}")
    return resp.json()


In [13]:

def _extract_name(part):
    if isinstance(part, dict):
        for key in ('default', 'en', 'fr'):
            if key in part and part[key]:
                return part[key]
        for value in part.values():
            if value:
                return value
        return ''
    if isinstance(part, str):
        return part
    return ''


def _combine_player_stats(player_dict):
    stats = {}

    def merge(source):
        if not isinstance(source, dict):
            return
        for key, value in source.items():
            if isinstance(value, dict):
                merge(value)
            elif not isinstance(value, list):
                stats.setdefault(key, value)

    for key in (
        'playerStats',
        'stats',
        'skaterStats',
        'goalieStats',
        'gameStats',
        'player',
        'playerInfo',
        'playerBio',
        'bioDetails',
        'person'
    ):
        merge(player_dict.get(key))

    # Some stats live directly on the player object
    for key, value in player_dict.items():
        if not isinstance(value, (dict, list)):
            stats.setdefault(key, value)

    return stats


def _first_present(stats, player_dict, *keys):
    sentinel = (None, '', [])

    nested_sources = [player_dict]
    for nested_key in ('player', 'playerInfo', 'playerBio', 'bioDetails', 'person'):
        nested = player_dict.get(nested_key)
        if isinstance(nested, dict):
            nested_sources.append(nested)

    for key in keys:
        if key in stats and stats[key] not in sentinel:
            return stats[key]
        for source in nested_sources:
            if isinstance(source, dict) and key in source and source[key] not in sentinel:
                return source[key]
    return None


def parse_boxscore(box: dict) -> tuple:
    """Convert a boxscore JSON into two lists: player rows and team rows."""
    players = []
    teams = []

    game_id = box.get('id')
    game_date = box.get('gameDate') or box.get('venueDate')

    for side in ['home', 'away']:
        team_key = f"{side}Team"
        team_info = box.get(team_key) or box.get('teams', {}).get(team_key) or {}
        team_id = team_info.get('id')
        team_abbrev = team_info.get('abbrev')
        team_name = None
        common_name = team_info.get('commonName')
        if isinstance(common_name, dict):
            team_name = _extract_name(common_name)
        elif isinstance(common_name, str):
            team_name = common_name
        stats_by_type = box.get('playerByGameStats', {}).get(team_key, {})

        team_row = {
            'gameId': game_id,
            'date': game_date,
            'teamId': team_id,
            'teamAbbrev': team_abbrev,
            'teamName': team_name,
            'homeAway': 'home' if side == 'home' else 'away',
            'finalScore': team_info.get('score'),
            'numScratches': None
        }
        scratches_section = box.get('scratches')
        if isinstance(scratches_section, dict):
            scratches = scratches_section.get(team_key)
            if isinstance(scratches, list):
                team_row['numScratches'] = len(scratches)
        teams.append(team_row)

        for group_key in ['forwards', 'defense', 'goalies']:
            players_list = stats_by_type.get(group_key, [])
            for p in players_list:
                stats = _combine_player_stats(p)

                first_name_val = _first_present(stats, p, 'firstName', 'preferredFirstName', 'playerFirstName')
                last_name_val = _first_present(stats, p, 'lastName', 'preferredLastName', 'playerLastName')
                full_name_val = _first_present(stats, p, 'fullName', 'playerFullName', 'name')

                first_name = _extract_name(first_name_val)
                last_name = _extract_name(last_name_val)
                full_name = _extract_name(full_name_val)

                if not full_name:
                    full_name = ' '.join(x for x in (first_name, last_name) if x).strip()

                player_row = {
                    'gameId': game_id,
                    'date': game_date,
                    'teamId': team_id,
                    'teamAbbrev': team_abbrev,
                    'teamName': team_name,
                    'homeAway': 'home' if side == 'home' else 'away',
                    'playerId': p.get('playerId'),
                    'playerName': full_name,
                    'playerFirstName': first_name or None,
                    'playerLastName': last_name or None,
                    'playerSlug': _first_present(stats, p, 'playerSlug', 'slug'),
                    'sweaterNumber': _first_present(stats, p, 'sweaterNumber', 'sweater', 'jerseyNumber')
                }

                player_row.update({
                    'positionCode': _first_present(stats, p, 'positionCode', 'position'),
                    'goals': _first_present(stats, p, 'goals'),
                    'assists': _first_present(stats, p, 'assists'),
                    'points': _first_present(stats, p, 'points'),
                    'shotsOnGoal': _first_present(stats, p, 'shotsOnGoal', 'shots', 'sog'),
                    'hits': _first_present(stats, p, 'hits'),
                    'pim': _first_present(stats, p, 'penaltyMinutes', 'pim'),
                    'plusMinus': _first_present(stats, p, 'plusMinus', 'plusminus'),
                    'timeOnIce': _first_present(stats, p, 'timeOnIce', 'toi'),
                    'saves': _first_present(stats, p, 'saves'),
                    'shotsAgainst': _first_present(stats, p, 'shotsAgainst', 'shotsFaced', 'sa'),
                    'goalsAgainst': _first_present(stats, p, 'goalsAgainst', 'ga'),
                    'savePct': _first_present(stats, p, 'savePct', 'savePercentage', 'savePercent'),
                    'faceoffWins': _first_present(stats, p, 'faceoffWins', 'faceoffsWon', 'foWins')
                })

                players.append(player_row)
    return players, teams



In [14]:
_player_profile_cache = {}


def _extract_profile_text(value):
    if isinstance(value, dict):
        for key in ('default', 'en', 'fr'):
            if value.get(key):
                return value[key]
        for val in value.values():
            if val:
                return val
    elif isinstance(value, str):
        return value
    return ''


def fetch_player_profile(player_id: int) -> dict:
    player_id = int(player_id)
    if player_id in _player_profile_cache:
        return _player_profile_cache[player_id]
    url = f"{BASE_URL}/player/{player_id}/landing"
    resp = session.get(url)
    if resp.status_code != 200:
        raise RuntimeError(f"Failed to fetch player profile {player_id}: {resp.status_code}")
    data = resp.json()
    _player_profile_cache[player_id] = data
    return data


def resolve_player_identity(player_id: int) -> dict:
    try:
        profile = fetch_player_profile(player_id)
    except Exception as exc:
        print(f"Warning: failed to resolve player {player_id}: {exc}")
        return {}
    player_info = profile.get('player') or profile
    first_name = _extract_profile_text(player_info.get('firstName'))
    last_name = _extract_profile_text(player_info.get('lastName'))
    full_name = _extract_profile_text(player_info.get('fullName'))
    if not full_name:
        full_name = ' '.join(x for x in (first_name, last_name) if x).strip()
    slug = player_info.get('playerSlug') or player_info.get('slug')
    return {
        'playerId': int(player_id),
        'playerFirstName': first_name or None,
        'playerLastName': last_name or None,
        'playerFullName': full_name or None,
        'playerSlug': slug
    }


def enrich_player_names(df: pd.DataFrame) -> pd.DataFrame:
    if df.empty or 'playerId' not in df.columns:
        return df
    df = df.copy()
    df['_playerIdInt'] = pd.to_numeric(df['playerId'], errors='coerce').astype('Int64')
    valid_ids = df['_playerIdInt'].dropna()
    if valid_ids.empty:
        return df.drop(columns=['_playerIdInt'])
    unique_ids = sorted(valid_ids.unique())
    identities = []
    for pid in unique_ids:
        identity = resolve_player_identity(int(pid))
        if identity:
            identities.append(identity)
    if not identities:
        return df.drop(columns=['_playerIdInt'])
    identity_df = pd.DataFrame(identities).set_index('playerId')

    def assign_if_missing(column: str, key: str) -> None:
        if key not in identity_df.columns:
            return
        if column not in df.columns:
            df[column] = None
        mask = df[column].isna() | df[column].astype(str).str.strip().eq('')
        mask &= df['_playerIdInt'].notna()
        if not mask.any():
            return
        mapping = identity_df[key].to_dict()
        df.loc[mask, column] = df.loc[mask, '_playerIdInt'].astype(int).map(mapping)

    assign_if_missing('playerName', 'playerFullName')
    assign_if_missing('playerFirstName', 'playerFirstName')
    assign_if_missing('playerLastName', 'playerLastName')
    assign_if_missing('playerSlug', 'playerSlug')

    return df.drop(columns=['_playerIdInt'])


In [16]:
def debug_player_sample(limit=100):
    """Return a DataFrame with the first `limit` player rows for debugging."""
    collected = []
    for game_id in schedule_df['gameId'].tolist():
        try:
            box = fetch_boxscore(int(game_id))
        except Exception as exc:
            print(f"Warning: failed to fetch boxscore {game_id}: {exc}")
            continue

        try:
            players, _ = parse_boxscore(box)
        except Exception as exc:
            print(f"Warning: failed to parse boxscore {game_id}: {exc}")
            continue

        for row in players:
            collected.append(row)
            if len(collected) >= limit:
                return pd.DataFrame(collected)

    return pd.DataFrame(collected)


debug_players_df = debug_player_sample(limit=100)
if not debug_players_df.empty:
    debug_players_df = enrich_player_names(debug_players_df)
    identity_preview = (debug_players_df[['playerId', 'playerName', 'playerSlug']]
                        .drop_duplicates()
                        .head(20))
    print("Sample player identities:", identity_preview.to_string(index=False), sep='\n')
else:
    print("No player rows collected for debugging.")

debug_players_df.to_csv('debug_players.csv', index=False)
print(f"Collected {len(debug_players_df)} sample player rows for debugging.")

debug_players_df.head()


Collected 100 sample player rows for debugging.


In [None]:
player_rows = []
team_rows = []

print(f"Fetching boxscores for {len(schedule_df)} games...")
for game_id in tqdm(schedule_df['gameId'].tolist()):
    try:
        box = fetch_boxscore(int(game_id))
        players, teams = parse_boxscore(box)
        player_rows.extend(players)
        team_rows.extend(teams)
    except Exception as e:
        print(f"Warning: {e}")
        continue

players_df = pd.DataFrame(player_rows)
teams_df = pd.DataFrame(team_rows)

players_df.to_csv(f'nhl_{SEASON}_player_game_stats.csv', index=False)
teams_df.to_csv(f'nhl_{SEASON}_game_team_stats.csv', index=False)

players_df.head()


In [None]:
# Derive each player's most recent team in the season
latest_team_series = (players_df.sort_values(['playerId','date'])
                                    .groupby('playerId')
                                    .tail(1)
                                    .set_index('playerId')['teamId'])

players_df['latestTeamId'] = players_df['playerId'].map(latest_team_series)

players_df.to_csv(f'nhl_{SEASON}_player_game_stats_with_latest_team.csv', index=False)

players_df[['playerId','playerName','teamId','latestTeamId']].head()


## Injury and scratch data

The NHL API does not provide a dedicated injuries endpoint for public consumption.  However, the game-by-game boxscore includes a list of players who were scratched (did not dress) for each team.  These scratches may be due to injury, illness or coach’s decision.  In the team summary table the column `numScratches` captures how many players were listed as scratched for each team in each game.  If you need more detailed injury information you may need to cross-reference third-party injury reports or subscription services.  The TopDownHockey package documentation notes that EliteProspects and NHL play-by-play data can be scraped for personal use【430751149834465†L264-L306】, but such scrapes often require careful handling of rate limits and may not explicitly label injuries.  Use the `scratches` information as a proxy for unavailable players when building your model.
