# NHL 2024-25 Data Collection

This notebook retrieves game and player statistics from the National Hockey League's 2024-25 season using the official NHL web API (`api-web.nhle.com`).  The goal is to produce clean, structured data sets that can be used for predictive modelling.  The notebook performs the following steps:

* **Scrape schedule:** Generate a list of all regular-season game IDs and basic game information (date, teams, scores, home/away).  The NHL API exposes a schedule endpoint that can be queried by date.  We iterate over every date in the season to build a full schedule.  The API’s season endpoints return JSON describing the games, including home/away teams, game ID and final score【591982464243434†L0-L17】.
* **Download boxscores:** For each game ID we query the NHL "gamecenter" endpoint, which returns a detailed boxscore that includes player statistics for both teams.  This endpoint groups players by position (forwards, defence and goalies) and provides stats such as goals, assists, points, shots, hits, penalty minutes and time-on-ice【131195590886814†L280-L307】.
* **Build player-game table:**  We flatten the boxscore JSON into a table where each row corresponds to a single player's performance in a single game.  The table includes the player's ID, name, position, team, game ID, date and all available statistics.  Players not dressed for the game (scratches) are also recorded when the information is present in the boxscore.
* **Build game-team table:**  We derive a simpler table where each row corresponds to a team in a given game, capturing final score, home/away flag, starting goalie and counts of scratches.  This makes it easy to join player-level data with team outcomes later.
* **Save data:**  Both tables are saved to disk (CSV or Parquet) so that they can be reused without re-scraping.

The notebook relies on the open NHL API described in a recent technical article【131195590886814†L280-L307】 and on the TopDownHockey scraper documentation for context【430751149834465†L264-L306】.  You will need an active internet connection when running the code because it queries the NHL servers.


In [None]:
import requests
import pandas as pd
import datetime as dt
from tqdm import tqdm

# Season you want to scrape.  The NHL encodes seasons as the two year span (e.g., '20242025')
SEASON = '20242025'

# Date range for the 2024-25 regular season.  Adjust if you want to include pre-season or playoffs
season_start = dt.date(2024, 10, 1)  # first regular season games
season_end   = dt.date(2025, 6, 30)   # last possible playoff date (adjust as necessary)

# Base URL for the NHL Web API
BASE_URL = "https://api-web.nhle.com/v1"

# A session with a custom User-Agent helps avoid 403 responses
session = requests.Session()
session.headers.update({'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64) Python Data Collector',
                        'Accept': 'application/json'})


In [None]:
def fetch_schedule(date: dt.date) -> list:
    """Query the NHL schedule for a given date and return a list of games."""
    url = f"{BASE_URL}/schedule/{date.isoformat()}"
    resp = session.get(url)
    if resp.status_code != 200:
        raise RuntimeError(f"Failed to fetch schedule for {date}: {resp.status_code}")
    data = resp.json()
    games = []
    for week in data.get('gameWeek', []):
        for game in week.get('games', []):
            game_info = {
                'gameId': game['id'],
                'season': game['season'],
                'gameType': game['gameType'],
                'date': week['date'],
                'homeTeamId': game['homeTeam']['id'],
                'homeTeamAbbrev': game['homeTeam']['abbrev'],
                'homeTeamName': game['homeTeam']['commonName']['default'],
                'homeScore': game['homeTeam'].get('score'),
                'awayTeamId': game['awayTeam']['id'],
                'awayTeamAbbrev': game['awayTeam']['abbrev'],
                'awayTeamName': game['awayTeam']['commonName']['default'],
                'awayScore': game['awayTeam'].get('score'),
                'gameState': game.get('gameState'),
                'gameScheduleState': game.get('gameScheduleState')
            }
            games.append(game_info)
    return games


In [None]:
# Build a full list of dates between start and end
all_dates = [season_start + dt.timedelta(days=i) for i in range((season_end - season_start).days + 1)]

schedule_rows = []
print(f"Fetching schedule for {len(all_dates)} days...")
for date in tqdm(all_dates):
    try:
        daily_games = fetch_schedule(date)
        schedule_rows.extend(daily_games)
    except Exception as e:
        # Continue on errors; you may want to log these dates and retry later
        print(f"Warning: {e}")

schedule_df = pd.DataFrame(schedule_rows)

# Filter to regular-season games (gameType 2).  Set gameType values: 1 = pre-season, 2 = regular season, 3 = playoffs.
schedule_df = schedule_df[schedule_df['gameType'] == 2].reset_index(drop=True)
print(f"Collected {len(schedule_df)} regular season games.")

# Save schedule to CSV
schedule_df.to_csv(f'nhl_{SEASON}_schedule.csv', index=False)

schedule_df.head()


In [None]:
def fetch_boxscore(game_id: int) -> dict:
    """Retrieve the boxscore for a specific game."""
    url = f"{BASE_URL}/gamecenter/{game_id}/boxscore"
    resp = session.get(url)
    if resp.status_code != 200:
        raise RuntimeError(f"Failed to fetch boxscore for game {game_id}: {resp.status_code}")
    return resp.json()


def parse_boxscore(box: dict) -> tuple:
    """Convert a boxscore JSON into two lists: player rows and team rows."""
    players = []
    teams = []

    game_id = box.get('id')
    game_date = box.get('gameDate') or box.get('venueDate')

    for side in ['home', 'away']:
        team_key = f"{side}Team"
        team_info = box.get(team_key) or box.get('teams', {}).get(team_key) or {}
        team_id = team_info.get('id')
        team_abbrev = team_info.get('abbrev')
        team_name = None
        common_name = team_info.get('commonName')
        if isinstance(common_name, dict):
            team_name = common_name.get('default')
        stats_by_type = box.get('playerByGameStats', {}).get(f"{side}Team", {})

        team_row = {
            'gameId': game_id,
            'date': game_date,
            'teamId': team_id,
            'teamAbbrev': team_abbrev,
            'teamName': team_name,
            'homeAway': 'home' if side == 'home' else 'away',
            'finalScore': team_info.get('score'),
            'numScratches': None
        }
        scratches = None
        scratches_section = box.get('scratches')
        if isinstance(scratches_section, dict):
            scratches = scratches_section.get(f"{side}Team")
        if scratches:
            team_row['numScratches'] = len(scratches)
        teams.append(team_row)

        for group_key in ['forwards', 'defense', 'goalies']:
            players_list = stats_by_type.get(group_key, [])
            for p in players_list:
                first_name = ''
                last_name = ''
                fn = p.get('firstName')
                ln = p.get('lastName')
                if isinstance(fn, dict):
                    first_name = fn.get('default', '')
                if isinstance(ln, dict):
                    last_name = ln.get('default', '')
                player_row = {
                    'gameId': game_id,
                    'date': game_date,
                    'teamId': team_id,
                    'teamAbbrev': team_abbrev,
                    'teamName': team_name,
                    'homeAway': 'home' if side == 'home' else 'away',
                    'playerId': p.get('playerId'),
                    'playerName': (first_name + ' ' + last_name).strip(),
                    'sweaterNumber': p.get('sweaterNumber'),
                    'positionCode': p.get('positionCode'),
                    'goals': p.get('goals'),
                    'assists': p.get('assists'),
                    'points': p.get('points'),
                    'shotsOnGoal': p.get('shotsOnGoal'),
                    'hits': p.get('hits'),
                    'pim': p.get('penaltyMinutes'),
                    'plusMinus': p.get('plusMinus'),
                    'timeOnIce': p.get('timeOnIce'),
                    'saves': p.get('saves'),
                    'shotsAgainst': p.get('shotsAgainst'),
                    'goalsAgainst': p.get('goalsAgainst'),
                    'savePct': p.get('savePercentage'),
                    'faceoffWins': p.get('faceoffWins')
                }
                players.append(player_row)
    return players, teams


In [None]:
player_rows = []
team_rows = []

print(f"Fetching boxscores for {len(schedule_df)} games...")
for game_id in tqdm(schedule_df['gameId'].tolist()):
    try:
        box = fetch_boxscore(int(game_id))
        players, teams = parse_boxscore(box)
        player_rows.extend(players)
        team_rows.extend(teams)
    except Exception as e:
        print(f"Warning: {e}")
        continue

players_df = pd.DataFrame(player_rows)
teams_df = pd.DataFrame(team_rows)

players_df.to_csv(f'nhl_{SEASON}_player_game_stats.csv', index=False)
teams_df.to_csv(f'nhl_{SEASON}_game_team_stats.csv', index=False)

players_df.head()


In [None]:
# Derive each player's most recent team in the season
latest_team_series = (players_df.sort_values(['playerId','date'])
                                    .groupby('playerId')
                                    .tail(1)
                                    .set_index('playerId')['teamId'])

players_df['latestTeamId'] = players_df['playerId'].map(latest_team_series)

players_df.to_csv(f'nhl_{SEASON}_player_game_stats_with_latest_team.csv', index=False)

players_df[['playerId','playerName','teamId','latestTeamId']].head()


## Injury and scratch data

The NHL API does not provide a dedicated injuries endpoint for public consumption.  However, the game-by-game boxscore includes a list of players who were scratched (did not dress) for each team.  These scratches may be due to injury, illness or coach’s decision.  In the team summary table the column `numScratches` captures how many players were listed as scratched for each team in each game.  If you need more detailed injury information you may need to cross-reference third-party injury reports or subscription services.  The TopDownHockey package documentation notes that EliteProspects and NHL play-by-play data can be scraped for personal use【430751149834465†L264-L306】, but such scrapes often require careful handling of rate limits and may not explicitly label injuries.  Use the `scratches` information as a proxy for unavailable players when building your model.
