# Scrape Data
In this module we are going to scrape nhl.com for all of the stats we need to train our machine learning model. NHL.com actually provides an API writted in JSON which will make it very easy to scrape. There is a great guide here https://github.com/dword4/nhlapi on how to navigate through and use the nhl API.

You can see what the API looks like for a game at this link https://statsapi.web.nhl.com/api/v1/game/2021020797/feed/live. Buried in all that text is all the game information and stats we will need for our model.

Using a JSON viewer will also make analyzing the JSON code a lot easier. When writing this code I used this one: http://jsonviewer.stack.hu/.

We are eventually going to build 3 dataframes in order to train our model:
1. Team stats dataframe - Will contain team stats each row will represent 1 team that played in a game
2. Goalie stats dataframe - Will contain goalie stats each row will represent 1 goalie that played in a game
3. Training dataframe - Will contain game information + team stats + goalie stats and will be used to train our model. Each row will represent 1 game.

In [1]:
import datetime as dt
from datetime import timedelta
import json
from typing import List, Dict
from bs4 import BeautifulSoup
import requests
from urllib3.util.retry import Retry
from requests.adapters import HTTPAdapter
import pickle
from typing import List
import pandas as pd

#show full columns on dataframes
pd.set_option('display.expand_frame_repr', False)
pd.set_option("display.max_rows", 25)

## Scraping Functions
Our first step will be to write a few functions that will scrape the NHL.com API. 

Each NHL game has a unique game id used as an identifier. This function will take a season as an integer and return a list containing all game ids for that season.

In [2]:
# get game ids from nhl.com API
def get_game_ids(season):
    '''
    Retrieves all of the game ids for the provided season
    Arguments:
        season (int): the season for which you want to retrieve game ids (ex: 20192020)
    Returns:
        List[int]: a list containing all regular season game ids for that season
    '''

    session = requests.Session()
    retry = Retry(connect=3, backoff_factor=0.5)
    adapter = HTTPAdapter(max_retries=retry)
    session.mount('http://', adapter)
    session.mount('https://', adapter)

    season_str: str = str(season)
    url: str = f"https://statsapi.web.nhl.com/api/v1/schedule?season={season_str}&gameType=R"
    resp = session.get(url)
    raw_schedule = json.loads(resp.text)
    schedule = raw_schedule['dates']
    # each entry in schedule is a day in the NHL. Each 'games' key contains games on that day
    # Therefore we need a nested loop to retrieve all games

    game_ids=[]

    for day in schedule:
        # Retrieve list that shows all games played on that day
        games = day['games']
        # Loop through games and retrieve ids
        for game in games:
            game_id = game['gamePk']
            game_ids.append(game_id)
    return game_ids

This next function will scrape the following data for both the home and away team for the game id provided:
- Date game was played on
- Home team
- Away team
- Goals scored
- PIM's
- Shots
- PP%
- PPG
- PP Opportunities
- FO%
- Blocked Shots
- Takeaways
- Giveaways
- Hits
- Starting Goalies
- Outcome

The function will return a list containing 2 dictionaries. The first dictionary containing all the info for the home team and the second dictionary containing all the information for the away team.


In [3]:
def scrape_team_stats(game_id: int) -> List[Dict]:
    """
        returns two entries in a List.
        The first entry is stats for the home team and the second is stats for the away team.
        Each entry represents 1 game played.
        Refer to: https://github.com/dword4/nhlapi on how to use the NHL API
        Arguments
            game_id (int): game id we are retrieving data for
        Returns
            List[dict]: list containing an entry for the home team and away team playing in the
                        same game
    """
    session = requests.Session()
    retry = Retry(connect=3, backoff_factor=0.5)
    adapter = HTTPAdapter(max_retries=retry)
    session.mount('http://', adapter)
    session.mount('https://', adapter)

    url = f'https://statsapi.web.nhl.com/api/v1/game/{str(game_id)}/feed/live'
    resp = session.get(url)
    json_data = json.loads(resp.text)

    # RETRIEVE STATS REQUIRED

    # retrieve date and convert to date time
    game_date: str = json_data['gameData']['datetime']['dateTime']
    game_date = dt.datetime.strptime(game_date, '%Y-%m-%dT%H:%M:%SZ')

    # Retrieve team names
    home_team: str = json_data["liveData"]['boxscore']['teams']['home']['team']['abbreviation']
    away_team: str = json_data["liveData"]['boxscore']['teams']['away']['team']['abbreviation']

    # collect list of teamSkaterStats we want to retrieve from json data
    team_skater_stats_home = json_data["liveData"]\
        ["boxscore"]['teams']['home']['teamStats']['teamSkaterStats']
    team_skater_stats_away = json_data["liveData"]\
        ["boxscore"]['teams']['away']['teamStats']['teamSkaterStats']

    # Starting goalies
    # spot checked a few APIs and it seems like the starting goalie will be listed last in the json
    # file if he was pulled. The goalie that finishes the game will be listed first (0).
    home_team_starting_goalie_id = json_data["liveData"]\
        ['boxscore']['teams']['home']['goalies'][-1]
    away_team_starting_goalie_id = json_data["liveData"]\
        ['boxscore']['teams']['away']['goalies'][-1]
    home_team_starting_goalie_name = json_data["liveData"]\
        ['boxscore']['teams']['home']['players']\
            ['ID'+str(home_team_starting_goalie_id)]['person']['fullName']
    away_team_starting_goalie_name = json_data["liveData"]\
        ['boxscore']['teams']['away']['players']\
            ['ID'+str(away_team_starting_goalie_id)]['person']['fullName']

    # retrieve outcome (same for both home team and away team)
    if not json_data['liveData']['linescore']['hasShootout']:
        if (json_data["liveData"]["boxscore"]\
            ['teams']['home']['teamStats']['teamSkaterStats']['goals'] >
            json_data["liveData"]["boxscore"]\
                ['teams']['away']['teamStats']['teamSkaterStats']['goals']):
            home_team_win = True
        if (json_data["liveData"]["boxscore"]\
            ['teams']['home']['teamStats']['teamSkaterStats']['goals'] <
            json_data["liveData"]["boxscore"]\
                ['teams']['away']['teamStats']['teamSkaterStats']['goals']):
            home_team_win = False
    if json_data['liveData']['linescore']['hasShootout']:
        if (json_data['liveData']['linescore']['shootoutInfo']['home']['scores'] >
            json_data['liveData']['linescore']['shootoutInfo']['away']['scores']):
            home_team_win = True
        if (json_data['liveData']['linescore']['shootoutInfo']['home']['scores'] <
            json_data['liveData']['linescore']['shootoutInfo']['away']['scores']):
            home_team_win = False

    # create dictionaries for the home and away team
    if game_id == 2020020215: # manually entering incorrect input data in NHL API for this game
        home_team_stats = {'date':game_date, 'game_id':game_id,
                           'team':home_team, 'is_home_team':True,
                           'home_team_win':False,
                           'goalie_id':home_team_starting_goalie_id,
                           'goalie_name':home_team_starting_goalie_name}

        home_team_stats.update(team_skater_stats_home)

        away_team_stats = {'date':game_date, 'game_id':game_id,
                           'team':away_team, 'is_home_team':False,
                           'home_team_win':False,
                           'goalie_id':away_team_starting_goalie_id,
                           'goalie_name':away_team_starting_goalie_name}

        away_team_stats.update(team_skater_stats_away)

    else:
        home_team_stats = {'date':game_date, 'game_id':game_id,
                           'team':home_team, 'is_home_team':True,
                           'home_team_win':home_team_win,
                           'goalie_id':home_team_starting_goalie_id,
                           'goalie_name':home_team_starting_goalie_name}

        home_team_stats.update(team_skater_stats_home)

        away_team_stats = {'date':game_date, 'game_id':game_id,
                           'team':away_team, 'is_home_team':False,
                           'home_team_win':home_team_win,
                           'goalie_id':away_team_starting_goalie_id,
                           'goalie_name':away_team_starting_goalie_name}

        away_team_stats.update(team_skater_stats_away)

    teams = [home_team_stats, away_team_stats]

    return teams

Lets run this function for one game so we can see the output

In [4]:
data = scrape_team_stats(2021020797)
print(data)

[{'date': datetime.datetime(2022, 2, 2, 0, 0), 'game_id': 2021020797, 'team': 'TBL', 'is_home_team': True, 'home_team_win': True, 'goalie_id': 8476883, 'goalie_name': 'Andrei Vasilevskiy', 'goals': 3, 'pim': 10, 'shots': 32, 'powerPlayPercentage': '25.0', 'powerPlayGoals': 1.0, 'powerPlayOpportunities': 4.0, 'faceOffWinPercentage': '46.7', 'blocked': 10, 'takeaways': 4, 'giveaways': 4, 'hits': 37}, {'date': datetime.datetime(2022, 2, 2, 0, 0), 'game_id': 2021020797, 'team': 'SJS', 'is_home_team': False, 'home_team_win': True, 'goalie_id': 8473503, 'goalie_name': 'James Reimer', 'goals': 2, 'pim': 12, 'shots': 21, 'powerPlayPercentage': '33.3', 'powerPlayGoals': 1.0, 'powerPlayOpportunities': 3.0, 'faceOffWinPercentage': '53.3', 'blocked': 11, 'takeaways': 6, 'giveaways': 1, 'hits': 43}]


Next we will scrape goalie stats. This function will take a game id and return a list containing a dictionary for each goalie that played in that game.

In [4]:
def scrape_goalie_stats(game_id: int) -> List[Dict]:
    """
        retrieves a list of dictionaries containing goalie stats for all
        goalies that played in the game specified by game_id.
        Each dictionary represents one goalie.
        Refer to: https://github.com/dword4/nhlapi on how to use the NHL API
        Arguments
            game_id (int): game id we are retrieving data for
        Returns
            List[Dict]: list containing an entry for the home team and away team playing in the
                        same game.
        """

    # backoff strategy to avoid max retry errors
    session = requests.Session()
    retry = Retry(connect=3, backoff_factor=0.5)
    adapter = HTTPAdapter(max_retries=retry)
    session.mount('http://', adapter)
    session.mount('https://', adapter)

    url = f'https://statsapi.web.nhl.com/api/v1/game/{str(game_id)}/feed/live'
    resp = session.get(url)
    json_data = json.loads(resp.text)

    # RETRIEVE STATS REQUIRED

    # get date
    game_date = json_data['gameData']['datetime']['dateTime']
    game_date = dt.datetime.strptime(game_date, '%Y-%m-%dT%H:%M:%SZ')

    # Get goalie team
    home_goalie_team = json_data['gameData']['teams']['home']['abbreviation']
    away_goalie_team = json_data['gameData']['teams']['away']['abbreviation']

    # Get goalie IDs
    home_goalie_id = json_data['liveData']['boxscore']['teams']['home']['goalies']
    away_goalie_id = json_data['liveData']['boxscore']['teams']['away']['goalies']

    # Get goalie names
    home_goalie_names = []
    away_goalie_names = []

    # for loop to iterate through list of home goalies that played this game
    for i in home_goalie_id:
        j = json_data['liveData']['boxscore']['teams']['home']['players']['ID' + str(i)]\
            ['person']['fullName']
        home_goalie_names.append(j)
    # for loop to iterate through list of away goalies that played this game
    for i in away_goalie_id:
        j = json_data['liveData']['boxscore']['teams']['away']['players']['ID' + str(i)]\
            ['person']['fullName']
        away_goalie_names.append(j)

    # Get goalie stats
    home_goalie_stats = []
    away_goalie_stats = []
    # for loop to iterate through list of home goalies that played this game
    for i in home_goalie_id:
        j = json_data['liveData']['boxscore']['teams']['home']['players']['ID' + str(i)]\
            ['stats']['goalieStats']
        home_goalie_stats.append(j)
    # for loop to iterate through list of home goalies that played this game
    for i in away_goalie_id:
        j = json_data['liveData']['boxscore']['teams']['away']['players']['ID' + str(i)]\
            ['stats']['goalieStats']
        away_goalie_stats.append(j)

    # make home goalie list. for loop needed as there could be more than 2 goalies playing in 1 game
    home_goalies = []
    goalie_counter = list(range(len(home_goalie_stats))) # counter for number of goalies that played

    for goalie_count in goalie_counter:
        # create dictonary for goalie
        home_goalie = {'date':game_date, 'game_id':game_id, 'team':home_goalie_team,
                       'goalie_name':home_goalie_names[goalie_count],\
                           'goalie_id':home_goalie_id[goalie_count],
                       'is_home_team':True}
        home_goalie.update(home_goalie_stats[goalie_count])

        home_goalies.append(home_goalie)

    # make away goalie list. for loop needed as there could be more than 2 goalies playing in 1 game
    away_goalies = []
    goalie_counter = list(range(len(away_goalie_stats))) # counter for number of goalies that played

    for goalie_count in goalie_counter:
        # create dictonary for goalie
        away_goalie = {'date':game_date, 'game_id':game_id, 'team':away_goalie_team,
                       'goalie_name':away_goalie_names[goalie_count],\
                           'goalie_id':away_goalie_id[goalie_count],
                       'is_home_team':False}
        away_goalie.update(away_goalie_stats[goalie_count])

        away_goalies.append(away_goalie)


    # Merge the two lists
    goalie_stats = away_goalies + home_goalies

    return goalie_stats

Lets try this function for one game

In [6]:
data = scrape_goalie_stats(2021020797)
print(data)

[{'date': datetime.datetime(2022, 2, 2, 0, 0), 'game_id': 2021020797, 'team': 'SJS', 'goalie_name': 'James Reimer', 'goalie_id': 8473503, 'is_home_team': False, 'timeOnIce': '62:45', 'assists': 0, 'goals': 0, 'pim': 0, 'shots': 32, 'saves': 29, 'powerPlaySaves': 3, 'shortHandedSaves': 0, 'evenSaves': 26, 'shortHandedShotsAgainst': 0, 'evenShotsAgainst': 28, 'powerPlayShotsAgainst': 4, 'decision': 'L', 'savePercentage': 90.625, 'powerPlaySavePercentage': 75.0, 'evenStrengthSavePercentage': 92.85714285714286}, {'date': datetime.datetime(2022, 2, 2, 0, 0), 'game_id': 2021020797, 'team': 'TBL', 'goalie_name': 'Andrei Vasilevskiy', 'goalie_id': 8476883, 'is_home_team': True, 'timeOnIce': '62:45', 'assists': 0, 'goals': 0, 'pim': 0, 'shots': 21, 'saves': 19, 'powerPlaySaves': 4, 'shortHandedSaves': 1, 'evenSaves': 14, 'shortHandedShotsAgainst': 1, 'evenShotsAgainst': 15, 'powerPlayShotsAgainst': 5, 'decision': 'W', 'savePercentage': 90.47619047619048, 'powerPlaySavePercentage': 80.0, 'shortH

If a goalie gets pulled we should make sure that all goalies are scrapped for that game. Lets try a game where there was a pulled goalie.

In [7]:
data = scrape_goalie_stats(2021020798)
print(data)

[{'date': datetime.datetime(2022, 2, 2, 0, 0), 'game_id': 2021020798, 'team': 'TOR', 'goalie_name': 'Jack Campbell', 'goalie_id': 8475789, 'is_home_team': False, 'timeOnIce': '60:00', 'assists': 0, 'goals': 0, 'pim': 0, 'shots': 32, 'saves': 31, 'powerPlaySaves': 4, 'shortHandedSaves': 1, 'evenSaves': 26, 'shortHandedShotsAgainst': 1, 'evenShotsAgainst': 27, 'powerPlayShotsAgainst': 4, 'decision': 'W', 'savePercentage': 96.875, 'powerPlaySavePercentage': 100.0, 'shortHandedSavePercentage': 100.0, 'evenStrengthSavePercentage': 96.29629629629629}, {'date': datetime.datetime(2022, 2, 2, 0, 0), 'game_id': 2021020798, 'team': 'NJD', 'goalie_name': 'Jon Gillies', 'goalie_id': 8476903, 'is_home_team': True, 'timeOnIce': '40:00', 'assists': 0, 'goals': 0, 'pim': 0, 'shots': 28, 'saves': 22, 'powerPlaySaves': 2, 'shortHandedSaves': 1, 'evenSaves': 19, 'shortHandedShotsAgainst': 1, 'evenShotsAgainst': 25, 'powerPlayShotsAgainst': 2, 'decision': 'L', 'savePercentage': 78.57142857142857, 'powerPla

Perfect we can see in this game there are 3 dictionaries in the list representing all 3 goalies that played in the game.

This next function will pull game information. It will take a game id and return a dictionary contianing game information for that game.

In [5]:
def scrape_game_info(game_id:int) -> Dict:
    """
        returns an dictionary with game information for the game_id provided
        Refer to: https://github.com/dword4/nhlapi on how to use the NHL API
        Arguments
            game_id (int): game id we are retrieving data for
        Returns
            Dict: Dictionary with information for the game_id provided
        """
    session = requests.Session()
    retry = Retry(connect=3, backoff_factor=0.5)
    adapter = HTTPAdapter(max_retries=retry)
    session.mount('http://', adapter)
    session.mount('https://', adapter)

    url = f'https://statsapi.web.nhl.com/api/v1/game/{str(game_id)}/feed/live'
    resp = session.get(url)
    json_data = json.loads(resp.text)

    # RETRIEVE INFO REQUIRED

    # retrieve date and convert to date time
    game_date: str = json_data['gameData']['datetime']['dateTime']
    game_date = dt.datetime.strptime(game_date, '%Y-%m-%dT%H:%M:%SZ')

    # Retrieve team names
    home_team: str = json_data["liveData"]['boxscore']['teams']['home']['team']['abbreviation']
    away_team: str = json_data["liveData"]['boxscore']['teams']['away']['team']['abbreviation']

    # retrieve outcome 'hasShootout' is a boolean
    if not json_data['liveData']['linescore']['hasShootout']:
        if json_data["liveData"]["boxscore"]['teams']['home']['teamStats']['teamSkaterStats']\
            ['goals'] > json_data["liveData"]["boxscore"]['teams']['away']['teamStats']\
                ['teamSkaterStats']['goals']:
            home_team_win = True
        if json_data["liveData"]["boxscore"]['teams']['home']['teamStats']['teamSkaterStats']\
            ['goals'] < json_data["liveData"]["boxscore"]['teams']['away']['teamStats']\
                ['teamSkaterStats']['goals']:
            home_team_win = False
    if json_data['liveData']['linescore']['hasShootout']:
        if json_data['liveData']['linescore']['shootoutInfo']['home']['scores'] >\
            json_data['liveData']['linescore']['shootoutInfo']['away']['scores']:
            home_team_win = True
        if json_data['liveData']['linescore']['shootoutInfo']['home']['scores'] <\
            json_data['liveData']['linescore']['shootoutInfo']['away']['scores']:
            home_team_win = False

    # Starting goalies
    # spot checked a few APIs and it seems like the starting goalie will be listed last in the json
    # file if he was pulled. The goalie that finishes the game will be listed first (0).
    home_team_starting_goalie_id = json_data["liveData"]['boxscore']['teams']['home']['goalies'][-1]
    away_team_starting_goalie_id = json_data["liveData"]['boxscore']['teams']['away']['goalies'][-1]
    home_team_starting_goalie_name = \
    json_data["liveData"]['boxscore']['teams']['home']['players']['ID' +\
        str(home_team_starting_goalie_id)]['person']['fullName']
    away_team_starting_goalie_name = \
    json_data["liveData"]['boxscore']['teams']['away']['players']['ID' +\
        str(away_team_starting_goalie_id)]['person']['fullName']
    if game_id == 2020020215: # manually entering incorrect input data in NHL API for this game
        game_info = {'date':game_date, 'game_id':game_id, 'home_team':home_team,\
            'away_team':away_team, 'home_team_win':False,\
                'home_goalie_id':home_team_starting_goalie_id,\
                    'away_goalie_id':away_team_starting_goalie_id,\
                        'home_goalie_name':home_team_starting_goalie_name,\
                            'away_goalie_name':away_team_starting_goalie_name}
    else:
        game_info = {'date':game_date, 'game_id':game_id, 'home_team':home_team,\
            'away_team':away_team, 'home_team_win':home_team_win,\
                'home_goalie_id':home_team_starting_goalie_id,\
                    'away_goalie_id':away_team_starting_goalie_id,\
                        'home_goalie_name':home_team_starting_goalie_name,\
                            'away_goalie_name':away_team_starting_goalie_name}
    return game_info

Lets run an example game id

In [9]:
print(scrape_game_info(2021020798))

{'date': datetime.datetime(2022, 2, 2, 0, 0), 'game_id': 2021020798, 'home_team': 'NJD', 'away_team': 'TOR', 'home_team_win': False, 'home_goalie_id': 8481033, 'away_goalie_id': 8475789, 'home_goalie_name': 'Akira Schmid', 'away_goalie_name': 'Jack Campbell'}


We have now written our main scrapping functions. We'll just need a few helper functions that will help us along the way. This first function will take a game ID and retrieve a team playing in that game depending on the boolean supplied as the second argument.

In [6]:
def retrieve_team(game_id: int, home: bool) -> str:
    """
    retrieves the team abbreviation playing in an NHL game
    Arguments
        game_id (int): game id we are retrieving data for
        home (bool): if True retrieves the home team, False retrieves away
    Returns
        team (str): team abbreviation
    """

    url = f'https://statsapi.web.nhl.com/api/v1/game/{str(game_id)}/feed/live'
    resp = requests.get(url)
    json_data = json.loads(resp.text)

    if home:
        team = json_data['gameData']['teams']['home']['abbreviation']
    else:
        team = json_data['gameData']['teams']['away']['abbreviation']

    return team

In [11]:
retrieve_team(2021020798, True)

'NJD'

In [12]:
retrieve_team(2021020798, False)

'TOR'

The next function will take a game id and retrieve the date that game was played on.

In [7]:
def retrieve_date(game_id: int) -> dt.datetime:
    """
    retrieves the date an NHL game was played
    ...
    Parameters
    ----------
    game_id: int
        game id we are retrieving data for
    Returns
    -------
    date: dt.datetime
        date that NHL game was played
    """
    url = f'https://statsapi.web.nhl.com/api/v1/game/{str(game_id)}/feed/live'
    resp = requests.get(url)
    json_data = json.loads(resp.text)

    date = json_data['gameData']['datetime']['dateTime']
    date = dt.datetime.strptime(date, '%Y-%m-%dT%H:%M:%SZ')
    # The NHL api applies an offset to the date which messes with the time we need to 
    # subtract this offset from the date

    offset = int(json_data['gameData']['teams']['home']['venue']['timeZone']['offset'])

    date = date + timedelta(hours=offset)

    return date

In [14]:
retrieve_date(2021020804)

datetime.datetime(2022, 2, 1, 19, 30)

Starting goalies are very important in the NHL and in order to make accurate predictions we will need to predict the starting goalies for each game. The following function will go to dailyfaceoff.com and scrape the predicted starting goalies for a game.

In [8]:
def get_starting_goalies(home_abv: str, away_abv: str, date: str) -> str:
    """
    scrapes starting goaltenders from dailyfaceoff.com for the specified date and teams
    ...
    Parameters
    ----------
    home_abv: str
        abbreviation for home team
    away_abv: str
        abbreviation for away team
    date: str
        string for which we want to retrieve starting goalies (ex. '01-13-2021')
    Returns
    -------
    home_goalie: str
        home goalie name
    away_goalie: str
        away goalie name
    """

    # First define a dictionary to translate team abbreviations
    # in our df to the team names used on daily faceoff
    team_translations = {'MIN':'Minnesota Wild','TOR':'Toronto Maple Leafs',
                         'PIT':'Pittsburgh Penguins', 'COL':'Colorado Avalanche',
                         'EDM':'Edmonton Oilers', 'CAR':'Carolina Hurricanes',
                         'CBJ':'Columbus Blue Jackets', 'NJD':'New Jersey Devils',
                         'DET':'Detroit Red Wings', 'OTT':'Ottawa Senators',
                         'BOS':'Boston Bruins', 'SJS':'San Jose Sharks',
                         'BUF':'Buffalo Sabres','NYI':'New York Islanders',
                         'WSH':'Washington Capitals','TBL':'Tampa Bay Lightning',
                         'STL':'St Louis Blues', 'NSH':'Nashville Predators',
                         'CHI':'Chicago Blackhawks', 'VAN':'Vancouver Canucks',
                         'CGY':'Calgary Flames', 'PHI':'Philadelphia Flyers',
                         'LAK':'Los Angeles Kings', 'MTL':'Montreal Canadiens',
                         'ANA':'Anaheim Ducks', 'DAL':'Dallas Stars',
                         'NYR':'New York Rangers', 'FLA':'Florida Panthers',
                         'WPG':'Winnipeg Jets', 'ARI':'Arizona Coyotes',
                         'VGK':'Vegas Golden Knights'}

    home_team = team_translations[home_abv]
    away_team = team_translations[away_abv]

    url = f'https://www.dailyfaceoff.com/starting-goalies/{date}'

    # Need headers as daily faceoff will block the get request without one
    headers = {'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_6)\
        AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.193 Safari/537.36'}
    result = requests.get(url, headers=headers)

    # Parse the data
    src = result.content
    soup = BeautifulSoup(src, 'lxml')

    goalie_boxes = soup.find_all('div', {'class':'starting-goalies-card stat-card'})

    # find the goalie box that contains the games we are looking for
    for count, box in enumerate(goalie_boxes):
        if home_team and away_team in box.text:
            goalie_box = goalie_boxes[count]
        else:
            continue
    # retrieve the h4 headings which contain the starting goalies

    h4_headings = goalie_box.find_all('h4')

    # Away goalie is at element 1 and home goalie is at element 2
    away_goalie = h4_headings[1].text
    home_goalie = h4_headings[2].text

    return home_goalie, away_goalie

Each player in the nhl has a unique numeric ID. Once we have the starting goalie names from daily faceoff we need to convert the name to the numeric ID used by the nhl.com website to identify the player. The following function will accomplish this.

In [9]:
def convert_player_to_id(team_name: str, player_name: str):
    """
    converts a player name to id
    ...
    Parameters
    ----------
    team_name: str
        abbreviation for the players team
    player_name: str
        player name string. first and last name (ex. 'Olli Jokinen')
    Returns
    -------
    player_id: int
        player id
    """
    url = 'https://statsapi.web.nhl.com/api/v1/teams'
    resp = requests.get(url)
    json_data = json.loads(resp.text)

    for team in json_data['teams']:
        if team['abbreviation'] == team_name:
            team_id = team['id']
        else:
            continue
    # Use the team id to go to team page
    url = f'https://statsapi.web.nhl.com/api/v1/teams/{team_id}?expand=team.roster'
    resp = requests.get(url)
    json_data = json.loads(resp.text)

    team_roster = json_data['teams'][0]['roster']['roster']

    for player_info in team_roster:
        if player_info['person']['fullName'] == player_name:
            return player_info['person']['id']
        else:
            continue

## Time to Scrape
Now that we have all our scraping functions it's time to scrape the date we need. 

### Scrape Game IDs
We are going to scrape all games between 2012 and 2020. First we will get a list of game ids for all of these games. Since we don't want to be constantly scraping data we will store this data in a pickle file in our data folder so we always have it.

In [13]:
def pull_game_ids(first_year: int=2012, last_year: int=2020) -> List[int]:
    """
    pulls all nhl game ids between the specified dates
    ...
    Parameters
    ----------
    first_year: int
        first year to retrieve game ids
    last_year: int
        last year to retrieve game ids
    Returns
    -------
    game_ids: str
    """
    #create a list of years for which we want data
    years = list(range(first_year, last_year))

    #create year for the get_game_ids() function argument in the format 20192020
    game_ids_url_years = []

    for i in years:
        j = str(i) + str(i+1)
        game_ids_url_years.append(j)

    #run for loop to retrieve game IDs for all seasons required
    ids = []
    for i in game_ids_url_years:

        if len(ids) % 500 == 0:  # Progress bar
            print(str(len(ids) / len(game_ids_url_years) * 100) +
                  ' percent done retrieving game ids.')

        try:
            ids = ids + get_game_ids(i)

        except KeyError:
            print(str('*************Not able to retrieve: ' +
                      str(i) +
                      ' games due to KeyError************'))
            continue

    return ids

In [14]:
#%% scrape game ids
game_ids = pull_game_ids(first_year=2012, last_year=2021)
with open('/Users/patrickpetanca/projects/nhl_analysis/data/game_ids.pkl', 'wb') as f:
    pickle.dump(game_ids, f)

print(game_ids[0:10])
print(game_ids[-10:])
print(str(len(game_ids)) + ' Game Ids were retrieved')

0.0 percent done retrieving game ids.
[2012020001, 2012020002, 2012020003, 2012020004, 2012020005, 2012020006, 2012020007, 2012020008, 2012020009, 2012020010]
[2020020688, 2020020790, 2020020803, 2020020647, 2020020704, 2020020741, 2020020673, 2020020567, 2020020864, 2020020634]
10144 Game Ids were retrieved


Analyzing the first and last 10 elements of the list and everything looks good. The first 4 digits of the gameid tell you the year, the 02 after means it was a regular season game and the last digits tell you the game number of the season it was.

10144 ids were retrieved and with some rough math that is about what we would expect for 9 years of hockey.

### Scrape Team Stats
Now lets write a function to pull our team stats for every game in our game_ids list.

In [21]:
#%% scrape team stats
def pull_team_stats(ids: List[int]) -> List[dict]:
    """
    pulls all team stats for the provided game ids
    ...
    Parameters
    ----------
    ids: List[int]
        list of game ids to pull team stats for
    Returns
    -------
    team_stats: List[dict]
        list of NhlTeam objects
    """

    # retrieve game by game stats for every game in the ids list
    stats = []

    for i in ids:
        stats_i = scrape_team_stats(i)
        stats += stats_i

        if len(stats) % 500 == 0:  # Progress bar
            print(str(0.5 * len(stats) /
                      len(ids) * 100) +
                  ' percent done retrieving game data/stats.')

    return stats

Pull team stats and store them in a pickle file. The following cell will take approximately an hour to run.

In [23]:
#%% scrape team stats
team_stats = pull_team_stats(game_ids)
with open('/Users/patrickpetanca/projects/nhl_analysis/data/team_stats.pkl', 'wb') as f:
    pickle.dump(team_stats, f)

2.4645110410094637 percent done retrieving game data/stats.
4.929022082018927 percent done retrieving game data/stats.
7.393533123028391 percent done retrieving game data/stats.
9.858044164037855 percent done retrieving game data/stats.
12.322555205047319 percent done retrieving game data/stats.
14.787066246056781 percent done retrieving game data/stats.
17.251577287066247 percent done retrieving game data/stats.
19.71608832807571 percent done retrieving game data/stats.
22.180599369085176 percent done retrieving game data/stats.
24.645110410094638 percent done retrieving game data/stats.
27.1096214511041 percent done retrieving game data/stats.
29.574132492113563 percent done retrieving game data/stats.
32.03864353312303 percent done retrieving game data/stats.
34.503154574132495 percent done retrieving game data/stats.
36.967665615141954 percent done retrieving game data/stats.
39.43217665615142 percent done retrieving game data/stats.
41.896687697160885 percent done retrieving game 

Lets see what one entry in this data list looks like

In [24]:
print(len(team_stats))
print(team_stats[0])

20288
{'date': datetime.datetime(2013, 1, 19, 20, 0), 'game_id': 2012020001, 'team': 'PHI', 'is_home_team': True, 'home_team_win': False, 'goalie_id': 8468524, 'goalie_name': 'Ilya Bryzgalov', 'goals': 1, 'pim': 6, 'shots': 27, 'powerPlayPercentage': '0.0', 'powerPlayGoals': 0.0, 'powerPlayOpportunities': 5.0, 'faceOffWinPercentage': '43.5', 'blocked': 12, 'takeaways': 8, 'giveaways': 12, 'hits': 40}


We have 20288 dictionaries in our list which is exactly double the amount of game_ids we pulled. This makes sense as there are 2 teams playing in each game.

### Scrape Goalie Stats
The below function will now scrape goalie stats our goalie stast and store them in a list of dictionaries. Each dictionary in the list represents stats for 1 goalie in 1 game. We will again store the information in a pickle file.

In [10]:
#%% scrape goalie stats
def pull_goalie_stats(ids: List[int]) -> List[dict]:
    """
        pulls all goalie stats for the provided game ids
        ...
        Parameters
        ----------
        game_ids: List[int]
            list of game ids to pull team stats for
        Returns
        -------
        goalie_stats: List[dict]
            list of nhl goalie dictionaries each entry
            represents 1 game played by 1 goalie
        """

    goalie_stats_list=[]
    for i in ids:
        goalies_i = scrape_goalie_stats(i)
        goalie_stats_list += goalies_i

        # progress bar todo fix progress bar to account for more goalies than game ids
        if len(goalie_stats_list) % 250 == 0:
            print(str(0.5 * len(goalie_stats_list) /
                      len(ids) * 100) +
                  ' percent done retrieving goalie data.')

    return goalie_stats_list

Scrape the stats. Again this will take a while to run.

In [15]:
#%% scrape goalie stats
goalie_stats = pull_goalie_stats(game_ids)
with open('/Users/patrickpetanca/projects/nhl_analysis/data/goalie_stats.pkl', 'wb') as f:
    pickle.dump(goalie_stats, f)

3.6967665615141954 percent done retrieving goalie data.
9.858044164037855 percent done retrieving goalie data.
12.322555205047319 percent done retrieving goalie data.
13.55481072555205 percent done retrieving goalie data.
16.019321766561514 percent done retrieving goalie data.
20.948343848580443 percent done retrieving goalie data.
23.412854889589905 percent done retrieving goalie data.
24.645110410094638 percent done retrieving goalie data.
27.1096214511041 percent done retrieving goalie data.
32.03864353312303 percent done retrieving goalie data.
36.967665615141954 percent done retrieving goalie data.
41.896687697160885 percent done retrieving goalie data.
43.128943217665615 percent done retrieving goalie data.
44.36119873817035 percent done retrieving goalie data.
46.82570977917981 percent done retrieving goalie data.
49.290220820189276 percent done retrieving goalie data.
52.986987381703464 percent done retrieving goalie data.
55.45149842271293 percent done retrieving goalie data.


Lets see what one entry in this data looks like

In [17]:
print(goalie_stats[0])
print(len(goalie_stats))

{'date': datetime.datetime(2013, 1, 19, 20, 0), 'game_id': 2012020001, 'team': 'PIT', 'goalie_name': 'Marc-Andre Fleury', 'goalie_id': 8470594, 'is_home_team': False, 'timeOnIce': '60:00', 'assists': 0, 'goals': 0, 'pim': 0, 'shots': 27, 'saves': 26, 'powerPlaySaves': 11, 'shortHandedSaves': 0, 'evenSaves': 15, 'shortHandedShotsAgainst': 0, 'evenShotsAgainst': 16, 'powerPlayShotsAgainst': 11, 'decision': 'W', 'savePercentage': 96.29629629629629, 'powerPlaySavePercentage': 100.0, 'evenStrengthSavePercentage': 93.75}
21820


### Scrape Games Info
Finally we will scrape game information. This information is the start of the data we will use to train our machine learning model. We will append a bunch of stats to the end of the data frame and use that for training.

In [22]:
#%% scrape game info
def pull_game_info(ids: List[int]) -> List[dict]:
    """
    pulls all game_info for the provided game ids
    ...
    Parameters
    ----------
    ids: List[int]
        list of game ids to pull team stats for
    Returns
    -------
    games_info: List[dict]
        list of dictionaries
    """

    # retrieve game by game info for every game in the game_ids list
    games_info = []

    for i in ids:
        game_i = scrape_game_info(i)
        games_info.append(game_i)

        if len(games_info) % 500 == 0:  # Progress bar
            print(str(len(games_info) /
                      len(ids) * 100) +
                  ' percent done retrieving game data/stats.')
    return games_info

In [23]:
#%% scrape game info
game_info = pull_game_info(game_ids)
with open('/Users/patrickpetanca/projects/nhl_analysis/data/games_info.pkl', 'wb') as f:
    pickle.dump(game_info, f)

4.929022082018927 percent done retrieving game data/stats.
9.858044164037855 percent done retrieving game data/stats.
14.787066246056781 percent done retrieving game data/stats.
19.71608832807571 percent done retrieving game data/stats.
24.645110410094638 percent done retrieving game data/stats.
29.574132492113563 percent done retrieving game data/stats.
34.503154574132495 percent done retrieving game data/stats.


KeyboardInterrupt: 