# Modeling NHL Game Outcomes

Building binary classification models to predict the winning team

## Overview

Topline goal: predict the winner of a given NHL game

The first purpose of this model is to see if the winner of an NHL game can be predicted with more accuracy than with a naive prediction - choosing the home team to win every time with no other informative input.

In this notebook several modeling techniques will be trained/tested using team game log data from the prior three full NHL regular seasons (19'-20', 20'-21', 21'-22). Games played to date from the current season (22'-23') will then be used to evaluate the model's ability to correctly predict the winning team.

If I can develop a game prediction model that performs above that naive baseline, the next step will be to adjust the benchmark to a semi naive prediction. The semi naive baseline will be based on choosing the 'Vegas' favorite to win everytime. The recent wave of mobile sports betting legalization has led to massive increase in sports wagers placed both in the US and globally. With the increased popularity of sports betting, sportsbooks odds are an increasingly efficient market i.e. the implied probabilties of sportsbook odds are an unbiased estimator of outcomes.

If I can develop a model which is competitive with the betting market, the next step will be to test betting strategies which leverage the predition model to yield postive ROI. Given the 'vig' (i.e. commission) charged by sportsbooks to take wagers, simply beating the market will not be enough to produce a profitable strategy. Instead the model will have to outperform the market by ~5+% to approach profitability.

The excess return required to be profitable in sports betting and generally high log loss scores in betting models, necessitates the application of betting strategies to the model's prediction. Rather than betting equal amounts on each event, I will test strategies that aim to identify and capitilize on over/undervalued sportsbooks odds.

In [1]:
# Standard Packages
import pandas as pd
# from pandas.testing import assert_frame_equal
import numpy as np
import requests
import re
import time
import os
import warnings

# Scraping packages
import requests
import json
from bs4 import BeautifulSoup
import hockey_scraper

# Viz Packages
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline 

# Modeling Packages
## Modeling Prep
from sklearn.model_selection import train_test_split, cross_val_score, cross_validate, KFold, \
GridSearchCV, RandomizedSearchCV

## SKLearn Data Prep Modules
from sklearn.preprocessing import StandardScaler, OneHotEncoder, LabelEncoder, \
PolynomialFeatures, PowerTransformer, Normalizer, MaxAbsScaler

from sklearn.impute import SimpleImputer

## SKLearn Classification Models
from sklearn.linear_model import LogisticRegression, Ridge, Lasso, ElasticNet
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier, export_graphviz
from sklearn.ensemble import BaggingClassifier, RandomForestClassifier,\
ExtraTreesClassifier, VotingClassifier, StackingRegressor, GradientBoostingClassifier

## SKLearn Pipeline Setup
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer

## SKLearn Model Optimization
from sklearn.feature_selection import RFE, f_regression

## Boosting
# from xgboost import XGBClassifier

## SKLearn Metrics
### Classification Scoring/Evaluation
from sklearn.metrics import classification_report, accuracy_score, recall_score, precision_score, f1_score, \
ConfusionMatrixDisplay, log_loss, confusion_matrix, RocCurveDisplay, make_scorer, roc_auc_score

# ## TensorFlow
# #for the Neural Network
# from tensorflow.keras.models import Sequential
# from tensorflow.keras.layers import Dense, Dropout
# from tensorflow.keras.regularizers import l2
# from tensorflow.keras.optimizers import SGD
# from tensorflow.keras.wrappers import scikit_learn
# from tensorflow.keras.callbacks import EarlyStopping
# from keras.constraints import maxnorm

In [2]:
# Notebook Config
from pprintpp import pprint as pp
%reload_ext pprintpp
from tqdm import tqdm
from io import StringIO

## Suppress Python Warnings (Future, Deprecation)
import warnings

warnings.filterwarnings("ignore", category= FutureWarning)
warnings.filterwarnings("ignore", category=DeprecationWarning)

## Suppress Pandas Warnings (SettingWithCopy)
pd.options.mode.chained_assignment = None

## Pandas Display Config
pd.options.display.max_columns = 80
# pd.options.display.width = None

## Display SKLearn estimators as diagrams
from sklearn import set_config
set_config(display= 'diagram')

## Progress bars
from tqdm.notebook import tqdm
pbar = tqdm(..., display=False) # disable by default
# display(pbar.container) ## display progress bar in desired cell

## EDA

### Data Retrieval

Data sources/notes:
- Game log team statistics are retrieved are retrieved from Natural Stat Trick (NST)
- Game log results data is retrieved from the NHL's API
- Only games from the regular season are considered
    - Ensures all teams are sampled equally
    - Avoids potential issues stemming from structural differences in regular and postseason games
- Data is available/retrieved from the perspective of both the home and away teams
- Metrics describing offensive, defensive and goaltending performance are used to model outcomes
- For each team data for the 3 common game strength states is used
    - 5v5 adjusted
    - 5v4 powerplay (PP)
    - 4v5 penalty kill (PK)
- Team stats are in 'rates' form rather than standard total counts

Rates definitions per NST glossary:
- 5v5 Score & Venue Adjusted 
    - Play where both teams have five skaters and a goalie on the ice, with the event counts adjusted for home ice advantage and leading or trailing score effects. This is done using the method created by Micah Blake McCurdy.
- 5 on 4 PP 
    - Play where the team has five skaters and a goalie on the ice versus four skaters and a goalie their opponent due to penalties.
- 4 on 5 PK 
    - Play where the team has four skaters and a goalie on the ice versus five skaters and a goalie their opponent due to penalties.
- Rates
    - TOI is presented as TOI/GP. For and against statistics are presented as the counts per 60 minutes of play.
  
Processing notes:
- Static data from prior season is used for initial development
- Once the model is built, I will aim to transition to a dynamic/automated ETL process

### Data Constants

In [3]:
# team abbreviation dict from hockey scraper package function
# "_descrition": "# All the corresponding tri-codes for team names",

team_dict = {
    'Anaheim Ducks': 'ANA',
    'Arizona Coyotes' : 'ARI',
    'Boston Bruins': 'BOS', 
    'Buffalo Sabres':'BUF',
    'Calgary Flames': 'CGY', 
    'Carolina Hurricanes': 'CAR', 
    'Chicago Blackhawks': 'CHI', 
    'Colorado Avalanche': 'COL',
    'Columbus Blue Jackets': 'CBJ',
    'Dallas Stars': 'DAL',
    'Detroit Red Wings': 'DET',
    'Edmonton Oilers': 'EDM',
    'Florida Panthers': 'FLA',
    'Los Angeles Kings': 'L.A',
    'Minnesota Wild': 'MIN',
    'Montreal Canadiens': 'MTL',
    'Nashville Predators': 'NSH',
    'New Jersey Devils': 'N.J',
    "New York Islanders": 'NYI',
    "New York Rangers": 'NYR',
    'Ottawa Senators': 'OTT',
    'Philadelphia Flyers': 'PHI',
    'Pittsburgh Penguins': 'PIT',
    'San Jose Sharks': 'S.J',
    'Seattle Kraken': 'SEA',
#     'St. Louis Blues': 'STL',
    'St Louis Blues': 'STL',
    'Tampa Bay Lightning': 'T.B',
    'Toronto Maple Leafs': 'TOR',
    'Vancouver Canucks': 'VAN',
    'Vegas Golden Knights':'VGK',
    'Washington Capitals': 'WSH',
    'Winnipeg Jets': 'WPG'
}

### Game States

Game strength state definitions per NST glossary:
- Even Strength (5v5) Score & Venue Adjusted 
    - Play where both teams have five skaters and a goalie on the ice, with the event counts adjusted for home ice advantage and leading or trailing score effects. This is done using the method created by Micah Blake McCurdy.
- 5 on 4 PP 
    - Play where the team has five skaters and a goalie on the ice versus four skaters and a goalie their opponent due to penalties.
- 4 on 5 PK 
    - Play where the team has four skaters and a goalie on the ice versus five skaters and a goalie their opponent due to penalties.

In [4]:
# game state vars
ev = 'sva'
pp = 'pp'
pk = 'pk'

### Team Statistics

Feature definitions per NST glossary:
- TOI:  
    - Total amount of time played
- GF/60:
    - Rate of Goals for that team per 60 minutes of play. GF*60/TOI
- xGF/60:
    - Expected version of above
- GA/60:
    - Rate of Goals against that team per 60 minutes of play. GA*60/TOI
- xGF/60:
    - Expected version of above
- FF%: 
    - Percentage of total Fenwick in games that team played that are for that team. FF*100/(FF+FA)
- GF%:
    - Percentage of total Goals in games that team played that are for that team. GF*100/(GF+GA)Percentage of total Goals in games that team played that are for that team. GF*100/(GF+GA)
- xGF%:
    - Expected version of above. A metric designed to measure the probability of a shot resulting in a goal. Modeled on shots league wide.
- HDCF%:
    - Percentage of total High Danger Scoring Chances in games that team played that are for that team.
- HDSH%:
    - Percentage of High Danger Shots for that team that were Goals. HDGF*100/HDSF
- HDSV%:
    - Percentage of High Danger Shots against that team that were not Goals. 100-(HDGA*100/HDSA)
- SH%:
    - Percentage of Shots for that team that were Goals. GF*100/SF
- SV%: 
    - Percentage of Shots against that team that were not Goals. 100-(GA*100/SA)

In [5]:
# Even strength 5v5 strength state field list
ev_features_list = [
    'Team',
    'FF%',
    'GF%',
    'xGF%',
    'HDCF%',
    'HDSH%',
    'HDSV%',
    'SH%',
    'SV%',
    'Date', # extracted from 'Game' and used to make 'Game_Key'
    'Game_Key',] # descriptor to be dropped after processing
#     'Game', # descriptor to be dropped after processing
#     'Team', # descriptor to be dropped after processing
#    'TOI', ## don't think i need this

In [6]:
# PP df features list
pp_features_list = [
    'GF/60', 
    'xGF/60', 
    'Game_Key']

In [7]:
# PK df features list
# same as pp df but swap GFs for GAs
pk_features_list = [
    'GA/60', 
    'xGA/60', 
    'Game_Key']
#     'TOI', 

In [8]:
# # Dynamic data loading via NST API
# def get_season_stats_df(season_start_id, season_end_id, strength_state):
#     # season_id is the concatenated season start and end years ex 20222023
#     # strength states defined above - ev, pp, pk
#     url = 'https://www.naturalstattrick.com/games.php?\
#     fromseason={}&thruseason={}&stype=2&sit={}&loc=B&team=All&rate=y'
#     .format(season_start_id, season_end_id, strenth_state)

#     response = requests.get(url)
#     soup = BeautifulSoup(response.text, 'html.parser')

#     table = soup.find_all("table")[0]
#     df = pd.read_html(str(table))[0]
#     return df

In [9]:
# # get season stats for 18-19', 19-20', 20-21, 21-22', 22-23'
# evget_season_stats_df(20182019)
# # 5v5 stats
# ev_1819 = get_season_stats_df(20182019, )
# ev_1920 = pd.DataFrame(scrape_schedule('2019-10-02', '2020-03-11', preseason=False, not_over=False))
# ev_2021 = pd.DataFrame(scrape_schedule('2021-01-13', '2021-05-19', preseason=False, not_over=False))
# ev_2122 = pd.DataFrame(scrape_schedule('2021-10-12', '2022-05-01', preseason=False, not_over=False))
# ev_2223 = pd.DataFrame(scrape_schedule('2022-10-07', '2023-04-14', preseason=False, not_over=False))

# # pp stats



# # pk stats

In [10]:
# stats1920_2223get_season_stats_df(20192020, 20222023)

NameError: name 'stats1920_2223get_season_stats_df' is not defined

### Data Cleaning/Preprocessing

In [11]:
# 1st round cleaning of team stat sets
def process_stats(game_log_df):
    # extract date from game column
    game_log_df['Date'] = pd.to_datetime(game_log_df['Game'].str[:11])
    # sub full team names with abbreviated names
    game_log_df.replace(team_dict, inplace=True)
    # create game log index
    game_log_df['Game_Key'] = game_log_df['Team'].astype(str)+'_'+game_log_df['Date'].astype(str)
    game_log_df = game_log_df.replace('-', 0)
    return game_log_df

In [12]:
ev_df = pd.read_csv('data/19_23-5v5.csv')
pp_df = pd.read_csv('data/19_23-5v4.csv')
pk_df = pd.read_csv('data/19_23-4v5.csv')

print(ev_df.shape, pp_df.shape, pk_df.shape)

In [13]:
ev_df

In [14]:
# process stats
process_stats(ev_df)
process_stats(pp_df)
process_stats(pk_df)

In [15]:
ev_df = ev_df.replace('-', 0)
pp_df = pp_df.replace('-', 0)
pk_df = pk_df.replace('-', 0)

In [16]:
ev_df.info()

In [17]:
ev_df = ev_df[ev_features_list]
pp_df = pp_df[pp_features_list]
pk_df = pk_df[pk_features_list]

In [18]:
# merge team stats (5v5_adj, 5v4_pp, 4v5_pk game states) for each period with the feature column list defined above
def merge_strength_states(ev_df, pp_df, pk_df):
    # add suffix to all columns in ev_df, pp_df, and pk_df
    pp_df = pp_df.rename(columns={col: col + '_pp' if col != 'Game_Key' else col for col in pp_df.columns})
    pk_df = pk_df.rename(columns={col: col + '_pk' if col != 'Game_Key' else col for col in pk_df.columns})
    
    
    # left merge the 5v5 and 5v4 dfs on 'Game_key', w/ respective features, add suffixes for shared cols 
    even_pp_merged = pd.merge(ev_df, pp_df,
                              on = 'Game_Key', how = 'left', suffixes=('', '_pp'))

    # left merge that df with the 4v5_pk on 'Game_key', selected columns, and suffixes for overlapping columns
    all_states_merged = pd.merge(even_pp_merged, pk_df, 
                                  on = 'Game_Key', how = 'left', suffixes = ('', '_pk'))

    return all_states_merged


In [19]:
def convert_to_float(df):
    return df.apply(lambda x: pd.to_numeric(x, errors='ignore') if x.name not in ['Game', 'Team', 'Unnamed: 2','Game_Key', 'Date'] else x)


In [20]:
all_stats_df = merge_strength_states(ev_df, pp_df, pk_df)
all_stats_df

In [21]:
all_stats_df.info()

In [22]:
all_stats_df = convert_to_float(all_stats_df)
print(all_stats_df.dtypes)

In [23]:
def rolling_features(df, rolling_games=10):
#     df['Date'] = df['Date']
#     df['season_id'] = df['season_id']
#     df['_Team_Won'] = df['_Team_Won']
#     df['rolling_avg_pts_pct'] = df.groupby([df['Date'].dt.year, 'Team'])['_pts_pct'].rolling(window=10, min_periods=10).mean().reset_index(0, drop=True)
#     df['rolling_avg_away_pts_pct'] = df.groupby([df['Date'].dt.year, 'Team'])['away_pts_pct'].rolling(window=10, min_periods=10).mean().reset_index(0, drop=True)
    df['rolling_avg_FF%'] = df.groupby([df['Date'].dt.year, 'Team'])['FF%'].transform(lambda x: x.rolling(rolling_games, rolling_games).mean().shift())
    df['rolling_avg_GF%'] = df.groupby([df['Date'].dt.year, 'Team'])['GF%'].transform(lambda x: x.rolling(rolling_games, rolling_games).mean().shift())
    df['rolling_avg_xGF%'] = df.groupby([df['Date'].dt.year, 'Team'])['xGF%'].transform(lambda x: x.rolling(rolling_games, rolling_games).mean().shift())
    df['rolling_avg_HDCF%'] = df.groupby([df['Date'].dt.year, 'Team'])['HDCF%'].transform(lambda x: x.rolling(rolling_games, rolling_games).mean().shift())
    df['rolling_avg_HDSH%'] = df.groupby([df['Date'].dt.year, 'Team'])['HDSH%'].transform(lambda x: x.rolling(rolling_games, rolling_games).mean().shift())
    df['rolling_avg_HDSV%'] = df.groupby([df['Date'].dt.year, 'Team'])['HDSV%'].transform(lambda x: x.rolling(rolling_games, rolling_games).mean().shift())
    df['rolling_avg_SH%'] = df.groupby([df['Date'].dt.year, 'Team'])['SH%'].transform(lambda x: x.rolling(rolling_games, rolling_games).sum().mean())
    df['rolling_avg_SV%'] = df.groupby([df['Date'].dt.year, 'Team'])['SV%'].transform(lambda x: x.rolling(rolling_games, rolling_games).sum().mean())
    df['rolling_avg_GF/60_pp'] = df.groupby([df['Date'].dt.year, 'Team'])['GF/60_pp'].transform(lambda x: x.rolling(rolling_games, rolling_games).mean().shift())
    df['rolling_avg_xGF/60_pp'] = df.groupby([df['Date'].dt.year, 'Team'])['xGF/60_pp'].transform(lambda x: x.rolling(rolling_games, rolling_games).mean().shift())
    df['rolling_avg_GA/60_pk'] = df.groupby([df['Date'].dt.year, 'Team'])['GA/60_pk'].transform(lambda x: x.rolling(rolling_games, rolling_games).mean().shift())
    df['rolling_avg_xGA/60_pk'] = df.groupby([df['Date'].dt.year, 'Team'])['xGA/60_pk'].transform(lambda x: x.rolling(rolling_games, rolling_games).mean().shift())
#     df['rolling_avg_away_FF%'] = df.groupby([df['Date'].dt.year, 'Team'])['away_FF%'].rolling(window=10, min_periods=10).mean().reset_index(0, drop=True)
#     df['rolling_avg_away_GF%'] = df.groupby([df['Date'].dt.year, 'Team'])['away_GF%'].rolling(window=10, min_periods=10).mean().reset_index(0, drop=True)
#     df['rolling_avg_away_xGF%'] = df.groupby([df['Date'].dt.year, 'Team'])['away_xGF%'].rolling(window=10, min_periods=10).mean().reset_index(0, drop=True)
#     df['rolling_avg_away_HDCF%'] = df.groupby([df['Date'].dt.year, 'Team'])['away_HDCF%'].rolling(window=10, min_periods=10).mean().reset_index(0, drop=True)
#     df['rolling_avg_away_HDSH%'] = df.groupby([df['Date'].dt.year, 'Team'])['away_HDSH%'].rolling(window=10, min_periods=10).mean().reset_index(0, drop=True)
#     df['rolling_avg_away_HDSV%'] = df.groupby([df['Date'].dt.year, 'Team'])['away_HDSV%'].rolling(window=10, min_periods=10).mean().reset_index(0, drop=True)
#     df['rolling_avg_away_SH%'] = df.groupby([df['Date'].dt.year, 'Team'])['away_SH%'].rolling(window=10, min_periods=10).mean().reset_index(0, drop=True)
#     df['rolling_avg_away_SV%'] = df.groupby([df['Date'].dt.year, 'Team'])['away_SV%'].rolling(window=10, min_periods=10).mean().reset_index(0, drop=True)
#     df['rolling_avg_away_GF/60_pp'] = df.groupby([df['Date'].dt.year, 'Team'])['away_GF/60_pp'].rolling(window=10, min_periods=10).mean().reset_index(0, drop=True)
#     df['rolling_avg_away_xGF/60_pp'] = df.groupby([df['Date'].dt.year, 'Team'])['away_xGF/60_pp'].rolling(window=10, min_periods=10).mean().reset_index(0, drop=True)
#     df['rolling_avg_away_GA/60_pk'] = df.groupby([df['Date'].dt.year, 'Team'])['away_GA/60_pk'].rolling(window=10, min_periods=10).mean().reset_index(0, drop=True)
#     df['rolling_avg_away_xGA/60_pk'] = df.groupby([df['Date'].dt.year, 'Team'])['away_xGA/60_pk'].rolling(window=10, min_periods=10).mean().reset_index(0, drop=True)
    return df

In [24]:
all_stats_df = rolling_features(all_stats_df)
all_stats_df.head(20)

In [25]:
print(all_stats_df.columns)

In [26]:
all_stats_features = ['Team','Date', 'Game_Key','rolling_avg_FF%', 'rolling_avg_GF%', 'rolling_avg_xGF%',
       'rolling_avg_HDCF%', 'rolling_avg_HDSH%', 'rolling_avg_HDSV%',
       'rolling_avg_SH%', 'rolling_avg_SV%', 'rolling_avg_GF/60_pp',
       'rolling_avg_xGF/60_pp', 'rolling_avg_GA/60_pk',
       'rolling_avg_xGA/60_pk']

In [27]:
all_stats_df.isna().sum()

In [28]:
all_stats_df = all_stats_df[all_stats_features]
all_stats_df = all_stats_df.dropna()
all_stats_df

In [29]:
# def column_dropper(ev_df, ev_features_list, pp_df, pp_features_list, pk_df, pk_features_list):
#     ev_df = ev_df[ev_features_list]
#     pp_df = pp_df[pp_features_list]
#     pk_df = pk_df[pk_features_list]
#     ev_df = ev_df.replace('-', 0)
#     pp_df = pp_df.replace('-', 0)
#     pk_df = pk_df.replace('-', 0)
#     return ev_df, pp_df, pk_df

### Add NHL.com schedule/result df


In [30]:
# functions from the Hockey Scraper API
# modified to retrieve additional info 
"""
This module contains functions to scrape the json schedule for any games or date range
"""

from datetime import datetime, timedelta
import json
import time
import hockey_scraper.utils.shared as shared


# TODO: Currently rescraping page each time since the status of some games may have changed
# (e.g. Scraped on 2020-01-20 and game on 2020-01-21 was not Final...when use old page again will still think not Final)
# Need to find a more elegant way of doing this (Metadata???)
def get_schedule(date_from, date_to):
    """
    Scrapes games in date range
    Ex: https://statsapi.web.nhl.com/api/v1/schedule?startDate=2010-10-03&endDate=2011-06-20
    
    :param date_from: scrape from this date
    :param date_to: scrape until this date
    
    :return: raw json of schedule of date range
    """
    page_info = {
        "url": 'https://statsapi.web.nhl.com/api/v1/schedule?startDate={a}&endDate={b}'.format(a=date_from, b=date_to),
        "name": date_from + "_" + date_to,
        "type": "json_schedule",
        "season": shared.get_season(date_from),
    }

    return json.loads(shared.get_file(page_info, force=True))


def chunk_schedule_calls(from_date, to_date):
    """
    The schedule endpoint sucks when handling a big date range. So instead I call in increments of n days.
    
    :param date_from: scrape from this date
    :param date_to: scrape until this date

    :return: raw json of schedule of date range
    """
    sched = []
    days_per_call = 30

    from_date = datetime.strptime(from_date, "%Y-%m-%d") 
    to_date = datetime.strptime(to_date, "%Y-%m-%d")
    num_days = (to_date - from_date).days + 1  # +1 since difference is looking for total number of days

    for offset in range(0, num_days, days_per_call):
        f_chunk = datetime.strftime(from_date + timedelta(days=offset), "%Y-%m-%d")

        # We need the min bec. if the chunks are evenly sized this prevents us from overshooting the max
        t_chunk = datetime.strftime(from_date + timedelta(days=min(num_days-1, offset+days_per_call-1)), "%Y-%m-%d")

        chunk_sched = get_schedule(f_chunk, t_chunk)
        sched.append(chunk_sched['dates'])

    return sched


def get_dates(games):
    """
    Given a list game_ids it returns the dates for each game.

    We sort all the games and retrieve the schedule from the beginning of the season from the earliest game
    until the end of most recent season.
    
    :param games: list with game_id's ex: 2016020001
    
    :return: list with game_id and corresponding date for all games
    """
    today = datetime.today()

    # Determine oldest and newest game
    games = list(map(str, games))
    games.sort()

    date_from = shared.season_start_bound(games[0][:4])
    year_to = int(games[-1][:4])

    # If the last game is part of the ongoing season then only request the schedule until Today
    # We get strange errors if we don't do it like this
    if year_to == shared.get_season(datetime.strftime(today, "%Y-%m-%d")):
        date_to = '-'.join([str(today.year), str(today.month), str(today.day)])
    else:
        date_to = datetime.strftime(shared.season_end_bound(year_to+1), "%Y-%m-%d")  # Newest game in sample

    # TODO: Assume true is live here -> Workaround
    schedule = scrape_schedule(date_from, date_to, preseason=True, not_over=True)

    # Only return games we want in range
    games_list = []
    for game in schedule:
        if str(game['game_id']) in games:
            games_list.extend([game])
    return games_list


def scrape_schedule(date_from, date_to, preseason=False, not_over=False):
    """
    Calls getSchedule and scrapes the raw schedule Json
    
    :param date_from: scrape from this date
    :param date_to: scrape until this date
    :param preseason: Boolean indicating whether include preseason games (default if False)
    :param not_over: Boolean indicating whether we scrape games not finished. 
                     Means we relax the requirement of checking if the game is over. 
    
    :return: list with all the game id's
    """
    schedule = []
    schedule_json = chunk_schedule_calls(date_from, date_to)

    for chunk in schedule_json:
        for day in chunk:
            for game in day['games']:
                if game['status']['detailedState'] == 'Final' or not_over:
                    game_id = int(str(game['gamePk'])[5:])
                    # add game type logic to filter out none regular season games
                    if (game_id >= 20000 or preseason) and game_id < 40000:
                        schedule.append({
                                 "game_id": game['gamePk'],
                                "game_type": game['gameType'],
                                 "season_id": game['season'],
                                 "date": day['date'], 
                                 "home_score": game['teams']['home'].get("score"),
                                 "away_score": game['teams']['away'].get("score"),
                                #  "start_time": datetime.strptime(game['gameDate'][:-1], "%Y-%m-%dT%H:%M:%S"),
                                #  "venue": game['venue'].get('name'),
                                #  "home_team_id": game['teams']['home']['team']['id'],
                                 "home_team": shared.get_team(game['teams']['home']['team']['name']),
                                 "home_wins": game['teams']['home'].get("leagueRecord").get("wins"),
                                 "home_losses": game['teams']['home'].get("leagueRecord").get("losses"),
                                 "home_otl": game['teams']['home'].get("leagueRecord").get("ot"),
                                 "away_wins": game['teams']['away'].get("leagueRecord").get("wins"),
                                 "away_losses": game['teams']['away'].get("leagueRecord").get("losses"),
                                 "away_otl": game['teams']['away'].get("leagueRecord").get("ot"), 
                                #  "away_team_id": game['teams']['away']['team']['id'],
                                 "away_team": shared.get_team(game['teams']['away']['team']['name']),
                                #  "home_score": game['teams']['home'].get("score"),
                                #  "away_score": game['teams']['away'].get("score"),
                                 "status": game["status"]["abstractGameState"]
                        })


    return schedule

NHL regular season date boundaries by season
- 2018-2019: October 3, 2018 – June 12, 2019
- 2019-2020: October 2, 2019 – March 11, 2020
  - Covid caused scheduling issues, so need to make sure only regular season games are factored
- 2020-2021: January 13, 2021 - May 19, 2021
  - same potential issue noted above
- 2021-2022: October 12, 2021 - May 1, 2022
- 2022-2023: October 7, 2022 - April 14, 2023

In [31]:
# run modified scrape_schedule for 19-20, 20-21, and 21-22 seasons individually 
# will make for easier processing of imputed features
schedule_1920 = pd.DataFrame(scrape_schedule('2019-10-02', '2020-03-11', preseason=False, not_over=False))
schedule_2021 = pd.DataFrame(scrape_schedule('2021-01-13', '2021-05-19', preseason=False, not_over=False))
schedule_2122 = pd.DataFrame(scrape_schedule('2021-10-12', '2022-05-01', preseason=False, not_over=False))
schedule_2223 = pd.DataFrame(scrape_schedule('2022-10-07', '2023-04-14', preseason=False, not_over=False))
schedule_1920

Bulk processing to conduct upon load:

- Convert the following fields accordingly
  - game_id, season_id to string
  - home_otl, away_otl to int
- Filter out non regular season games
  - game_type = R
- Add column to denote winner from perspective of home team
  - home_team = 1 denotes home team victory, 0 for loss
- Add game keys so schedule dfs can be joined with stats dfs
- Add running total column for number of games played so far that season for all teams
- Add running standings points total column for all teams
    - Win = 2 points
    - (Regulation) Loss = 0 points
    - Overtime loss = 1 point
- Add column for points percentage
    - pts_pct = actual points accumulated / potential max points 
    - potential max points = games played * 2

In [32]:
# makes sure all games are 'R' type for regular season
schedule_1920['game_type'].value_counts()
# assert schedule_2021.loc[schedule_2021['game_type'] == 'R']
# assert schedule_2122.loc[schedule_2122['game_type'] == 'R']
# assert schedule_2223.loc[schedule_2223['game_type'] == 'R']

In [33]:
schedule_1920.loc[schedule_1920['game_type'] == 'WA']

In [34]:
def schedule_df_processing(schedule_df):
    # filter out non regular season games - should have 3262 for 19-20 through 21-22 seasons
    schedule_df = schedule_df.loc[schedule_df['game_type'] == 'R']
    # convert game_id and season_id to str. make sure they aren't used numerically and in case needed for filtering
    schedule_df['game_id'] = schedule_df['game_id'].astype(str) # don't need
    schedule_df['season_id'] = schedule_df['season_id'].astype(str)
    # convert overtime loss columns to int from float as can't be decimal
    schedule_df['home_otl'] = schedule_df['home_otl'].astype(int)
    schedule_df['away_otl'] = schedule_df['away_otl'].astype(int)
    # add column for denoting game winner - denote home team victory in binary, win = 1
    schedule_df['Home_Team_Won'] = np.where(schedule_df['home_score'] > schedule_df['away_score'], 1, 0)
    # add keys for merging stats dfs
    schedule_df['Home_Team_Key'] = schedule_df['home_team'].astype(str)+'_'+schedule_df['date'].astype(str)
    schedule_df['Away_Team_Key'] = schedule_df['away_team'].astype(str)+'_'+schedule_df['date'].astype(str)
    # add column containing running total games played so far that season
    schedule_df['home_gp'] = (schedule_df['home_wins'] + schedule_df['home_losses'] + schedule_df['home_otl']).astype(int)
    schedule_df['away_gp'] = (schedule_df['away_wins'] + schedule_df['away_losses'] + schedule_df['away_otl']).astype(int)                              
    # add column containing running standing points total for teams. wins = 2, losses = 0, otl = 1
    schedule_df['home_points'] = ((schedule_df['home_wins'] * 2) + schedule_df['home_otl']).astype(int)
    schedule_df['away_points'] = ((schedule_df['away_wins'] * 2) + schedule_df['away_otl']).astype(int)
    # add column for points percentage column
    schedule_df['home_pts_pct'] = schedule_df['home_points'] / (schedule_df['home_gp'] * 2)
    schedule_df['away_pts_pct'] = schedule_df['away_points'] / (schedule_df['away_gp'] * 2)

    return schedule_df

In [35]:
# call processing function on all schedule dfs
schedule_1920 = schedule_df_processing(schedule_1920)
schedule_2021 = schedule_df_processing(schedule_2021)
schedule_2122 = schedule_df_processing(schedule_2122)
schedule_2223 = schedule_df_processing(schedule_2223)

In [36]:
# check lengths to make sure the correct number of games are returned
# 2542, 1082, 868, 1312, 1312
# print(len(schedule_1819))
print(len(schedule_1920))
print(len(schedule_2021))
print(len(schedule_2122))
print(len(schedule_2223))

In [37]:
schedule_1920_2123 = pd.concat([schedule_1920, schedule_2021, schedule_2122, schedule_2223], axis=0, ignore_index=True)
schedule_1920_2123

In [38]:
print(schedule_1920_2123.columns.to_list())

In [39]:
sched_cols = ['season_id', 'Home_Team_Won', 'Home_Team_Key', 'Away_Team_Key','home_pts_pct', 'away_pts_pct']
schedule_1920_2123 = schedule_1920_2123[sched_cols]
schedule_1920_2123

In [40]:
modeling_df = schedule_1920_2123.merge(all_stats_df.add_prefix('home_'), 
                                       left_on = 'Home_Team_Key', right_on = 'home_Game_Key', how = 'left')
modeling_df = modeling_df.merge(all_stats_df.add_prefix('away_'), 
                                       left_on = 'Away_Team_Key', right_on = 'away_Game_Key', how = 'left')
modeling_df['Date'] = modeling_df['home_Date']
modeling_df = modeling_df.drop(columns=['Home_Team_Key', 'Away_Team_Key', 'home_Date', 
                                        'home_Game_Key', 'away_Date', 'away_Game_Key'])
modeling_df

In [41]:
# def rolling_wins(df, rolling_games=10):
#     df['rolling_avg_home_pts_pct'] = df.groupby([df['Date'].dt.year, 'home_Team'])['home_pts_pct'].transform(lambda x: x.rolling(rolling_games).mean().shift())
#     df['rolling_avg_away_pts_pct'] = df.groupby([df['Date'].dt.year, 'away_Team'])['away_pts_pct'].transform(lambda x: x.rolling(rolling_games).mean().shift())
#     return df

In [42]:
# modeling_df = rolling_wins(modeling_df)
# modeling_df

In [43]:
# modeling_df['Date'] = modeling_df['home_Date']
# modeling_df = modeling_df.drop(columns=['Home_Team_Key', 'Away_Team_Key', 'home_Date', 
#                                         'home_Game_Key', 'away_Date', 'away_Game_Key'])
# modeling_df

In [44]:
# print(modeling_df.columns.to_list())

In [45]:
# modeling_df = schedule_1920_2122.merge(home_away_stats[cols_to_prefix].add_prefix('home_'), 
#                                        left_on = 'Home_Team_Key', right_on = 'home_Game_Key', how = 'left').drop(columns = 'home_Team_Key')

# modeling_df = modeling_df.merge(home_away_stats[cols_to_prefix].add_prefix('away_'), 
#                                        left_on = 'Away_Team_Key', right_on = 'away_Game_Key', how = 'left').drop(columns = 'away_Team_Key')
# modeling_df

In [53]:
modeling_df.info()

In [47]:
# def calculate_rolling_avg(df, window_size=10, min_periods=1, default_value=np.nan, shift=1):
#     # get column names to calculate rolling average for
#     cols = [col for col in df.columns if col not in ['season_id', 'Date', 'Home_Team_']]
    
#     # calculate rolling average for each column and drop original columns
#     for col in cols:
#         df[f'rolling_avg_{col}'] = df.groupby([df['Date'].dt.year, 'Home_Team_Won'])[col].rolling(window=window_size, min_periods=min_periods).mean().reset_index(0, drop=True)
#         df[f'rolling_avg_{col}'] = df[f'rolling_avg_{col}'].fillna(default_value)
#         if shift != 0:
#             df[f'rolling_avg_{col}'] = df[f'rolling_avg_{col}'].shift(shift)
    
#     return df

In [48]:
# modeling_final_df = calculate_rolling_avg(modeling_df)
# modeling_final_df

In [49]:
# modeling_final = rolling_features(modeling_df)
# modeling_final

In [50]:
# df['rolling_avg_goals'] = df.groupby(df['date'].dt.year)['goals'].
# rolling(window=rolling_window, min_periods=10).mean().reset_index(0, drop=True)

In [51]:
# df['rolling_avg_goals'] = df.groupby(df['date'].dt.year)['goals'].
# rolling(window=10, min_periods=10).mean().shift(1).reset_index(0, drop=True)

# df['rolling_avg_goals'] = df.groupby([df['date'].dt.year, 'team'])['goals'].
# rolling(window=10, min_periods=10).mean().reset_index(0, drop=True)
