# Modeling NHL Game Outcomes

Building binary classification models to predict the winning team

# Overview

Topline goal: predict the winner of a given NHL game

The first purpose of this model is to see if the winner of an NHL game can be predicted with more accuracy than with a naive prediction - choosing the home team to win every time with no other informative input.

In this notebook several modeling techniques will be trained/tested using team game log data from the prior three full NHL regular seasons (19'-20', 20'-21', 21'-22). Games played to date from the current season (22'-23') will then be used to evaluate the model's ability to correctly predict the winning team.

If I can develop a game prediction model that performs above that naive baseline, the next step will be to adjust the benchmark to a semi naive prediction. The semi naive baseline will be based on choosing the 'Vegas' favorite to win everytime. The recent wave of mobile sports betting legalization has led to massive increase in sports wagers placed both in the US and globally. With the increased popularity of sports betting, sportsbooks odds are an increasingly efficient market i.e. the implied probabilties of sportsbook odds are an unbiased estimator of outcomes.

If I can develop a model which is competitive with the betting market, the next step will be to test betting strategies which leverage the predition model to yield postive ROI. Given the 'vig' (i.e. commission) charged by sportsbooks to take wagers, simply beating the market will not be enough to produce a profitable strategy. Instead the model will have to outperform the market by ~5+% to approach profitability.

The excess return required to be profitable in sports betting and generally high log loss scores in betting models, necessitates the application of betting strategies to the model's prediction. Rather than betting equal amounts on each event, I will test strategies that aim to identify and capitilize on over/undervalued sportsbooks odds.

In [15]:
# Standard Packages
import pandas as pd
from pandas.testing import assert_frame_equal
import numpy as np
import re
import time
import os
import warnings
import pickle

# Viz Packages
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline 

# Scraping packages
import requests
import json
from bs4 import BeautifulSoup
import hockey_scraper

# Modeling Packages
## Modeling Prep
from sklearn.model_selection import train_test_split, cross_val_score, cross_validate, KFold, \
GridSearchCV, RandomizedSearchCV

## SKLearn Data Prep Modules
from sklearn.preprocessing import StandardScaler, OneHotEncoder, LabelEncoder, \
PolynomialFeatures, PowerTransformer, Normalizer, MaxAbsScaler
from sklearn.impute import SimpleImputer

## SKLearn Classification Models
from sklearn.linear_model import LogisticRegression, Ridge, Lasso, ElasticNet
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier, export_graphviz
from sklearn.ensemble import BaggingClassifier, RandomForestClassifier,\
ExtraTreesClassifier, VotingClassifier, AdaBoostClassifier, HistGradientBoostingClassifier
from sklearn.svm import SVC

## SKLearn Pipeline Setup
from sklearn.pipeline import Pipeline, make_pipeline, FeatureUnion
from sklearn.compose import ColumnTransformer

## SKLearn Model Optimization
from sklearn.feature_selection import RFE, f_regression, RFECV, SelectKBest

# ## Boosting
# from xgboost import XGBRegressor
from xgboost import XGBClassifier

## SKLearn Metrics
### Classification Scoring/Evaluation
from sklearn.metrics import classification_report, accuracy_score, recall_score, precision_score, f1_score, \
ConfusionMatrixDisplay, log_loss, confusion_matrix, RocCurveDisplay, make_scorer, roc_auc_score

In [5]:
# Notebook Config
from pprintpp import pprint as pp
%reload_ext pprintpp
from tqdm import tqdm
from io import StringIO

## Suppress Python Warnings (Future, Deprecation)
warnings.filterwarnings("ignore", category= FutureWarning)
warnings.filterwarnings("ignore", category=DeprecationWarning)

## Suppress Pandas Warnings (SettingWithCopy)
pd.options.mode.chained_assignment = None

## Pandas Display Config
pd.options.display.max_columns = 20
# pd.options.display.width = None

## Display SKLearn estimators as diagrams
from sklearn import set_config
set_config(display= 'diagram')

## EDA

### Data Retrieval

Data notes:
- Includes offensive, defensive and goaltending stats for both the home and away teams
- For each team data for 3 different game strength states is used
    - 5v5 even strength (adjusted for score states and venue)
    - 5v4 powerplay (man advantage)
    - 4v5 penalty pill (man down)
- Static data from prior season is used for initial development
- Once the model is built, I will aim to transition to a dynamic/automated ETL process

In [40]:
# Load test/train data 
# Static game log data from prior 3 full seasons
# Team stats for both sides in even strength (adjusted), man up and man down situations

home_5v5_adj = pd.read_csv('data/filtered/filtered-19_22-home-5v5_adj.csv')
home_5v4_pp = pd.read_csv('data/filtered/filtered-19_22-home-pp_5v4.csv')
home_4v5_pk = pd.read_csv('data/filtered/filtered-19_22-home-pk_4v5.csv')

away_5v5_adj = pd.read_csv('data/filtered/filtered-19_22-away-5v5_adj.csv')
away_5v4_pp = pd.read_csv('data/filtered/filtered-19_22-away-pp_5v4.csv')
away_4v5_pk = pd.read_csv('data/filtered/filtered-19_22-away-pk_4v5.csv')

print(home_5v5_adj.shape, home_5v4_pp.shape, home_4v5_pk.shape)
print(away_5v5_adj.shape, away_5v4_pp.shape, away_4v5_pk.shape)

(3262, 33) (3205, 33) (3191, 33)
(3262, 33) (3191, 33) (3205, 33)


In [29]:
home_5v5_adj.head()

Unnamed: 0,Game,Team,Unnamed: 2,TOI,CF/60,CA/60,CF%,FF/60,FA/60,FF%,...,HDCA/60,HDCF%,HDGF/60,HDGA/60,HDGF%,HDSH%,HDSV%,SH%,SV%,PDO
0,"2019-10-02 - Senators 3, Maple Leafs 5",Toronto Maple Leafs,Limited ReportFull Report,44.133333,78.0,54.02,59.08,59.15,43.75,57.48,...,8.32,65.42,5.12,4.2,54.92,55.99,24.79,11.94,85.6,0.975
1,"2019-10-02 - Capitals 3, Blues 2",St Louis Blues,Limited ReportFull Report,50.866667,33.43,47.21,41.45,27.17,39.14,40.97,...,6.14,47.66,1.11,0.0,100.0,19.64,100.0,5.56,95.79,1.014
2,"2019-10-02 - Canucks 2, Oilers 3",Edmonton Oilers,Limited ReportFull Report,47.066667,44.76,67.92,39.72,30.38,48.03,38.75,...,9.11,39.87,2.4,1.35,64.02,39.29,65.85,14.87,91.33,1.062
3,"2019-10-02 - Sharks 1, Golden Knights 4",Vegas Golden Knights,Limited ReportFull Report,45.666667,62.73,40.18,60.96,51.94,30.07,63.34,...,5.27,75.25,1.3,0.0,100.0,12.06,100.0,8.07,93.28,1.013
4,"2019-10-03 - Panthers 2, Lightning 5",Tampa Bay Lightning,Limited ReportFull Report,45.5,46.81,57.6,44.83,32.45,46.56,41.07,...,9.26,55.98,1.31,1.33,49.49,19.81,83.02,9.94,96.45,1.064


In [30]:
all_cols = (home_5v5_adj.columns.to_list())
pp(all_cols)

[
    'Game',
    'Team',
    'Unnamed: 2',
    'TOI',
    'CF/60',
    'CA/60',
    'CF%',
    'FF/60',
    'FA/60',
    'FF%',
    'SF/60',
    'SA/60',
    'SF%',
    'GF/60',
    'GA/60',
    'GF%',
    'xGF/60',
    'xGA/60',
    'xGF%',
    'SCF/60',
    'SCA/60',
    'SCF%',
    'HDCF/60',
    'HDCA/60',
    'HDCF%',
    'HDGF/60',
    'HDGA/60',
    'HDGF%',
    'HDSH%',
    'HDSV%',
    'SH%',
    'SV%',
    'PDO',
]


In [31]:
# Even strength features list
# comment out features to disregard
ev_features_list = [
    'Game',
    'Team',
    'TOI',
#     'CF/60',
#     'CA/60',
#     'CF%',
    'FF/60',
    'FA/60',
    'FF%',
#     'SF/60',
#     'SA/60',
#     'SF%',
    'GF/60',
    'GA/60',
    'GF%',
    'xGF/60',
    'xGA/60',
    'xGF%',
#     'SCF/60',
#     'SCA/60',
#     'SCF%',
    'HDCF/60',
    'HDCA/60',
    'HDCF%',
#     'HDGF/60',
#     'HDGA/60',
#     'HDGF%',
    'HDSH%',
    'HDSV%',
    'SH%',
    'SV%',
    'PDO'
]

In [32]:
pp_cols = (home_5v4_pp.columns.to_list())
print(pp_cols)

['Game', 'Team', 'Unnamed: 2', 'TOI', 'CF/60', 'CA/60', 'CF%', 'FF/60', 'FA/60', 'FF%', 'SF/60', 'SA/60', 'SF%', 'GF/60', 'GA/60', 'GF%', 'xGF/60', 'xGA/60', 'xGF%', 'SCF/60', 'SCA/60', 'SCF%', 'HDCF/60', 'HDCA/60', 'HDCF%', 'HDGF/60', 'HDGA/60', 'HDGF%', 'HDSH%', 'HDSV%', 'SH%', 'SV%', 'PDO']


In [33]:
# PP df features list
# comment out features to disregard
pp_features_list = [
#     'Game',
#     'Team',
    'TOI',
#     'CF/60',
#     'CA/60',
#     'CF%',
#     'FF/60',
#     'FA/60',
#     'FF%',
#     'SF/60',
#     'SA/60',
#     'SF%',
    'GF/60',
#     'GA/60',
#     'GF%',
    'xGF/60',
#     'xGA/60',
#     'xGF%',
#     'SCF/60',
#     'SCA/60',
#     'SCF%',
#     'HDCF/60',
#     'HDCA/60',
#     'HDCF%',
#     'HDGF/60',
#     'HDGA/60',
#     'HDGF%',
#     'HDSH%',
#     'HDSV%',
#     'SH%',
#     'SV%',
#     'PDO'
]

In [34]:
# PK df features list
# same as pp df but swap GFs for GAs
pk_features_list = ['TOI', 'GA/60', 'xGF/60']

In [99]:
# team abbreviation dict from hockey scraper package function
# "_descrition": "# All the corresponding tri-codes for team names",

team_dict = {
    'Anaheim Ducks': 'ANA',
    'Arizona Coyotes' : 'ARI',
    'Boston Bruins': 'BOS', 
    'Buffalo Sabres':'BUF',
    'Calgary Flames': 'CGY', 
    'Carolina Hurricanes': 'CAR', 
    'Chicago Blackhawks': 'CHI', 
    'Colorado Avalanche': 'COL',
    'Columbus Blue Jackets': 'CBJ',
    'Dallas Stars': 'DAL',
    'Detroit Red Wings': 'DET',
    'Edmonton Oilers': 'EDM',
    'Florida Panthers': 'FLA',
    'Los Angeles Kings': 'L.A.',
    'Minnesota Wild': 'MIN',
    'Montreal Canadiens': 'MTL',
    'Nashville Predators': 'NSH',
    'New Jersey Devils': 'N.J.',
    "New York Islanders": 'NYI',
    "New York Rangers": 'NYR',
    'Ottawa Senators': 'OTT',
    'Philadelphia Flyers': 'PHI',
    'Pittsburgh Penguins': 'PIT',
    'San Jose Sharks': 'S.J.',
    'Seattle Kraken': 'SEA',
#     'St. Louis Blues': 'STL',
    'St Louis Blues': 'STL',
    'Tampa Bay Lightning': 'T.B.',
    'Toronto Maple Leafs': 'TOR',
    'Vancouver Canucks': 'VAN',
    'Vegas Golden Knights':'VGK',
    'Washington Capitals': 'WSH',
    'Winnipeg Jets': 'WPG'
}

In [100]:
len(teams)

32

In [112]:
# 1st round cleaning of data sets
def process_datasets(game_log_df):
    # extract date from game column
    game_log_df['Date'] = pd.to_datetime(game_log_df['Game'].str[:11])
    # sub full team names with abbreviated names
#     game_log_df = game_log_df.replace({'Team': team_dict})
    game_log_df.replace(team_dict, inplace=True)
    # create game log index
    game_log_df['Game_Key'] = game_log_df['Team'].astype(str)+'_'+game_log_df['Date'].astype(str)
    return game_log_df

In [113]:
process_datasets(home_5v5_adj)
home_5v5_adj.head()

Unnamed: 0,Game,Team,Unnamed: 2,TOI,CF/60,CA/60,CF%,FF/60,FA/60,FF%,...,HDGF/60,HDGA/60,HDGF%,HDSH%,HDSV%,SH%,SV%,PDO,Date,Game_Key
0,"2019-10-02 - Senators 3, Maple Leafs 5",TOR,Limited ReportFull Report,44.133333,78.0,54.02,59.08,59.15,43.75,57.48,...,5.12,4.2,54.92,55.99,24.79,11.94,85.6,0.975,2019-10-02,TOR_2019-10-02
1,"2019-10-02 - Capitals 3, Blues 2",STL,Limited ReportFull Report,50.866667,33.43,47.21,41.45,27.17,39.14,40.97,...,1.11,0.0,100.0,19.64,100.0,5.56,95.79,1.014,2019-10-02,STL_2019-10-02
2,"2019-10-02 - Canucks 2, Oilers 3",EDM,Limited ReportFull Report,47.066667,44.76,67.92,39.72,30.38,48.03,38.75,...,2.4,1.35,64.02,39.29,65.85,14.87,91.33,1.062,2019-10-02,EDM_2019-10-02
3,"2019-10-02 - Sharks 1, Golden Knights 4",VGK,Limited ReportFull Report,45.666667,62.73,40.18,60.96,51.94,30.07,63.34,...,1.3,0.0,100.0,12.06,100.0,8.07,93.28,1.013,2019-10-02,VGK_2019-10-02
4,"2019-10-03 - Panthers 2, Lightning 5",T.B.,Limited ReportFull Report,45.5,46.81,57.6,44.83,32.45,46.56,41.07,...,1.31,1.33,49.49,19.81,83.02,9.94,96.45,1.064,2019-10-03,T.B._2019-10-03


In [114]:
home_5v5_adj['Team'].value_counts()

WPG     106
DET     106
OTT     106
VGK     106
MTL     106
CBJ     105
ANA     105
S.J.    105
NYR     105
NSH     104
PIT     104
NYI     104
BOS     104
PHI     104
STL     104
BUF     104
MIN     104
FLA     104
VAN     104
CHI     103
L.A.    103
TOR     103
N.J.    103
DAL     103
T.B.    103
EDM     103
CGY     102
ARI     102
WSH     102
COL     102
CAR     102
SEA      41
Name: Team, dtype: int64

In [131]:
process_datasets(home_5v4_pp)
process_datasets(home_4v5_pk)
process_datasets(away_5v5_adj)
process_datasets(away_5v4_pp)
process_datasets(away_4v5_pk)

Unnamed: 0,Game,Team,Unnamed: 2,TOI,CF/60,CA/60,CF%,FF/60,FA/60,FF%,...,HDGF/60,HDGA/60,HDGF%,HDSH%,HDSV%,SH%,SV%,PDO,Date,Game_Key
0,"2019-10-02 - Senators 3, Maple Leafs 5",OTT,Limited ReportFull Report,8.166667,14.69,146.94,9.09,14.69,124.9,10.53,...,0,0,-,0.00,100.00,0.00,90.00,0.900,2019-10-02,OTT_2019-10-02
1,"2019-10-02 - Capitals 3, Blues 2",WSH,Limited ReportFull Report,3.250000,18.46,92.31,16.67,18.46,73.85,20.00,...,0,0,-,-,-,0.00,50.00,0.500,2019-10-02,WSH_2019-10-02
2,"2019-10-02 - Canucks 2, Oilers 3",VAN,Limited ReportFull Report,4.000000,0,120,0.00,0,45,0.00,...,0,0,-,-,100.00,-,100.00,-,2019-10-02,VAN_2019-10-02
3,"2019-10-02 - Sharks 1, Golden Knights 4",S.J.,Limited ReportFull Report,4.333333,0,152.31,0.00,0,124.62,0.00,...,0,0,-,-,-,-,83.33,-,2019-10-02,S.J._2019-10-02
4,"2019-10-03 - Panthers 2, Lightning 5",FLA,Limited ReportFull Report,4.233333,14.17,99.21,12.50,14.17,99.21,12.50,...,0,14.17,0.00,-,50.00,100.00,85.71,1.857,2019-10-03,FLA_2019-10-03
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3200,"2022-04-29 - Avalanche 1, Wild 4",COL,Limited ReportFull Report,9.983333,0,84.14,0.00,0,54.09,0.00,...,0,0,-,-,-,-,100.00,-,2022-04-29,COL_2022-04-29
3201,"2022-04-29 - Flames 1, Jets 3",CGY,Limited ReportFull Report,5.400000,11.11,100,10.00,11.11,77.78,12.50,...,0,0,-,-,100.00,0.00,100.00,1.000,2022-04-29,CGY_2022-04-29
3202,"2022-04-29 - Predators 4, Coyotes 5",NSH,Limited ReportFull Report,5.950000,10.08,110.92,8.33,0,80.67,0.00,...,0,0,-,-,100.00,-,100.00,-,2022-04-29,NSH_2022-04-29
3203,"2022-04-29 - Sharks 0, Kraken 3",S.J.,Limited ReportFull Report,2.000000,0,150,0.00,0,90,0.00,...,0,0,-,-,100.00,-,100.00,-,2022-04-29,S.J._2022-04-29


In [132]:
away_5v5_adj

Unnamed: 0,Game,Team,Unnamed: 2,TOI,CF/60,CA/60,CF%,FF/60,FA/60,FF%,...,HDGF/60,HDGA/60,HDGF%,HDSH%,HDSV%,SH%,SV%,PDO,Date,Game_Key
0,"2019-10-02 - Senators 3, Maple Leafs 5",OTT,Limited ReportFull Report,44.133333,54.02,78.00,40.92,43.75,59.15,42.52,...,4.20,5.12,45.08,75.21,44.01,14.40,88.06,1.025,2019-10-02,OTT_2019-10-02
1,"2019-10-02 - Capitals 3, Blues 2",WSH,Limited ReportFull Report,50.866667,47.21,33.43,58.55,39.14,27.17,59.03,...,0.00,1.11,0.00,0.00,80.36,4.21,94.44,0.986,2019-10-02,WSH_2019-10-02
2,"2019-10-02 - Canucks 2, Oilers 3",VAN,Limited ReportFull Report,47.066667,67.92,44.76,60.28,48.03,30.38,61.25,...,1.35,2.40,35.98,34.15,60.71,8.67,85.13,0.938,2019-10-02,VAN_2019-10-02
3,"2019-10-02 - Sharks 1, Golden Knights 4",S.J.,Limited ReportFull Report,45.666667,40.18,62.73,39.04,30.07,51.94,36.66,...,0.00,1.30,0.00,0.00,87.94,6.72,91.93,0.987,2019-10-02,S.J._2019-10-02
4,"2019-10-03 - Panthers 2, Lightning 5",FLA,Limited ReportFull Report,45.500000,57.60,46.81,55.17,46.56,32.45,58.93,...,1.33,1.31,50.51,16.98,80.19,3.55,90.06,0.936,2019-10-03,FLA_2019-10-03
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3257,"2022-04-29 - Avalanche 1, Wild 4",COL,Limited ReportFull Report,39.400000,57.96,37.86,60.49,46.61,30.65,60.33,...,0.00,2.98,0.00,0.00,52.17,4.50,80.21,0.847,2022-04-29,COL_2022-04-29
3258,"2022-04-29 - Flames 1, Jets 3",CGY,Limited ReportFull Report,47.916667,57.16,77.20,42.54,45.89,59.43,43.57,...,1.33,1.18,52.92,16.92,92.14,3.98,94.66,0.986,2022-04-29,CGY_2022-04-29
3259,"2022-04-29 - Predators 4, Coyotes 5",NSH,Limited ReportFull Report,48.750000,74.72,41.64,64.22,58.08,31.75,64.66,...,3.99,2.26,63.85,48.16,47.48,16.40,74.89,0.913,2022-04-29,NSH_2022-04-29
3260,"2022-04-29 - Sharks 0, Kraken 3",S.J.,Limited ReportFull Report,52.416667,32.76,64.92,33.54,26.93,48.21,35.84,...,0.00,0.00,-,0.00,100.00,0.00,93.18,0.932,2022-04-29,S.J._2022-04-29


In [137]:
home_5v4_pp['TOI']

0       8.166667
1       3.250000
2       4.000000
3       4.333333
4       4.233333
          ...   
3200    9.983333
3201    5.400000
3202    5.950000
3203    2.000000
3204    2.000000
Name: TOI, Length: 3205, dtype: float64

In [138]:
home_5v4_pp['xGF/60']

0        7.47
1        6.44
2         4.6
3        8.96
4       10.54
        ...  
3200      3.6
3201     9.52
3202    15.61
3203    12.69
3204     5.95
Name: xGF/60, Length: 3205, dtype: object

In [142]:
home_5v4_pp['GF/60']

0        7.35
1       18.46
2           0
3       13.85
4       14.17
        ...  
3200        0
3201        0
3202        0
3203        0
3204        0
Name: GF/60, Length: 3205, dtype: object

### Add NHL.com schedule/result df


In [177]:
"""
This module contains functions to scrape the json schedule for any games or date range
"""

from datetime import datetime, timedelta
import json
import time
import hockey_scraper.utils.shared as shared


# TODO: Currently rescraping page each time since the status of some games may have changed
# (e.g. Scraped on 2020-01-20 and game on 2020-01-21 was not Final...when use old page again will still think not Final)
# Need to find a more elegant way of doing this (Metadata???)
def get_schedule(date_from, date_to):
    """
    Scrapes games in date range
    Ex: https://statsapi.web.nhl.com/api/v1/schedule?startDate=2010-10-03&endDate=2011-06-20
    
    :param date_from: scrape from this date
    :param date_to: scrape until this date
    
    :return: raw json of schedule of date range
    """
    page_info = {
        "url": 'https://statsapi.web.nhl.com/api/v1/schedule?startDate={a}&endDate={b}'.format(a=date_from, b=date_to),
        "name": date_from + "_" + date_to,
        "type": "json_schedule",
        "season": shared.get_season(date_from),
    }

    return json.loads(shared.get_file(page_info, force=True))


def chunk_schedule_calls(from_date, to_date):
    """
    The schedule endpoint sucks when handling a big date range. So instead I call in increments of n days.
    
    :param date_from: scrape from this date
    :param date_to: scrape until this date

    :return: raw json of schedule of date range
    """
    sched = []
    days_per_call = 30

    from_date = datetime.strptime(from_date, "%Y-%m-%d") 
    to_date = datetime.strptime(to_date, "%Y-%m-%d")
    num_days = (to_date - from_date).days + 1  # +1 since difference is looking for total number of days

    for offset in range(0, num_days, days_per_call):
        f_chunk = datetime.strftime(from_date + timedelta(days=offset), "%Y-%m-%d")

        # We need the min bec. if the chunks are evenly sized this prevents us from overshooting the max
        t_chunk = datetime.strftime(from_date + timedelta(days=min(num_days-1, offset+days_per_call-1)), "%Y-%m-%d")

        chunk_sched = get_schedule(f_chunk, t_chunk)
        sched.append(chunk_sched['dates'])

    return sched


def get_dates(games):
    """
    Given a list game_ids it returns the dates for each game.

    We sort all the games and retrieve the schedule from the beginning of the season from the earliest game
    until the end of most recent season.
    
    :param games: list with game_id's ex: 2016020001
    
    :return: list with game_id and corresponding date for all games
    """
    today = datetime.today()

    # Determine oldest and newest game
    games = list(map(str, games))
    games.sort()

    date_from = shared.season_start_bound(games[0][:4])
    year_to = int(games[-1][:4])

    # If the last game is part of the ongoing season then only request the schedule until Today
    # We get strange errors if we don't do it like this
    if year_to == shared.get_season(datetime.strftime(today, "%Y-%m-%d")):
        date_to = '-'.join([str(today.year), str(today.month), str(today.day)])
    else:
        date_to = datetime.strftime(shared.season_end_bound(year_to+1), "%Y-%m-%d")  # Newest game in sample

    # TODO: Assume true is live here -> Workaround
    schedule = scrape_schedule(date_from, date_to, preseason=True, not_over=True)

    # Only return games we want in range
    games_list = []
    for game in schedule:
        if str(game['game_id']) in games:
            games_list.extend([game])

    return games_list


def scrape_schedule(date_from, date_to, preseason=False, not_over=False):
    """
    Calls getSchedule and scrapes the raw schedule Json
    
    :param date_from: scrape from this date
    :param date_to: scrape until this date
    :param preseason: Boolean indicating whether include preseason games (default if False)
    :param not_over: Boolean indicating whether we scrape games not finished. 
                     Means we relax the requirement of checking if the game is over. 
    
    :return: list with all the game id's
    """
    schedule = []
    schedule_json = chunk_schedule_calls(date_from, date_to)

    for chunk in schedule_json:
        for day in chunk:
            for game in day['games']:
                if game['status']['detailedState'] == 'Final' or not_over:
                    game_id = int(str(game['gamePk'])[5:])

                    if (game_id >= 20000 or preseason) and game_id < 40000:
                        schedule.append({
                                 "game_id": game['gamePk'],
                                "game_type": game['gameType'],
                                 "season_id": game['season'],
                                 "date": day['date'], 
                                 "home_score": game['teams']['home'].get("score"),
                                 "away_score": game['teams']['away'].get("score"),
                                #  "start_time": datetime.strptime(game['gameDate'][:-1], "%Y-%m-%dT%H:%M:%S"),
                                #  "venue": game['venue'].get('name'),
                                 "home_team_id": game['teams']['home']['team']['id'],
                                 "home_team": shared.get_team(game['teams']['home']['team']['name']),
                                 "home_wins": game['teams']['home'].get("leagueRecord").get("wins"),
                                 "home_losses": game['teams']['home'].get("leagueRecord").get("losses"),
                                 "home_otl": game['teams']['home'].get("leagueRecord").get("ot"),
                                 "away_wins": game['teams']['away'].get("leagueRecord").get("wins"),
                                 "away_losses": game['teams']['away'].get("leagueRecord").get("losses"),
                                 "away_otl": game['teams']['away'].get("leagueRecord").get("ot"), 
                                 "away_team_id": game['teams']['away']['team']['id'],
                                 "away_team": shared.get_team(game['teams']['away']['team']['name']),
                                #  "home_score": game['teams']['home'].get("score"),
                                #  "away_score": game['teams']['away'].get("score"),
                                 "status": game["status"]["abstractGameState"]
                        })


    return schedule

In [178]:
schedule_json = scrape_schedule('2019-10-01', '2022-06-06', preseason=False, not_over=False)

In [179]:
schedule_df=pd.DataFrame(schedule_json)
schedule_df
# schedule_df.sort_values(by=['game_id'])

Unnamed: 0,game_id,game_type,season_id,date,home_score,away_score,home_team_id,home_team,home_wins,home_losses,home_otl,away_wins,away_losses,away_otl,away_team_id,away_team,status
0,2019020001,R,20192020,2019-10-02,5,3,10,TOR,1,0,0.0,0,1,0.0,9,OTT,Final
1,2019020002,R,20192020,2019-10-02,2,3,19,STL,0,0,1.0,1,0,0.0,15,WSH,Final
2,2019020003,R,20192020,2019-10-02,3,2,22,EDM,1,0,0.0,0,1,0.0,23,VAN,Final
3,2019020004,R,20192020,2019-10-02,4,1,54,VGK,1,0,0.0,0,1,0.0,28,S.J,Final
4,2019020005,R,20192020,2019-10-03,5,2,14,T.B,1,0,0.0,0,1,0.0,13,FLA,Final
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3552,2021030322,P,20212022,2022-06-02,4,0,21,COL,10,2,,8,6,,22,EDM,Final
3553,2021030312,P,20212022,2022-06-03,3,2,3,NYR,10,6,,8,5,,14,T.B,Final
3554,2021030323,P,20212022,2022-06-04,2,4,22,EDM,8,7,,11,2,,21,COL,Final
3555,2021030313,P,20212022,2022-06-05,3,2,14,T.B,9,5,,10,7,,3,NYR,Final


In [180]:
reg_sched_df = schedule_df.loc[schedule_df['game_type'] == 'R']
reg_sched_df

Unnamed: 0,game_id,game_type,season_id,date,home_score,away_score,home_team_id,home_team,home_wins,home_losses,home_otl,away_wins,away_losses,away_otl,away_team_id,away_team,status
0,2019020001,R,20192020,2019-10-02,5,3,10,TOR,1,0,0.0,0,1,0.0,9,OTT,Final
1,2019020002,R,20192020,2019-10-02,2,3,19,STL,0,0,1.0,1,0,0.0,15,WSH,Final
2,2019020003,R,20192020,2019-10-02,3,2,22,EDM,1,0,0.0,0,1,0.0,23,VAN,Final
3,2019020004,R,20192020,2019-10-02,4,1,54,VGK,1,0,0.0,0,1,0.0,28,S.J,Final
4,2019020005,R,20192020,2019-10-03,5,2,14,T.B,1,0,0.0,0,1,0.0,13,FLA,Final
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3472,2021021307,R,20212022,2022-04-29,4,2,25,DAL,46,30,6.0,31,37,14.0,24,ANA,Final
3473,2021021207,R,20212022,2022-04-29,3,2,22,EDM,49,27,6.0,40,30,12.0,23,VAN,Final
3474,2021021312,R,20212022,2022-04-29,3,0,55,SEA,27,48,6.0,32,37,13.0,28,S.J,Final
3475,2021021311,R,20212022,2022-04-29,5,4,53,ARI,25,50,7.0,45,30,7.0,18,NSH,Final


In [181]:
reg_sched_df['Home_Team_Won'] = np.where(reg_sched_df['home_score'] > reg_sched_df['away_score'], 1, 0)
reg_sched_df['Home_Team_Key'] = reg_sched_df['home_team'].astype(str)+'_'+reg_sched_df['date'].astype(str)
reg_sched_df['Away_Team_Key'] = reg_sched_df['away_team'].astype(str)+'_'+reg_sched_df['date'].astype(str)
reg_sched_df

Unnamed: 0,game_id,game_type,season_id,date,home_score,away_score,home_team_id,home_team,home_wins,home_losses,home_otl,away_wins,away_losses,away_otl,away_team_id,away_team,status,Home_Team_Won,Home_Team_Key,Away_Team_Key
0,2019020001,R,20192020,2019-10-02,5,3,10,TOR,1,0,0.0,0,1,0.0,9,OTT,Final,1,TOR_2019-10-02,OTT_2019-10-02
1,2019020002,R,20192020,2019-10-02,2,3,19,STL,0,0,1.0,1,0,0.0,15,WSH,Final,0,STL_2019-10-02,WSH_2019-10-02
2,2019020003,R,20192020,2019-10-02,3,2,22,EDM,1,0,0.0,0,1,0.0,23,VAN,Final,1,EDM_2019-10-02,VAN_2019-10-02
3,2019020004,R,20192020,2019-10-02,4,1,54,VGK,1,0,0.0,0,1,0.0,28,S.J,Final,1,VGK_2019-10-02,S.J_2019-10-02
4,2019020005,R,20192020,2019-10-03,5,2,14,T.B,1,0,0.0,0,1,0.0,13,FLA,Final,1,T.B_2019-10-03,FLA_2019-10-03
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3472,2021021307,R,20212022,2022-04-29,4,2,25,DAL,46,30,6.0,31,37,14.0,24,ANA,Final,1,DAL_2022-04-29,ANA_2022-04-29
3473,2021021207,R,20212022,2022-04-29,3,2,22,EDM,49,27,6.0,40,30,12.0,23,VAN,Final,1,EDM_2022-04-29,VAN_2022-04-29
3474,2021021312,R,20212022,2022-04-29,3,0,55,SEA,27,48,6.0,32,37,13.0,28,S.J,Final,1,SEA_2022-04-29,S.J_2022-04-29
3475,2021021311,R,20212022,2022-04-29,5,4,53,ARI,25,50,7.0,45,30,7.0,18,NSH,Final,1,ARI_2022-04-29,NSH_2022-04-29


In [185]:
reg_sched_df['home_points'] = ((reg_sched_df['home_wins'] * 2) + reg_sched_df['home_otl']).astype(int)
reg_sched_df['away_points'] = ((reg_sched_df['away_wins'] * 2) + reg_sched_df['away_otl']).astype(int)
reg_sched_df

Unnamed: 0,game_id,game_type,season_id,date,home_score,away_score,home_team_id,home_team,home_wins,home_losses,...,away_losses,away_otl,away_team_id,away_team,status,Home_Team_Won,Home_Team_Key,Away_Team_Key,home_points,away_points
0,2019020001,R,20192020,2019-10-02,5,3,10,TOR,1,0,...,1,0.0,9,OTT,Final,1,TOR_2019-10-02,OTT_2019-10-02,2,0
1,2019020002,R,20192020,2019-10-02,2,3,19,STL,0,0,...,0,0.0,15,WSH,Final,0,STL_2019-10-02,WSH_2019-10-02,1,2
2,2019020003,R,20192020,2019-10-02,3,2,22,EDM,1,0,...,1,0.0,23,VAN,Final,1,EDM_2019-10-02,VAN_2019-10-02,2,0
3,2019020004,R,20192020,2019-10-02,4,1,54,VGK,1,0,...,1,0.0,28,S.J,Final,1,VGK_2019-10-02,S.J_2019-10-02,2,0
4,2019020005,R,20192020,2019-10-03,5,2,14,T.B,1,0,...,1,0.0,13,FLA,Final,1,T.B_2019-10-03,FLA_2019-10-03,2,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3472,2021021307,R,20212022,2022-04-29,4,2,25,DAL,46,30,...,37,14.0,24,ANA,Final,1,DAL_2022-04-29,ANA_2022-04-29,98,76
3473,2021021207,R,20212022,2022-04-29,3,2,22,EDM,49,27,...,30,12.0,23,VAN,Final,1,EDM_2022-04-29,VAN_2022-04-29,104,92
3474,2021021312,R,20212022,2022-04-29,3,0,55,SEA,27,48,...,37,13.0,28,S.J,Final,1,SEA_2022-04-29,S.J_2022-04-29,60,77
3475,2021021311,R,20212022,2022-04-29,5,4,53,ARI,25,50,...,30,7.0,18,NSH,Final,1,ARI_2022-04-29,NSH_2022-04-29,57,97


In [187]:
reg_sched_df['home_gp'] = (reg_sched_df['home_wins'] + reg_sched_df['home_losses'] + reg_sched_df['home_otl']).astype(int)
reg_sched_df['away_gp'] = (reg_sched_df['away_wins'] + reg_sched_df['away_losses'] + reg_sched_df['away_otl']).astype(int)
reg_sched_df

Unnamed: 0,game_id,game_type,season_id,date,home_score,away_score,home_team_id,home_team,home_wins,home_losses,...,away_team_id,away_team,status,Home_Team_Won,Home_Team_Key,Away_Team_Key,home_points,away_points,home_gp,away_gp
0,2019020001,R,20192020,2019-10-02,5,3,10,TOR,1,0,...,9,OTT,Final,1,TOR_2019-10-02,OTT_2019-10-02,2,0,1,1
1,2019020002,R,20192020,2019-10-02,2,3,19,STL,0,0,...,15,WSH,Final,0,STL_2019-10-02,WSH_2019-10-02,1,2,1,1
2,2019020003,R,20192020,2019-10-02,3,2,22,EDM,1,0,...,23,VAN,Final,1,EDM_2019-10-02,VAN_2019-10-02,2,0,1,1
3,2019020004,R,20192020,2019-10-02,4,1,54,VGK,1,0,...,28,S.J,Final,1,VGK_2019-10-02,S.J_2019-10-02,2,0,1,1
4,2019020005,R,20192020,2019-10-03,5,2,14,T.B,1,0,...,13,FLA,Final,1,T.B_2019-10-03,FLA_2019-10-03,2,0,1,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3472,2021021307,R,20212022,2022-04-29,4,2,25,DAL,46,30,...,24,ANA,Final,1,DAL_2022-04-29,ANA_2022-04-29,98,76,82,82
3473,2021021207,R,20212022,2022-04-29,3,2,22,EDM,49,27,...,23,VAN,Final,1,EDM_2022-04-29,VAN_2022-04-29,104,92,82,82
3474,2021021312,R,20212022,2022-04-29,3,0,55,SEA,27,48,...,28,S.J,Final,1,SEA_2022-04-29,S.J_2022-04-29,60,77,81,82
3475,2021021311,R,20212022,2022-04-29,5,4,53,ARI,25,50,...,18,NSH,Final,1,ARI_2022-04-29,NSH_2022-04-29,57,97,82,82


### Preliminary Feauture Selection and Processing

In [133]:
# add xg per pp minute

### Top line feature (column) deletion and addition

Deleting:
- 'Unnamed: 2' held hyperlinks to related pages
- 'C' type (Corsi) shot stats 
  - we will be using 'F' type (Fenwick) shots (goals, on net and misses, but excludes blocks)
- 'SC' type (Scoring Chance) vars
    - we will be focusing on HD type

In [148]:
json_test = scrape_schedule('2023-04-12', '2023-04-13')

In [157]:
json_test_df = pd.DataFrame(json_test)
json_test_df

Unnamed: 0,game_id,season_id,date,home_score,away_score,home_team_id,home_team,home_wins,home_losses,home_otl,away_wins,away_losses,away_otl,away_team_id,away_team,status
0,2022021295,20222023,2023-04-12,4,2,2,NYI,42,31,9,31,44,6,8,MTL,Final
1,2022021296,20222023,2023-04-12,2,5,19,STL,37,37,7,46,21,14,25,DAL,Final
2,2022021297,20222023,2023-04-12,3,1,20,CGY,38,27,17,22,43,16,28,S.J,Final
3,2022021298,20222023,2023-04-13,4,5,8,MTL,31,45,6,65,12,5,6,BOS,Final
4,2022021299,20222023,2023-04-13,4,6,13,FLA,42,32,8,52,21,9,12,CAR,Final
5,2022021300,20222023,2023-04-13,5,0,14,T.B,46,30,6,35,37,10,17,DET,Final
6,2022021301,20222023,2023-04-13,4,5,15,WSH,35,37,10,52,22,8,1,N.J,Final
7,2022021302,20222023,2023-04-13,4,3,7,BUF,41,33,7,39,35,8,9,OTT,Final
8,2022021303,20222023,2023-04-13,3,2,29,CBJ,25,47,9,40,31,11,5,PIT,Final
9,2022021304,20222023,2023-04-13,2,3,3,NYR,47,22,13,50,21,11,10,TOR,Final
