# Modeling NHL Game Outcomes

Building binary classification models to predict the winning team

# Overview

Topline goal: predict the winner of a given NHL game

The first purpose of this model is to see if the winner of an NHL game can be predicted with more accuracy than with a naive prediction - choosing the home team to win every time with no other informative input.

In this notebook several modeling techniques will be trained/tested using team game log data from the prior three full NHL regular seasons (19'-20', 20'-21', 21'-22). Games played to date from the current season (22'-23') will then be used to evaluate the model's ability to correctly predict the winning team.

If I can develop a game prediction model that performs above that naive baseline, the next step will be to adjust the benchmark to a semi naive prediction. The semi naive baseline will be based on choosing the 'Vegas' favorite to win everytime. The recent wave of mobile sports betting legalization has led to massive increase in sports wagers placed both in the US and globally. With the increased popularity of sports betting, sportsbooks odds are an increasingly efficient market i.e. the implied probabilties of sportsbook odds are an unbiased estimator of outcomes.

If I can develop a model which is competitive with the betting market, the next step will be to test betting strategies which leverage the predition model to yield postive ROI. Given the 'vig' (i.e. commission) charged by sportsbooks to take wagers, simply beating the market will not be enough to produce a profitable strategy. Instead the model will have to outperform the market by ~5+% to approach profitability.

The excess return required to be profitable in sports betting and generally high log loss scores in betting models, necessitates the application of betting strategies to the model's prediction. Rather than betting equal amounts on each event, I will test strategies that aim to identify and capitilize on over/undervalued sportsbooks odds.

In [1]:
# Standard Packages
import pandas as pd
from pandas.testing import assert_frame_equal
import numpy as np
import re
import time
import os
import warnings
import pickle

# Viz Packages
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline 

# Scraping packages
import requests
import json
from bs4 import BeautifulSoup
import hockey_scraper

# Modeling Packages
## Modeling Prep
from sklearn.model_selection import train_test_split, cross_val_score, cross_validate, KFold, \
GridSearchCV, RandomizedSearchCV

## SKLearn Data Prep Modules
from sklearn.preprocessing import StandardScaler, OneHotEncoder, LabelEncoder, \
PolynomialFeatures, PowerTransformer, Normalizer, MaxAbsScaler
from sklearn.impute import SimpleImputer

## SKLearn Classification Models
from sklearn.linear_model import LogisticRegression, Ridge, Lasso, ElasticNet
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier, export_graphviz
from sklearn.ensemble import BaggingClassifier, RandomForestClassifier,\
ExtraTreesClassifier, VotingClassifier, AdaBoostClassifier, HistGradientBoostingClassifier
from sklearn.svm import SVC

## SKLearn Pipeline Setup
from sklearn.pipeline import Pipeline, make_pipeline, FeatureUnion
from sklearn.compose import ColumnTransformer

## SKLearn Model Optimization
from sklearn.feature_selection import RFE, f_regression, RFECV, SelectKBest

# ## Boosting
# from xgboost import XGBRegressor
from xgboost import XGBClassifier

## SKLearn Metrics
### Classification Scoring/Evaluation
from sklearn.metrics import classification_report, accuracy_score, recall_score, precision_score, f1_score, \
ConfusionMatrixDisplay, log_loss, confusion_matrix, RocCurveDisplay, make_scorer, roc_auc_score

In [2]:
# Notebook Config
from pprintpp import pprint as pp
%reload_ext pprintpp
from tqdm import tqdm
from io import StringIO

## Suppress Python Warnings (Future, Deprecation)
warnings.filterwarnings("ignore", category= FutureWarning)
warnings.filterwarnings("ignore", category=DeprecationWarning)

## Suppress Pandas Warnings (SettingWithCopy)
pd.options.mode.chained_assignment = None

## Pandas Display Config
pd.options.display.max_columns = 80
# pd.options.display.width = None

## Display SKLearn estimators as diagrams
from sklearn import set_config
set_config(display= 'diagram')

## EDA

### Data Retrieval

Data notes:
- Includes offensive, defensive and goaltending stats for both the home and away teams
- For each team data for 3 different game strength states is used
    - 5v5 even strength (adjusted for score states and venue)
    - 5v4 powerplay (man advantage)
    - 4v5 penalty pill (man down)
- Static data from prior season is used for initial development
- Once the model is built, I will aim to transition to a dynamic/automated ETL process

In [3]:
# Load test/train data 
# Static game log data from prior 3 full seasons
# Team stats for both sides in even strength (adjusted), man up and man down situations

home_5v5_adj = pd.read_csv('data/filtered/filtered-19_22-home-5v5_adj.csv')
home_5v4_pp = pd.read_csv('data/filtered/filtered-19_22-home-pp_5v4.csv')
home_4v5_pk = pd.read_csv('data/filtered/filtered-19_22-home-pk_4v5.csv')

away_5v5_adj = pd.read_csv('data/filtered/filtered-19_22-away-5v5_adj.csv')
away_5v4_pp = pd.read_csv('data/filtered/filtered-19_22-away-pp_5v4.csv')
away_4v5_pk = pd.read_csv('data/filtered/filtered-19_22-away-pk_4v5.csv')

print(home_5v5_adj.shape, home_5v4_pp.shape, home_4v5_pk.shape)
print(away_5v5_adj.shape, away_5v4_pp.shape, away_4v5_pk.shape)

(3262, 33) (3205, 33) (3191, 33)
(3262, 33) (3191, 33) (3205, 33)


In [4]:
home_5v5_adj.head()

Unnamed: 0,Game,Team,Unnamed: 2,TOI,CF/60,CA/60,CF%,FF/60,FA/60,FF%,SF/60,SA/60,SF%,GF/60,GA/60,GF%,xGF/60,xGA/60,xGF%,SCF/60,SCA/60,SCF%,HDCF/60,HDCA/60,HDCF%,HDGF/60,HDGA/60,HDGF%,HDSH%,HDSV%,SH%,SV%,PDO
0,"2019-10-02 - Senators 3, Maple Leafs 5",Toronto Maple Leafs,Limited ReportFull Report,44.133333,78.0,54.02,59.08,59.15,43.75,57.48,43.23,29.01,59.84,5.16,4.18,55.27,3.44,2.22,60.73,34.03,19.18,63.96,15.74,8.32,65.42,5.12,4.2,54.92,55.99,24.79,11.94,85.6,0.975
1,"2019-10-02 - Capitals 3, Blues 2",St Louis Blues,Limited ReportFull Report,50.866667,33.43,47.21,41.45,27.17,39.14,40.97,20.01,28.43,41.31,1.11,1.2,48.19,1.35,2.26,37.48,13.76,23.09,37.34,5.59,6.14,47.66,1.11,0.0,100.0,19.64,100.0,5.56,95.79,1.014
2,"2019-10-02 - Canucks 2, Oilers 3",Edmonton Oilers,Limited ReportFull Report,47.066667,44.76,67.92,39.72,30.38,48.03,38.75,24.1,30.41,44.21,3.58,2.64,57.61,1.71,1.99,46.24,19.81,31.14,38.88,6.04,9.11,39.87,2.4,1.35,64.02,39.29,65.85,14.87,91.33,1.062
3,"2019-10-02 - Sharks 1, Golden Knights 4",Vegas Golden Knights,Limited ReportFull Report,45.666667,62.73,40.18,60.96,51.94,30.07,63.34,32.21,19.82,61.9,2.6,1.33,66.09,3.84,1.4,73.28,32.63,8.95,78.48,16.02,5.27,75.25,1.3,0.0,100.0,12.06,100.0,8.07,93.28,1.013
4,"2019-10-03 - Panthers 2, Lightning 5",Tampa Bay Lightning,Limited ReportFull Report,45.5,46.81,57.6,44.83,32.45,46.56,41.07,25.64,37.57,40.57,2.55,1.33,65.64,1.86,1.69,52.43,19.77,22.62,46.64,11.77,9.26,55.98,1.31,1.33,49.49,19.81,83.02,9.94,96.45,1.064


In [5]:
# team abbreviation dict from hockey scraper package function
# "_descrition": "# All the corresponding tri-codes for team names",

team_dict = {
    'Anaheim Ducks': 'ANA',
    'Arizona Coyotes' : 'ARI',
    'Boston Bruins': 'BOS', 
    'Buffalo Sabres':'BUF',
    'Calgary Flames': 'CGY', 
    'Carolina Hurricanes': 'CAR', 
    'Chicago Blackhawks': 'CHI', 
    'Colorado Avalanche': 'COL',
    'Columbus Blue Jackets': 'CBJ',
    'Dallas Stars': 'DAL',
    'Detroit Red Wings': 'DET',
    'Edmonton Oilers': 'EDM',
    'Florida Panthers': 'FLA',
    'Los Angeles Kings': 'L.A',
    'Minnesota Wild': 'MIN',
    'Montreal Canadiens': 'MTL',
    'Nashville Predators': 'NSH',
    'New Jersey Devils': 'N.J',
    "New York Islanders": 'NYI',
    "New York Rangers": 'NYR',
    'Ottawa Senators': 'OTT',
    'Philadelphia Flyers': 'PHI',
    'Pittsburgh Penguins': 'PIT',
    'San Jose Sharks': 'S.J',
    'Seattle Kraken': 'SEA',
#     'St. Louis Blues': 'STL',
    'St Louis Blues': 'STL',
    'Tampa Bay Lightning': 'T.B',
    'Toronto Maple Leafs': 'TOR',
    'Vancouver Canucks': 'VAN',
    'Vegas Golden Knights':'VGK',
    'Washington Capitals': 'WSH',
    'Winnipeg Jets': 'WPG'
}

In [6]:
len(team_dict) # check to make sure all 32 teams accounted for

32

In [7]:
# 1st round cleaning of data sets
def process_datasets(game_log_df):
    # extract date from game column
    game_log_df['Date'] = pd.to_datetime(game_log_df['Game'].str[:11])
    # sub full team names with abbreviated names
#     game_log_df = game_log_df.replace({'Team': team_dict})
    game_log_df.replace(team_dict, inplace=True)
    # create game log index
    game_log_df['Game_Key'] = game_log_df['Team'].astype(str)+'_'+game_log_df['Date'].astype(str)
    return game_log_df

In [8]:
process_datasets(home_5v5_adj)
home_5v5_adj

Unnamed: 0,Game,Team,Unnamed: 2,TOI,CF/60,CA/60,CF%,FF/60,FA/60,FF%,SF/60,SA/60,SF%,GF/60,GA/60,GF%,xGF/60,xGA/60,xGF%,SCF/60,SCA/60,SCF%,HDCF/60,HDCA/60,HDCF%,HDGF/60,HDGA/60,HDGF%,HDSH%,HDSV%,SH%,SV%,PDO,Date,Game_Key
0,"2019-10-02 - Senators 3, Maple Leafs 5",TOR,Limited ReportFull Report,44.133333,78.00,54.02,59.08,59.15,43.75,57.48,43.23,29.01,59.84,5.16,4.18,55.27,3.44,2.22,60.73,34.03,19.18,63.96,15.74,8.32,65.42,5.12,4.20,54.92,55.99,24.79,11.94,85.60,0.975,2019-10-02,TOR_2019-10-02
1,"2019-10-02 - Capitals 3, Blues 2",STL,Limited ReportFull Report,50.866667,33.43,47.21,41.45,27.17,39.14,40.97,20.01,28.43,41.31,1.11,1.20,48.19,1.35,2.26,37.48,13.76,23.09,37.34,5.59,6.14,47.66,1.11,0.00,100.00,19.64,100.00,5.56,95.79,1.014,2019-10-02,STL_2019-10-02
2,"2019-10-02 - Canucks 2, Oilers 3",EDM,Limited ReportFull Report,47.066667,44.76,67.92,39.72,30.38,48.03,38.75,24.10,30.41,44.21,3.58,2.64,57.61,1.71,1.99,46.24,19.81,31.14,38.88,6.04,9.11,39.87,2.40,1.35,64.02,39.29,65.85,14.87,91.33,1.062,2019-10-02,EDM_2019-10-02
3,"2019-10-02 - Sharks 1, Golden Knights 4",VGK,Limited ReportFull Report,45.666667,62.73,40.18,60.96,51.94,30.07,63.34,32.21,19.82,61.90,2.60,1.33,66.09,3.84,1.40,73.28,32.63,8.95,78.48,16.02,5.27,75.25,1.30,0.00,100.00,12.06,100.00,8.07,93.28,1.013,2019-10-02,VGK_2019-10-02
4,"2019-10-03 - Panthers 2, Lightning 5",T.B,Limited ReportFull Report,45.500000,46.81,57.60,44.83,32.45,46.56,41.07,25.64,37.57,40.57,2.55,1.33,65.64,1.86,1.69,52.43,19.77,22.62,46.64,11.77,9.26,55.98,1.31,1.33,49.49,19.81,83.02,9.94,96.45,1.064,2019-10-03,T.B_2019-10-03
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3257,"2022-04-29 - Avalanche 1, Wild 4",MIN,Limited ReportFull Report,39.400000,37.86,57.96,39.51,30.65,46.61,39.67,22.45,34.01,39.76,4.44,1.53,74.39,1.95,1.95,50.05,17.22,27.82,38.23,7.71,5.99,56.30,2.98,0.00,100.00,47.83,100.00,19.79,95.50,1.153,2022-04-29,MIN_2022-04-29
3258,"2022-04-29 - Flames 1, Jets 3",WPG,Limited ReportFull Report,47.916667,77.20,57.16,57.46,59.43,45.89,56.43,43.77,33.25,56.83,2.34,1.32,63.87,4.08,2.44,62.52,43.01,20.62,67.59,18.27,9.15,66.62,1.18,1.33,47.08,7.86,83.08,5.34,96.02,1.014,2022-04-29,WPG_2022-04-29
3259,"2022-04-29 - Predators 4, Coyotes 5",ARI,Limited ReportFull Report,48.750000,41.64,74.72,35.78,31.75,58.08,35.34,22.81,32.01,41.60,5.73,5.25,52.18,1.60,3.01,34.78,18.43,36.52,33.54,4.30,13.88,23.67,2.26,3.99,36.15,52.52,51.84,25.11,83.60,1.087,2022-04-29,ARI_2022-04-29
3260,"2022-04-29 - Sharks 0, Kraken 3",SEA,Limited ReportFull Report,52.416667,64.92,32.76,66.46,48.21,26.93,64.16,32.41,21.61,60.00,2.21,0.00,100.00,2.42,1.42,62.98,29.28,12.21,70.56,11.58,4.54,71.82,0.00,0.00,-,0.00,100.00,6.82,100.00,1.068,2022-04-29,SEA_2022-04-29


In [9]:
home_5v5_adj['Team'].value_counts() # all team names abbreviated successfully

WPG    106
DET    106
OTT    106
VGK    106
MTL    106
CBJ    105
ANA    105
S.J    105
NYR    105
NSH    104
PIT    104
NYI    104
BOS    104
PHI    104
STL    104
BUF    104
MIN    104
FLA    104
VAN    104
CHI    103
L.A    103
TOR    103
N.J    103
DAL    103
T.B    103
EDM    103
CGY    102
ARI    102
WSH    102
COL    102
CAR    102
SEA     41
Name: Team, dtype: int64

In [10]:
print(sum(home_5v5_adj['Team'].value_counts()))

3262


In [11]:
# apply same transformation to other datasets
process_datasets(home_5v4_pp)
process_datasets(home_4v5_pk)
process_datasets(away_5v5_adj)
process_datasets(away_5v4_pp)
process_datasets(away_4v5_pk)

Unnamed: 0,Game,Team,Unnamed: 2,TOI,CF/60,CA/60,CF%,FF/60,FA/60,FF%,SF/60,SA/60,SF%,GF/60,GA/60,GF%,xGF/60,xGA/60,xGF%,SCF/60,SCA/60,SCF%,HDCF/60,HDCA/60,HDCF%,HDGF/60,HDGA/60,HDGF%,HDSH%,HDSV%,SH%,SV%,PDO,Date,Game_Key
0,"2019-10-02 - Senators 3, Maple Leafs 5",OTT,Limited ReportFull Report,8.166667,14.69,146.94,9.09,14.69,124.9,10.53,14.69,73.47,16.67,0,7.35,0.00,1.86,7.47,19.93,7.35,66.12,10.00,7.35,7.35,50.00,0,0,-,0.00,100.00,0.00,90.00,0.900,2019-10-02,OTT_2019-10-02
1,"2019-10-02 - Capitals 3, Blues 2",WSH,Limited ReportFull Report,3.250000,18.46,92.31,16.67,18.46,73.85,20.00,18.46,36.92,33.33,0,18.46,0.00,0.67,6.44,9.38,0,18.46,0.00,0,18.46,0.00,0,0,-,-,-,0.00,50.00,0.500,2019-10-02,WSH_2019-10-02
2,"2019-10-02 - Canucks 2, Oilers 3",VAN,Limited ReportFull Report,4.000000,0,120,0.00,0,45,0.00,0,45,0.00,0,0,-,0,4.6,0.00,0,45,0.00,0,15,0.00,0,0,-,-,100.00,-,100.00,-,2019-10-02,VAN_2019-10-02
3,"2019-10-02 - Sharks 1, Golden Knights 4",S.J,Limited ReportFull Report,4.333333,0,152.31,0.00,0,124.62,0.00,0,83.08,0.00,0,13.85,0.00,0,8.96,0.00,0,69.23,0.00,0,13.85,0.00,0,0,-,-,-,-,83.33,-,2019-10-02,S.J_2019-10-02
4,"2019-10-03 - Panthers 2, Lightning 5",FLA,Limited ReportFull Report,4.233333,14.17,99.21,12.50,14.17,99.21,12.50,14.17,99.21,12.50,14.17,14.17,50.00,1.08,10.54,9.28,14.17,42.52,25.00,0,28.35,0.00,0,14.17,0.00,-,50.00,100.00,85.71,1.857,2019-10-03,FLA_2019-10-03
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3200,"2022-04-29 - Avalanche 1, Wild 4",COL,Limited ReportFull Report,9.983333,0,84.14,0.00,0,54.09,0.00,0,36.06,0.00,0,0,-,0,3.6,0.00,0,36.06,0.00,0,6.01,0.00,0,0,-,-,-,-,100.00,-,2022-04-29,COL_2022-04-29
3201,"2022-04-29 - Flames 1, Jets 3",CGY,Limited ReportFull Report,5.400000,11.11,100,10.00,11.11,77.78,12.50,11.11,55.56,16.67,0,0,-,0.11,9.52,1.13,0,66.67,0.00,0,33.33,0.00,0,0,-,-,100.00,0.00,100.00,1.000,2022-04-29,CGY_2022-04-29
3202,"2022-04-29 - Predators 4, Coyotes 5",NSH,Limited ReportFull Report,5.950000,10.08,110.92,8.33,0,80.67,0.00,0,60.5,0.00,0,0,-,0,15.61,0.00,0,100.84,0.00,0,50.42,0.00,0,0,-,-,100.00,-,100.00,-,2022-04-29,NSH_2022-04-29
3203,"2022-04-29 - Sharks 0, Kraken 3",S.J,Limited ReportFull Report,2.000000,0,150,0.00,0,90,0.00,0,90,0.00,0,0,-,0,12.69,0.00,0,120,0.00,0,60,0.00,0,0,-,-,100.00,-,100.00,-,2022-04-29,S.J_2022-04-29


In [13]:
# check nulls
away_5v5_adj.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3262 entries, 0 to 3261
Data columns (total 35 columns):
 #   Column      Non-Null Count  Dtype         
---  ------      --------------  -----         
 0   Game        3262 non-null   object        
 1   Team        3262 non-null   object        
 2   Unnamed: 2  3262 non-null   object        
 3   TOI         3262 non-null   float64       
 4   CF/60       3262 non-null   float64       
 5   CA/60       3262 non-null   float64       
 6   CF%         3262 non-null   float64       
 7   FF/60       3262 non-null   float64       
 8   FA/60       3262 non-null   float64       
 9   FF%         3262 non-null   float64       
 10  SF/60       3262 non-null   float64       
 11  SA/60       3262 non-null   float64       
 12  SF%         3262 non-null   float64       
 13  GF/60       3262 non-null   float64       
 14  GA/60       3262 non-null   float64       
 15  GF%         3262 non-null   object        
 16  xGF/60      3262 non-nul

In [14]:
all_cols = (home_5v5_adj.columns.to_list())
pp(all_cols)

[
    'Game',
    'Team',
    'Unnamed: 2',
    'TOI',
    'CF/60',
    'CA/60',
    'CF%',
    'FF/60',
    'FA/60',
    'FF%',
    'SF/60',
    'SA/60',
    'SF%',
    'GF/60',
    'GA/60',
    'GF%',
    'xGF/60',
    'xGA/60',
    'xGF%',
    'SCF/60',
    'SCA/60',
    'SCF%',
    'HDCF/60',
    'HDCA/60',
    'HDCF%',
    'HDGF/60',
    'HDGA/60',
    'HDGF%',
    'HDSH%',
    'HDSV%',
    'SH%',
    'SV%',
    'PDO',
    'Date',
    'Game_Key',
]


In [15]:
# Even strength features list
# comment out features to disregard
ev_features_list = [
    'Game',
    'Team',
    'TOI',
#     'CF/60',
#     'CA/60',
#     'CF%',
    'FF/60',
    'FA/60',
    'FF%',
#     'SF/60',
#     'SA/60',
#     'SF%',
    'GF/60',
    'GA/60',
    'GF%',
    'xGF/60',
    'xGA/60',
    'xGF%',
#     'SCF/60',
#     'SCA/60',
#     'SCF%',
    'HDCF/60',
    'HDCA/60',
    'HDCF%',
#     'HDGF/60',
#     'HDGA/60',
#     'HDGF%',
    'HDSH%',
    'HDSV%',
    'SH%',
    'SV%',
    'PDO',
    'Date',
    'Game_Key',
]

In [16]:
pp_cols = (home_5v4_pp.columns.to_list())
print(pp_cols)

['Game', 'Team', 'Unnamed: 2', 'TOI', 'CF/60', 'CA/60', 'CF%', 'FF/60', 'FA/60', 'FF%', 'SF/60', 'SA/60', 'SF%', 'GF/60', 'GA/60', 'GF%', 'xGF/60', 'xGA/60', 'xGF%', 'SCF/60', 'SCA/60', 'SCF%', 'HDCF/60', 'HDCA/60', 'HDCF%', 'HDGF/60', 'HDGA/60', 'HDGF%', 'HDSH%', 'HDSV%', 'SH%', 'SV%', 'PDO', 'Date', 'Game_Key']


In [17]:
# PP df features list
# comment out features to disregard
pp_features_list = [
#     'Game',
#     'Team',
    'TOI',
#     'CF/60',
#     'CA/60',
#     'CF%',
#     'FF/60',
#     'FA/60',
#     'FF%',
#     'SF/60',
#     'SA/60',
#     'SF%',
    'GF/60',
#     'GA/60',
#     'GF%',
    'xGF/60',
#     'xGA/60',
#     'xGF%',
#     'SCF/60',
#     'SCA/60',
#     'SCF%',
#     'HDCF/60',
#     'HDCA/60',
#     'HDCF%',
#     'HDGF/60',
#     'HDGA/60',
#     'HDGF%',
#     'HDSH%',
#     'HDSV%',
#     'SH%',
#     'SV%',
#     'PDO',
#     'Date',
    'Game_Key',
]

In [18]:
# PK df features list
# same as pp df but swap GFs for GAs
pk_features_list = ['TOI', 'GA/60', 'xGA/60', 'Game_Key']

In [19]:
home_5v4_pp['TOI']

0       8.166667
1       3.250000
2       4.000000
3       4.333333
4       4.233333
          ...   
3200    9.983333
3201    5.400000
3202    5.950000
3203    2.000000
3204    2.000000
Name: TOI, Length: 3205, dtype: float64

In [20]:
home_5v4_pp['xGF/60']

0        7.47
1        6.44
2         4.6
3        8.96
4       10.54
        ...  
3200      3.6
3201     9.52
3202    15.61
3203    12.69
3204     5.95
Name: xGF/60, Length: 3205, dtype: object

In [21]:
home_5v4_pp['GF/60']

0        7.35
1       18.46
2           0
3       13.85
4       14.17
        ...  
3200        0
3201        0
3202        0
3203        0
3204        0
Name: GF/60, Length: 3205, dtype: object

inputs:
- stat dfs
  - home_5v5_adj, home_5v4_pp, home_4v5_pk
  - away_5v5_adj, away_5v4_pp, away_4v5_pk
- feature lists
  - ev_features_list, pp_features_list, pk_features_list

In [22]:
# merge team stats (5v5_adj, 5v4_pp, 4v5_pk game states) for each period with the feature column list defined above
def merge_strength_states(ev_df, pp_df, pk_df, ev_features, pp_features, pk_features):
    # left merge the 5v5 and 5v4 dfs on 'Game_key', w/ respective features, add suffixes for shared cols 
    even_pp_merged = pd.merge(ev_df[ev_features], pp_df[pp_features],
                              on = 'Game_Key', how = 'left', suffixes=('', '_pp'))

    # left merge that df with the 4v5_pk on 'Game_key', selected columns, and suffixes for overlapping columns
    all_states_merged = pd.merge(even_pp_merged, pk_df[pk_features], 
                                  on = 'Game_Key', how = 'left', suffixes = ('', '_pk'))

    return all_states_merged


In [23]:
# test funtion
home_merged = merge_strength_states(home_5v5_adj, home_5v4_pp, home_4v5_pk, 
                                              ev_features_list, pp_features_list, pk_features_list)
home_merged

Unnamed: 0,Game,Team,TOI,FF/60,FA/60,FF%,GF/60,GA/60,GF%,xGF/60,xGA/60,xGF%,HDCF/60,HDCA/60,HDCF%,HDSH%,HDSV%,SH%,SV%,PDO,Date,Game_Key,TOI_pp,GF/60_pp,xGF/60_pp,TOI_pk,GA/60_pk,xGA/60_pk
0,"2019-10-02 - Senators 3, Maple Leafs 5",TOR,44.133333,59.15,43.75,57.48,5.16,4.18,55.27,3.44,2.22,60.73,15.74,8.32,65.42,55.99,24.79,11.94,85.60,0.975,2019-10-02,TOR_2019-10-02,8.166667,7.35,7.47,6.000000,0,2.77
1,"2019-10-02 - Capitals 3, Blues 2",STL,50.866667,27.17,39.14,40.97,1.11,1.20,48.19,1.35,2.26,37.48,5.59,6.14,47.66,19.64,100.00,5.56,95.79,1.014,2019-10-02,STL_2019-10-02,3.250000,18.46,6.44,5.883333,10.2,7.62
2,"2019-10-02 - Canucks 2, Oilers 3",EDM,47.066667,30.38,48.03,38.75,3.58,2.64,57.61,1.71,1.99,46.24,6.04,9.11,39.87,39.29,65.85,14.87,91.33,1.062,2019-10-02,EDM_2019-10-02,4.000000,0,4.6,8.000000,0,9.09
3,"2019-10-02 - Sharks 1, Golden Knights 4",VGK,45.666667,51.94,30.07,63.34,2.60,1.33,66.09,3.84,1.40,73.28,16.02,5.27,75.25,12.06,100.00,8.07,93.28,1.013,2019-10-02,VGK_2019-10-02,4.333333,13.85,8.96,6.983333,0,5.33
4,"2019-10-03 - Panthers 2, Lightning 5",T.B,45.500000,32.45,46.56,41.07,2.55,1.33,65.64,1.86,1.69,52.43,11.77,9.26,55.98,19.81,83.02,9.94,96.45,1.064,2019-10-03,T.B_2019-10-03,4.233333,14.17,10.54,6.766667,0,4.25
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3257,"2022-04-29 - Avalanche 1, Wild 4",MIN,39.400000,30.65,46.61,39.67,4.44,1.53,74.39,1.95,1.95,50.05,7.71,5.99,56.30,47.83,100.00,19.79,95.50,1.153,2022-04-29,MIN_2022-04-29,9.983333,0,3.6,7.416667,0,3.99
3258,"2022-04-29 - Flames 1, Jets 3",WPG,47.916667,59.43,45.89,56.43,2.34,1.32,63.87,4.08,2.44,62.52,18.27,9.15,66.62,7.86,83.08,5.34,96.02,1.014,2022-04-29,WPG_2022-04-29,5.400000,0,9.52,4.000000,0,12.63
3259,"2022-04-29 - Predators 4, Coyotes 5",ARI,48.750000,31.75,58.08,35.34,5.73,5.25,52.18,1.60,3.01,34.78,4.30,13.88,23.67,52.52,51.84,25.11,83.60,1.087,2022-04-29,ARI_2022-04-29,5.950000,0,15.61,3.950000,0,9.76
3260,"2022-04-29 - Sharks 0, Kraken 3",SEA,52.416667,48.21,26.93,64.16,2.21,0.00,100.00,2.42,1.42,62.98,11.58,4.54,71.82,0.00,100.00,6.82,100.00,1.068,2022-04-29,SEA_2022-04-29,2.000000,0,12.69,1.600000,0,11.05


In [24]:
# merge away df
away_merged = merge_strength_states(away_5v5_adj, away_5v4_pp, away_4v5_pk, 
                                              ev_features_list, pp_features_list, pk_features_list)
away_merged

Unnamed: 0,Game,Team,TOI,FF/60,FA/60,FF%,GF/60,GA/60,GF%,xGF/60,xGA/60,xGF%,HDCF/60,HDCA/60,HDCF%,HDSH%,HDSV%,SH%,SV%,PDO,Date,Game_Key,TOI_pp,GF/60_pp,xGF/60_pp,TOI_pk,GA/60_pk,xGA/60_pk
0,"2019-10-02 - Senators 3, Maple Leafs 5",OTT,44.133333,43.75,59.15,42.52,4.18,5.16,44.73,2.22,3.44,39.27,8.32,15.74,34.58,75.21,44.01,14.40,88.06,1.025,2019-10-02,OTT_2019-10-02,6.000000,0,2.77,8.166667,7.35,7.47
1,"2019-10-02 - Capitals 3, Blues 2",WSH,50.866667,39.14,27.17,59.03,1.20,1.11,51.81,2.26,1.35,62.52,6.14,5.59,52.34,0.00,80.36,4.21,94.44,0.986,2019-10-02,WSH_2019-10-02,5.883333,10.2,7.62,3.250000,18.46,6.44
2,"2019-10-02 - Canucks 2, Oilers 3",VAN,47.066667,48.03,30.38,61.25,2.64,3.58,42.39,1.99,1.71,53.76,9.11,6.04,60.13,34.15,60.71,8.67,85.13,0.938,2019-10-02,VAN_2019-10-02,8.000000,0,9.09,4.000000,0,4.6
3,"2019-10-02 - Sharks 1, Golden Knights 4",S.J,45.666667,30.07,51.94,36.66,1.33,2.60,33.91,1.40,3.84,26.72,5.27,16.02,24.75,0.00,87.94,6.72,91.93,0.987,2019-10-02,S.J_2019-10-02,6.983333,0,5.33,4.333333,13.85,8.96
4,"2019-10-03 - Panthers 2, Lightning 5",FLA,45.500000,46.56,32.45,58.93,1.33,2.55,34.36,1.69,1.86,47.57,9.26,11.77,44.02,16.98,80.19,3.55,90.06,0.936,2019-10-03,FLA_2019-10-03,6.766667,0,4.25,4.233333,14.17,10.54
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3257,"2022-04-29 - Avalanche 1, Wild 4",COL,39.400000,46.61,30.65,60.33,1.53,4.44,25.61,1.95,1.95,49.95,5.99,7.71,43.70,0.00,52.17,4.50,80.21,0.847,2022-04-29,COL_2022-04-29,7.416667,0,3.99,9.983333,0,3.6
3258,"2022-04-29 - Flames 1, Jets 3",CGY,47.916667,45.89,59.43,43.57,1.32,2.34,36.13,2.44,4.08,37.48,9.15,18.27,33.38,16.92,92.14,3.98,94.66,0.986,2022-04-29,CGY_2022-04-29,4.000000,0,12.63,5.400000,0,9.52
3259,"2022-04-29 - Predators 4, Coyotes 5",NSH,48.750000,58.08,31.75,64.66,5.25,5.73,47.82,3.01,1.60,65.22,13.88,4.30,76.33,48.16,47.48,16.40,74.89,0.913,2022-04-29,NSH_2022-04-29,3.950000,0,9.76,5.950000,0,15.61
3260,"2022-04-29 - Sharks 0, Kraken 3",S.J,52.416667,26.93,48.21,35.84,0.00,2.21,0.00,1.42,2.42,37.02,4.54,11.58,28.18,0.00,100.00,0.00,93.18,0.932,2022-04-29,S.J_2022-04-29,1.600000,0,11.05,2.000000,0,12.69


In [25]:
print(ev_features_list)
print(pp_features_list)
print(pk_features_list)

['Game', 'Team', 'TOI', 'FF/60', 'FA/60', 'FF%', 'GF/60', 'GA/60', 'GF%', 'xGF/60', 'xGA/60', 'xGF%', 'HDCF/60', 'HDCA/60', 'HDCF%', 'HDSH%', 'HDSV%', 'SH%', 'SV%', 'PDO', 'Date', 'Game_Key']
['TOI', 'GF/60', 'xGF/60', 'Game_Key']
['TOI', 'GA/60', 'xGA/60', 'Game_Key']


In [26]:
#concat both dfs in alternating patterns, so that home team comes first followed by the away team
if len(home_merged) != len(away_merged):
    print("you've messed up.")
else:
    # Use list comprehension to concatenate corresponding rows from df1 and df2
    concatenated_rows = [pd.concat([home_merged.iloc[[i]], 
                                    away_merged.iloc[[i]]], axis=0) for i in range(len(home_merged))]
    
    # Concatenate all individual rows into a single DataFrame
    home_away_stats = pd.concat(concatenated_rows, axis=0, ignore_index=True)

In [28]:
home_away_stats.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6524 entries, 0 to 6523
Data columns (total 28 columns):
 #   Column     Non-Null Count  Dtype         
---  ------     --------------  -----         
 0   Game       6524 non-null   object        
 1   Team       6524 non-null   object        
 2   TOI        6524 non-null   float64       
 3   FF/60      6524 non-null   float64       
 4   FA/60      6524 non-null   float64       
 5   FF%        6524 non-null   float64       
 6   GF/60      6524 non-null   float64       
 7   GA/60      6524 non-null   float64       
 8   GF%        6524 non-null   object        
 9   xGF/60     6524 non-null   float64       
 10  xGA/60     6524 non-null   float64       
 11  xGF%       6524 non-null   float64       
 12  HDCF/60    6524 non-null   float64       
 13  HDCA/60    6524 non-null   float64       
 14  HDCF%      6524 non-null   float64       
 15  HDSH%      6524 non-null   object        
 16  HDSV%      6524 non-null   object        


In [30]:
# why are some numeric columns objects which cannnot be converted to numeric
print (home_away_stats[pd.to_numeric(home_away_stats['GF%'], errors='coerce').isnull()])


                                            Game Team        TOI  FF/60  \
224             2019-10-19 - Canucks 0, Devils 1  N.J  37.300000  22.55   
225             2019-10-19 - Canucks 0, Devils 1  VAN  37.300000  29.93   
234    2019-10-19 - Golden Knights 3, Penguins 0  PIT  46.800000  36.75   
235    2019-10-19 - Golden Knights 3, Penguins 0  VGK  46.800000  31.72   
252                2019-10-20 - Oilers 0, Jets 1  WPG  48.333333  24.03   
...                                          ...  ...        ...    ...   
6027          2022-03-29 - Avalanche 2, Flames 1  COL  45.066667  42.59   
6136             2022-04-05 - Oilers 2, Sharks 1  S.J  50.850000  47.50   
6137             2022-04-05 - Oilers 2, Sharks 1  EDM  50.850000  40.88   
6244          2022-04-12 - Sharks 0, Predators 1  NSH  54.000000  59.16   
6245          2022-04-12 - Sharks 0, Predators 1  S.J  54.000000  34.40   

      FA/60    FF%  GF/60  GA/60 GF%  xGF/60  xGA/60   xGF%  HDCF/60  HDCA/60  \
224   29.93  42.97

In [31]:
# some columns have value of '-' when the denominator is 0
# concert all of these values to 0
home_away_stats = home_away_stats.replace('-', 0)
home_away_stats.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6524 entries, 0 to 6523
Data columns (total 28 columns):
 #   Column     Non-Null Count  Dtype         
---  ------     --------------  -----         
 0   Game       6524 non-null   object        
 1   Team       6524 non-null   object        
 2   TOI        6524 non-null   float64       
 3   FF/60      6524 non-null   float64       
 4   FA/60      6524 non-null   float64       
 5   FF%        6524 non-null   float64       
 6   GF/60      6524 non-null   float64       
 7   GA/60      6524 non-null   float64       
 8   GF%        6524 non-null   object        
 9   xGF/60     6524 non-null   float64       
 10  xGA/60     6524 non-null   float64       
 11  xGF%       6524 non-null   float64       
 12  HDCF/60    6524 non-null   float64       
 13  HDCA/60    6524 non-null   float64       
 14  HDCF%      6524 non-null   float64       
 15  HDSH%      6524 non-null   object        
 16  HDSV%      6524 non-null   object        


In [32]:
object_columns = home_away_stats.select_dtypes(include=['object']).columns.tolist()
object_columns

[
  'Game',
  'Team',
  'GF%',
  'HDSH%',
  'HDSV%',
  'Game_Key',
  'GF/60_pp',
  'xGF/60_pp',
  'GA/60_pk',
  'xGA/60_pk',
]

In [36]:
float_cols = ['GF%', 'HDSH%', 'HDSV%', 'GF/60_pp', 'xGF/60_pp', 'GA/60_pk', 'xGA/60_pk']
home_away_stats[float_cols] = home_away_stats[float_cols].astype(float)


In [37]:
home_away_stats.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6524 entries, 0 to 6523
Data columns (total 28 columns):
 #   Column     Non-Null Count  Dtype         
---  ------     --------------  -----         
 0   Game       6524 non-null   object        
 1   Team       6524 non-null   object        
 2   TOI        6524 non-null   float64       
 3   FF/60      6524 non-null   float64       
 4   FA/60      6524 non-null   float64       
 5   FF%        6524 non-null   float64       
 6   GF/60      6524 non-null   float64       
 7   GA/60      6524 non-null   float64       
 8   GF%        6524 non-null   float64       
 9   xGF/60     6524 non-null   float64       
 10  xGA/60     6524 non-null   float64       
 11  xGF%       6524 non-null   float64       
 12  HDCF/60    6524 non-null   float64       
 13  HDCA/60    6524 non-null   float64       
 14  HDCF%      6524 non-null   float64       
 15  HDSH%      6524 non-null   float64       
 16  HDSV%      6524 non-null   float64       


### Add NHL.com schedule/result df


In [38]:
# functions from the Hockey Scraper API
# modified to retrieve additional info 
"""
This module contains functions to scrape the json schedule for any games or date range
"""

from datetime import datetime, timedelta
import json
import time
import hockey_scraper.utils.shared as shared


# TODO: Currently rescraping page each time since the status of some games may have changed
# (e.g. Scraped on 2020-01-20 and game on 2020-01-21 was not Final...when use old page again will still think not Final)
# Need to find a more elegant way of doing this (Metadata???)
def get_schedule(date_from, date_to):
    """
    Scrapes games in date range
    Ex: https://statsapi.web.nhl.com/api/v1/schedule?startDate=2010-10-03&endDate=2011-06-20
    
    :param date_from: scrape from this date
    :param date_to: scrape until this date
    
    :return: raw json of schedule of date range
    """
    page_info = {
        "url": 'https://statsapi.web.nhl.com/api/v1/schedule?startDate={a}&endDate={b}'.format(a=date_from, b=date_to),
        "name": date_from + "_" + date_to,
        "type": "json_schedule",
        "season": shared.get_season(date_from),
    }

    return json.loads(shared.get_file(page_info, force=True))


def chunk_schedule_calls(from_date, to_date):
    """
    The schedule endpoint sucks when handling a big date range. So instead I call in increments of n days.
    
    :param date_from: scrape from this date
    :param date_to: scrape until this date

    :return: raw json of schedule of date range
    """
    sched = []
    days_per_call = 30

    from_date = datetime.strptime(from_date, "%Y-%m-%d") 
    to_date = datetime.strptime(to_date, "%Y-%m-%d")
    num_days = (to_date - from_date).days + 1  # +1 since difference is looking for total number of days

    for offset in range(0, num_days, days_per_call):
        f_chunk = datetime.strftime(from_date + timedelta(days=offset), "%Y-%m-%d")

        # We need the min bec. if the chunks are evenly sized this prevents us from overshooting the max
        t_chunk = datetime.strftime(from_date + timedelta(days=min(num_days-1, offset+days_per_call-1)), "%Y-%m-%d")

        chunk_sched = get_schedule(f_chunk, t_chunk)
        sched.append(chunk_sched['dates'])

    return sched


def get_dates(games):
    """
    Given a list game_ids it returns the dates for each game.

    We sort all the games and retrieve the schedule from the beginning of the season from the earliest game
    until the end of most recent season.
    
    :param games: list with game_id's ex: 2016020001
    
    :return: list with game_id and corresponding date for all games
    """
    today = datetime.today()

    # Determine oldest and newest game
    games = list(map(str, games))
    games.sort()

    date_from = shared.season_start_bound(games[0][:4])
    year_to = int(games[-1][:4])

    # If the last game is part of the ongoing season then only request the schedule until Today
    # We get strange errors if we don't do it like this
    if year_to == shared.get_season(datetime.strftime(today, "%Y-%m-%d")):
        date_to = '-'.join([str(today.year), str(today.month), str(today.day)])
    else:
        date_to = datetime.strftime(shared.season_end_bound(year_to+1), "%Y-%m-%d")  # Newest game in sample

    # TODO: Assume true is live here -> Workaround
    schedule = scrape_schedule(date_from, date_to, preseason=True, not_over=True)

    # Only return games we want in range
    games_list = []
    for game in schedule:
        if str(game['game_id']) in games:
            games_list.extend([game])
    return games_list


def scrape_schedule(date_from, date_to, preseason=False, not_over=False):
    """
    Calls getSchedule and scrapes the raw schedule Json
    
    :param date_from: scrape from this date
    :param date_to: scrape until this date
    :param preseason: Boolean indicating whether include preseason games (default if False)
    :param not_over: Boolean indicating whether we scrape games not finished. 
                     Means we relax the requirement of checking if the game is over. 
    
    :return: list with all the game id's
    """
    schedule = []
    schedule_json = chunk_schedule_calls(date_from, date_to)

    for chunk in schedule_json:
        for day in chunk:
            for game in day['games']:
                if game['status']['detailedState'] == 'Final' or not_over:
                    game_id = int(str(game['gamePk'])[5:])
                    # add game type logic to filter out none regular season games
                    if (game_id >= 20000 or preseason) and game_id < 40000:
                        schedule.append({
                                 "game_id": game['gamePk'],
                                "game_type": game['gameType'],
                                 "season_id": game['season'],
                                 "date": day['date'], 
                                 "home_score": game['teams']['home'].get("score"),
                                 "away_score": game['teams']['away'].get("score"),
                                #  "start_time": datetime.strptime(game['gameDate'][:-1], "%Y-%m-%dT%H:%M:%S"),
                                #  "venue": game['venue'].get('name'),
                                 "home_team_id": game['teams']['home']['team']['id'],
                                 "home_team": shared.get_team(game['teams']['home']['team']['name']),
                                 "home_wins": game['teams']['home'].get("leagueRecord").get("wins"),
                                 "home_losses": game['teams']['home'].get("leagueRecord").get("losses"),
                                 "home_otl": game['teams']['home'].get("leagueRecord").get("ot"),
                                 "away_wins": game['teams']['away'].get("leagueRecord").get("wins"),
                                 "away_losses": game['teams']['away'].get("leagueRecord").get("losses"),
                                 "away_otl": game['teams']['away'].get("leagueRecord").get("ot"), 
                                 "away_team_id": game['teams']['away']['team']['id'],
                                 "away_team": shared.get_team(game['teams']['away']['team']['name']),
                                #  "home_score": game['teams']['home'].get("score"),
                                #  "away_score": game['teams']['away'].get("score"),
                                 "status": game["status"]["abstractGameState"]
                        })


    return schedule

NHL regular season date boundaries by season
- 2019-2020: October 2, 2019 – March 11, 2020
  - Covid caused scheduling issues, so need to make sure only regular season games are factored
- 2020-2021: January 13, 2021 - May 19, 2021
  - same potential issue noted above
- 2021-2022: October 12, 2021 - May 1, 2022
- 2022-2023: October 7, 2022 - April 14, 2023

In [39]:
# run modified scrape_schedule for 19-20, 20-21, and 21-22 seasons individually 
# will make for easier processing of imputed features
schedule_1920 = pd.DataFrame(scrape_schedule('2019-10-02', '2020-03-11', preseason=False, not_over=False))
schedule_2021 = pd.DataFrame(scrape_schedule('2021-01-13', '2021-05-19', preseason=False, not_over=False))
schedule_2122 = pd.DataFrame(scrape_schedule('2021-10-12', '2022-05-01', preseason=False, not_over=False))
schedule_2223 = pd.DataFrame(scrape_schedule('2022-10-07', '2023-04-14', preseason=False, not_over=False))
schedule_1920

Unnamed: 0,game_id,game_type,season_id,date,home_score,away_score,home_team_id,home_team,home_wins,home_losses,home_otl,away_wins,away_losses,away_otl,away_team_id,away_team,status
0,2019020001,R,20192020,2019-10-02,5,3,10,TOR,1,0,0.0,0,1,0.0,9,OTT,Final
1,2019020002,R,20192020,2019-10-02,2,3,19,STL,0,0,1.0,1,0,0.0,15,WSH,Final
2,2019020003,R,20192020,2019-10-02,3,2,22,EDM,1,0,0.0,0,1,0.0,23,VAN,Final
3,2019020004,R,20192020,2019-10-02,4,1,54,VGK,1,0,0.0,0,1,0.0,28,S.J,Final
4,2019020005,R,20192020,2019-10-03,5,2,14,T.B,1,0,0.0,0,1,0.0,13,FLA,Final
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1078,2019021079,R,20192020,2020-03-11,6,2,16,CHI,32,30,8.0,29,36,5.0,28,S.J,Final
1079,2019021080,R,20192020,2020-03-11,2,4,22,EDM,37,25,9.0,37,28,6.0,52,WPG,Final
1080,2019020876,R,20192020,2020-03-11,2,4,24,ANA,29,33,9.0,42,19,10.0,19,STL,Final
1081,2019021081,R,20192020,2020-03-11,3,2,21,COL,42,20,8.0,37,28,5.0,3,NYR,Final


In [40]:
schedule_1920.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1083 entries, 0 to 1082
Data columns (total 17 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   game_id       1083 non-null   int64  
 1   game_type     1083 non-null   object 
 2   season_id     1083 non-null   object 
 3   date          1083 non-null   object 
 4   home_score    1083 non-null   int64  
 5   away_score    1083 non-null   int64  
 6   home_team_id  1083 non-null   int64  
 7   home_team     1083 non-null   object 
 8   home_wins     1083 non-null   int64  
 9   home_losses   1083 non-null   int64  
 10  home_otl      1082 non-null   float64
 11  away_wins     1083 non-null   int64  
 12  away_losses   1083 non-null   int64  
 13  away_otl      1082 non-null   float64
 14  away_team_id  1083 non-null   int64  
 15  away_team     1083 non-null   object 
 16  status        1083 non-null   object 
dtypes: float64(2), int64(9), object(6)
memory usage: 144.0+ KB


Bulk processing to conduct upon load:

- Convert the following fields accordingly
  - game_id, season_id to string
  - home_otl, away_otl to int
- Filter out non regular season games
  - game_type = R
- Add column to denote winner from perspective of home team
  - home_team = 1 denotes home team victory, 0 for loss
- Add game keys so schedule dfs can be joined with stats dfs
- Add running total column for number of games played so far that season for all teams
- Add running standings points total column for all teams
    - Win = 2 points
    - (Regulation) Loss = 0 points
    - Overtime loss = 1 point
- Add column for points percentage
    - pts_pct = actual points accumulated / potential max points 
    - potential max points = games played * 2

In [41]:
# makes sure all games are 'R' type for regular season
schedule_1920['game_type'].value_counts()
# assert schedule_2021.loc[schedule_2021['game_type'] == 'R']
# assert schedule_2122.loc[schedule_2122['game_type'] == 'R']
# assert schedule_2223.loc[schedule_2223['game_type'] == 'R']

R     1082
WA       1
Name: game_type, dtype: int64

In [42]:
schedule_1920.loc[schedule_1920['game_type'] == 'WA']

Unnamed: 0,game_id,game_type,season_id,date,home_score,away_score,home_team_id,home_team,home_wins,home_losses,home_otl,away_wins,away_losses,away_otl,away_team_id,away_team,status
767,2019120001,WA,20192020,2020-01-24,1,2,7461,AMERICAN ALL-STARS,0,0,,0,0,,7460,CANADIAN ALL-STARS,Final


In [43]:
def schedule_df_processing(schedule_df):
    # filter out non regular season games - should have 3262 for 19-20 through 21-22 seasons
    schedule_df = schedule_df.loc[schedule_df['game_type'] == 'R']
    # convert game_id and season_id to str. make sure they aren't used numerically and in case needed for filtering
    schedule_df['game_id'] = schedule_df['game_id'].astype(str)
    schedule_df['season_id'] = schedule_df['season_id'].astype(str)
    # convert overtime loss columns to int from float as can't be decimal
    schedule_df['home_otl'] = schedule_df['home_otl'].astype(int)
    schedule_df['away_otl'] = schedule_df['away_otl'].astype(int)
    # add column for denoting game winner - denote home team victory in binary, win = 1
    schedule_df['Home_Team_Won'] = np.where(schedule_df['home_score'] > schedule_df['away_score'], 1, 0)
    # add keys for merging stats dfs
    schedule_df['Home_Team_Key'] = schedule_df['home_team'].astype(str)+'_'+schedule_df['date'].astype(str)
    schedule_df['Away_Team_Key'] = schedule_df['away_team'].astype(str)+'_'+schedule_df['date'].astype(str)
    # add column containing running total games played so far that season
    schedule_df['home_gp'] = (schedule_df['home_wins'] + schedule_df['home_losses'] + schedule_df['home_otl']).astype(int)
    schedule_df['away_gp'] = (schedule_df['away_wins'] + schedule_df['away_losses'] + schedule_df['away_otl']).astype(int)                              
    # add column containing running standing points total for teams. wins = 2, losses = 0, otl = 1
    schedule_df['home_points'] = ((schedule_df['home_wins'] * 2) + schedule_df['home_otl']).astype(int)
    schedule_df['away_points'] = ((schedule_df['away_wins'] * 2) + schedule_df['away_otl']).astype(int)
    # add column for points percentage column
    schedule_df['home_pts_pct'] = schedule_df['home_points'] / (schedule_df['home_gp'] * 2)
    schedule_df['away_pts_pct'] = schedule_df['away_points'] / (schedule_df['away_gp'] * 2)

    return schedule_df

In [44]:
# test function
schedule_1920 = schedule_df_processing(schedule_1920)
schedule_1920

Unnamed: 0,game_id,game_type,season_id,date,home_score,away_score,home_team_id,home_team,home_wins,home_losses,home_otl,away_wins,away_losses,away_otl,away_team_id,away_team,status,Home_Team_Won,Home_Team_Key,Away_Team_Key,home_gp,away_gp,home_points,away_points,home_pts_pct,away_pts_pct
0,2019020001,R,20192020,2019-10-02,5,3,10,TOR,1,0,0,0,1,0,9,OTT,Final,1,TOR_2019-10-02,OTT_2019-10-02,1,1,2,0,1.000000,0.000000
1,2019020002,R,20192020,2019-10-02,2,3,19,STL,0,0,1,1,0,0,15,WSH,Final,0,STL_2019-10-02,WSH_2019-10-02,1,1,1,2,0.500000,1.000000
2,2019020003,R,20192020,2019-10-02,3,2,22,EDM,1,0,0,0,1,0,23,VAN,Final,1,EDM_2019-10-02,VAN_2019-10-02,1,1,2,0,1.000000,0.000000
3,2019020004,R,20192020,2019-10-02,4,1,54,VGK,1,0,0,0,1,0,28,S.J,Final,1,VGK_2019-10-02,S.J_2019-10-02,1,1,2,0,1.000000,0.000000
4,2019020005,R,20192020,2019-10-03,5,2,14,T.B,1,0,0,0,1,0,13,FLA,Final,1,T.B_2019-10-03,FLA_2019-10-03,1,1,2,0,1.000000,0.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1078,2019021079,R,20192020,2020-03-11,6,2,16,CHI,32,30,8,29,36,5,28,S.J,Final,1,CHI_2020-03-11,S.J_2020-03-11,70,70,72,63,0.514286,0.450000
1079,2019021080,R,20192020,2020-03-11,2,4,22,EDM,37,25,9,37,28,6,52,WPG,Final,0,EDM_2020-03-11,WPG_2020-03-11,71,71,83,80,0.584507,0.563380
1080,2019020876,R,20192020,2020-03-11,2,4,24,ANA,29,33,9,42,19,10,19,STL,Final,0,ANA_2020-03-11,STL_2020-03-11,71,71,67,94,0.471831,0.661972
1081,2019021081,R,20192020,2020-03-11,3,2,21,COL,42,20,8,37,28,5,3,NYR,Final,1,COL_2020-03-11,NYR_2020-03-11,70,70,92,79,0.657143,0.564286


In [45]:
# call processing function on all schedule dfs
# schedule_1920 = schedule_df_processing(schedule_1920)
schedule_2021 = schedule_df_processing(schedule_2021)
schedule_2122 = schedule_df_processing(schedule_2122)
schedule_2223 = schedule_df_processing(schedule_2223)

In [46]:
# check lengths to make sure the correct number of games are returned
# 1082, 868, 1312, 1312
print(len(schedule_1920))
print(len(schedule_2021))
print(len(schedule_2122))
print(len(schedule_2223))

1082
868
1312
1312


In [47]:
schedule_1920_2122 = pd.concat([schedule_1920, schedule_2021, schedule_2122], axis=0, ignore_index=True)

In [48]:
schedule_1920_2122

Unnamed: 0,game_id,game_type,season_id,date,home_score,away_score,home_team_id,home_team,home_wins,home_losses,home_otl,away_wins,away_losses,away_otl,away_team_id,away_team,status,Home_Team_Won,Home_Team_Key,Away_Team_Key,home_gp,away_gp,home_points,away_points,home_pts_pct,away_pts_pct
0,2019020001,R,20192020,2019-10-02,5,3,10,TOR,1,0,0,0,1,0,9,OTT,Final,1,TOR_2019-10-02,OTT_2019-10-02,1,1,2,0,1.000000,0.000000
1,2019020002,R,20192020,2019-10-02,2,3,19,STL,0,0,1,1,0,0,15,WSH,Final,0,STL_2019-10-02,WSH_2019-10-02,1,1,1,2,0.500000,1.000000
2,2019020003,R,20192020,2019-10-02,3,2,22,EDM,1,0,0,0,1,0,23,VAN,Final,1,EDM_2019-10-02,VAN_2019-10-02,1,1,2,0,1.000000,0.000000
3,2019020004,R,20192020,2019-10-02,4,1,54,VGK,1,0,0,0,1,0,28,S.J,Final,1,VGK_2019-10-02,S.J_2019-10-02,1,1,2,0,1.000000,0.000000
4,2019020005,R,20192020,2019-10-03,5,2,14,T.B,1,0,0,0,1,0,13,FLA,Final,1,T.B_2019-10-03,FLA_2019-10-03,1,1,2,0,1.000000,0.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3257,2021021307,R,20212022,2022-04-29,4,2,25,DAL,46,30,6,31,37,14,24,ANA,Final,1,DAL_2022-04-29,ANA_2022-04-29,82,82,98,76,0.597561,0.463415
3258,2021021207,R,20212022,2022-04-29,3,2,22,EDM,49,27,6,40,30,12,23,VAN,Final,1,EDM_2022-04-29,VAN_2022-04-29,82,82,104,92,0.634146,0.560976
3259,2021021312,R,20212022,2022-04-29,3,0,55,SEA,27,48,6,32,37,13,28,S.J,Final,1,SEA_2022-04-29,S.J_2022-04-29,81,82,60,77,0.370370,0.469512
3260,2021021311,R,20212022,2022-04-29,5,4,53,ARI,25,50,7,45,30,7,18,NSH,Final,1,ARI_2022-04-29,NSH_2022-04-29,82,82,57,97,0.347561,0.591463


In [49]:
home_away_stats.columns.to_list()

[
  'Game',
  'Team',
  'TOI',
  'FF/60',
  'FA/60',
  'FF%',
  'GF/60',
  'GA/60',
  'GF%',
  'xGF/60',
  'xGA/60',
  'xGF%',
  'HDCF/60',
  'HDCA/60',
  'HDCF%',
  'HDSH%',
  'HDSV%',
  'SH%',
  'SV%',
  'PDO',
  'Date',
  'Game_Key',
  'TOI_pp',
  'GF/60_pp',
  'xGF/60_pp',
  'TOI_pk',
  'GA/60_pk',
  'xGA/60_pk',
]

In [50]:
cols_to_prefix = [
#   'Game',
  'Team',
  'TOI',
  'FF/60',
  'FA/60',
  'FF%',
  'GF/60',
  'GA/60',
  'GF%',
  'xGF/60',
  'xGA/60',
  'xGF%',
  'HDCF/60',
  'HDCA/60',
  'HDCF%',
  'HDSH%',
  'HDSV%',
  'SH%',
  'SV%',
  'PDO',
  'Date',
  'Game_Key',
  'TOI_pp',
  'GF/60_pp',
  'xGF/60_pp',
  'TOI_pk',
  'GA/60_pk',
  'xGA/60_pk',
]

In [51]:
modeling_df = schedule_1920_2122.merge(home_away_stats[cols_to_prefix].add_prefix('home_'), 
                                       left_on = 'Home_Team_Key', right_on = 'home_Game_Key', how = 'left')
modeling_df = modeling_df.merge(home_away_stats[cols_to_prefix].add_prefix('away_'), 
                                       left_on = 'Away_Team_Key', right_on = 'away_Game_Key', how = 'left')
modeling_df

Unnamed: 0,game_id,game_type,season_id,date,home_score,away_score,home_team_id,home_team,home_wins,home_losses,home_otl,away_wins,away_losses,away_otl,away_team_id,away_team,status,Home_Team_Won,Home_Team_Key,Away_Team_Key,home_gp,away_gp,home_points,away_points,home_pts_pct,away_pts_pct,home_Team,home_TOI,home_FF/60,home_FA/60,home_FF%,home_GF/60,home_GA/60,home_GF%,home_xGF/60,home_xGA/60,home_xGF%,home_HDCF/60,home_HDCA/60,home_HDCF%,home_HDSH%,home_HDSV%,home_SH%,home_SV%,home_PDO,home_Date,home_Game_Key,home_TOI_pp,home_GF/60_pp,home_xGF/60_pp,home_TOI_pk,home_GA/60_pk,home_xGA/60_pk,away_Team,away_TOI,away_FF/60,away_FA/60,away_FF%,away_GF/60,away_GA/60,away_GF%,away_xGF/60,away_xGA/60,away_xGF%,away_HDCF/60,away_HDCA/60,away_HDCF%,away_HDSH%,away_HDSV%,away_SH%,away_SV%,away_PDO,away_Date,away_Game_Key,away_TOI_pp,away_GF/60_pp,away_xGF/60_pp,away_TOI_pk,away_GA/60_pk,away_xGA/60_pk
0,2019020001,R,20192020,2019-10-02,5,3,10,TOR,1,0,0,0,1,0,9,OTT,Final,1,TOR_2019-10-02,OTT_2019-10-02,1,1,2,0,1.000000,0.000000,TOR,44.133333,59.15,43.75,57.48,5.16,4.18,55.27,3.44,2.22,60.73,15.74,8.32,65.42,55.99,24.79,11.94,85.60,0.975,2019-10-02,TOR_2019-10-02,8.166667,7.35,7.47,6.000000,0.00,2.77,OTT,44.133333,43.75,59.15,42.52,4.18,5.16,44.73,2.22,3.44,39.27,8.32,15.74,34.58,75.21,44.01,14.40,88.06,1.025,2019-10-02,OTT_2019-10-02,6.000000,0.00,2.77,8.166667,7.35,7.47
1,2019020002,R,20192020,2019-10-02,2,3,19,STL,0,0,1,1,0,0,15,WSH,Final,0,STL_2019-10-02,WSH_2019-10-02,1,1,1,2,0.500000,1.000000,STL,50.866667,27.17,39.14,40.97,1.11,1.20,48.19,1.35,2.26,37.48,5.59,6.14,47.66,19.64,100.00,5.56,95.79,1.014,2019-10-02,STL_2019-10-02,3.250000,18.46,6.44,5.883333,10.20,7.62,WSH,50.866667,39.14,27.17,59.03,1.20,1.11,51.81,2.26,1.35,62.52,6.14,5.59,52.34,0.00,80.36,4.21,94.44,0.986,2019-10-02,WSH_2019-10-02,5.883333,10.20,7.62,3.250000,18.46,6.44
2,2019020003,R,20192020,2019-10-02,3,2,22,EDM,1,0,0,0,1,0,23,VAN,Final,1,EDM_2019-10-02,VAN_2019-10-02,1,1,2,0,1.000000,0.000000,EDM,47.066667,30.38,48.03,38.75,3.58,2.64,57.61,1.71,1.99,46.24,6.04,9.11,39.87,39.29,65.85,14.87,91.33,1.062,2019-10-02,EDM_2019-10-02,4.000000,0.00,4.60,8.000000,0.00,9.09,VAN,47.066667,48.03,30.38,61.25,2.64,3.58,42.39,1.99,1.71,53.76,9.11,6.04,60.13,34.15,60.71,8.67,85.13,0.938,2019-10-02,VAN_2019-10-02,8.000000,0.00,9.09,4.000000,0.00,4.60
3,2019020004,R,20192020,2019-10-02,4,1,54,VGK,1,0,0,0,1,0,28,S.J,Final,1,VGK_2019-10-02,S.J_2019-10-02,1,1,2,0,1.000000,0.000000,VGK,45.666667,51.94,30.07,63.34,2.60,1.33,66.09,3.84,1.40,73.28,16.02,5.27,75.25,12.06,100.00,8.07,93.28,1.013,2019-10-02,VGK_2019-10-02,4.333333,13.85,8.96,6.983333,0.00,5.33,S.J,45.666667,30.07,51.94,36.66,1.33,2.60,33.91,1.40,3.84,26.72,5.27,16.02,24.75,0.00,87.94,6.72,91.93,0.987,2019-10-02,S.J_2019-10-02,6.983333,0.00,5.33,4.333333,13.85,8.96
4,2019020005,R,20192020,2019-10-03,5,2,14,T.B,1,0,0,0,1,0,13,FLA,Final,1,T.B_2019-10-03,FLA_2019-10-03,1,1,2,0,1.000000,0.000000,T.B,45.500000,32.45,46.56,41.07,2.55,1.33,65.64,1.86,1.69,52.43,11.77,9.26,55.98,19.81,83.02,9.94,96.45,1.064,2019-10-03,T.B_2019-10-03,4.233333,14.17,10.54,6.766667,0.00,4.25,FLA,45.500000,46.56,32.45,58.93,1.33,2.55,34.36,1.69,1.86,47.57,9.26,11.77,44.02,16.98,80.19,3.55,90.06,0.936,2019-10-03,FLA_2019-10-03,6.766667,0.00,4.25,4.233333,14.17,10.54
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3257,2021021307,R,20212022,2022-04-29,4,2,25,DAL,46,30,6,31,37,14,24,ANA,Final,1,DAL_2022-04-29,ANA_2022-04-29,82,82,98,76,0.597561,0.463415,DAL,54.450000,32.22,30.41,51.44,2.06,2.28,47.45,2.21,1.37,61.63,11.32,5.80,66.12,51.77,50.19,12.11,88.68,1.008,2022-04-29,DAL_2022-04-29,2.566667,23.38,1.87,2.000000,0.00,7.57,ANA,54.450000,30.41,32.22,48.56,2.28,2.06,52.55,1.37,2.21,38.37,5.80,11.32,33.88,49.81,48.23,11.32,87.89,0.992,2022-04-29,ANA_2022-04-29,2.000000,0.00,7.57,2.566667,23.38,1.87
3258,2021021207,R,20212022,2022-04-29,3,2,22,EDM,49,27,6,40,30,12,23,VAN,Final,1,EDM_2022-04-29,VAN_2022-04-29,82,82,104,92,0.634146,0.560976,EDM,53.783333,39.31,51.30,43.38,2.06,2.36,46.65,2.56,3.93,39.47,13.96,12.29,53.19,0.00,100.00,7.13,93.16,1.003,2022-04-29,EDM_2022-04-29,2.000000,0.00,12.50,4.000000,0.00,17.36,VAN,53.783333,51.30,39.31,56.62,2.36,2.06,53.35,3.93,2.56,60.53,12.29,13.96,46.81,0.00,100.00,6.84,92.87,0.997,2022-04-29,VAN_2022-04-29,4.000000,0.00,17.36,2.000000,0.00,12.50
3259,2021021312,R,20212022,2022-04-29,3,0,55,SEA,27,48,6,32,37,13,28,S.J,Final,1,SEA_2022-04-29,S.J_2022-04-29,81,82,60,77,0.370370,0.469512,SEA,52.416667,48.21,26.93,64.16,2.21,0.00,100.00,2.42,1.42,62.98,11.58,4.54,71.82,0.00,100.00,6.82,100.00,1.068,2022-04-29,SEA_2022-04-29,2.000000,0.00,12.69,1.600000,0.00,11.05,S.J,52.416667,26.93,48.21,35.84,0.00,2.21,0.00,1.42,2.42,37.02,4.54,11.58,28.18,0.00,100.00,0.00,93.18,0.932,2022-04-29,S.J_2022-04-29,1.600000,0.00,11.05,2.000000,0.00,12.69
3260,2021021311,R,20212022,2022-04-29,5,4,53,ARI,25,50,7,45,30,7,18,NSH,Final,1,ARI_2022-04-29,NSH_2022-04-29,82,82,57,97,0.347561,0.591463,ARI,48.750000,31.75,58.08,35.34,5.73,5.25,52.18,1.60,3.01,34.78,4.30,13.88,23.67,52.52,51.84,25.11,83.60,1.087,2022-04-29,ARI_2022-04-29,5.950000,0.00,15.61,3.950000,0.00,9.76,NSH,48.750000,58.08,31.75,64.66,5.25,5.73,47.82,3.01,1.60,65.22,13.88,4.30,76.33,48.16,47.48,16.40,74.89,0.913,2022-04-29,NSH_2022-04-29,3.950000,0.00,9.76,5.950000,0.00,15.61


In [52]:
# modeling_df = modeling_df.merge(home_away_stats[cols_to_prefix].add_prefix('away_'), 
#                                        left_on = 'Away_Team_Key', right_on = 'away_Game_Key', how = 'left')
# modeling_df

In [53]:
print(modeling_df.columns.to_list())

['game_id', 'game_type', 'season_id', 'date', 'home_score', 'away_score', 'home_team_id', 'home_team', 'home_wins', 'home_losses', 'home_otl', 'away_wins', 'away_losses', 'away_otl', 'away_team_id', 'away_team', 'status', 'Home_Team_Won', 'Home_Team_Key', 'Away_Team_Key', 'home_gp', 'away_gp', 'home_points', 'away_points', 'home_pts_pct', 'away_pts_pct', 'home_Team', 'home_TOI', 'home_FF/60', 'home_FA/60', 'home_FF%', 'home_GF/60', 'home_GA/60', 'home_GF%', 'home_xGF/60', 'home_xGA/60', 'home_xGF%', 'home_HDCF/60', 'home_HDCA/60', 'home_HDCF%', 'home_HDSH%', 'home_HDSV%', 'home_SH%', 'home_SV%', 'home_PDO', 'home_Date', 'home_Game_Key', 'home_TOI_pp', 'home_GF/60_pp', 'home_xGF/60_pp', 'home_TOI_pk', 'home_GA/60_pk', 'home_xGA/60_pk', 'away_Team', 'away_TOI', 'away_FF/60', 'away_FA/60', 'away_FF%', 'away_GF/60', 'away_GA/60', 'away_GF%', 'away_xGF/60', 'away_xGA/60', 'away_xGF%', 'away_HDCF/60', 'away_HDCA/60', 'away_HDCF%', 'away_HDSH%', 'away_HDSV%', 'away_SH%', 'away_SV%', 'away_PDO

In [54]:
drop_list = ['game_id', 'game_type', 'date', 'home_score', 'away_score', 'home_team_id', 'home_wins', 
             'home_losses', 'home_otl', 'away_wins', 'away_losses', 'away_otl', 'away_team_id', 
             'status', 'Home_Team_Key', 'Away_Team_Key', 'home_gp', 'away_gp', 'home_points', 'away_points',
             'home_Team', 'home_Date', 'home_Game_Key', 'away_Team', 'away_Date', 'away_Game_Key']
modeling_df = modeling_df.drop(columns=drop_list)

In [55]:
modeling_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3262 entries, 0 to 3261
Data columns (total 54 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   season_id       3262 non-null   object 
 1   home_team       3262 non-null   object 
 2   away_team       3262 non-null   object 
 3   Home_Team_Won   3262 non-null   int64  
 4   home_pts_pct    3262 non-null   float64
 5   away_pts_pct    3262 non-null   float64
 6   home_TOI        3262 non-null   float64
 7   home_FF/60      3262 non-null   float64
 8   home_FA/60      3262 non-null   float64
 9   home_FF%        3262 non-null   float64
 10  home_GF/60      3262 non-null   float64
 11  home_GA/60      3262 non-null   float64
 12  home_GF%        3262 non-null   float64
 13  home_xGF/60     3262 non-null   float64
 14  home_xGA/60     3262 non-null   float64
 15  home_xGF%       3262 non-null   float64
 16  home_HDCF/60    3262 non-null   float64
 17  home_HDCA/60    3262 non-null   f

In [None]:
# print (modeling_df[pd.to_numeric(modeling_df[cols_to_convert], errors='coerce').isnull()])

In [57]:
# fill pp and pk nans with 0 as they are games where teams did not go on the pp or pk
cols_to_fill = modeling_df.filter(like='_TOI_pp').columns.tolist() + \
               modeling_df.filter(like='_GF/60_pp').columns.tolist() + \
               modeling_df.filter(like='_xGF/60_pp').columns.tolist() + \
               modeling_df.filter(like='_TOI_pk').columns.tolist() + \
               modeling_df.filter(like='_GA/60_pk').columns.tolist() + \
               modeling_df.filter(like='_xGA/60_pk').columns.tolist()

modeling_df[cols_to_fill] = modeling_df[cols_to_fill].fillna(value=0)

In [58]:
# make sure no nans lurking about
na_mask = modeling_df.isna().any(axis=1) 
na_rows = modeling_df[na_mask] 

# print the selected rows
na_rows # no more nans

Unnamed: 0,season_id,home_team,away_team,Home_Team_Won,home_pts_pct,away_pts_pct,home_TOI,home_FF/60,home_FA/60,home_FF%,home_GF/60,home_GA/60,home_GF%,home_xGF/60,home_xGA/60,home_xGF%,home_HDCF/60,home_HDCA/60,home_HDCF%,home_HDSH%,home_HDSV%,home_SH%,home_SV%,home_PDO,home_TOI_pp,home_GF/60_pp,home_xGF/60_pp,home_TOI_pk,home_GA/60_pk,home_xGA/60_pk,away_TOI,away_FF/60,away_FA/60,away_FF%,away_GF/60,away_GA/60,away_GF%,away_xGF/60,away_xGA/60,away_xGF%,away_HDCF/60,away_HDCA/60,away_HDCF%,away_HDSH%,away_HDSV%,away_SH%,away_SV%,away_PDO,away_TOI_pp,away_GF/60_pp,away_xGF/60_pp,away_TOI_pk,away_GA/60_pk,away_xGA/60_pk


In [59]:
# save as csv for modeling
modeling_df.to_csv('modeling_data.csv', index=False)

### Preliminary Feauture Selection and Processing