# NHL Game Prediction Modeling
by Gary Schwaeber

## Overview

With sport betting becoming increasingly popular and mainstream I believe that data science can be used to make superior decisions over gut intuitions. In this notebook I will attempt to train logistic regression, ada boost, and gradient boosting models in an attempt to make the best possible game prediction model. I will train my models and tune model hyperparemetres using game results from seasons '2017-2018', '2018-2019', '2019-2020'. Then I will predict on held out games from the current 2021 season and evaluate my model. There are currently a handful of public models whose log loss on the current season's games is being [tracked](https://hockey-statistics.com/2021/05/03/game-projections-january-13th-2021/) on which I can compare the quality of my model to. The score I will look to optimize is log loss, however, I will also review accuracy scores due to their interpretability.

Log-loss is indicative of how close the prediction probability is to the corresponding actual/true value (0 or 1 in case of binary classification). The more the predicted probability diverges from the actual value, the higher is the log-loss value. [Source](https://towardsdatascience.com/intuition-behind-log-loss-score-4e0c9979680a)


In [43]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
import numpy as np
import statsmodels.api as sm
import hockey_scraper
import pickle
import time
import random
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.tree import DecisionTreeClassifier
from sklearn import svm
from sklearn.preprocessing import normalize, FunctionTransformer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import log_loss, accuracy_score
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.metrics import roc_auc_score, roc_curve, auc

from sklearn.metrics import confusion_matrix, plot_confusion_matrix,\
    precision_score, recall_score, accuracy_score, f1_score, log_loss,\
    roc_curve, roc_auc_score, classification_report
from sklearn.ensemble import BaggingClassifier, RandomForestClassifier, AdaBoostClassifier, AdaBoostRegressor, GradientBoostingClassifier
from collections import Counter
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.compose import ColumnTransformer
from sklearn.datasets import fetch_openml
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.compose import make_column_selector as selector
from sklearn.pipeline import make_pipeline, Pipeline
from sklearn.feature_selection import RFECV

#for the Neural Network
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout
from tensorflow.keras.regularizers import l2
from tensorflow.keras.optimizers import SGD
from tensorflow.keras.wrappers import scikit_learn
from tensorflow.keras.callbacks import EarlyStopping

import warnings
warnings.filterwarnings('ignore')
pd.set_option('display.max_columns', None)

In [2]:
df = pd.read_csv('data/all_games_multirolling_SVA_2.csv')

In [4]:
df.shape

(4447, 155)

In [29]:
# define feature columns for different rolling intervals

common = ['home_Goalie_FenwickSV%',
 'home_Goalie_GSAx/60',
 'home_Goalie_HDCSV%',
 'away_Goalie_FenwickSV%',
 'away_Goalie_GSAx/60',
 'away_Goalie_HDCSV%', 
 'home_Rating.A.Pre',
 'away_Rating.A.Pre',
 'B2B_Status']

r3 = ['home_last_3_FF%_5v5',
 'home_last_3_GF%_5v5',
 'home_last_3_xGF%_5v5',
 'home_last_3_SH%',
 'home_last3_xGF_per_min_pp',
 'home_last3_GF_per_min_pp',
 'home_last3_xGA_per_min_pk',
 'home_last3_GA_per_min_pk',
 'away_last_3_FF%_5v5',
 'away_last_3_GF%_5v5',
 'away_last_3_xGF%_5v5',
 'away_last_3_SH%',
 'away_last3_xGF_per_min_pp',
 'away_last3_GF_per_min_pp',
 'away_last3_xGA_per_min_pk',
 'away_last3_GA_per_min_pk'] + common

r5 =['home_last_5_FF%_5v5',
 'home_last_5_GF%_5v5',
 'home_last_5_xGF%_5v5',
 'home_last_5_SH%',

 'home_last5_xGF_per_min_pp',
 'home_last5_GF_per_min_pp',

 'home_last5_xGA_per_min_pk',
 'home_last5_GA_per_min_pk',
 'away_last_5_FF%_5v5',
 'away_last_5_GF%_5v5',
 'away_last_5_xGF%_5v5',
 'away_last_5_SH%',
 'away_last5_xGF_per_min_pp',
 'away_last5_GF_per_min_pp',
 'away_last5_xGA_per_min_pk',
 'away_last5_GA_per_min_pk'] + common

r10 =['home_last_10_FF%_5v5',
 'home_last_10_GF%_5v5',
 'home_last_10_xGF%_5v5',
 'home_last_10_SH%',
 'home_last10_xGF_per_min_pp',
 'home_last10_GF_per_min_pp',
 'home_last10_xGA_per_min_pk',
 'home_last10_GA_per_min_pk',
  'away_last_10_FF%_5v5',
 'away_last_10_GF%_5v5',
 'away_last_10_xGF%_5v5',
 'away_last_10_SH%',
 'away_last10_xGF_per_min_pp',
 'away_last10_GF_per_min_pp',
 'away_last10_xGA_per_min_pk',
 'away_last10_GA_per_min_pk',]


r20 = ['home_last_20_FF%_5v5',
 'home_last_20_GF%_5v5',
 'home_last_20_xGF%_5v5',
 'home_last_20_SH%',

 'home_last20_xGF_per_min_pp',
 'home_last20_GF_per_min_pp',

 'home_last20_xGA_per_min_pk',
 'home_last20_GA_per_min_pk',
 'away_last_20_FF%_5v5',
 'away_last_20_GF%_5v5',
 'away_last_20_xGF%_5v5',
 'away_last_20_SH%',

 'away_last20_xGF_per_min_pp',
 'away_last20_GF_per_min_pp',

 'away_last20_xGA_per_min_pk',
 'away_last20_GA_per_min_pk']

r30 = ['home_last_30_FF%_5v5',
 'home_last_30_GF%_5v5',
 'home_last_30_xGF%_5v5',
 'home_last_30_SH%',
 'home_last30_xGF_per_min_pp',
 'home_last30_GF_per_min_pp',
 'home_last30_xGA_per_min_pk',
 'home_last30_GA_per_min_pk',
 'away_last_30_FF%_5v5',
 'away_last_30_GF%_5v5',
 'away_last_30_xGF%_5v5',
 'away_last_30_SH%',
 'away_last30_xGF_per_min_pp',
 'away_last30_GF_per_min_pp',
 'away_last30_xGA_per_min_pk',
 'away_last30_GA_per_min_pk'] + common


r40 = ['home_last_40_FF%_5v5',
 'home_last_40_GF%_5v5',
 'home_last_40_xGF%_5v5',
 'home_last_40_SH%',
 'home_last40_xGF_per_min_pp',
 'home_last40_GF_per_min_pp',
 'home_last40_xGA_per_min_pk',
 'home_last40_GA_per_min_pk',
 'away_last_40_FF%_5v5',
 'away_last_40_GF%_5v5',
 'away_last_40_xGF%_5v5',
 'away_last_40_SH%',
 'away_last40_xGF_per_min_pp',
 'away_last40_GF_per_min_pp',
 'away_last40_xGA_per_min_pk',
 'away_last40_GA_per_min_pk'] + common


all_r = list(set(r3+r5+r10+r20+r30+r40))

r3_30 =list(set(r3+r30))
r5_30 = list(set(r5+r30))
r10_30 = list(set(r10+r30))
r_3_5_30 = list(set(r3+r5+r30))
r_5_20 = list(set(r5+r20))
r_5_40 = list(set(r5+r40))

## Baseline Model

The baseline model will predict that every home team wins their game and that the probability of that is the ratio of games the home team has won.

In [6]:
df['Home_Team_Won'].value_counts(normalize=True)

1    0.541714
0    0.458286
Name: Home_Team_Won, dtype: float64

In [7]:
baseline_preds = np.ones(df.shape[0])
accuracy_score(df['Home_Team_Won'],baseline_preds)

0.5417135147290308

In [8]:
baseline_probs = np.repeat(df['Home_Team_Won'].value_counts(normalize=True)[1], df.shape[0])

log_loss(df['Home_Team_Won'], baseline_probs)

0.6896630977766495

The models will need to beat an accuracy score of 54.17% and a log loss of .6897, otherwise they are no better than just predicting the home team will win. 

## Rolling 5 and 40 game features

For my first set of models I will attempt using 5 and 40 game rolling features. These seemed like a good set based on the feature selection notebook. 40 games is currently the longest rolling runway I have currently for the team statistics. The 40 games stats intuitively provide the most smoothing of team data over the course of the season, while the 5 game stats may provide some insight on any streakiness or may cover recent developments that would affect short term team performances such as player injuries, trades coaching changes etc.

In [9]:
X_train = df[df['Season'] != '2020-2021'].dropna().loc[:,r_5_40]
y_train = df[df['Season'] != '2020-2021'].dropna()['Home_Team_Won']
X_test = df[df['Season'] == '2020-2021'].loc[:,r_5_40]
y_test = df[df['Season'] == '2020-2021']['Home_Team_Won']

In [26]:
X_train.columns

Index(['home_last40_xGF_per_min_pp', 'away_last_5_xGF%_5v5',
       'home_last5_pp_TOI_per_game', 'home_last_40_GF%_5v5',
       'home_last40_xGA_per_min_pk', 'home_last5_xGA_per_min_pk',
       'home_last_40_SH%', 'home_last5_pk_TOI_per_game',
       'away_last40_pp_TOI_per_game', 'home_Goalie_GSAx/60',
       'away_last40_pk_TOI_per_game', 'away_Goalie_GSAx/60',
       'away_last_5_GF%_5v5', 'home_last40_pk_TOI_per_game', 'B2B_Status',
       'home_last_40_xGF%_5v5', 'away_last5_pp_TOI_per_game',
       'away_last5_pk_TOI_per_game', 'home_last5_GF_per_min_pp',
       'home_last_5_GF%_5v5', 'home_last_5_FF%_5v5',
       'away_last5_xGF_per_min_pp', 'away_last40_xGF_per_min_pp',
       'home_last40_GA_per_min_pk', 'home_Goalie_HDCSV%',
       'away_last5_GA_per_min_pk', 'away_last40_GF_per_min_pp',
       'away_Rating.A.Pre', 'home_last_5_xGF%_5v5', 'away_last_5_SH%',
       'home_Rating.A.Pre', 'home_last5_xGF_per_min_pp',
       'away_last_40_xGF%_5v5', 'home_last5_GA_per_min_pk',
  

In [10]:
X_train.shape

(3582, 49)

In [23]:
numeric_features = ['home_last40_xGF_per_min_pp', 'away_last_5_xGF%_5v5',
       'home_last_40_GF%_5v5',
       'home_last40_xGA_per_min_pk', 'home_last5_xGA_per_min_pk',
       'home_last_40_SH%', 
       'home_Goalie_GSAx/60',
        'away_Goalie_GSAx/60',
       'away_last_5_GF%_5v5', 
       'home_last_40_xGF%_5v5', 
     'home_last5_GF_per_min_pp',
       'home_last_5_GF%_5v5', 'home_last_5_FF%_5v5',
       'away_last5_xGF_per_min_pp', 'away_last40_xGF_per_min_pp',
       'home_last40_GA_per_min_pk', 'home_Goalie_HDCSV%',
       'away_last5_GA_per_min_pk', 'away_last40_GF_per_min_pp',
       'away_Rating.A.Pre', 'home_last_5_xGF%_5v5', 'away_last_5_SH%',
       'home_Rating.A.Pre', 'home_last5_xGF_per_min_pp',
       'away_last_40_xGF%_5v5', 'home_last5_GA_per_min_pk',
     'away_last5_GF_per_min_pp',
       'away_last_40_GF%_5v5', 'away_last_40_SH%', 'away_last_5_FF%_5v5',
       'home_Goalie_FenwickSV%', 'away_Goalie_HDCSV%',
       'away_last40_xGA_per_min_pk', 'home_last_5_SH%',
       'away_last5_xGA_per_min_pk', 'home_last_40_FF%_5v5',
       'away_Goalie_FenwickSV%', 'away_last_40_FF%_5v5',
       'home_last40_GF_per_min_pp', 'away_last40_GA_per_min_pk']

In [14]:
X_train[numeric_features].head()

Unnamed: 0,home_last40_xGF_per_min_pp,away_last_5_xGF%_5v5,home_last5_pp_TOI_per_game,home_last_40_GF%_5v5,home_last40_xGA_per_min_pk,home_last5_xGA_per_min_pk,home_last_40_SH%,home_last5_pk_TOI_per_game,away_last40_pp_TOI_per_game,home_Goalie_GSAx/60,away_last40_pk_TOI_per_game,away_Goalie_GSAx/60,away_last_5_GF%_5v5,home_last40_pk_TOI_per_game,home_last_40_xGF%_5v5,away_last5_pp_TOI_per_game,away_last5_pk_TOI_per_game,home_last5_GF_per_min_pp,home_last_5_GF%_5v5,home_last_5_FF%_5v5,away_last5_xGF_per_min_pp,away_last40_xGF_per_min_pp,home_last40_GA_per_min_pk,home_Goalie_HDCSV%,away_last5_GA_per_min_pk,away_last40_GF_per_min_pp,away_Rating.A.Pre,home_last_5_xGF%_5v5,away_last_5_SH%,home_Rating.A.Pre,home_last5_xGF_per_min_pp,away_last_40_xGF%_5v5,home_last5_GA_per_min_pk,home_last40_pp_TOI_per_game,away_last5_GF_per_min_pp,away_last_40_GF%_5v5,away_last_40_SH%,away_last_5_FF%_5v5,home_Goalie_FenwickSV%,away_Goalie_HDCSV%,away_last40_xGA_per_min_pk,home_last_5_SH%,away_last5_xGA_per_min_pk,home_last_40_FF%_5v5,away_Goalie_FenwickSV%,away_last_40_FF%_5v5,home_last40_GF_per_min_pp,away_last40_GA_per_min_pk
0,0.112699,48.770492,4.19,50.127801,0.104858,0.098556,9.025236,3.693333,4.646667,-0.202922,4.54,0.082345,45.9375,4.923333,48.992719,5.893333,3.07,0.095465,57.080799,52.399869,0.06991,0.1224,0.137102,0.858462,0.19544,0.139885,1500.66,51.663405,6.967375,1495.03,0.079714,49.339386,0.054152,5.328333,0.10181,51.399425,8.124451,52.562502,0.937294,0.873171,0.133976,9.426112,0.074267,48.803377,0.942516,49.991679,0.117297,0.121145
1,0.124909,51.204482,3.336667,56.868932,0.129028,0.153383,9.060588,3.546667,4.315417,0.169541,4.92875,-0.239655,49.927641,4.774167,51.954595,6.0,4.966667,0.2997,59.064609,42.564205,0.096,0.102018,0.10473,0.877358,0.040268,0.115864,1535.17,46.860987,11.358025,1577.1,0.143856,52.486645,0.225564,4.705417,0.1,58.184556,8.420932,46.882217,0.941904,0.864516,0.097844,12.093988,0.109128,50.828439,0.941294,50.633643,0.138139,0.086229
2,0.132248,40.305523,6.283333,56.575634,0.116445,0.131278,9.02546,4.54,4.921667,0.302087,5.185417,-0.097423,45.427286,4.23375,49.851785,4.816667,5.853333,0.190981,58.385392,60.511924,0.153218,0.120843,0.112194,0.897778,0.068337,0.11683,1496.85,60.180542,9.286882,1522.11,0.113316,49.136336,0.132159,4.6825,0.16609,50.499508,7.879167,43.520998,0.942492,0.878613,0.107127,8.478124,0.112415,50.407241,0.938246,50.595552,0.149493,0.106067
3,0.105738,49.941995,4.62,53.260259,0.120913,0.137299,7.970138,4.763333,5.57125,-0.164139,5.305,-0.080476,56.272661,4.379167,52.809227,5.173333,5.963333,0.04329,57.771883,54.316401,0.137242,0.143998,0.125595,0.869266,0.100615,0.103208,1496.86,52.571429,6.524847,1525.37,0.118615,50.855171,0.125962,4.778333,0.115979,45.246898,5.932286,51.909534,0.934447,0.848,0.093779,9.804628,0.086864,52.890654,0.938305,51.197815,0.099407,0.131951
4,0.129293,43.6373,2.69,48.882718,0.084868,0.067197,7.303942,5.446667,4.720833,-0.310233,4.475833,-0.346771,52.130045,5.193333,54.871795,6.066667,3.63,0.297398,48.959081,52.400715,0.142088,0.087855,0.101091,0.830721,0.0,0.121801,1545.81,50.929752,7.311321,1521.29,0.098885,50.381002,0.03672,4.482083,0.065934,52.122642,7.885816,47.102597,0.933383,0.839117,0.102718,5.518246,0.107438,55.762037,0.939698,51.309591,0.189644,0.128468


In [15]:
scoring = ['neg_log_loss', 'accuracy']

In [24]:
numeric_transformer = Pipeline(steps=[
    ('scaler', StandardScaler())])

categorical_features = ['B2B_Status']

categorical_transformer = Pipeline(steps=[
    ('ohe', OneHotEncoder())])

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)])

### Logistic Regression

In [30]:
log_pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                      ('logisticregression', LogisticRegression(max_iter=10000))])

log_params = {'logisticregression__solver' : ['liblinear', 'lbfgs', 'newton-cg'],
                'logisticregression__penalty': ['l1', 'l2'],
                'logisticregression__C': [.00001, .0001, .001, .01, .05, 0.1],
                'logisticregression__class_weight': [None] }

log_cv = GridSearchCV(log_pipeline, param_grid=log_params, cv=5, scoring=scoring, refit = 'neg_log_loss',  verbose=1)

In [31]:
log_cv.fit(X_train, y_train)

Fitting 5 folds for each of 36 candidates, totalling 180 fits


GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('preprocessor',
                                        ColumnTransformer(transformers=[('num',
                                                                         Pipeline(steps=[('scaler',
                                                                                          StandardScaler())]),
                                                                         ['home_last40_xGF_per_min_pp',
                                                                          'away_last_5_xGF%_5v5',
                                                                          'home_last_40_GF%_5v5',
                                                                          'home_last40_xGA_per_min_pk',
                                                                          'home_last5_xGA_per_min_pk',
                                                                          'home_last_40_SH%',
                                 

In [32]:
log_cv.best_score_

-0.6754370089204439

In [33]:
log_results = pd.DataFrame(log_cv.cv_results_).sort_values('mean_test_neg_log_loss', ascending=False)
log_results.head(10)

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_logisticregression__C,param_logisticregression__class_weight,param_logisticregression__penalty,param_logisticregression__solver,params,split0_test_neg_log_loss,split1_test_neg_log_loss,split2_test_neg_log_loss,split3_test_neg_log_loss,split4_test_neg_log_loss,mean_test_neg_log_loss,std_test_neg_log_loss,rank_test_neg_log_loss,split0_test_accuracy,split1_test_accuracy,split2_test_accuracy,split3_test_accuracy,split4_test_accuracy,mean_test_accuracy,std_test_accuracy,rank_test_accuracy
16,0.012287,0.000163,0.007713,8.4e-05,0.001,,l2,lbfgs,"{'logisticregression__C': 0.001, 'logisticregr...",-0.678374,-0.671673,-0.677392,-0.675825,-0.673921,-0.675437,0.002411,1,0.566248,0.592748,0.594972,0.571229,0.578212,0.580682,0.011433,2
17,0.019093,0.000621,0.007772,3.6e-05,0.001,,l2,newton-cg,"{'logisticregression__C': 0.001, 'logisticregr...",-0.678374,-0.671674,-0.677392,-0.675825,-0.673923,-0.675437,0.00241,2,0.566248,0.592748,0.594972,0.571229,0.578212,0.580682,0.011433,2
15,0.014589,0.000732,0.008238,0.000482,0.001,,l2,liblinear,"{'logisticregression__C': 0.001, 'logisticregr...",-0.678653,-0.672743,-0.678873,-0.676277,-0.675154,-0.67634,0.002285,3,0.567643,0.619247,0.594972,0.585196,0.569832,0.587378,0.018843,1
21,0.017617,0.000735,0.007767,2.3e-05,0.01,,l2,liblinear,"{'logisticregression__C': 0.01, 'logisticregre...",-0.677641,-0.668417,-0.680201,-0.678592,-0.676997,-0.676369,0.00412,4,0.584379,0.598326,0.587989,0.565642,0.565642,0.580396,0.012887,4
23,0.025454,0.001999,0.008297,0.000467,0.01,,l2,newton-cg,"{'logisticregression__C': 0.01, 'logisticregre...",-0.677863,-0.668443,-0.679819,-0.679093,-0.676975,-0.676439,0.004116,5,0.585774,0.594142,0.587989,0.557263,0.569832,0.579,0.01351,6
22,0.016674,0.000611,0.007799,0.000178,0.01,,l2,lbfgs,"{'logisticregression__C': 0.01, 'logisticregre...",-0.677864,-0.668444,-0.679819,-0.679094,-0.676973,-0.676439,0.004116,6,0.585774,0.594142,0.587989,0.557263,0.569832,0.579,0.01351,6
24,0.017753,0.000853,0.007911,6.1e-05,0.05,,l1,liblinear,"{'logisticregression__C': 0.05, 'logisticregre...",-0.678655,-0.671507,-0.6778,-0.679098,-0.675913,-0.676595,0.002768,7,0.567643,0.588563,0.596369,0.561453,0.575419,0.577889,0.012936,8
30,0.025024,0.004794,0.009411,0.000891,0.1,,l1,liblinear,"{'logisticregression__C': 0.1, 'logisticregres...",-0.677719,-0.669147,-0.678903,-0.679826,-0.677732,-0.676665,0.003841,8,0.570432,0.596932,0.585196,0.561453,0.554469,0.573696,0.015507,13
27,0.024262,0.001168,0.008714,0.000813,0.05,,l2,liblinear,"{'logisticregression__C': 0.05, 'logisticregre...",-0.677728,-0.669121,-0.681685,-0.681348,-0.679935,-0.677963,0.004635,9,0.58159,0.599721,0.579609,0.561453,0.555866,0.575648,0.015642,10
28,0.022607,0.000736,0.008461,0.000296,0.05,,l2,lbfgs,"{'logisticregression__C': 0.05, 'logisticregre...",-0.677767,-0.669153,-0.681584,-0.681528,-0.679915,-0.67799,0.004632,10,0.577406,0.598326,0.581006,0.562849,0.555866,0.575091,0.01483,11


### Ada Boost

In [34]:
ada_pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                      ('ada', AdaBoostClassifier())])

ada_params = {'ada__n_estimators': [25, 50],
         'ada__learning_rate': [.1, 1, 10, 20],
         'ada__base_estimator': [svm.SVC(probability=True , kernel='linear'), LogisticRegression()],}

ada_cv = GridSearchCV(ada_pipeline, param_grid=ada_params, cv=5, scoring=scoring, refit='neg_log_loss', verbose=1)

In [35]:
ada_cv.fit(X_train, y_train)

Fitting 5 folds for each of 16 candidates, totalling 80 fits


GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('preprocessor',
                                        ColumnTransformer(transformers=[('num',
                                                                         Pipeline(steps=[('scaler',
                                                                                          StandardScaler())]),
                                                                         ['home_last40_xGF_per_min_pp',
                                                                          'away_last_5_xGF%_5v5',
                                                                          'home_last_40_GF%_5v5',
                                                                          'home_last40_xGA_per_min_pk',
                                                                          'home_last5_xGA_per_min_pk',
                                                                          'home_last_40_SH%',
                                 

In [41]:
ada_cv.best_score_

-0.6799359662807147

In [36]:
ada_results = pd.DataFrame(ada_cv.cv_results_).sort_values('mean_test_neg_log_loss', ascending=False)
ada_results.head(10)

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_ada__base_estimator,param_ada__learning_rate,param_ada__n_estimators,params,split0_test_neg_log_loss,split1_test_neg_log_loss,split2_test_neg_log_loss,split3_test_neg_log_loss,split4_test_neg_log_loss,mean_test_neg_log_loss,std_test_neg_log_loss,rank_test_neg_log_loss,split0_test_accuracy,split1_test_accuracy,split2_test_accuracy,split3_test_accuracy,split4_test_accuracy,mean_test_accuracy,std_test_accuracy,rank_test_accuracy
4,42.529209,0.277327,2.543278,0.006013,"SVC(kernel='linear', probability=True)",10.0,25,"{'ada__base_estimator': SVC(kernel='linear', p...",-0.683305,-0.674921,-0.681226,-0.681435,-0.678793,-0.679936,0.002889,1,0.553696,0.596932,0.583799,0.561453,0.569832,0.573142,0.015526,4
0,43.341996,0.235797,2.495557,0.040371,"SVC(kernel='linear', probability=True)",0.1,25,"{'ada__base_estimator': SVC(kernel='linear', p...",-0.682869,-0.679884,-0.681938,-0.681413,-0.681377,-0.681496,0.000969,2,0.560669,0.569038,0.564246,0.551676,0.553073,0.55974,0.006589,8
5,83.132445,1.405995,5.013317,0.237507,"SVC(kernel='linear', probability=True)",10.0,50,"{'ada__base_estimator': SVC(kernel='linear', p...",-0.684522,-0.676831,-0.68288,-0.68405,-0.680332,-0.681723,0.002844,3,0.55788,0.591353,0.585196,0.554469,0.567039,0.571187,0.014674,6
6,41.064554,1.219384,2.431893,0.203562,"SVC(kernel='linear', probability=True)",20.0,25,"{'ada__base_estimator': SVC(kernel='linear', p...",-0.6848,-0.678553,-0.682937,-0.684372,-0.680497,-0.682232,0.002375,4,0.545328,0.594142,0.571229,0.546089,0.569832,0.565324,0.018196,7
8,0.1228,0.002372,0.015609,0.000248,LogisticRegression(),0.1,25,"{'ada__base_estimator': LogisticRegression(), ...",-0.684109,-0.681523,-0.684027,-0.68269,-0.682635,-0.682997,0.000969,5,0.564854,0.594142,0.597765,0.569832,0.581006,0.58152,0.012945,2
1,81.938708,0.095818,4.740888,0.009278,"SVC(kernel='linear', probability=True)",0.1,50,"{'ada__base_estimator': SVC(kernel='linear', p...",-0.685502,-0.683732,-0.68468,-0.685002,-0.684744,-0.684732,0.000578,6,0.543933,0.542538,0.537709,0.540503,0.546089,0.542155,0.002873,12
7,73.91773,4.193198,3.965984,0.348138,"SVC(kernel='linear', probability=True)",20.0,50,"{'ada__base_estimator': SVC(kernel='linear', p...",-0.686802,-0.683494,-0.685372,-0.686456,-0.683155,-0.685056,0.001494,7,0.543933,0.545328,0.540503,0.543296,0.561453,0.546902,0.007443,9
9,0.229651,0.005856,0.022538,0.000169,LogisticRegression(),0.1,50,"{'ada__base_estimator': LogisticRegression(), ...",-0.686993,-0.685253,-0.687152,-0.686154,-0.686353,-0.686381,0.000677,8,0.569038,0.60251,0.597765,0.572626,0.571229,0.582634,0.014416,1
2,35.319119,0.179226,2.073445,0.010988,"SVC(kernel='linear', probability=True)",1.0,25,"{'ada__base_estimator': SVC(kernel='linear', p...",-0.688066,-0.688023,-0.68788,-0.688809,-0.687754,-0.688106,0.000368,9,0.543933,0.543933,0.543296,0.543296,0.544693,0.54383,0.000517,10
3,68.084216,0.296903,4.008086,0.018197,"SVC(kernel='linear', probability=True)",1.0,50,"{'ada__base_estimator': SVC(kernel='linear', p...",-0.688713,-0.688734,-0.688831,-0.688501,-0.688628,-0.688681,0.000111,10,0.543933,0.543933,0.543296,0.543296,0.544693,0.54383,0.000517,10


### Gradient Boosting

In [37]:
gb_pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                      ('gb', GradientBoostingClassifier())])

gb_params = {'gb__n_estimators': [200, 300, 400],
         'gb__learning_rate': [.001,.01, .1],
         'gb__max_depth' : [3,5]}

gb_cv = GridSearchCV(gb_pipeline, param_grid=gb_params, cv=5, scoring=scoring, refit='neg_log_loss', verbose=1)

In [38]:
gb_cv.fit(X_train, y_train)

Fitting 5 folds for each of 18 candidates, totalling 90 fits


GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('preprocessor',
                                        ColumnTransformer(transformers=[('num',
                                                                         Pipeline(steps=[('scaler',
                                                                                          StandardScaler())]),
                                                                         ['home_last40_xGF_per_min_pp',
                                                                          'away_last_5_xGF%_5v5',
                                                                          'home_last_40_GF%_5v5',
                                                                          'home_last40_xGA_per_min_pk',
                                                                          'home_last5_xGA_per_min_pk',
                                                                          'home_last_40_SH%',
                                 

In [39]:
gb_cv.best_score_

-0.6813351464598496

In [40]:
gb_results = pd.DataFrame(gb_cv.cv_results_).sort_values('mean_test_neg_log_loss', ascending=False)
gb_results.head(10)

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_gb__learning_rate,param_gb__max_depth,param_gb__n_estimators,params,split0_test_neg_log_loss,split1_test_neg_log_loss,split2_test_neg_log_loss,split3_test_neg_log_loss,split4_test_neg_log_loss,mean_test_neg_log_loss,std_test_neg_log_loss,rank_test_neg_log_loss,split0_test_accuracy,split1_test_accuracy,split2_test_accuracy,split3_test_accuracy,split4_test_accuracy,mean_test_accuracy,std_test_accuracy,rank_test_accuracy
6,5.060808,0.005743,0.013046,0.00036,0.01,3,200,"{'gb__learning_rate': 0.01, 'gb__max_depth': 3...",-0.682658,-0.679488,-0.684309,-0.680543,-0.679678,-0.681335,0.001864,1,0.559275,0.570432,0.590782,0.568436,0.582402,0.574265,0.011067,1
7,7.613778,0.132845,0.014969,0.000935,0.01,3,300,"{'gb__learning_rate': 0.01, 'gb__max_depth': 3...",-0.682625,-0.679836,-0.68534,-0.68125,-0.680842,-0.681979,0.001904,2,0.550907,0.55788,0.586592,0.565642,0.579609,0.568126,0.01327,3
8,10.091404,0.022878,0.015978,0.00022,0.01,3,400,"{'gb__learning_rate': 0.01, 'gb__max_depth': 3...",-0.682338,-0.681422,-0.686645,-0.682562,-0.682085,-0.68301,0.001857,3,0.549512,0.559275,0.585196,0.572626,0.574022,0.568126,0.012419,3
9,8.066107,0.019059,0.016398,0.000273,0.01,5,200,"{'gb__learning_rate': 0.01, 'gb__max_depth': 5...",-0.683842,-0.681349,-0.688062,-0.682242,-0.684823,-0.684064,0.002336,4,0.559275,0.570432,0.571229,0.581006,0.562849,0.568958,0.007531,2
2,10.050758,0.085446,0.01859,0.000354,0.001,3,400,"{'gb__learning_rate': 0.001, 'gb__max_depth': ...",-0.685737,-0.683565,-0.686076,-0.685384,-0.684671,-0.685087,0.000892,5,0.538354,0.548117,0.539106,0.551676,0.541899,0.543831,0.005215,13
5,15.995703,0.058481,0.026158,0.001207,0.001,5,400,"{'gb__learning_rate': 0.001, 'gb__max_depth': ...",-0.685431,-0.682782,-0.689631,-0.685636,-0.684904,-0.685677,0.002221,6,0.53696,0.560669,0.526536,0.544693,0.547486,0.543269,0.011335,14
4,11.986876,0.043948,0.021257,0.000472,0.001,5,300,"{'gb__learning_rate': 0.001, 'gb__max_depth': ...",-0.686009,-0.683521,-0.688014,-0.686682,-0.685319,-0.685909,0.001489,7,0.531381,0.560669,0.53352,0.537709,0.547486,0.542153,0.010785,16
1,7.521795,0.034738,0.015967,0.000136,0.001,3,300,"{'gb__learning_rate': 0.001, 'gb__max_depth': ...",-0.686319,-0.684856,-0.686472,-0.686419,-0.685634,-0.68594,0.000621,8,0.542538,0.545328,0.539106,0.543296,0.543296,0.542713,0.002028,15
10,12.228661,0.02834,0.019323,0.000183,0.01,5,300,"{'gb__learning_rate': 0.01, 'gb__max_depth': 5...",-0.683065,-0.684688,-0.690831,-0.686171,-0.688825,-0.686716,0.002797,9,0.566248,0.560669,0.567039,0.575419,0.564246,0.566724,0.004873,6
3,7.987818,0.031277,0.016568,0.000518,0.001,5,200,"{'gb__learning_rate': 0.001, 'gb__max_depth': ...",-0.686886,-0.685194,-0.688012,-0.688228,-0.685266,-0.686717,0.001297,10,0.535565,0.546722,0.536313,0.540503,0.550279,0.541876,0.005775,17


It does not seem that gradient boosting is producing good results for this dataset

### Neural Network

In [47]:
def build_model():
    model = Sequential()
    model.add(Dense(12, activation='relu', input_dim=44))
    model.add(Dense(8, activation='relu'))
    model.add(Dense(4, activation='relu'))
    model.add(Dense(1, activation = 'sigmoid'))

    model.compile(optimizer='adam',
                  loss='binary_crossentropy',
                  metrics=['accuracy'])
    return model

In [53]:
keras_model = scikit_learn.KerasClassifier(build_model,
                                          epochs=50,
                                          batch_size=32,
                                          verbose=2)

numeric_transformer = Pipeline(steps=[
    ('scaler', StandardScaler())])

categorical_features = ['B2B_Status']

categorical_transformer = Pipeline(steps=[
    ('ohe', OneHotEncoder())])

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)])

nn_pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                      ('nn', keras_model)])

In [54]:
nn_pipeline.fit(X_train, y_train)

Epoch 1/50
112/112 - 0s - loss: 0.6968 - accuracy: 0.5475
Epoch 2/50
112/112 - 0s - loss: 0.6840 - accuracy: 0.5558
Epoch 3/50
112/112 - 0s - loss: 0.6778 - accuracy: 0.5617
Epoch 4/50
112/112 - 0s - loss: 0.6747 - accuracy: 0.5653
Epoch 5/50
112/112 - 0s - loss: 0.6721 - accuracy: 0.5676
Epoch 6/50
112/112 - 0s - loss: 0.6700 - accuracy: 0.5843
Epoch 7/50
112/112 - 0s - loss: 0.6674 - accuracy: 0.5918
Epoch 8/50
112/112 - 0s - loss: 0.6656 - accuracy: 0.5941
Epoch 9/50
112/112 - 0s - loss: 0.6630 - accuracy: 0.5988
Epoch 10/50
112/112 - 0s - loss: 0.6616 - accuracy: 0.6002
Epoch 11/50
112/112 - 0s - loss: 0.6590 - accuracy: 0.5960
Epoch 12/50
112/112 - 0s - loss: 0.6570 - accuracy: 0.6064
Epoch 13/50
112/112 - 0s - loss: 0.6545 - accuracy: 0.6072
Epoch 14/50
112/112 - 0s - loss: 0.6528 - accuracy: 0.6033
Epoch 15/50
112/112 - 0s - loss: 0.6509 - accuracy: 0.6080
Epoch 16/50
112/112 - 0s - loss: 0.6488 - accuracy: 0.6139
Epoch 17/50
112/112 - 0s - loss: 0.6466 - accuracy: 0.6175
Epoch 

Pipeline(steps=[('preprocessor',
                 ColumnTransformer(transformers=[('num',
                                                  Pipeline(steps=[('scaler',
                                                                   StandardScaler())]),
                                                  ['home_last40_xGF_per_min_pp',
                                                   'away_last_5_xGF%_5v5',
                                                   'home_last_40_GF%_5v5',
                                                   'home_last40_xGA_per_min_pk',
                                                   'home_last5_xGA_per_min_pk',
                                                   'home_last_40_SH%',
                                                   'home_Goalie_GSAx/60',
                                                   'away_Goalie_GSAx/60',
                                                   'away_last_5_GF%_5v5',
                                                   'home_las

In [55]:
nn_pipeline.predict(X_test)

ValueError: Found unknown categories ['0'] in column 0 during transform

In [None]:
sigmoid_loss = n.history['loss']
sigmoid_accuracy = results.history['accuracy']

fig, (ax1, ax2) = plt.subplots(1,2, figsize=(10,5))
sns.lineplot(x=results.epoch, y=sigmoid_loss, ax=ax1, label='loss')
sns.lineplot(x=results.epoch, y=sigmoid_accuracy, ax=ax2, label='accuracy');

### Feature Importance Evaluation

Reviewing the logistic regression, coefficients, I can see which feature the algorithm deemed most impactful. I am
very surprised that away_last_40_xGF%_5v5 was cut by the l1 regularization, that seemed like it would be one of the more important features.

In [42]:
log_coef = pd.DataFrame(list(zip(X_train.columns, log_cv.best_estimator_[1].coef_[0])), columns = ['Feature', 'Coef'] )
log_coef['Coef_abs'] = abs(log_coef['Coef'])
log_coef.sort_values('Coef_abs', ascending = False)

Unnamed: 0,Feature,Coef,Coef_abs
19,B2B_Status,-0.057949,0.057949
22,home_last_5_xGF%_5v5,0.056853,0.056853
37,home_last40_GF_per_min_pp,-0.048104,0.048104
35,away_last40_GA_per_min_pk,0.047458,0.047458
9,home_last5_xGA_per_min_pk,0.045442,0.045442
24,home_Rating.A.Pre,-0.039534,0.039534
27,home_last_5_GF%_5v5,-0.03694,0.03694
15,home_Goalie_HDCSV%,-0.036372,0.036372
30,home_last_40_SH%,0.036092,0.036092
38,away_last_40_SH%,0.032656,0.032656


## 40 Game Rolling

I will run some models using only the rolling 40 game team stats

In [56]:
X_train = df[df['Season'] != '2020-2021'].dropna().loc[:,r40]
y_train = df[df['Season'] != '2020-2021'].dropna()['Home_Team_Won']
X_test = df[df['Season'] == '2020-2021'].loc[:,r40]
y_test = df[df['Season'] == '2020-2021']['Home_Team_Won']

In [58]:
X_train.columns

Index(['home_last_40_FF%_5v5', 'home_last_40_GF%_5v5', 'home_last_40_xGF%_5v5',
       'home_last_40_SH%', 'home_last40_xGF_per_min_pp',
       'home_last40_GF_per_min_pp', 'home_last40_xGA_per_min_pk',
       'home_last40_GA_per_min_pk', 'away_last_40_FF%_5v5',
       'away_last_40_GF%_5v5', 'away_last_40_xGF%_5v5', 'away_last_40_SH%',
       'away_last40_xGF_per_min_pp', 'away_last40_GF_per_min_pp',
       'away_last40_xGA_per_min_pk', 'away_last40_GA_per_min_pk',
       'home_Goalie_FenwickSV%', 'home_Goalie_GSAx/60', 'home_Goalie_HDCSV%',
       'away_Goalie_FenwickSV%', 'away_Goalie_GSAx/60', 'away_Goalie_HDCSV%',
       'home_Rating.A.Pre', 'away_Rating.A.Pre', 'B2B_Status'],
      dtype='object')

In [62]:
numeric_features =['home_last_40_FF%_5v5', 'home_last_40_GF%_5v5', 'home_last_40_xGF%_5v5',
       'home_last_40_SH%', 'home_last40_xGF_per_min_pp',
       'home_last40_GF_per_min_pp', 'home_last40_xGA_per_min_pk',
       'home_last40_GA_per_min_pk', 'away_last_40_FF%_5v5',
       'away_last_40_GF%_5v5', 'away_last_40_xGF%_5v5', 'away_last_40_SH%',
       'away_last40_xGF_per_min_pp', 'away_last40_GF_per_min_pp',
       'away_last40_xGA_per_min_pk', 'away_last40_GA_per_min_pk',
       'home_Goalie_FenwickSV%', 'home_Goalie_GSAx/60', 'home_Goalie_HDCSV%',
       'away_Goalie_FenwickSV%', 'away_Goalie_GSAx/60', 'away_Goalie_HDCSV%',
       'home_Rating.A.Pre', 'away_Rating.A.Pre']

### Logistic Regression

In [63]:
numeric_transformer = Pipeline(steps=[
    ('scaler', StandardScaler())])

categorical_features = ['B2B_Status']

categorical_transformer = Pipeline(steps=[
    ('ohe', OneHotEncoder())])

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)])

log_40_pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                      ('logisticregression', LogisticRegression(max_iter=10000))])

In [64]:
log_params = {'logisticregression__solver' : ['liblinear', 'lbfgs', 'newton-cg'],
                'logisticregression__penalty': ['l1', 'l2'],
                'logisticregression__C': [.01, 0.1, 1, 10],
                'logisticregression__class_weight': [None] }

log_cv_40 = GridSearchCV(log_40_pipeline, param_grid=log_params, cv=5, scoring=scoring, refit = 'neg_log_loss',  verbose=1)

In [65]:
log_cv_40.fit(X_train, y_train)

Fitting 5 folds for each of 24 candidates, totalling 120 fits


GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('preprocessor',
                                        ColumnTransformer(transformers=[('num',
                                                                         Pipeline(steps=[('scaler',
                                                                                          StandardScaler())]),
                                                                         ['home_last_40_FF%_5v5',
                                                                          'home_last_40_GF%_5v5',
                                                                          'home_last_40_xGF%_5v5',
                                                                          'home_last_40_SH%',
                                                                          'home_last40_xGF_per_min_pp',
                                                                          'home_last40_GF_per_min_pp',
                                      

In [212]:
log_40_results = pd.DataFrame(log_cv_40.cv_results_).sort_values('mean_test_neg_log_loss', ascending=False)
log_40_results.head(10)

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_logisticregression__C,param_logisticregression__class_weight,param_logisticregression__penalty,param_logisticregression__solver,params,split0_test_neg_log_loss,split1_test_neg_log_loss,split2_test_neg_log_loss,split3_test_neg_log_loss,split4_test_neg_log_loss,mean_test_neg_log_loss,std_test_neg_log_loss,rank_test_neg_log_loss,split0_test_accuracy,split1_test_accuracy,split2_test_accuracy,split3_test_accuracy,split4_test_accuracy,mean_test_accuracy,std_test_accuracy,rank_test_accuracy
4,0.014693,0.00113,0.007716,0.000719,0.01,,l2,lbfgs,"{'logisticregression__C': 0.01, 'logisticregre...",-0.675677,-0.672521,-0.678186,-0.674674,-0.669348,-0.674081,0.002986,1,0.570667,0.593333,0.592,0.569333,0.581333,0.581333,0.010154,3
5,0.017283,0.000292,0.006951,0.000118,0.01,,l2,newton-cg,"{'logisticregression__C': 0.01, 'logisticregre...",-0.675677,-0.67252,-0.678187,-0.674673,-0.669355,-0.674082,0.002984,2,0.570667,0.593333,0.592,0.569333,0.581333,0.581333,0.010154,3
3,0.014061,0.000746,0.008523,0.001029,0.01,,l2,liblinear,"{'logisticregression__C': 0.01, 'logisticregre...",-0.675375,-0.672548,-0.678078,-0.674946,-0.669716,-0.674133,0.002821,3,0.568,0.596,0.592,0.573333,0.584,0.582667,0.010667,2
9,0.014426,0.000387,0.00695,0.000158,0.1,,l2,liblinear,"{'logisticregression__C': 0.1, 'logisticregres...",-0.674011,-0.672891,-0.678995,-0.6746,-0.671304,-0.67436,0.002575,4,0.570667,0.584,0.593333,0.578667,0.576,0.580533,0.00771,5
11,0.020913,0.000814,0.007221,0.000453,0.1,,l2,newton-cg,"{'logisticregression__C': 0.1, 'logisticregres...",-0.674061,-0.672896,-0.679035,-0.674563,-0.671252,-0.674362,0.0026,5,0.568,0.582667,0.593333,0.581333,0.577333,0.580533,0.008202,5
10,0.015567,0.000349,0.0071,0.00019,0.1,,l2,lbfgs,"{'logisticregression__C': 0.1, 'logisticregres...",-0.674062,-0.6729,-0.679034,-0.674562,-0.67125,-0.674362,0.002599,6,0.568,0.582667,0.593333,0.581333,0.577333,0.580533,0.008202,5
12,0.022668,0.001238,0.007185,0.000171,1.0,,l1,liblinear,"{'logisticregression__C': 1, 'logisticregressi...",-0.673682,-0.672968,-0.679327,-0.674495,-0.671486,-0.674391,0.002659,7,0.572,0.58,0.590667,0.577333,0.576,0.5792,0.006288,9
15,0.015133,0.000416,0.007112,0.000265,1.0,,l2,liblinear,"{'logisticregression__C': 1, 'logisticregressi...",-0.673574,-0.673067,-0.679649,-0.6746,-0.67206,-0.67459,0.002659,8,0.573333,0.582667,0.585333,0.574667,0.572,0.5776,0.00536,10
17,0.021284,0.000212,0.007193,0.000235,1.0,,l2,newton-cg,"{'logisticregression__C': 1, 'logisticregressi...",-0.673579,-0.673068,-0.679654,-0.674596,-0.672055,-0.67459,0.002661,9,0.573333,0.582667,0.585333,0.574667,0.572,0.5776,0.00536,10
16,0.016655,0.000304,0.006913,0.00019,1.0,,l2,lbfgs,"{'logisticregression__C': 1, 'logisticregressi...",-0.673578,-0.673068,-0.679656,-0.674598,-0.672055,-0.674591,0.002662,10,0.573333,0.582667,0.585333,0.574667,0.572,0.5776,0.00536,10


#### Feature Importance Evaluation

In [214]:
log_40_coef = pd.DataFrame(list(zip(X_train.columns, log_cv_40.best_estimator_[1].coef_[0])), columns = ['Feature', 'Coef'] )
log_40_coef['Coef_abs'] = abs(log_40_coef['Coef'])
log_40_coef.sort_values('Coef_abs', ascending = False)

Unnamed: 0,Feature,Coef,Coef_abs
10,home_last_40_xGF%_5v5,0.167769,0.167769
16,away_last_40_FF%_5v5,0.128198,0.128198
22,away_last40_pk_TOI_per_game,-0.127853,0.127853
20,away_last40_pp_TOI_per_game,-0.120332,0.120332
9,home_last_40_GF%_5v5,0.092343,0.092343
5,away_Goalie_HDCSV%,-0.069468,0.069468
7,away_B2B,-0.069465,0.069465
23,away_last40_xGA_per_min_pk,0.068317,0.068317
4,away_Goalie_GSAx/60,-0.058731,0.058731
6,home_B2B,0.057827,0.057827


### Ada Boost

In [66]:
numeric_transformer = Pipeline(steps=[
    ('scaler', StandardScaler())])

categorical_features = ['B2B_Status']

categorical_transformer = Pipeline(steps=[
    ('ohe', OneHotEncoder())])

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)])


ada_40_pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                      ('ada', AdaBoostClassifier())])

ada_params = {'ada__n_estimators': [25],
         'ada__learning_rate': [.01, .1, 1, 10],
         'ada__base_estimator': [svm.SVC(probability=True , kernel='linear'), LogisticRegression()],}

ada_cv_40 = GridSearchCV(ada_40_pipeline, param_grid=ada_params, cv=5, scoring=scoring, refit='neg_log_loss', verbose=1)

In [67]:
ada_cv_40.fit(X_train, y_train)

Fitting 5 folds for each of 8 candidates, totalling 40 fits


GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('preprocessor',
                                        ColumnTransformer(transformers=[('num',
                                                                         Pipeline(steps=[('scaler',
                                                                                          StandardScaler())]),
                                                                         ['home_last_40_FF%_5v5',
                                                                          'home_last_40_GF%_5v5',
                                                                          'home_last_40_xGF%_5v5',
                                                                          'home_last_40_SH%',
                                                                          'home_last40_xGF_per_min_pp',
                                                                          'home_last40_GF_per_min_pp',
                                      

In [68]:
ada_40_results = pd.DataFrame(ada_cv_40.cv_results_).sort_values('mean_test_neg_log_loss', ascending=False)
ada_40_results.head(10)

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_ada__base_estimator,param_ada__learning_rate,param_ada__n_estimators,params,split0_test_neg_log_loss,split1_test_neg_log_loss,split2_test_neg_log_loss,split3_test_neg_log_loss,split4_test_neg_log_loss,mean_test_neg_log_loss,std_test_neg_log_loss,rank_test_neg_log_loss,split0_test_accuracy,split1_test_accuracy,split2_test_accuracy,split3_test_accuracy,split4_test_accuracy,mean_test_accuracy,std_test_accuracy,rank_test_accuracy
0,37.925709,0.638357,2.14507,0.033783,"SVC(kernel='linear', probability=True)",0.01,25,"{'ada__base_estimator': SVC(kernel='linear', p...",-0.678507,-0.672283,-0.677155,-0.678249,-0.672888,-0.675816,0.002684,1,0.550907,0.577406,0.599162,0.561453,0.579609,0.573707,0.016532,5
4,0.11237,0.003209,0.015125,0.000329,LogisticRegression(),0.01,25,"{'ada__base_estimator': LogisticRegression(), ...",-0.680541,-0.676259,-0.67967,-0.678631,-0.675985,-0.678217,0.001817,2,0.569038,0.588563,0.569832,0.567039,0.583799,0.575654,0.008774,3
3,37.55927,0.155653,2.273125,0.004889,"SVC(kernel='linear', probability=True)",10.0,25,"{'ada__base_estimator': SVC(kernel='linear', p...",-0.681224,-0.674573,-0.677922,-0.680711,-0.676703,-0.678227,0.002487,3,0.567643,0.595537,0.608939,0.571229,0.572626,0.583195,0.016198,1
1,38.204704,0.113358,2.208953,0.011606,"SVC(kernel='linear', probability=True)",0.1,25,"{'ada__base_estimator': SVC(kernel='linear', p...",-0.68244,-0.679579,-0.682155,-0.681328,-0.679784,-0.681057,0.001183,4,0.564854,0.574616,0.564246,0.568436,0.565642,0.567559,0.003809,6
5,0.114122,0.000884,0.014839,7e-05,LogisticRegression(),0.1,25,"{'ada__base_estimator': LogisticRegression(), ...",-0.684177,-0.681519,-0.684073,-0.683125,-0.681574,-0.682894,0.00116,5,0.560669,0.588563,0.597765,0.569832,0.582402,0.579847,0.013203,2
2,31.212874,0.111837,1.813163,0.012767,"SVC(kernel='linear', probability=True)",1.0,25,"{'ada__base_estimator': SVC(kernel='linear', p...",-0.688093,-0.687867,-0.688317,-0.688659,-0.686559,-0.687899,0.000719,6,0.543933,0.543933,0.543296,0.543296,0.544693,0.54383,0.000517,7
6,0.122421,0.057093,0.015021,0.00021,LogisticRegression(),1.0,25,"{'ada__base_estimator': LogisticRegression(), ...",-0.691571,-0.691189,-0.691649,-0.691591,-0.691297,-0.691459,0.000182,7,0.559275,0.594142,0.589385,0.564246,0.569832,0.575376,0.013873,4
7,0.232039,0.006414,0.015102,0.000146,LogisticRegression(),10.0,25,"{'ada__base_estimator': LogisticRegression(), ...",-0.691897,-0.69111,-0.702624,-0.690811,-0.691471,-0.693583,0.004535,8,0.543933,0.543933,0.543296,0.543296,0.544693,0.54383,0.000517,7


## All Rolling Game Features With Recursive Feature Elimination

In [69]:
X_train = df[df['Season'] != '2020-2021'].dropna().loc[:,all_r]
y_train = df[df['Season'] != '2020-2021'].dropna()['Home_Team_Won']
X_test = df[df['Season'] == '2020-2021'].loc[:,all_r]
y_test = df[df['Season'] == '2020-2021']['Home_Team_Won']

In [383]:
X_train.shape

(3750, 104)

### Recursive Feature Elimination

In [70]:
numeric_transformer = Pipeline(steps=[
    ('scaler', StandardScaler())])

categorical_features = ['B2B_Status']

categorical_transformer = Pipeline(steps=[
    ('ohe', OneHotEncoder())])

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)])

rfecv = RFECV(estimator= LogisticRegression(max_iter =10000, penalty = 'l2', solver='liblinear', C=.1), step=1, scoring='accuracy')
rfecv_pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                      ('rfecv', rfecv)])

In [71]:
rfecv_pipeline.fit(X_train, y_train)

Pipeline(steps=[('preprocessor',
                 ColumnTransformer(transformers=[('num',
                                                  Pipeline(steps=[('scaler',
                                                                   StandardScaler())]),
                                                  ['home_last_40_FF%_5v5',
                                                   'home_last_40_GF%_5v5',
                                                   'home_last_40_xGF%_5v5',
                                                   'home_last_40_SH%',
                                                   'home_last40_xGF_per_min_pp',
                                                   'home_last40_GF_per_min_pp',
                                                   'home_last40_xGA_per_min_pk',
                                                   'home_last40_GA_per_min_pk',
                                                   'away_last_40_FF%_5v5',
                                                   

In [72]:
rfecv_pipeline[1].n_features_

9

In [73]:
rfecv_pipeline[1].ranking_

array([ 1,  1, 13,  8, 18, 14,  9,  6,  2, 20, 17,  7, 10, 12, 16, 15,  1,
        1,  5, 19, 11,  4,  1,  1,  1,  1,  1,  3])

In [74]:
rfecv_results = pd.DataFrame(list(zip(X_train.columns, rfecv_pipeline[1].ranking_)), columns = ['Feature', 'Ranking']).sort_values('Ranking')
rfecv_results.head(rfecv_pipeline[1].n_features_)

Unnamed: 0,Feature,Ranking
0,home_last3_xGF_per_min_pp,1
1,home_Goalie_FenwickSV%,1
25,home_last_20_GF%_5v5,1
24,away_last_30_GF%_5v5,1
23,away_Rating.A.Pre,1
22,home_last_10_FF%_5v5,1
17,away_last10_xGF_per_min_pp,1
16,home_last_40_xGF%_5v5,1
26,away_last40_GF_per_min_pp,1


In [75]:
rfecv_columns = list(rfecv_results.iloc[:rfecv_pipeline[1].n_features_,0])
rfecv_columns 

['home_last3_xGF_per_min_pp',
 'home_Goalie_FenwickSV%',
 'home_last_20_GF%_5v5',
 'away_last_30_GF%_5v5',
 'away_Rating.A.Pre',
 'home_last_10_FF%_5v5',
 'away_last10_xGF_per_min_pp',
 'home_last_40_xGF%_5v5',
 'away_last40_GF_per_min_pp']

### Logistic Regression

In [76]:
X_train = df[df['Season'] != '2020-2021'].dropna().loc[:,rfecv_columns]
y_train = df[df['Season'] != '2020-2021'].dropna()['Home_Team_Won']
X_test = df[df['Season'] == '2020-2021'].loc[:,rfecv_columns]
y_test = df[df['Season'] == '2020-2021']['Home_Team_Won']

In [77]:
log_rfecv_pipeline = Pipeline(steps=[('ss', StandardScaler()),
                      ('logisticregression', LogisticRegression(max_iter=10000))])

log_params = {'logisticregression__solver' : ['liblinear', 'lbfgs', 'newton-cg'],
                'logisticregression__penalty': ['l1', 'l2'],
                'logisticregression__C': [.01, 0.1, 10, 20, 100],
                'logisticregression__class_weight': [None]}

log_cv_all = GridSearchCV(log_rfecv_pipeline, param_grid=log_params, cv=5, scoring=scoring, refit = 'neg_log_loss',  verbose=1)

In [78]:
log_cv_all.fit(X_train[rfecv_columns], y_train)

Fitting 5 folds for each of 30 candidates, totalling 150 fits


GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('ss', StandardScaler()),
                                       ('logisticregression',
                                        LogisticRegression(max_iter=10000))]),
             param_grid={'logisticregression__C': [0.01, 0.1, 10, 20, 100],
                         'logisticregression__class_weight': [None],
                         'logisticregression__penalty': ['l1', 'l2'],
                         'logisticregression__solver': ['liblinear', 'lbfgs',
                                                        'newton-cg']},
             refit='neg_log_loss', scoring=['neg_log_loss', 'accuracy'],
             verbose=1)

In [79]:
log_all_results = pd.DataFrame(log_cv_all.cv_results_).sort_values('mean_test_neg_log_loss', ascending=False)
log_all_results.head(10)

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_logisticregression__C,param_logisticregression__class_weight,param_logisticregression__penalty,param_logisticregression__solver,params,split0_test_neg_log_loss,split1_test_neg_log_loss,split2_test_neg_log_loss,split3_test_neg_log_loss,split4_test_neg_log_loss,mean_test_neg_log_loss,std_test_neg_log_loss,rank_test_neg_log_loss,split0_test_accuracy,split1_test_accuracy,split2_test_accuracy,split3_test_accuracy,split4_test_accuracy,mean_test_accuracy,std_test_accuracy,rank_test_accuracy
6,0.00622,0.00013,0.003461,7.5e-05,0.1,,l1,liblinear,"{'logisticregression__C': 0.1, 'logisticregres...",-0.683493,-0.676186,-0.67455,-0.672781,-0.673968,-0.676196,0.00381,1,0.550907,0.58159,0.594972,0.574022,0.576816,0.575661,0.014317,17
4,0.007108,7.6e-05,0.003425,1.4e-05,0.01,,l2,lbfgs,"{'logisticregression__C': 0.01, 'logisticregre...",-0.682996,-0.675241,-0.675597,-0.673949,-0.673375,-0.676232,0.003479,2,0.546722,0.585774,0.596369,0.565642,0.581006,0.575103,0.017297,18
5,0.013248,0.001218,0.004097,0.000478,0.01,,l2,newton-cg,"{'logisticregression__C': 0.01, 'logisticregre...",-0.682997,-0.675241,-0.675597,-0.673949,-0.673375,-0.676232,0.003479,3,0.546722,0.585774,0.596369,0.565642,0.581006,0.575103,0.017297,18
3,0.005481,0.000175,0.003253,1.3e-05,0.01,,l2,liblinear,"{'logisticregression__C': 0.01, 'logisticregre...",-0.682877,-0.675304,-0.675674,-0.674005,-0.673641,-0.6763,0.003376,4,0.555091,0.584379,0.587989,0.575419,0.579609,0.576497,0.011518,16
10,0.007337,0.000301,0.003419,3.3e-05,0.1,,l2,lbfgs,"{'logisticregression__C': 0.1, 'logisticregres...",-0.684044,-0.676038,-0.675903,-0.673108,-0.673425,-0.676503,0.003961,5,0.549512,0.585774,0.592179,0.578212,0.581006,0.577337,0.014696,1
11,0.011399,0.000528,0.003445,1.6e-05,0.1,,l2,newton-cg,"{'logisticregression__C': 0.1, 'logisticregres...",-0.684044,-0.676037,-0.675902,-0.673108,-0.673426,-0.676503,0.003961,6,0.549512,0.585774,0.592179,0.578212,0.581006,0.577337,0.014696,1
9,0.005766,4.8e-05,0.003394,1.6e-05,0.1,,l2,liblinear,"{'logisticregression__C': 0.1, 'logisticregres...",-0.684022,-0.676034,-0.675904,-0.673109,-0.673453,-0.676504,0.003948,7,0.546722,0.585774,0.589385,0.579609,0.581006,0.576499,0.015289,5
12,0.006644,0.00013,0.003406,2.9e-05,10.0,,l1,liblinear,"{'logisticregression__C': 10, 'logisticregress...",-0.684239,-0.676227,-0.675958,-0.672984,-0.673468,-0.676575,0.004045,8,0.549512,0.585774,0.589385,0.579609,0.581006,0.577057,0.014204,3
18,0.006527,0.000311,0.003445,1.9e-05,20.0,,l1,liblinear,"{'logisticregression__C': 20, 'logisticregress...",-0.684245,-0.676233,-0.675967,-0.672985,-0.673465,-0.676579,0.004046,9,0.548117,0.585774,0.590782,0.578212,0.581006,0.576778,0.014956,4
16,0.007279,0.000221,0.003613,0.000365,10.0,,l2,lbfgs,"{'logisticregression__C': 10, 'logisticregress...",-0.684249,-0.676236,-0.675975,-0.672987,-0.67346,-0.676582,0.004048,10,0.548117,0.584379,0.590782,0.578212,0.581006,0.576499,0.014798,5


### Ada Boost

In [80]:
X_train = df[df['Season'] != '2020-2021'].dropna().loc[:,rfecv_columns]
y_train = df[df['Season'] != '2020-2021'].dropna()['Home_Team_Won']
X_test = df[df['Season'] == '2020-2021'].loc[:,rfecv_columns]
y_test = df[df['Season'] == '2020-2021']['Home_Team_Won']

In [81]:
ada_rfecv_pipeline = Pipeline(steps=[('ss', StandardScaler()),
                      ('ada', AdaBoostClassifier())])

ada_params = {'ada__n_estimators': [25],
         'ada__learning_rate': [ .1, 10],
         'ada__base_estimator': [svm.SVC(probability=True , kernel='linear'), LogisticRegression(max_iter =10000, C=.01, penalty = 'l1', solver = 'liblinear')],}

ada_cv_all = GridSearchCV(ada_rfecv_pipeline, param_grid=ada_params, cv=5, scoring=scoring, refit='neg_log_loss', verbose=1)

In [82]:
ada_cv_all.fit(X_train, y_train)

Fitting 5 folds for each of 4 candidates, totalling 20 fits


GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('ss', StandardScaler()),
                                       ('ada', AdaBoostClassifier())]),
             param_grid={'ada__base_estimator': [SVC(kernel='linear',
                                                     probability=True),
                                                 LogisticRegression(C=0.01,
                                                                    max_iter=10000,
                                                                    penalty='l1',
                                                                    solver='liblinear')],
                         'ada__learning_rate': [0.1, 10],
                         'ada__n_estimators': [25]},
             refit='neg_log_loss', scoring=['neg_log_loss', 'accuracy'],
             verbose=1)

In [83]:
ada_all_results = pd.DataFrame(ada_cv_all.cv_results_).sort_values('mean_test_neg_log_loss', ascending=False)
ada_all_results.head(10)

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_ada__base_estimator,param_ada__learning_rate,param_ada__n_estimators,params,split0_test_neg_log_loss,split1_test_neg_log_loss,split2_test_neg_log_loss,split3_test_neg_log_loss,split4_test_neg_log_loss,mean_test_neg_log_loss,std_test_neg_log_loss,rank_test_neg_log_loss,split0_test_accuracy,split1_test_accuracy,split2_test_accuracy,split3_test_accuracy,split4_test_accuracy,mean_test_accuracy,std_test_accuracy,rank_test_accuracy
1,32.3474,0.550717,2.137133,0.103863,"SVC(kernel='linear', probability=True)",10.0,25,"{'ada__base_estimator': SVC(kernel='linear', p...",-0.683982,-0.678475,-0.679754,-0.682702,-0.681416,-0.681266,0.001976734,1,0.556485,0.591353,0.574022,0.561453,0.555866,0.567836,0.013448,1
0,32.946891,0.287765,2.11329,0.028486,"SVC(kernel='linear', probability=True)",0.1,25,"{'ada__base_estimator': SVC(kernel='linear', p...",-0.684418,-0.682521,-0.68266,-0.681958,-0.682048,-0.682721,0.0008898324,2,0.553696,0.564854,0.571229,0.553073,0.560056,0.560581,0.006866,2
2,0.055528,0.000309,0.010574,2.9e-05,"LogisticRegression(C=0.01, max_iter=10000, pen...",0.1,25,{'ada__base_estimator': LogisticRegression(C=0...,-0.693147,-0.693147,-0.693147,-0.693147,-0.693147,-0.693147,8.599751000000001e-17,3,0.456067,0.456067,0.456704,0.456704,0.455307,0.45617,0.000517,3
3,0.055622,0.000853,0.010454,1.9e-05,"LogisticRegression(C=0.01, max_iter=10000, pen...",10.0,25,{'ada__base_estimator': LogisticRegression(C=0...,-0.693147,-0.693147,-0.693147,-0.693147,-0.693147,-0.693147,8.599751000000001e-17,3,0.456067,0.456067,0.456704,0.456704,0.455307,0.45617,0.000517,3


## Apply Best Model To Test

I will evaluate the best model iterations on the held out 2021 season data

In [358]:
results_dict = {'cv accuracy': {}, 'cv log loss': {}, 'test accuracy': {}, 'test log_loss':{}}
accuracy_list = []
log_loss_list = []

In [359]:
X_train = df[df['Season'] != '2020-2021'].dropna().loc[:,r_5_40]
y_train = df[df['Season'] != '2020-2021'].dropna()['Home_Team_Won']
X_test = df[df['Season'] == '2020-2021'].dropna().loc[:,r_5_40]
y_test = df[df['Season'] == '2020-2021'].dropna()['Home_Team_Won']



test_preds_5_40 = log_cv.predict(X_test)

test_probs_5_40 = log_cv.predict_proba(X_test)


accuracy_list.append(accuracy_score(y_test, test_preds_5_40))
log_loss_list.append(log_loss(y_test, test_probs_5_40))


In [360]:
X_train = df[df['Season'] != '2020-2021'].dropna().loc[:,r40]
y_train = df[df['Season'] != '2020-2021'].dropna()['Home_Team_Won']
X_test = df[df['Season'] == '2020-2021'].dropna().loc[:,r40]
y_test = df[df['Season'] == '2020-2021'].dropna()['Home_Team_Won']



test_preds_40 = log_cv_40.predict(X_test)

test_probs_40 = log_cv_40.predict_proba(X_test)

accuracy_list.append(accuracy_score(y_test, test_preds_40))
log_loss_list.append(log_loss(y_test, test_probs_40))

In [361]:
X_train = df[df['Season'] != '2020-2021'].dropna().loc[:,rfecv_columns]
y_train = df[df['Season'] != '2020-2021'].dropna()['Home_Team_Won']
X_test = df[df['Season'] == '2020-2021'].dropna().loc[:,rfecv_columns]
y_test = df[df['Season'] == '2020-2021'].dropna()['Home_Team_Won']

test_preds_rfecv = log_cv_all.predict(X_test)

test_probs_rfecv = log_cv_all.predict_proba(X_test)


accuracy_list.append(accuracy_score(y_test, test_preds_rfecv))
log_loss_list.append(log_loss(y_test, test_probs_rfecv))



In [364]:
X_train = df[df['Season'] != '2020-2021'].dropna().loc[:,r_5_40]
y_train = df[df['Season'] != '2020-2021'].dropna()['Home_Team_Won']
X_test = df[df['Season'] == '2020-2021'].dropna().loc[:,r_5_40]
y_test = df[df['Season'] == '2020-2021'].dropna()['Home_Team_Won']



accuracy_list.append(accuracy_score(y_test, ada_cv.predict(X_test)))
log_loss_list.append(log_loss(y_test,ada_cv.predict_proba(X_test)))

In [365]:
X_train = df[df['Season'] != '2020-2021'].dropna().loc[:,r40]
y_train = df[df['Season'] != '2020-2021'].dropna()['Home_Team_Won']
X_test = df[df['Season'] == '2020-2021'].dropna().loc[:,r40]
y_test = df[df['Season'] == '2020-2021'].dropna()['Home_Team_Won']



accuracy_list.append(accuracy_score(y_test, ada_cv_40.predict(X_test)))
log_loss_list.append(log_loss(y_test, ada_cv_40.predict_proba(X_test)))

In [403]:
results_dict['test accuracy'] = accuracy_list
results_dict['test log_loss'] = log_loss_list
models = ['5 and 40 log', '40 log', 'rfecv log', '5 and 40 ada', '40 ada']
results_dict['cv accuracy'] = [log_results['mean_test_accuracy'][0], log_40_results['mean_test_accuracy'][0], log_all_results['mean_test_accuracy'][0], ada_results['mean_test_accuracy'][0], ada_40_results['mean_test_accuracy'][0]]
results_dict['cv log loss'] = [log_cv.best_score_*-1, log_cv_40.best_score_*-1, log_cv_all.best_score_*-1, ada_cv.best_score_*-1, ada_cv_40.best_score_*-1]

In [404]:
results_df = pd.DataFrame(results_dict, index = models)

## Conclusion

Best model was logistic regression with the rolling 5 and 40 features on the test data. Interestingly, this was the 4th best model on the CV training data set though it did have the best CV accuracy.

In [405]:
results_df.sort_values('test log_loss')

Unnamed: 0,cv accuracy,cv log loss,test accuracy,test log_loss
5 and 40 log,0.581067,0.677735,0.597205,0.657201
40 log,0.580267,0.674081,0.593393,0.65724
40 ada,0.579467,0.675472,0.606099,0.660695
rfecv log,0.571733,0.675783,0.590851,0.665077
5 and 40 ada,0.562667,0.681288,0.560356,0.678121


## Next Steps
To further improve the models I would like to take the following next steps

- Train a neural network model
- Categorize B2B better
- Include team ELO feature
- Try linear weightings in rolling features
- Increase goalie games
- Add prior year goalie GAR feature
- Add Team HDSC % feature
- Add more seasons to training set
- Compare against historical implied odds from a bookmaker
- Adjust ineperienced goalie imputed stats and exclude 2021 season to avoid data leakage on test set