# NHL Game Prediction Modeling
by Gary Schwaeber

## Overview

With sport betting becoming increasingly popular and mainstream I believe that data science can be used to make superior decisions over gut intuitions. Unlike in Football or Basketball where the betting against the spread is the most popular type of betting, the moneyline is king in the NHL due to lower scoring games. When betting the moneyline, the way to gain an edge is if you know the truer probability of the game outcome then the implied odds from the moneyline. Over the course of the season, if your internally derived game probabilities are superior to the book's, you will be profitable. 

In this notebook I will attempt to train logistic regression, ada boost, gradient boosting, and neural network models in an attempt to make the best possible game prediction model. I will train my models and tune model hyperparemetres using game results from seasons '2017-2018', '2018-2019', '2019-2020'. Then I will predict on held out games from the current 2021 season and evaluate my model. 

Log loss is the score which I will use to optimize and judge the models. Log-loss is indicative of how close the prediction probability is to the corresponding actual/true value (0 or 1 in case of binary classification). The more the predicted probability diverges from the actual value, the higher is the log-loss value, [Source](https://towardsdatascience.com/intuition-behind-log-loss-score-4e0c9979680a). There are currently a handful of public models whose log loss on the current season's games is being [tracked](https://hockey-statistics.com/2021/05/03/game-projections-january-13th-2021/) on which I can compare the quality of my model to.   I will also review accuracy scores due to their interpretability.

In [471]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
import numpy as np
import statsmodels.api as sm
import hockey_scraper
import pickle
import time
import random
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

import sklearn
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.tree import DecisionTreeClassifier
from sklearn import svm
from sklearn.preprocessing import normalize, FunctionTransformer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import log_loss, accuracy_score
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.metrics import roc_auc_score, roc_curve, auc

from sklearn.metrics import confusion_matrix, plot_confusion_matrix,\
    precision_score, recall_score, accuracy_score, f1_score, log_loss,\
    roc_curve, roc_auc_score, classification_report
from sklearn.ensemble import VotingClassifier, BaggingClassifier, RandomForestClassifier, AdaBoostClassifier, AdaBoostRegressor, GradientBoostingClassifier
from collections import Counter
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.compose import ColumnTransformer
from sklearn.datasets import fetch_openml
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.compose import make_column_selector as selector
from sklearn.pipeline import make_pipeline, Pipeline
from sklearn.feature_selection import RFECV

#for the Neural Network
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout
from tensorflow.keras.regularizers import l2
from tensorflow.keras.optimizers import SGD
from tensorflow.keras.wrappers import scikit_learn
from tensorflow.keras.callbacks import EarlyStopping
from keras.constraints import maxnorm

import warnings
warnings.filterwarnings('ignore')
pd.set_option('display.max_columns', None)

In [2]:
df = pd.read_csv('data/all_games_multirolling_SVA_2.csv')

In [4]:
df.shape

(4447, 155)

In [347]:
# define feature columns for different rolling intervals

common = ['home_Goalie_FenwickSV%',
 'home_Goalie_GSAx/60',
 'home_Goalie_HDCSV%',
 'away_Goalie_FenwickSV%',
 'away_Goalie_GSAx/60',
 'away_Goalie_HDCSV%', 
 'home_Rating.A.Pre',
 'away_Rating.A.Pre',
 'B2B_Status']

r3 = ['home_last_3_FF%_5v5',
 'home_last_3_GF%_5v5',
 'home_last_3_xGF%_5v5',
 'home_last_3_SH%',
 'home_last3_xGF_per_min_pp',
 'home_last3_GF_per_min_pp',
 'home_last3_xGA_per_min_pk',
 'home_last3_GA_per_min_pk',
 'away_last_3_FF%_5v5',
 'away_last_3_GF%_5v5',
 'away_last_3_xGF%_5v5',
 'away_last_3_SH%',
 'away_last3_xGF_per_min_pp',
 'away_last3_GF_per_min_pp',
 'away_last3_xGA_per_min_pk',
 'away_last3_GA_per_min_pk'] + common

r5 =['home_last_5_FF%_5v5',
 'home_last_5_GF%_5v5',
 'home_last_5_xGF%_5v5',
 'home_last_5_SH%',

 'home_last5_xGF_per_min_pp',
 'home_last5_GF_per_min_pp',

 'home_last5_xGA_per_min_pk',
 'home_last5_GA_per_min_pk',
 'away_last_5_FF%_5v5',
 'away_last_5_GF%_5v5',
 'away_last_5_xGF%_5v5',
 'away_last_5_SH%',
 'away_last5_xGF_per_min_pp',
 'away_last5_GF_per_min_pp',
 'away_last5_xGA_per_min_pk',
 'away_last5_GA_per_min_pk'] + common

r10 =['home_last_10_FF%_5v5',
 'home_last_10_GF%_5v5',
 'home_last_10_xGF%_5v5',
 'home_last_10_SH%',
 'home_last10_xGF_per_min_pp',
 'home_last10_GF_per_min_pp',
 'home_last10_xGA_per_min_pk',
 'home_last10_GA_per_min_pk',
  'away_last_10_FF%_5v5',
 'away_last_10_GF%_5v5',
 'away_last_10_xGF%_5v5',
 'away_last_10_SH%',
 'away_last10_xGF_per_min_pp',
 'away_last10_GF_per_min_pp',
 'away_last10_xGA_per_min_pk',
 'away_last10_GA_per_min_pk',]


r20 = ['home_last_20_FF%_5v5',
 'home_last_20_GF%_5v5',
 'home_last_20_xGF%_5v5',
 'home_last_20_SH%',

 'home_last20_xGF_per_min_pp',
 'home_last20_GF_per_min_pp',

 'home_last20_xGA_per_min_pk',
 'home_last20_GA_per_min_pk',
 'away_last_20_FF%_5v5',
 'away_last_20_GF%_5v5',
 'away_last_20_xGF%_5v5',
 'away_last_20_SH%',

 'away_last20_xGF_per_min_pp',
 'away_last20_GF_per_min_pp',

 'away_last20_xGA_per_min_pk',
 'away_last20_GA_per_min_pk']

r30 = ['home_last_30_FF%_5v5',
 'home_last_30_GF%_5v5',
 'home_last_30_xGF%_5v5',
 'home_last_30_SH%',
 'home_last30_xGF_per_min_pp',
 'home_last30_GF_per_min_pp',
 'home_last30_xGA_per_min_pk',
 'home_last30_GA_per_min_pk',
 'away_last_30_FF%_5v5',
 'away_last_30_GF%_5v5',
 'away_last_30_xGF%_5v5',
 'away_last_30_SH%',
 'away_last30_xGF_per_min_pp',
 'away_last30_GF_per_min_pp',
 'away_last30_xGA_per_min_pk',
 'away_last30_GA_per_min_pk'] + common


r40 = ['home_last_40_FF%_5v5',
 'home_last_40_GF%_5v5',
 'home_last_40_xGF%_5v5',
 'home_last_40_SH%',
 'home_last40_xGF_per_min_pp',
 'home_last40_GF_per_min_pp',
 'home_last40_xGA_per_min_pk',
 'home_last40_GA_per_min_pk',
 'away_last_40_FF%_5v5',
 'away_last_40_GF%_5v5',
 'away_last_40_xGF%_5v5',
 'away_last_40_SH%',
 'away_last40_xGF_per_min_pp',
 'away_last40_GF_per_min_pp',
 'away_last40_xGA_per_min_pk',
 'away_last40_GA_per_min_pk'] + common


all_r = list(set(r3+r5+r10+r20+r30+r40))

r3_30 =list(set(r3+r30))
r5_30 = list(set(r5+r30))
r10_30 = list(set(r10+r30))
r_3_5_30 = list(set(r3+r5+r30))
r_5_20 = list(set(r5+r20))
r_5_40 = list(set(r5+r40))

## Baseline Model

The baseline model will predict that every home team wins their game and that the probability of that is the ratio of games the home team has won.

In [6]:
df['Home_Team_Won'].value_counts(normalize=True)

1    0.541714
0    0.458286
Name: Home_Team_Won, dtype: float64

In [7]:
baseline_preds = np.ones(df.shape[0])
accuracy_score(df['Home_Team_Won'],baseline_preds)

0.5417135147290308

In [8]:
baseline_probs = np.repeat(df['Home_Team_Won'].value_counts(normalize=True)[1], df.shape[0])

log_loss(df['Home_Team_Won'], baseline_probs)

0.6896630977766495

The models will need to beat an accuracy score of 54.17% and a log loss of .6897, otherwise they are no better than just predicting the home team will win. 

## Rolling 5 and 40 game features

For my first set of models I will attempt using 5 and 40 game rolling features. These seemed like a good set based on the feature selection notebook. 40 games is currently the longest rolling runway I have currently for the team statistics. The 40 games stats intuitively provide the most smoothing of team data over the course of the season, while the 5 game stats may provide some insight on any streakiness or may cover recent developments that would affect short term team performances such as player injuries, trades coaching changes etc.

In [351]:
X_train = df[df['Season'] != '2020-2021'].dropna().loc[:,r_5_40]
y_train = df[df['Season'] != '2020-2021'].dropna()['Home_Team_Won']
X_test = df[df['Season'] == '2020-2021'].loc[:,r_5_40]
y_test = df[df['Season'] == '2020-2021']['Home_Team_Won']

In [352]:
X_train.columns

Index(['away_last40_GF_per_min_pp', 'away_last_40_FF%_5v5',
       'away_last5_xGA_per_min_pk', 'home_Goalie_FenwickSV%',
       'home_last_5_SH%', 'home_last5_GA_per_min_pk',
       'home_last5_xGA_per_min_pk', 'away_last_40_GF%_5v5',
       'away_last5_xGF_per_min_pp', 'home_last_40_GF%_5v5',
       'away_last_5_FF%_5v5', 'home_Goalie_HDCSV%', 'home_last_40_FF%_5v5',
       'home_last40_xGA_per_min_pk', 'B2B_Status',
       'away_last40_xGA_per_min_pk', 'home_last40_GA_per_min_pk',
       'home_last_5_xGF%_5v5', 'home_last5_GF_per_min_pp', 'home_Rating.A.Pre',
       'away_last_5_SH%', 'away_last5_GA_per_min_pk', 'home_last_5_GF%_5v5',
       'away_Goalie_GSAx/60', 'home_last_40_SH%', 'away_last_40_xGF%_5v5',
       'home_last_40_xGF%_5v5', 'home_last40_xGF_per_min_pp',
       'away_last40_GA_per_min_pk', 'away_last_5_xGF%_5v5',
       'home_last40_GF_per_min_pp', 'away_last_40_SH%',
       'away_last40_xGF_per_min_pp', 'home_last_5_FF%_5v5',
       'away_last5_GF_per_min_pp', 'away_

In [353]:
X_train.shape

(3582, 41)

In [354]:
numeric_features = ['home_last40_xGF_per_min_pp', 'away_last_5_xGF%_5v5',
       'home_last_40_GF%_5v5',
       'home_last40_xGA_per_min_pk', 'home_last5_xGA_per_min_pk',
       'home_last_40_SH%', 
       'home_Goalie_GSAx/60',
        'away_Goalie_GSAx/60',
       'away_last_5_GF%_5v5', 
       'home_last_40_xGF%_5v5', 
     'home_last5_GF_per_min_pp',
       'home_last_5_GF%_5v5', 'home_last_5_FF%_5v5',
       'away_last5_xGF_per_min_pp', 'away_last40_xGF_per_min_pp',
       'home_last40_GA_per_min_pk', 'home_Goalie_HDCSV%',
       'away_last5_GA_per_min_pk', 'away_last40_GF_per_min_pp',
       'away_Rating.A.Pre', 'home_last_5_xGF%_5v5', 'away_last_5_SH%',
       'home_Rating.A.Pre', 'home_last5_xGF_per_min_pp',
       'away_last_40_xGF%_5v5', 'home_last5_GA_per_min_pk',
     'away_last5_GF_per_min_pp',
       'away_last_40_GF%_5v5', 'away_last_40_SH%', 'away_last_5_FF%_5v5',
       'home_Goalie_FenwickSV%', 'away_Goalie_HDCSV%',
       'away_last40_xGA_per_min_pk', 'home_last_5_SH%',
       'away_last5_xGA_per_min_pk', 'home_last_40_FF%_5v5',
       'away_Goalie_FenwickSV%', 'away_last_40_FF%_5v5',
       'home_last40_GF_per_min_pp', 'away_last40_GA_per_min_pk']

In [355]:
X_train[numeric_features].head()

Unnamed: 0,home_last40_xGF_per_min_pp,away_last_5_xGF%_5v5,home_last_40_GF%_5v5,home_last40_xGA_per_min_pk,home_last5_xGA_per_min_pk,home_last_40_SH%,home_Goalie_GSAx/60,away_Goalie_GSAx/60,away_last_5_GF%_5v5,home_last_40_xGF%_5v5,home_last5_GF_per_min_pp,home_last_5_GF%_5v5,home_last_5_FF%_5v5,away_last5_xGF_per_min_pp,away_last40_xGF_per_min_pp,home_last40_GA_per_min_pk,home_Goalie_HDCSV%,away_last5_GA_per_min_pk,away_last40_GF_per_min_pp,away_Rating.A.Pre,home_last_5_xGF%_5v5,away_last_5_SH%,home_Rating.A.Pre,home_last5_xGF_per_min_pp,away_last_40_xGF%_5v5,home_last5_GA_per_min_pk,away_last5_GF_per_min_pp,away_last_40_GF%_5v5,away_last_40_SH%,away_last_5_FF%_5v5,home_Goalie_FenwickSV%,away_Goalie_HDCSV%,away_last40_xGA_per_min_pk,home_last_5_SH%,away_last5_xGA_per_min_pk,home_last_40_FF%_5v5,away_Goalie_FenwickSV%,away_last_40_FF%_5v5,home_last40_GF_per_min_pp,away_last40_GA_per_min_pk
0,0.112699,48.770492,50.127801,0.104858,0.098556,9.025236,-0.202922,0.082345,45.9375,48.992719,0.095465,57.080799,52.399869,0.06991,0.1224,0.137102,0.858462,0.19544,0.139885,1500.66,51.663405,6.967375,1495.03,0.079714,49.339386,0.054152,0.10181,51.399425,8.124451,52.562502,0.937294,0.873171,0.133976,9.426112,0.074267,48.803377,0.942516,49.991679,0.117297,0.121145
1,0.124909,51.204482,56.868932,0.129028,0.153383,9.060588,0.169541,-0.239655,49.927641,51.954595,0.2997,59.064609,42.564205,0.096,0.102018,0.10473,0.877358,0.040268,0.115864,1535.17,46.860987,11.358025,1577.1,0.143856,52.486645,0.225564,0.1,58.184556,8.420932,46.882217,0.941904,0.864516,0.097844,12.093988,0.109128,50.828439,0.941294,50.633643,0.138139,0.086229
2,0.132248,40.305523,56.575634,0.116445,0.131278,9.02546,0.302087,-0.097423,45.427286,49.851785,0.190981,58.385392,60.511924,0.153218,0.120843,0.112194,0.897778,0.068337,0.11683,1496.85,60.180542,9.286882,1522.11,0.113316,49.136336,0.132159,0.16609,50.499508,7.879167,43.520998,0.942492,0.878613,0.107127,8.478124,0.112415,50.407241,0.938246,50.595552,0.149493,0.106067
3,0.105738,49.941995,53.260259,0.120913,0.137299,7.970138,-0.164139,-0.080476,56.272661,52.809227,0.04329,57.771883,54.316401,0.137242,0.143998,0.125595,0.869266,0.100615,0.103208,1496.86,52.571429,6.524847,1525.37,0.118615,50.855171,0.125962,0.115979,45.246898,5.932286,51.909534,0.934447,0.848,0.093779,9.804628,0.086864,52.890654,0.938305,51.197815,0.099407,0.131951
4,0.129293,43.6373,48.882718,0.084868,0.067197,7.303942,-0.310233,-0.346771,52.130045,54.871795,0.297398,48.959081,52.400715,0.142088,0.087855,0.101091,0.830721,0.0,0.121801,1545.81,50.929752,7.311321,1521.29,0.098885,50.381002,0.03672,0.065934,52.122642,7.885816,47.102597,0.933383,0.839117,0.102718,5.518246,0.107438,55.762037,0.939698,51.309591,0.189644,0.128468


In [356]:
scoring = ['neg_log_loss', 'accuracy']

In [357]:
numeric_transformer = Pipeline(steps=[
    ('scaler', StandardScaler())])

categorical_features = ['B2B_Status']

categorical_transformer = Pipeline(steps=[
    ('ohe', OneHotEncoder())])

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)])

### Logistic Regression

In [358]:
log_pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                      ('logisticregression', LogisticRegression(max_iter=10000))])

log_params = {'logisticregression__solver' : ['liblinear', 'lbfgs', 'newton-cg'],
                'logisticregression__penalty': ['l1', 'l2'],
                'logisticregression__C': [.00001, .0001, .001, .01, .05, 0.1],
                'logisticregression__class_weight': [None] }

log_cv = GridSearchCV(log_pipeline, param_grid=log_params, cv=5, scoring=scoring, refit = 'neg_log_loss',  verbose=1)

In [359]:
log_cv.fit(X_train, y_train)

Fitting 5 folds for each of 36 candidates, totalling 180 fits


GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('preprocessor',
                                        ColumnTransformer(transformers=[('num',
                                                                         Pipeline(steps=[('scaler',
                                                                                          StandardScaler())]),
                                                                         ['home_last40_xGF_per_min_pp',
                                                                          'away_last_5_xGF%_5v5',
                                                                          'home_last_40_GF%_5v5',
                                                                          'home_last40_xGA_per_min_pk',
                                                                          'home_last5_xGA_per_min_pk',
                                                                          'home_last_40_SH%',
                                 

In [360]:
log_cv.best_score_

-0.6754370089204439

In [361]:
log_results = pd.DataFrame(log_cv.cv_results_).sort_values('mean_test_neg_log_loss', ascending=False)
log_results.head(10)

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_logisticregression__C,param_logisticregression__class_weight,param_logisticregression__penalty,param_logisticregression__solver,params,split0_test_neg_log_loss,split1_test_neg_log_loss,split2_test_neg_log_loss,split3_test_neg_log_loss,split4_test_neg_log_loss,mean_test_neg_log_loss,std_test_neg_log_loss,rank_test_neg_log_loss,split0_test_accuracy,split1_test_accuracy,split2_test_accuracy,split3_test_accuracy,split4_test_accuracy,mean_test_accuracy,std_test_accuracy,rank_test_accuracy
16,0.01423,0.000881,0.008953,0.000246,0.001,,l2,lbfgs,"{'logisticregression__C': 0.001, 'logisticregr...",-0.678374,-0.671673,-0.677392,-0.675825,-0.673921,-0.675437,0.002411,1,0.566248,0.592748,0.594972,0.571229,0.578212,0.580682,0.011433,2
17,0.022448,0.000838,0.009748,0.000274,0.001,,l2,newton-cg,"{'logisticregression__C': 0.001, 'logisticregr...",-0.678374,-0.671674,-0.677392,-0.675825,-0.673923,-0.675437,0.00241,2,0.566248,0.592748,0.594972,0.571229,0.578212,0.580682,0.011433,2
15,0.01436,0.000209,0.008002,0.000113,0.001,,l2,liblinear,"{'logisticregression__C': 0.001, 'logisticregr...",-0.678653,-0.672743,-0.678873,-0.676277,-0.675154,-0.67634,0.002285,3,0.567643,0.619247,0.594972,0.585196,0.569832,0.587378,0.018843,1
21,0.019311,0.001301,0.008871,0.000578,0.01,,l2,liblinear,"{'logisticregression__C': 0.01, 'logisticregre...",-0.677641,-0.668417,-0.680201,-0.678592,-0.676997,-0.676369,0.00412,4,0.584379,0.598326,0.587989,0.565642,0.565642,0.580396,0.012887,4
23,0.025948,0.001959,0.008886,0.000504,0.01,,l2,newton-cg,"{'logisticregression__C': 0.01, 'logisticregre...",-0.677863,-0.668443,-0.679819,-0.679093,-0.676975,-0.676439,0.004116,5,0.585774,0.594142,0.587989,0.557263,0.569832,0.579,0.01351,6
22,0.020201,0.001397,0.009166,0.000786,0.01,,l2,lbfgs,"{'logisticregression__C': 0.01, 'logisticregre...",-0.677864,-0.668444,-0.679819,-0.679094,-0.676973,-0.676439,0.004116,6,0.585774,0.594142,0.587989,0.557263,0.569832,0.579,0.01351,6
24,0.018809,0.001569,0.008565,0.000657,0.05,,l1,liblinear,"{'logisticregression__C': 0.05, 'logisticregre...",-0.678657,-0.671507,-0.677803,-0.679101,-0.675922,-0.676598,0.002769,7,0.567643,0.588563,0.596369,0.561453,0.575419,0.577889,0.012936,8
30,0.022879,0.00344,0.008905,0.000758,0.1,,l1,liblinear,"{'logisticregression__C': 0.1, 'logisticregres...",-0.677719,-0.669127,-0.678905,-0.679826,-0.677734,-0.676662,0.003849,8,0.570432,0.596932,0.585196,0.561453,0.553073,0.573417,0.01586,13
27,0.024693,0.000996,0.009416,0.000581,0.05,,l2,liblinear,"{'logisticregression__C': 0.05, 'logisticregre...",-0.677728,-0.669121,-0.681685,-0.681348,-0.679935,-0.677963,0.004635,9,0.58159,0.599721,0.579609,0.561453,0.555866,0.575648,0.015642,10
28,0.023362,0.000928,0.009284,0.000269,0.05,,l2,lbfgs,"{'logisticregression__C': 0.05, 'logisticregre...",-0.677767,-0.669153,-0.681584,-0.681528,-0.679915,-0.67799,0.004632,10,0.577406,0.598326,0.581006,0.562849,0.555866,0.575091,0.01483,11


### Ada Boost

In [363]:
ada_pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                      ('ada', AdaBoostClassifier())])

ada_params = {'ada__n_estimators': [25, 50],
         'ada__learning_rate': [.1, 1, 10, 20],
         'ada__base_estimator': [svm.SVC(probability=True , kernel='linear'), LogisticRegression()],}

ada_cv = GridSearchCV(ada_pipeline, param_grid=ada_params, cv=5, scoring=scoring, refit='neg_log_loss', verbose=1)

In [364]:
ada_cv.fit(X_train, y_train)

Fitting 5 folds for each of 16 candidates, totalling 80 fits


GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('preprocessor',
                                        ColumnTransformer(transformers=[('num',
                                                                         Pipeline(steps=[('scaler',
                                                                                          StandardScaler())]),
                                                                         ['home_last40_xGF_per_min_pp',
                                                                          'away_last_5_xGF%_5v5',
                                                                          'home_last_40_GF%_5v5',
                                                                          'home_last40_xGA_per_min_pk',
                                                                          'home_last5_xGA_per_min_pk',
                                                                          'home_last_40_SH%',
                                 

In [365]:
ada_cv.best_score_

-0.6799356834363997

In [366]:
ada_results = pd.DataFrame(ada_cv.cv_results_).sort_values('mean_test_neg_log_loss', ascending=False)
ada_results.head(10)

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_ada__base_estimator,param_ada__learning_rate,param_ada__n_estimators,params,split0_test_neg_log_loss,split1_test_neg_log_loss,split2_test_neg_log_loss,split3_test_neg_log_loss,split4_test_neg_log_loss,mean_test_neg_log_loss,std_test_neg_log_loss,rank_test_neg_log_loss,split0_test_accuracy,split1_test_accuracy,split2_test_accuracy,split3_test_accuracy,split4_test_accuracy,mean_test_accuracy,std_test_accuracy,rank_test_accuracy
4,42.276222,0.342722,2.560906,0.079277,"SVC(kernel='linear', probability=True)",10.0,25,"{'ada__base_estimator': SVC(kernel='linear', p...",-0.683174,-0.675307,-0.681271,-0.68128,-0.678648,-0.679936,0.002726,1,0.564854,0.595537,0.582402,0.560056,0.569832,0.574536,0.012872,3
0,44.279606,0.927575,2.506164,0.064214,"SVC(kernel='linear', probability=True)",0.1,25,"{'ada__base_estimator': SVC(kernel='linear', p...",-0.682536,-0.679752,-0.682087,-0.681354,-0.68149,-0.681444,0.000946,2,0.55788,0.571827,0.562849,0.551676,0.555866,0.56002,0.006912,8
6,44.684842,0.986084,2.675247,0.181274,"SVC(kernel='linear', probability=True)",20.0,25,"{'ada__base_estimator': SVC(kernel='linear', p...",-0.684677,-0.678135,-0.684233,-0.682375,-0.680141,-0.681912,0.002476,3,0.552301,0.592748,0.551676,0.564246,0.571229,0.56644,0.015085,7
8,0.126325,0.005333,0.016162,0.000574,LogisticRegression(),0.1,25,"{'ada__base_estimator': LogisticRegression(), ...",-0.684109,-0.681523,-0.684027,-0.68269,-0.682635,-0.682997,0.000969,4,0.564854,0.594142,0.597765,0.569832,0.581006,0.58152,0.012945,2
5,83.157681,11.129491,4.957148,0.868651,"SVC(kernel='linear', probability=True)",10.0,50,"{'ada__base_estimator': SVC(kernel='linear', p...",-0.684681,-0.689051,-0.682996,-0.683049,-0.680393,-0.684034,0.00286,5,0.569038,0.543933,0.586592,0.579609,0.565642,0.568963,0.014572,6
1,83.538823,1.743496,4.931413,0.17363,"SVC(kernel='linear', probability=True)",0.1,50,"{'ada__base_estimator': SVC(kernel='linear', p...",-0.685212,-0.683904,-0.684925,-0.684599,-0.684942,-0.684716,0.00045,6,0.543933,0.543933,0.539106,0.540503,0.544693,0.542434,0.002209,12
7,79.672341,4.427197,4.250342,0.344274,"SVC(kernel='linear', probability=True)",20.0,50,"{'ada__base_estimator': SVC(kernel='linear', p...",-0.686891,-0.682469,-0.685481,-0.686313,-0.684238,-0.685078,0.00158,7,0.543933,0.562064,0.540503,0.543296,0.547486,0.547456,0.007635,9
9,0.228802,0.003782,0.023327,0.000595,LogisticRegression(),0.1,50,"{'ada__base_estimator': LogisticRegression(), ...",-0.686993,-0.685253,-0.687152,-0.686154,-0.686353,-0.686381,0.000677,8,0.569038,0.60251,0.597765,0.572626,0.571229,0.582634,0.014416,1
2,36.399376,0.31774,2.152855,0.039304,"SVC(kernel='linear', probability=True)",1.0,25,"{'ada__base_estimator': SVC(kernel='linear', p...",-0.688148,-0.688114,-0.687842,-0.688028,-0.687648,-0.687956,0.000187,9,0.543933,0.543933,0.543296,0.543296,0.544693,0.54383,0.000517,10
3,68.962821,1.963881,4.055349,0.116078,"SVC(kernel='linear', probability=True)",1.0,50,"{'ada__base_estimator': SVC(kernel='linear', p...",-0.68886,-0.688524,-0.688807,-0.688775,-0.688475,-0.688688,0.000157,10,0.543933,0.543933,0.543296,0.543296,0.544693,0.54383,0.000517,10


### Gradient Boosting

In [367]:
gb_pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                      ('gb', GradientBoostingClassifier())])

gb_params = {'gb__n_estimators': [200, 400],
         'gb__learning_rate': [.001,.01],
         'gb__max_depth' : [3,5]}

gb_cv = GridSearchCV(gb_pipeline, param_grid=gb_params, cv=5, scoring=scoring, refit='neg_log_loss', verbose=1)

In [368]:
gb_cv.fit(X_train, y_train)

Fitting 5 folds for each of 8 candidates, totalling 40 fits


GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('preprocessor',
                                        ColumnTransformer(transformers=[('num',
                                                                         Pipeline(steps=[('scaler',
                                                                                          StandardScaler())]),
                                                                         ['home_last40_xGF_per_min_pp',
                                                                          'away_last_5_xGF%_5v5',
                                                                          'home_last_40_GF%_5v5',
                                                                          'home_last40_xGA_per_min_pk',
                                                                          'home_last5_xGA_per_min_pk',
                                                                          'home_last_40_SH%',
                                 

In [369]:
gb_cv.best_score_

-0.6813251696169567

In [370]:
gb_results = pd.DataFrame(gb_cv.cv_results_).sort_values('mean_test_neg_log_loss', ascending=False)
gb_results.head(10)

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_gb__learning_rate,param_gb__max_depth,param_gb__n_estimators,params,split0_test_neg_log_loss,split1_test_neg_log_loss,split2_test_neg_log_loss,split3_test_neg_log_loss,split4_test_neg_log_loss,mean_test_neg_log_loss,std_test_neg_log_loss,rank_test_neg_log_loss,split0_test_accuracy,split1_test_accuracy,split2_test_accuracy,split3_test_accuracy,split4_test_accuracy,mean_test_accuracy,std_test_accuracy,rank_test_accuracy
4,5.265635,0.096079,0.013914,0.001005,0.01,3,200,"{'gb__learning_rate': 0.01, 'gb__max_depth': 3...",-0.682622,-0.679526,-0.684279,-0.680547,-0.679651,-0.681325,0.001847,1,0.559275,0.570432,0.590782,0.567039,0.582402,0.573986,0.011227,1
5,10.673233,0.07639,0.017061,0.001314,0.01,3,400,"{'gb__learning_rate': 0.01, 'gb__max_depth': 3...",-0.682203,-0.681423,-0.686615,-0.682465,-0.68198,-0.682937,0.001871,2,0.549512,0.559275,0.581006,0.572626,0.574022,0.567288,0.011333,2
6,8.427701,0.163069,0.017259,0.000989,0.01,5,200,"{'gb__learning_rate': 0.01, 'gb__max_depth': 5...",-0.684066,-0.680755,-0.688405,-0.682039,-0.684558,-0.683965,0.002611,3,0.555091,0.567643,0.572626,0.579609,0.561453,0.567284,0.008524,3
1,10.391733,0.082113,0.02072,0.000926,0.001,3,400,"{'gb__learning_rate': 0.001, 'gb__max_depth': ...",-0.685737,-0.683565,-0.686076,-0.685384,-0.684671,-0.685087,0.000892,4,0.538354,0.548117,0.539106,0.551676,0.541899,0.543831,0.005215,5
3,16.657891,0.049212,0.026395,0.001463,0.001,5,400,"{'gb__learning_rate': 0.001, 'gb__max_depth': ...",-0.685452,-0.682728,-0.689618,-0.685672,-0.684913,-0.685676,0.00223,5,0.535565,0.560669,0.526536,0.546089,0.547486,0.543269,0.011557,6
2,8.27432,0.161389,0.017821,0.001041,0.001,5,200,"{'gb__learning_rate': 0.001, 'gb__max_depth': ...",-0.686921,-0.685192,-0.68802,-0.688213,-0.685295,-0.686728,0.00129,6,0.53417,0.546722,0.536313,0.540503,0.550279,0.541598,0.006098,8
0,5.221529,0.092727,0.014165,0.000452,0.001,3,200,"{'gb__learning_rate': 0.001, 'gb__max_depth': ...",-0.687017,-0.68619,-0.687083,-0.687529,-0.686338,-0.686831,0.000498,7,0.543933,0.543933,0.536313,0.543296,0.541899,0.541875,0.002878,7
7,16.611218,0.167446,0.022752,0.000479,0.01,5,400,"{'gb__learning_rate': 0.01, 'gb__max_depth': 5...",-0.683452,-0.688025,-0.694803,-0.691262,-0.694313,-0.690371,0.004227,8,0.569038,0.564854,0.551676,0.575419,0.565642,0.565326,0.007775,4


It does not seem that gradient boosting is producing good results for this dataset

### Neural Network

In [280]:
# define the grid search parameters
# batch_size = [10, 20, 40, 60, 80, 100]

param_grid = {'nn__epochs': [8,10, 12, 15, 18],
             'nn__optimizer' : ['RMSprop', 'Adam'], 
             'nn__activation' : ['sigmoid', 'hard_sigmoid', 'linear'],
            'nn__neurons' : [12, 18, 24, 30, 36, 40],
             'nn__weight_constraint': [1, 3, 5],
             'nn__dropout_rate' : [0.0,  0.3, 0.6, 0.9]}
keras_model = scikit_learn.KerasClassifier(build_fn=build_model, verbose=0)

numeric_transformer = Pipeline(steps=[
    ('scaler', StandardScaler())])

categorical_features = ['B2B_Status']

categorical_transformer = Pipeline(steps=[
    ('ohe', OneHotEncoder(handle_unknown='ignore'))])

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)])

nn_pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                      ('nn', keras_model)])





nn_cv = GridSearchCV(estimator=nn_pipeline, param_grid=param_grid, cv=3, scoring=scoring, refit='neg_log_loss', verbose=1)

In [None]:
nn_cv.fit(X_train, y_train)

In [284]:
nn_results = pd.DataFrame(nn_cv.cv_results_).sort_values('mean_test_neg_log_loss', ascending=False)
nn_results.head(5)

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_nn__activation,param_nn__dropout_rate,param_nn__epochs,param_nn__neurons,param_nn__optimizer,param_nn__weight_constraint,params,split0_test_neg_log_loss,split1_test_neg_log_loss,split2_test_neg_log_loss,mean_test_neg_log_loss,std_test_neg_log_loss,rank_test_neg_log_loss,split0_test_accuracy,split1_test_accuracy,split2_test_accuracy,mean_test_accuracy,std_test_accuracy,rank_test_accuracy
1648,0.683245,0.006266,0.104092,0.001678,linear,0.3,8,36,Adam,3,"{'nn__activation': 'linear', 'nn__dropout_rate...",-0.66799,-0.671304,-0.681146,-0.67348,0.005587,1,0.592965,0.582077,0.568677,0.58124,0.009933,572
1732,0.92138,0.005455,0.104325,0.000706,linear,0.3,15,12,Adam,3,"{'nn__activation': 'linear', 'nn__dropout_rate...",-0.671125,-0.67153,-0.678915,-0.673857,0.00358,2,0.58794,0.598827,0.570352,0.585706,0.011732,48
1828,0.678168,0.001133,0.103484,0.001109,linear,0.6,8,36,Adam,3,"{'nn__activation': 'linear', 'nn__dropout_rate...",-0.669233,-0.672437,-0.680127,-0.673932,0.004571,3,0.59799,0.593802,0.556114,0.582635,0.018831,333
1683,0.749562,0.003505,0.103456,0.00172,linear,0.3,10,36,Adam,1,"{'nn__activation': 'linear', 'nn__dropout_rate...",-0.669233,-0.672677,-0.679914,-0.673941,0.004451,4,0.59129,0.589615,0.548576,0.576494,0.019752,1258
1147,0.874522,0.008508,0.112667,0.000833,hard_sigmoid,0.6,10,40,RMSprop,3,"{'nn__activation': 'hard_sigmoid', 'nn__dropou...",-0.6702,-0.672712,-0.678995,-0.673969,0.003699,5,0.588777,0.592965,0.572027,0.58459,0.009046,112
1967,1.246649,0.0699,0.114459,0.003921,linear,0.6,18,30,Adam,5,"{'nn__activation': 'linear', 'nn__dropout_rate...",-0.670564,-0.672294,-0.679264,-0.674041,0.00376,6,0.578727,0.586265,0.560302,0.575098,0.010906,1375
1249,1.408697,0.335738,0.11308,0.001698,hard_sigmoid,0.6,18,36,RMSprop,3,"{'nn__activation': 'hard_sigmoid', 'nn__dropou...",-0.671396,-0.671077,-0.679764,-0.674079,0.004022,7,0.583752,0.60134,0.558626,0.58124,0.017528,549
1757,0.939249,0.004742,0.103511,0.000758,linear,0.3,15,36,Adam,5,"{'nn__activation': 'linear', 'nn__dropout_rate...",-0.669046,-0.672297,-0.680904,-0.674082,0.005003,8,0.596315,0.597152,0.555276,0.582915,0.019546,292
1916,0.99661,0.017258,0.106048,0.002964,linear,0.6,15,18,RMSprop,5,"{'nn__activation': 'linear', 'nn__dropout_rate...",-0.67096,-0.671626,-0.680003,-0.674196,0.004115,9,0.598827,0.598827,0.558626,0.585427,0.018951,66
1873,0.871623,0.010311,0.104613,0.000707,linear,0.6,12,12,RMSprop,3,"{'nn__activation': 'linear', 'nn__dropout_rate...",-0.671955,-0.671238,-0.679398,-0.674197,0.003689,10,0.582915,0.599665,0.557789,0.580123,0.017209,738


## 40 Game Rolling

I will run some models using only the rolling 40 game team stats

In [456]:
X_train = df[df['Season'] != '2020-2021'].dropna().loc[:,r40]
y_train = df[df['Season'] != '2020-2021'].dropna()['Home_Team_Won']
X_test = df[df['Season'] == '2020-2021'].loc[:,r40]
y_test = df[df['Season'] == '2020-2021']['Home_Team_Won']

In [306]:
X_train.columns

Index(['home_last_40_FF%_5v5', 'home_last_40_GF%_5v5', 'home_last_40_xGF%_5v5',
       'home_last_40_SH%', 'home_last40_xGF_per_min_pp',
       'home_last40_GF_per_min_pp', 'home_last40_xGA_per_min_pk',
       'home_last40_GA_per_min_pk', 'away_last_40_FF%_5v5',
       'away_last_40_GF%_5v5', 'away_last_40_xGF%_5v5', 'away_last_40_SH%',
       'away_last40_xGF_per_min_pp', 'away_last40_GF_per_min_pp',
       'away_last40_xGA_per_min_pk', 'away_last40_GA_per_min_pk',
       'home_Goalie_FenwickSV%', 'home_Goalie_GSAx/60', 'home_Goalie_HDCSV%',
       'away_Goalie_FenwickSV%', 'away_Goalie_GSAx/60', 'away_Goalie_HDCSV%',
       'home_Rating.A.Pre', 'away_Rating.A.Pre', 'B2B_Status'],
      dtype='object')

In [457]:
numeric_features =['home_last_40_FF%_5v5', 'home_last_40_GF%_5v5', 'home_last_40_xGF%_5v5',
       'home_last_40_SH%', 'home_last40_xGF_per_min_pp',
       'home_last40_GF_per_min_pp', 'home_last40_xGA_per_min_pk',
       'home_last40_GA_per_min_pk', 'away_last_40_FF%_5v5',
       'away_last_40_GF%_5v5', 'away_last_40_xGF%_5v5', 'away_last_40_SH%',
       'away_last40_xGF_per_min_pp', 'away_last40_GF_per_min_pp',
       'away_last40_xGA_per_min_pk', 'away_last40_GA_per_min_pk',
       'home_Goalie_FenwickSV%', 'home_Goalie_GSAx/60', 'home_Goalie_HDCSV%',
       'away_Goalie_FenwickSV%', 'away_Goalie_GSAx/60', 'away_Goalie_HDCSV%',
       'home_Rating.A.Pre', 'away_Rating.A.Pre']

### Logistic Regression

In [63]:
numeric_transformer = Pipeline(steps=[
    ('scaler', StandardScaler())])

categorical_features = ['B2B_Status']

categorical_transformer = Pipeline(steps=[
    ('ohe', OneHotEncoder())])

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)])

log_40_pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                      ('logisticregression', LogisticRegression(max_iter=10000))])

In [64]:
log_params = {'logisticregression__solver' : ['liblinear', 'lbfgs', 'newton-cg'],
                'logisticregression__penalty': ['l1', 'l2'],
                'logisticregression__C': [.01, 0.1, 1, 10],
                'logisticregression__class_weight': [None] }

log_cv_40 = GridSearchCV(log_40_pipeline, param_grid=log_params, cv=5, scoring=scoring, refit = 'neg_log_loss',  verbose=1)

In [65]:
log_cv_40.fit(X_train, y_train)

Fitting 5 folds for each of 24 candidates, totalling 120 fits


GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('preprocessor',
                                        ColumnTransformer(transformers=[('num',
                                                                         Pipeline(steps=[('scaler',
                                                                                          StandardScaler())]),
                                                                         ['home_last_40_FF%_5v5',
                                                                          'home_last_40_GF%_5v5',
                                                                          'home_last_40_xGF%_5v5',
                                                                          'home_last_40_SH%',
                                                                          'home_last40_xGF_per_min_pp',
                                                                          'home_last40_GF_per_min_pp',
                                      

In [384]:
log_40_results = pd.DataFrame(log_cv_40.cv_results_).sort_values('mean_test_neg_log_loss', ascending=False)
log_40_results.head(10)

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_logisticregression__C,param_logisticregression__class_weight,param_logisticregression__penalty,param_logisticregression__solver,params,split0_test_neg_log_loss,split1_test_neg_log_loss,split2_test_neg_log_loss,split3_test_neg_log_loss,split4_test_neg_log_loss,mean_test_neg_log_loss,std_test_neg_log_loss,rank_test_neg_log_loss,split0_test_accuracy,split1_test_accuracy,split2_test_accuracy,split3_test_accuracy,split4_test_accuracy,mean_test_accuracy,std_test_accuracy,rank_test_accuracy
3,0.013546,0.000548,0.008087,0.000554,0.01,,l2,liblinear,"{'logisticregression__C': 0.01, 'logisticregre...",-0.677897,-0.667774,-0.678553,-0.67785,-0.669169,-0.674249,0.004744,1,0.562064,0.595537,0.593575,0.568436,0.574022,0.578727,0.013481,4
5,0.020442,0.000386,0.007603,0.000155,0.01,,l2,newton-cg,"{'logisticregression__C': 0.01, 'logisticregre...",-0.678123,-0.667782,-0.678179,-0.678294,-0.669174,-0.674311,0.004783,2,0.564854,0.589958,0.594972,0.571229,0.574022,0.579007,0.011493,1
4,0.014846,0.000243,0.00865,0.002275,0.01,,l2,lbfgs,"{'logisticregression__C': 0.01, 'logisticregre...",-0.678123,-0.667785,-0.678179,-0.678293,-0.669174,-0.674311,0.004782,3,0.564854,0.589958,0.594972,0.571229,0.574022,0.579007,0.011493,1
6,0.017185,0.001476,0.007465,2.9e-05,0.1,,l1,liblinear,"{'logisticregression__C': 0.1, 'logisticregres...",-0.677613,-0.668763,-0.677168,-0.679307,-0.671782,-0.674927,0.003983,4,0.567643,0.587169,0.608939,0.554469,0.576816,0.579007,0.018431,1
9,0.014779,0.000584,0.007444,6.2e-05,0.1,,l2,liblinear,"{'logisticregression__C': 0.1, 'logisticregres...",-0.677746,-0.668233,-0.680027,-0.6799,-0.670958,-0.675373,0.004863,5,0.573222,0.587169,0.585196,0.551676,0.565642,0.572581,0.013096,5
10,0.019531,0.000411,0.007489,6.3e-05,0.1,,l2,lbfgs,"{'logisticregression__C': 0.1, 'logisticregres...",-0.677759,-0.668236,-0.679981,-0.67999,-0.670951,-0.675383,0.004873,6,0.573222,0.587169,0.585196,0.551676,0.564246,0.572302,0.013255,6
11,0.025096,0.000453,0.007748,0.000268,0.1,,l2,newton-cg,"{'logisticregression__C': 0.1, 'logisticregres...",-0.677763,-0.668245,-0.679982,-0.679991,-0.670953,-0.675387,0.004871,7,0.571827,0.587169,0.585196,0.551676,0.564246,0.572023,0.013247,7
12,0.029974,0.005265,0.007823,0.000163,1.0,,l1,liblinear,"{'logisticregression__C': 1, 'logisticregressi...",-0.677904,-0.668691,-0.68021,-0.680296,-0.671408,-0.675702,0.004772,8,0.571827,0.587169,0.579609,0.546089,0.564246,0.569788,0.014107,8
15,0.016982,0.000966,0.008524,0.000234,1.0,,l2,liblinear,"{'logisticregression__C': 1, 'logisticregressi...",-0.677739,-0.668911,-0.680817,-0.680721,-0.671601,-0.675958,0.00486,9,0.570432,0.585774,0.579609,0.548883,0.562849,0.569509,0.01294,10
17,0.026711,0.001629,0.00794,0.000586,1.0,,l2,newton-cg,"{'logisticregression__C': 1, 'logisticregressi...",-0.67774,-0.668912,-0.680813,-0.680733,-0.6716,-0.67596,0.004862,10,0.570432,0.585774,0.579609,0.548883,0.562849,0.569509,0.01294,10


### Ada Boost

In [66]:
numeric_transformer = Pipeline(steps=[
    ('scaler', StandardScaler())])

categorical_features = ['B2B_Status']

categorical_transformer = Pipeline(steps=[
    ('ohe', OneHotEncoder())])

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)])


ada_40_pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                      ('ada', AdaBoostClassifier())])

ada_params = {'ada__n_estimators': [25],
         'ada__learning_rate': [.01, .1, 1, 10],
         'ada__base_estimator': [svm.SVC(probability=True , kernel='linear'), LogisticRegression()],}

ada_cv_40 = GridSearchCV(ada_40_pipeline, param_grid=ada_params, cv=5, scoring=scoring, refit='neg_log_loss', verbose=1)

In [67]:
ada_cv_40.fit(X_train, y_train)

Fitting 5 folds for each of 8 candidates, totalling 40 fits


GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('preprocessor',
                                        ColumnTransformer(transformers=[('num',
                                                                         Pipeline(steps=[('scaler',
                                                                                          StandardScaler())]),
                                                                         ['home_last_40_FF%_5v5',
                                                                          'home_last_40_GF%_5v5',
                                                                          'home_last_40_xGF%_5v5',
                                                                          'home_last_40_SH%',
                                                                          'home_last40_xGF_per_min_pp',
                                                                          'home_last40_GF_per_min_pp',
                                      

In [68]:
ada_40_results = pd.DataFrame(ada_cv_40.cv_results_).sort_values('mean_test_neg_log_loss', ascending=False)
ada_40_results.head(10)

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_ada__base_estimator,param_ada__learning_rate,param_ada__n_estimators,params,split0_test_neg_log_loss,split1_test_neg_log_loss,split2_test_neg_log_loss,split3_test_neg_log_loss,split4_test_neg_log_loss,mean_test_neg_log_loss,std_test_neg_log_loss,rank_test_neg_log_loss,split0_test_accuracy,split1_test_accuracy,split2_test_accuracy,split3_test_accuracy,split4_test_accuracy,mean_test_accuracy,std_test_accuracy,rank_test_accuracy
0,37.925709,0.638357,2.14507,0.033783,"SVC(kernel='linear', probability=True)",0.01,25,"{'ada__base_estimator': SVC(kernel='linear', p...",-0.678507,-0.672283,-0.677155,-0.678249,-0.672888,-0.675816,0.002684,1,0.550907,0.577406,0.599162,0.561453,0.579609,0.573707,0.016532,5
4,0.11237,0.003209,0.015125,0.000329,LogisticRegression(),0.01,25,"{'ada__base_estimator': LogisticRegression(), ...",-0.680541,-0.676259,-0.67967,-0.678631,-0.675985,-0.678217,0.001817,2,0.569038,0.588563,0.569832,0.567039,0.583799,0.575654,0.008774,3
3,37.55927,0.155653,2.273125,0.004889,"SVC(kernel='linear', probability=True)",10.0,25,"{'ada__base_estimator': SVC(kernel='linear', p...",-0.681224,-0.674573,-0.677922,-0.680711,-0.676703,-0.678227,0.002487,3,0.567643,0.595537,0.608939,0.571229,0.572626,0.583195,0.016198,1
1,38.204704,0.113358,2.208953,0.011606,"SVC(kernel='linear', probability=True)",0.1,25,"{'ada__base_estimator': SVC(kernel='linear', p...",-0.68244,-0.679579,-0.682155,-0.681328,-0.679784,-0.681057,0.001183,4,0.564854,0.574616,0.564246,0.568436,0.565642,0.567559,0.003809,6
5,0.114122,0.000884,0.014839,7e-05,LogisticRegression(),0.1,25,"{'ada__base_estimator': LogisticRegression(), ...",-0.684177,-0.681519,-0.684073,-0.683125,-0.681574,-0.682894,0.00116,5,0.560669,0.588563,0.597765,0.569832,0.582402,0.579847,0.013203,2
2,31.212874,0.111837,1.813163,0.012767,"SVC(kernel='linear', probability=True)",1.0,25,"{'ada__base_estimator': SVC(kernel='linear', p...",-0.688093,-0.687867,-0.688317,-0.688659,-0.686559,-0.687899,0.000719,6,0.543933,0.543933,0.543296,0.543296,0.544693,0.54383,0.000517,7
6,0.122421,0.057093,0.015021,0.00021,LogisticRegression(),1.0,25,"{'ada__base_estimator': LogisticRegression(), ...",-0.691571,-0.691189,-0.691649,-0.691591,-0.691297,-0.691459,0.000182,7,0.559275,0.594142,0.589385,0.564246,0.569832,0.575376,0.013873,4
7,0.232039,0.006414,0.015102,0.000146,LogisticRegression(),10.0,25,"{'ada__base_estimator': LogisticRegression(), ...",-0.691897,-0.69111,-0.702624,-0.690811,-0.691471,-0.693583,0.004535,8,0.543933,0.543933,0.543296,0.543296,0.544693,0.54383,0.000517,7


### Neural Network

In [310]:
def build_model():
    model = Sequential()
    model.add(Dense(neurons, activation=activation, input_dim=28, kernel_constraint=maxnorm(weight_constraint)))
    model.add(Dropout(dropout_rate))
    model.add(Dense(4, activation=activation))
    model.add(Dense(1, activation = 'sigmoid'))

    model.compile(optimizer=optimizer,
                  loss='binary_crossentropy',
                  metrics=['accuracy'])
    return model

param_grid = {'nn__epochs': [8,10, 15, 18],
             'nn__optimizer' : ['RMSprop', 'Adam'], 
             'nn__activation' : ['hard_sigmoid', 'linear'],
            'nn__neurons' : [12, 24, 36, 40],
             'nn__weight_constraint': [1, 3],
             'nn__dropout_rate' : [0.3, 0.6]}

keras_model = scikit_learn.KerasClassifier(build_fn=build_model, verbose=0)

numeric_transformer = Pipeline(steps=[
    ('scaler', StandardScaler())])

categorical_features = ['B2B_Status']

categorical_transformer = Pipeline(steps=[
    ('ohe', OneHotEncoder(handle_unknown='ignore'))])

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)])

nn_40_pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                      ('nn', keras_model)])





nn_40_cv = GridSearchCV(estimator=nn_40_pipeline, param_grid=param_grid, cv=3, scoring=scoring, refit='neg_log_loss', verbose=1)

In [311]:
nn_40_cv.fit(X_train, y_train)

Fitting 3 folds for each of 256 candidates, totalling 768 fits


GridSearchCV(cv=3,
             estimator=Pipeline(steps=[('preprocessor',
                                        ColumnTransformer(transformers=[('num',
                                                                         Pipeline(steps=[('scaler',
                                                                                          StandardScaler())]),
                                                                         ['home_last_40_FF%_5v5',
                                                                          'home_last_40_GF%_5v5',
                                                                          'home_last_40_xGF%_5v5',
                                                                          'home_last_40_SH%',
                                                                          'home_last40_xGF_per_min_pp',
                                                                          'home_last40_GF_per_min_pp',
                                      

In [506]:
nn_40_results = pd.DataFrame(nn_40_cv.cv_results_).sort_values('mean_test_neg_log_loss', ascending=False)
nn_40_results.head(10)

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_nn__activation,param_nn__dropout_rate,param_nn__epochs,param_nn__neurons,param_nn__optimizer,param_nn__weight_constraint,params,split0_test_neg_log_loss,split1_test_neg_log_loss,split2_test_neg_log_loss,mean_test_neg_log_loss,std_test_neg_log_loss,rank_test_neg_log_loss,split0_test_accuracy,split1_test_accuracy,split2_test_accuracy,mean_test_accuracy,std_test_accuracy,rank_test_accuracy
186,1.026956,0.006196,0.313454,0.299331,linear,0.3,18,36,Adam,1,"{'nn__activation': 'linear', 'nn__dropout_rate': 0.3, 'nn__epochs': 18, 'nn__neurons': 36, 'nn__optimizer': 'Adam', 'nn__weight_constraint': 1}",-0.670244,-0.670506,-0.677651,-0.6728,0.003431,1,0.585427,0.59464,0.564489,0.581519,0.012615,66
201,0.714656,0.003949,0.099995,0.000245,linear,0.6,8,36,RMSprop,3,"{'nn__activation': 'linear', 'nn__dropout_rate': 0.6, 'nn__epochs': 8, 'nn__neurons': 36, 'nn__optimizer': 'RMSprop', 'nn__weight_constraint': 3}",-0.669582,-0.671819,-0.677286,-0.672896,0.003236,2,0.582915,0.602178,0.559464,0.581519,0.017466,66
45,1.036484,0.007792,0.110218,0.000721,hard_sigmoid,0.3,15,40,RMSprop,3,"{'nn__activation': 'hard_sigmoid', 'nn__dropout_rate': 0.3, 'nn__epochs': 15, 'nn__neurons': 40, 'nn__optimizer': 'RMSprop', 'nn__weight_constraint': 3}",-0.669712,-0.670576,-0.678738,-0.673009,0.004066,3,0.574539,0.595477,0.573702,0.58124,0.010073,79
182,1.004689,0.007679,0.100871,0.001236,linear,0.3,18,24,Adam,1,"{'nn__activation': 'linear', 'nn__dropout_rate': 0.3, 'nn__epochs': 18, 'nn__neurons': 24, 'nn__optimizer': 'Adam', 'nn__weight_constraint': 1}",-0.669191,-0.671065,-0.678912,-0.673056,0.004211,4,0.585427,0.592127,0.564489,0.580681,0.011772,91
130,0.841163,0.297668,0.101427,0.001652,linear,0.3,8,12,Adam,1,"{'nn__activation': 'linear', 'nn__dropout_rate': 0.3, 'nn__epochs': 8, 'nn__neurons': 12, 'nn__optimizer': 'Adam', 'nn__weight_constraint': 1}",-0.67178,-0.669664,-0.677814,-0.673086,0.003453,5,0.576214,0.583752,0.563652,0.574539,0.008291,220
228,1.156157,0.303291,0.103092,0.001277,linear,0.6,15,24,RMSprop,1,"{'nn__activation': 'linear', 'nn__dropout_rate': 0.6, 'nn__epochs': 15, 'nn__neurons': 24, 'nn__optimizer': 'RMSprop', 'nn__weight_constraint': 1}",-0.669714,-0.670462,-0.679106,-0.673094,0.004262,6,0.580402,0.603015,0.559464,0.58096,0.017784,84
178,0.954173,0.001785,0.100213,0.00111,linear,0.3,18,12,Adam,1,"{'nn__activation': 'linear', 'nn__dropout_rate': 0.3, 'nn__epochs': 18, 'nn__neurons': 12, 'nn__optimizer': 'Adam', 'nn__weight_constraint': 1}",-0.670004,-0.671762,-0.677552,-0.673106,0.003225,7,0.582077,0.595477,0.567839,0.581798,0.011285,61
123,1.107433,0.005035,0.109486,0.000659,hard_sigmoid,0.6,18,36,Adam,3,"{'nn__activation': 'hard_sigmoid', 'nn__dropout_rate': 0.6, 'nn__epochs': 18, 'nn__neurons': 36, 'nn__optimizer': 'Adam', 'nn__weight_constraint': 3}",-0.670246,-0.670674,-0.678426,-0.673115,0.003759,8,0.572864,0.595477,0.566164,0.578169,0.012541,142
155,0.733315,0.002264,0.100377,0.000298,linear,0.3,10,36,Adam,3,"{'nn__activation': 'linear', 'nn__dropout_rate': 0.3, 'nn__epochs': 10, 'nn__neurons': 36, 'nn__optimizer': 'Adam', 'nn__weight_constraint': 3}",-0.669381,-0.670897,-0.679089,-0.673122,0.004264,9,0.592127,0.592127,0.564489,0.582915,0.013029,37
144,0.770164,0.000856,0.313379,0.301552,linear,0.3,10,12,RMSprop,1,"{'nn__activation': 'linear', 'nn__dropout_rate': 0.3, 'nn__epochs': 10, 'nn__neurons': 12, 'nn__optimizer': 'RMSprop', 'nn__weight_constraint': 1}",-0.670948,-0.670709,-0.677717,-0.673125,0.003249,10,0.586265,0.597152,0.554439,0.579285,0.018123,120


## All Rolling Game Features With Recursive Feature Elimination

In [69]:
X_train = df[df['Season'] != '2020-2021'].dropna().loc[:,all_r]
y_train = df[df['Season'] != '2020-2021'].dropna()['Home_Team_Won']
X_test = df[df['Season'] == '2020-2021'].loc[:,all_r]
y_test = df[df['Season'] == '2020-2021']['Home_Team_Won']

In [383]:
X_train.shape

(3750, 104)

### Recursive Feature Elimination

In [70]:
numeric_transformer = Pipeline(steps=[
    ('scaler', StandardScaler())])

categorical_features = ['B2B_Status']

categorical_transformer = Pipeline(steps=[
    ('ohe', OneHotEncoder())])

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)])

rfecv = RFECV(estimator= LogisticRegression(max_iter =10000, penalty = 'l2', solver='liblinear', C=.1), step=1, scoring='accuracy')
rfecv_pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                      ('rfecv', rfecv)])

In [71]:
rfecv_pipeline.fit(X_train, y_train)

Pipeline(steps=[('preprocessor',
                 ColumnTransformer(transformers=[('num',
                                                  Pipeline(steps=[('scaler',
                                                                   StandardScaler())]),
                                                  ['home_last_40_FF%_5v5',
                                                   'home_last_40_GF%_5v5',
                                                   'home_last_40_xGF%_5v5',
                                                   'home_last_40_SH%',
                                                   'home_last40_xGF_per_min_pp',
                                                   'home_last40_GF_per_min_pp',
                                                   'home_last40_xGA_per_min_pk',
                                                   'home_last40_GA_per_min_pk',
                                                   'away_last_40_FF%_5v5',
                                                   

In [72]:
rfecv_pipeline[1].n_features_

9

In [73]:
rfecv_pipeline[1].ranking_

array([ 1,  1, 13,  8, 18, 14,  9,  6,  2, 20, 17,  7, 10, 12, 16, 15,  1,
        1,  5, 19, 11,  4,  1,  1,  1,  1,  1,  3])

In [74]:
rfecv_results = pd.DataFrame(list(zip(X_train.columns, rfecv_pipeline[1].ranking_)), columns = ['Feature', 'Ranking']).sort_values('Ranking')
rfecv_results.head(rfecv_pipeline[1].n_features_)

Unnamed: 0,Feature,Ranking
0,home_last3_xGF_per_min_pp,1
1,home_Goalie_FenwickSV%,1
25,home_last_20_GF%_5v5,1
24,away_last_30_GF%_5v5,1
23,away_Rating.A.Pre,1
22,home_last_10_FF%_5v5,1
17,away_last10_xGF_per_min_pp,1
16,home_last_40_xGF%_5v5,1
26,away_last40_GF_per_min_pp,1


In [75]:
rfecv_columns = list(rfecv_results.iloc[:rfecv_pipeline[1].n_features_,0])
rfecv_columns 

['home_last3_xGF_per_min_pp',
 'home_Goalie_FenwickSV%',
 'home_last_20_GF%_5v5',
 'away_last_30_GF%_5v5',
 'away_Rating.A.Pre',
 'home_last_10_FF%_5v5',
 'away_last10_xGF_per_min_pp',
 'home_last_40_xGF%_5v5',
 'away_last40_GF_per_min_pp']

### Logistic Regression

In [76]:
X_train = df[df['Season'] != '2020-2021'].dropna().loc[:,rfecv_columns]
y_train = df[df['Season'] != '2020-2021'].dropna()['Home_Team_Won']
X_test = df[df['Season'] == '2020-2021'].loc[:,rfecv_columns]
y_test = df[df['Season'] == '2020-2021']['Home_Team_Won']

In [77]:
log_rfecv_pipeline = Pipeline(steps=[('ss', StandardScaler()),
                      ('logisticregression', LogisticRegression(max_iter=10000))])

log_params = {'logisticregression__solver' : ['liblinear', 'lbfgs', 'newton-cg'],
                'logisticregression__penalty': ['l1', 'l2'],
                'logisticregression__C': [.01, 0.1, 10, 20, 100],
                'logisticregression__class_weight': [None]}

log_cv_all = GridSearchCV(log_rfecv_pipeline, param_grid=log_params, cv=5, scoring=scoring, refit = 'neg_log_loss',  verbose=1)

In [78]:
log_cv_all.fit(X_train[rfecv_columns], y_train)

Fitting 5 folds for each of 30 candidates, totalling 150 fits


GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('ss', StandardScaler()),
                                       ('logisticregression',
                                        LogisticRegression(max_iter=10000))]),
             param_grid={'logisticregression__C': [0.01, 0.1, 10, 20, 100],
                         'logisticregression__class_weight': [None],
                         'logisticregression__penalty': ['l1', 'l2'],
                         'logisticregression__solver': ['liblinear', 'lbfgs',
                                                        'newton-cg']},
             refit='neg_log_loss', scoring=['neg_log_loss', 'accuracy'],
             verbose=1)

In [79]:
log_all_results = pd.DataFrame(log_cv_all.cv_results_).sort_values('mean_test_neg_log_loss', ascending=False)
log_all_results.head(10)

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_logisticregression__C,param_logisticregression__class_weight,param_logisticregression__penalty,param_logisticregression__solver,params,split0_test_neg_log_loss,split1_test_neg_log_loss,split2_test_neg_log_loss,split3_test_neg_log_loss,split4_test_neg_log_loss,mean_test_neg_log_loss,std_test_neg_log_loss,rank_test_neg_log_loss,split0_test_accuracy,split1_test_accuracy,split2_test_accuracy,split3_test_accuracy,split4_test_accuracy,mean_test_accuracy,std_test_accuracy,rank_test_accuracy
6,0.00622,0.00013,0.003461,7.5e-05,0.1,,l1,liblinear,"{'logisticregression__C': 0.1, 'logisticregres...",-0.683493,-0.676186,-0.67455,-0.672781,-0.673968,-0.676196,0.00381,1,0.550907,0.58159,0.594972,0.574022,0.576816,0.575661,0.014317,17
4,0.007108,7.6e-05,0.003425,1.4e-05,0.01,,l2,lbfgs,"{'logisticregression__C': 0.01, 'logisticregre...",-0.682996,-0.675241,-0.675597,-0.673949,-0.673375,-0.676232,0.003479,2,0.546722,0.585774,0.596369,0.565642,0.581006,0.575103,0.017297,18
5,0.013248,0.001218,0.004097,0.000478,0.01,,l2,newton-cg,"{'logisticregression__C': 0.01, 'logisticregre...",-0.682997,-0.675241,-0.675597,-0.673949,-0.673375,-0.676232,0.003479,3,0.546722,0.585774,0.596369,0.565642,0.581006,0.575103,0.017297,18
3,0.005481,0.000175,0.003253,1.3e-05,0.01,,l2,liblinear,"{'logisticregression__C': 0.01, 'logisticregre...",-0.682877,-0.675304,-0.675674,-0.674005,-0.673641,-0.6763,0.003376,4,0.555091,0.584379,0.587989,0.575419,0.579609,0.576497,0.011518,16
10,0.007337,0.000301,0.003419,3.3e-05,0.1,,l2,lbfgs,"{'logisticregression__C': 0.1, 'logisticregres...",-0.684044,-0.676038,-0.675903,-0.673108,-0.673425,-0.676503,0.003961,5,0.549512,0.585774,0.592179,0.578212,0.581006,0.577337,0.014696,1
11,0.011399,0.000528,0.003445,1.6e-05,0.1,,l2,newton-cg,"{'logisticregression__C': 0.1, 'logisticregres...",-0.684044,-0.676037,-0.675902,-0.673108,-0.673426,-0.676503,0.003961,6,0.549512,0.585774,0.592179,0.578212,0.581006,0.577337,0.014696,1
9,0.005766,4.8e-05,0.003394,1.6e-05,0.1,,l2,liblinear,"{'logisticregression__C': 0.1, 'logisticregres...",-0.684022,-0.676034,-0.675904,-0.673109,-0.673453,-0.676504,0.003948,7,0.546722,0.585774,0.589385,0.579609,0.581006,0.576499,0.015289,5
12,0.006644,0.00013,0.003406,2.9e-05,10.0,,l1,liblinear,"{'logisticregression__C': 10, 'logisticregress...",-0.684239,-0.676227,-0.675958,-0.672984,-0.673468,-0.676575,0.004045,8,0.549512,0.585774,0.589385,0.579609,0.581006,0.577057,0.014204,3
18,0.006527,0.000311,0.003445,1.9e-05,20.0,,l1,liblinear,"{'logisticregression__C': 20, 'logisticregress...",-0.684245,-0.676233,-0.675967,-0.672985,-0.673465,-0.676579,0.004046,9,0.548117,0.585774,0.590782,0.578212,0.581006,0.576778,0.014956,4
16,0.007279,0.000221,0.003613,0.000365,10.0,,l2,lbfgs,"{'logisticregression__C': 10, 'logisticregress...",-0.684249,-0.676236,-0.675975,-0.672987,-0.67346,-0.676582,0.004048,10,0.548117,0.584379,0.590782,0.578212,0.581006,0.576499,0.014798,5


### Ada Boost

In [80]:
X_train = df[df['Season'] != '2020-2021'].dropna().loc[:,rfecv_columns]
y_train = df[df['Season'] != '2020-2021'].dropna()['Home_Team_Won']
X_test = df[df['Season'] == '2020-2021'].loc[:,rfecv_columns]
y_test = df[df['Season'] == '2020-2021']['Home_Team_Won']

In [81]:
ada_rfecv_pipeline = Pipeline(steps=[('ss', StandardScaler()),
                      ('ada', AdaBoostClassifier())])

ada_params = {'ada__n_estimators': [25],
         'ada__learning_rate': [ .1, 10],
         'ada__base_estimator': [svm.SVC(probability=True , kernel='linear'), LogisticRegression(max_iter =10000, C=.01, penalty = 'l1', solver = 'liblinear')],}

ada_cv_all = GridSearchCV(ada_rfecv_pipeline, param_grid=ada_params, cv=5, scoring=scoring, refit='neg_log_loss', verbose=1)

In [82]:
ada_cv_all.fit(X_train, y_train)

Fitting 5 folds for each of 4 candidates, totalling 20 fits


GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('ss', StandardScaler()),
                                       ('ada', AdaBoostClassifier())]),
             param_grid={'ada__base_estimator': [SVC(kernel='linear',
                                                     probability=True),
                                                 LogisticRegression(C=0.01,
                                                                    max_iter=10000,
                                                                    penalty='l1',
                                                                    solver='liblinear')],
                         'ada__learning_rate': [0.1, 10],
                         'ada__n_estimators': [25]},
             refit='neg_log_loss', scoring=['neg_log_loss', 'accuracy'],
             verbose=1)

In [83]:
ada_all_results = pd.DataFrame(ada_cv_all.cv_results_).sort_values('mean_test_neg_log_loss', ascending=False)
ada_all_results.head(10)

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_ada__base_estimator,param_ada__learning_rate,param_ada__n_estimators,params,split0_test_neg_log_loss,split1_test_neg_log_loss,split2_test_neg_log_loss,split3_test_neg_log_loss,split4_test_neg_log_loss,mean_test_neg_log_loss,std_test_neg_log_loss,rank_test_neg_log_loss,split0_test_accuracy,split1_test_accuracy,split2_test_accuracy,split3_test_accuracy,split4_test_accuracy,mean_test_accuracy,std_test_accuracy,rank_test_accuracy
1,32.3474,0.550717,2.137133,0.103863,"SVC(kernel='linear', probability=True)",10.0,25,"{'ada__base_estimator': SVC(kernel='linear', p...",-0.683982,-0.678475,-0.679754,-0.682702,-0.681416,-0.681266,0.001976734,1,0.556485,0.591353,0.574022,0.561453,0.555866,0.567836,0.013448,1
0,32.946891,0.287765,2.11329,0.028486,"SVC(kernel='linear', probability=True)",0.1,25,"{'ada__base_estimator': SVC(kernel='linear', p...",-0.684418,-0.682521,-0.68266,-0.681958,-0.682048,-0.682721,0.0008898324,2,0.553696,0.564854,0.571229,0.553073,0.560056,0.560581,0.006866,2
2,0.055528,0.000309,0.010574,2.9e-05,"LogisticRegression(C=0.01, max_iter=10000, pen...",0.1,25,{'ada__base_estimator': LogisticRegression(C=0...,-0.693147,-0.693147,-0.693147,-0.693147,-0.693147,-0.693147,8.599751000000001e-17,3,0.456067,0.456067,0.456704,0.456704,0.455307,0.45617,0.000517,3
3,0.055622,0.000853,0.010454,1.9e-05,"LogisticRegression(C=0.01, max_iter=10000, pen...",10.0,25,{'ada__base_estimator': LogisticRegression(C=0...,-0.693147,-0.693147,-0.693147,-0.693147,-0.693147,-0.693147,8.599751000000001e-17,3,0.456067,0.456067,0.456704,0.456704,0.455307,0.45617,0.000517,3


### Neural Network

In [422]:
X_train = df[df['Season'] != '2020-2021'].dropna().loc[:,rfecv_columns]
y_train = df[df['Season'] != '2020-2021'].dropna()['Home_Team_Won']
X_test = df[df['Season'] == '2020-2021'].loc[:,rfecv_columns]
y_test = df[df['Season'] == '2020-2021']['Home_Team_Won']

In [423]:
X_train.shape

(3582, 9)

In [426]:
def build_model(optimizer='adam', activation='relu', neurons = 1, dropout_rate=0.0, weight_constraint=0):
    model = Sequential()
    model.add(Dense(neurons, activation=activation, input_dim=9, kernel_constraint=maxnorm(weight_constraint)))
    model.add(Dropout(dropout_rate))
    model.add(Dense(4, activation=activation))
    model.add(Dense(1, activation = 'sigmoid'))

    model.compile(optimizer=optimizer,
                  loss='binary_crossentropy',
                  metrics=['accuracy'])
    return model


param_grid = {'nn__epochs': [8,10, 15, 18],
             'nn__optimizer' : ['Adam'], 
             'nn__activation' : ['hard_sigmoid', 'linear'],
            'nn__neurons' : [12, 24, 36, 40],
             'nn__weight_constraint': [1, 3],
             'nn__dropout_rate' : [0.3, 0.6]}
keras_model = scikit_learn.KerasClassifier(build_fn=build_model, verbose=0)

numeric_transformer = Pipeline(steps=[
    ('scaler', StandardScaler())])

categorical_features = ['B2B_Status']

categorical_transformer = Pipeline(steps=[
    ('ohe', OneHotEncoder(handle_unknown='ignore'))])

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)])

nn_all_pipeline = Pipeline(steps=[('scaler', StandardScaler()),
                      ('nn', keras_model)])





nn_all_cv = GridSearchCV(estimator=nn_all_pipeline, param_grid=param_grid, cv=3, scoring=scoring, refit='neg_log_loss', verbose=1)

In [427]:
nn_all_cv.fit(X_train, y_train)

Fitting 3 folds for each of 128 candidates, totalling 384 fits


GridSearchCV(cv=3,
             estimator=Pipeline(steps=[('scaler', StandardScaler()),
                                       ('nn',
                                        <tensorflow.python.keras.wrappers.scikit_learn.KerasClassifier object at 0x7fd1f87d3340>)]),
             param_grid={'nn__activation': ['hard_sigmoid', 'linear'],
                         'nn__dropout_rate': [0.3, 0.6],
                         'nn__epochs': [8, 10, 15, 18],
                         'nn__neurons': [12, 24, 36, 40],
                         'nn__optimizer': ['Adam'],
                         'nn__weight_constraint': [1, 3]},
             refit='neg_log_loss', scoring=['neg_log_loss', 'accuracy'],
             verbose=1)

In [428]:
nn_all_results = pd.DataFrame(nn_all_cv.cv_results_).sort_values('mean_test_neg_log_loss', ascending=False)
nn_all_results.head(5)

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_nn__activation,param_nn__dropout_rate,param_nn__epochs,param_nn__neurons,param_nn__optimizer,param_nn__weight_constraint,params,split0_test_neg_log_loss,split1_test_neg_log_loss,split2_test_neg_log_loss,mean_test_neg_log_loss,std_test_neg_log_loss,rank_test_neg_log_loss,split0_test_accuracy,split1_test_accuracy,split2_test_accuracy,mean_test_accuracy,std_test_accuracy,rank_test_accuracy
108,0.728519,0.003397,0.097156,0.003177,linear,0.6,10,36,Adam,1,"{'nn__activation': 'linear', 'nn__dropout_rate...",-0.677029,-0.67405,-0.678228,-0.676436,0.001756,1,0.572027,0.578727,0.573702,0.574819,0.002847,37
125,1.018057,0.008765,0.096354,0.000784,linear,0.6,18,36,Adam,3,"{'nn__activation': 'linear', 'nn__dropout_rate...",-0.677144,-0.67326,-0.679239,-0.676547,0.002477,2,0.574539,0.582077,0.568677,0.575098,0.005485,34
124,1.015489,0.003671,0.095677,0.000542,linear,0.6,18,36,Adam,1,"{'nn__activation': 'linear', 'nn__dropout_rate...",-0.677608,-0.673404,-0.678655,-0.676556,0.002269,3,0.572864,0.576214,0.569514,0.572864,0.002735,64
102,0.630642,0.001822,0.096596,0.001702,linear,0.6,8,40,Adam,1,"{'nn__activation': 'linear', 'nn__dropout_rate...",-0.677097,-0.67475,-0.677836,-0.676561,0.001316,4,0.564489,0.58124,0.574539,0.573423,0.006884,58
92,1.017234,0.003919,0.09559,0.000813,linear,0.3,18,36,Adam,1,"{'nn__activation': 'linear', 'nn__dropout_rate...",-0.67879,-0.673118,-0.677963,-0.676624,0.002502,5,0.572864,0.589615,0.575377,0.579285,0.007376,6


## Apply Best Models To Test Data

I will evaluate the best model iterations on the held out 2021 season data

In [493]:
results_dict = {'Training Cross Validation Accuracy': {}, 'Training Cross Validation Log Loss': {}, 'Test Accuracy': {}, 'Test Log Loss':{}, 'Paramters':{}}
accuracy_list = []
log_loss_list = []

In [494]:
X_train = df[df['Season'] != '2020-2021'].dropna().loc[:,r_5_40]
y_train = df[df['Season'] != '2020-2021'].dropna()['Home_Team_Won']
X_test = df[df['Season'] == '2020-2021'].dropna().loc[:,r_5_40]
y_test = df[df['Season'] == '2020-2021'].dropna()['Home_Team_Won']





accuracy_list.append(accuracy_score(y_test, log_cv.predict(X_test)))
log_loss_list.append(log_loss(y_test, log_cv.predict_proba(X_test)))


In [495]:
X_train = df[df['Season'] != '2020-2021'].dropna().loc[:,r40]
y_train = df[df['Season'] != '2020-2021'].dropna()['Home_Team_Won']
X_test = df[df['Season'] == '2020-2021'].dropna().loc[:,r40]
y_test = df[df['Season'] == '2020-2021'].dropna()['Home_Team_Won']



test_preds_40 = log_cv_40.predict(X_test)

test_probs_40 = log_cv_40.predict_proba(X_test)

accuracy_list.append(accuracy_score(y_test, test_preds_40))
log_loss_list.append(log_loss(y_test, test_probs_40))

In [496]:
X_train = df[df['Season'] != '2020-2021'].dropna().loc[:,rfecv_columns]
y_train = df[df['Season'] != '2020-2021'].dropna()['Home_Team_Won']
X_test = df[df['Season'] == '2020-2021'].dropna().loc[:,rfecv_columns]
y_test = df[df['Season'] == '2020-2021'].dropna()['Home_Team_Won']

test_preds_rfecv = log_cv_all.predict(X_test)

test_probs_rfecv = log_cv_all.predict_proba(X_test)


accuracy_list.append(accuracy_score(y_test, test_preds_rfecv))
log_loss_list.append(log_loss(y_test, test_probs_rfecv))



In [497]:
X_train = df[df['Season'] != '2020-2021'].dropna().loc[:,r_5_40]
y_train = df[df['Season'] != '2020-2021'].dropna()['Home_Team_Won']
X_test = df[df['Season'] == '2020-2021'].dropna().loc[:,r_5_40]
y_test = df[df['Season'] == '2020-2021'].dropna()['Home_Team_Won']



accuracy_list.append(accuracy_score(y_test, ada_cv.predict(X_test)))
log_loss_list.append(log_loss(y_test,ada_cv.predict_proba(X_test)))

In [498]:
X_train = df[df['Season'] != '2020-2021'].dropna().loc[:,r40]
y_train = df[df['Season'] != '2020-2021'].dropna()['Home_Team_Won']
X_test = df[df['Season'] == '2020-2021'].dropna().loc[:,r40]
y_test = df[df['Season'] == '2020-2021'].dropna()['Home_Team_Won']



accuracy_list.append(accuracy_score(y_test, ada_cv_40.predict(X_test)))
log_loss_list.append(log_loss(y_test, ada_cv_40.predict_proba(X_test)))

In [499]:
X_train = df[df['Season'] != '2020-2021'].dropna().loc[:,r_5_40]
y_train = df[df['Season'] != '2020-2021'].dropna()['Home_Team_Won']
X_test = df[df['Season'] == '2020-2021'].dropna().loc[:,r_5_40]
y_test = df[df['Season'] == '2020-2021'].dropna()['Home_Team_Won']



accuracy_list.append(accuracy_score(y_test, nn_cv.predict(X_test)))
log_loss_list.append(log_loss(y_test, nn_cv.predict_proba(X_test)))

In [500]:
X_train = df[df['Season'] != '2020-2021'].dropna().loc[:,r40]
y_train = df[df['Season'] != '2020-2021'].dropna()['Home_Team_Won']
X_test = df[df['Season'] == '2020-2021'].dropna().loc[:,r40]
y_test = df[df['Season'] == '2020-2021'].dropna()['Home_Team_Won']



accuracy_list.append(accuracy_score(y_test, nn_40_cv.predict(X_test)))
log_loss_list.append(log_loss(y_test, nn_40_cv.predict_proba(X_test)))

In [501]:
X_train = df[df['Season'] != '2020-2021'].dropna().loc[:,rfecv_columns]
y_train = df[df['Season'] != '2020-2021'].dropna()['Home_Team_Won']
X_test = df[df['Season'] == '2020-2021'].dropna().loc[:,rfecv_columns]
y_test = df[df['Season'] == '2020-2021'].dropna()['Home_Team_Won']



accuracy_list.append(accuracy_score(y_test, nn_all_cv.predict(X_test)))
log_loss_list.append(log_loss(y_test, nn_all_cv.predict_proba(X_test)))

In [512]:
results_dict['Test Accuracy'] = accuracy_list
results_dict['Test Log Loss'] = log_loss_list
models = ['5 and 40 Logistic Regression', 
          '40 Logistic Regression', 
          'rfecv Logistic Regression', 
          '5 and 40 AdaBoost', 
          '40 AdaBoost', 
          '5 and 40 Neural Network', 
          '40 Neural Network', 
          'rfecv Neural Network']
results_dict['Training Cross Validation Accuracy'] = [log_results['mean_test_accuracy'][0], 
                               log_40_results.loc[:,'mean_test_accuracy'].iloc[0], 
                               log_all_results.loc[:,'mean_test_accuracy'].iloc[0], 
                               ada_results.loc[:,'mean_test_accuracy'].iloc[0], 
                               ada_40_results.loc[:,'mean_test_accuracy'].iloc[0], 
                               nn_results.loc[:,'mean_test_accuracy'].iloc[0],
                              nn_40_results.loc[:,'mean_test_accuracy'].iloc[0],
                              nn_all_results.loc[:,'mean_test_accuracy'].iloc[0]]
results_dict['Training Cross Validation Log Loss'] = [log_cv.best_score_*-1, 
                               log_cv_40.best_score_*-1, 
                               log_cv_all.best_score_*-1, 
                               ada_cv.best_score_*-1, 
                               ada_cv_40.best_score_*-1, 
                               nn_cv.best_score_*-1, 
                               nn_40_cv.best_score_*-1, 
                               nn_all_cv.best_score_*-1]

results_dict['Paramters'] = [log_results.loc[:,'params'].iloc[0], 
                               log_40_results.loc[:,'params'].iloc[0], 
                               log_all_results.loc[:,'params'].iloc[0], 
                               ada_results.loc[:,'params'].iloc[0], 
                               ada_40_results.loc[:,'params'].iloc[0], 
                               nn_results.loc[:,'params'].iloc[0],
                              nn_40_results.loc[:,'params'].iloc[0],
                              nn_all_results.loc[:,'params'].iloc[0]]

In [513]:
results_df = pd.DataFrame(results_dict, index = models)

## Conclusion

The best model which had the best training cross validation log loss and test log loss was the Neural Network on the rolling 40 game features. I will save the predictions 

In [514]:
pd.set_option('display.max_colwidth', None)
results_df.sort_values('Test Log Loss')

Unnamed: 0,Training Cross Validation Accuracy,Training Cross Validation Log Loss,Test Accuracy,Test Log Loss,Paramters
40 Neural Network,0.581519,0.6728,0.59878,0.655967,"{'nn__activation': 'linear', 'nn__dropout_rate': 0.3, 'nn__epochs': 18, 'nn__neurons': 36, 'nn__optimizer': 'Adam', 'nn__weight_constraint': 1}"
40 Logistic Regression,0.578727,0.674249,0.602439,0.656803,"{'logisticregression__C': 0.01, 'logisticregression__class_weight': None, 'logisticregression__penalty': 'l2', 'logisticregression__solver': 'liblinear'}"
5 and 40 Neural Network,0.58124,0.67348,0.591463,0.660135,"{'nn__activation': 'linear', 'nn__dropout_rate': 0.3, 'nn__epochs': 8, 'nn__neurons': 36, 'nn__optimizer': 'Adam', 'nn__weight_constraint': 3}"
rfecv Neural Network,0.574819,0.676436,0.609756,0.661108,"{'nn__activation': 'linear', 'nn__dropout_rate': 0.6, 'nn__epochs': 10, 'nn__neurons': 36, 'nn__optimizer': 'Adam', 'nn__weight_constraint': 1}"
rfecv Logistic Regression,0.575661,0.676196,0.613415,0.661188,"{'logisticregression__C': 0.1, 'logisticregression__class_weight': None, 'logisticregression__penalty': 'l1', 'logisticregression__solver': 'liblinear'}"
40 AdaBoost,0.573707,0.675816,0.620732,0.661381,"{'ada__base_estimator': SVC(kernel='linear', probability=True), 'ada__learning_rate': 0.01, 'ada__n_estimators': 25}"
5 and 40 Logistic Regression,0.45617,0.675437,0.59878,0.661808,"{'logisticregression__C': 0.001, 'logisticregression__class_weight': None, 'logisticregression__penalty': 'l2', 'logisticregression__solver': 'lbfgs'}"
5 and 40 AdaBoost,0.574536,0.679936,0.606098,0.670984,"{'ada__base_estimator': SVC(kernel='linear', probability=True), 'ada__learning_rate': 10, 'ada__n_estimators': 25}"


In [552]:
# Save predictions and probabilites for from best model
X_test = df[df['Season'] == '2020-2021'].dropna().loc[:,r40]
y_test = df[df['Season'] == '2020-2021'].dropna()['Home_Team_Won']

pred_df = df[df['Season'] == '2020-2021'].dropna().loc[:,['game_id',
 'date',
 'venue',
 'home_team',
 'away_team',
 'start_time',
 'home_score',
 'away_score',
 'status',
 'Home_Team_Won',
 'Home_Team_Key',
 'Away_Team_Key', 'home_Game_Number','away_Game_Number','home_goalie',
 'home_Goalie_FenwickSV%',
 'home_Goalie_GSAx/60',
 'home_Goalie_HDCSV%',
 'away_goalie',
 'away_Goalie_FenwickSV%',
 'away_Goalie_GSAx/60',
 'away_Goalie_HDCSV%','home_last_40_FF%_5v5',
 'home_last_40_GF%_5v5',
 'home_last_40_xGF%_5v5',
 'home_last_40_SH%',
 'home_last40_pp_TOI_per_game',
 'home_last40_xGF_per_min_pp',
 'home_last40_GF_per_min_pp',
 'home_last40_pk_TOI_per_game',
 'home_last40_xGA_per_min_pk',
 'home_last40_GA_per_min_pk','away_last_40_FF%_5v5',
 'away_last_40_GF%_5v5',
 'away_last_40_xGF%_5v5',
 'away_last_40_SH%',
 'away_last40_pp_TOI_per_game',
 'away_last40_xGF_per_min_pp',
 'away_last40_GF_per_min_pp',
 'away_last40_pk_TOI_per_game',
 'away_last40_xGA_per_min_pk',
 'away_last40_GA_per_min_pk',
 'home_Rating.A.Pre',
 'away_Rating.A.Pre',
 'B2B_Status']]

preds = nn_40_cv.predict(X_test)
probs = nn_40_cv.predict_proba(X_test)

Predictions_2021 = pd.concat([pred_df, 
                             pd.DataFrame(preds, columns = ['Prediction'], index = y_test.index ),
                             pd.DataFrame(probs, columns = ['Away Win Probability', 'Home Win Probability'], index = y_test.index)], 
                             axis =1)

In [555]:
Predictions_2021.tail()

Unnamed: 0,game_id,date,venue,home_team,away_team,start_time,home_score,away_score,status,Home_Team_Won,Home_Team_Key,Away_Team_Key,home_Game_Number,away_Game_Number,home_goalie,home_Goalie_FenwickSV%,home_Goalie_GSAx/60,home_Goalie_HDCSV%,away_goalie,away_Goalie_FenwickSV%,away_Goalie_GSAx/60,away_Goalie_HDCSV%,home_last_40_FF%_5v5,home_last_40_GF%_5v5,home_last_40_xGF%_5v5,home_last_40_SH%,home_last40_pp_TOI_per_game,home_last40_xGF_per_min_pp,home_last40_GF_per_min_pp,home_last40_pk_TOI_per_game,home_last40_xGA_per_min_pk,home_last40_GA_per_min_pk,away_last_40_FF%_5v5,away_last_40_GF%_5v5,away_last_40_xGF%_5v5,away_last_40_SH%,away_last40_pp_TOI_per_game,away_last40_xGF_per_min_pp,away_last40_GF_per_min_pp,away_last40_pk_TOI_per_game,away_last40_xGA_per_min_pk,away_last40_GA_per_min_pk,home_Rating.A.Pre,away_Rating.A.Pre,B2B_Status,Prediction,Away Win Probability,Home Win Probability
4442,2020020838,2021-05-06,TD Garden,BOS,NYR,2021-05-06 23:00:00,4,0,Final,1,BOS_2021-05-06,NYR_2021-05-06,40.0,38.0,Jeremy Swayman,0.935086,-0.255694,0.86206,Igor Shesterkin,0.943293,0.221547,0.893805,55.281007,54.929673,53.113745,7.448563,5.1825,0.087844,0.096479,5.401667,0.084094,0.087936,48.264073,54.430227,48.777665,10.05015,5.033333,0.135397,0.153974,4.960833,0.111876,0.10079,1569.72,1512.11,Away_only,1,0.389045,0.610955
4443,2020020839,2021-05-06,Nassau Veterans Memorial Coliseum,NYI,N.J,2021-05-06 23:00:00,1,2,Final,0,NYI_2021-05-06,N.J_2021-05-06,43.0,46.0,Semyon Varlamov,0.945489,0.090302,0.88102,Mackenzie Blackwood,0.929299,-0.399936,0.837209,50.241772,57.867228,53.050836,8.805727,4.270833,0.112976,0.093659,4.164167,0.102181,0.090054,48.503229,41.919777,48.218609,7.979786,5.086667,0.092202,0.078637,4.4425,0.115419,0.135059,1549.32,1439.38,Neither,1,0.305557,0.694443
4444,2020020842,2021-05-06,PPG Paints Arena,PIT,BUF,2021-05-06 23:00:00,8,4,Final,1,PIT_2021-05-06,BUF_2021-05-06,48.0,49.0,Tristan Jarry,0.929605,-0.42756,0.843672,Michael Houser,0.935086,-0.255694,0.86206,50.36059,58.253252,49.798658,9.041652,4.2225,0.12386,0.171699,4.520833,0.095945,0.121659,43.7066,39.713487,43.700006,7.311708,4.217917,0.076993,0.082979,4.502917,0.123698,0.127695,1556.67,1416.17,Neither,1,0.29849,0.70151
4445,2020020847,2021-05-06,Scotiabank Arena,TOR,MTL,2021-05-06 23:00:00,5,2,Final,1,TOR_2021-05-06,MTL_2021-05-06,42.0,44.0,Jack Campbell,0.938931,-0.117228,0.845,Jake Allen,0.937289,-0.098128,0.878049,52.425741,57.93858,57.199725,9.228362,4.385833,0.125119,0.085503,3.714583,0.102703,0.134605,53.658068,46.852953,51.668374,6.754951,4.255417,0.085714,0.111622,4.325417,0.118775,0.138715,1550.15,1485.59,Away_only,1,0.361329,0.638671
4446,2020020593,2021-05-06,Rogers Place,EDM,VAN,2021-05-07 01:00:00,3,6,Final,0,EDM_2021-05-06,VAN_2021-05-06,47.0,47.0,Mike Smith,0.943015,0.055221,0.874687,Thatcher Demko,0.933794,-0.096288,0.854626,49.044109,53.663901,49.880668,9.351147,4.523333,0.128611,0.171334,4.334167,0.116055,0.09229,46.485886,44.299738,45.089832,7.305675,4.430417,0.105464,0.112856,5.131667,0.111075,0.116921,1536.06,1462.38,Neither,1,0.328182,0.671818


In [561]:
Predictions_2021.to_csv('data/Predictions_2021')

## Next Steps
To further improve the models I would like to take the following next steps

- Implement voting classifier
- Try linear weightings in rolling features
- Build bottom up model using player statistics