# NHL Game Prediction Modeling
by Gary Schwaeber

## Overview

With sports betting becoming increasingly popular and mainstream I believe that data science can be used to make superior decisions over gut intuitions. Unlike in Football or Basketball where the betting against the spread is the most popular type of betting, the moneyline is king in the NHL due to lower scoring games. When betting the moneyline, the way to gain an edge is if you know the truer probability of the game outcome then the implied odds from the moneyline. Over the course of the season, if your internally derived game probabilities are superior to the book's, you can be profitable. 

In this notebook I will attempt to train logistic regression, ada boost, gradient boosting, and neural network models in an attempt to make the best possible game prediction model. I will train my models and tune model hyperparemetres using game results from seasons '2017-2018', '2018-2019', '2019-2020'. Then I will predict on held out games from the current 2021 season and evaluate my model. 

Log loss is the score which I will use to optimize and judge the models. Log-loss is indicative of how close the prediction probability is to the corresponding actual/true value (0 or 1 in case of binary classification). The more the predicted probability diverges from the actual value, the higher is the log-loss value, [Source](https://towardsdatascience.com/intuition-behind-log-loss-score-4e0c9979680a). There are currently a handful of public models whose log loss on the current season's games is being [tracked](https://hockey-statistics.com/2021/05/03/game-projections-january-13th-2021/) on which I can compare the quality of my model to.   I will also review accuracy scores due to their interpretability.

In [1]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
import numpy as np
import statsmodels.api as sm
import hockey_scraper
import pickle
import time
import random
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

import sklearn
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.tree import DecisionTreeClassifier
from sklearn import svm
from sklearn.preprocessing import normalize, FunctionTransformer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import log_loss, accuracy_score
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.metrics import roc_auc_score, roc_curve, auc

from sklearn.metrics import confusion_matrix, plot_confusion_matrix,\
    precision_score, recall_score, accuracy_score, f1_score, log_loss,\
    roc_curve, roc_auc_score, classification_report
from sklearn.ensemble import VotingClassifier, BaggingClassifier, RandomForestClassifier, AdaBoostClassifier, AdaBoostRegressor, GradientBoostingClassifier
from collections import Counter
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.compose import ColumnTransformer
from sklearn.datasets import fetch_openml
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.compose import make_column_selector as selector
from sklearn.pipeline import make_pipeline, Pipeline
from sklearn.feature_selection import RFECV

#for the Neural Network
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout
from tensorflow.keras.regularizers import l2
from tensorflow.keras.optimizers import SGD
from tensorflow.keras.wrappers import scikit_learn
from tensorflow.keras.callbacks import EarlyStopping
from keras.constraints import maxnorm

import warnings
warnings.filterwarnings('ignore')
pd.set_option('display.max_columns', None)

In [2]:
df = pd.read_csv('data/all_games_multirolling_SVA_2.csv')

In [3]:
df.shape

(4447, 155)

In [4]:
# define feature columns for different rolling intervals

common = ['home_Goalie_FenwickSV%',
 'home_Goalie_GSAx/60',
 'home_Goalie_HDCSV%',
 'away_Goalie_FenwickSV%',
 'away_Goalie_GSAx/60',
 'away_Goalie_HDCSV%', 
 'home_Rating.A.Pre',
 'away_Rating.A.Pre',
 'B2B_Status']

r3 = ['home_last_3_FF%_5v5',
 'home_last_3_GF%_5v5',
 'home_last_3_xGF%_5v5',
 'home_last_3_SH%',
 'home_last3_xGF_per_min_pp',
 'home_last3_GF_per_min_pp',
 'home_last3_xGA_per_min_pk',
 'home_last3_GA_per_min_pk',
 'away_last_3_FF%_5v5',
 'away_last_3_GF%_5v5',
 'away_last_3_xGF%_5v5',
 'away_last_3_SH%',
 'away_last3_xGF_per_min_pp',
 'away_last3_GF_per_min_pp',
 'away_last3_xGA_per_min_pk',
 'away_last3_GA_per_min_pk'] + common

r5 =['home_last_5_FF%_5v5',
 'home_last_5_GF%_5v5',
 'home_last_5_xGF%_5v5',
 'home_last_5_SH%',
 'home_last5_xGF_per_min_pp',
 'home_last5_GF_per_min_pp',
 'home_last5_xGA_per_min_pk',
 'home_last5_GA_per_min_pk',
 'away_last_5_FF%_5v5',
 'away_last_5_GF%_5v5',
 'away_last_5_xGF%_5v5',
 'away_last_5_SH%',
 'away_last5_xGF_per_min_pp',
 'away_last5_GF_per_min_pp',
 'away_last5_xGA_per_min_pk',
 'away_last5_GA_per_min_pk'] + common

r10 =['home_last_10_FF%_5v5',
 'home_last_10_GF%_5v5',
 'home_last_10_xGF%_5v5',
 'home_last_10_SH%',
 'home_last10_xGF_per_min_pp',
 'home_last10_GF_per_min_pp',
 'home_last10_xGA_per_min_pk',
 'home_last10_GA_per_min_pk',
  'away_last_10_FF%_5v5',
 'away_last_10_GF%_5v5',
 'away_last_10_xGF%_5v5',
 'away_last_10_SH%',
 'away_last10_xGF_per_min_pp',
 'away_last10_GF_per_min_pp',
 'away_last10_xGA_per_min_pk',
 'away_last10_GA_per_min_pk'] + common


r20 = ['home_last_20_FF%_5v5',
 'home_last_20_GF%_5v5',
 'home_last_20_xGF%_5v5',
 'home_last_20_SH%',
 'home_last20_xGF_per_min_pp',
 'home_last20_GF_per_min_pp',
 'home_last20_xGA_per_min_pk',
 'home_last20_GA_per_min_pk',
 'away_last_20_FF%_5v5',
 'away_last_20_GF%_5v5',
 'away_last_20_xGF%_5v5',
 'away_last_20_SH%',
 'away_last20_xGF_per_min_pp',
 'away_last20_GF_per_min_pp',
 'away_last20_xGA_per_min_pk',
 'away_last20_GA_per_min_pk'] +common

r30 = ['home_last_30_FF%_5v5',
 'home_last_30_GF%_5v5',
 'home_last_30_xGF%_5v5',
 'home_last_30_SH%',
 'home_last30_xGF_per_min_pp',
 'home_last30_GF_per_min_pp',
 'home_last30_xGA_per_min_pk',
 'home_last30_GA_per_min_pk',
 'away_last_30_FF%_5v5',
 'away_last_30_GF%_5v5',
 'away_last_30_xGF%_5v5',
 'away_last_30_SH%',
 'away_last30_xGF_per_min_pp',
 'away_last30_GF_per_min_pp',
 'away_last30_xGA_per_min_pk',
 'away_last30_GA_per_min_pk'] + common


r40 = ['home_last_40_FF%_5v5',
 'home_last_40_GF%_5v5',
 'home_last_40_xGF%_5v5',
 'home_last_40_SH%',
 'home_last40_xGF_per_min_pp',
 'home_last40_GF_per_min_pp',
 'home_last40_xGA_per_min_pk',
 'home_last40_GA_per_min_pk',
 'away_last_40_FF%_5v5',
 'away_last_40_GF%_5v5',
 'away_last_40_xGF%_5v5',
 'away_last_40_SH%',
 'away_last40_xGF_per_min_pp',
 'away_last40_GF_per_min_pp',
 'away_last40_xGA_per_min_pk',
 'away_last40_GA_per_min_pk'] + common


all_r = list(set(r3+r5+r10+r20+r30+r40)) 

r3_30 =list(set(r3+r30))
r_5_40 = list(set(r5+r40))

## Baseline Model

The baseline model will predict that every home team wins their game and that the probability of that is the ratio of games the home team has won.

In [5]:
df['Home_Team_Won'].value_counts(normalize=True)

1    0.541714
0    0.458286
Name: Home_Team_Won, dtype: float64

In [6]:
baseline_preds = np.ones(df.shape[0])
accuracy_score(df['Home_Team_Won'],baseline_preds)

0.5417135147290308

In [7]:
baseline_probs = np.repeat(df['Home_Team_Won'].value_counts(normalize=True)[1], df.shape[0])

log_loss(df['Home_Team_Won'], baseline_probs)

0.6896630977766495

The models will need to beat an accuracy score of 54.17% and a log loss of .6897, otherwise they are no better than just predicting the home team will win. 

## Rolling 5 and 40 game features

For my first set of models I will attempt using 5 and 40 game rolling features. These seemed like a good set based on the feature selection notebook. 40 games is the longest rolling runway I have for the team statistics. The 40 games stats intuitively provide the most smoothing of team data over the course of the season, while the 5 game stats may provide some insight on any streakiness or may cover recent developments that would affect short term team performances such as player injuries, trades coaching changes etc.

In [8]:
X_train = df[df['Season'] != '2020-2021'].dropna().loc[:,r_5_40]
y_train = df[df['Season'] != '2020-2021'].dropna()['Home_Team_Won']
X_test = df[df['Season'] == '2020-2021'].loc[:,r_5_40]
y_test = df[df['Season'] == '2020-2021']['Home_Team_Won']

In [9]:
X_train.columns

Index(['home_Goalie_GSAx/60', 'home_last5_GF_per_min_pp',
       'away_last_5_GF%_5v5', 'away_last5_GF_per_min_pp',
       'away_last5_xGF_per_min_pp', 'away_last_40_xGF%_5v5', 'away_last_5_SH%',
       'home_last5_xGF_per_min_pp', 'away_last5_xGA_per_min_pk',
       'away_last_40_SH%', 'away_Rating.A.Pre', 'home_last_5_xGF%_5v5',
       'home_last40_xGF_per_min_pp', 'home_Goalie_HDCSV%',
       'away_last40_GF_per_min_pp', 'home_last40_GF_per_min_pp',
       'home_last40_GA_per_min_pk', 'away_last_40_GF%_5v5',
       'home_last_40_xGF%_5v5', 'away_last_5_FF%_5v5', 'home_last_5_SH%',
       'away_last40_xGA_per_min_pk', 'away_last_5_xGF%_5v5',
       'home_last40_xGA_per_min_pk', 'home_last_40_FF%_5v5',
       'away_Goalie_GSAx/60', 'away_last40_GA_per_min_pk',
       'away_last5_GA_per_min_pk', 'home_Goalie_FenwickSV%',
       'away_last40_xGF_per_min_pp', 'home_last_5_GF%_5v5',
       'away_Goalie_HDCSV%', 'home_last5_GA_per_min_pk',
       'home_last_40_GF%_5v5', 'home_Rating.A.Pre'

In [10]:
X_train.shape

(3582, 41)

In [11]:
numeric_features = ['home_last40_xGF_per_min_pp', 'away_last_5_xGF%_5v5',
       'home_last_40_GF%_5v5',
       'home_last40_xGA_per_min_pk', 'home_last5_xGA_per_min_pk',
       'home_last_40_SH%', 
       'home_Goalie_GSAx/60',
        'away_Goalie_GSAx/60',
       'away_last_5_GF%_5v5', 
       'home_last_40_xGF%_5v5', 
     'home_last5_GF_per_min_pp',
       'home_last_5_GF%_5v5', 'home_last_5_FF%_5v5',
       'away_last5_xGF_per_min_pp', 'away_last40_xGF_per_min_pp',
       'home_last40_GA_per_min_pk', 'home_Goalie_HDCSV%',
       'away_last5_GA_per_min_pk', 'away_last40_GF_per_min_pp',
       'away_Rating.A.Pre', 'home_last_5_xGF%_5v5', 'away_last_5_SH%',
       'home_Rating.A.Pre', 'home_last5_xGF_per_min_pp',
       'away_last_40_xGF%_5v5', 'home_last5_GA_per_min_pk',
     'away_last5_GF_per_min_pp',
       'away_last_40_GF%_5v5', 'away_last_40_SH%', 'away_last_5_FF%_5v5',
       'home_Goalie_FenwickSV%', 'away_Goalie_HDCSV%',
       'away_last40_xGA_per_min_pk', 'home_last_5_SH%',
       'away_last5_xGA_per_min_pk', 'home_last_40_FF%_5v5',
       'away_Goalie_FenwickSV%', 'away_last_40_FF%_5v5',
       'home_last40_GF_per_min_pp', 'away_last40_GA_per_min_pk']

In [12]:
X_train[numeric_features].head()

Unnamed: 0,home_last40_xGF_per_min_pp,away_last_5_xGF%_5v5,home_last_40_GF%_5v5,home_last40_xGA_per_min_pk,home_last5_xGA_per_min_pk,home_last_40_SH%,home_Goalie_GSAx/60,away_Goalie_GSAx/60,away_last_5_GF%_5v5,home_last_40_xGF%_5v5,home_last5_GF_per_min_pp,home_last_5_GF%_5v5,home_last_5_FF%_5v5,away_last5_xGF_per_min_pp,away_last40_xGF_per_min_pp,home_last40_GA_per_min_pk,home_Goalie_HDCSV%,away_last5_GA_per_min_pk,away_last40_GF_per_min_pp,away_Rating.A.Pre,home_last_5_xGF%_5v5,away_last_5_SH%,home_Rating.A.Pre,home_last5_xGF_per_min_pp,away_last_40_xGF%_5v5,home_last5_GA_per_min_pk,away_last5_GF_per_min_pp,away_last_40_GF%_5v5,away_last_40_SH%,away_last_5_FF%_5v5,home_Goalie_FenwickSV%,away_Goalie_HDCSV%,away_last40_xGA_per_min_pk,home_last_5_SH%,away_last5_xGA_per_min_pk,home_last_40_FF%_5v5,away_Goalie_FenwickSV%,away_last_40_FF%_5v5,home_last40_GF_per_min_pp,away_last40_GA_per_min_pk
0,0.112699,48.770492,50.127801,0.104858,0.098556,9.025236,-0.202922,0.082345,45.9375,48.992719,0.095465,57.080799,52.399869,0.06991,0.1224,0.137102,0.858462,0.19544,0.139885,1500.66,51.663405,6.967375,1495.03,0.079714,49.339386,0.054152,0.10181,51.399425,8.124451,52.562502,0.937294,0.873171,0.133976,9.426112,0.074267,48.803377,0.942516,49.991679,0.117297,0.121145
1,0.124909,51.204482,56.868932,0.129028,0.153383,9.060588,0.169541,-0.239655,49.927641,51.954595,0.2997,59.064609,42.564205,0.096,0.102018,0.10473,0.877358,0.040268,0.115864,1535.17,46.860987,11.358025,1577.1,0.143856,52.486645,0.225564,0.1,58.184556,8.420932,46.882217,0.941904,0.864516,0.097844,12.093988,0.109128,50.828439,0.941294,50.633643,0.138139,0.086229
2,0.132248,40.305523,56.575634,0.116445,0.131278,9.02546,0.302087,-0.097423,45.427286,49.851785,0.190981,58.385392,60.511924,0.153218,0.120843,0.112194,0.897778,0.068337,0.11683,1496.85,60.180542,9.286882,1522.11,0.113316,49.136336,0.132159,0.16609,50.499508,7.879167,43.520998,0.942492,0.878613,0.107127,8.478124,0.112415,50.407241,0.938246,50.595552,0.149493,0.106067
3,0.105738,49.941995,53.260259,0.120913,0.137299,7.970138,-0.164139,-0.080476,56.272661,52.809227,0.04329,57.771883,54.316401,0.137242,0.143998,0.125595,0.869266,0.100615,0.103208,1496.86,52.571429,6.524847,1525.37,0.118615,50.855171,0.125962,0.115979,45.246898,5.932286,51.909534,0.934447,0.848,0.093779,9.804628,0.086864,52.890654,0.938305,51.197815,0.099407,0.131951
4,0.129293,43.6373,48.882718,0.084868,0.067197,7.303942,-0.310233,-0.346771,52.130045,54.871795,0.297398,48.959081,52.400715,0.142088,0.087855,0.101091,0.830721,0.0,0.121801,1545.81,50.929752,7.311321,1521.29,0.098885,50.381002,0.03672,0.065934,52.122642,7.885816,47.102597,0.933383,0.839117,0.102718,5.518246,0.107438,55.762037,0.939698,51.309591,0.189644,0.128468


In [13]:
# this scoring variable will be used on all models.
#The grid search results will output both scores but ultimately will use the best log loss
scoring = ['neg_log_loss', 'accuracy']

### Logistic Regression

In [14]:
#establish transformer objects for scaling the numerical features and one hot encoding the categorical feature
numeric_transformer = Pipeline(steps=[
    ('scaler', StandardScaler())])

categorical_features = ['B2B_Status']

categorical_transformer = Pipeline(steps=[
    ('ohe', OneHotEncoder())])

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)])

log_pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                      ('logisticregression', LogisticRegression(max_iter=10000))])

#paramters to test with the grid search
log_params = {'logisticregression__solver' : ['liblinear', 'lbfgs', 'newton-cg'],
                'logisticregression__penalty': ['l1', 'l2'],
                'logisticregression__C': [.00001, .0001, .001, .01, .05, 0.1],
                'logisticregression__class_weight': [None] }

log_cv = GridSearchCV(log_pipeline, param_grid=log_params, cv=5, scoring=scoring, refit = 'neg_log_loss',  verbose=1)

In [15]:
log_cv.fit(X_train, y_train)

Fitting 5 folds for each of 36 candidates, totalling 180 fits


GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('preprocessor',
                                        ColumnTransformer(transformers=[('num',
                                                                         Pipeline(steps=[('scaler',
                                                                                          StandardScaler())]),
                                                                         ['home_last40_xGF_per_min_pp',
                                                                          'away_last_5_xGF%_5v5',
                                                                          'home_last_40_GF%_5v5',
                                                                          'home_last40_xGA_per_min_pk',
                                                                          'home_last5_xGA_per_min_pk',
                                                                          'home_last_40_SH%',
                                 

In [16]:
log_cv.best_score_

-0.6754370089204439

In [17]:
log_results = pd.DataFrame(log_cv.cv_results_).sort_values('mean_test_neg_log_loss', ascending=False)
log_results.head(10)

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_logisticregression__C,param_logisticregression__class_weight,param_logisticregression__penalty,param_logisticregression__solver,params,split0_test_neg_log_loss,split1_test_neg_log_loss,split2_test_neg_log_loss,split3_test_neg_log_loss,split4_test_neg_log_loss,mean_test_neg_log_loss,std_test_neg_log_loss,rank_test_neg_log_loss,split0_test_accuracy,split1_test_accuracy,split2_test_accuracy,split3_test_accuracy,split4_test_accuracy,mean_test_accuracy,std_test_accuracy,rank_test_accuracy
16,0.012243,0.000233,0.007874,0.000153,0.001,,l2,lbfgs,"{'logisticregression__C': 0.001, 'logisticregr...",-0.678374,-0.671673,-0.677392,-0.675825,-0.673921,-0.675437,0.002411,1,0.566248,0.592748,0.594972,0.571229,0.578212,0.580682,0.011433,2
17,0.021892,0.001702,0.008688,0.000791,0.001,,l2,newton-cg,"{'logisticregression__C': 0.001, 'logisticregr...",-0.678374,-0.671674,-0.677392,-0.675825,-0.673923,-0.675437,0.00241,2,0.566248,0.592748,0.594972,0.571229,0.578212,0.580682,0.011433,2
15,0.01415,0.00019,0.007921,0.000177,0.001,,l2,liblinear,"{'logisticregression__C': 0.001, 'logisticregr...",-0.678653,-0.672743,-0.678873,-0.676277,-0.675154,-0.67634,0.002285,3,0.567643,0.619247,0.594972,0.585196,0.569832,0.587378,0.018843,1
21,0.01759,0.000723,0.007908,8.9e-05,0.01,,l2,liblinear,"{'logisticregression__C': 0.01, 'logisticregre...",-0.677641,-0.668417,-0.680201,-0.678592,-0.676997,-0.676369,0.00412,4,0.584379,0.598326,0.587989,0.565642,0.565642,0.580396,0.012887,4
23,0.025329,0.003175,0.008586,0.001297,0.01,,l2,newton-cg,"{'logisticregression__C': 0.01, 'logisticregre...",-0.677863,-0.668443,-0.679819,-0.679093,-0.676975,-0.676439,0.004116,5,0.585774,0.594142,0.587989,0.557263,0.569832,0.579,0.01351,6
22,0.027196,0.008879,0.011265,0.00249,0.01,,l2,lbfgs,"{'logisticregression__C': 0.01, 'logisticregre...",-0.677864,-0.668444,-0.679819,-0.679094,-0.676973,-0.676439,0.004116,6,0.585774,0.594142,0.587989,0.557263,0.569832,0.579,0.01351,6
24,0.016978,0.000739,0.007928,0.000102,0.05,,l1,liblinear,"{'logisticregression__C': 0.05, 'logisticregre...",-0.678657,-0.671509,-0.677805,-0.679101,-0.67592,-0.676598,0.002769,7,0.567643,0.588563,0.596369,0.561453,0.575419,0.577889,0.012936,8
30,0.023284,0.002756,0.007912,9.4e-05,0.1,,l1,liblinear,"{'logisticregression__C': 0.1, 'logisticregres...",-0.677721,-0.669128,-0.678904,-0.679827,-0.677732,-0.676662,0.003849,8,0.570432,0.596932,0.585196,0.561453,0.553073,0.573417,0.01586,13
27,0.023851,0.001142,0.00831,0.000487,0.05,,l2,liblinear,"{'logisticregression__C': 0.05, 'logisticregre...",-0.677728,-0.669121,-0.681685,-0.681348,-0.679935,-0.677963,0.004635,9,0.58159,0.599721,0.579609,0.561453,0.555866,0.575648,0.015642,10
28,0.025571,0.001457,0.00966,0.000528,0.05,,l2,lbfgs,"{'logisticregression__C': 0.05, 'logisticregre...",-0.677767,-0.669153,-0.681584,-0.681528,-0.679915,-0.67799,0.004632,10,0.577406,0.598326,0.581006,0.562849,0.555866,0.575091,0.01483,11


### Ada Boost

In [18]:
ada_pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                      ('ada', AdaBoostClassifier())])

ada_params = {'ada__n_estimators': [25, 50],
         'ada__learning_rate': [.1, 1, 10, 20],
         'ada__base_estimator': [svm.SVC(probability=True , kernel='linear'), LogisticRegression()],}

ada_cv = GridSearchCV(ada_pipeline, param_grid=ada_params, cv=5, scoring=scoring, refit='neg_log_loss', verbose=1)

In [19]:
ada_cv.fit(X_train, y_train)

Fitting 5 folds for each of 16 candidates, totalling 80 fits


GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('preprocessor',
                                        ColumnTransformer(transformers=[('num',
                                                                         Pipeline(steps=[('scaler',
                                                                                          StandardScaler())]),
                                                                         ['home_last40_xGF_per_min_pp',
                                                                          'away_last_5_xGF%_5v5',
                                                                          'home_last_40_GF%_5v5',
                                                                          'home_last40_xGA_per_min_pk',
                                                                          'home_last5_xGA_per_min_pk',
                                                                          'home_last_40_SH%',
                                 

In [20]:
ada_cv.best_score_

-0.6798578897026044

Earlier iterations of the Ada Boost grid search tried using decision trees as a base estimator, however those models performed very poorly. Support Vector machines as the base estimator perform very well. Although the Logistic Regression models have generally been performing best so far, the Ada Boost models with Support Vector Machines as a base estimator are still outputting competitive results. 

In [21]:
ada_results = pd.DataFrame(ada_cv.cv_results_).sort_values('mean_test_neg_log_loss', ascending=False)
ada_results.head(10)

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_ada__base_estimator,param_ada__learning_rate,param_ada__n_estimators,params,split0_test_neg_log_loss,split1_test_neg_log_loss,split2_test_neg_log_loss,split3_test_neg_log_loss,split4_test_neg_log_loss,mean_test_neg_log_loss,std_test_neg_log_loss,rank_test_neg_log_loss,split0_test_accuracy,split1_test_accuracy,split2_test_accuracy,split3_test_accuracy,split4_test_accuracy,mean_test_accuracy,std_test_accuracy,rank_test_accuracy
4,42.614577,0.222443,2.52127,0.012169,"SVC(kernel='linear', probability=True)",10.0,25,"{'ada__base_estimator': SVC(kernel='linear', p...",-0.683102,-0.674955,-0.681145,-0.681154,-0.678933,-0.679858,0.002784,1,0.563459,0.598326,0.586592,0.565642,0.574022,0.577608,0.013162,3
6,40.737666,0.997575,2.405483,0.16574,"SVC(kernel='linear', probability=True)",20.0,25,"{'ada__base_estimator': SVC(kernel='linear', p...",-0.684937,-0.676536,-0.682238,-0.683349,-0.68036,-0.681484,0.002889,2,0.545328,0.594142,0.585196,0.564246,0.567039,0.57119,0.017072,6
0,45.630049,1.223703,2.60134,0.084959,"SVC(kernel='linear', probability=True)",0.1,25,"{'ada__base_estimator': SVC(kernel='linear', p...",-0.682767,-0.679747,-0.682376,-0.681655,-0.681343,-0.681578,0.001045,3,0.556485,0.573222,0.562849,0.555866,0.555866,0.560858,0.00672,8
5,82.776386,1.669668,4.909572,0.272084,"SVC(kernel='linear', probability=True)",10.0,50,"{'ada__base_estimator': SVC(kernel='linear', p...",-0.684551,-0.676798,-0.682961,-0.684091,-0.680342,-0.681749,0.002874,4,0.560669,0.591353,0.579609,0.554469,0.568436,0.570907,0.013228,7
8,0.143607,0.019959,0.01797,0.00343,LogisticRegression(),0.1,25,"{'ada__base_estimator': LogisticRegression(), ...",-0.684109,-0.681523,-0.684027,-0.68269,-0.682635,-0.682997,0.000969,5,0.564854,0.594142,0.597765,0.569832,0.581006,0.58152,0.012945,2
1,82.608645,1.396454,4.729361,0.080904,"SVC(kernel='linear', probability=True)",0.1,50,"{'ada__base_estimator': SVC(kernel='linear', p...",-0.685449,-0.683975,-0.684934,-0.684928,-0.684876,-0.684832,0.000477,6,0.543933,0.543933,0.541899,0.541899,0.546089,0.543551,0.001561,12
7,73.45379,3.892666,3.899523,0.248024,"SVC(kernel='linear', probability=True)",20.0,50,"{'ada__base_estimator': SVC(kernel='linear', p...",-0.686929,-0.684355,-0.686312,-0.685958,-0.684266,-0.685564,0.00107,7,0.543933,0.542538,0.543296,0.543296,0.548883,0.544389,0.00229,9
9,0.244384,0.01548,0.025143,0.002342,LogisticRegression(),0.1,50,"{'ada__base_estimator': LogisticRegression(), ...",-0.686993,-0.685253,-0.687152,-0.686154,-0.686353,-0.686381,0.000677,8,0.569038,0.60251,0.597765,0.572626,0.571229,0.582634,0.014416,1
2,36.225959,0.817533,2.105014,0.071245,"SVC(kernel='linear', probability=True)",1.0,25,"{'ada__base_estimator': SVC(kernel='linear', p...",-0.688253,-0.688205,-0.688302,-0.688179,-0.687902,-0.688168,0.000139,9,0.543933,0.543933,0.543296,0.543296,0.544693,0.54383,0.000517,10
3,68.708771,0.226974,4.001605,0.033344,"SVC(kernel='linear', probability=True)",1.0,50,"{'ada__base_estimator': SVC(kernel='linear', p...",-0.689027,-0.688601,-0.688762,-0.688496,-0.688619,-0.688701,0.000184,10,0.543933,0.543933,0.543296,0.543296,0.544693,0.54383,0.000517,10


### Gradient Boosting

In [22]:
gb_pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                      ('gb', GradientBoostingClassifier())])

gb_params = {'gb__n_estimators': [200, 400],
         'gb__learning_rate': [.001,.01],
         'gb__max_depth' : [3,5]}

gb_cv = GridSearchCV(gb_pipeline, param_grid=gb_params, cv=5, scoring=scoring, refit='neg_log_loss', verbose=1)

In [23]:
gb_cv.fit(X_train, y_train)

Fitting 5 folds for each of 8 candidates, totalling 40 fits


GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('preprocessor',
                                        ColumnTransformer(transformers=[('num',
                                                                         Pipeline(steps=[('scaler',
                                                                                          StandardScaler())]),
                                                                         ['home_last40_xGF_per_min_pp',
                                                                          'away_last_5_xGF%_5v5',
                                                                          'home_last_40_GF%_5v5',
                                                                          'home_last40_xGA_per_min_pk',
                                                                          'home_last5_xGA_per_min_pk',
                                                                          'home_last_40_SH%',
                                 

In [24]:
gb_cv.best_score_

-0.6813146498639139

The gradient boosting results are generally very poor. It seems like models decision trees as a base estimator do not work well for this dataset

In [25]:
gb_results = pd.DataFrame(gb_cv.cv_results_).sort_values('mean_test_neg_log_loss', ascending=False)
gb_results.head(10)

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_gb__learning_rate,param_gb__max_depth,param_gb__n_estimators,params,split0_test_neg_log_loss,split1_test_neg_log_loss,split2_test_neg_log_loss,split3_test_neg_log_loss,split4_test_neg_log_loss,mean_test_neg_log_loss,std_test_neg_log_loss,rank_test_neg_log_loss,split0_test_accuracy,split1_test_accuracy,split2_test_accuracy,split3_test_accuracy,split4_test_accuracy,mean_test_accuracy,std_test_accuracy,rank_test_accuracy
4,5.039086,0.00785,0.013047,0.000277,0.01,3,200,"{'gb__learning_rate': 0.01, 'gb__max_depth': 3...",-0.682641,-0.6794,-0.684327,-0.680537,-0.679668,-0.681315,0.001888,1,0.559275,0.570432,0.590782,0.568436,0.582402,0.574265,0.011067,1
5,10.178907,0.01201,0.016022,0.000314,0.01,3,400,"{'gb__learning_rate': 0.01, 'gb__max_depth': 3...",-0.682245,-0.681441,-0.686649,-0.68267,-0.682056,-0.683012,0.001861,2,0.549512,0.559275,0.581006,0.572626,0.574022,0.567288,0.011333,3
6,8.131786,0.028058,0.016273,0.000238,0.01,5,200,"{'gb__learning_rate': 0.01, 'gb__max_depth': 5...",-0.684019,-0.680469,-0.688348,-0.68215,-0.684828,-0.683963,0.002664,3,0.55788,0.567643,0.568436,0.581006,0.562849,0.567563,0.007713,2
1,10.385874,0.181404,0.020259,0.001276,0.001,3,400,"{'gb__learning_rate': 0.001, 'gb__max_depth': ...",-0.685737,-0.683565,-0.686076,-0.685384,-0.684671,-0.685087,0.000892,4,0.538354,0.548117,0.539106,0.551676,0.541899,0.543831,0.005215,5
3,16.069403,0.104621,0.025809,0.001443,0.001,5,400,"{'gb__learning_rate': 0.001, 'gb__max_depth': ...",-0.685429,-0.682733,-0.689635,-0.685681,-0.684922,-0.68568,0.002234,5,0.535565,0.559275,0.526536,0.546089,0.547486,0.54299,0.011143,6
2,8.247091,0.097229,0.016837,0.00076,0.001,5,200,"{'gb__learning_rate': 0.001, 'gb__max_depth': ...",-0.686931,-0.685201,-0.688013,-0.688221,-0.685264,-0.686726,0.001296,6,0.53417,0.546722,0.536313,0.540503,0.550279,0.541598,0.006098,8
0,5.242151,0.042042,0.014054,0.000876,0.001,3,200,"{'gb__learning_rate': 0.001, 'gb__max_depth': ...",-0.687017,-0.68619,-0.687083,-0.687529,-0.686338,-0.686831,0.000498,7,0.543933,0.543933,0.536313,0.543296,0.541899,0.541875,0.002878,7
7,16.548913,0.027711,0.022408,0.000389,0.01,5,400,"{'gb__learning_rate': 0.01, 'gb__max_depth': 5...",-0.68331,-0.68812,-0.69506,-0.690875,-0.694274,-0.690328,0.004298,8,0.569038,0.560669,0.548883,0.575419,0.571229,0.565048,0.009404,4


### Neural Network

In [26]:
def build_model(optimizer='adam', activation='linear', neurons = 36, dropout_rate=0.3, weight_constraint=3):
    model = Sequential()
    model.add(Dense(neurons, activation=activation, input_dim=44, kernel_constraint=maxnorm(weight_constraint)))
    model.add(Dropout(dropout_rate))
    model.add(Dense(4, activation=activation))
    model.add(Dense(1, activation = 'sigmoid'))

    model.compile(optimizer=optimizer,
                  loss='binary_crossentropy',
                  metrics=['accuracy'])
    return model

# define the grid search parameters

param_grid = {'nn__epochs': [8,10, 12, 15, 18],
             'nn__optimizer' : ['RMSprop', 'Adam'], 
             'nn__activation' : ['sigmoid', 'hard_sigmoid', 'linear'],
            'nn__neurons' : [12, 18, 24, 30, 36, 40],
             'nn__weight_constraint': [1, 3, 5],
             'nn__dropout_rate' : [0.0,  0.3, 0.6, 0.9]}

#wrap neural network in scikit learn wrapper
keras_model = scikit_learn.KerasClassifier(build_fn=build_model, verbose=0)

numeric_transformer = Pipeline(steps=[
    ('scaler', StandardScaler())])

categorical_features = ['B2B_Status']

categorical_transformer = Pipeline(steps=[
    ('ohe', OneHotEncoder(handle_unknown='ignore'))])

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)])

nn_pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                      ('nn', keras_model)])





nn_cv = GridSearchCV(estimator=nn_pipeline, param_grid=param_grid, cv=3, scoring=scoring, refit='neg_log_loss', verbose=1)

In [27]:
nn_cv.fit(X_train, y_train)

Fitting 3 folds for each of 2160 candidates, totalling 6480 fits


GridSearchCV(cv=3,
             estimator=Pipeline(steps=[('preprocessor',
                                        ColumnTransformer(transformers=[('num',
                                                                         Pipeline(steps=[('scaler',
                                                                                          StandardScaler())]),
                                                                         ['home_last40_xGF_per_min_pp',
                                                                          'away_last_5_xGF%_5v5',
                                                                          'home_last_40_GF%_5v5',
                                                                          'home_last40_xGA_per_min_pk',
                                                                          'home_last5_xGA_per_min_pk',
                                                                          'home_last_40_SH%',
                                 

The optimized Neural Network model is showing the most promising results of the 4 models. with log loss of 0.673480 compared to 0.675437 for the Logistic Regression, 0.679936 for Ada Boost , and 0.681325 for Gradient Boosting.

In [28]:
nn_results = pd.DataFrame(nn_cv.cv_results_).sort_values('mean_test_neg_log_loss', ascending=False)
nn_results.head(5)

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_nn__activation,param_nn__dropout_rate,param_nn__epochs,param_nn__neurons,param_nn__optimizer,param_nn__weight_constraint,params,split0_test_neg_log_loss,split1_test_neg_log_loss,split2_test_neg_log_loss,mean_test_neg_log_loss,std_test_neg_log_loss,rank_test_neg_log_loss,split0_test_accuracy,split1_test_accuracy,split2_test_accuracy,mean_test_accuracy,std_test_accuracy,rank_test_accuracy
1956,1.083388,0.003374,0.106423,0.000459,linear,0.6,18,24,RMSprop,1,"{'nn__activation': 'linear', 'nn__dropout_rate...",-0.670084,-0.672037,-0.67866,-0.673594,0.00367,1,0.58794,0.603015,0.561977,0.58431,0.016949,181
1766,1.173387,0.142763,0.105134,0.001061,linear,0.3,18,12,RMSprop,5,"{'nn__activation': 'linear', 'nn__dropout_rate...",-0.669859,-0.671422,-0.680451,-0.673911,0.004669,2,0.588777,0.599665,0.551926,0.580123,0.020427,744
1922,0.984559,0.00375,0.107558,0.00077,linear,0.6,15,24,RMSprop,5,"{'nn__activation': 'linear', 'nn__dropout_rate...",-0.671357,-0.670923,-0.679546,-0.673942,0.003967,3,0.592127,0.597152,0.556114,0.581798,0.018277,467
1656,0.763118,0.000977,0.099721,0.000794,linear,0.3,10,12,RMSprop,1,"{'nn__activation': 'linear', 'nn__dropout_rate...",-0.670359,-0.672176,-0.679355,-0.673963,0.003884,4,0.588777,0.59799,0.556951,0.58124,0.017581,581
1935,0.954143,0.001307,0.104931,0.000327,linear,0.6,15,36,Adam,1,"{'nn__activation': 'linear', 'nn__dropout_rate...",-0.672181,-0.670628,-0.679445,-0.674084,0.003843,5,0.587102,0.598827,0.565327,0.583752,0.01388,254


## 40 Game Rolling

I will run some models using only the rolling 40 game team stats. In the feautre selection notebook, using only 40 game rolling team stats had the best scoring using a basic Logisitic Regression model. Although the idea that including a shorter rolling period may provide the streakiness factor, it is possible that only modeling with longer term smoothing of team performance may be the most predictive.

In [29]:
X_train = df[df['Season'] != '2020-2021'].dropna().loc[:,r40]
y_train = df[df['Season'] != '2020-2021'].dropna()['Home_Team_Won']
X_test = df[df['Season'] == '2020-2021'].loc[:,r40]
y_test = df[df['Season'] == '2020-2021']['Home_Team_Won']

In [30]:
X_train.columns

Index(['home_last_40_FF%_5v5', 'home_last_40_GF%_5v5', 'home_last_40_xGF%_5v5',
       'home_last_40_SH%', 'home_last40_xGF_per_min_pp',
       'home_last40_GF_per_min_pp', 'home_last40_xGA_per_min_pk',
       'home_last40_GA_per_min_pk', 'away_last_40_FF%_5v5',
       'away_last_40_GF%_5v5', 'away_last_40_xGF%_5v5', 'away_last_40_SH%',
       'away_last40_xGF_per_min_pp', 'away_last40_GF_per_min_pp',
       'away_last40_xGA_per_min_pk', 'away_last40_GA_per_min_pk',
       'home_Goalie_FenwickSV%', 'home_Goalie_GSAx/60', 'home_Goalie_HDCSV%',
       'away_Goalie_FenwickSV%', 'away_Goalie_GSAx/60', 'away_Goalie_HDCSV%',
       'home_Rating.A.Pre', 'away_Rating.A.Pre', 'B2B_Status'],
      dtype='object')

In [31]:
numeric_features =['home_last_40_FF%_5v5', 'home_last_40_GF%_5v5', 'home_last_40_xGF%_5v5',
       'home_last_40_SH%', 'home_last40_xGF_per_min_pp',
       'home_last40_GF_per_min_pp', 'home_last40_xGA_per_min_pk',
       'home_last40_GA_per_min_pk', 'away_last_40_FF%_5v5',
       'away_last_40_GF%_5v5', 'away_last_40_xGF%_5v5', 'away_last_40_SH%',
       'away_last40_xGF_per_min_pp', 'away_last40_GF_per_min_pp',
       'away_last40_xGA_per_min_pk', 'away_last40_GA_per_min_pk',
       'home_Goalie_FenwickSV%', 'home_Goalie_GSAx/60', 'home_Goalie_HDCSV%',
       'away_Goalie_FenwickSV%', 'away_Goalie_GSAx/60', 'away_Goalie_HDCSV%',
       'home_Rating.A.Pre', 'away_Rating.A.Pre']

### Logistic Regression

In [32]:
numeric_transformer = Pipeline(steps=[
    ('scaler', StandardScaler())])

categorical_features = ['B2B_Status']

categorical_transformer = Pipeline(steps=[
    ('ohe', OneHotEncoder())])

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)])

log_40_pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                      ('logisticregression', LogisticRegression(max_iter=10000))])

In [33]:
log_params = {'logisticregression__solver' : ['liblinear', 'lbfgs', 'newton-cg'],
                'logisticregression__penalty': ['l1', 'l2'],
                'logisticregression__C': [.01, 0.1, 1, 10],
                'logisticregression__class_weight': [None] }

log_cv_40 = GridSearchCV(log_40_pipeline, param_grid=log_params, cv=5, scoring=scoring, refit = 'neg_log_loss',  verbose=1)

In [34]:
log_cv_40.fit(X_train, y_train)

Fitting 5 folds for each of 24 candidates, totalling 120 fits


GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('preprocessor',
                                        ColumnTransformer(transformers=[('num',
                                                                         Pipeline(steps=[('scaler',
                                                                                          StandardScaler())]),
                                                                         ['home_last_40_FF%_5v5',
                                                                          'home_last_40_GF%_5v5',
                                                                          'home_last_40_xGF%_5v5',
                                                                          'home_last_40_SH%',
                                                                          'home_last40_xGF_per_min_pp',
                                                                          'home_last40_GF_per_min_pp',
                                      

In [35]:
log_40_results = pd.DataFrame(log_cv_40.cv_results_).sort_values('mean_test_neg_log_loss', ascending=False)
log_40_results.head(10)

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_logisticregression__C,param_logisticregression__class_weight,param_logisticregression__penalty,param_logisticregression__solver,params,split0_test_neg_log_loss,split1_test_neg_log_loss,split2_test_neg_log_loss,split3_test_neg_log_loss,split4_test_neg_log_loss,mean_test_neg_log_loss,std_test_neg_log_loss,rank_test_neg_log_loss,split0_test_accuracy,split1_test_accuracy,split2_test_accuracy,split3_test_accuracy,split4_test_accuracy,mean_test_accuracy,std_test_accuracy,rank_test_accuracy
3,0.015797,0.000464,0.009378,0.000589,0.01,,l2,liblinear,"{'logisticregression__C': 0.01, 'logisticregre...",-0.677897,-0.667774,-0.678553,-0.67785,-0.669169,-0.674249,0.004744,1,0.562064,0.595537,0.593575,0.568436,0.574022,0.578727,0.013481,4
5,0.021412,0.000807,0.007937,0.000286,0.01,,l2,newton-cg,"{'logisticregression__C': 0.01, 'logisticregre...",-0.678123,-0.667782,-0.678179,-0.678294,-0.669174,-0.674311,0.004783,2,0.564854,0.589958,0.594972,0.571229,0.574022,0.579007,0.011493,1
4,0.017624,0.001365,0.009272,0.001054,0.01,,l2,lbfgs,"{'logisticregression__C': 0.01, 'logisticregre...",-0.678123,-0.667785,-0.678179,-0.678293,-0.669174,-0.674311,0.004782,3,0.564854,0.589958,0.594972,0.571229,0.574022,0.579007,0.011493,1
6,0.019279,0.002509,0.008176,0.000784,0.1,,l1,liblinear,"{'logisticregression__C': 0.1, 'logisticregres...",-0.677617,-0.668766,-0.677167,-0.679306,-0.671783,-0.674928,0.003982,4,0.567643,0.587169,0.608939,0.554469,0.576816,0.579007,0.018431,1
9,0.015628,0.000473,0.008149,0.000324,0.1,,l2,liblinear,"{'logisticregression__C': 0.1, 'logisticregres...",-0.677746,-0.668233,-0.680027,-0.6799,-0.670958,-0.675373,0.004863,5,0.573222,0.587169,0.585196,0.551676,0.565642,0.572581,0.013096,5
10,0.020239,0.000497,0.007967,0.000537,0.1,,l2,lbfgs,"{'logisticregression__C': 0.1, 'logisticregres...",-0.677759,-0.668236,-0.679981,-0.67999,-0.670951,-0.675383,0.004873,6,0.573222,0.587169,0.585196,0.551676,0.564246,0.572302,0.013255,6
11,0.025984,0.001622,0.007845,0.000375,0.1,,l2,newton-cg,"{'logisticregression__C': 0.1, 'logisticregres...",-0.677763,-0.668245,-0.679982,-0.679991,-0.670953,-0.675387,0.004871,7,0.571827,0.587169,0.585196,0.551676,0.564246,0.572023,0.013247,7
12,0.027255,0.004471,0.009229,0.001175,1.0,,l1,liblinear,"{'logisticregression__C': 1, 'logisticregressi...",-0.67789,-0.668705,-0.680209,-0.680259,-0.671406,-0.675694,0.00476,8,0.573222,0.587169,0.579609,0.546089,0.564246,0.570067,0.014158,8
15,0.016365,0.001007,0.007827,0.000251,1.0,,l2,liblinear,"{'logisticregression__C': 1, 'logisticregressi...",-0.677739,-0.668911,-0.680817,-0.680721,-0.671601,-0.675958,0.00486,9,0.570432,0.585774,0.579609,0.548883,0.562849,0.569509,0.01294,9
17,0.027231,0.001503,0.007969,0.000218,1.0,,l2,newton-cg,"{'logisticregression__C': 1, 'logisticregressi...",-0.67774,-0.668912,-0.680813,-0.680733,-0.6716,-0.67596,0.004862,10,0.570432,0.585774,0.579609,0.548883,0.562849,0.569509,0.01294,9


### Ada Boost

In [36]:
numeric_transformer = Pipeline(steps=[
    ('scaler', StandardScaler())])

categorical_features = ['B2B_Status']

categorical_transformer = Pipeline(steps=[
    ('ohe', OneHotEncoder())])

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)])


ada_40_pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                      ('ada', AdaBoostClassifier())])

ada_params = {'ada__n_estimators': [25],
         'ada__learning_rate': [.01, .1, 1, 10],
         'ada__base_estimator': [svm.SVC(probability=True , kernel='linear'), LogisticRegression()],}

ada_cv_40 = GridSearchCV(ada_40_pipeline, param_grid=ada_params, cv=5, scoring=scoring, refit='neg_log_loss', verbose=1)

In [37]:
ada_cv_40.fit(X_train, y_train)

Fitting 5 folds for each of 8 candidates, totalling 40 fits


GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('preprocessor',
                                        ColumnTransformer(transformers=[('num',
                                                                         Pipeline(steps=[('scaler',
                                                                                          StandardScaler())]),
                                                                         ['home_last_40_FF%_5v5',
                                                                          'home_last_40_GF%_5v5',
                                                                          'home_last_40_xGF%_5v5',
                                                                          'home_last_40_SH%',
                                                                          'home_last40_xGF_per_min_pp',
                                                                          'home_last40_GF_per_min_pp',
                                      

In [38]:
ada_40_results = pd.DataFrame(ada_cv_40.cv_results_).sort_values('mean_test_neg_log_loss', ascending=False)
ada_40_results.head(10)

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_ada__base_estimator,param_ada__learning_rate,param_ada__n_estimators,params,split0_test_neg_log_loss,split1_test_neg_log_loss,split2_test_neg_log_loss,split3_test_neg_log_loss,split4_test_neg_log_loss,mean_test_neg_log_loss,std_test_neg_log_loss,rank_test_neg_log_loss,split0_test_accuracy,split1_test_accuracy,split2_test_accuracy,split3_test_accuracy,split4_test_accuracy,mean_test_accuracy,std_test_accuracy,rank_test_accuracy
0,39.230205,0.071301,2.247906,0.005976,"SVC(kernel='linear', probability=True)",0.01,25,"{'ada__base_estimator': SVC(kernel='linear', p...",-0.678544,-0.672591,-0.677178,-0.677964,-0.672798,-0.675815,0.002585,1,0.548117,0.580195,0.594972,0.561453,0.576816,0.572311,0.01612,5
4,0.119884,0.003862,0.016288,0.00097,LogisticRegression(),0.01,25,"{'ada__base_estimator': LogisticRegression(), ...",-0.680541,-0.676259,-0.67967,-0.678631,-0.675985,-0.678217,0.001817,2,0.569038,0.588563,0.569832,0.567039,0.583799,0.575654,0.008774,3
3,39.694361,0.119665,2.402361,0.015123,"SVC(kernel='linear', probability=True)",10.0,25,"{'ada__base_estimator': SVC(kernel='linear', p...",-0.681482,-0.677015,-0.677408,-0.680627,-0.676307,-0.678568,0.002078,3,0.566248,0.585774,0.596369,0.576816,0.572626,0.579566,0.010526,2
1,40.214163,0.039122,2.327299,0.006262,"SVC(kernel='linear', probability=True)",0.1,25,"{'ada__base_estimator': SVC(kernel='linear', p...",-0.682364,-0.679555,-0.681843,-0.681452,-0.679591,-0.680961,0.00117,4,0.564854,0.570432,0.569832,0.560056,0.564246,0.565884,0.003847,6
5,0.123238,0.004518,0.016054,0.000404,LogisticRegression(),0.1,25,"{'ada__base_estimator': LogisticRegression(), ...",-0.684177,-0.681519,-0.684073,-0.683125,-0.681574,-0.682894,0.00116,5,0.560669,0.588563,0.597765,0.569832,0.582402,0.579847,0.013203,1
2,32.865977,0.141314,1.90889,0.004267,"SVC(kernel='linear', probability=True)",1.0,25,"{'ada__base_estimator': SVC(kernel='linear', p...",-0.688404,-0.687679,-0.687998,-0.688574,-0.688222,-0.688176,0.000313,6,0.543933,0.543933,0.543296,0.543296,0.544693,0.54383,0.000517,7
6,0.098325,0.005258,0.016114,0.000747,LogisticRegression(),1.0,25,"{'ada__base_estimator': LogisticRegression(), ...",-0.691571,-0.691189,-0.691649,-0.691591,-0.691297,-0.691459,0.000182,7,0.559275,0.594142,0.589385,0.564246,0.569832,0.575376,0.013873,4
7,0.24769,0.005661,0.016178,0.000848,LogisticRegression(),10.0,25,"{'ada__base_estimator': LogisticRegression(), ...",-0.691897,-0.69111,-0.702624,-0.690811,-0.691471,-0.693583,0.004535,8,0.543933,0.543933,0.543296,0.543296,0.544693,0.54383,0.000517,7


### Neural Network

In [39]:
def build_model(optimizer='adam', activation='relu', neurons = 1, dropout_rate=0.0, weight_constraint=0):
    model = Sequential()
    model.add(Dense(neurons, activation=activation, input_dim=28, kernel_constraint=maxnorm(weight_constraint)))
    model.add(Dropout(dropout_rate))
    model.add(Dense(4, activation=activation))
    model.add(Dense(1, activation = 'sigmoid'))

    model.compile(optimizer=optimizer,
                  loss='binary_crossentropy',
                  metrics=['accuracy'])
    return model

param_grid = {'nn__epochs': [8,10, 15, 18],
             'nn__optimizer' : ['RMSprop', 'Adam'], 
             'nn__activation' : ['hard_sigmoid', 'linear'],
            'nn__neurons' : [12, 24, 36, 40],
             'nn__weight_constraint': [1, 3],
             'nn__dropout_rate' : [0.3, 0.6]}

keras_model = scikit_learn.KerasClassifier(build_fn=build_model, verbose=0)

numeric_transformer = Pipeline(steps=[
    ('scaler', StandardScaler())])

categorical_features = ['B2B_Status']

categorical_transformer = Pipeline(steps=[
    ('ohe', OneHotEncoder(handle_unknown='ignore'))])

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)])

nn_40_pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                      ('nn', keras_model)])





nn_40_cv = GridSearchCV(estimator=nn_40_pipeline, param_grid=param_grid, cv=3, scoring=scoring, refit='neg_log_loss', verbose=1)

In [40]:
nn_40_cv.fit(X_train, y_train)

Fitting 3 folds for each of 256 candidates, totalling 768 fits


GridSearchCV(cv=3,
             estimator=Pipeline(steps=[('preprocessor',
                                        ColumnTransformer(transformers=[('num',
                                                                         Pipeline(steps=[('scaler',
                                                                                          StandardScaler())]),
                                                                         ['home_last_40_FF%_5v5',
                                                                          'home_last_40_GF%_5v5',
                                                                          'home_last_40_xGF%_5v5',
                                                                          'home_last_40_SH%',
                                                                          'home_last40_xGF_per_min_pp',
                                                                          'home_last40_GF_per_min_pp',
                                      

In [41]:
nn_40_results = pd.DataFrame(nn_40_cv.cv_results_).sort_values('mean_test_neg_log_loss', ascending=False)
nn_40_results.head(10)

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_nn__activation,param_nn__dropout_rate,param_nn__epochs,param_nn__neurons,param_nn__optimizer,param_nn__weight_constraint,params,split0_test_neg_log_loss,split1_test_neg_log_loss,split2_test_neg_log_loss,mean_test_neg_log_loss,std_test_neg_log_loss,rank_test_neg_log_loss,split0_test_accuracy,split1_test_accuracy,split2_test_accuracy,mean_test_accuracy,std_test_accuracy,rank_test_accuracy
153,1.025458,0.301492,0.112389,0.010358,linear,0.3,10,36,RMSprop,3,"{'nn__activation': 'linear', 'nn__dropout_rate...",-0.669314,-0.671602,-0.676991,-0.672635,0.003218,1,0.583752,0.597152,0.551089,0.577331,0.019346,173
252,1.092319,0.007639,0.105952,0.001877,linear,0.6,18,40,RMSprop,1,"{'nn__activation': 'linear', 'nn__dropout_rate...",-0.669903,-0.67114,-0.677077,-0.672707,0.003132,2,0.585427,0.592127,0.561139,0.579564,0.013313,111
174,0.955958,0.003744,0.104754,0.001062,linear,0.3,15,40,Adam,1,"{'nn__activation': 'linear', 'nn__dropout_rate...",-0.670733,-0.669889,-0.678207,-0.672943,0.003738,3,0.583752,0.59129,0.557789,0.57761,0.01435,167
143,0.691614,0.006337,0.105344,0.000454,linear,0.3,8,40,Adam,3,"{'nn__activation': 'linear', 'nn__dropout_rate...",-0.670418,-0.671872,-0.676542,-0.672944,0.002613,4,0.580402,0.587102,0.559464,0.575656,0.011772,200
155,0.785075,0.007942,0.107056,0.001952,linear,0.3,10,36,Adam,3,"{'nn__activation': 'linear', 'nn__dropout_rate...",-0.669908,-0.671856,-0.677075,-0.672946,0.003026,5,0.58459,0.586265,0.561139,0.577331,0.01147,173
154,0.881473,0.148928,0.108441,0.001316,linear,0.3,10,36,Adam,1,"{'nn__activation': 'linear', 'nn__dropout_rate...",-0.669646,-0.670802,-0.678553,-0.673,0.003955,6,0.585427,0.596315,0.571189,0.58431,0.010288,12
248,1.090879,0.001805,0.104656,0.000764,linear,0.6,18,36,RMSprop,1,"{'nn__activation': 'linear', 'nn__dropout_rate...",-0.670306,-0.671497,-0.677247,-0.673017,0.003031,7,0.582077,0.599665,0.564489,0.582077,0.01436,48
185,1.093323,0.010162,0.103517,0.001971,linear,0.3,18,36,RMSprop,3,"{'nn__activation': 'linear', 'nn__dropout_rate...",-0.670147,-0.670897,-0.678094,-0.673046,0.003583,8,0.580402,0.59464,0.559464,0.578169,0.014447,147
133,0.749521,0.003655,0.10604,0.000566,linear,0.3,8,24,RMSprop,3,"{'nn__activation': 'linear', 'nn__dropout_rate...",-0.668565,-0.671806,-0.678798,-0.673056,0.00427,9,0.583752,0.599665,0.561977,0.581798,0.015448,57
170,1.050796,0.136331,0.106393,0.002591,linear,0.3,15,36,Adam,1,"{'nn__activation': 'linear', 'nn__dropout_rate...",-0.670279,-0.67192,-0.677174,-0.673124,0.002941,10,0.583752,0.595477,0.561139,0.580123,0.014252,93


So far, the 40 game rolling features set has generally produced superior scoring in the models, with the Neural Network model on the 40 game rolling features scoring best on cross validation with a log loss of 0.672800

## All Rolling Game Features With Recursive Feature Elimination

I will now attempt using Recursive Feature Elimination on all different rolling game options (3,5,10,20,30,40) to see if there is some mix of features that the RFECV can find patterns for. I will then use those features for each of the models see if they produce more fruitful results. 

In [42]:
X_train = df[df['Season'] != '2020-2021'].dropna().loc[:,all_r]
y_train = df[df['Season'] != '2020-2021'].dropna()['Home_Team_Won']
X_test = df[df['Season'] == '2020-2021'].loc[:,all_r]
y_test = df[df['Season'] == '2020-2021']['Home_Team_Won']

In [43]:
X_train.shape

(3582, 105)

In [44]:
numeric_features = all_r
numeric_features.remove('B2B_Status')

### Recursive Feature Elimination

In [45]:
numeric_transformer = Pipeline(steps=[
    ('scaler', StandardScaler())])

categorical_features = ['B2B_Status']

categorical_transformer = Pipeline(steps=[
    ('ohe', OneHotEncoder())])

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)])

rfecv = RFECV(estimator= LogisticRegression(max_iter =10000, penalty = 'l2', solver='liblinear', C=.1), step=1, scoring='accuracy')
rfecv_pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                      ('rfecv', rfecv)])

In [46]:
rfecv_pipeline.fit(X_train, y_train)

Pipeline(steps=[('preprocessor',
                 ColumnTransformer(transformers=[('num',
                                                  Pipeline(steps=[('scaler',
                                                                   StandardScaler())]),
                                                  ['home_last30_GF_per_min_pp',
                                                   'home_last_30_FF%_5v5',
                                                   'home_last30_xGF_per_min_pp',
                                                   'away_last_20_xGF%_5v5',
                                                   'home_last_3_GF%_5v5',
                                                   'home_last3_GA_per_min_pk',
                                                   'away_last_40_xGF%_5v5',
                                                   'away_last_5_SH%',
                                                   'home_last_20_SH%',
                                                   'away_last_3

In [47]:
rfecv_pipeline[1].n_features_

36

In [48]:
rfecv_pipeline[1].ranking_

array([ 1,  1,  2,  1, 40, 16,  3, 10,  1, 60,  1,  1, 32,  5, 56, 12, 70,
       59,  1, 38, 34,  1, 39,  1,  1, 47, 24,  1, 52,  4, 51, 46, 30, 50,
       49,  1, 66,  1, 71,  1, 26, 53, 73, 23, 15, 20,  1, 64, 48, 61, 37,
       67, 17,  8, 68,  1,  1,  1, 42, 19, 58, 14,  1,  6, 55, 36,  1,  9,
       31, 54, 41,  1,  1, 25, 27, 22,  1,  1, 57,  1, 28,  1, 13, 45, 11,
       21,  1, 43, 69,  1, 63, 44, 33,  1,  1, 29, 18, 35,  1, 62,  7,  1,
       72, 65,  1,  1,  1,  1])

In [49]:
rfecv_results = pd.DataFrame(list(zip(X_train.columns, rfecv_pipeline[1].ranking_)), columns = ['Feature', 'Ranking']).sort_values('Ranking')
rfecv_results.head(rfecv_pipeline[1].n_features_)

Unnamed: 0,Feature,Ranking
0,home_last30_GF_per_min_pp,1
35,home_last_10_xGF%_5v5,1
37,home_Goalie_GSAx/60,1
39,away_last_20_FF%_5v5,1
46,home_last_3_FF%_5v5,1
55,home_last_40_FF%_5v5,1
56,away_Goalie_GSAx/60,1
57,home_Goalie_FenwickSV%,1
62,home_last_40_SH%,1
66,away_last_30_SH%,1


In [50]:
rfecv_columns = list(rfecv_results.iloc[:rfecv_pipeline[1].n_features_,0])
rfecv_columns 

['home_last30_GF_per_min_pp',
 'home_last_10_xGF%_5v5',
 'home_Goalie_GSAx/60',
 'away_last_20_FF%_5v5',
 'home_last_3_FF%_5v5',
 'home_last_40_FF%_5v5',
 'away_Goalie_GSAx/60',
 'home_Goalie_FenwickSV%',
 'home_last_40_SH%',
 'away_last_30_SH%',
 'away_last10_GA_per_min_pk',
 'home_last20_xGF_per_min_pp',
 'away_last_30_xGF%_5v5',
 'home_last30_xGA_per_min_pk',
 'home_last_5_FF%_5v5',
 'home_last_3_xGF%_5v5',
 'away_last_20_SH%',
 'home_last_5_xGF%_5v5',
 'away_last30_xGF_per_min_pp',
 'away_last20_GA_per_min_pk',
 'away_last40_xGF_per_min_pp',
 'home_Rating.A.Pre',
 'home_last_40_GF%_5v5',
 'away_last40_GA_per_min_pk',
 'home_last_10_SH%',
 'home_last_20_SH%',
 'away_Rating.A.Pre',
 'home_last_30_FF%_5v5',
 'home_last40_xGA_per_min_pk',
 'home_last20_xGA_per_min_pk',
 'home_last20_GF_per_min_pp',
 'away_last_40_SH%',
 'away_last_20_xGF%_5v5',
 'home_last30_xGF_per_min_pp',
 'away_last_40_xGF%_5v5',
 'away_last_40_FF%_5v5']

### Logistic Regression

In [51]:
X_train = df[df['Season'] != '2020-2021'].dropna().loc[:,rfecv_columns]
y_train = df[df['Season'] != '2020-2021'].dropna()['Home_Team_Won']
X_test = df[df['Season'] == '2020-2021'].loc[:,rfecv_columns]
y_test = df[df['Season'] == '2020-2021']['Home_Team_Won']

In [52]:
log_rfecv_pipeline = Pipeline(steps=[('ss', StandardScaler()),
                      ('logisticregression', LogisticRegression(max_iter=10000))])

log_params = {'logisticregression__solver' : ['liblinear', 'lbfgs', 'newton-cg'],
                'logisticregression__penalty': ['l1', 'l2'],
                'logisticregression__C': [.01, 0.1, 10, 20, 100],
                'logisticregression__class_weight': [None]}

log_cv_all = GridSearchCV(log_rfecv_pipeline, param_grid=log_params, cv=5, scoring=scoring, refit = 'neg_log_loss',  verbose=1)

In [53]:
log_cv_all.fit(X_train[rfecv_columns], y_train)

Fitting 5 folds for each of 30 candidates, totalling 150 fits


GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('ss', StandardScaler()),
                                       ('logisticregression',
                                        LogisticRegression(max_iter=10000))]),
             param_grid={'logisticregression__C': [0.01, 0.1, 10, 20, 100],
                         'logisticregression__class_weight': [None],
                         'logisticregression__penalty': ['l1', 'l2'],
                         'logisticregression__solver': ['liblinear', 'lbfgs',
                                                        'newton-cg']},
             refit='neg_log_loss', scoring=['neg_log_loss', 'accuracy'],
             verbose=1)

In [54]:
log_all_results = pd.DataFrame(log_cv_all.cv_results_).sort_values('mean_test_neg_log_loss', ascending=False)
log_all_results.head(5)

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_logisticregression__C,param_logisticregression__class_weight,param_logisticregression__penalty,param_logisticregression__solver,params,split0_test_neg_log_loss,split1_test_neg_log_loss,split2_test_neg_log_loss,split3_test_neg_log_loss,split4_test_neg_log_loss,mean_test_neg_log_loss,std_test_neg_log_loss,rank_test_neg_log_loss,split0_test_accuracy,split1_test_accuracy,split2_test_accuracy,split3_test_accuracy,split4_test_accuracy,mean_test_accuracy,std_test_accuracy,rank_test_accuracy
10,0.019798,0.000133,0.004045,8.2e-05,0.1,,l2,lbfgs,"{'logisticregression__C': 0.1, 'logisticregres...",-0.673249,-0.667799,-0.674011,-0.676105,-0.674852,-0.673203,0.002863,1,0.576011,0.594142,0.589385,0.583799,0.574022,0.583472,0.007667,16
11,0.023398,0.00062,0.004155,0.000174,0.1,,l2,newton-cg,"{'logisticregression__C': 0.1, 'logisticregres...",-0.67325,-0.6678,-0.674017,-0.676107,-0.674854,-0.673206,0.002864,2,0.576011,0.594142,0.590782,0.583799,0.574022,0.583751,0.007899,15
9,0.017963,0.000521,0.004251,0.000253,0.1,,l2,liblinear,"{'logisticregression__C': 0.1, 'logisticregres...",-0.673227,-0.667819,-0.673998,-0.676117,-0.674871,-0.673206,0.00286,3,0.574616,0.596932,0.593575,0.583799,0.576816,0.585148,0.008855,7
4,0.016625,0.001407,0.005046,0.000709,0.01,,l2,lbfgs,"{'logisticregression__C': 0.01, 'logisticregre...",-0.676699,-0.667724,-0.67361,-0.674932,-0.673247,-0.673242,0.003013,4,0.564854,0.589958,0.608939,0.579609,0.572626,0.583197,0.015293,17
5,0.027423,0.009254,0.00471,0.000983,0.01,,l2,newton-cg,"{'logisticregression__C': 0.01, 'logisticregre...",-0.676699,-0.667725,-0.673613,-0.67493,-0.673248,-0.673243,0.003012,5,0.564854,0.589958,0.608939,0.579609,0.572626,0.583197,0.015293,17


### Ada Boost

In [55]:
X_train = df[df['Season'] != '2020-2021'].dropna().loc[:,rfecv_columns]
y_train = df[df['Season'] != '2020-2021'].dropna()['Home_Team_Won']
X_test = df[df['Season'] == '2020-2021'].loc[:,rfecv_columns]
y_test = df[df['Season'] == '2020-2021']['Home_Team_Won']

In [56]:
ada_rfecv_pipeline = Pipeline(steps=[('ss', StandardScaler()),
                      ('ada', AdaBoostClassifier())])

ada_params = {'ada__n_estimators': [25],
         'ada__learning_rate': [ .1, 10],
         'ada__base_estimator': [svm.SVC(probability=True , kernel='linear'), LogisticRegression(max_iter =10000, C=.01, penalty = 'l1', solver = 'liblinear')],}

ada_cv_all = GridSearchCV(ada_rfecv_pipeline, param_grid=ada_params, cv=5, scoring=scoring, refit='neg_log_loss', verbose=1)

In [57]:
ada_cv_all.fit(X_train, y_train)

Fitting 5 folds for each of 4 candidates, totalling 20 fits


GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('ss', StandardScaler()),
                                       ('ada', AdaBoostClassifier())]),
             param_grid={'ada__base_estimator': [SVC(kernel='linear',
                                                     probability=True),
                                                 LogisticRegression(C=0.01,
                                                                    max_iter=10000,
                                                                    penalty='l1',
                                                                    solver='liblinear')],
                         'ada__learning_rate': [0.1, 10],
                         'ada__n_estimators': [25]},
             refit='neg_log_loss', scoring=['neg_log_loss', 'accuracy'],
             verbose=1)

In [58]:
ada_all_results = pd.DataFrame(ada_cv_all.cv_results_).sort_values('mean_test_neg_log_loss', ascending=False)
ada_all_results.head(10)

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_ada__base_estimator,param_ada__learning_rate,param_ada__n_estimators,params,split0_test_neg_log_loss,split1_test_neg_log_loss,split2_test_neg_log_loss,split3_test_neg_log_loss,split4_test_neg_log_loss,mean_test_neg_log_loss,std_test_neg_log_loss,rank_test_neg_log_loss,split0_test_accuracy,split1_test_accuracy,split2_test_accuracy,split3_test_accuracy,split4_test_accuracy,mean_test_accuracy,std_test_accuracy,rank_test_accuracy
0,43.460288,0.056277,2.471431,0.005118,"SVC(kernel='linear', probability=True)",0.1,25,"{'ada__base_estimator': SVC(kernel='linear', p...",-0.682305,-0.678201,-0.680686,-0.682025,-0.680141,-0.680672,0.001475632,1,0.571827,0.58159,0.554469,0.565642,0.557263,0.566158,0.009862,2
1,42.537684,0.256272,2.540797,0.029652,"SVC(kernel='linear', probability=True)",10.0,25,"{'ada__base_estimator': SVC(kernel='linear', p...",-0.682582,-0.680568,-0.678514,-0.684473,-0.679927,-0.681213,0.002090701,2,0.564854,0.599721,0.604749,0.561453,0.569832,0.580122,0.01832,1
2,0.085728,0.000993,0.012222,0.000502,"LogisticRegression(C=0.01, max_iter=10000, pen...",0.1,25,{'ada__base_estimator': LogisticRegression(C=0...,-0.693147,-0.693147,-0.693147,-0.693147,-0.693147,-0.693147,8.599751000000001e-17,3,0.456067,0.456067,0.456704,0.456704,0.455307,0.45617,0.000517,3
3,0.084748,0.000796,0.012553,0.000402,"LogisticRegression(C=0.01, max_iter=10000, pen...",10.0,25,{'ada__base_estimator': LogisticRegression(C=0...,-0.693147,-0.693147,-0.693147,-0.693147,-0.693147,-0.693147,8.599751000000001e-17,3,0.456067,0.456067,0.456704,0.456704,0.455307,0.45617,0.000517,3


### Neural Network

In [59]:
X_train = df[df['Season'] != '2020-2021'].dropna().loc[:,rfecv_columns]
y_train = df[df['Season'] != '2020-2021'].dropna()['Home_Team_Won']
X_test = df[df['Season'] == '2020-2021'].loc[:,rfecv_columns]
y_test = df[df['Season'] == '2020-2021']['Home_Team_Won']

In [60]:
X_train.shape

(3582, 36)

In [61]:
def build_model(optimizer='adam', activation='linear', neurons = 1, dropout_rate=0.0, weight_constraint=0):
    model = Sequential()
    model.add(Dense(neurons, activation=activation, input_dim=rfecv_pipeline[1].n_features_, kernel_constraint=maxnorm(weight_constraint)))
    model.add(Dropout(dropout_rate))
    model.add(Dense(4, activation=activation))
    model.add(Dense(1, activation = 'sigmoid'))

    model.compile(optimizer=optimizer,
                  loss='binary_crossentropy',
                  metrics=['accuracy'])
    return model


param_grid = {'nn__epochs': [8,10, 15, 18],
             'nn__optimizer' : ['Adam', 'RMSprop'], 
             'nn__activation' : ['hard_sigmoid', 'linear'],
            'nn__neurons' : [12, 24, 36, 40],
             'nn__weight_constraint': [1, 3],
             'nn__dropout_rate' : [0.3, 0.6]}
keras_model = scikit_learn.KerasClassifier(build_fn=build_model, verbose=0)

numeric_transformer = Pipeline(steps=[
    ('scaler', StandardScaler())])

categorical_features = ['B2B_Status']

categorical_transformer = Pipeline(steps=[
    ('ohe', OneHotEncoder(handle_unknown='ignore'))])

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)])

nn_all_pipeline = Pipeline(steps=[('scaler', StandardScaler()),
                      ('nn', keras_model)])





nn_all_cv = GridSearchCV(estimator=nn_all_pipeline, param_grid=param_grid, cv=3, scoring=scoring, refit='neg_log_loss', verbose=1)

In [62]:
nn_all_cv.fit(X_train, y_train)

Fitting 3 folds for each of 128 candidates, totalling 384 fits


GridSearchCV(cv=3,
             estimator=Pipeline(steps=[('scaler', StandardScaler()),
                                       ('nn',
                                        <tensorflow.python.keras.wrappers.scikit_learn.KerasClassifier object at 0x7f8617bf57c0>)]),
             param_grid={'nn__activation': ['hard_sigmoid', 'linear'],
                         'nn__dropout_rate': [0.3, 0.6],
                         'nn__epochs': [8, 10, 15, 18],
                         'nn__neurons': [12, 24, 36, 40],
                         'nn__optimizer': ['Adam'],
                         'nn__weight_constraint': [1, 3]},
             refit='neg_log_loss', scoring=['neg_log_loss', 'accuracy'],
             verbose=1)

In [63]:
nn_all_results = pd.DataFrame(nn_all_cv.cv_results_).sort_values('mean_test_neg_log_loss', ascending=False)
nn_all_results.head(5)

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_nn__activation,param_nn__dropout_rate,param_nn__epochs,param_nn__neurons,param_nn__optimizer,param_nn__weight_constraint,params,split0_test_neg_log_loss,split1_test_neg_log_loss,split2_test_neg_log_loss,mean_test_neg_log_loss,std_test_neg_log_loss,rank_test_neg_log_loss,split0_test_accuracy,split1_test_accuracy,split2_test_accuracy,mean_test_accuracy,std_test_accuracy,rank_test_accuracy
82,0.905101,0.012086,0.096793,0.001195,linear,0.3,15,24,Adam,1,"{'nn__activation': 'linear', 'nn__dropout_rate...",-0.673046,-0.668641,-0.679657,-0.673781,0.004527,1,0.576214,0.605528,0.557789,0.579844,0.019657,54
88,0.981413,0.006571,0.097363,0.001749,linear,0.3,18,12,Adam,1,"{'nn__activation': 'linear', 'nn__dropout_rate...",-0.671383,-0.669498,-0.680475,-0.673785,0.004793,2,0.59129,0.606365,0.556114,0.58459,0.021055,7
114,0.897445,0.002981,0.0957,0.000832,linear,0.6,15,24,Adam,1,"{'nn__activation': 'linear', 'nn__dropout_rate...",-0.671991,-0.668439,-0.681377,-0.673936,0.005458,3,0.586265,0.588777,0.547739,0.57426,0.018782,114
76,0.724736,0.002487,0.095831,0.000297,linear,0.3,10,36,Adam,1,"{'nn__activation': 'linear', 'nn__dropout_rate...",-0.673235,-0.669248,-0.679684,-0.674056,0.0043,4,0.582077,0.600503,0.567839,0.583473,0.013371,16
116,0.907361,0.002384,0.095542,0.000227,linear,0.6,15,36,Adam,1,"{'nn__activation': 'linear', 'nn__dropout_rate...",-0.673419,-0.669362,-0.680445,-0.674408,0.004578,5,0.577889,0.597152,0.553601,0.576214,0.017819,102


The scores for the RFECV models did ok but were still worse than the 5 and 40 and 40 only feature models. Ada Boost in particular did not do well with the RFECV features. 

## Apply Best Models To Test Data

I will evaluate the best model iterations on the held out 2021 season data

In [64]:
results_dict = {'Training Cross Validation Accuracy': {}, 'Training Cross Validation Log Loss': {}, 'Test Accuracy': {}, 'Test Log Loss':{}, 'Paramters':{}}
accuracy_list = []
log_loss_list = []

In [65]:
X_train = df[df['Season'] != '2020-2021'].dropna().loc[:,r_5_40]
y_train = df[df['Season'] != '2020-2021'].dropna()['Home_Team_Won']
X_test = df[df['Season'] == '2020-2021'].dropna().loc[:,r_5_40]
y_test = df[df['Season'] == '2020-2021'].dropna()['Home_Team_Won']





accuracy_list.append(accuracy_score(y_test, log_cv.predict(X_test)))
log_loss_list.append(log_loss(y_test, log_cv.predict_proba(X_test)))


In [66]:
X_train = df[df['Season'] != '2020-2021'].dropna().loc[:,r40]
y_train = df[df['Season'] != '2020-2021'].dropna()['Home_Team_Won']
X_test = df[df['Season'] == '2020-2021'].dropna().loc[:,r40]
y_test = df[df['Season'] == '2020-2021'].dropna()['Home_Team_Won']



test_preds_40 = log_cv_40.predict(X_test)

test_probs_40 = log_cv_40.predict_proba(X_test)

accuracy_list.append(accuracy_score(y_test, test_preds_40))
log_loss_list.append(log_loss(y_test, test_probs_40))

In [67]:
X_train = df[df['Season'] != '2020-2021'].dropna().loc[:,rfecv_columns]
y_train = df[df['Season'] != '2020-2021'].dropna()['Home_Team_Won']
X_test = df[df['Season'] == '2020-2021'].dropna().loc[:,rfecv_columns]
y_test = df[df['Season'] == '2020-2021'].dropna()['Home_Team_Won']

test_preds_rfecv = log_cv_all.predict(X_test)

test_probs_rfecv = log_cv_all.predict_proba(X_test)


accuracy_list.append(accuracy_score(y_test, test_preds_rfecv))
log_loss_list.append(log_loss(y_test, test_probs_rfecv))



In [68]:
X_train = df[df['Season'] != '2020-2021'].dropna().loc[:,r_5_40]
y_train = df[df['Season'] != '2020-2021'].dropna()['Home_Team_Won']
X_test = df[df['Season'] == '2020-2021'].dropna().loc[:,r_5_40]
y_test = df[df['Season'] == '2020-2021'].dropna()['Home_Team_Won']



accuracy_list.append(accuracy_score(y_test, ada_cv.predict(X_test)))
log_loss_list.append(log_loss(y_test,ada_cv.predict_proba(X_test)))

In [69]:
X_train = df[df['Season'] != '2020-2021'].dropna().loc[:,r40]
y_train = df[df['Season'] != '2020-2021'].dropna()['Home_Team_Won']
X_test = df[df['Season'] == '2020-2021'].dropna().loc[:,r40]
y_test = df[df['Season'] == '2020-2021'].dropna()['Home_Team_Won']



accuracy_list.append(accuracy_score(y_test, ada_cv_40.predict(X_test)))
log_loss_list.append(log_loss(y_test, ada_cv_40.predict_proba(X_test)))

In [70]:
X_train = df[df['Season'] != '2020-2021'].dropna().loc[:,r_5_40]
y_train = df[df['Season'] != '2020-2021'].dropna()['Home_Team_Won']
X_test = df[df['Season'] == '2020-2021'].dropna().loc[:,r_5_40]
y_test = df[df['Season'] == '2020-2021'].dropna()['Home_Team_Won']



accuracy_list.append(accuracy_score(y_test, nn_cv.predict(X_test)))
log_loss_list.append(log_loss(y_test, nn_cv.predict_proba(X_test)))

In [71]:
X_train = df[df['Season'] != '2020-2021'].dropna().loc[:,r40]
y_train = df[df['Season'] != '2020-2021'].dropna()['Home_Team_Won']
X_test = df[df['Season'] == '2020-2021'].dropna().loc[:,r40]
y_test = df[df['Season'] == '2020-2021'].dropna()['Home_Team_Won']



accuracy_list.append(accuracy_score(y_test, nn_40_cv.predict(X_test)))
log_loss_list.append(log_loss(y_test, nn_40_cv.predict_proba(X_test)))

In [72]:
X_train = df[df['Season'] != '2020-2021'].dropna().loc[:,rfecv_columns]
y_train = df[df['Season'] != '2020-2021'].dropna()['Home_Team_Won']
X_test = df[df['Season'] == '2020-2021'].dropna().loc[:,rfecv_columns]
y_test = df[df['Season'] == '2020-2021'].dropna()['Home_Team_Won']



accuracy_list.append(accuracy_score(y_test, nn_all_cv.predict(X_test)))
log_loss_list.append(log_loss(y_test, nn_all_cv.predict_proba(X_test)))

In [73]:
results_dict['Test Accuracy'] = accuracy_list
results_dict['Test Log Loss'] = log_loss_list
models = ['5 and 40 Logistic Regression', 
          '40 Logistic Regression', 
          'rfecv Logistic Regression', 
          '5 and 40 AdaBoost', 
          '40 AdaBoost', 
          '5 and 40 Neural Network', 
          '40 Neural Network', 
          'rfecv Neural Network']
results_dict['Training Cross Validation Accuracy'] = [log_results['mean_test_accuracy'][0], 
                               log_40_results.loc[:,'mean_test_accuracy'].iloc[0], 
                               log_all_results.loc[:,'mean_test_accuracy'].iloc[0], 
                               ada_results.loc[:,'mean_test_accuracy'].iloc[0], 
                               ada_40_results.loc[:,'mean_test_accuracy'].iloc[0], 
                               nn_results.loc[:,'mean_test_accuracy'].iloc[0],
                              nn_40_results.loc[:,'mean_test_accuracy'].iloc[0],
                              nn_all_results.loc[:,'mean_test_accuracy'].iloc[0]]
results_dict['Training Cross Validation Log Loss'] = [log_cv.best_score_*-1, 
                               log_cv_40.best_score_*-1, 
                               log_cv_all.best_score_*-1, 
                               ada_cv.best_score_*-1, 
                               ada_cv_40.best_score_*-1, 
                               nn_cv.best_score_*-1, 
                               nn_40_cv.best_score_*-1, 
                               nn_all_cv.best_score_*-1]

results_dict['Paramters'] = [log_results.loc[:,'params'].iloc[0], 
                               log_40_results.loc[:,'params'].iloc[0], 
                               log_all_results.loc[:,'params'].iloc[0], 
                               ada_results.loc[:,'params'].iloc[0], 
                               ada_40_results.loc[:,'params'].iloc[0], 
                               nn_results.loc[:,'params'].iloc[0],
                              nn_40_results.loc[:,'params'].iloc[0],
                              nn_all_results.loc[:,'params'].iloc[0]]

In [74]:
results_df = pd.DataFrame(results_dict, index = models)

## Conclusion

The best model which had the best training cross validation log loss and test log loss was the Neural Network on the rolling 40 game features. 

In [75]:
pd.set_option('display.max_colwidth', None)
results_df.sort_values('Test Log Loss')

Unnamed: 0,Training Cross Validation Accuracy,Training Cross Validation Log Loss,Test Accuracy,Test Log Loss,Paramters
40 Neural Network,0.577331,0.672635,0.602439,0.655534,"{'nn__activation': 'linear', 'nn__dropout_rate': 0.3, 'nn__epochs': 10, 'nn__neurons': 36, 'nn__optimizer': 'RMSprop', 'nn__weight_constraint': 3}"
40 Logistic Regression,0.578727,0.674249,0.602439,0.656803,"{'logisticregression__C': 0.01, 'logisticregression__class_weight': None, 'logisticregression__penalty': 'l2', 'logisticregression__solver': 'liblinear'}"
5 and 40 Neural Network,0.58431,0.673594,0.603659,0.657981,"{'nn__activation': 'linear', 'nn__dropout_rate': 0.6, 'nn__epochs': 18, 'nn__neurons': 24, 'nn__optimizer': 'RMSprop', 'nn__weight_constraint': 1}"
40 AdaBoost,0.572311,0.675815,0.614634,0.660548,"{'ada__base_estimator': SVC(kernel='linear', probability=True), 'ada__learning_rate': 0.01, 'ada__n_estimators': 25}"
5 and 40 Logistic Regression,0.45617,0.675437,0.59878,0.661808,"{'logisticregression__C': 0.001, 'logisticregression__class_weight': None, 'logisticregression__penalty': 'l2', 'logisticregression__solver': 'lbfgs'}"
rfecv Neural Network,0.579844,0.673781,0.579268,0.664362,"{'nn__activation': 'linear', 'nn__dropout_rate': 0.3, 'nn__epochs': 15, 'nn__neurons': 24, 'nn__optimizer': 'Adam', 'nn__weight_constraint': 1}"
rfecv Logistic Regression,0.583472,0.673203,0.580488,0.670223,"{'logisticregression__C': 0.1, 'logisticregression__class_weight': None, 'logisticregression__penalty': 'l2', 'logisticregression__solver': 'lbfgs'}"
5 and 40 AdaBoost,0.577608,0.679858,0.59878,0.671058,"{'ada__base_estimator': SVC(kernel='linear', probability=True), 'ada__learning_rate': 10, 'ada__n_estimators': 25}"


I can see how my test results copare to some publicly published models per [hockey-statisics.com](https://hockey-statistics.com/2021/05/03/game-projections-january-13th-2021/). Log loss figures on the below chart is only for the 2021 season and up to 5/2/2021 which aligns fairly closely with my test dataset results, though my predictions go through 5/6/2021. My best model would come in 4th between BayesBet and Hockey-Statistics. This is encouraging and shows my model is competitive with other models out there. Still not as good as the Implied Odds which came from Bet365 and ComeOn per the webpage creator Lars Skytte.
<img src="images/model_comparison.png" alt="model_comparison" width="300"/>

I will save the predictions and evaluate them against historical odds to see if my model could be profitable.

In [76]:
# Save predictions and probabilites for from best model
X_test = df[df['Season'] == '2020-2021'].dropna().loc[:,r40]
y_test = df[df['Season'] == '2020-2021'].dropna()['Home_Team_Won']

pred_df = df[df['Season'] == '2020-2021'].dropna().loc[:,['game_id',
 'date',
 'venue',
 'home_team',
 'away_team',
 'start_time',
 'home_score',
 'away_score',
 'status',
 'Home_Team_Won',
 'Home_Team_Key',
 'Away_Team_Key', 'home_Game_Number','away_Game_Number','home_goalie',
 'home_Goalie_FenwickSV%',
 'home_Goalie_GSAx/60',
 'home_Goalie_HDCSV%',
 'away_goalie',
 'away_Goalie_FenwickSV%',
 'away_Goalie_GSAx/60',
 'away_Goalie_HDCSV%','home_last_40_FF%_5v5',
 'home_last_40_GF%_5v5',
 'home_last_40_xGF%_5v5',
 'home_last_40_SH%',
 'home_last40_pp_TOI_per_game',
 'home_last40_xGF_per_min_pp',
 'home_last40_GF_per_min_pp',
 'home_last40_pk_TOI_per_game',
 'home_last40_xGA_per_min_pk',
 'home_last40_GA_per_min_pk','away_last_40_FF%_5v5',
 'away_last_40_GF%_5v5',
 'away_last_40_xGF%_5v5',
 'away_last_40_SH%',
 'away_last40_pp_TOI_per_game',
 'away_last40_xGF_per_min_pp',
 'away_last40_GF_per_min_pp',
 'away_last40_pk_TOI_per_game',
 'away_last40_xGA_per_min_pk',
 'away_last40_GA_per_min_pk',
 'home_Rating.A.Pre',
 'away_Rating.A.Pre',
 'B2B_Status']]

preds = nn_40_cv.predict(X_test)
probs = nn_40_cv.predict_proba(X_test)

Predictions_2021 = pd.concat([pred_df, 
                             pd.DataFrame(preds, columns = ['Prediction'], index = y_test.index ),
                             pd.DataFrame(probs, columns = ['Away Win Probability', 'Home Win Probability'], index = y_test.index)], 
                             axis =1)

In [77]:
Predictions_2021.tail()

Unnamed: 0,game_id,date,venue,home_team,away_team,start_time,home_score,away_score,status,Home_Team_Won,Home_Team_Key,Away_Team_Key,home_Game_Number,away_Game_Number,home_goalie,home_Goalie_FenwickSV%,home_Goalie_GSAx/60,home_Goalie_HDCSV%,away_goalie,away_Goalie_FenwickSV%,away_Goalie_GSAx/60,away_Goalie_HDCSV%,home_last_40_FF%_5v5,home_last_40_GF%_5v5,home_last_40_xGF%_5v5,home_last_40_SH%,home_last40_pp_TOI_per_game,home_last40_xGF_per_min_pp,home_last40_GF_per_min_pp,home_last40_pk_TOI_per_game,home_last40_xGA_per_min_pk,home_last40_GA_per_min_pk,away_last_40_FF%_5v5,away_last_40_GF%_5v5,away_last_40_xGF%_5v5,away_last_40_SH%,away_last40_pp_TOI_per_game,away_last40_xGF_per_min_pp,away_last40_GF_per_min_pp,away_last40_pk_TOI_per_game,away_last40_xGA_per_min_pk,away_last40_GA_per_min_pk,home_Rating.A.Pre,away_Rating.A.Pre,B2B_Status,Prediction,Away Win Probability,Home Win Probability
4442,2020020838,2021-05-06,TD Garden,BOS,NYR,2021-05-06 23:00:00,4,0,Final,1,BOS_2021-05-06,NYR_2021-05-06,40.0,38.0,Jeremy Swayman,0.935086,-0.255694,0.86206,Igor Shesterkin,0.943293,0.221547,0.893805,55.281007,54.929673,53.113745,7.448563,5.1825,0.087844,0.096479,5.401667,0.084094,0.087936,48.264073,54.430227,48.777665,10.05015,5.033333,0.135397,0.153974,4.960833,0.111876,0.10079,1569.72,1512.11,Away_only,1,0.39688,0.60312
4443,2020020839,2021-05-06,Nassau Veterans Memorial Coliseum,NYI,N.J,2021-05-06 23:00:00,1,2,Final,0,NYI_2021-05-06,N.J_2021-05-06,43.0,46.0,Semyon Varlamov,0.945489,0.090302,0.88102,Mackenzie Blackwood,0.929299,-0.399936,0.837209,50.241772,57.867228,53.050836,8.805727,4.270833,0.112976,0.093659,4.164167,0.102181,0.090054,48.503229,41.919777,48.218609,7.979786,5.086667,0.092202,0.078637,4.4425,0.115419,0.135059,1549.32,1439.38,Neither,1,0.301967,0.698033
4444,2020020842,2021-05-06,PPG Paints Arena,PIT,BUF,2021-05-06 23:00:00,8,4,Final,1,PIT_2021-05-06,BUF_2021-05-06,48.0,49.0,Tristan Jarry,0.929605,-0.42756,0.843672,Michael Houser,0.935086,-0.255694,0.86206,50.36059,58.253252,49.798658,9.041652,4.2225,0.12386,0.171699,4.520833,0.095945,0.121659,43.7066,39.713487,43.700006,7.311708,4.217917,0.076993,0.082979,4.502917,0.123698,0.127695,1556.67,1416.17,Neither,1,0.28014,0.71986
4445,2020020847,2021-05-06,Scotiabank Arena,TOR,MTL,2021-05-06 23:00:00,5,2,Final,1,TOR_2021-05-06,MTL_2021-05-06,42.0,44.0,Jack Campbell,0.938931,-0.117228,0.845,Jake Allen,0.937289,-0.098128,0.878049,52.425741,57.93858,57.199725,9.228362,4.385833,0.125119,0.085503,3.714583,0.102703,0.134605,53.658068,46.852953,51.668374,6.754951,4.255417,0.085714,0.111622,4.325417,0.118775,0.138715,1550.15,1485.59,Away_only,1,0.344289,0.655711
4446,2020020593,2021-05-06,Rogers Place,EDM,VAN,2021-05-07 01:00:00,3,6,Final,0,EDM_2021-05-06,VAN_2021-05-06,47.0,47.0,Mike Smith,0.943015,0.055221,0.874687,Thatcher Demko,0.933794,-0.096288,0.854626,49.044109,53.663901,49.880668,9.351147,4.523333,0.128611,0.171334,4.334167,0.116055,0.09229,46.485886,44.299738,45.089832,7.305675,4.430417,0.105464,0.112856,5.131667,0.111075,0.116921,1536.06,1462.38,Neither,1,0.307734,0.692266


In [79]:
Predictions_2021.to_csv('data/Predictions_2021b')

## Next Steps
To further improve the models I would like to take the following next steps

- Implement voting classifier
- Try linear weightings in rolling features
- Build bottom up model using player statistics