## Introduction

The purpose of this project is to build a machine learning model which takes as input some player statistics available from https://moneypuck.com, and as output gives a prediction for the number of goals the player scored for that season.

It is important to note that this model is *not* predicting how many goals a player will score next season. Instead, the model is learning how many goals a player *should have scored this season* using statistics, and comparing this prediction with the number of goals the player *actually scored*. An "underperforming" player would then be a player who scored less goals than the model predicted they should score, and an "overperforming" player would be a player who scored more goals than the model predicted. 

From this, we will see that most players who are deemed as underperformers by the model do indeed have bounce back seasons (purely in terms of number of goals scored) the following year, while most overperformers tend to score less goals the following year.

This type of model might be useful tool when trying to evaluate, say, the value of a player in a trade (e.g. selling high and buying low), or when determining lists for the NHL entry draft.

## Imports

In [1]:
import os

import csv
%matplotlib inline
import sys

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
import xgboost as xgb
import eli5
import shap
from xgboost import XGBClassifier
from sklearn.feature_selection import RFE, RFECV
from sklearn.compose import ColumnTransformer, make_column_transformer
from sklearn.dummy import DummyClassifier
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.ensemble import VotingClassifier
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression, Ridge, Lasso
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import (
    accuracy_score,
    ConfusionMatrixDisplay,
    classification_report,
    confusion_matrix,
    f1_score,
    make_scorer,
    plot_confusion_matrix,
    recall_score,
    mean_absolute_error
)
from sklearn.model_selection import (
    GridSearchCV,
    RandomizedSearchCV,
    cross_val_score,
    cross_validate,
    train_test_split,
)
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder, StandardScaler
from sklearn.svm import SVC

This means that in case of installing LightGBM from PyPI via the ``pip install lightgbm`` command, you don't need to install the gcc compiler anymore.
Instead of that, you need to install the OpenMP library, which is required for running LightGBM on the system with the Apple Clang compiler.
You can install the OpenMP library by the following command: ``brew install libomp``.


In [164]:
df_17_18 = pd.read_csv("Players_17_18.csv", index_col="playerId") # The 2017-2018 season
df_18_19 = pd.read_csv("Players_18_19.csv", index_col="playerId") # The 2017-2018 season

## Exploratory Data Analysis

In [3]:
# Import the data dictionary which describes the data
with open("MoneyPuckDataDictionaryForPlayers.csv", newline='') as csvfile:
    r = csv.reader(csvfile, delimiter=',', quotechar='|')
    for row in r:
        print(''.join(row))

*****MoneyPuck.com Player and Team Data******

Please reach out through MoneyPuck.com if you have any feedback
No guarantees are made to the quality of the data. NHL data is known to have issues and biases.
You are welcome to use this data in any work. Just please cite MoneyPuck.com

Below are a description of general terms used in the data as well as a data dictionary below it:



General TermsDescription
Expected Goals"The sum of the probabilities of unblocked shot attempts being goals. For example a rebound shot in the slot may be worth 0.5 expected goals while a shot from the blueline while short handed may be worth 0.01 expected goals. The expected value of each shot attempt is calculated by the MoneyPuck Expected Goals model. Expected goals is commonly abbreviated as ""xGoals"". Blocked shot attempts are valued at 0 xGoals. See more here: http://moneypuck.com/about.htm#shotModel"
Score AdjustedAdjusts metrics to gives more credit to away teams and teams with large leads.
Flurry A

In [4]:
# Descriptions of the features

pd.set_option('display.max_colwidth', None, "display.max_rows", None)
explanations = pd.read_csv("MoneyPuckDataDictionaryForPlayers.csv").drop(labels=range(25),axis=0)
renamed = {"*****MoneyPuck.com Player and Team Data******":"Column Name",
          "Unnamed: 1":"Description"}
explanations = explanations.rename(mapper=renamed, axis=1)
explanations

Unnamed: 0,Column Name,Description
25,playerId,Unique ID for each player assigned by the NHL
26,season,Starting year of the season. For example 2018 for the 2018-2019 season
27,situation,"5on5 for normal play, 5on4 for a normal powerplay, 4on5 for a normal PK. 'Other' includes everything else: two man advantage, empty net, 4on3, etc. 'all' includes all situations"
28,games_played,Number of games played.
29,icetime,Ice time in seconds
30,shifts,Number of shifts a player had
31,gameScore,Game Score rating as designed by @domluszczyszyn
32,onIce_xGoalsPercentage,On Ice xGoals For / (On Ice xGoals For + On Ice xGoals Against)
33,offIce_xGoalsPercentage,Off Ice xGoals For / (Off Ice xGoals For + Off Ice xGoals Against)
34,onIce_corsiPercentage,On Ice Shot Attempts For / (On Ice Shot Attempts For + On Ice Shot Attempts Against)


In [5]:
df_17_18.head(5)

Unnamed: 0_level_0,season,name,team,position,situation,games_played,icetime,shifts,gameScore,onIce_xGoalsPercentage,...,OffIce_F_xGoals,OffIce_A_xGoals,OffIce_F_shotAttempts,OffIce_A_shotAttempts,xGoalsForAfterShifts,xGoalsAgainstAfterShifts,corsiForAfterShifts,corsiAgainstAfterShifts,fenwickForAfterShifts,fenwickAgainstAfterShifts
playerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
8479595,2017,Blake Hillman,CHI,D,other,4,70.0,2.0,-0.13,0.0,...,0.18,0.55,4.0,5.0,0.0,0.0,0.0,0.0,0.0,0.0
8479595,2017,Blake Hillman,CHI,D,all,4,4327.0,99.0,0.12,0.32,...,5.12,8.86,153.0,185.0,0.0,0.0,0.0,0.0,0.0,0.0
8479595,2017,Blake Hillman,CHI,D,5on5,4,3860.0,84.0,0.12,0.37,...,4.37,6.99,138.0,157.0,0.09,0.14,4.0,4.0,2.0,3.0
8479595,2017,Blake Hillman,CHI,D,4on5,4,392.0,12.0,0.12,0.02,...,0.01,1.06,1.0,17.0,0.0,0.0,0.0,0.0,0.0,0.0
8479595,2017,Blake Hillman,CHI,D,5on4,4,5.0,1.0,0.03,0.0,...,0.01,0.16,1.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0


In [249]:
df_17_18.columns.tolist()

##### *Note:*
There are tons of features here. Some of them we need to be careful with. 

For example, the saved shots on goal feature must be dropped. The model scored extremely well on predicting goals, and upon further inspection, the largest positive coefficient identified by Ridge was shots on goal, while the largest negative coefficient was saved shots on goal. So the model essentially had learned something along the lines of goals = shots - saves shots, which means we were essentially giving it the answers.

In [7]:
df_17_18["position"].value_counts()

D    1530
C    1365
L     860
R     695
Name: position, dtype: int64

In [169]:
# Create a dataframe for all-situations scoring (as opposed to just 5 on 5, or 5 on 4, etc)
df_17_18_all = df_17_18.loc[df_17_18["situation"] == "all"].drop(columns="situation")
df_18_19_all = df_18_19.loc[df_18_19["situation"] == "all"].drop(columns="situation")

In [170]:
#set back to default
pd.set_option('display.max_colwidth', 50, "display.max_rows", 60)

## Predicting Goals per Game

### Split the Data

In [171]:
# 2017-2018
y_17_18 = df_17_18_all[["I_F_goals"]]
X_17_18 = df_17_18_all.drop(columns="I_F_goals")

X_17_18_train, X_17_18_test, y_17_18_train, y_17_18_test = train_test_split(X_17_18, y_17_18, test_size=0.2)

# 2018-2019
y_18_19 = df_18_19_all[["I_F_goals"]]
X_18_19 = df_18_19_all.drop(columns="I_F_goals")

X_18_19_train, X_18_19_test, y_18_19_train, y_18_19_test = train_test_split(X_18_19, y_18_19, test_size=0.2)

### Create a preprocessor

In [12]:
X_17_18.columns.tolist()

In [251]:
drop_features = ["season", "name", "I_F_primaryAssists", "I_F_secondaryAssists",
               'I_F_points', 'I_F_reboundGoals', 
               'I_F_lowDangerGoals', 'I_F_mediumDangerGoals', 'I_F_highDangerGoals', 
               'OnIce_F_goals', 'OnIce_F_reboundGoals',
               'OnIce_F_lowDangerGoals', 'OnIce_F_mediumDangerGoals', 'OnIce_F_highDangerGoals', 
                "I_F_savedShotsOnGoal", "I_F_savedUnblockedShotAttempts"]


# get rid of all "on ice against" statistics
#one_ice_a = []
for feat in X_17_18_train.columns.tolist():
    if feat[:7] == "OnIce_A":
        drop_features.append(feat) 
    

categorical_features = ["team", "position"]

pass_features = X_17_18_train.columns.tolist()
for feat in drop_features + categorical_features: #everything else is passed through
    pass_features.remove(feat)
        
# Create a column transformer
preprocessor = make_column_transformer((OneHotEncoder(handle_unknown = "ignore"), categorical_features),
                                       ("passthrough", pass_features),
                                       ("drop", drop_features))

# Note that we will apply standard scaler later in a pipeline

# Keep track of new feature names
new_categorical_features = []
for feat_name in X_17_18_train["team"].unique().tolist() + X_17_18_train["position"].unique().tolist():
    new_categorical_features.append(feat_name)

new_feature_names = new_categorical_features + pass_features

# JUST USING FEATURES I UNDERSTAND

In [259]:
num_features = ['games_played', 'icetime', 'shifts', 'gameScore', 
                'onIce_xGoalsPercentage', 'offIce_xGoalsPercentage', 'onIce_corsiPercentage', 
                'offIce_corsiPercentage', 'onIce_fenwickPercentage', 'offIce_fenwickPercentage', 
                'iceTimeRank', 'I_F_xOnGoal', 'I_F_xGoals', 
                'I_F_xRebounds', 'I_F_xFreeze', 'I_F_xPlayStopped', 
                'I_F_xPlayContinuedInZone', 'I_F_xPlayContinuedOutsideZone', 
                'I_F_flurryAdjustedxGoals', 'I_F_scoreVenueAdjustedxGoals', 'I_F_flurryScoreVenueAdjustedxGoals', 
                'I_F_primaryAssists', 'I_F_secondaryAssists', 'I_F_shotsOnGoal', 
                'OnIce_F_lowDangerxGoals', 'OnIce_F_mediumDangerxGoals', 'OnIce_F_highDangerxGoals']

#Notes
# I_F_faceOffsWon = faceoffsWon
# I_F_shifts = shifts

num_features = ['games_played', 'icetime', 
                'gameScore', 'onIce_xGoalsPercentage', 
                'onIce_corsiPercentage', 'onIce_fenwickPercentage', 
                'iceTimeRank', 'I_F_xOnGoal', 'I_F_xGoals', 
                'I_F_xRebounds', 'I_F_xFreeze', 'I_F_xPlayStopped', 'I_F_xPlayContinuedInZone', 
                'I_F_xPlayContinuedOutsideZone', 'I_F_flurryAdjustedxGoals', 'I_F_scoreVenueAdjustedxGoals', #
                'I_F_flurryScoreVenueAdjustedxGoals', 'I_F_primaryAssists', 'I_F_secondaryAssists', #
                'I_F_shotsOnGoal', 'I_F_missedShots', 'I_F_blockedShotAttempts', 'I_F_shotAttempts', 
                'I_F_rebounds', 'I_F_freeze', 'I_F_playStopped', 
                'I_F_playContinuedInZone', 'I_F_playContinuedOutsideZone',  
                'I_F_hits', 'I_F_takeaways', 'I_F_giveaways', 'I_F_lowDangerShots', 'I_F_mediumDangerShots', 
                'I_F_highDangerShots', 'I_F_lowDangerxGoals', 'I_F_mediumDangerxGoals', 
                'I_F_highDangerxGoals', 'I_F_scoreAdjustedShotsAttempts', 'I_F_unblockedShotAttempts', #
                'I_F_scoreAdjustedUnblockedShotAttempts', 'I_F_dZoneGiveaways', #
                'I_F_xGoalsFromxReboundsOfShots', 'I_F_xGoalsFromActualReboundsOfShots', 
                'I_F_reboundxGoals', 'I_F_xGoals_with_earned_rebounds', 
                'I_F_xGoals_with_earned_rebounds_scoreAdjusted', 
                'I_F_xGoals_with_earned_rebounds_scoreFlurryAdjusted', 'I_F_shifts', 
                'I_F_oZoneShiftStarts', 'I_F_dZoneShiftStarts', 'I_F_neutralZoneShiftStarts', 
                'I_F_flyShiftStarts', 'I_F_oZoneShiftEnds', 'I_F_dZoneShiftEnds', 'I_F_neutralZoneShiftEnds', 
                'I_F_flyShiftEnds', 'faceoffsWon', 'faceoffsLost', 'timeOnBench', 
                'penalityMinutesDrawn', 'penaltiesDrawn', 'OnIce_F_xOnGoal', 
                'OnIce_F_xGoals', 'OnIce_F_flurryAdjustedxGoals', 'OnIce_F_scoreVenueAdjustedxGoals', #
                'OnIce_F_flurryScoreVenueAdjustedxGoals', 'OnIce_F_shotsOnGoal', 'OnIce_F_missedShots', #
                'OnIce_F_blockedShotAttempts', 'OnIce_F_shotAttempts', 'OnIce_F_rebounds', 
                'OnIce_F_lowDangerShots', 'OnIce_F_mediumDangerShots', 
                'OnIce_F_highDangerShots', 'OnIce_F_lowDangerxGoals', 'OnIce_F_mediumDangerxGoals', 
                'OnIce_F_highDangerxGoals', 'OnIce_F_scoreAdjustedShotsAttempts', #
                'OnIce_F_unblockedShotAttempts', 'OnIce_F_scoreAdjustedUnblockedShotAttempts', #
                'OnIce_F_xGoalsFromxReboundsOfShots', 'OnIce_F_xGoalsFromActualReboundsOfShots', 
                'OnIce_F_reboundxGoals', 'OnIce_F_xGoals_with_earned_rebounds', 
                'OnIce_F_xGoals_with_earned_rebounds_scoreAdjusted', #
                'OnIce_F_xGoals_with_earned_rebounds_scoreFlurryAdjusted', 
                'xGoalsForAfterShifts', 'xGoalsAgainstAfterShifts', 'corsiForAfterShifts', 
                'corsiAgainstAfterShifts', 'fenwickForAfterShifts', 'fenwickAgainstAfterShifts']

num_features = ['games_played', 'icetime', 
                'gameScore', 
                'onIce_corsiPercentage', 'onIce_fenwickPercentage', 
                'I_F_primaryAssists', 'I_F_secondaryAssists', #
                'I_F_shotsOnGoal', 'I_F_missedShots', 'I_F_blockedShotAttempts', 'I_F_shotAttempts', 
                'I_F_rebounds', 'I_F_freeze', 'I_F_playStopped', 
                'I_F_playContinuedInZone', 'I_F_playContinuedOutsideZone',  
                'I_F_hits', 'I_F_takeaways', 'I_F_giveaways', 'I_F_lowDangerShots', 'I_F_mediumDangerShots', 
                'I_F_highDangerShots', 'I_F_scoreAdjustedShotsAttempts', 'I_F_unblockedShotAttempts', #
                'I_F_shifts', 
                'I_F_oZoneShiftStarts', 'I_F_dZoneShiftStarts', 'I_F_neutralZoneShiftStarts', 
                'I_F_flyShiftStarts', 'I_F_oZoneShiftEnds', 'I_F_dZoneShiftEnds', 'I_F_neutralZoneShiftEnds', 
                'I_F_flyShiftEnds', 'faceoffsWon', 'faceoffsLost', 
                'penalityMinutesDrawn', 'penaltiesDrawn', 'OnIce_F_shotsOnGoal', 'OnIce_F_missedShots', #
                'OnIce_F_blockedShotAttempts', 'OnIce_F_shotAttempts', 'OnIce_F_rebounds', 
                'OnIce_F_lowDangerShots', 'OnIce_F_mediumDangerShots', 
                'OnIce_F_highDangerShots', 
                'OnIce_F_unblockedShotAttempts', 'corsiForAfterShifts', 
                'corsiAgainstAfterShifts', 'fenwickForAfterShifts', 'fenwickAgainstAfterShifts']

categorical_features = ['team', 'position']
drop_features = []
for feat in X_17_18.columns:
    if feat not in num_features+categorical_features:
        drop_features.append(feat)
        
preprocessor = make_column_transformer((OneHotEncoder(handle_unknown="ignore", sparse=False), categorical_features),
                                       (StandardScaler(), num_features),
                                       ("drop", drop_features))

In [260]:
X_17_18_train_transformed = preprocessor.fit_transform(X_17_18_train)

ohe_column_names = preprocessor.named_transformers_["onehotencoder"].get_feature_names_out().tolist()#[0:51]

new_feature_names = ohe_column_names + num_features

pd.DataFrame(X_17_18_train_transformed, columns = new_feature_names).head()

Unnamed: 0,team_ANA,team_ARI,team_BOS,team_BUF,team_CAR,team_CBJ,team_CGY,team_CHI,team_COL,team_DAL,...,OnIce_F_shotAttempts,OnIce_F_rebounds,OnIce_F_lowDangerShots,OnIce_F_mediumDangerShots,OnIce_F_highDangerShots,OnIce_F_unblockedShotAttempts,corsiForAfterShifts,corsiAgainstAfterShifts,fenwickForAfterShifts,fenwickAgainstAfterShifts
0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,1.409338,1.252625,1.240412,1.921286,1.149882,1.378985,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,...,1.333898,0.792266,1.396608,0.969257,1.635786,1.337526,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,1.991539,2.288432,2.133384,1.2866,1.830148,1.959415,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.874698,0.715539,0.813082,0.5206,0.761159,0.757097,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,-1.162182,-1.16426,-1.15853,-1.098942,-1.11767,-1.15221,0.0,0.0,0.0,0.0


### Create a Pipeline for Multiple Models

In [261]:
# Ridge regression
pipe_17_18_Ridge = make_pipeline(preprocessor, StandardScaler(), Ridge())

# Lasso Regression
pipe_17_18_Lasso = make_pipeline(preprocessor, StandardScaler(), Lasso(max_iter=1000))

# Random forest regression
pipe_17_18_rfr = make_pipeline(preprocessor, RandomForestRegressor())

### Hyperparameter Tuning - Ridge

In [262]:
# Ridge 
parameters_ridge = {"ridge__alpha": np.linspace(0.001, 5, 100)} # Some values for the regularization strength
gs_ridge = GridSearchCV(
                            pipe_17_18_Ridge, 
                            parameters_ridge,
                            scoring="neg_mean_absolute_error"
)
# I have an older laptop that doesn't take well to parallel processing so n_jobs is not -1 here for grid search! 

gs_ridge.fit(X_17_18_train, y_17_18_train);

In [263]:
alpha_best_ridge = gs_ridge.best_params_["ridge__alpha"]
alpha_best_ridge

0.001

In [264]:
pd.DataFrame(gs_ridge.cv_results_).head(5)

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_ridge__alpha,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,mean_test_score,std_test_score,rank_test_score
0,0.037114,0.016102,0.017598,0.007831,0.001,{'ridge__alpha': 0.001},-0.013634,-0.012851,-0.015667,-0.012534,-0.010359,-0.013009,0.001716,1
1,0.032645,0.009042,0.01802,0.006497,0.0514949,{'ridge__alpha': 0.05149494949494949},-0.186663,-0.18349,-0.179273,-0.169048,-0.146349,-0.172965,0.014574,2
2,0.023921,0.001695,0.013488,0.002647,0.10199,{'ridge__alpha': 0.10198989898989898},-0.317619,-0.321533,-0.315055,-0.291932,-0.255205,-0.300269,0.024787,3
3,0.024288,0.000711,0.011787,0.000348,0.152485,{'ridge__alpha': 0.15248484848484847},-0.425442,-0.439217,-0.429316,-0.394923,-0.349025,-0.407585,0.032806,4
4,0.024238,0.000818,0.012642,0.001097,0.20298,{'ridge__alpha': 0.20297979797979795},-0.517141,-0.540545,-0.527542,-0.483347,-0.430943,-0.499904,0.039351,5


In [265]:
pipe_17_18_Ridge = make_pipeline(preprocessor, StandardScaler(), Ridge(alpha=alpha_best_ridge))
pd.DataFrame(cross_validate(pipe_17_18_Ridge, X_17_18_train, y_17_18_train, scoring="neg_mean_absolute_error"))

Unnamed: 0,fit_time,score_time,test_score
0,0.075798,0.016632,-0.013634
1,0.032821,0.015022,-0.012851
2,0.027483,0.015834,-0.015667
3,0.022839,0.012559,-0.012534
4,0.023329,0.013767,-0.010359


In [318]:
pipe_17_18_Ridge.fit(X_17_18_train, y_17_18_train)
new_feature_names[np.argmax(pipe_17_18_Ridge.named_steps["ridge"].coef_)]

'I_F_shotsOnGoal'

In [319]:
new_feature_names[np.argmin(pipe_17_18_Ridge.named_steps["ridge"].coef_)]

'I_F_playContinuedOutsideZone'

### Hyperparameter Tuning - Lasso

In [266]:
# Lasso 
parameters_lasso = {"lasso__alpha": np.linspace(0.5, 5, 100)} # Some values for the regularization strength
gs_lasso = GridSearchCV(
                            pipe_17_18_Lasso, 
                            parameters_lasso,
                            scoring="neg_mean_absolute_error"
)
# I have an older laptop that doesn't take well to parallel processing so n_jobs is not -1 here for grid search! 

gs_lasso.fit(X_17_18_train, y_17_18_train);

In [267]:
alpha_best_lasso = gs_lasso.best_params_["lasso__alpha"]
alpha_best_lasso

0.5

In [268]:
pd.DataFrame(gs_lasso.cv_results_).head(5)

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_lasso__alpha,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,mean_test_score,std_test_score,rank_test_score
0,0.042063,0.026429,0.015581,0.006614,0.5,{'lasso__alpha': 0.5},-1.995679,-2.084039,-2.089846,-1.99625,-1.758574,-1.984878,0.120259,1
1,0.031291,0.01355,0.012988,0.001514,0.545455,{'lasso__alpha': 0.5454545454545454},-1.991384,-2.088807,-2.102659,-2.009258,-1.772028,-1.992827,0.118574,2
2,0.029573,0.002904,0.014336,0.002008,0.590909,{'lasso__alpha': 0.5909090909090909},-1.988966,-2.097255,-2.116492,-2.022703,-1.78556,-2.002195,0.118006,3
3,0.024442,0.001818,0.011623,0.000325,0.636364,{'lasso__alpha': 0.6363636363636364},-1.988884,-2.108774,-2.130553,-2.036149,-1.799256,-2.012723,0.118148,4
4,0.024987,0.001234,0.012958,0.00129,0.681818,{'lasso__alpha': 0.6818181818181819},-1.990273,-2.121455,-2.144598,-2.049595,-1.813461,-2.023876,0.118457,5


In [332]:
pipe_17_18_Lasso = make_pipeline(preprocessor, StandardScaler(), Lasso(alpha = alpha_best_lasso))
pd.DataFrame(cross_validate(pipe_17_18_Lasso, X_17_18_train, y_17_18_train, scoring="neg_mean_absolute_error"))

Unnamed: 0,fit_time,score_time,test_score
0,0.05852,0.029304,-1.995679
1,0.058459,0.019898,-2.084039
2,0.028139,0.021279,-2.089846
3,0.043714,0.01911,-1.99625
4,0.040598,0.012531,-1.758574


### Hyperparameter Tuning - Random Forest

In [238]:
# Random Forest 
d = preprocessor.fit_transform(X_17_18_train).shape[1]
parameters_rfr = {
                    "randomforestregressor__max_depth": np.arange(np.floor(np.sqrt(d)/2), np.floor(np.sqrt(d)*2)),
                    "randomforestregressor__n_estimators": np.arange(20,100)
                 }
                  #Some values for the regularization strength
rs_rfr = RandomizedSearchCV(
                            pipe_17_18_rfr, 
                            parameters_rfr,
                            scoring = "neg_mean_absolute_error"
)
# I have an older laptop that doesn't take well to parallel processing so n_jobs is not -1 here for grid search! 

rs_rfr.fit(X_17_18_train, np.ravel(y_17_18_train));

In [239]:
max_depth_best = rs_rfr.best_params_["randomforestregressor__max_depth"]
n_estimators_best = rs_rfr.best_params_["randomforestregressor__n_estimators"]

print(max_depth_best)
print(n_estimators_best)

5.0
70


In [243]:
pd.DataFrame(rs_rfr.cv_results_).head(5)

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_randomforestregressor__n_estimators,param_randomforestregressor__max_depth,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,mean_test_score,std_test_score,rank_test_score
0,1.092714,0.015975,0.026735,0.003613,70,5,"{'randomforestregressor__n_estimators': 70, 'randomforestregressor__max_depth': 5.0}",-2.044487,-2.200367,-2.1261,-1.915323,-1.648707,-1.986997,0.193783,1
1,0.987772,0.02521,0.019591,0.002321,40,18,"{'randomforestregressor__n_estimators': 40, 'randomforestregressor__max_depth': 18.0}",-2.08007,-2.219056,-2.074816,-1.984859,-1.65669,-2.003098,0.188699,3
2,1.187794,0.02429,0.025982,0.006433,53,8,"{'randomforestregressor__n_estimators': 53, 'randomforestregressor__max_depth': 8.0}",-2.048307,-2.194693,-2.129529,-1.942795,-1.706474,-2.004359,0.171086,5
3,1.697238,0.051391,0.024921,0.003676,71,11,"{'randomforestregressor__n_estimators': 71, 'randomforestregressor__max_depth': 11.0}",-2.113089,-2.23641,-2.105714,-1.923862,-1.639182,-2.003652,0.207738,4
4,1.007163,0.015895,0.02264,0.003698,55,6,"{'randomforestregressor__n_estimators': 55, 'randomforestregressor__max_depth': 6.0}",-2.105381,-2.217325,-2.131191,-1.941037,-1.739408,-2.026869,0.169305,8


In [245]:
pipe_17_18_rfr = make_pipeline(preprocessor, 
                                 StandardScaler(), 
                                 RandomForestRegressor(max_depth = max_depth_best, n_estimators = n_estimators_best)
                                )
pd.DataFrame(cross_validate(pipe_17_18_rfr, 
                            X_17_18_train, 
                            np.ravel(y_17_18_train), 
                            scoring="neg_mean_absolute_error"))

Unnamed: 0,fit_time,score_time,test_score
0,1.197282,0.040178,-2.091225
1,1.167992,0.028356,-2.212462
2,1.108607,0.028576,-2.14312
3,1.084351,0.028071,-1.966927
4,1.108033,0.020013,-1.685812


In [316]:
all_feats = X_17_18_train.columns.tolist()
for col in drop_features:
    all_feats.remove(col)
    
all_feats

In [320]:
sorted_coefs = np.sort(np.array(pipe_17_18_Ridge.named_steps["ridge"].coef_[0]))
locs = np.argsort(np.array(pipe_17_18_Ridge.named_steps["ridge"].coef_[0]))
for i, coef in enumerate(sorted_coefs):
    print(i)
    print("Feature:", new_feature_names[locs[i]])
    print("Coefficient:", coef)
    print("\n")

0
Feature: I_F_playContinuedOutsideZone
Coefficient: -36.90179981596396


1
Feature: I_F_playContinuedInZone
Coefficient: -33.25695285958529


2
Feature: I_F_freeze
Coefficient: -16.43547602607052


3
Feature: I_F_rebounds
Coefficient: -6.229562677276686


4
Feature: I_F_blockedShotAttempts
Coefficient: -4.221464169600693


5
Feature: I_F_playStopped
Coefficient: -3.5020169655543305


6
Feature: OnIce_F_rebounds
Coefficient: -0.022167850143998906


7
Feature: icetime
Coefficient: -0.020049373752445164


8
Feature: games_played
Coefficient: -0.013060814957872948


9
Feature: penaltiesDrawn
Coefficient: -0.012404418212347008


10
Feature: OnIce_F_shotsOnGoal
Coefficient: -0.01007112467581452


11
Feature: gameScore
Coefficient: -0.009929240230197853


12
Feature: I_F_takeaways
Coefficient: -0.004669023551114529


13
Feature: I_F_oZoneShiftStarts
Coefficient: -0.0042905296326723715


14
Feature: I_F_giveaways
Coefficient: -0.0041445738265054955


15
Feature: I_F_dZoneShiftStarts
Coefficie

## Comparing Predicted Goals and Actual Goals

In [359]:
def UnderPerformer(differences_last_year): 
    
    underperformer_locs = np.argsort(differences_last_year)
    
    return underperformer_locs

def OverPerformer(differences_last_year): 
    
    overperformer_locs = np.argsort(differences_last_year)[::-1] # Reversed order
    
    return overperformer_locs

# Requires a fitted model.
# Also, be careful that X_last_year is a subset of X_this_year,
# For example use X_last_year test and X_this_year entire dataset.
def Find(PerformerType, X_last_year, X_this_year, y_last_year, y_this_year, model, top_n = 10): # Decorator function

    print("\n")
    if PerformerType == UnderPerformer:
        print("Top", top_n, "Underperformers in 2017-2018")
    elif PerformerType == OverPerformer:
        print("Top", top_n, "Overperformers in 2017-2018")
    else:
        print("Incorrect PerformerType.")
    print("------------------------------")
    
    # Differences in actual goals minus predicted goals for last year
    y_pred_last_year = model.predict(X_last_year)
    differences_last_year = np.ravel(y_last_year) - np.ravel(y_pred_last_year)  
    
    # Sort the differences to identify under/overperformers. 
    # Earlier in the list means more under/overperformance last season.
    performer_locs = PerformerType(differences_last_year)
        
    for i in range(top_n):


        print("------------------------------")
        print("\n")
        

        # Get some info about the under/overperformer
        j = performer_locs[i]
        player_id = pd.DataFrame(y_last_year.iloc[j]).T.index[0]
        name = pd.DataFrame(X_last_year.loc[player_id]).T["name"].iloc[0]
        
        # Find the games played, actual goals, and predicted goals from last year
        games_played_last_year = X_last_year["games_played"].loc[player_id]
        goals_last_year = y_last_year.loc[player_id][0]
        pred_goals_last_year = model.predict(pd.DataFrame(X_last_year.loc[player_id]).T)[0]

        # Make sure that player actually played next year
        if player_id in X_this_year.index.tolist():

            # Find the games played, actual goals, and predicted goals from this year
            games_played_this_year = X_this_year["games_played"].loc[player_id]
            goals_this_year = y_this_year.loc[player_id][0]
            pred_goals_this_year = model.predict(pd.DataFrame(X_this_year.loc[player_id]).T)[0]

            # Let's see if they actually bounced back next season!
            print(i+1)
            print("Player:", name)
            print("Player ID:", player_id)
            print("\n")

            print("2017-2018 Season")
            print("----------------------")
            print("Goal pace over 82 games: {:.2f}".format(goals_last_year / games_played_last_year * 82))
            print("Predicted goal pace over 82 games: {:.2f}".format(pred_goals_last_year / games_played_last_year * 82))
            print("Games played:", games_played_last_year)
            print("\n")

            print("2018-2019 Season")
            print("----------------------")
            print("Goal pace over 82 games: {:.2f}".format(goals_this_year / games_played_this_year * 82))
            print("Predicted goal pace over 82 games: {:.2f}".format(pred_goals_this_year / games_played_this_year * 82))
            print("Games played:", games_played_this_year)
            print("\n")

        else:

            print("Player", name, "(player ID", player_id, ") did not play in the 2018-2019 season.\n\n")

In [360]:
# Make a prediction on the test set in 2017-2018
pipe_18_19_Ridge = make_pipeline(preprocessor, StandardScaler(), Ridge(alpha=alpha_best_ridge))
pipe_18_19_Ridge.fit(X_18_19_train, y_18_19_train)
y_pred_18_19 = pipe_18_19_Ridge.predict(X_18_19_test)


pipe_17_18_Ridge.fit(X_17_18_train, y_17_18_train)
y_pred_17_18 = pipe_17_18_Ridge.predict(X_17_18_test)

# See the biggest outliers
differences_17_18 = np.ravel(y_17_18_test) - np.ravel(y_pred_17_18)

pipe_17_18_Lasso.fit(X_17_18_train, y_17_18_train);

In [364]:
Find(OverPerformer, X_17_18_test, X_18_19, y_17_18_test, y_18_19, pipe_17_18_Lasso)



Top 10 Overperformers in 2017-2018
------------------------------
------------------------------


1
Player: William Karlsson
Player ID: 8476448


2017-2018 Season
----------------------
Goal pace over 82 games: 43.00
Predicted goal pace over 82 games: 27.33
Games played: 82


2018-2019 Season
----------------------
Goal pace over 82 games: 24.00
Predicted goal pace over 82 games: 22.79
Games played: 82


------------------------------


2
Player: Brock Boeser
Player ID: 8478444


2017-2018 Season
----------------------
Goal pace over 82 games: 38.35
Predicted goal pace over 82 games: 21.03
Games played: 62


2018-2019 Season
----------------------
Goal pace over 82 games: 30.90
Predicted goal pace over 82 games: 22.65
Games played: 69


------------------------------


3
Player: Anze Kopitar
Player ID: 8471685


2017-2018 Season
----------------------
Goal pace over 82 games: 35.00
Predicted goal pace over 82 games: 27.63
Games played: 82


2018-2019 Season
----------------------
Go

In [365]:
Find(UnderPerformer, X_17_18_test, X_18_19, y_17_18_test, y_18_19, pipe_17_18_Lasso)



Top 10 Underperformers in 2017-2018
------------------------------
------------------------------


1
Player: John Klingberg
Player ID: 8475906


2017-2018 Season
----------------------
Goal pace over 82 games: 8.10
Predicted goal pace over 82 games: 15.33
Games played: 81


2018-2019 Season
----------------------
Goal pace over 82 games: 14.09
Predicted goal pace over 82 games: 13.05
Games played: 64


------------------------------


2
Player: Bryan Rust
Player ID: 8475810


2017-2018 Season
----------------------
Goal pace over 82 games: 15.45
Predicted goal pace over 82 games: 22.43
Games played: 69


2018-2019 Season
----------------------
Goal pace over 82 games: 20.50
Predicted goal pace over 82 games: 19.79
Games played: 72


------------------------------


3
Player: Jakub Voracek
Player ID: 8474161


2017-2018 Season
----------------------
Goal pace over 82 games: 20.00
Predicted goal pace over 82 games: 25.67
Games played: 82


2018-2019 Season
----------------------
Goal 

## Compare Models

We have several models (Ridge, Lasso, Random Forest Regression) which are trained to predicted the number of goals a player scored. 

Now, let's use each of these models to make a new Ridge ensemble model which essentially takes a weighted average "vote" between the existing models to predict goals. The new ensemble will train by rewarding the models that best predict over/underperformers. 

This means we will need to mathematically quantify how well a model predicts under/overperformers. Define $y^0$ to be the goals scored last year, and $y^1$ to be the goals scored this year. Let $p = y^0 - y^0_\text{pred}$ be the vector difference between the actual goals scored last year and the predicted goals scored last year. Let's defined a model's **outlier predictive ability score** $\mu$ to be

\begin{equation}
\mu = \frac{1}{1+e^{-\frac{N}{|J|}}} \sum_{j \in J} D_j,
\end{equation}

where 
- $D = y^0 - y^1$ is the vector difference between goals scored last year $y^0$ and goals scored this year $y^1$
- $J = \{j : p_j \text{ is at least 2 std dev from the mean of } p\}$
- $N$ is the number of correct predictions out of the $|J|$ predictions made by the model, where a "correct" prediction $j$ is defined as correctly predicting whether the player $j$ bounces back or regresses next season by at least half of the gap $p_j$.

The idea is that the score $\mu$ is higher when the differences $D_j$ in goals from one year to the next are large, but is slightly decreased by the sigmoid $(1+e^{-N/|J|})^{-1}$ when the model makes incorrect predictions.

Now we will create the Ridge ensemble model, and use GridSearchCV to find the best hyperparameter which maximizes the score $\mu$.

In [366]:
def Mu(N, J, D):
    return 1 / (1 + np.exp(-N/J)) * np.sum(D)