# Feature Engineering - Skater + Goalie Data

In [1]:
import pandas as pd

## Read in data from hockey reference

In [2]:
# Skater data
skaters_2021 = pd.read_csv("../data/skaters_2021.csv")
skaters_2022 = pd.read_csv("../data/skaters_2022.csv")

# Goalie data
goalies_2021 = pd.read_csv("../data/goalies_2021.csv")
goalies_2022 = pd.read_csv("../data/goalies_2022.csv")

In [54]:
skaters_2021.isnull().sum(axis = 0)
#skaters_2022.isnull().sum(axis = 0)
#goalies_2021.isnull().sum(axis = 0)
#goalies_2022.isnull().sum(axis = 0)

player_id               0
player_name             0
age                     0
season                  0
game_num                0
date                    0
team                    0
opponent                0
home_away_status        0
result                  0
G                       0
A                       0
P                       0
rating                  0
PIM                     0
EVG                     0
PPG                     0
SHG                     0
GWG                     0
EVA                     0
PPA                     0
SHA                     0
S                       0
S_perc               7330
shifts                  0
TOI                     0
HIT                     0
BLK                     0
FOW                     0
FOL                     0
FOW_perc            19451
dtype: int64

## Function for creating skaters final data
Should create all the cumulative stats for the skaters.

### Setting up a place to store cumulative computed features
Ideally, this will basically be the data frame containing the rows/observations to train a model on. We can include all of the same information like player ID, game number, team, etc. But for the features like shots or TOI, we will compute the cumulative sums/averages up to the start of each game and store them here. This way, we can keep separate the raw game by game data from the cumulative data.

In [11]:
def create_cumulative_statistics(indiv_games, player_type):
    # Make sure data is in correct order before grouping by player ID
    indiv_games = indiv_games.sort_values("game_num")
    
    if player_type == "skater":
        # Create new data frames containing only certain columns from the old data frames
        cumulative_games = indiv_games.loc[:, ["player_id", "player_name", "age", "season", "game_num", "date", 
                                                        "team", "opponent", "home_away_status", "result", "G"]]

        # Compute avg TOI at the start of every game and append to the new data frame.
        cumulative_games["avg_TOI"] = indiv_games.groupby("player_id")["TOI"].expanding().mean().reset_index().loc[:, "TOI"]
    
        # Compute shots per 60 minutes
        cumulative_games[["total_S", "total_TOI"]] = indiv_games.groupby("player_id")[["S", "TOI"]].cumsum()
        cumulative_games["S_60"] = 60.0 * cumulative_games.total_S / cumulative_games.total_TOI
        cumulative_games = cumulative_games.drop(columns = ["total_S", "total_TOI"])
        
        
    elif player_type == "goalie":
        # Create new data frames containing only certain columns from the old data frames
        cumulative_games = indiv_games.loc[:, ["player_id", "player_name", "age", "season", "game_num", "date", 
                                                    "team", "opponent", "home_away_status", "result", "decision"]]
    
        # Compute Save Percentage = sum(Saves) / sum(Shots Against)
        cumulative_games[["SA", "SV"]] = indiv_games.groupby("player_id")[["SA", "SV"]].cumsum()
        cumulative_games["SV_perc"] = cumulative_games.SV / cumulative_games.SA
        cumulative_games = cumulative_games.drop(columns = ["SV", "SA"])
        
        # Compute goals against average?
        
    else:
        # If wrong player type, print warning message 
        print("The player type is not known. Should be one of either 1) skater 2) goalie.")
        return
    
    return cumulative_games
    
    

In [20]:
# Use function to create final data frames for skaters and goalies
skaters_final_2021 = create_cumulative_statistics(skaters_2021, player_type = "skater")
skaters_final_2022 = create_cumulative_statistics(skaters_2022, player_type = "skater")
goalies_final_2021 = create_cumulative_statistics(goalies_2021, player_type = "goalie")
goalies_final_2022 = create_cumulative_statistics(goalies_2022, player_type = "goalie")

In [55]:
skaters_final_2021.head()

Unnamed: 0,player_id,player_name,age,season,game_num,date,team,opponent,home_away_status,result,G,avg_TOI,S_60
0,/a/abramvi01,Vitaly Abramov,22,2021,1,2021-05-05,OTT,MTL,1,W,0,9.633333,0.0
12833,/h/hugheja03,Jack Hughes,19,2021,1,2021-01-14,NJD,BOS,1,L-SO,0,21.833333,8.244275
22924,/p/pintosh01,Shane Pinto,20,2021,1,2021-04-17,OTT,MTL,0,W,0,9.4,0.0
694,/a/athanan01,Andreas Athanasiou,26,2021,1,2021-01-14,LAK,MIN,1,L-OT,1,13.083333,4.585987
4228,/c/chiarbe01,Ben Chiarot,29,2021,1,2021-01-13,MTL,TOR,0,L-OT,0,21.166667,2.834646


In [56]:
goalies_final_2021.head()

Unnamed: 0,player_id,player_name,age,season,game_num,date,team,opponent,home_away_status,result,decision,SV_perc
0,/a/allenja01,Jake Allen,30,2021,1,2021-01-18,MTL,EDM,0,W,W,0.961538
610,/h/helleco01,Connor Hellebuyck,27,2021,1,2021-01-14,WPG,CGY,1,W,W,0.884615
655,/h/hillad01,Adin Hill,24,2021,1,2021-02-24,ARI,ANA,1,W,W,1.0
674,/h/hogbema01,Marcus Hogberg,26,2021,1,2021-01-21,OTT,WPG,1,L,,1.0
688,/h/holtbbr01,Braden Holtby,31,2021,1,2021-01-13,VAN,EDM,0,W,W,0.903226


## Issue: On some days, a goalie played bad and was pulled from the game
This matters because once the 2nd goalie is put in, there are now 2 goalies that have played in the exact same game. Therefore, when we left join the goalie information to the skater information, we will be duplicating a lot of rows in the skater data frame (becuase there will be 2 rows in the goalie data frame that have the same date and team information as 1 row in the skater data. 

Extending/elaborating on this, we do not know which goalie the skaters scored on during this game. This poses a problem since we want to include the strength of the opposing teams goalie in the training data. If we don't know which goalie was in the game for the times that a player was on the ice, there is some innacuracy here.

There are 116 cases of a goalie getting pulled in the 2021 season.

Here is an example.

In [45]:
#goalies_final_2021.groupby(["date", "team"])["player_id"].count()
pulled_goalies = goalies_final_2021.groupby(["date", "team"])["player_id"].nunique()
pulled_goalies[pulled_goalies >= 2]

date        team
2021-01-15  PIT     2
            STL     2
2021-01-16  SJS     2
2021-01-18  PHI     2
2021-01-19  BUF     2
                   ..
2021-05-06  EDM     2
            MTL     2
2021-05-08  NYR     2
2021-05-09  OTT     2
2021-05-10  NYI     2
Name: player_id, Length: 116, dtype: int64

In [46]:
goalies_2021.loc[(goalies_2021.team == "PIT") & (goalies_2021.date == "2021-01-15"), :]

Unnamed: 0,player_id,player_name,age,season,game_num,date,team,opponent,home_away_status,result,decision,GA,SA,SV,SV_perc,shutout,PIM,TOI
283,/d/desmica01,Casey DeSmith,29,2021,1,2021-01-15,PIT,PHI,0,L,,1,13,12,0.923,0,0,47.75
752,/j/jarrytr01,Tristan Jarry,25,2021,2,2021-01-15,PIT,PHI,0,L,L,3,6,3,0.5,0,0,11.5


## Left join goalie to skater?