# Natural Stat Trick Player Data
I collected the last ten regular seasons of player statistical, biographical, and biometric data from the website Natural Stat Trick. The site provides easy access to data in CSV form by season, but statistical and biographical data are in separate tables. I combined all of the data into a single table and removed any unnecessary fields.

### 1. Imports and Functions
* **var_to_pickle**: Writes the given variable to a pickle file
* **nst_files_to_df**: Loads Natural Stat Trick CSV files that have given prefix from data directory and combines them into a single DataFrame

In [1]:
import pandas as pd
import numpy as np

from nhl_injuries_code import var_to_pickle, nst_files_to_df

### 2. Create DataFrames from CSVs
Prefixes are CSV file prefixes for several seasons of files, old_cols are columns I want to keep from the CSV data, and new_cols are new names for the old_cols.

In [2]:
stats_prefix = 'skaters_stats_'
stats_old_cols = ['Player', 'Team', 'Position', 'GP', 'TOI', 'Total Points', 'Shots', 'PIM',
                  'Major', 'Penalties Drawn', 'Hits', 'Hits Taken', 'Shots Blocked', 'Season']
stats_new_cols = ['Name', 'Team', 'Position', 'Games_Played', 'Time_On_Ice', 'Points',
                  'Shots', 'Penalty_Minutes', 'Major_Penalties', 'Penalties_Drawn', 'Hits',
                  'Hits_Taken', 'Shots_Blocked', 'Season']

bios_prefix = 'skaters_bios_'
bios_old_cols = ['Player', 'Team', 'Position', 'Date of Birth', 'Nationality', 'Height (in)',
                 'Weight (lbs)', 'Season']
bios_new_cols = ['Name', 'Team', 'Position', 'Birth_Date', 'Nationality', 'Height', 'Weight',
                 'Season']

stats_df = nst_files_to_df(stats_prefix, stats_old_cols, stats_new_cols)
bios_df = nst_files_to_df(bios_prefix, bios_old_cols, bios_new_cols)
df = pd.merge(stats_df, bios_df, how='left', on=['Name', 'Team', 'Position', 'Season'])

### 3. Fill in Missing Seasons with Blank Seasons
I'm considering a missing season to be any season that lies between a player's min and max seasons but is not present in the DataFrame. There could be any number of reasons for a player not recording any NHL stats, such as long-term injury, retirement and subsequent return, or season-long demotion to a lower league. In all of those cases a player would have registered zeros for all counting stats, which is what I use for a blank season.

In [3]:
# Finds which seasons are missing from a player's continuous range of seasons
seasons_df = (df.groupby(['Name', 'Birth_Date'])['Season'].unique().reset_index())
seasons_df['Season'] = seasons_df['Season'].apply(lambda x: set(x.tolist()))
seasons_df['Min'] = seasons_df['Season'].map(min)
seasons_df['Max'] = seasons_df['Season'].map(max)
seasons_df['All_Seasons'] =\
    [set(range(min, max+1)) for min, max in seasons_df[['Min', 'Max']].values]
seasons_df['Missing_Seasons'] = seasons_df['All_Seasons'] - seasons_df['Season']
seasons_df = seasons_df[seasons_df['Missing_Seasons'].map(bool)]

# I couldn't think of a more elegant solution, so this code brute forces a DataFrame of
# fill-ins for the missing seasons
fill_ins = []
for index, row in seasons_df[['Name', 'Birth_Date', 'Missing_Seasons']].iterrows():
    for year in sorted(list(row['Missing_Seasons'])):
        mask = ((df['Name'] == row['Name']) &
                (df['Birth_Date'] == row['Birth_Date']) &
                (df['Season'] < year))
        fill_in = df[mask].tail(1)
        fill_in['Season'] = year
        fill_ins.append(fill_in)
fill_in_df = pd.concat(fill_ins)

# Resets all counting stats for fill-in rows to zero
fill_in_df.iloc[:, 3:13] = 0
df = pd.concat([df, fill_in_df])
df.sort_values(by=['Name', 'Team', 'Position', 'Season'], inplace=True)
df.reset_index(drop=True, inplace=True)

### 4. Fix and Format Column Values
Birth_Date should be in datetime format and not all of the numeric field values get fully converted, so I fix those issues here.

In [4]:
df.replace('-', np.nan, inplace=True)
df.dropna(how='any', inplace=True)
df.reset_index(drop=True, inplace=True)
df['Birth_Date'] = pd.to_datetime(df['Birth_Date'], format='%Y-%m-%d')
df['Height'] = df['Height'].astype(int)
df['Weight'] = df['Weight'].astype(int)

### 5. Add Columns Identifying Europeans and Russians
European players once had a reputation for being soft, although not as much any more, and Alex Ovechkin once said 'Russian machine never breaks.' These predictors should help test those hypotheses.

In [5]:
europeans = ['SWE', 'FIN', 'CZE', 'DEU', 'FRA', 'UKR', 'NOR', 'AUT', 'BLR', 'SVK', 'LTU', 'SVN',
             'LVA', 'HRV', 'GBR', 'CHE', 'NLD', 'DNK']
df['European'] = df['Nationality'].isin(europeans).astype(int)
df['Russian'] = (df['Nationality'] == 'RUS').astype(int)

### 6. Simplify Position Values
By default there are too many different values for the Position predictor, so I narrowed them down to Center (C), Winger (W), and Defense (D). Wingers tend to engage in more physical board battles than centers, so it may be worth keeping both classes.

In [6]:
# If multiple positions are listed, classify them as defense or winger in that order
positions = ['D', 'L', 'R']
for position in positions:
    mask = df['Position'].str.contains('%s,|, %s' % (position, position))
    df.loc[mask, 'Position'] = position
# L/R wingers are just wingers
df.loc[df['Position'].isin(['L', 'R']), 'Position'] = 'W'

### 7. Save Pickle

In [7]:
df_pickle = '../data/stats_df.pk'
var_to_pickle(df, df_pickle)