# Data Merge
Combining my two data sources into a single, useful DataFrame required a fair amount of work. This notebook puts everything together and converts values for analysis. It is dependent on pickle files saved in notebooks **player_injury_data.ipynb** and **player_nst_data.ipynb**.

### 1. Imports and Functions
* **var_to_pickle**: Writes the given variable to a pickle file
* **nst_files_to_df**: Loads Natural Stat Trick CSV files that have given prefix from data directory and combines them into a single DataFrame

In [1]:
import datetime
import numpy as np
import pandas as pd
from collections import defaultdict

from luther_code import var_to_pickle, read_pickle

%matplotlib inline

### 2. Define List of Season Parameters
Sets the range of seasons I analyzed as well as the number of games per season.

In [2]:
season_years = list(range(2010,2020))
season_month = 9
season_day = 1

# Every season in data range is 82 games except for lockout-shortened 2013 with 48
season_length = defaultdict(lambda:82)
season_length[2013] = 48

### 3. Add Season Column to Injuries Dataframe

In [3]:
injuries_df = read_pickle('../data/injuries_df.pickle')
injuries_df['Season'] = 0
min_year = injuries_df['Injury_Date'].min().year
max_year = injuries_df['Injury_Date'].max().year + 1
season = {min_year - 1:datetime.datetime(min_year - 1, season_month, season_day)}
for year in range(min_year, max_year):
    season[year] = datetime.datetime(year, season_month, season_day)
    mask = ((injuries_df['Injury_Date'] > season[year-1])
            & (injuries_df['Injury_Date'] <= season[year]))
    injuries_df.loc[mask, 'Season'] = year

### 4. Make DataFrame of Games Missed due to Injury by Player and Season

In [4]:
missed_df = (injuries_df.groupby(['Name', 'Birth_Date', 'Season'], as_index=False).sum())

# Cap games missed by injury to season length (seasons may be shortened by lockouts)
missed_df.loc[missed_df['Games_Missed'] > season_length[0], 'Games_Missed'] = season_length[0]
for key,val in season_length.items():
    mask = (missed_df['Season'] == key) & (missed_df['Games_Missed'] > val)
    missed_df.loc[mask, 'Games_Missed'] = val

### 5. Change Games Missed Names to Match Stats Names
The inconsistencies are due to the two data sets coming from different sources.

In [5]:
name_changes = [
    ('Alex Burmistrov', 'Alexander Burmistrov'),
    ('Alexander Petrovic', 'Alex Petrovic'),
    ('Alexei Marchenko', 'Alexey Marchenko'),
    ('Matt Benning', 'Matthew Benning'),
    ('Michael Cammalleri', 'Mike Cammalleri'),
    ('Mike Sauer', 'Michael Sauer'),
    ('Mike Zigomanis', 'Michael Zigomanis'),
    ('P.A. Parenteau', 'PA Parenteau'),
    ('T.J. Brodie', 'TJ Brodie'),
    ('T.J. Galiardi', 'TJ Galiardi')
]
for old_name,new_name in name_changes:
    missed_df.loc[missed_df['Name'] == old_name, 'Name'] = new_name

### 6. Merge Stats and Games Missed DataFrames
Due to some birthday inconsistencies, I only used birthdays to merge players that share the same name.

In [6]:
stats_df = read_pickle('../data/stats_df.pickle')

# Not all player birthdays are consistent between DataFrames, so only use Birth_Date as
# a merge condition if multiple players share the same name
bdays_per_name = stats_df.groupby('Name')['Birth_Date'].nunique()
multiple_names = bdays_per_name[(bdays_per_name > 1)].index
multiples_df = stats_df[stats_df['Name'].isin(multiple_names)]
multiples_df = multiples_df.merge(missed_df, how='left', on=['Name', 'Birth_Date', 'Season'])

# Recombine multiple name and single name DataFrames
df = stats_df[~stats_df['Name'].isin(multiple_names)]
df = df.merge(missed_df[['Name', 'Season', 'Games_Missed']], how='left', on=['Name', 'Season'])
df = df.append(multiples_df)
df = df.sort_values(by=['Name', 'Birth_Date', 'Season']).reset_index(drop=True)
df['Games_Missed'] = df['Games_Missed'].fillna(0).astype(int)

### 7. Adjust Games Missed if Total Games Exceeds Games in a Season
Total Games equals Games Missed plus Games Played. Some players do actually play more than the regulation number of games due to being traded to a team that has games in hand on their former team. In most situations, total games should less than or equal to the number of games in a season.

In [7]:
max_missed = (season_length[0] - df['Games_Played']).clip(0, season_length[0])
df['Games_Missed'] = df['Games_Missed'].clip(0, max_missed)

### 8. Convert Counting Stats to Per Game Stats
I normalized some stat values to account for differences in games played. Converted values are Time on Ice, Points, Shots, Penalty Minutes, Major Penalties, Penalties Drawn, Hits, Hits Taken, and Shots Blocked.

In [8]:
stats = ['Time_On_Ice', 'Points', 'Shots', 'Penalty_Minutes', 'Major_Penalties',
         'Penalties_Drawn', 'Hits', 'Hits_Taken', 'Shots_Blocked']
for stat in stats:
    df[stat] = df[stat] / df['Games_Played']
    df[stat] = df[stat].fillna(0)

### 9. Add Time-Related Features
I added columns for player age by season, games missed last season, and average games missed per season for all previous seasons.

In [9]:
season_starts = df['Season'].apply(lambda x: season[x])
df['Age'] = (season_starts - df['Birth_Date']).dt.days // 365

df['Last_Games_Missed'] = df.groupby(['Name', 'Birth_Date'])['Games_Missed'].shift(1)
df['Last_Games_Missed'] = df['Last_Games_Missed'].fillna(0).astype(int)

df_temp1 = df[['Name', 'Birth_Date', 'Season']]
df_temp2 = df[['Name', 'Birth_Date', 'Season', 'Games_Missed']]
df_temp1 = pd.merge(df_temp1, df_temp2, on=['Name', 'Birth_Date'])
df_temp1 = df_temp1[df_temp1['Season_x'] > df_temp1['Season_y']]
df_temp1 = df_temp1.groupby(['Name', 'Birth_Date', 'Season_x'])[['Games_Missed']].mean()
df_temp1.reset_index(inplace=True)
df_temp1.rename(columns={'Season_x':'Season', 'Games_Missed':'Avg_Games_Missed'},
                inplace=True)
df = df.merge(df_temp1, how='left', on=['Name', 'Birth_Date', 'Season'])
df['Avg_Games_Missed'] = df['Avg_Games_Missed'].fillna(0)

### 10. Save Pickle

In [10]:
df_pickle = '../data/merged_data_df.pickle'
var_to_pickle(df, df_pickle)