# Group Name: The Big One

## Group Members: Nicholas Parker, Matthew King, Sean Sturtevant

### Dataset: Predict NHL Player Salaries

1) Ask
----

Can we accurately predict the salary of a player in the NHL, for the 2016-2017 season?

2) Acquire
----

Link to data and data dictionary can be found [here](https://www.kaggle.com/camnugent/predict-nhl-player-salaries#train.csv).

3) Process
----

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

### Direct Feature Engineering

In [2]:
hockey_train = pd.read_csv('./data/clean/train.csv'
                     ,encoding = "ISO-8859-1")
hockey_test = pd.read_csv('./data/clean/test.csv'
                          ,encoding = "ISO-8859-1")
hockey_test_y = pd.read_csv('./data/clean/test_salaries.csv'
                          ,encoding = "ISO-8859-1")

In [3]:
# The Kaggle dataset gives the data split in train and test
# We will recombine and make our own train and test split
def combine_train_and_test(train_df, test_df, test_response):
    test_df = pd.concat([test_df, test_response], axis = 1)
    return pd.concat([train_df, test_df],ignore_index = True, sort = False)

hockey = combine_train_and_test(hockey_train, hockey_test, hockey_test_y)

In [4]:
# Removing the time variable 'Born' by making a variable 'Age'
hockey['Age'] = 117 - pd.to_numeric(hockey['Born'].str[0:2])
hockey.loc[(hockey['Age'] > 99), 'Age'] = hockey['Age'] - 100

def nationality_group(df, nationalityCol):
    # A function to feature engineering the 'Nationality column'
    # Changes it from 16 unique values to 5 to prevent overfitting
    scandanavianNations = ['SWE','NOR','FIN']
    otherNations = ['CHE','CZE','FRA','DEU','SVK','AUT','DNK','LVA','HRV','GBR','SVN']
    df.loc[(df[nationalityCol].isin(scandanavianNations)), nationalityCol] = 'Scandanavian'
    df.loc[(df[nationalityCol].isin(otherNations)), nationalityCol] = 'Other'
    return df

hockey = nationality_group(hockey, 'Nat')

In [5]:
# Code used to group and remove provinces and states that are only seen a few times
# Useful to prevent overfitting
prs = hockey.groupby('Pr/St').agg({'Pr/St':['count']}).reset_index()
prs.columns = ['pr/st','count']
extreneousStates = list(prs.loc[(prs['count'] < 10)]['pr/st'])
hockey.loc[(hockey['Pr/St'].isin(extreneousStates)),'Pr/St'] = 'Other'
hockey.loc[(hockey['Pr/St'].isna()),'Pr/St'] = 'UnSpecified'

In [6]:
# Adding isNa Cols
# These columns are useful to account for missing data
def addIsNACol(df, col_name):
    na_col_name = col_name + '_is_na'
    df[na_col_name] = 0
    df.loc[(df[col_name].isna()), na_col_name] = 1
    return df

hockey = addIsNACol(hockey, 'DftYr')
hockey = addIsNACol(hockey, 'iCF')

### Save Processed Data to be used by Model Pipeline
Further work with the columns will be done in the pipeline by excluding variable and imputation

In [7]:
# hockey.to_csv('./data/processed/hockey.csv', index= False)

4) Models
----

### Baseline Model

### Model Selection Process

### Final Model Choice

5) Deliver
----

### Evaluation Metrics

### Summary and Takeaways