# Capstone Project 1: Data Wrangling 


This file is the first of four notebooks for my capstone; the first is on cleaning the data, the second is on exploring and quick visualizations, and the third is actual model building and the fourth is application. 

All data was collected from [Natural Stat Trick](https://www.naturalstattrick.com/) an invaluable resource currated by Micah McCurdy, who compiles all kinds of data from the Nation Hockey League. 

The only preprocessing I did was to edit some of the header rows in the csv, and added a year column, in order to have that preserved.

I created the following as my naming convetions for the downloaded files: 

Year: refers to the year the season completed
PK: means a penalty kill unit 
PP: means powerplay
S: means 5-on-5 or standard play
Counts: are an individual occurance
Rates: are something over 60 minutes of time, the length of a standard game

How to use this information: this means 'Player Season Totals - Natural Stat Trick-2015SOIRates.csv' is the Rate of recorded items during the 2014-2015 NHL regular season. 'Player Season Totals - Natural Stat Trick-2017PKindCounts.csv' would be the Count of items during the Penalty Kill for each individual player.

I'll describe the remainder of my work as I go.

In [2]:
## import working modules
import pandas as pd
import numpy as np
import glob
import os
import unicodedata
from datetime import datetime
from functools import reduce

### Structuring the Import

I ran into a few issues starting out here. One that I had a number of files split out across the table given how I downloaded them, then the challenge of having some values that matter to the name space. I soved this buy iterating through all the csvs in the file, and saving the names into a separate list and then zipping them back together. 

In [3]:
## loop through all the csvs, import them into a dictionary with edicted names
## first read all the csvs as raw files
## important note here, the output files save na values as "-", so those have to be dealt with. 
## you could do it later, but here I'm adding them in

df = [pd.read_csv(f,na_values = '-') for f in glob.glob('*.csv')]

## pull in the names and clean them
names = glob.glob('*.csv')
for i in range(len(names)):
    names[i]= names[i].replace("-","").replace("Natural Stat Trick","").replace(".csv","").replace(" ","")

## zip together for clean mapping and create dictionary
player_data = list(zip(names, df))
player_data_dict = dict(player_data)


### The Sebastian Aho Problem

I set the key as Player-Value-Team to solve for what I call the 'Sebastian Aho' problem, which is that there are two Sebastian Ahos that play in the league. This is a pretty standard problem in most large data sets, having name be unreliable, but here we lack another item to generate a unique index or primary key so this is what we're using. Also, it gives me the excuse to post [the Sebastian Aho trailer](https://www.youtube.com/watch?v=5dZo-866LP4) (which is very clever, go watch it, it's only a minute long). 

In [42]:
## set index for joining
for key, value in player_data_dict.items():
    value.set_index(['Player','Year','Team'], inplace=True)
    

In [8]:
## cleaning up column headers
## all those not recorded here I manually replaced " " with "_" and "%" with "Per" prior to saving
## if not for my meddling with the flat files you wouldn't have to do this, and arguably it's not 
## necissary here, but I prefer it to be more standardized
keys = player_data_dict.keys()
for i in keys:
    player_data_dict[i].columns = player_data_dict[i].columns.str.replace(r" ", "_")
    player_data_dict[i].columns = player_data_dict[i].columns.str.replace(r"%","Per")

### Joining The Dictionaries

In order to create a case where each player had a record for every season they played, I needed to concat the various dictionaries by year, then append them all together, which is the next few cells. 

Not shown here, I did a couple checks to insure that the tables didn't inflate such as len(player_data_2014) and len(player_data_dict['PlayerSeasonTotals2014SIndCounts']) to make sure that there wasn't inflation there. I found exactly one that did increase due to a name capitalization item that I manually fixed. In fact this is the most common problem I've found in hockey data, which is with players coming from multiple nationalities there are a lot of interpritations of how to spell a name (Ales Hemsky is frequently Alex Hemski for example) so joining across platforms gets very tricky. 

Finally, after some deliberation, I decided that I would not add suffixes to the values of the first table, Standard Independant counts, because otherwise the power play table would stand as the item of record as far as things like Position, Games Played, etc. The main problem here is Natural Stat Trick would record 0 games played if a player had not had a power play shift or the like, which created a number of NA values that shouldn't have been.

This does cause an accounting issue that I'll get into later but for now it's fine. 

In [9]:

player_data_2014 = pd.merge(player_data_dict['PlayerSeasonTotals2014SIndCounts'],player_data_dict['PlayerSeasonTotals2014PPindCounts'], left_index=True, right_index=True, suffixes=("",'_PPIndC'), how='outer').merge(player_data_dict['PlayerSeasonTotals2014PKIndCounts'],left_index=True, right_index=True, suffixes=("",'_PkIndC'), how='outer').merge(player_data_dict['PlayerSeasonTotals2014PKOIRates'],left_index=True, right_index=True, suffixes=("",'_PkIndR'), how='outer').merge(player_data_dict['PlayerSeasonTotals2014PPOIRates'],left_index=True, right_index=True, suffixes=("",'_PPIndR'), how='outer').merge(player_data_dict['PlayerSeasonTotals2014SOIRates'],left_index=True, right_index=True, suffixes=("",'_SIndR'), how='outer').merge(player_data_dict['Playerbios2014'],left_index=True, right_index=True, suffixes=("",'_Bios'), how='outer')
player_data_2015 = pd.merge(player_data_dict['PlayerSeasonTotals2015SIndCounts'],player_data_dict['PlayerSeasonTotals2015PPIndCounts'], left_index=True, right_index=True, suffixes=("",'_PPIndC'), how='outer').merge(player_data_dict['PlayerSeasonTotals2015PKIndCounts'],left_index=True, right_index=True, suffixes=("",'_PkIndC'), how='outer').merge(player_data_dict['PlayerSeasonTotals2015PKOIRates'],left_index=True, right_index=True, suffixes=("",'_PkIndR'), how='outer').merge(player_data_dict['PlayerSeasonTotals2015PPOIRates'],left_index=True, right_index=True, suffixes=("",'_PPIndR'), how='outer').merge(player_data_dict['PlayerSeasonTotals2015SOIRates'],left_index=True, right_index=True, suffixes=("",'_SIndR'), how='outer').merge(player_data_dict['Playerbios2015'],left_index=True, right_index=True, suffixes=("",'_Bios'), how='outer')
player_data_2016 = pd.merge(player_data_dict['PlayerSeasonTotals2016SIndCounts'],player_data_dict['PlayerSeasonTotals2016PPindCounts'], left_index=True, right_index=True, suffixes=("",'_PPIndC'), how='outer').merge(player_data_dict['PlayerSeasonTotals2016PKindCounts'],left_index=True, right_index=True, suffixes=("",'_PkIndC'), how='outer').merge(player_data_dict['PlayerSeasonTotals2016PKOIRates'],left_index=True, right_index=True, suffixes=("",'_PkIndR'), how='outer').merge(player_data_dict['PlayerSeasonTotals2016PPOIRates'],left_index=True, right_index=True, suffixes=("",'_PPIndR'), how='outer').merge(player_data_dict['PlayerSeasonTotals2016SOIRates'],left_index=True, right_index=True, suffixes=("",'_SIndR'), how='outer').merge(player_data_dict['Playerbios2016'],left_index=True, right_index=True, suffixes=("",'_Bios'), how='outer')
player_data_2017 = pd.merge(player_data_dict['PlayerSeasonTotals2017SIndCounts'],player_data_dict['PlayerSeasonTotals2017PPindCounts'], left_index=True, right_index=True, suffixes=("",'_PPIndC'), how='outer').merge(player_data_dict['PlayerSeasonTotals2017PKindCounts'],left_index=True, right_index=True, suffixes=("",'_PkIndC'), how='outer').merge(player_data_dict['PlayerSeasonTotals2017PKOIRates'],left_index=True, right_index=True, suffixes=("",'_PkIndR'), how='outer').merge(player_data_dict['PlayerSeasonTotals2017PPOIRates'],left_index=True, right_index=True, suffixes=("",'_PPIndR'), how='outer').merge(player_data_dict['PlayerSeasonTotals2017SOIRates'],left_index=True, right_index=True, suffixes=("",'_SIndR'), how='outer').merge(player_data_dict['Playerbios2017'],left_index=True, right_index=True, suffixes=("",'_Bios'), how='outer')
player_data_2018 = pd.merge(player_data_dict['PlayerSeasonTotals2018SIndCounts'],player_data_dict['PlayerSeasonTotals2018PPindCounts'], left_index=True, right_index=True, suffixes=("",'_PPIndC'), how='outer').merge(player_data_dict['PlayerSeasonTotals2018PKindCounts'],left_index=True, right_index=True, suffixes=("",'_PkIndC'), how='outer').merge(player_data_dict['PlayerSeasonTotals2018PKOIRates'],left_index=True, right_index=True, suffixes=("",'_PkIndR'), how='outer').merge(player_data_dict['PlayerSeasonTotals2018PPOIRates'],left_index=True, right_index=True, suffixes=("",'_PPIndR'), how='outer').merge(player_data_dict['PlayerSeasonTotals2018SOIRates'],left_index=True, right_index=True, suffixes=("",'_SIndR'), how='outer').merge(player_data_dict['Playerbios2018'],left_index=True, right_index=True, suffixes=("",'_Bios'), how='outer')

comb_player_data = pd.concat([player_data_2014, player_data_2015, player_data_2016, player_data_2017, player_data_2018])


In [15]:
## visual check here, just looking at the head and last few rows
print(comb_player_data.iloc[0:5, 251:])

                          Draft_Round  Round_Pick  Overall_Draft_Position
Player         Year Team                                                 
Aaron Ness     2014 NYI           2.0        10.0                    40.0
Aaron Palushaj 2014 CAR           2.0        14.0                    44.0
Aaron Rome     2014 DAL           4.0         8.0                   104.0
Aaron Volpatti 2014 WSH           NaN         NaN                     NaN
Adam Almquist  2014 DET           7.0        29.0                   210.0


### Dealing with NAs

Given the specialty nature of this data set there are few na values. The notable exception is Draft information, as a number of players were undrafted. After consideration, I decided to make this a number higher than possible, 1000, which would make it stand out and be important for a linear regression, where as if I make it say, 0, it would be decidedly right by 1, which is the ideal draft ranking.

Draft Year I made into 1900, and team is now 'Undrafted'. 

As far as other missing values, Instead of writing out the reasoning for each fill here, I'll explain thusly: if the value is a count or a percentage of a count, I entered it as zero. If it's a percentage of a rate, then I filled to median. Example of each:

1) IPP is Individual Point Percentage, the percentage of goals for that player's team while that player is on the ice that the player earned a point on. For this I filled the NA to 0, becuase logically if you have no points in this category your count would be zero.

2) OffZoneStartPer is Offensive Zone Start Percentage. this is a rate, and thus, simply hasn't been calculated since the player hasn't been given the opportunity to have such a stat created.

In essence the logic here is "if you have a NA due to not getting any points, you are marked zero, if you have an NA due to not being on the ice for that metric, you are returned to the mean.

There is another alternative which is I found some pointless columns (power play position, etc) which I dropped wholesale.

Finally, I did a lot of manual iterations of printing out the list of over the 181 columns with missing values and changing each. while I've retained the fill in code below I'm deleting the boring iteration of making a list of the columns with na, then printing that list. Instead I've saved the fill segment. 

In [16]:
## cleaning up the draft data
comb_player_data['Draft_Round'].fillna(100,inplace = True)
comb_player_data['Draft_Year'].fillna(1900,inplace = True)
comb_player_data['Round_Pick'].fillna(100,inplace = True)
comb_player_data['Overall_Draft_Position'].fillna(1000,inplace = True)
comb_player_data['Draft_Team'].fillna('Undrafted',inplace = True)

In [17]:
## this is the iteration to return each value
has_na = comb_player_data.isnull().sum() > 0
comb_player_data_cols = comb_player_data.columns[has_na]
comb_player_data_cols_2 = [col for col in comb_player_data.columns if col in comb_player_data_cols]


In [18]:
## these are the count values that I'm filling as zero
fill_zero = ['SHPer_PPIndC','IPP','SHPer','GP_PPIndC','TOI_PPIndC','Goals_PPIndC','Total_Assists_PPIndC','IPP_PPIndC','Shots_PPIndC','GP_PPIndC','TOI_PPIndC','IPP_PPIndC','IPP_PPIndC', 'Shots_PPIndC', 'SHPer_PPIndC', 'iSCF_PPIndC', 'iHDCF_PPIndC', 'Rush_Attempts_PPIndC', 'Rebounds_Created_PPIndC', 'PIM_PPIndC','Total_Penalties_PPIndC', 'Minor_PPIndC', 'Major_PPIndC', 'Misconduct_PPIndC', 'Penalties_Drawn_PPIndC', 'Giveaways_PPIndC', 'Takeaways_PPIndC', 'Hits_PPIndC', 'Hits_Taken_PPIndC', 'Shots_Blocked_PPIndC', 'Faceoffs_Won_PPIndC', 'Faceoffs_Lost_PPIndC','Faceoffs_Per_PPIndC','GP_PkIndC', 'TOI_PkIndC', 'Goals_PkIndC', 'Total_Assists_PkIndC', 'IPP_PkIndC', 'Shots_PkIndC', 'SHPer_PkIndC',    'iSCF_PkIndC', 'iHDCF_PkIndC', 'Rush_Attempts_PkIndC', 'Rebounds_Created_PkIndC', 'PIM_PkIndC', 'Total_Penalties_PkIndC', 'Minor_PkIndC', 'Major_PkIndC', 'Misconduct_PkIndC', 'Penalties_Drawn_PkIndC', 'Giveaways_PkIndC', 'Takeaways_PkIndC', 'Hits_PkIndC', 'Hits_Taken_PkIndC', 'Shots_Blocked_PkIndC', 'Faceoffs_Won_PkIndC', 'Faceoffs_Lost_PkIndC', 'PKTOI', 'PKTOI_GP', 'PKCF_60', 'PKCA_60', 'PKTOI', 'PKTOI_GP', 'PKCF_60', 'PKCA_60','PKFF_60', 'PKFA_60','PKSF_60', 'PKSA_60', 'PKGF_60', 'PKGA_60','PKSCF_60', 'PKSCA_60', 'PKHDCF_60', 'PKHDCA_60',  'PKHDGF_60', 'PKHDGA_60','PKLDCF_60', 'PKLDCA_60',  'PKMDCF_60', 'PKMDCA_60', 'PKMDGF_60', 'PKMDGA_60', 'PKLDGF_60', 'PKLDGA_60',  'PKPDO', 'PKOff.\xa0Zone_Faceoffs_60', 'PKNeu.\xa0Zone_Faceoffs_60', 'PKDef.\xa0Zone_Faceoffs_60', 'PKOnTFStarts_60',     'PKOffZoneFaceoffs_60', 'PKNeuZoneFaceoffs_60', 'PKDefZoneFaceoffs_60',   'PPTOI', 'PPTOI_GP', 'PPCF_60', 'PPCA_60', 'PPFF_60', 'PPFA_60',  'PPSF_60', 'PPSA_60',  'PPGF_60', 'PPGA_60', 'PPSCF_60', 'PPSCA_60','PPHDCF_60', 'PPHDCA_60',  'PPHDGF_60', 'PPHDGA_60',  'PPMDCF_60', 'PPMDCA_60','PPMDGF_60', 'PPMDGA_60',  'PPLDCF_60', 'PPLDCA_60', 'PPLDGF_60', 'PPLDGA_60', 'PPPDO', 'PPOff.\xa0Zone_Faceoffs_60', 'PPNeu.\xa0Zone_Faceoffs_60', 'PPDef.\xa0Zone_Faceoffs_60', 'PPOnTFStarts_60',  'PPOffZoneFaceoffs_60', 'PPNeuZoneFaceoffs_60', 'PPDefZoneFaceoffs_60',  'CFPer', 'FFPer', 'SFPer', 'GFPer', 'SCFPer', 'HDCFPer', 'HDGFPer', 'MDCFPer', 'MDGFPer', 'LDCFPer', 'LDGFPer', 'On-Ice_SHPer', 'On-Ice_SVPer', 'PDO', 'OffZoneStartPer', 'Off.\xa0Zone_Faceoff_Per']

for i in fill_zero:
    comb_player_data[i].fillna(0,inplace = True)

In [19]:
## mean values
fill_mean = ['PPLDGFPer', 'PPOn-Ice_SHPer', 'PPOn-Ice_SVPer',  'PPLDCFPer', 'PPMDGFPer','PPMDCFPer', 'PPHDGFPer','PPHDCFPer', 'PPSCFPer','PPGFPer','PPSFPer','PPFFPer','PPCFPer','PKOff.\xa0Zone_Faceoff_Per', 'PKOffZoneStartPer','PKLDGFPer', 'PKOn-Ice_SHPer', 'PKOn-Ice_SVPer','PPOffZoneStartPer','Faceoffs_Per','Faceoffs_Per_PPIndC','Faceoffs_Per_PPIndC','iCF_PPIndC',  'PKMDCFPer', 'PKMDGFPer',  'PKLDCFPer','iFF_PPIndC','iCF_PkIndC', 'iFF_PkIndC','Faceoffs_Per_PkIndC','PKCFPer','PKFFPer','PKSFPer','PKGFPer','PKSCFPer','PKHDCFPer', 'PKHDGFPer','PPOff.\xa0Zone_Faceoff_Per' ]

for i in fill_mean:
    comb_player_data[i].fillna(comb_player_data[i].mean(),inplace = True)

A note on some of the things I choose to drop here, that one might quibble with: 

1) I dropped first and second assists, keeping only total. Some could argue that there's some value in knowing which players are creating more secondary assists, but I think its too highly cooralated with the general assista as well as the final points so I took them out. 

In [20]:
## the drop values
comb_player_data.drop(['Position_PPIndC', 'First_Assists_PPIndC', 'Second_Assists_PPIndC', 'Total_Points_PPIndC', 'Position_PkIndC', 'First_Assists_PkIndC', 'Second_Assists_PkIndC', 'Total_Points_PkIndC', 'Position_PkIndR', 'GP_PkIndR', 'Position_PPIndR', 'GP_PPIndR', 'Birth_State/Province','Position_PPIndR', 'GP_PPIndR','Position_PkIndR', 'GP_PkIndR',  'First_Assists_PkIndC', 'Second_Assists_PkIndC', 'Total_Points_PkIndC', 'Position_PkIndC','First_Assists','Second_Assists','Total_Points','Position_PPIndC','First_Assists_PPIndC', 'Second_Assists_PPIndC', 'Total_Points_PPIndC','Position_PPIndC','Birth_State/Province'], axis=1, inplace=True)
## dropping a few more excess columns I found, just to reduce junk variables
comb_player_data.drop(['GP_PPIndC','GP_PkIndC','Position_SIndR', 'GP_SIndR','Position_Bios'], axis=1, inplace=True)


In [21]:
## checking that all na are gone
has_na = comb_player_data.isnull().sum() > 0
comb_player_data_cols = comb_player_data.columns[has_na]
comb_player_data_cols_3 = [col for col in comb_player_data.columns if col in comb_player_data_cols]
print(comb_player_data_cols_3)

[]


As we can see in the end here, we've got no more nan values so we can move on to creating new features.

### Creating age and fantasy score variables

One of the general thesis items here is that players have a growth curve, as well as a performance drop off with age. There isn't an reliable age variable from the data, so here we will create one.

Secondarily we need to create a value for the fantasy performance, as that is what we are trying to predict. Note that the scoring is specific to a certain league and isn't static, but this can be easily modified by other users to match their leagues scoring. 

In [23]:
## reset the index so we can use the year column
comb_player_data.reset_index(inplace=True)

## year imported as an int so need to convert that from 2018 into '1-1-2018'
## going to make a mini function here and then iterate row wise
def age_formula(x,y):
    return round((datetime.strptime(str(x), '%Y')-datetime.strptime(str(y), '%m/%d/%y')).days/365)

comb_player_data['Age'] = comb_player_data.apply(lambda row: age_formula(row['Year'],row['Date_of_Birth']), axis=1)

In [24]:
## checking that it worked
comb_player_data['Age'].loc[comb_player_data['Player'] == 'Zdeno Chara']

883     37
1766    38
2664    39
3552    40
4442    41
Name: Age, dtype: int64

Below I'm going to create the score for fantasy points in line with the target league, as well as the fantasy points per game. This is slightly annoying in that power play metrics are extended across rows so it's a long formula but it works. 

In [25]:

comb_player_data['Fantasy Points'] = comb_player_data['Goals']*3+comb_player_data['Total_Assists']*2+comb_player_data['Shots']*0.1+comb_player_data['Hits']*0.25+comb_player_data['Shots_Blocked']*0.5+comb_player_data['Faceoffs_Won']*0.05+comb_player_data['Goals_PPIndC']*3.5+comb_player_data['Total_Assists_PPIndC']*2.25+comb_player_data['Shots_PPIndC']*0.1+comb_player_data['Hits_PPIndC']*0.25+comb_player_data['Shots_Blocked_PPIndC']*0.5+comb_player_data['Faceoffs_Won_PPIndC']*0.05+comb_player_data['Goals_PkIndC']*5+comb_player_data['Total_Assists_PkIndC']*2.25+comb_player_data['Shots_PkIndC']*0.1+comb_player_data['Hits_PkIndC']*0.25+comb_player_data['Shots_Blocked_PkIndC']*0.5+comb_player_data['Faceoffs_Won_PkIndC']*0.05

## and average ppg

comb_player_data['Fantasy Points Per Game'] = comb_player_data['Fantasy Points'] / comb_player_data['GP']


## creating ppg rating 
comb_player_data['Fantasy Points Per Game'] = round(comb_player_data['Fantasy Points'] / comb_player_data['GP'], 2)



Here I'm creating a drafted verse undrafted player metric, total time on ice, and a one-hot enccoding of position for later use.

In [28]:
##drafted vs undrafted numbers

comb_player_data['drafted']=np.where(comb_player_data['Overall_Draft_Position']==1000, 0, 1)

## total time on ice

comb_player_data['Total Minutes Played'] = comb_player_data['TOI'] + comb_player_data['PPTOI'] + comb_player_data['PKTOI']

comb_player_data['FantasyPoints_60'] = comb_player_data['Fantasy Points']/(comb_player_data['Total Minutes Played']/60)

## create position as just one value instead of multiple, accepting that first listed value is primary
comb_player_data['Cleaned_Position'] = comb_player_data['Position'].str[0]

## let's one hot these for better feature usage since most every model will want a numerical value
one_hot_p = pd.get_dummies(comb_player_data.Cleaned_Position)
new_player_data = pd.concat([comb_player_data, one_hot_p], axis=1)
new_player_data[0:5]

Unnamed: 0,Player,Year,Team,Position,GP,TOI,Goals,Total_Assists,IPP,Shots,...,Fantasy Points,Fantasy Points Per Game,drafted,Total Minutes Played,FantasyPoints_60,Cleaned_Position,C,D,L,R
0,Aaron Ness,2014,NYI,D,20,275.25,1,2,100.0,23,...,25.8,1.29,1,295.883333,5.231792,D,0,1,0,0
1,Aaron Palushaj,2014,CAR,R,2,17.516667,0,0,0.0,2,...,0.95,0.475,1,18.65,3.0563,R,0,0,0,1
2,Aaron Rome,2014,DAL,D,25,301.2,0,1,10.0,15,...,27.5,1.1,1,327.2,5.042787,D,0,1,0,0
3,Aaron Volpatti,2014,WSH,L,41,300.183333,2,0,50.0,18,...,32.25,0.786585,0,301.866667,6.410115,L,0,0,1,0
4,Adam Almquist,2014,DET,D,2,31.516667,1,0,50.0,2,...,3.45,1.725,1,34.433333,6.011617,D,0,1,0,0


In [34]:
## fancy bit of reordering to put Fantasy Points first in the order

value = new_player_data['Fantasy Points']
new_player_data.drop(labels=['Fantasy Points'], axis=1,inplace = True)
new_player_data.insert(0, 'Fantasy Points', value)
new_player_data[0:5]

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,Fantasy Points,GP,TOI,IPP,SHPer,iCF,iFF,iSCF,iHDCF,Rush_Attempts,...,Weight_(lbs),Draft_Round,Round_Pick,Overall_Draft_Position,drafted,Total Minutes Played,C,D,L,R
Player,Team,Year,Position,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1
Aaron Ness,NYI,2014,D,25.8,20,275.25,100.0,4.35,48,32,15,0,0,...,184,2.0,10.0,40.0,1,295.883333,0,1,0,0
Aaron Palushaj,CAR,2014,R,0.95,2,17.516667,0.0,0.0,4,4,1,1,0,...,187,2.0,14.0,44.0,1,18.65,0,0,0,1
Aaron Rome,DAL,2014,D,27.5,25,301.2,10.0,0.0,36,22,11,2,0,...,220,4.0,8.0,104.0,1,327.2,0,1,0,0
Aaron Volpatti,WSH,2014,L,32.25,41,300.183333,50.0,11.11,39,27,21,9,1,...,215,100.0,100.0,1000.0,0,301.866667,0,0,1,0
Adam Almquist,DET,2014,D,3.45,2,31.516667,50.0,50.0,3,2,2,1,0,...,174,7.0,29.0,210.0,1,34.433333,0,1,0,0


Finally we'll clean up the long strings and remove the infinate values that throw errors in model building. For the following section significant credit is due to the tutorial from [Stack Abuse, posted here](https://stackabuse.com/random-forest-algorithm-with-python-and-scikit-learn/). Thanks to Usman Malik for his detailed post. 

In [35]:
for i in new_player_data.columns[:]:
    if new_player_data[i].dtype == float:
        new_player_data[i] = round(new_player_data[i],2)
        


I also know from earlier iterations that there are some remaining NAs here, so I'm going to hunt those down, and also deal with the infinte cells.

In [36]:

new_player_data.notna().all(axis=None)

True

In [39]:
infs = np.where(np.isinf(new_player_data))
infs

(array([ 261, 1162, 1190, 1339, 1365, 1451, 1656, 1736, 1878, 1954, 2189,
        2299, 2741, 3137, 3162, 3799, 3821, 3881, 4305]),
 array([ 3, 43, 43,  3, 43, 43, 43, 43, 43, 43, 23, 43,  3, 43, 43, 43,  3,
        23, 43]))

All of these columns related to the values for IPP or individual points share, which is a sort of a record keeping fluke. Going to clean those here and fill them with the mean value. 

In [40]:
pd.options.mode.use_inf_as_na = True
new_player_data['IPP_PkIndC'].fillna(new_player_data['IPP_PkIndC'].mean(),inplace=True)
new_player_data['IPP'].fillna(new_player_data['IPP'].mean(),inplace=True)
new_player_data['IPP_PPIndC'].fillna(new_player_data['IPP_PPIndC'].mean(),inplace=True)
infs = np.where(np.isinf(new_player_data))
infs

(array([], dtype=int64), array([], dtype=int64))

Now we'll save one file for ploting, then drop values used in calcuating the fantasy points value for model building, and save it again.

In [41]:
new_player_data.to_csv("Combined_Player_Data2.csv")

new_player_data.drop(['Goals','Total_Assists','Shots','Hits',
 'Shots_Blocked',   'Faceoffs_Won', 'Goals_PPIndC',
 'Total_Assists_PPIndC', 'Shots_PPIndC', 'Hits_PPIndC',  'Shots_Blocked_PPIndC',
 'Faceoffs_Won_PPIndC', 'Goals_PkIndC','Total_Assists_PkIndC', 'Shots_PkIndC',
 'Hits_PkIndC', 'Shots_Blocked_PkIndC','Faceoffs_Won_PkIndC', 'Fantasy Points Per Game',
'FantasyPoints_60','Date_of_Birth','Birth_City','Birth_Country','Nationality','Cleaned_Position',
 'Draft_Year','Draft_Team','FantasyPoints_60'], axis=1, inplace=True)     
new_player_data.set_index(['Player','Team','Year', 'Position'],inplace=True)




new_player_data.to_csv("Model_Ready_Data2.csv")

## Done!

And that's a wrap folks, we have a good file, which we'll save here to a new csv and begin the next leg, actually building some models.

Thanks for reading! 