# Project: Capstone Project 1: Data Wrangling 

The following are the code and description of the processes I used to collect, unify, and clean my data for the Springboard Capstone 1. 

All data was collected from [Natural Stat Trick](https://www.naturalstattrick.com/) an invaluable resource currated by Micah McCurdy, who compiles all kinds of data from the Nation Hockey League. 

The only preprocessing I did was to edit some of the header rows in the csv, and added a year column, in order to have that preserved.

Naming conventions follow thusly: 

Year: refers to the year the season completed
PK: means a penalty kill unit 
PP: means powerplay
S: means 5-on-5 or standard play
Counts: are an individual occurance
Rates: are something over 60 minutes of time, the length of a standard game

How to use this information: this means 'Player Season Totals - Natural Stat Trick-2015SOIRates.csv' is the Rate of recorded items during the 2014-2015 NHL regular season. 'Player Season Totals - Natural Stat Trick-2017PKindCounts.csv' would be the Count of items during the Penalty Kill for each individual player.

I'll describe the remainder of my work as I go.

In [1]:
## import working modules
import pandas as pd
import numpy as np
import glob
import os
import unicodedata
from datetime import datetime
from functools import reduce

### Structuring the Import

I ran into a few issues starting out here. One that I had a number of files split out across the table given how I downloaded them, then the challenge of having some values that matter to the name space. I soved this buy iterating through all the csvs in the file, and saving the names into a separate list and then zipping them back together. 

In [2]:
## loop through all the csvs, import them into a dictionary with edicted names
## first read all the csvs as raw files
## important note here, the output files save na values as "-", so those have to be dealt with. 
## you could do it later, but here I'm adding them in

df = [pd.read_csv(f,na_values = '-') for f in glob.glob('*.csv')]

## pull in the names and clean them
names = glob.glob('*.csv')
for i in range(len(names)):
    names[i]= names[i].replace("-","").replace("Natural Stat Trick","").replace(".csv","").replace(" ","")

## zip together for clean mapping and create dictionary
player_data = list(zip(names, df))
player_data_dict = dict(player_data)


### The Sebastian Aho Problem

I set the key as Player-Value-Team to solve for what I call the 'Sebastian Aho' problem, which is that there are two Sebastian Ahos that play in the league. This is a pretty standard problem in most large data sets, having name be unreliable, but here we lack another item to generate a unique index or primary key so this is what we're using. Also, it gives me the excuse to post [the Sebastian Aho trailer](https://www.youtube.com/watch?v=5dZo-866LP4) (which is very clever, go watch it, it's only a minute long). 

In [3]:
## set index for joining
for key, value in player_data_dict.items():
    value.set_index(['Player','Year','Team'], inplace=True)
    

In [4]:
## this is generally just a gut check to see what's made it in clean and reference to find 
## the various frames later
player_data_dict.keys()

dict_keys(['Combined_Player_Data', 'PlayerSeasonTotals2014PKIndCounts', 'PlayerSeasonTotals2014PKOIRates', 'PlayerSeasonTotals2014PPindCounts', 'PlayerSeasonTotals2014PPOIRates', 'PlayerSeasonTotals2014SIndCounts', 'PlayerSeasonTotals2014SOIRates', 'PlayerSeasonTotals2015PKIndCounts', 'PlayerSeasonTotals2015PKOIRates', 'PlayerSeasonTotals2015PPIndCounts', 'PlayerSeasonTotals2015PPOIRates', 'PlayerSeasonTotals2015SIndCounts', 'PlayerSeasonTotals2015SOIRates', 'PlayerSeasonTotals2016PKindCounts', 'PlayerSeasonTotals2016PKOIRates', 'PlayerSeasonTotals2016PPindCounts', 'PlayerSeasonTotals2016PPOIRates', 'PlayerSeasonTotals2016SIndCounts', 'PlayerSeasonTotals2016SOIRates', 'PlayerSeasonTotals2017PKindCounts', 'PlayerSeasonTotals2017PKOIRates', 'PlayerSeasonTotals2017PPindCounts', 'PlayerSeasonTotals2017PPOIRates', 'PlayerSeasonTotals2017SIndCounts', 'PlayerSeasonTotals2017SOIRates', 'PlayerSeasonTotals2018PKindCounts', 'PlayerSeasonTotals2018PKOIRates', 'PlayerSeasonTotals2018PPindCounts', 

In [5]:
## cleaning up column headers
## all those not recorded here I manually replaced " " with "_" and "%" with "Per" prior to saving
## if not for my meddling with the flat files you wouldn't have to do this, and arguably it's not 
## necissary here, but I prefer it to be more standardized
keys = player_data_dict.keys()
for i in keys:
    player_data_dict[i].columns = player_data_dict[i].columns.str.replace(r" ", "_")
    player_data_dict[i].columns = player_data_dict[i].columns.str.replace(r"%","Per")

### Joining The Dictionaries

In order to create a case where each player had a record for every season they played, I needed to concat the various dictionaries by year, then append them all together, which is the next few cells. 

Not shown here, I did a couple checks to insure that the tables didn't inflate such as len(player_data_2014) and len(player_data_dict['PlayerSeasonTotals2014SIndCounts']) to make sure that there wasn't inflation there. I found exactly one that did increase due to a name capitalization item that I manually fixed. In fact this is the most common problem I've found in hockey data, which is with players coming from multiple nationalities there are a lot of interpritations of how to spell a name (Ales Hemsky is frequently Alex Hemski for example) so joining across platforms gets very tricky. 

Finally, after some deliberation, I decided that I would not add suffixes to the values of the first table, Standard Independant counts, because otherwise the power play table would stand as the item of record as far as things like Position, Games Played, etc. The main problem here is Natural Stat Trick would record 0 games played if a player had not had a power play shift or the like, which created a number of NA values that shouldn't have been.

This does cause an accounting issue that I'll get into later but for now it's fine. 

In [6]:

player_data_2014 = pd.merge(player_data_dict['PlayerSeasonTotals2014SIndCounts'],player_data_dict['PlayerSeasonTotals2014PPindCounts'], left_index=True, right_index=True, suffixes=("",'_PPIndC'), how='outer').merge(player_data_dict['PlayerSeasonTotals2014PKIndCounts'],left_index=True, right_index=True, suffixes=("",'_PkIndC'), how='outer').merge(player_data_dict['PlayerSeasonTotals2014PKOIRates'],left_index=True, right_index=True, suffixes=("",'_PkIndR'), how='outer').merge(player_data_dict['PlayerSeasonTotals2014PPOIRates'],left_index=True, right_index=True, suffixes=("",'_PPIndR'), how='outer').merge(player_data_dict['PlayerSeasonTotals2014SOIRates'],left_index=True, right_index=True, suffixes=("",'_SIndR'), how='outer').merge(player_data_dict['Playerbios2014'],left_index=True, right_index=True, suffixes=("",'_Bios'), how='outer')
player_data_2015 = pd.merge(player_data_dict['PlayerSeasonTotals2015SIndCounts'],player_data_dict['PlayerSeasonTotals2015PPIndCounts'], left_index=True, right_index=True, suffixes=("",'_PPIndC'), how='outer').merge(player_data_dict['PlayerSeasonTotals2015PKIndCounts'],left_index=True, right_index=True, suffixes=("",'_PkIndC'), how='outer').merge(player_data_dict['PlayerSeasonTotals2015PKOIRates'],left_index=True, right_index=True, suffixes=("",'_PkIndR'), how='outer').merge(player_data_dict['PlayerSeasonTotals2015PPOIRates'],left_index=True, right_index=True, suffixes=("",'_PPIndR'), how='outer').merge(player_data_dict['PlayerSeasonTotals2015SOIRates'],left_index=True, right_index=True, suffixes=("",'_SIndR'), how='outer').merge(player_data_dict['Playerbios2015'],left_index=True, right_index=True, suffixes=("",'_Bios'), how='outer')
player_data_2016 = pd.merge(player_data_dict['PlayerSeasonTotals2016SIndCounts'],player_data_dict['PlayerSeasonTotals2016PPindCounts'], left_index=True, right_index=True, suffixes=("",'_PPIndC'), how='outer').merge(player_data_dict['PlayerSeasonTotals2016PKindCounts'],left_index=True, right_index=True, suffixes=("",'_PkIndC'), how='outer').merge(player_data_dict['PlayerSeasonTotals2016PKOIRates'],left_index=True, right_index=True, suffixes=("",'_PkIndR'), how='outer').merge(player_data_dict['PlayerSeasonTotals2016PPOIRates'],left_index=True, right_index=True, suffixes=("",'_PPIndR'), how='outer').merge(player_data_dict['PlayerSeasonTotals2016SOIRates'],left_index=True, right_index=True, suffixes=("",'_SIndR'), how='outer').merge(player_data_dict['Playerbios2016'],left_index=True, right_index=True, suffixes=("",'_Bios'), how='outer')
player_data_2017 = pd.merge(player_data_dict['PlayerSeasonTotals2017SIndCounts'],player_data_dict['PlayerSeasonTotals2017PPindCounts'], left_index=True, right_index=True, suffixes=("",'_PPIndC'), how='outer').merge(player_data_dict['PlayerSeasonTotals2017PKindCounts'],left_index=True, right_index=True, suffixes=("",'_PkIndC'), how='outer').merge(player_data_dict['PlayerSeasonTotals2017PKOIRates'],left_index=True, right_index=True, suffixes=("",'_PkIndR'), how='outer').merge(player_data_dict['PlayerSeasonTotals2017PPOIRates'],left_index=True, right_index=True, suffixes=("",'_PPIndR'), how='outer').merge(player_data_dict['PlayerSeasonTotals2017SOIRates'],left_index=True, right_index=True, suffixes=("",'_SIndR'), how='outer').merge(player_data_dict['Playerbios2017'],left_index=True, right_index=True, suffixes=("",'_Bios'), how='outer')
player_data_2018 = pd.merge(player_data_dict['PlayerSeasonTotals2018SIndCounts'],player_data_dict['PlayerSeasonTotals2018PPindCounts'], left_index=True, right_index=True, suffixes=("",'_PPIndC'), how='outer').merge(player_data_dict['PlayerSeasonTotals2018PKindCounts'],left_index=True, right_index=True, suffixes=("",'_PkIndC'), how='outer').merge(player_data_dict['PlayerSeasonTotals2018PKOIRates'],left_index=True, right_index=True, suffixes=("",'_PkIndR'), how='outer').merge(player_data_dict['PlayerSeasonTotals2018PPOIRates'],left_index=True, right_index=True, suffixes=("",'_PPIndR'), how='outer').merge(player_data_dict['PlayerSeasonTotals2018SOIRates'],left_index=True, right_index=True, suffixes=("",'_SIndR'), how='outer').merge(player_data_dict['Playerbios2018'],left_index=True, right_index=True, suffixes=("",'_Bios'), how='outer')

comb_player_data = pd.concat([player_data_2014, player_data_2015, player_data_2016, player_data_2017, player_data_2018])


In [7]:
## visual check here 
print(comb_player_data)

                                    Position  GP          TOI  Goals  \
Player                Year Team                                        
Aaron Ness            2014 NYI             D  20   275.250000      1   
Aaron Palushaj        2014 CAR             R   2    17.516667      0   
Aaron Rome            2014 DAL             D  25   301.200000      0   
Aaron Volpatti        2014 WSH             L  41   300.183333      2   
Adam Almquist         2014 DET             D   2    31.516667      1   
Adam Burish           2014 S.J             R  15   125.383333      0   
Adam Cracknell        2014 STL             R  19   151.050000      0   
Adam Hall             2014 PHI             R  80   544.033333      2   
Adam Henrique         2014 N.J             C  77  1056.600000     15   
Adam Larsson          2014 N.J             D  26   426.583333      1   
Adam McQuaid          2014 BOS             D  30   425.933333      1   
Adam Pardy            2014 WPG             D  60   802.633333   

### Dealing with NAs

Given the specialty nature of this data set there are few na values. The notable exception is Draft information, as a number of players were undrafted. After consideration, I decided to make this a number higher than possible, 1000, which would make it stand out and be important for a linear regression, where as if I make it say, 0, it would be decidedly right by 1, which is the ideal draft ranking.

Draft Year I made into 1900, and team is now 'Undrafted'. 

As far as other missing values, Instead of writing out the reasoning for each fill here, I'll explain thusly: if the value is a count or a percentage of a count, I entered it as zero. If it's a percentage of a rate, then I filled to median. Example of each:

1) IPP is Individual Point Percentage, the percentage of goals for that player's team while that player is on the ice that the player earned a point on. For this I filled the NA to 0, becuase logically if you have no points in this category your count would be zero.

2) OffZoneStartPer is Offensive Zone Start Percentage. this is a rate, and thus, simply hasn't been calculated since the player hasn't been given the opportunity to have such a stat created.

In essence the logic here is "if you have a NA due to not getting any points, you are marked zero, if you have an NA due to not being on the ice for that metric, you are returned to the mean.

There is another alternative which is I found some pointless columns (power play position, etc) which I dropped wholesale.

Finally, I did a lot of manual iterations of printing out the list of over the 181 columns with missing values and changing each. while I've retained the fill in code below I'm deleting the boring iteration of making a list of the columns with na, then printing that list. Instead I've saved the fill segment. 

In [8]:
## quick glance at what has missing values
comb_player_data.isnull().sum()

Position                     0
GP                           0
TOI                          0
Goals                        0
Total_Assists                0
First_Assists                0
Second_Assists               0
Total_Points                 0
IPP                        332
Shots                        0
SHPer                      126
iCF                          0
iFF                          0
iSCF                         0
iHDCF                        0
Rush_Attempts                0
Rebounds_Created             0
PIM                          0
Total_Penalties              0
Minor                        0
Major                        0
Misconduct                   0
Penalties_Drawn              0
Giveaways                    0
Takeaways                    0
Hits                         0
Hits_Taken                   0
Shots_Blocked                0
Faceoffs_Won                 0
Faceoffs_Lost                0
                          ... 
LDCFPer                      1
LDGF_60 

In [9]:
## cleaning up the draft data
comb_player_data['Draft_Round'].fillna(100,inplace = True)
comb_player_data['Draft_Year'].fillna(1900,inplace = True)
comb_player_data['Round_Pick'].fillna(100,inplace = True)
comb_player_data['Overall_Draft_Position'].fillna(1000,inplace = True)
comb_player_data['Draft_Team'].fillna('Undrafted',inplace = True)

In [10]:
## this is the iteration to return each value
has_na = comb_player_data.isnull().sum() > 0
comb_player_data_cols = comb_player_data.columns[has_na]
comb_player_data_cols_2 = [col for col in comb_player_data.columns if col in comb_player_data_cols]
print(comb_player_data_cols_2)

['IPP', 'SHPer', 'Faceoffs_Per', 'Position_PPIndC', 'GP_PPIndC', 'TOI_PPIndC', 'Goals_PPIndC', 'Total_Assists_PPIndC', 'First_Assists_PPIndC', 'Second_Assists_PPIndC', 'Total_Points_PPIndC', 'IPP_PPIndC', 'Shots_PPIndC', 'SHPer_PPIndC', 'iCF_PPIndC', 'iFF_PPIndC', 'iSCF_PPIndC', 'iHDCF_PPIndC', 'Rush_Attempts_PPIndC', 'Rebounds_Created_PPIndC', 'PIM_PPIndC', 'Total_Penalties_PPIndC', 'Minor_PPIndC', 'Major_PPIndC', 'Misconduct_PPIndC', 'Penalties_Drawn_PPIndC', 'Giveaways_PPIndC', 'Takeaways_PPIndC', 'Hits_PPIndC', 'Hits_Taken_PPIndC', 'Shots_Blocked_PPIndC', 'Faceoffs_Won_PPIndC', 'Faceoffs_Lost_PPIndC', 'Faceoffs_Per_PPIndC', 'Position_PkIndC', 'GP_PkIndC', 'TOI_PkIndC', 'Goals_PkIndC', 'Total_Assists_PkIndC', 'First_Assists_PkIndC', 'Second_Assists_PkIndC', 'Total_Points_PkIndC', 'IPP_PkIndC', 'Shots_PkIndC', 'SHPer_PkIndC', 'iCF_PkIndC', 'iFF_PkIndC', 'iSCF_PkIndC', 'iHDCF_PkIndC', 'Rush_Attempts_PkIndC', 'Rebounds_Created_PkIndC', 'PIM_PkIndC', 'Total_Penalties_PkIndC', 'Minor_PkI

In [11]:
## these are the count values that I'm filling as zero
fill_zero = ['SHPer_PPIndC','IPP','SHPer','GP_PPIndC','TOI_PPIndC','Goals_PPIndC','Total_Assists_PPIndC','IPP_PPIndC','Shots_PPIndC','GP_PPIndC','TOI_PPIndC','IPP_PPIndC','IPP_PPIndC', 'Shots_PPIndC', 'SHPer_PPIndC', 'iSCF_PPIndC', 'iHDCF_PPIndC', 'Rush_Attempts_PPIndC', 'Rebounds_Created_PPIndC', 'PIM_PPIndC','Total_Penalties_PPIndC', 'Minor_PPIndC', 'Major_PPIndC', 'Misconduct_PPIndC', 'Penalties_Drawn_PPIndC', 'Giveaways_PPIndC', 'Takeaways_PPIndC', 'Hits_PPIndC', 'Hits_Taken_PPIndC', 'Shots_Blocked_PPIndC', 'Faceoffs_Won_PPIndC', 'Faceoffs_Lost_PPIndC','Faceoffs_Per_PPIndC','GP_PkIndC', 'TOI_PkIndC', 'Goals_PkIndC', 'Total_Assists_PkIndC', 'IPP_PkIndC', 'Shots_PkIndC', 'SHPer_PkIndC',    'iSCF_PkIndC', 'iHDCF_PkIndC', 'Rush_Attempts_PkIndC', 'Rebounds_Created_PkIndC', 'PIM_PkIndC', 'Total_Penalties_PkIndC', 'Minor_PkIndC', 'Major_PkIndC', 'Misconduct_PkIndC', 'Penalties_Drawn_PkIndC', 'Giveaways_PkIndC', 'Takeaways_PkIndC', 'Hits_PkIndC', 'Hits_Taken_PkIndC', 'Shots_Blocked_PkIndC', 'Faceoffs_Won_PkIndC', 'Faceoffs_Lost_PkIndC', 'PKTOI', 'PKTOI_GP', 'PKCF_60', 'PKCA_60', 'PKTOI', 'PKTOI_GP', 'PKCF_60', 'PKCA_60','PKFF_60', 'PKFA_60','PKSF_60', 'PKSA_60', 'PKGF_60', 'PKGA_60','PKSCF_60', 'PKSCA_60', 'PKHDCF_60', 'PKHDCA_60',  'PKHDGF_60', 'PKHDGA_60','PKLDCF_60', 'PKLDCA_60',  'PKMDCF_60', 'PKMDCA_60', 'PKMDGF_60', 'PKMDGA_60', 'PKLDGF_60', 'PKLDGA_60',  'PKPDO', 'PKOff.\xa0Zone_Faceoffs_60', 'PKNeu.\xa0Zone_Faceoffs_60', 'PKDef.\xa0Zone_Faceoffs_60', 'PKOnTFStarts_60',     'PKOffZoneFaceoffs_60', 'PKNeuZoneFaceoffs_60', 'PKDefZoneFaceoffs_60',   'PPTOI', 'PPTOI_GP', 'PPCF_60', 'PPCA_60', 'PPFF_60', 'PPFA_60',  'PPSF_60', 'PPSA_60',  'PPGF_60', 'PPGA_60', 'PPSCF_60', 'PPSCA_60','PPHDCF_60', 'PPHDCA_60',  'PPHDGF_60', 'PPHDGA_60',  'PPMDCF_60', 'PPMDCA_60','PPMDGF_60', 'PPMDGA_60',  'PPLDCF_60', 'PPLDCA_60', 'PPLDGF_60', 'PPLDGA_60', 'PPPDO', 'PPOff.\xa0Zone_Faceoffs_60', 'PPNeu.\xa0Zone_Faceoffs_60', 'PPDef.\xa0Zone_Faceoffs_60', 'PPOnTFStarts_60',  'PPOffZoneFaceoffs_60', 'PPNeuZoneFaceoffs_60', 'PPDefZoneFaceoffs_60',  'CFPer', 'FFPer', 'SFPer', 'GFPer', 'SCFPer', 'HDCFPer', 'HDGFPer', 'MDCFPer', 'MDGFPer', 'LDCFPer', 'LDGFPer', 'On-Ice_SHPer', 'On-Ice_SVPer', 'PDO', 'OffZoneStartPer', 'Off.\xa0Zone_Faceoff_Per']

for i in fill_zero:
    comb_player_data[i].fillna(0,inplace = True)

In [12]:
## mean values
fill_mean = ['PPLDGFPer', 'PPOn-Ice_SHPer', 'PPOn-Ice_SVPer',  'PPLDCFPer', 'PPMDGFPer','PPMDCFPer', 'PPHDGFPer','PPHDCFPer', 'PPSCFPer','PPGFPer','PPSFPer','PPFFPer','PPCFPer','PKOff.\xa0Zone_Faceoff_Per', 'PKOffZoneStartPer','PKLDGFPer', 'PKOn-Ice_SHPer', 'PKOn-Ice_SVPer','PPOffZoneStartPer','Faceoffs_Per','Faceoffs_Per_PPIndC','Faceoffs_Per_PPIndC','iCF_PPIndC',  'PKMDCFPer', 'PKMDGFPer',  'PKLDCFPer','iFF_PPIndC','iCF_PkIndC', 'iFF_PkIndC','Faceoffs_Per_PkIndC','PKCFPer','PKFFPer','PKSFPer','PKGFPer','PKSCFPer','PKHDCFPer', 'PKHDGFPer','PPOff.\xa0Zone_Faceoff_Per' ]

for i in fill_mean:
    comb_player_data[i].fillna(comb_player_data[i].mean(),inplace = True)

A note on some of the things I choose to drop here, that one might quibble with: 

1) I dropped first and second assists, keeping only total. Some could argue that there's some value in knowing which players are creating more secondary assists, but I think its too highly cooralated with the general assista as well as the final points so I took them out. 

In [13]:
## the drop values
comb_player_data.drop(['Position_PPIndC', 'First_Assists_PPIndC', 'Second_Assists_PPIndC', 'Total_Points_PPIndC', 'Position_PkIndC', 'First_Assists_PkIndC', 'Second_Assists_PkIndC', 'Total_Points_PkIndC', 'Position_PkIndR', 'GP_PkIndR', 'Position_PPIndR', 'GP_PPIndR', 'Birth_State/Province','Position_PPIndR', 'GP_PPIndR','Position_PkIndR', 'GP_PkIndR',  'First_Assists_PkIndC', 'Second_Assists_PkIndC', 'Total_Points_PkIndC', 'Position_PkIndC','First_Assists','Second_Assists','Total_Points','Position_PPIndC','First_Assists_PPIndC', 'Second_Assists_PPIndC', 'Total_Points_PPIndC','Position_PPIndC','Birth_State/Province'], axis=1, inplace=True)
## dropping a few more excess columns I found, just to reduce junk variables
comb_player_data.drop(['GP_PPIndC','GP_PkIndC','Position_SIndR', 'GP_SIndR','Position_Bios'], axis=1, inplace=True)


In [14]:
## checking that all na are gone
has_na = comb_player_data.isnull().sum() > 0
comb_player_data_cols = comb_player_data.columns[has_na]
comb_player_data_cols_3 = [col for col in comb_player_data.columns if col in comb_player_data_cols]
print(comb_player_data_cols_3)

[]


### Creating age and fantasy score variables

One of the general thesis items here is that players have a growth curve, as well as a performance drop off with age. There isn't an reliable age variable from the data, so here we will create one.

Secondarily we need to create a value for the fantasy performance, as that is what we are trying to predict. Note that the scoring is specific to a certain league and isn't static, but this can be easily modified by other users to match their leagues scoring. 

In [15]:
## reset the index so we can use the year column
comb_player_data.reset_index(inplace=True)

## year imported as an int so need to convert that from 2018 into '1-1-2018'
## going to make a mini function here and then iterate row wise
def age_formula(x,y):
    return round((datetime.strptime(str(x), '%Y')-datetime.strptime(str(y), '%m/%d/%y')).days/365)

comb_player_data['Age'] = comb_player_data.apply(lambda row: age_formula(row['Year'],row['Date_of_Birth']), axis=1)

In [16]:
## checking that it worked
comb_player_data['Age'].loc[comb_player_data['Player'] == 'Zdeno Chara']

883     37
1766    38
2664    39
3552    40
4442    41
Name: Age, dtype: int64

In [None]:
## creating fantasy value, which is essentially a basic multiplication across the essential rows
## the annoying accounting item here is that since the ppg aren't counted we have to add a lot 
## of additional multipliers

comb_player_data['Fantasy Points'] = comb_player_data['Goals']*3+comb_player_data['Total_Assists']*2+comb_player_data['Shots']*0.1+comb_player_data['Hits']*0.25+comb_player_data['Shots_Blocked']*0.5+comb_player_data['Faceoffs_Won']*0.05+comb_player_data['Goals_PPIndC']*3.5+comb_player_data['Total_Assists_PPIndC']*2.25+comb_player_data['Shots_PPIndC']*0.1+comb_player_data['Hits_PPIndC']*0.25+comb_player_data['Shots_Blocked_PPIndC']*0.5+comb_player_data['Faceoffs_Won_PPIndC']*0.05+comb_player_data['Goals_PkIndC']*5+comb_player_data['Total_Assists_PkIndC']*2.25+comb_player_data['Shots_PkIndC']*0.1+comb_player_data['Hits_PkIndC']*0.25+comb_player_data['Shots_Blocked_PkIndC']*0.5+comb_player_data['Faceoffs_Won_PkIndC']*0.05

## and average ppg

comb_player_data['Fantasy Points Per Game'] = comb_player_data['Fantasy Points'] / comb_player_data['GP']


In [None]:
comb_player_data['Fantasy Points Per Game']

## Done!

And that's a wrap folks, we have a good file, which we'll save here to a new csv and begin the next leg, actually building some models.

In [None]:
comb_player_data.to_csv("Combined_Player_Data.csv")

Thanks for reading! 