Let's build a prospect model.

Before we start, here's some related research:
<ul><li><a href="http://canucksarmy.com/2015/5/26/draft-analytics-unveiling-the-prospect-cohort-success-model">Introducing PCS</a>
<li><a href="http://prospect-stats.com/blog/Introducing_DEV.html">Draft expected value</a>

In [125]:
import GetPbP
EP_FOLDER = GetPbP.get_additional_data_folder() + 'EP/' #data from eliteprospects api, which I got permission to use
from os import listdir
import pandas as pd
import json
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
player_ref_fname = GetPbP.get_additional_data_folder() + 'EP player reference.txt'
player_stats_fname = GetPbP.get_additional_data_folder() + 'CHLF stats.txt'
player_inter_data = GetPbP.get_additional_data_folder() + 'mod1 data.csv'

"Paralysis by analysis" is an easy trap to fall into here. 
<p>First, we need to decide on the basic approach. We can use a logistic regression to estimate the probability of "making" the NHL, or another type of regression to predict "performance." We could also use, as in PCS, something more like a k-nearest-neighbors approach—find the most similar players historically, and see how they fared.
<p>Next, we have to decide on a dependent variable. Games played? Points? Points per game?  Point shares from Hockey Reference? GVT? Everything, after developing a similarity score?
<p>On a more basic level--should it be continuous or binary? (Recall that Josh W and Moneypuck's PCS uses a 200-GP threshold. Binary dependent variables lend themselves easily to classifiers and logit.) Should it differ by position? Should we include performance in lower-level leagues? (For example, an AHL all-star is better than an AHL grinder, and we may not want to simply equate them in the model.)
<p>Next, we have to decide on which variables to include—mainly, if we want to adjust for eras.
<p>We also need to consider change over time in the different leagues. For example, how do we handle NHL expansion? The changing strength of different leagues around the world? Do we set time limits? Do we model league strength over time and use this to adjust each of their numbers? What about player tendencies? (e.g. A Soviet player of relative talent level X might be less likely to defect to the NHL than a modern Russian player of relative talent level X. Or maybe it's the other way around. Point is, there's a cost associated with moving overseas, and the opportunity cost of passing on the NHL can change with time.)

The best strategy might be to start really simply and go from there. Let's start with forwards coming out Canadian major junior and try to predict their NHL GP.

### Read through all player files, generate file with their name/dob/player page/leagues
This will take a looooong time to run.

In [38]:
#generate list of all files
layer1 = set(EP_FOLDER + x + '/' for x in listdir(EP_FOLDER) if not x == '.DS_Store')
layer2 = set()
layer3 = set()
for parent in layer1:
    for child in set(listdir(parent)):
        if not child == '.DS_Store':
            layer2.add(parent + child + '/')
for parent in layer2:
    for child in set(listdir(parent)):
        if len(child) >= 5 and child[-5:] == '.json':
            layer3.add(parent + child)
len(layer3)

457006

In [69]:
limit = 1#len(layer3)
firstnames = []
lastnames = []
dobs = []
filenames = []
leagues = []
positions = []
for i, file in enumerate(layer3):
    if i >= limit:
        break
    firstnames.append('')
    lastnames.append('')
    dobs.append('')
    filenames.append(file)
    leagues.append(set())
    positions.append(set())
    r = open(file, 'r')
    data = json.loads(r.read())
    r.close()
    if not 'data' in data:
        pass
    else:
        data = data['data']
        if len(data) > 0:
            try:
                playerdata = data[0]['player']
                firstnames[-1] = playerdata['firstName']
                lastnames[-1] = playerdata['lastName']
                dobs[-1] = playerdata['dateOfBirth']
                for dct in data:
                    if 'playerPosition' in dct['player']:
                        positions[-1].add(dct['player']['playerPosition'])
                    if 'fullName' in dct['league']:
                        leagues[-1].add(dct['league']['fullName'])
                    else:
                        leagues[-1].add(dct['league']['name'])
            except KeyError:
                pass #I need complete data for this to work!
    if i % 10000 == 0:
        print('Done through {0:d} ({1:.0f}%)'.format(i, i/len(layer3)*100))
df = pd.DataFrame.from_dict({'first': firstnames, 'last': lastnames, 'dob': dobs, 
                            'league': leagues, 'pos': positions, 'file': filenames})
df = df[~((df.dob == '') | (df.first == '') | (df.last == '') | (df.league == '{}'))]
df.reset_index(inplace=True)
#df.head()

Done through 0 (0%)
Done through 10000 (2%)
Done through 20000 (4%)
Done through 30000 (7%)
Done through 40000 (9%)
Done through 50000 (11%)
Done through 60000 (13%)
Done through 70000 (15%)
Done through 80000 (18%)
Done through 90000 (20%)
Done through 100000 (22%)
Done through 110000 (24%)
Done through 120000 (26%)
Done through 130000 (28%)
Done through 140000 (31%)
Done through 150000 (33%)
Done through 160000 (35%)
Done through 170000 (37%)
Done through 180000 (39%)
Done through 190000 (42%)
Done through 200000 (44%)
Done through 210000 (46%)
Done through 220000 (48%)
Done through 230000 (50%)
Done through 240000 (53%)
Done through 250000 (55%)
Done through 260000 (57%)
Done through 270000 (59%)
Done through 280000 (61%)
Done through 290000 (63%)
Done through 300000 (66%)
Done through 310000 (68%)
Done through 320000 (70%)
Done through 330000 (72%)
Done through 340000 (74%)
Done through 350000 (77%)
Done through 360000 (79%)
Done through 370000 (81%)
Done through 380000 (83%)
Done 

In [70]:
#df.to_csv(player_ref_fname, sep='\t')
print('Written to file')

Written to file


### Read in player reference data from file

In [2]:
df = pd.read_csv(player_ref_fname, sep='\t')
df['league'] = df['league'].apply(lambda x: [lg.strip()[1:-1] for lg in x[1:-1].split(',')])
df['pos'] = df['pos'].apply(lambda x: [pos.strip()[1:-1] for pos in x[1:-1].split(',')])
df.drop(['Unnamed: 0', 'index'], axis=1, inplace=True)
df.head()

Unnamed: 0,dob,file,first,last,league,pos
0,1993-05-13,/Users/muneebalam/Desktop/NHLPlaybyPlay/Parsed...,Mitch,Van Teeling,"[RBC Cup, World Under-17 Hockey Challenge, Bri...",[LEFT_WING]
1,1994-03-14,/Users/muneebalam/Desktop/NHLPlaybyPlay/Parsed...,Michael,DeBuccia,"[Empire Junior Hockey League, American Collegi...",[RIGHT_WING]
2,1995-06-28,/Users/muneebalam/Desktop/NHLPlaybyPlay/Parsed...,Spencer,Naas,"[United States High School, NCAA D1]",[FORWARD]
3,1999-08-29,/Users/muneebalam/Desktop/NHLPlaybyPlay/Parsed...,Jannis,Alke,"[Germany U16 4, Germany U19 4]",[FORWARD]
4,1977-05-17,/Users/muneebalam/Desktop/NHLPlaybyPlay/Parsed...,Luc,Dostaler,"[QuÃ©bec Major Junior Hockey League, Ligue de ...",[CENTRE]


### Filter forwards who played in major junior

In [4]:
maj_jun_lgs = ['Western Hockey League', 'Ontario Hockey League', 
               'QuÃ©bec Major Junior Hockey League']

df['WHL'] = df['league'].apply(lambda x: maj_jun_lgs[0] in x)
df['OHL'] = df['league'].apply(lambda x: maj_jun_lgs[1] in x)
df['QMJHL'] = df['league'].apply(lambda x: maj_jun_lgs[2] in x)
df['LW'] = df['pos'].apply(lambda x: 'LEFT_WING' in x)
df['RW'] = df['pos'].apply(lambda x: 'RIGHT_WING' in x)
df['C'] = df['pos'].apply(lambda x: 'CENTRE' in x)
df['F'] = df['LW'] | df['RW'] | df['C']
df['CHL'] = df['WHL'] | df['OHL'] | df['QMJHL']
df['CHL_F'] = df['F'] & df['CHL']
#df[df['CHL_F']].head()

### Read in stats for these players

In [5]:
allleagues = {x for x in maj_jun_lgs}
allleagues.add('National Hockey League')
allleagues.add('American Hockey League')

to_read = [x for x in df[df['CHL_F']]['file']]
firstnames = []
lastnames = []
dobs = []
leagues = []
leagueyears = []
goals = []
assists = []
heights = []
weights = []
gp = []
teams = []
gtypes = []
positions = []
shoots = []
for i in range(len(to_read)):
    file = to_read[i]
    r = open(file, 'r')
    data = json.loads(r.read())
    r.close()
    if 'data' not in data:
        pass
    else:
        for dct in data['data']:
            if 'league' in dct and 'fullName' in dct['league'] and \
                dct['league']['fullName'] in allleagues:
                firstnames.append('')
                lastnames.append('')
                dobs.append('')
                leagues.append('')
                leagueyears.append('')
                goals.append(0)
                assists.append(0)
                heights.append(0)
                weights.append(0)
                gp.append(0)
                teams.append('')
                gtypes.append('')
                positions.append('')
                shoots.append('')

                try:
                    firstnames[-1] = dct['player']['firstName']
                except KeyError:
                    pass
                try:
                    lastnames[-1] = dct['player']['lastName']
                except KeyError:
                    pass
                try:
                    dobs[-1] = dct['player']['dateOfBirth']
                except KeyError:
                    pass
                try:
                    leagues[-1] = dct['league']['fullName']
                except KeyError:
                    pass
                try:
                    leagueyears[-1] = dct['season']['name']
                except KeyError:
                    pass
                try:
                    goals[-1] = dct['G']
                except KeyError:
                    pass
                try:
                    assists[-1] = dct['A']
                except KeyError:
                    pass
                try:
                    heights[-1] = dct['player']['height']
                except KeyError:
                    pass
                try:
                    weights[-1] = dct['player']['weight']
                except KeyError:
                    pass
                try:
                    gp[-1] = dct['GP']
                except KeyError:
                    pass
                try:
                    teams[-1] = dct['team']['name']
                except KeyError:
                    pass
                try:
                    gtypes[-1] = dct['gameType']
                except KeyError:
                    pass
                try:
                    positions[-1] = dct['player']['playerPosition']
                except KeyError:
                    pass
                try:
                    shoots[-1] = dct['player']['shoots']
                except KeyError:
                    pass
    if i % 1000 == 0:
        print('Done through {0:d}/{1:d}'.format(i, len(to_read)))

df2 = pd.DataFrame.from_dict({'first': firstnames, 'last': lastnames, 'dob': dobs,
                             'league': leagues, 'G': goals, 'A': assists, 
                              'season': leagueyears, 'height': heights, 'weight': weights,
                             'GP': gp, 'team': teams, 'pos': positions, 'shoots': shoots,
                             'type': gtypes})
#df2.head()

Done through 0/9683
Done through 1000/9683
Done through 2000/9683
Done through 3000/9683
Done through 4000/9683
Done through 5000/9683
Done through 6000/9683
Done through 7000/9683
Done through 8000/9683
Done through 9000/9683


### Write to file

In [6]:
df2.to_csv(player_stats_fname, sep='\t')

### Read in player season stats from file

In [127]:
df2 = pd.read_csv(player_stats_fname, sep='\t')
df2.drop('Unnamed: 0', axis=1, inplace=True)
colorder = ['first', 'last', 'pos', 'shoots', 'season', 'type', 'team',
           'GP', 'G', 'A', 'dob', 'height', 'weight', 'league']
df2 = df2[colorder]
df2.head()

Unnamed: 0,first,last,pos,shoots,season,type,team,GP,G,A,dob,height,weight,league
0,Luc,Dostaler,CENTRE,LEFT,1994-1995,REGULAR_SEASON,Saint-Hyacinthe Laser,40,3,9,1977-05-17,183,84,QuÃ©bec Major Junior Hockey League
1,Luc,Dostaler,CENTRE,LEFT,1995-1996,REGULAR_SEASON,Laval Titan,32,1,2,1977-05-17,183,84,QuÃ©bec Major Junior Hockey League
2,Dawson,Leedahl,LEFT_WING,LEFT,2012-2013,PLAYOFFS,Everett Silvertips,6,2,1,1996-03-16,185,91,Western Hockey League
3,Dawson,Leedahl,LEFT_WING,LEFT,2012-2013,REGULAR_SEASON,Everett Silvertips,56,3,6,1996-03-16,185,91,Western Hockey League
4,Dawson,Leedahl,LEFT_WING,LEFT,2013-2014,REGULAR_SEASON,Everett Silvertips,70,8,24,1996-03-16,185,91,Western Hockey League


### Group by name, season, league

In [128]:
grouping = df2.groupby(['first', 'last', 'season', 'league', 'pos', 'dob'], as_index=False)
grouped = grouping.agg({'GP': np.sum, 'G': np.sum, 'A': np.sum,
                       'height': np.mean, 'weight': np.mean})
#grouped.head()

### Add in player age on Jan 1 of season year

In [129]:
grouped['dob2'] = pd.to_datetime(grouped['dob'], infer_datetime_format=True)
grouped['season_start'] = grouped['season'].apply(lambda x: pd.Timestamp(x[:4] + '-01-01'))
grouped['age_days'] = grouped['season_start'] - grouped['dob2']

age = []
for i in range(len(grouped)):
    age.append(grouped['age_days'].iloc[i].days/365)
grouped['age_y'] = age
grouped = grouped[~(grouped['league'] == 'American Hockey League')]
#grouped.head()

### Let's combine first and last name, drop the position and dob and age_days
Also filter out really old players...let's say prior to 1970 yob, arbitrarily.

In [130]:
start_date = '1970-01-01'
grouped['Name'] = grouped['first'] + ' ' + grouped['last']
grouped.drop(['pos', 'dob', 'age_days', 'first', 'last'], axis=1, inplace=True)
grouped = grouped[grouped['dob2'] >= start_date]
grouped.head()

Unnamed: 0,season,league,weight,GP,A,height,G,dob2,season_start,age_y,Name
0,2015-2016,QuÃ©bec Major Junior Hockey League,93,53,21,191,28,1996-12-14,2015-01-01,18.060274,A.J. Greer
1,2006-2007,Ontario Hockey League,93,88,15,188,9,1990-06-27,2006-01-01,15.526027,A.J. Jenks
2,2007-2008,Ontario Hockey League,93,72,29,188,27,1990-06-27,2007-01-01,16.526027,A.J. Jenks
3,2008-2009,Ontario Hockey League,93,72,33,188,22,1990-06-27,2008-01-01,17.526027,A.J. Jenks
4,2009-2010,Ontario Hockey League,93,61,48,188,27,1990-06-27,2009-01-01,18.528767,A.J. Jenks


### Now add a column for NHL GP
First, create a new df with this data, and sum over seasons, then join with the old df.

In [131]:
df_nhl = grouped[grouped['league'] == 'National Hockey League']
df_nhl = df_nhl.groupby(['Name', 'dob2'], as_index=False).sum()
df_nhl = df_nhl[['Name', 'dob2', 'GP', 'G', 'A']]
df_nhl.rename(columns={'GP': 'NHLGP', 'G': 'NHLG', 'A': 'NHLA'}, inplace=True)
df_nhl.head()

Unnamed: 0,Name,dob2,NHLGP,NHLG,NHLA
0,Aaron Downey,1974-09-27,248,8,10
1,Aaron Gavey,1974-02-22,379,42,52
2,Adam Berti,1986-07-01,2,0,0
3,Adam Deadmarsh,1975-05-10,672,210,229
4,Adam Henrique,1990-02-06,358,96,112


In [132]:
grouped2 = grouped.merge(df_nhl, on=['Name', 'dob2'], how='left')
grouped2.fillna(0, inplace=True)
grouped2.head()

Unnamed: 0,season,league,weight,GP,A,height,G,dob2,season_start,age_y,Name,NHLGP,NHLG,NHLA
0,2015-2016,QuÃ©bec Major Junior Hockey League,93,53,21,191,28,1996-12-14,2015-01-01,18.060274,A.J. Greer,0,0,0
1,2006-2007,Ontario Hockey League,93,88,15,188,9,1990-06-27,2006-01-01,15.526027,A.J. Jenks,0,0,0
2,2007-2008,Ontario Hockey League,93,72,29,188,27,1990-06-27,2007-01-01,16.526027,A.J. Jenks,0,0,0
3,2008-2009,Ontario Hockey League,93,72,33,188,22,1990-06-27,2008-01-01,17.526027,A.J. Jenks,0,0,0
4,2009-2010,Ontario Hockey League,93,61,48,188,27,1990-06-27,2009-01-01,18.528767,A.J. Jenks,0,0,0


In [133]:
import statsmodels.formula.api as smf

In [134]:
mod = smf.ols(formula='NHLGP ~ age_y:GP + age_y:G + age_y:A + height + weight', data=grouped2)
ols = mod.fit().get_robustcov_results()
print(ols.summary())

                            OLS Regression Results                            
Dep. Variable:                  NHLGP   R-squared:                       0.218
Model:                            OLS   Adj. R-squared:                  0.218
Method:                 Least Squares   F-statistic:                     750.6
Date:                Sat, 17 Sep 2016   Prob (F-statistic):               0.00
Time:                        12:44:18   Log-Likelihood:            -1.6894e+05
No. Observations:               24109   AIC:                         3.379e+05
Df Residuals:                   24103   BIC:                         3.379e+05
Df Model:                           5                                         
Covariance Type:                  HC1                                         
                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
Intercept   -129.0840     12.457    -10.363      0.0

This isn't a very good model. What we can try to do is limit age to a smaller range--right now, very old players are going to have a disproportionate impact on our results. We also need to consider that worse players are going to have more CHL seasons, so they might have four data points in the data set while better players might only have one or two, skewing the results downward. We can try to resolve this by aggregating by player, with different columns for different ages. This binning approach is imperfect and might be slightly biased, but oh well. We also will need to fill in missing seasons for players--we can try to apply an aging curve.

<p>To keep things simple let's start with 15yo, 16yo, 17yo, 18yo, and 19yo bins (~2/3 of the data).
<p>Let's also normalize everything.
<p>Later on, we can also aggregate player scoring season by league to calculate goals per game, and use that to make an era adjustment.

The way we can apply an aging curve is by filtering players who played in each year of a certain age range, then calculate how they did in Y1 and Y2 (relative to Y1 in percentage terms)--that is, we assume a constant percentage increase or drop in GP, G, and A. We should use a chained approach--considering only two years at a time--to use as much data as possible and try to avoid excluding certain segments of players.

This will all be addressed in part 2.

In [135]:
grouped2.to_csv(player_inter_data, sep='\t')