# Predicting if drafted players will play in the NHL Using Decision Trees
In this script  we'll try to predict if a drafted player will make it to the NH level Using decision trees as well as random forest classifiers.
### Cleaning Data
The first stretch of the script involves cleaning the data, converting some strings into integers so the model can work with them and dropping others.
### Running Decision Tree and Random Forest
Then a Decision Tree model is fit and tested, with 63.05% accuracy and a Random Forest classifier is tested with 70.6% accuracy.
### Rerunning Decision Tree and Random Forest with trimmed data set
Following that the two least important features of the data are removed and each model is rerun with modest increases in prediction accuracy, to 64.6% and 70.75% respectively. 

# Overview
While 70.75% prediction success is respectable other models will be tested to achieve a higher success rate. The largest downfall of this model in my opinion is that it has no power to predict the success of a given player, only whether or not they have played a single game at the highest level.



In [1]:
# Dependencies
from sklearn import tree
import pandas as pd
import os 
from sklearn.model_selection import train_test_split

In [18]:
# Load in data
df = pd.read_csv('data_sets/drafted_data.csv')
df.head()

Unnamed: 0,LEAGUE,PLAYER_ID,POSITION,SEASON,AGE_SEPT_15,GP,G,A,TP,PPG,...,SEASON_>20GP,CUM_GP,ONLY_1_SEASON,MIN_3_SEASONS,DY_STATUS,SHOOTS,DRAFTED,NHL_PPG,NHL_GP,NHL_DV
0,WHL,100577/richard-nejezchleb,F,2015,21.369444,49,20,31,51,1.040816,...,3,150,0,1,DY+3,L,Yes,0.0,0,0
1,QMJHL,100867/jan-kostalek,D,2015,20.577778,57,7,36,43,0.754386,...,3,160,0,1,DY+2,R,Yes,0.0,0,0
2,OHL,101221/dominik-kubalik,F,2014,19.066667,59,18,11,29,0.491525,...,2,126,0,0,DY+1,L,Yes,0.0,0,0
3,WHL,10123/denis-rehak,D,2004,19.336111,25,0,3,3,0.12,...,1,25,1,0,DY+1,L,Yes,0.0,0,0
4,USHL,101430/nathan-walker,F,2013,19.605556,29,7,20,27,0.931034,...,1,29,1,0,DY+1,L,Yes,0.166667,12,1


In [19]:
# Dropping columns that won't impact the prediction model(player ID, season) and those which will give it the answer(NHL games played and points)
df = df.drop(['PLAYER_ID', 'SEASON', 'NHL_PPG','NHL_GP','DRAFTED','LEAGUE'], axis=1)
df.head()

Unnamed: 0,POSITION,AGE_SEPT_15,GP,G,A,TP,PPG,SEASON_NO,SEASON_>20GP,CUM_GP,ONLY_1_SEASON,MIN_3_SEASONS,DY_STATUS,SHOOTS,NHL_DV
0,F,21.369444,49,20,31,51,1.040816,3,3,150,0,1,DY+3,L,0
1,D,20.577778,57,7,36,43,0.754386,3,3,160,0,1,DY+2,R,0
2,F,19.066667,59,18,11,29,0.491525,2,2,126,0,0,DY+1,L,0
3,D,19.336111,25,0,3,3,0.12,1,1,25,1,0,DY+1,L,0
4,F,19.605556,29,7,20,27,0.931034,1,1,29,1,0,DY+1,L,1


In [20]:
# Convert Position to int (Forward = 0, Defense = 1, Goalie= 2)
df.POSITION[df.POSITION == 'F'] = 0
df.POSITION[df.POSITION == 'D'] = 1
df.POSITION[df.POSITION == 'G'] = 2

# Convert shooting side to int (left = 0, right = 1)
df.SHOOTS[df.SHOOTS == 'L'] = 0
df.SHOOTS[df.SHOOTS == 'R'] = 1
df.head()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  This is separate from the ipykernel package so we can avoid doing imports until
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  after removing the cwd from sys.path.
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  import sys
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: h

Unnamed: 0,POSITION,AGE_SEPT_15,GP,G,A,TP,PPG,SEASON_NO,SEASON_>20GP,CUM_GP,ONLY_1_SEASON,MIN_3_SEASONS,DY_STATUS,SHOOTS,NHL_DV
0,0,21.369444,49,20,31,51,1.040816,3,3,150,0,1,DY+3,0,0
1,1,20.577778,57,7,36,43,0.754386,3,3,160,0,1,DY+2,1,0
2,0,19.066667,59,18,11,29,0.491525,2,2,126,0,0,DY+1,0,0
3,1,19.336111,25,0,3,3,0.12,1,1,25,1,0,DY+1,0,0
4,0,19.605556,29,7,20,27,0.931034,1,1,29,1,0,DY+1,0,1


In [21]:
# Convert Draft year to int (draft year = 2, dy-2 = 0, etc)
df.DY_STATUS[df.DY_STATUS == 'DY-2'] = 0
df.DY_STATUS[df.DY_STATUS == 'DY-1'] = 1
df.DY_STATUS[df.DY_STATUS == 'DY'] = 2
df.DY_STATUS[df.DY_STATUS == 'DY+1'] = 3
df.DY_STATUS[df.DY_STATUS == 'DY+2'] = 4
df.DY_STATUS[df.DY_STATUS == 'DY+3'] = 5
df.head(10)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  This is separate from the ipykernel package so we can avoid doing imports until
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  after removing the cwd from sys.path.
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://p

Unnamed: 0,POSITION,AGE_SEPT_15,GP,G,A,TP,PPG,SEASON_NO,SEASON_>20GP,CUM_GP,ONLY_1_SEASON,MIN_3_SEASONS,DY_STATUS,SHOOTS,NHL_DV
0,0,21.369444,49,20,31,51,1.040816,3,3,150,0,1,5,0,0
1,1,20.577778,57,7,36,43,0.754386,3,3,160,0,1,4,1,0
2,0,19.066667,59,18,11,29,0.491525,2,2,126,0,0,3,0,0
3,1,19.336111,25,0,3,3,0.12,1,1,25,1,0,3,0,0
4,0,19.605556,29,7,20,27,0.931034,1,1,29,1,0,3,0,1
5,1,19.955556,68,14,54,68,1.0,1,1,68,1,0,3,0,1
6,0,18.013889,68,28,40,68,1.0,2,2,134,0,0,2,0,1
7,0,20.066667,59,12,37,49,0.830508,4,4,249,0,1,4,1,1
8,1,20.116667,43,13,35,48,1.116279,4,4,253,0,1,4,0,1
9,0,21.619444,56,20,33,53,0.946429,5,5,312,0,1,5,1,0


In [22]:
# Getting an error where a '-' can't be converted to int to fit the model. So I'm converting to NaN and dropping those rows
df.convert_objects(convert_numeric=True)
df.convert_objects(convert_numeric=True).dropna()

df.eq('-')
df = df[~df.eq('-').any(1)]
df

# Result is 7 dropped rows, still a sizeable data set

For all other conversions use the data-type specific converters pd.to_datetime, pd.to_timedelta and pd.to_numeric.
  
For all other conversions use the data-type specific converters pd.to_datetime, pd.to_timedelta and pd.to_numeric.
  This is separate from the ipykernel package so we can avoid doing imports until


Unnamed: 0,POSITION,AGE_SEPT_15,GP,G,A,TP,PPG,SEASON_NO,SEASON_>20GP,CUM_GP,ONLY_1_SEASON,MIN_3_SEASONS,DY_STATUS,SHOOTS,NHL_DV
0,0,21.369444,49,20,31,51,1.040816,3,3,150,0,1,5,0.0,0
1,1,20.577778,57,7,36,43,0.754386,3,3,160,0,1,4,1.0,0
2,0,19.066667,59,18,11,29,0.491525,2,2,126,0,0,3,0.0,0
3,1,19.336111,25,0,3,3,0.120000,1,1,25,1,0,3,0.0,0
4,0,19.605556,29,7,20,27,0.931034,1,1,29,1,0,3,0.0,1
5,1,19.955556,68,14,54,68,1.000000,1,1,68,1,0,3,0.0,1
6,0,18.013889,68,28,40,68,1.000000,2,2,134,0,0,2,0.0,1
7,0,20.066667,59,12,37,49,0.830508,4,4,249,0,1,4,1.0,1
8,1,20.116667,43,13,35,48,1.116279,4,4,253,0,1,4,0.0,1
9,0,21.619444,56,20,33,53,0.946429,5,5,312,0,1,5,1.0,0


In [24]:
# split data into inputs and outputs
data = df.values
inputs = data[:, 0:14]
outputs = data[:, 14]

In [25]:
# split data into inputs and outputs
outcomes = df['NHL_DV']
outcomes.head()

input_factors = df.drop(['NHL_DV'], axis = 1)
input_factors.head()
feature_names = input_factors.columns

In [28]:
# Create Train/test groups
input_train, input_test, output_train, output_test = train_test_split(input_factors, outcomes, random_state=42)

In [29]:
# Fit and test decision tree model
clf = tree.DecisionTreeClassifier()
clf = clf.fit(input_train, output_train)
clf.score(input_test, output_test)

0.630527817403709

The model was able to predict whether or not a drafted player will make it to the NHL with 63% accuracy. This is ok but certainly not good. New we'll try a Random Forest Classifier which will hopefully increase our successful prediction rate

In [30]:
# Random Forest Decision tree
from sklearn.ensemble import RandomForestClassifier
randomforest = RandomForestClassifier(n_estimators=200)
randomforest = randomforest.fit(input_train, output_train)
randomforest.score(input_test, output_test)

0.7061340941512125

In [31]:
# Feature importance test
sorted(zip(randomforest.feature_importances_, feature_names), reverse=True)

[(0.167935802836781, 'PPG'),
 (0.1599543587216858, 'AGE_SEPT_15'),
 (0.11350094689635587, 'CUM_GP'),
 (0.1074602652401599, 'A'),
 (0.10687072662592495, 'TP'),
 (0.09831898155270447, 'GP'),
 (0.09613629861028981, 'G'),
 (0.04409135713191317, 'DY_STATUS'),
 (0.02666331807756335, 'SEASON_NO'),
 (0.02665779756402066, 'SEASON_>20GP'),
 (0.0202790925803737, 'SHOOTS'),
 (0.019416454612630692, 'POSITION'),
 (0.00797454622167694, 'MIN_3_SEASONS'),
 (0.00474005332791969, 'ONLY_1_SEASON')]

The RF classifier increased our successful prediction rate to 70%, which is much better. Predictably Points per game was the most important feature in predicting whether or not a drafted player will make it into the NHL. What was more surprising is that their age at a given point(sept. 15) in their last season in the juniors is also an important feature.

I was surprised by how little the last two features factored in and Will no run the model without them to see if it improves or declines prediction performance.

In [32]:
# Removing 'min 3 seasons' and 'only one season'
df = df.drop(['MIN_3_SEASONS', 'ONLY_1_SEASON'], axis=1)
df.head()

Unnamed: 0,POSITION,AGE_SEPT_15,GP,G,A,TP,PPG,SEASON_NO,SEASON_>20GP,CUM_GP,DY_STATUS,SHOOTS,NHL_DV
0,0,21.369444,49,20,31,51,1.040816,3,3,150,5,0,0
1,1,20.577778,57,7,36,43,0.754386,3,3,160,4,1,0
2,0,19.066667,59,18,11,29,0.491525,2,2,126,3,0,0
3,1,19.336111,25,0,3,3,0.12,1,1,25,3,0,0
4,0,19.605556,29,7,20,27,0.931034,1,1,29,3,0,1


In [34]:
# split data into inputs and outputs
data = df.values
inputs = data[:, 0:12]
outputs = data[:, 12]

In [35]:
# split data into inputs and outputs
outcomes = df['NHL_DV']
outcomes.head()

input_factors = df.drop(['NHL_DV'], axis = 1)
input_factors.head()
feature_names = input_factors.columns

In [40]:
# Create Train/test groups
input_train, input_test, output_train, output_test = train_test_split(input_factors, outcomes, random_state=42)

In [41]:
# Fit and test new decision tree model
clf = tree.DecisionTreeClassifier()
clf = clf.fit(input_train, output_train)
clf.score(input_test, output_test)

0.6462196861626248

In [42]:
# New Random Forest Decision tree
from sklearn.ensemble import RandomForestClassifier
randomforest = RandomForestClassifier(n_estimators=200)
randomforest = randomforest.fit(input_train, output_train)
randomforest.score(input_test, output_test)

0.7075606276747504

In [43]:
# Feature importance test
sorted(zip(randomforest.feature_importances_, feature_names), reverse=True)

[(0.16629509355096375, 'PPG'),
 (0.15687451804241478, 'AGE_SEPT_15'),
 (0.11704646337196788, 'CUM_GP'),
 (0.11140640096086558, 'A'),
 (0.10682158111237339, 'TP'),
 (0.10015884778482897, 'GP'),
 (0.09750484655190661, 'G'),
 (0.0460060410934102, 'DY_STATUS'),
 (0.02891659112208282, 'SEASON_NO'),
 (0.028283212035458746, 'SEASON_>20GP'),
 (0.020419640843573575, 'POSITION'),
 (0.02026676353015395, 'SHOOTS')]

Results were slightly better without the binary variables for 1 season or +3 seasons but not considerably.