## Analysis Revisited: 

#### Can we build a model to predict whether Lebron James misses or makes a shot?

In this notebook, I revisit the original analysis armed with more than just decision trees to see if I can improve the model.

#### What's different this time around:
I perform one-hot encoding, feature selection, use cross-validation to tune some hyperparameters, all while working with a training and validation set.

I also try out XGBoost, Logistic Regression and Random Forest classifiers to see if they can beat out the baseline Decision Tree model we used back in the first analysis.

I also test out a python package that I co-created (SklearncomPYre) to optimize the training and fitting process.

#### Accuracy to beat: is 59%


In [41]:
#loading packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt


from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import normalize, StandardScaler
from sklearn.metrics import log_loss, accuracy_score
from sklearn.model_selection import train_test_split, GridSearchCV

from sklearn.feature_selection import RFE

import xgboost as xgb

import time

# Importing SklearncomPYre
from SklearncomPYre.train_test_acc_time import train_test_acc_time
from SklearncomPYre.comparison_viz import comparison_viz
from SklearncomPYre.split import split

import warnings
warnings.filterwarnings(action='ignore')

In [42]:
data_og = pd.read_csv("../data/shot_logs_raw.csv")
data = data_og.query('player_name == "lebron james"')
data = data.drop(columns=['player_name', 'player_id', 'FGM', 'CLOSEST_DEFENDER', 'PTS',
                  'MATCHUP', 'GAME_ID','GAME_CLOCK'])

data.head()

Unnamed: 0,LOCATION,W,FINAL_MARGIN,SHOT_NUMBER,PERIOD,SHOT_CLOCK,DRIBBLES,TOUCH_TIME,SHOT_DIST,PTS_TYPE,SHOT_RESULT,CLOSEST_DEFENDER_PLAYER_ID,CLOSE_DEF_DIST
45834,A,W,8,1,1,13.7,9,9.5,7.0,2,missed,201949,0.8
45835,A,W,8,2,1,15.2,8,7.9,5.4,2,missed,202685,2.5
45836,A,W,8,3,1,12.3,0,5.6,23.2,3,made,201949,3.5
45837,A,W,8,4,1,,0,2.0,27.1,3,missed,203082,3.9
45838,A,W,8,5,2,20.8,2,2.7,3.1,2,made,201949,3.5


In [43]:
#setting up our X and Y sets
Y = pd.DataFrame(data['SHOT_RESULT'])
X = data.drop(columns=['SHOT_RESULT'])

In [44]:
#getting to know our training data
X.describe()

Unnamed: 0,FINAL_MARGIN,SHOT_NUMBER,PERIOD,SHOT_CLOCK,DRIBBLES,TOUCH_TIME,SHOT_DIST,PTS_TYPE,CLOSEST_DEFENDER_PLAYER_ID,CLOSE_DEF_DIST
count,978.0,978.0,978.0,947.0,978.0,978.0,978.0,978.0,978.0,978.0
mean,7.208589,10.644172,2.40184,11.867899,4.665644,5.718507,14.01002,2.257669,159483.315951,4.192843
std,14.181866,6.537207,1.125255,6.313463,5.378572,4.758516,9.397451,0.437574,79346.795817,3.341898
min,-29.0,1.0,1.0,0.2,0.0,-4.3,0.1,2.0,708.0,0.0
25%,-2.0,5.0,1.0,6.45,1.0,2.3,5.025,2.0,200757.0,2.525
50%,8.0,10.0,2.0,11.7,3.0,4.5,12.65,2.0,202066.0,3.8
75%,14.0,15.0,3.0,17.25,7.0,7.8,23.475,3.0,203082.0,5.0
max,39.0,35.0,5.0,24.0,26.0,23.3,44.9,3.0,204060.0,53.2


In [45]:
# From the above, it looks like there are some NAs in SHOT_CLOCK column.

# Let's inspect:
pd.DataFrame(X.isnull().sum()).T

Unnamed: 0,LOCATION,W,FINAL_MARGIN,SHOT_NUMBER,PERIOD,SHOT_CLOCK,DRIBBLES,TOUCH_TIME,SHOT_DIST,PTS_TYPE,CLOSEST_DEFENDER_PLAYER_ID,CLOSE_DEF_DIST
0,0,0,0,0,0,31,0,0,0,0,0,0


In [46]:
# What are these shot clock NAs?

X[X['SHOT_CLOCK'].isnull()].head()

Unnamed: 0,LOCATION,W,FINAL_MARGIN,SHOT_NUMBER,PERIOD,SHOT_CLOCK,DRIBBLES,TOUCH_TIME,SHOT_DIST,PTS_TYPE,CLOSEST_DEFENDER_PLAYER_ID,CLOSE_DEF_DIST
45837,A,W,8,4,1,,0,2.0,27.1,3,203082,3.9
45844,A,W,8,11,3,,18,17.3,4.3,2,203082,2.2
45865,H,W,31,16,2,,0,1.0,41.0,3,203109,15.8
45887,A,L,-2,15,2,,6,11.8,25.9,3,201147,4.0
45952,A,W,18,6,1,,11,10.7,25.5,3,203921,4.7


In [47]:
#Let's look at the original data with the game_clock column included, ...
# in case these are possible buzzer shots.

In [48]:
data_og.loc[[45837]]

Unnamed: 0,GAME_ID,MATCHUP,LOCATION,W,FINAL_MARGIN,SHOT_NUMBER,PERIOD,GAME_CLOCK,SHOT_CLOCK,DRIBBLES,...,SHOT_DIST,PTS_TYPE,SHOT_RESULT,CLOSEST_DEFENDER,CLOSEST_DEFENDER_PLAYER_ID,CLOSE_DEF_DIST,FGM,PTS,player_name,player_id
45837,21400900,"MAR 04, 2015 - CLE @ TOR",A,W,8,4,1,0:02,,0,...,27.1,3,missed,"Ross, Terrence",203082,3.9,0,0,lebron james,2544


In [49]:
data_og.loc[[45844]]

Unnamed: 0,GAME_ID,MATCHUP,LOCATION,W,FINAL_MARGIN,SHOT_NUMBER,PERIOD,GAME_CLOCK,SHOT_CLOCK,DRIBBLES,...,SHOT_DIST,PTS_TYPE,SHOT_RESULT,CLOSEST_DEFENDER,CLOSEST_DEFENDER_PLAYER_ID,CLOSE_DEF_DIST,FGM,PTS,player_name,player_id
45844,21400900,"MAR 04, 2015 - CLE @ TOR",A,W,8,11,3,0:00,,18,...,4.3,2,made,"Ross, Terrence",203082,2.2,1,2,lebron james,2544


In [50]:
data_og.loc[[45952]]

Unnamed: 0,GAME_ID,MATCHUP,LOCATION,W,FINAL_MARGIN,SHOT_NUMBER,PERIOD,GAME_CLOCK,SHOT_CLOCK,DRIBBLES,...,SHOT_DIST,PTS_TYPE,SHOT_RESULT,CLOSEST_DEFENDER,CLOSEST_DEFENDER_PLAYER_ID,CLOSE_DEF_DIST,FGM,PTS,player_name,player_id
45952,21400821,"FEB 22, 2015 - CLE @ NYK",A,W,18,6,1,0:03,,11,...,25.5,3,made,"Early, Cleanthony",203921,4.7,1,3,lebron james,2544


In [51]:
# Looks like SHOT_CLOCK at NaN is indeed a buzzer shot. 

# I think these clutch shots are valuable shots for our analysis.
# Let's keep these rows with some wrangling.

In [52]:
X.SHOT_CLOCK = X.SHOT_CLOCK.fillna(0)

In [53]:
#some one-hot encoding before splitting

Y_copy = Y.copy()

Y['SHOT_RESULT'] = pd.Categorical(Y['SHOT_RESULT'])
Y = pd.get_dummies(Y, prefix = 'category')
Y = Y.drop(columns=['category_missed'])
Y.head()

Unnamed: 0,category_made
45834,0
45835,0
45836,1
45837,0
45838,1


In [54]:
X_copy = X.copy()

X['LOCATION'] = pd.Categorical(X['LOCATION'])
X = pd.get_dummies(X, prefix = 'category')

X=X.drop(columns=['category_A', 'category_L'])

X.head()

Unnamed: 0,FINAL_MARGIN,SHOT_NUMBER,PERIOD,SHOT_CLOCK,DRIBBLES,TOUCH_TIME,SHOT_DIST,PTS_TYPE,CLOSEST_DEFENDER_PLAYER_ID,CLOSE_DEF_DIST,category_H,category_W
45834,8,1,1,13.7,9,9.5,7.0,2,201949,0.8,0,1
45835,8,2,1,15.2,8,7.9,5.4,2,202685,2.5,0,1
45836,8,3,1,12.3,0,5.6,23.2,3,201949,3.5,0,1
45837,8,4,1,0.0,0,2.0,27.1,3,203082,3.9,0,1
45838,8,5,2,20.8,2,2.7,3.1,2,201949,3.5,0,1


In [55]:
#splitting our data into training, validation, testing sets
X_train, y_train, X_val, y_val, X_train_val, y_train_val, X_test, y_test = split(X,Y,0.55,0.15,0.3)


In [56]:
#feature selection

rfe = RFE(estimator = RandomForestClassifier(n_estimators=100), n_features_to_select = 5)

rfe.fit(X_train, y_train)
best = rfe.support_

In [57]:
X_train.columns[best]

Index(['SHOT_CLOCK', 'TOUCH_TIME', 'SHOT_DIST', 'CLOSEST_DEFENDER_PLAYER_ID',
       'CLOSE_DEF_DIST'],
      dtype='object')

In [58]:
rfe = RFE(estimator = RandomForestClassifier(n_estimators=100), n_features_to_select = 9)

rfe.fit(X_train, y_train)
best = rfe.support_
X_train.columns[best]

Index(['FINAL_MARGIN', 'SHOT_NUMBER', 'PERIOD', 'SHOT_CLOCK', 'DRIBBLES',
       'TOUCH_TIME', 'SHOT_DIST', 'CLOSEST_DEFENDER_PLAYER_ID',
       'CLOSE_DEF_DIST'],
      dtype='object')

In [59]:
#hyperparameter tuning with GridSearchCV

#random forest
n_list = np.arange(50, 150, 100)
ft_list = np.arange(2, 10, 1)
d_list = np.arange(5, 25, 5)

parameters = {'n_estimators':n_list, 
              'max_features':ft_list, 
              'max_depth':d_list}

rf = RandomForestClassifier(n_jobs=-1)

model = GridSearchCV(rf, parameters, cv=4, return_train_score=False)
t = time.time()
model.fit(X_train, y_train)
training_time = time.time() - t

In [60]:
print(model.best_params_)
print("best score:", model.best_score_)
print("best error:", 1 - model.best_score_)
print("time (s):", training_time)

{'max_depth': 5, 'max_features': 6, 'n_estimators': 50}
best score: 0.6370370370370371
best error: 0.36296296296296293
time (s): 37.16169810295105


In [61]:
#xgboost
d_list = np.arange(2, 10, 2)
booster_list = ['gbtree', 'gblinear', 'dart']
lambda_list = [0, 0.01, 0.1, 1,10]

parameters = {'booster':booster_list,
              'max_depth':d_list,
              'reg_lambda': lambda_list}

xgbc = xgb.XGBClassifier(n_jobs=-1)

model = GridSearchCV(xgbc, parameters, cv=4,  return_train_score=False)
t = time.time()
model.fit(X_train, y_train)
training_time = time.time() - t
print(model.best_params_)
print("best score:", model.best_score_)
print("best error:", 1 - model.best_score_)
print("time (s):",training_time)

{'booster': 'gbtree', 'max_depth': 2, 'reg_lambda': 10}
best score: 0.6111111111111112
best error: 0.38888888888888884
time (s): 20.440399885177612


In [62]:
#logit

solver_list = ['lbfgs', 'sag']
parameters = {'solver':solver_list}

logit = LogisticRegression(penalty= 'l2', n_jobs=-1)

model = GridSearchCV(logit, parameters, cv=5, return_train_score=False)

t = time.time()
model.fit(X_train, y_train)
training_time = time.time() - t

print(model.best_params_)
print("best score:", model.best_score_)
print("best error:", 1 - model.best_score_)
print("time (s):",training_time)

{'solver': 'lbfgs'}
best score: 0.5666666666666667
best error: 0.43333333333333335
time (s): 0.36352109909057617


In [63]:
#seeing how our model research does on the validation set:

classifiers = {"Logit": LogisticRegression(penalty= 'l2', n_jobs=-1),
               "XGBoost":xgb.XGBClassifier(booster='gblinear', 
                                           max_depth=2, 
                                           reg_lambda='1',
                                           n_jobs=-1),
               "Random Forest":RandomForestClassifier(max_depth=10, 
                                                      max_features=3, 
                                                      n_estimators=50,
                                                      n_jobs=-1)}



In [64]:
result = train_test_acc_time(classifiers,
                             X_train,
                             y_train,
                             X_val,
                             y_val)
result

Unnamed: 0,Model,Train Accuracy,Test Accuracy,Variance,Fit Time,Predict Time,Total Time
0,Logit,0.624074,0.680556,-0.056481,0.016371,0.002132,0.018503
1,XGBoost,0.625926,0.666667,-0.040741,0.016242,0.001362,0.017604
2,Random Forest,0.992593,0.597222,0.39537,0.158933,0.108043,0.266976


In [65]:
#Does cutting down on our features (as per our feature selection) make any difference?
best_features = ['SHOT_CLOCK', 'TOUCH_TIME', 'SHOT_DIST', 'CLOSEST_DEFENDER_PLAYER_ID',
       'CLOSE_DEF_DIST']

In [66]:
X_train = X_train.filter(best_features)

In [67]:
X_val = X_val.filter(best_features)

In [68]:
X_train.shape

(540, 5)

In [69]:
X_val.shape

(144, 5)

In [70]:
result = train_test_acc_time(classifiers,
                             X_train,
                             y_train,
                             X_val,
                             y_val)
result

Unnamed: 0,Model,Train Accuracy,Test Accuracy,Variance,Fit Time,Predict Time,Total Time
1,XGBoost,0.627778,0.659722,-0.031944,0.043637,0.001439,0.045076
0,Logit,0.614815,0.652778,-0.037963,0.005883,0.003105,0.008988
2,Random Forest,0.972222,0.597222,0.375,0.169056,0.108016,0.277072


In [71]:
#Nothing to really see here. Feature selection didn't help.

In [72]:
#What about important features? Do our models choose the same features?

In [73]:
model = RandomForestClassifier(max_depth=10, 
                       max_features=3, 
                       n_estimators=50, 
                       n_jobs=-1)

model.fit(X_train, y_train)
model.score(X_val, y_val)
model.feature_importances_

array([0.21134161, 0.18301665, 0.2691291 , 0.16594898, 0.17056366])

In [74]:
X_train.columns

Index(['SHOT_CLOCK', 'TOUCH_TIME', 'SHOT_DIST', 'CLOSEST_DEFENDER_PLAYER_ID',
       'CLOSE_DEF_DIST'],
      dtype='object')

In [75]:
rf_ft_importances = dict(zip(X_train.columns,model.feature_importances_))

In [76]:
rf_ft_importances

{'CLOSEST_DEFENDER_PLAYER_ID': 0.16594898183556694,
 'CLOSE_DEF_DIST': 0.17056365662633877,
 'SHOT_CLOCK': 0.2113416071674868,
 'SHOT_DIST': 0.26912910261237943,
 'TOUCH_TIME': 0.1830166517582281}

In [77]:
best_features

['SHOT_CLOCK',
 'TOUCH_TIME',
 'SHOT_DIST',
 'CLOSEST_DEFENDER_PLAYER_ID',
 'CLOSE_DEF_DIST']

In [78]:
#Same features!

In [79]:
#Ok, a final run though with the test set:

In [80]:
result = train_test_acc_time(classifiers,
                             X_train_val,
                             y_train_val,
                             X_test,
                             y_test)
result

Unnamed: 0,Model,Train Accuracy,Test Accuracy,Variance,Fit Time,Predict Time,Total Time
1,XGBoost,0.643275,0.632653,0.010622,0.033688,0.00129,0.034978
0,Logit,0.647661,0.612245,0.035416,0.005043,0.000998,0.006041
2,Random Forest,0.972222,0.585034,0.387188,0.1568,0.108478,0.265278


Based on the above-- there's no *real* improvement here! 

While XGBoost added a few accuracy percentage points, I think this is marginal improvement and might be washed away due to variance/randomness in another run through.

It turns, out that the decision tree model from the original analysis was pretty good at capturing the information in the data set afterall.

Have we hit irreducible error?

<br>

![welp](https://media.giphy.com/media/uTM2OGX2DAEPMd2lFJ/giphy-downsized.gif)

Next thing to try out here would be to test out whether we can predict shot hits or misses for players that aren't as good as Lebron. It might also be interesting to try out this model on players like Steph Curry.