In [1]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns

# The Dataset

I will use the same NBA box score data as in my previous capstone (https://github.com/johnboy81918/Capstone-1-Analytic-Report-and-Research-Proposal/blob/master/Capstone%201%20Analytic%20Report%20and%20Research%20Proposal.ipynb), and refer to some of its analysis to guide this proposal.

In [2]:
# Load the dataset

df = pd.read_csv('2012-18_teamBoxScore.csv')

Whereas the previous analysis was based entirely on team-aggregated statistics, this report will focus on individual game analysis. The primary goal is determine at which point each box score statistic becomes statistically significant in contributing to winning. First, let's examine the overall composition of the dataframe and see what statistics are available to work with.

In [3]:
# Look at column headers to see some of our data

df.head()

Unnamed: 0,gmDate,gmTime,seasTyp,offLNm1,offFNm1,offLNm2,offFNm2,offLNm3,offFNm3,teamAbbr,...,opptFIC40,opptOrtg,opptDrtg,opptEDiff,opptPlay%,opptAR,opptAST/TO,opptSTL/TO,poss,pace
0,2012-10-30,19:00,Regular,Brothers,Tony,Smith,Michael,Workman,Haywoode,WAS,...,61.6667,105.6882,94.4447,11.2435,0.439,16.7072,1.0476,33.3333,88.9409,88.9409
1,2012-10-30,19:00,Regular,Brothers,Tony,Smith,Michael,Workman,Haywoode,CLE,...,56.0417,94.4447,105.6882,-11.2435,0.3765,18.8679,2.0,84.6154,88.9409,88.9409
2,2012-10-30,20:00,Regular,McCutchen,Monty,Wright,Sean,Fitzgerald,Kane,BOS,...,80.8333,126.3381,112.6515,13.6866,0.5244,19.8287,3.125,100.0,94.9832,94.9832
3,2012-10-30,20:00,Regular,McCutchen,Monty,Wright,Sean,Fitzgerald,Kane,MIA,...,62.7083,112.6515,126.3381,-13.6866,0.4643,18.8501,1.5,25.0,94.9832,94.9832
4,2012-10-30,22:30,Regular,Foster,Scott,Zielinski,Gary,Dalen,Eric,DAL,...,58.6458,99.3678,108.1034,-8.7356,0.5,18.6567,1.7143,42.8571,91.579,91.579


Although the dataframe contains *many* statistics, the vast majority of them are derived (calculated from other numbers within the dataframe). For purposes of this study, I'll only be examining the actual counting statistics, rather than including all of the statistics derived from them. These include: assists, turnovers, steals, blocks, personal fouls, and rebounds.

In [4]:
# Start with assists: find the net assist margin for each team in each game, and the number of cases where that margin results in a win or a loss.

df['net_assists'] = df['teamAST'] - df['opptAST']
assists_wins = df[df['teamRslt'].str.contains("Win")].groupby('net_assists').size()
assists_losses = df[df['teamRslt'].str.contains("Loss")].groupby('net_assists').size()

In [5]:
print(assists_wins)
print(assists_losses)

net_assists
-18      1
-17      2
-16      1
-15      6
-14      2
-13     13
-12     16
-11     24
-10     51
-9      48
-8     104
-7     115
-6     155
-5     170
-4     246
-3     268
-2     333
-1     378
 0     448
 1     502
 2     471
 3     535
 4     510
 5     455
 6     413
 7     372
 8     358
 9     288
 10    265
 11    219
 12    173
 13    117
 14     88
 15     70
 16     40
 17     35
 18     26
 19     15
 20     12
 21     12
 22     12
 23      2
 24      5
 25      2
 26      1
dtype: int64
net_assists
-26      1
-25      2
-24      5
-23      2
-22     12
-21     12
-20     12
-19     15
-18     26
-17     35
-16     40
-15     70
-14     88
-13    117
-12    173
-11    219
-10    265
-9     288
-8     358
-7     372
-6     413
-5     455
-4     510
-3     535
-2     471
-1     502
 0     448
 1     378
 2     333
 3     268
 4     246
 5     170
 6     155
 7     115
 8     104
 9      48
 10     51
 11     24
 12     16
 13     13
 14      2
 15      6
 16   

The next step is to determine statistical significance of this information. Essentially, answer the following question: at what net assist margin has a team affected its chance to win a particular game? In other words: is a margin of >=x significantly different from <=(x-1)? The objective is to find what that x value is for each statistic! To determine statistical significance, use p < 0.05.

In [6]:
# theoretical code if we were to run this experiment

# for any given x, determine the number of cases in which this margin or greater occurred, and the win rate for those cases.

# for example: x = 19 yields 61 cases where the assist margin was 19 or greater, and a win rate of 100%

# then, calculate the same information for x-1 (18 in this example)

# finally, determine if these two sets are significantly different

# calculate the p value for all x, and find the smallest margin at which the chances of winning are significantly increased!

# Evaluation

Success in this experiment will be defining the "point of signficance" for each statistic, if one exists. This information could be easily used by a losing team in-game to determine which aspect of their current performance is most critical to address in order to try to comeback in the game at hand.