## 'Hot Hand' Exploration
By Joshua Ssemwanga

During this week I explored the concept of the ‘Hot Hand’ in sports. There is an age-old debate about whether or not a player scoring a high amount of points in a short amount of time is on purpose or if it is completely coincidental. If a basket player for example scores 6 consecutive shots, do those shots mean that they are more likely to make the 7th shot than normal? This was the idea that I attempted to cover.


## Part 1 - Data Preparation and Exploration 

In [1]:
%%capture
# Due to the configuration of the base Jupter image, the following imports are required for the regressions in the assignment to report the correct metrics

import sys 
!{sys.executable} -m pip uninstall statsmodels --yes 
!{sys.executable} -m pip uninstall numpy --yes
!{sys.executable} -m pip uninstall pandas --yes 
!{sys.executable} -m pip uninstall patsy --yes 
!{sys.executable} -m pip install numpy==1.17
!{sys.executable} -m pip install pandas==1.0
!{sys.executable} -m pip install patsy==0.5.2
!{sys.executable} -m pip install statsmodels==0.11.1

In [2]:
#Import Libraries

import pandas as pd
import datetime as dt
import scipy.stats as sp
import numpy as np
import statsmodels.formula.api as sm 

In [3]:
# Import Shotlog_14_15 and Player_Stats Datasets
#shot log represents an attempted shot
Shotlog_1415=pd.read_csv("Assignment Data/Week 6/Shotlog_14_15.csv")
#player stats represents a player
Player_Stats=pd.read_csv("Assignment Data/Week 6/Player_Stats_14_15.csv")
pd.set_option('display.max_rows', 250)
pd.set_option('display.max_columns', None)
display(Shotlog_1415)

Unnamed: 0,game_id,date,match,home_team,away_team,home_away,result,final_margin,shot_number,quarter,game_clock,shot_clock,dribbles,touch_time,shot_dist,points,current_shot_outcome,closest_defender,closest_defender_id,closest_def_dist,current_shot_hit,points_earned,shoot_player,player_id,average_hit,shot_count,shot_per_game
0,21400280,5-Dec-14,ATL @ BKN,BKN,ATL,A,W,23,1,1,11:23,5.6,0,1.2,19.6,2,made,"Lopez, Brook",201572,6.6,1,2,al horford,201143,0.541259,715,10
1,21400280,5-Dec-14,ATL @ BKN,BKN,ATL,A,W,23,2,1,10:48,5.6,0,1.1,14.5,2,missed,"Lopez, Brook",201572,5.6,0,0,al horford,201143,0.541259,715,10
2,21400280,5-Dec-14,ATL @ BKN,BKN,ATL,A,W,23,3,1,9:40,11.0,0,0.4,5.8,2,missed,"Lopez, Brook",201572,4.7,0,0,al horford,201143,0.541259,715,10
3,21400280,5-Dec-14,ATL @ BKN,BKN,ATL,A,W,23,4,1,7:09,13.5,0,0.8,20.8,2,missed,"Lopez, Brook",201572,5.8,0,0,al horford,201143,0.541259,715,10
4,21400280,5-Dec-14,ATL @ BKN,BKN,ATL,A,W,23,5,2,5:49,12.9,0,0.9,20.1,2,missed,"Lopez, Brook",201572,6.4,0,0,al horford,201143,0.541259,715,10
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
128064,21400350,14-Dec-14,WAS vs. UTA,WAS,UTA,H,W,9,11,3,2:59,21.1,3,4.0,1.7,2,made,"Burke, Trey",203504,4.7,1,2,john wall,202322,0.448513,874,15
128065,21400350,14-Dec-14,WAS vs. UTA,WAS,UTA,H,W,9,12,3,0:43,6.1,2,2.2,17.8,2,made,"Exum, Dante",203957,3.4,1,2,john wall,202322,0.448513,874,15
128066,21400350,14-Dec-14,WAS vs. UTA,WAS,UTA,H,W,9,13,4,2:18,4.3,0,0.0,3.2,2,missed,"Kanter, Enes",202683,1.2,0,0,john wall,202322,0.448513,874,15
128067,21400350,14-Dec-14,WAS vs. UTA,WAS,UTA,H,W,9,14,4,2:18,4.3,0,0.6,3.3,2,made,"Kanter, Enes",202683,1.4,1,2,john wall,202322,0.448513,874,15


In [4]:
#converting the date variable
Shotlog_1415["date"] = pd.to_datetime(Shotlog_1415["date"])
Shotlog_1415.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 128069 entries, 0 to 128068
Data columns (total 27 columns):
 #   Column                Non-Null Count   Dtype         
---  ------                --------------   -----         
 0   game_id               128069 non-null  int64         
 1   date                  128069 non-null  datetime64[ns]
 2   match                 128069 non-null  object        
 3   home_team             128069 non-null  object        
 4   away_team             128069 non-null  object        
 5   home_away             128069 non-null  object        
 6   result                128069 non-null  object        
 7   final_margin          128069 non-null  int64         
 8   shot_number           128069 non-null  int64         
 9   quarter               128069 non-null  int64         
 10  game_clock            128069 non-null  object        
 11  shot_clock            122502 non-null  float64       
 12  dribbles              128069 non-null  int64         
 13 

In [5]:
#summary statistics for shot clock 
print(Shotlog_1415['shot_clock'].describe())

count    122502.000000
mean         12.453344
std           5.763265
min           0.000000
25%           8.200000
50%          12.300000
75%          16.675000
max          24.000000
Name: shot_clock, dtype: float64


In [6]:
Shotlog_1415['game_clock'] = pd.to_timedelta('00:'+ Shotlog_1415['game_clock'])
Shotlog_1415['game_clock'].describe()

count                    128069
mean     0 days 00:05:51.393811
std      0 days 00:03:27.590603
min             0 days 00:00:00
25%             0 days 00:02:52
50%             0 days 00:05:52
75%             0 days 00:08:51
max             0 days 00:12:00
Name: game_clock, dtype: object

In [7]:
#creating lagged shot hit variable
Shotlog_1415['lag_shot_hit']=Shotlog_1415.sort_values(by=['quarter','game_clock'], ascending=[True, True]).groupby(['shoot_player','date'])['current_shot_hit'].shift(1)
display(Shotlog_1415)

Unnamed: 0,game_id,date,match,home_team,away_team,home_away,result,final_margin,shot_number,quarter,game_clock,shot_clock,dribbles,touch_time,shot_dist,points,current_shot_outcome,closest_defender,closest_defender_id,closest_def_dist,current_shot_hit,points_earned,shoot_player,player_id,average_hit,shot_count,shot_per_game,lag_shot_hit
0,21400280,2014-12-05,ATL @ BKN,BKN,ATL,A,W,23,1,1,00:11:23,5.6,0,1.2,19.6,2,made,"Lopez, Brook",201572,6.6,1,2,al horford,201143,0.541259,715,10,0.0
1,21400280,2014-12-05,ATL @ BKN,BKN,ATL,A,W,23,2,1,00:10:48,5.6,0,1.1,14.5,2,missed,"Lopez, Brook",201572,5.6,0,0,al horford,201143,0.541259,715,10,0.0
2,21400280,2014-12-05,ATL @ BKN,BKN,ATL,A,W,23,3,1,00:09:40,11.0,0,0.4,5.8,2,missed,"Lopez, Brook",201572,4.7,0,0,al horford,201143,0.541259,715,10,0.0
3,21400280,2014-12-05,ATL @ BKN,BKN,ATL,A,W,23,4,1,00:07:09,13.5,0,0.8,20.8,2,missed,"Lopez, Brook",201572,5.8,0,0,al horford,201143,0.541259,715,10,
4,21400280,2014-12-05,ATL @ BKN,BKN,ATL,A,W,23,5,2,00:05:49,12.9,0,0.9,20.1,2,missed,"Lopez, Brook",201572,6.4,0,0,al horford,201143,0.541259,715,10,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
128064,21400350,2014-12-14,WAS vs. UTA,WAS,UTA,H,W,9,11,3,00:02:59,21.1,3,4.0,1.7,2,made,"Burke, Trey",203504,4.7,1,2,john wall,202322,0.448513,874,15,1.0
128065,21400350,2014-12-14,WAS vs. UTA,WAS,UTA,H,W,9,12,3,00:00:43,6.1,2,2.2,17.8,2,made,"Exum, Dante",203957,3.4,1,2,john wall,202322,0.448513,874,15,1.0
128066,21400350,2014-12-14,WAS vs. UTA,WAS,UTA,H,W,9,13,4,00:02:18,4.3,0,0.0,3.2,2,missed,"Kanter, Enes",202683,1.2,0,0,john wall,202322,0.448513,874,15,0.0
128067,21400350,2014-12-14,WAS vs. UTA,WAS,UTA,H,W,9,14,4,00:02:18,4.3,0,0.6,3.3,2,made,"Kanter, Enes",202683,1.4,1,2,john wall,202322,0.448513,874,15,0.0


In [8]:
#error variables
Shotlog_1415['error']=Shotlog_1415['current_shot_hit']-Shotlog_1415['average_hit']
Shotlog_1415['lagerror']=Shotlog_1415['lag_shot_hit']-Shotlog_1415['average_hit']

In [9]:
display(Shotlog_1415)

Unnamed: 0,game_id,date,match,home_team,away_team,home_away,result,final_margin,shot_number,quarter,game_clock,shot_clock,dribbles,touch_time,shot_dist,points,current_shot_outcome,closest_defender,closest_defender_id,closest_def_dist,current_shot_hit,points_earned,shoot_player,player_id,average_hit,shot_count,shot_per_game,lag_shot_hit,error,lagerror
0,21400280,2014-12-05,ATL @ BKN,BKN,ATL,A,W,23,1,1,00:11:23,5.6,0,1.2,19.6,2,made,"Lopez, Brook",201572,6.6,1,2,al horford,201143,0.541259,715,10,0.0,0.458741,-0.541259
1,21400280,2014-12-05,ATL @ BKN,BKN,ATL,A,W,23,2,1,00:10:48,5.6,0,1.1,14.5,2,missed,"Lopez, Brook",201572,5.6,0,0,al horford,201143,0.541259,715,10,0.0,-0.541259,-0.541259
2,21400280,2014-12-05,ATL @ BKN,BKN,ATL,A,W,23,3,1,00:09:40,11.0,0,0.4,5.8,2,missed,"Lopez, Brook",201572,4.7,0,0,al horford,201143,0.541259,715,10,0.0,-0.541259,-0.541259
3,21400280,2014-12-05,ATL @ BKN,BKN,ATL,A,W,23,4,1,00:07:09,13.5,0,0.8,20.8,2,missed,"Lopez, Brook",201572,5.8,0,0,al horford,201143,0.541259,715,10,,-0.541259,
4,21400280,2014-12-05,ATL @ BKN,BKN,ATL,A,W,23,5,2,00:05:49,12.9,0,0.9,20.1,2,missed,"Lopez, Brook",201572,6.4,0,0,al horford,201143,0.541259,715,10,1.0,-0.541259,0.458741
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
128064,21400350,2014-12-14,WAS vs. UTA,WAS,UTA,H,W,9,11,3,00:02:59,21.1,3,4.0,1.7,2,made,"Burke, Trey",203504,4.7,1,2,john wall,202322,0.448513,874,15,1.0,0.551487,0.551487
128065,21400350,2014-12-14,WAS vs. UTA,WAS,UTA,H,W,9,12,3,00:00:43,6.1,2,2.2,17.8,2,made,"Exum, Dante",203957,3.4,1,2,john wall,202322,0.448513,874,15,1.0,0.551487,0.551487
128066,21400350,2014-12-14,WAS vs. UTA,WAS,UTA,H,W,9,13,4,00:02:18,4.3,0,0.0,3.2,2,missed,"Kanter, Enes",202683,1.2,0,0,john wall,202322,0.448513,874,15,0.0,-0.448513,-0.448513
128067,21400350,2014-12-14,WAS vs. UTA,WAS,UTA,H,W,9,14,4,00:02:18,4.3,0,0.6,3.3,2,made,"Kanter, Enes",202683,1.4,1,2,john wall,202322,0.448513,874,15,0.0,0.551487,-0.448513


In [10]:
print(Shotlog_1415['error'].describe())

count    1.280690e+05
mean    -5.770049e-18
std      4.949640e-01
min     -7.124682e-01
25%     -4.491979e-01
50%     -3.850837e-01
75%      5.395973e-01
max      6.914894e-01
Name: error, dtype: float64


In [11]:
print(Shotlog_1415['lagerror'].describe())

count    113726.000000
mean          0.002883
std           0.495546
min          -0.712468
25%          -0.449198
50%          -0.385084
75%           0.541071
max           0.691489
Name: lagerror, dtype: float64


## Part 2 - Conditional Probability and Autocorrelation

#### Conditional Probability
We can first calculate the conditional probability of making a shot in the current period conditional on making the previous shot. 
$$Conditional \ Probability=\frac{Probability \ of \ Making \ Consecutive \ Shots}{Probability \ of \ Making \ Previous \ Shot}$$

We will need to create a variable that indicates a player made consecutive shots.

In [12]:
Shotlog_1415['conse_shot_hit'] = np.where((Shotlog_1415['current_shot_hit']==1)&(Shotlog_1415['lag_shot_hit']==1), 1, 0) 
Shotlog_1415.head()

Unnamed: 0,game_id,date,match,home_team,away_team,home_away,result,final_margin,shot_number,quarter,game_clock,shot_clock,dribbles,touch_time,shot_dist,points,current_shot_outcome,closest_defender,closest_defender_id,closest_def_dist,current_shot_hit,points_earned,shoot_player,player_id,average_hit,shot_count,shot_per_game,lag_shot_hit,error,lagerror,conse_shot_hit
0,21400280,2014-12-05,ATL @ BKN,BKN,ATL,A,W,23,1,1,00:11:23,5.6,0,1.2,19.6,2,made,"Lopez, Brook",201572,6.6,1,2,al horford,201143,0.541259,715,10,0.0,0.458741,-0.541259,0
1,21400280,2014-12-05,ATL @ BKN,BKN,ATL,A,W,23,2,1,00:10:48,5.6,0,1.1,14.5,2,missed,"Lopez, Brook",201572,5.6,0,0,al horford,201143,0.541259,715,10,0.0,-0.541259,-0.541259,0
2,21400280,2014-12-05,ATL @ BKN,BKN,ATL,A,W,23,3,1,00:09:40,11.0,0,0.4,5.8,2,missed,"Lopez, Brook",201572,4.7,0,0,al horford,201143,0.541259,715,10,0.0,-0.541259,-0.541259,0
3,21400280,2014-12-05,ATL @ BKN,BKN,ATL,A,W,23,4,1,00:07:09,13.5,0,0.8,20.8,2,missed,"Lopez, Brook",201572,5.8,0,0,al horford,201143,0.541259,715,10,,-0.541259,,0
4,21400280,2014-12-05,ATL @ BKN,BKN,ATL,A,W,23,5,2,00:05:49,12.9,0,0.9,20.1,2,missed,"Lopez, Brook",201572,6.4,0,0,al horford,201143,0.541259,715,10,1.0,-0.541259,0.458741,0


We can create a player level dataframe. The average of the variable "conse_shot_hit" would be the joint probability of making current and previous shots. We will also calculate the average of "lag_shot_hit" to indicate the probability of making the previous shot.

In [13]:
Player_Prob=Shotlog_1415.groupby(['shoot_player'])['conse_shot_hit','lag_shot_hit'].mean()
Player_Prob=Player_Prob.reset_index()
Player_Prob.rename(columns={'lag_shot_hit':'average_lag_hit'}, inplace=True)
Player_Prob.head()

Unnamed: 0,shoot_player,conse_shot_hit,average_lag_hit
0,aaron brooks,0.151515,0.426
1,aaron gordon,0.201923,0.532468
2,al farouq aminu,0.155039,0.446078
3,al horford,0.271329,0.536474
4,al jefferson,0.215,0.478667


In [14]:
Player_Prob['conditional_prob']=Player_Prob['conse_shot_hit']/Player_Prob['average_lag_hit']
Player_Prob.head()

Unnamed: 0,shoot_player,conse_shot_hit,average_lag_hit,conditional_prob
0,aaron brooks,0.151515,0.426,0.355669
1,aaron gordon,0.201923,0.532468,0.379221
2,al farouq aminu,0.155039,0.446078,0.347559
3,al horford,0.271329,0.536474,0.505763
4,al jefferson,0.215,0.478667,0.449164


In [15]:
Player_Stats=pd.merge(Player_Prob, Player_Stats, on=['shoot_player'])
Player_Stats.head(10)

Unnamed: 0,shoot_player,conse_shot_hit,average_lag_hit,conditional_prob,average_hit
0,aaron brooks,0.151515,0.426,0.355669,0.41533
1,aaron gordon,0.201923,0.532468,0.379221,0.528846
2,al farouq aminu,0.155039,0.446078,0.347559,0.430233
3,al horford,0.271329,0.536474,0.505763,0.541259
4,al jefferson,0.215,0.478667,0.449164,0.4775
5,alan anderson,0.163205,0.448029,0.364273,0.433234
6,alan crabbe,0.117021,0.460317,0.254219,0.425532
7,alex len,0.247492,0.539419,0.458811,0.528428
8,alexis ajinca,0.274882,0.586826,0.468421,0.597156
9,alonzo gee,0.137681,0.441176,0.312077,0.478261


In [16]:
#summary statistics 
Player_Stats[["conse_shot_hit", "average_lag_hit","conditional_prob"]].describe()

Unnamed: 0,conse_shot_hit,average_lag_hit,conditional_prob
count,281.0,281.0,281.0
mean,0.175405,0.455115,0.379687
std,0.049693,0.056983,0.065492
min,0.071429,0.329167,0.188571
25%,0.145098,0.418511,0.3402
50%,0.172043,0.450495,0.384804
75%,0.200837,0.482517,0.418332
max,0.442748,0.700906,0.631679


In [17]:
#t-test
sp.stats.ttest_ind(Player_Stats['conditional_prob'], Player_Stats['average_hit'])

Ttest_indResult(statistic=-13.624632774711136, pvalue=1.0310757346280856e-36)

given the p-value this test is not significant enough 

In [18]:
#autocorrelation coefficient
Shotlog_1415['current_shot_hit'].corr(Shotlog_1415['lag_shot_hit'])

-0.011959455787768518

## Part 3 - Regression Analyses

We can run a prediction analysis on the error for the current period and compare it to the previous period

In [19]:
reg1 = sm.ols(formula = 'error ~ lagerror', data= Shotlog_1415).fit()
print(reg1.summary())

                            OLS Regression Results                            
Dep. Variable:                  error   R-squared:                       0.000
Model:                            OLS   Adj. R-squared:                  0.000
Method:                 Least Squares   F-statistic:                     56.70
Date:                Tue, 09 May 2023   Prob (F-statistic):           5.12e-14
Time:                        15:33:43   Log-Likelihood:                -81459.
No. Observations:              113726   AIC:                         1.629e+05
Df Residuals:                  113724   BIC:                         1.629e+05
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      0.0036      0.001      2.468      0.0

Lets add control variables to the regression

In [20]:
reg2 = sm.ols(formula = 'error ~ lagerror+shot_dist+dribbles+touch_time+C(points)+C(quarter)+home_away+closest_def_dist', data= Shotlog_1415).fit()
print(reg2.summary())

                            OLS Regression Results                            
Dep. Variable:                  error   R-squared:                       0.040
Model:                            OLS   Adj. R-squared:                  0.040
Method:                 Least Squares   F-statistic:                     365.6
Date:                Tue, 09 May 2023   Prob (F-statistic):               0.00
Time:                        15:33:52   Log-Likelihood:                -79159.
No. Observations:              113726   AIC:                         1.583e+05
Df Residuals:                  113712   BIC:                         1.585e+05
Df Model:                          13                                         
Covariance Type:            nonrobust                                         
                       coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------------
Intercept            0.1176      0.005  

add a weighted control of 1

In [21]:
reg3 = sm.ols(formula = 'error ~ lagerror+shot_dist+dribbles+touch_time+C(points)+C(quarter)+home_away+closest_def_dist', weights=1/Shotlog_1415['shot_per_game'], data= Shotlog_1415).fit()
print(reg3.summary())

                            OLS Regression Results                            
Dep. Variable:                  error   R-squared:                      -9.895
Model:                            OLS   Adj. R-squared:                 -9.896
Method:                 Least Squares   F-statistic:                    -7944.
Date:                Tue, 09 May 2023   Prob (F-statistic):               1.00
Time:                        15:34:01   Log-Likelihood:                -79159.
No. Observations:              113726   AIC:                         1.583e+05
Df Residuals:                  113712   BIC:                         1.585e+05
Df Model:                          13                                         
Covariance Type:            nonrobust                                         
                       coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------------
Intercept            0.1176      0.005  