### John Lee
### BrainStation, Data Science

# Notebook Part 1

The NBA has been ramping up its use of data over the last decade. More teams than ever rely on data to take the most optimal shot or to gain any advantage on the defensive end. Following this trend, this notebook will attempt to model a niche segment of the game: predicting the effect of a timeout on a made field goal basket. A field goal basket is defined as any 2 point or 3 point basket that is made.

With the result of the model, the following are to be investigated:
    
   1. The features of the model that can be utilized in actual gameplay to give a strategic advantage
   2. Predicting the outcome of a shot to create an optimal strategy for an offense and defense when calling a timeout
   3. Scenarios in a game where the model would be useful
    
This project will be using NBA play-by-play data scraped from the official [Basketball Reference website](https://www.basketball-reference.com/). Each row represents a single play of a game in chronological order, whereas the columns represent the details of the specific play (ex. rebounder, play made by home team, home score at the time). The data has been scraped and uploaded by an unknown Kaggle user and can be found [here](https://www.kaggle.com/datasets/schmadam97/nba-playbyplay-data-20182019?select=NBA_PBP_2020-21.csv). The data that is used will be four seasons worth of data (2018-2021).



### Feature Engineering and Data Transformation

--- 

Before going into EDA, we will first transform the data by feature engineering the depedent variable. The dependent variable will be the result (make/miss) of the first shot attempt after a timeout.

In [3]:
import numpy as np
import pandas as pd
from scipy import stats

In [2]:
df1 = pd.read_csv('NBA_PBP_2017-18.csv')
df2 = pd.read_csv('NBA_PBP_2018-19.csv')
df3 = pd.read_csv('NBA_PBP_2019-20.csv')
df4 = pd.read_csv('NBA_PBP_2020-21.csv')

df = pd.concat([df1, df2, df3, df4], ignore_index=True)
df.head(45)

Unnamed: 0,URL,GameType,Location,Date,Time,WinningTeam,Quarter,SecLeft,AwayTeam,AwayPlay,...,EnterGame,LeaveGame,TurnoverPlayer,TurnoverType,TurnoverCause,TurnoverCauser,JumpballAwayPlayer,JumpballHomePlayer,JumpballPoss,Unnamed: 40
0,/boxscores/201710170CLE.html,regular,Quicken Loans Arena Cleveland Ohio,October 17 2017,8:01 PM,CLE,1,720,BOS,Jump ball: K. Love vs. A. Horford (K. Irving g...,...,,,,,,,K. Love - loveke01,A. Horford - horfoal01,K. Irving - irvinky01,
1,/boxscores/201710170CLE.html,regular,Quicken Loans Arena Cleveland Ohio,October 17 2017,8:01 PM,CLE,1,704,BOS,K. Irving makes 2-pt jump shot from 10 ft (ass...,...,,,,,,,,,,
2,/boxscores/201710170CLE.html,regular,Quicken Loans Arena Cleveland Ohio,October 17 2017,8:01 PM,CLE,1,687,BOS,,...,,,,,,,,,,
3,/boxscores/201710170CLE.html,regular,Quicken Loans Arena Cleveland Ohio,October 17 2017,8:01 PM,CLE,1,683,BOS,Defensive rebound by A. Horford,...,,,,,,,,,,
4,/boxscores/201710170CLE.html,regular,Quicken Loans Arena Cleveland Ohio,October 17 2017,8:01 PM,CLE,1,681,BOS,G. Hayward misses 3-pt jump shot from 25 ft,...,,,,,,,,,,
5,/boxscores/201710170CLE.html,regular,Quicken Loans Arena Cleveland Ohio,October 17 2017,8:01 PM,CLE,1,678,BOS,,...,,,,,,,,,,
6,/boxscores/201710170CLE.html,regular,Quicken Loans Arena Cleveland Ohio,October 17 2017,8:01 PM,CLE,1,662,BOS,,...,,,,,,,,,,
7,/boxscores/201710170CLE.html,regular,Quicken Loans Arena Cleveland Ohio,October 17 2017,8:01 PM,CLE,1,659,BOS,Defensive rebound by J. Brown,...,,,,,,,,,,
8,/boxscores/201710170CLE.html,regular,Quicken Loans Arena Cleveland Ohio,October 17 2017,8:01 PM,CLE,1,651,BOS,J. Tatum misses 2-pt layup from 2 ft (block by...,...,,,,,,,,,,
9,/boxscores/201710170CLE.html,regular,Quicken Loans Arena Cleveland Ohio,October 17 2017,8:01 PM,CLE,1,651,BOS,Offensive rebound by Team,...,,,,,,,,,,


To simplify the data, we will drop the columns that are not relevant to a timeout or a make or miss of a shot or features that might cause multicollinearity (ie. Fouler and Fouled).  

In [3]:
%%time
#dropping columns that are not relevant
df = df.drop(['Blocker', 'Assister', 'FoulType', 'Fouled', 'ViolationType', 
              'FreeThrowShooter', 'FreeThrowOutcome', 'FreeThrowNum','EnterGame',
              'LeaveGame','TurnoverType','TurnoverCause','TurnoverCauser',
              'JumpballAwayPlayer','JumpballHomePlayer','JumpballPoss', 'URL','Shooter', 'ViolationPlayer', 'Game_duration_min'], axis=1)
df.drop(df.columns[df.columns.str.contains('unnamed',case = False)],axis = 1, inplace = True)
df


Wall time: 1.74 s


Unnamed: 0,URL,GameType,Location,Date,Time,WinningTeam,Quarter,SecLeft,AwayTeam,AwayPlay,...,Shooter,ShotType,ShotOutcome,ShotDist,Fouler,Rebounder,ReboundType,ViolationPlayer,TimeoutTeam,TurnoverPlayer
0,/boxscores/201710170CLE.html,regular,Quicken Loans Arena Cleveland Ohio,October 17 2017,8:01 PM,CLE,1,720,BOS,Jump ball: K. Love vs. A. Horford (K. Irving g...,...,,,,,,,,,,
1,/boxscores/201710170CLE.html,regular,Quicken Loans Arena Cleveland Ohio,October 17 2017,8:01 PM,CLE,1,704,BOS,K. Irving makes 2-pt jump shot from 10 ft (ass...,...,K. Irving - irvinky01,2-pt jump shot,make,10.0,,,,,,
2,/boxscores/201710170CLE.html,regular,Quicken Loans Arena Cleveland Ohio,October 17 2017,8:01 PM,CLE,1,687,BOS,,...,D. Rose - rosede01,2-pt layup,miss,1.0,,,,,,
3,/boxscores/201710170CLE.html,regular,Quicken Loans Arena Cleveland Ohio,October 17 2017,8:01 PM,CLE,1,683,BOS,Defensive rebound by A. Horford,...,,,,,,A. Horford - horfoal01,defensive,,,
4,/boxscores/201710170CLE.html,regular,Quicken Loans Arena Cleveland Ohio,October 17 2017,8:01 PM,CLE,1,681,BOS,G. Hayward misses 3-pt jump shot from 25 ft,...,G. Hayward - haywago01,3-pt jump shot,miss,25.0,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1842317,/boxscores/202101150UTA.html,regular,Vivint Smart Home Arena Salt Lake City Utah,January 15 2021,9:00 PM,UTA,4,15,ATL,,...,S. Harrison - harrish01,2-pt layup,miss,3.0,,,,,,
1842318,/boxscores/202101150UTA.html,regular,Vivint Smart Home Arena Salt Lake City Utah,January 15 2021,9:00 PM,UTA,4,13,ATL,,...,,,,,,U. Azubuike - azubuud01,offensive,,,
1842319,/boxscores/202101150UTA.html,regular,Vivint Smart Home Arena Salt Lake City Utah,January 15 2021,9:00 PM,UTA,4,10,ATL,,...,,,,,,,,,,U. Azubuike - azubuud01
1842320,/boxscores/202101150UTA.html,regular,Vivint Smart Home Arena Salt Lake City Utah,January 15 2021,9:00 PM,UTA,4,0,ATL,End of 4th quarter,...,,,,,,,,,,


In the data, we want to distinguish each play by the corresponding game. In order to do so, we will create a column that will label each play to its respective game. 

In [4]:
%%time
df['GameID'] = np.where(df['AwayPlay'] == 'End of Game', 'EndOfGame', None)
#Creating GameID column to distinguish games

Wall time: 125 ms


In [5]:
end_of_game_index = df.dropna(axis=0, subset=['GameID']).index

In [6]:
%%time
# Populated GameID 
GameID = 1
PreviousGameIndex = -1
for i in end_of_game_index:
    df.iloc[(PreviousGameIndex+1):(i+1),-1] = GameID
    GameID += 1
    PreviousGameIndex = i

Wall time: 1min 59s


In [7]:
df.head(471)

Unnamed: 0,URL,GameType,Location,Date,Time,WinningTeam,Quarter,SecLeft,AwayTeam,AwayPlay,...,ShotType,ShotOutcome,ShotDist,Fouler,Rebounder,ReboundType,ViolationPlayer,TimeoutTeam,TurnoverPlayer,GameID
0,/boxscores/201710170CLE.html,regular,Quicken Loans Arena Cleveland Ohio,October 17 2017,8:01 PM,CLE,1,720,BOS,Jump ball: K. Love vs. A. Horford (K. Irving g...,...,,,,,,,,,,1
1,/boxscores/201710170CLE.html,regular,Quicken Loans Arena Cleveland Ohio,October 17 2017,8:01 PM,CLE,1,704,BOS,K. Irving makes 2-pt jump shot from 10 ft (ass...,...,2-pt jump shot,make,10.0,,,,,,,1
2,/boxscores/201710170CLE.html,regular,Quicken Loans Arena Cleveland Ohio,October 17 2017,8:01 PM,CLE,1,687,BOS,,...,2-pt layup,miss,1.0,,,,,,,1
3,/boxscores/201710170CLE.html,regular,Quicken Loans Arena Cleveland Ohio,October 17 2017,8:01 PM,CLE,1,683,BOS,Defensive rebound by A. Horford,...,,,,,A. Horford - horfoal01,defensive,,,,1
4,/boxscores/201710170CLE.html,regular,Quicken Loans Arena Cleveland Ohio,October 17 2017,8:01 PM,CLE,1,681,BOS,G. Hayward misses 3-pt jump shot from 25 ft,...,3-pt jump shot,miss,25.0,,,,,,,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
466,/boxscores/201710170CLE.html,regular,Quicken Loans Arena Cleveland Ohio,October 17 2017,8:01 PM,CLE,4,2,BOS,Offensive rebound by K. Irving,...,,,,,K. Irving - irvinky01,offensive,,,,1
467,/boxscores/201710170CLE.html,regular,Quicken Loans Arena Cleveland Ohio,October 17 2017,8:01 PM,CLE,4,0,BOS,K. Irving misses 3-pt jump shot from 26 ft,...,3-pt jump shot,miss,26.0,,,,,,,1
468,/boxscores/201710170CLE.html,regular,Quicken Loans Arena Cleveland Ohio,October 17 2017,8:01 PM,CLE,4,0,BOS,Offensive rebound by Team,...,,,,,Team,offensive,,,,1
469,/boxscores/201710170CLE.html,regular,Quicken Loans Arena Cleveland Ohio,October 17 2017,8:01 PM,CLE,3,0,BOS,End of Game,...,,,,,,,,,,1


We can confirm that each row has been grouped into their corresponding games.

To make the dependent variable via feature engineering, we need to run a for loop to crawl through all the rows of data and to find the instance of a shot after a timeout. In order to do so, the NaN values that correspond to the for loop need to be replaced in order for it to run.

In [11]:
df['AwayPlay']

0          Jump ball: K. Love vs. A. Horford (K. Irving g...
1          K. Irving makes 2-pt jump shot from 10 ft (ass...
2                                                        NaN
3                            Defensive rebound by A. Horford
4                G. Hayward misses 3-pt jump shot from 25 ft
                                 ...                        
1842317                                                  NaN
1842318                                                  NaN
1842319                                                  NaN
1842320                                   End of 4th quarter
1842321                                          End of Game
Name: AwayPlay, Length: 1842322, dtype: object

In [12]:
df['AwayPlay'] = df['AwayPlay'].replace(np.nan, 0)

In [13]:
df['HomePlay'] = df['HomePlay'].replace(np.nan, 0)

In [14]:
df[['HomePlay','AwayPlay']]

Unnamed: 0,HomePlay,AwayPlay
0,0,Jump ball: K. Love vs. A. Horford (K. Irving g...
1,0,K. Irving makes 2-pt jump shot from 10 ft (ass...
2,D. Rose misses 2-pt layup from 1 ft (block by ...,0
3,0,Defensive rebound by A. Horford
4,0,G. Hayward misses 3-pt jump shot from 25 ft
...,...,...
1842317,S. Harrison misses 2-pt layup from 3 ft,0
1842318,Offensive rebound by U. Azubuike,0
1842319,Turnover by U. Azubuike (bad pass; steal by O....,0
1842320,0,End of 4th quarter


In [17]:
%%time
#Make column to indicate if timeout resulted in basket
df['FirstBasketAfterTO'] = 'n/a'
timeoutindexseries = df.dropna(axis=0, subset=['TimeoutTeam']).index
shotindexseries = df.dropna(axis=0, subset=['ShotType']).index
PreviousGameID = 1
for i in timeoutindexseries:
    differenceseries = shotindexseries - i
    timeoutshotindex = i+min(filter(lambda a: a > 0, differenceseries))
    if df['GameID'][timeoutshotindex] == df['GameID'][i]:
        if '3-pt' in df['ShotType'][timeoutshotindex] and df['ShotOutcome'][timeoutshotindex] == 'make' and df['HomePlay'][timeoutshotindex] != 0:
            typeofshot = '3Home'
        elif '3-pt' in df['ShotType'][timeoutshotindex] and df['ShotOutcome'][timeoutshotindex] == 'make' and df['AwayPlay'][timeoutshotindex] != 0:
            typeofshot = '3Away'
        elif '2-pt' in df['ShotType'][timeoutshotindex] and df['ShotOutcome'][timeoutshotindex] == 'make' and df['HomePlay'][timeoutshotindex] != 0:
            typeofshot = '2Home'
        elif '2-pt' in df['ShotType'][timeoutshotindex] and df['ShotOutcome'][timeoutshotindex] == 'make' and df['AwayPlay'][timeoutshotindex] != 0:
            typeofshot = '2Away'
        elif '3-pt' in df['ShotType'][timeoutshotindex] and df['ShotOutcome'][timeoutshotindex] == 'miss' and df['HomePlay'][timeoutshotindex] != 0:
            typeofshot = '3HMiss'
        elif '3-pt' in df['ShotType'][timeoutshotindex] and df['ShotOutcome'][timeoutshotindex] == 'miss' and df['AwayPlay'][timeoutshotindex] != 0:
            typeofshot = '3AMiss'
        elif '2-pt' in df['ShotType'][timeoutshotindex] and df['ShotOutcome'][timeoutshotindex] == 'miss' and df['HomePlay'][timeoutshotindex] != 0:
            typeofshot = '2HMiss'
        elif '2-pt' in df['ShotType'][timeoutshotindex] and df['ShotOutcome'][timeoutshotindex] == 'miss' and df['AwayPlay'][timeoutshotindex] != 0:
            typeofshot = '2AMiss'
        df['FirstBasketAfterTO'][i] = typeofshot

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


Wall time: 9h 25min 59s


In [20]:
%%time
df.to_csv('df1.csv', index=False)

Wall time: 15.5 s


The feature engineered dependent variable has been successfully created, but since it took almost 10 hours to run, the data was exported so that it can be read directly. 

This was the most complicated part of the project, as we had to filter through every single row and find the first shot after a timeout in the data that was after the index of each timeout but before the following timeout, then input the result in the same index as the timeout.

We will move onto the second notebook (Capstone Project part 2) to further analyze this data.