# Data Analysis of NBA Shot Clock Data (Part II)

Now that we've done some basic aggregations, analysis, and visualization. We're going to take a more statistical and rigorous approach. More specifically we'll.

* Analyze associations between variables, and see why the data behaves the way it does under certain situations

* Analyze the distribution of data of some numeric features and assess what metrics (mean, median, etc.) to draw better conclusions from the data

* More advanced visualizations via matplotlib to answer certain questions about the data

As always let's load the file again, but before that, we're going to load it a little differently

In [9]:
import pandas as  pd
import numpy as np
import os
import datetime as dt

pd.set_option('display.max_columns', 30)

parent_path = os.path.dirname(os.path.dirname(os.getcwd()))

replace_double_slash = parent_path.replace('\\', '/')

data_path = replace_double_slash + '/data/shot_logs_clean_final.csv'

nba_shot_data = pd.read_csv(data_path)

nba_shot_data.info()

nba_shot_data.head(5)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 126056 entries, 0 to 126055
Data columns (total 23 columns):
Game Id                       126056 non-null int64
Date                          126056 non-null object
Home Team                     126056 non-null object
Away Team                     126056 non-null object
Winning Team                  126056 non-null object
Losing Team                   126056 non-null object
Final Margin                  126056 non-null int64
Shot Number                   126056 non-null int64
Period                        126056 non-null int64
Game Clock                    126056 non-null object
Shot Clock                    126056 non-null float64
Dribbles                      126056 non-null int64
Touch Time                    126056 non-null float64
Shot Dist                     126056 non-null float64
Pts Type                      126056 non-null int64
Shot Result                   126056 non-null object
Closest Defender              126056 non-nul

Unnamed: 0,Game Id,Date,Home Team,Away Team,Winning Team,Losing Team,Final Margin,Shot Number,Period,Game Clock,Shot Clock,Dribbles,Touch Time,Shot Dist,Pts Type,Shot Result,Closest Defender,Closest Defender Player Id,Close Def Dist,Fgm,Pts,Player Name,Player Id
0,21400899,2015-03-04,BKN,CHA,CHA,BKN,24,1,1,00:01:09,10.8,2,1.9,7.7,2,made,"Anderson, Alan",101187,1.3,1,2,"Roberts, Brian",203148
1,21400899,2015-03-04,BKN,CHA,CHA,BKN,24,2,1,00:00:14,3.4,0,0.8,28.2,3,missed,"Bogdanovic, Bojan",202711,6.1,0,0,"Roberts, Brian",203148
2,21400899,2015-03-04,BKN,CHA,CHA,BKN,24,3,1,00:00:00,0.0,3,2.7,10.1,2,missed,"Bogdanovic, Bojan",202711,0.9,0,0,"Roberts, Brian",203148
3,21400899,2015-03-04,BKN,CHA,CHA,BKN,24,4,2,00:11:47,10.3,2,1.9,17.2,2,missed,"Brown, Markel",203900,3.4,0,0,"Roberts, Brian",203148
4,21400899,2015-03-04,BKN,CHA,CHA,BKN,24,5,2,00:10:34,10.9,2,2.7,3.7,2,missed,"Young, Thaddeus",201152,1.1,0,0,"Roberts, Brian",203148


Now we need to convert the Date Column and Game Clock Column to datetime and timedelta datatypes before we can do further analysis

In [10]:
nba_shot_data['Date'] = pd.to_datetime(nba_shot_data['Date'], format='%Y-%m-%d')

nba_shot_data['Game Clock'] = pd.to_timedelta(nba_shot_data['Game Clock'])

nba_shot_data.info()

nba_shot_data.head(5)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 126056 entries, 0 to 126055
Data columns (total 23 columns):
Game Id                       126056 non-null int64
Date                          126056 non-null datetime64[ns]
Home Team                     126056 non-null object
Away Team                     126056 non-null object
Winning Team                  126056 non-null object
Losing Team                   126056 non-null object
Final Margin                  126056 non-null int64
Shot Number                   126056 non-null int64
Period                        126056 non-null int64
Game Clock                    126056 non-null timedelta64[ns]
Shot Clock                    126056 non-null float64
Dribbles                      126056 non-null int64
Touch Time                    126056 non-null float64
Shot Dist                     126056 non-null float64
Pts Type                      126056 non-null int64
Shot Result                   126056 non-null object
Closest Defender           

Unnamed: 0,Game Id,Date,Home Team,Away Team,Winning Team,Losing Team,Final Margin,Shot Number,Period,Game Clock,Shot Clock,Dribbles,Touch Time,Shot Dist,Pts Type,Shot Result,Closest Defender,Closest Defender Player Id,Close Def Dist,Fgm,Pts,Player Name,Player Id
0,21400899,2015-03-04,BKN,CHA,CHA,BKN,24,1,1,00:01:09,10.8,2,1.9,7.7,2,made,"Anderson, Alan",101187,1.3,1,2,"Roberts, Brian",203148
1,21400899,2015-03-04,BKN,CHA,CHA,BKN,24,2,1,00:00:14,3.4,0,0.8,28.2,3,missed,"Bogdanovic, Bojan",202711,6.1,0,0,"Roberts, Brian",203148
2,21400899,2015-03-04,BKN,CHA,CHA,BKN,24,3,1,00:00:00,0.0,3,2.7,10.1,2,missed,"Bogdanovic, Bojan",202711,0.9,0,0,"Roberts, Brian",203148
3,21400899,2015-03-04,BKN,CHA,CHA,BKN,24,4,2,00:11:47,10.3,2,1.9,17.2,2,missed,"Brown, Markel",203900,3.4,0,0,"Roberts, Brian",203148
4,21400899,2015-03-04,BKN,CHA,CHA,BKN,24,5,2,00:10:34,10.9,2,2.7,3.7,2,missed,"Young, Thaddeus",201152,1.1,0,0,"Roberts, Brian",203148


Now we're going to do distribution plots to and see the distribution of some of the features.

* The shot distance values
* The touch time values
* The closest defender distance values

In [11]:
%matplotlib inline
from matplotlib import pyplot as plt

In [14]:
#Shot Distance

mean = np.mean(nba_shot_data['Shot Dist'])
std_dev = np.std(nba_shot_data['Shot Dist'])

