# NBA Data-Viz 
<br>
The aim of this Jupyter notebook is to get some practise at wrangling and visualising data. The end goal is to have a fully interactive dashboard that allows someone to explore NBA data. 


<br>  
## Getting the data:
<br> The dataset that I will be working with was obtianed from Kaggle. It is available here __[NBA Players Stats Since 1950](https://www.kaggle.com/drgilermo/nba-players-stats/version/2#)__. There are three csv files that are available. For now I will be only working with Seasons_Stats.csv, where each row contains advance statistics for a given player. 

In [43]:
#Import statements
import numpy as np
import pandas as pd
from bokeh.plotting import figure
from bokeh.io import show, output_notebook
from bokeh.models import ColumnDataSource, HoverTool, CategoricalColorMapper
from bokeh.palettes import Category10_5, Category20_16

In [44]:
Season_Stats = pd.read_csv('/Users/Akshi/Desktop/Projects/Sport Analytics/NBA/NBA_Analytics/Data/nba-players-stats/Seasons_Stats.csv')
Season_Stats.head()

Unnamed: 0.1,Unnamed: 0,Year,Player,Pos,Age,Tm,G,GS,MP,PER,...,FT%,ORB,DRB,TRB,AST,STL,BLK,TOV,PF,PTS
0,0,1950.0,Curly Armstrong,G-F,31.0,FTW,63.0,,,,...,0.705,,,,176.0,,,,217.0,458.0
1,1,1950.0,Cliff Barker,SG,29.0,INO,49.0,,,,...,0.708,,,,109.0,,,,99.0,279.0
2,2,1950.0,Leo Barnhorst,SF,25.0,CHS,67.0,,,,...,0.698,,,,140.0,,,,192.0,438.0
3,3,1950.0,Ed Bartels,F,24.0,TOT,15.0,,,,...,0.559,,,,20.0,,,,29.0,63.0
4,4,1950.0,Ed Bartels,F,24.0,DNN,13.0,,,,...,0.548,,,,20.0,,,,27.0,59.0


In [45]:
print('The dataframe is: ' + str(Season_Stats.shape[0]) + ' rows by '+  str(Season_Stats.shape[1]) + ' columns') 

The dataframe is: 24691 rows by 53 columns


This  dataset goes back all the way to 1950, however the NBA did not start tracking certain important statistics such 3-Point% till the 1979-1980 season __[Source](https://stats.nba.com/help/faq/)__, so I will only focus on player data from 1980 onwards.

In [46]:
#define new dataframe
DF = Season_Stats.loc[(Season_Stats['Year'] >= 1980)] 
#List all the column headers for future reference
columns = list(DF.columns.values)
print(columns)

['Unnamed: 0', 'Year', 'Player', 'Pos', 'Age', 'Tm', 'G', 'GS', 'MP', 'PER', 'TS%', '3PAr', 'FTr', 'ORB%', 'DRB%', 'TRB%', 'AST%', 'STL%', 'BLK%', 'TOV%', 'USG%', 'blanl', 'OWS', 'DWS', 'WS', 'WS/48', 'blank2', 'OBPM', 'DBPM', 'BPM', 'VORP', 'FG', 'FGA', 'FG%', '3P', '3PA', '3P%', '2P', '2PA', '2P%', 'eFG%', 'FT', 'FTA', 'FT%', 'ORB', 'DRB', 'TRB', 'AST', 'STL', 'BLK', 'TOV', 'PF', 'PTS']


All column names and their interpretations can be found on __[Basketball Refrence](https://www.basketball-reference.com/about/glossary.html)__.
<br><br>
## Exploring the data:
<br> 
In this section I will do some basic exploratory data-anlaysis to produce statistical summaries and visualisations.
Lets start by looking at the distribution of the number of 3-Pointers made by a player. Im chosing to look at the distribution of this since some players might have really high 3P%, but they attempt hardly any shots.  


In [47]:
#Summary staistics about the number of 3P made in a season by a player 
DF["3P"].describe()

count    18927.000000
mean        22.215037
std         38.543366
min          0.000000
25%          0.000000
50%          2.000000
75%         27.000000
max        402.000000
Name: 3P, dtype: float64

The above gives some interesting information about the distribution of 3P made in a season. The median is 2, 75th quantile is 27 and the max is 402. Thats a massive range. This histogram is going to look weird. The issue here might be that we have not had a chance to filter our data. Since there have so many players to play in the NBA, lets only focus on those who have attempted 20 or more 3-point shots in a season. 

In [48]:
df_3PA = DF.loc[(DF["3PA"]>20)]
arr_hist, edges = np.histogram(df_3PA["3P"], bins =81, range=[0,405] )

#Add the above to a Dataframe
splash = pd.DataFrame({'3P': arr_hist,
                      'left': edges[:-1],
                      'right': edges[1:]})
#Create a blank figure
plot = figure(plot_height=400, plot_width=400, 
             title="Histogram of 3-pointers made in the NBA (1980-2017)",
             x_axis_label = "3-Pointers made in a season",
             y_axis_label = "Count")
plot.quad(bottom=0, top=splash['3P'], left=splash['left'], right=splash['right'],
         fill_color=(250, 54, 53), line_color='black')
plot.background_fill_color = 'whitesmoke'
output_notebook()
show(plot)

The above shows the number of 3-pointers made by a player in a season. I want to say its positively skewed, but it almost feels wrong to. 


## Adding a HoverTool to the plot
br>The next few sections will make the above distribution more interactive and hopefully make it look less strange. Interactivty will be added through Bokeh, transforming the dataframe from before ```splash``` to a ColumDataSource object. 

In [53]:
#Same as before. Select columns we want, make bins and create a dataframe
df_3PA = DF.loc[(DF["3PA"]>20)]
arr_hist, edges = np.histogram(df_3PA["3P"], bins =81, range=[0,405] )

splash = pd.DataFrame({'count': arr_hist,
                      'left': edges[:-1],
                      'right': edges[1:]})
#Add a new column to the dataframe for the length of the interval
splash['interval'] = ['%d to %d shots' %(left,right)
                      for left,right in zip(splash['left'], splash['right'])]
splash.head()

Unnamed: 0,count,left,right,interval
0,227,0.0,5.0,0 to 5 shots
1,930,5.0,10.0,5 to 10 shots
2,835,10.0,15.0,10 to 15 shots
3,673,15.0,20.0,15 to 20 shots
4,514,20.0,25.0,20 to 25 shots


In [54]:
#Add the above dataframe to a ColumnDataSource to allow for interactivty later on
src = ColumnDataSource(splash)
src.data.keys()

dict_keys(['count', 'left', 'right', 'interval', 'index'])

In [52]:
#Create a blank plot as before
plot_2 = figure(plot_height=400, plot_width=400, 
             title="Histogram of 3-pointers made in the NBA (1980-2017)",
             x_axis_label = "3-Pointers made in a season",
             y_axis_label = "Count")
#Add a quad to the plot with src this time
plot_2.quad(source=src,bottom=0,top='count', left='left', right='right', fill_color=(250, 54, 53), 
         line_color='black', hover_fill_alpha=1.0, hover_fill_color='blue')
#
#Add a hover tool. Note the @interval, which refers to the interval column from src
hover = HoverTool(tooltips= [('3-Pointers made', '@interval'), 
                           ('Count', '@count')])
plot_2.add_tools(hover)
show(plot_2)