# **NBA Data-Viz** 

The aim of this Jupyter notebook is to get some practise at wrangling and visualising data. The end goal is to have a fully interactive dashboard that allows someone to explore the NBA dataset found on Kaggle. 

## **Table of Contents**:
> 1. [Getting the data](#1)
> 2. [Exploring the data](#2)
> 3. [Hover tool](#3)
> 4. [Active interactions](#4)
> 5. [Widgets](#5)
> 6. [The App](#6)

<a id="1"></a> 
## *Getting the data*
 The dataset that I will be working with was obtianed from Kaggle. It is available here __[NBA Players Stats Since 1950](https://www.kaggle.com/drgilermo/nba-players-stats/version/2#)__. There are three csv files that are available. For now I will be only working with Seasons_Stats.csv, where each row contains advance statistics for a given player. 

In [1]:
#Import statements
import numpy as np
import pandas as pd

from bokeh.plotting import figure
from bokeh.io import show, output_notebook, output_file, save

from bokeh.models import ColumnDataSource, HoverTool, CheckboxGroup, Panel
from bokeh.models.widgets import RangeSlider, Slider, Tabs

from bokeh.palettes import Category10_5, Category20_16
from bokeh.layouts import column, row, WidgetBox

from bokeh.application.handlers import FunctionHandler
from bokeh.application import Application

output_notebook()

In [2]:
#Load data
Season_Stats = pd.read_csv('/Users/Akshi/Desktop/Projects/Sport Analytics/NBA/NBA_Analytics/Data/nba-players-stats/Seasons_Stats.csv')
Season_Stats.head()

Unnamed: 0.1,Unnamed: 0,Year,Player,Pos,Age,Tm,G,GS,MP,PER,...,FT%,ORB,DRB,TRB,AST,STL,BLK,TOV,PF,PTS
0,0,1950.0,Curly Armstrong,G-F,31.0,FTW,63.0,,,,...,0.705,,,,176.0,,,,217.0,458.0
1,1,1950.0,Cliff Barker,SG,29.0,INO,49.0,,,,...,0.708,,,,109.0,,,,99.0,279.0
2,2,1950.0,Leo Barnhorst,SF,25.0,CHS,67.0,,,,...,0.698,,,,140.0,,,,192.0,438.0
3,3,1950.0,Ed Bartels,F,24.0,TOT,15.0,,,,...,0.559,,,,20.0,,,,29.0,63.0
4,4,1950.0,Ed Bartels,F,24.0,DNN,13.0,,,,...,0.548,,,,20.0,,,,27.0,59.0


In [None]:
print('The dataframe is: ' + str(Season_Stats.shape[0]) + ' rows by '+  str(Season_Stats.shape[1]) + ' columns') 

<br>This  dataset goes back all the way to 1950, however the NBA did not start tracking certain important statistics such 3-Point% till the 1979-1980 season __[Source](https://stats.nba.com/help/faq/)__, so I will only focus on player data from 1980 onwards.
<br>

In [3]:
#define new dataframe
DF = Season_Stats.loc[(Season_Stats['Year'] >= 1980)] 
#List all the column headers for future reference
columns = list(DF.columns.values)
print(columns)

['Unnamed: 0', 'Year', 'Player', 'Pos', 'Age', 'Tm', 'G', 'GS', 'MP', 'PER', 'TS%', '3PAr', 'FTr', 'ORB%', 'DRB%', 'TRB%', 'AST%', 'STL%', 'BLK%', 'TOV%', 'USG%', 'blanl', 'OWS', 'DWS', 'WS', 'WS/48', 'blank2', 'OBPM', 'DBPM', 'BPM', 'VORP', 'FG', 'FGA', 'FG%', '3P', '3PA', '3P%', '2P', '2PA', '2P%', 'eFG%', 'FT', 'FTA', 'FT%', 'ORB', 'DRB', 'TRB', 'AST', 'STL', 'BLK', 'TOV', 'PF', 'PTS']


All column names and their interpretations can be found on __[Basketball Refrence](https://www.basketball-reference.com/about/glossary.html)__.
<br><br>
<a id="2"></a> 
## *Exploring the data*
In this section I will do some basic exploratory data-anlaysis to produce statistical summaries and visualisations.
Lets start by looking at the distribution of the number of 3-Pointers made by a player. Im chosing to look at the distribution of this since some players might have really high 3P%, but they attempt hardly any shots.  


In [None]:
#Summary staistics about the number of 3P made in a season by a player 
DF["3P"].describe()

<br> The above gives some interesting information about the distribution of 3P made in a season. The median is 2, 75th quantile is 27 and the max is 402. Thats a massive range. This histogram is going to look weird. The issue here might be that we have not had a chance to filter our data. Since there have so many players to play in the NBA, lets only focus on those who have attempted 20 or more 3-point shots in a season. 

In [None]:
df_3PA = DF.loc[(DF["3PA"]>20)]
arr_hist, edges = np.histogram(df_3PA["3P"], bins =81, range=[0,405] )

#Add the above to a Dataframe
splash = pd.DataFrame({'3P': arr_hist,
                      'left': edges[:-1],
                      'right': edges[1:]})
#Create a blank figure
plot = figure(plot_height=400, plot_width=400, 
             title="Histogram of 3-pointers made in the NBA (1980-2017)",
             x_axis_label = "3-Pointers made in a season",
             y_axis_label = "Count")
plot.quad(bottom=0, top=splash['3P'], left=splash['left'], right=splash['right'],
         fill_color=(250, 54, 53), line_color='black')
plot.background_fill_color = 'whitesmoke'
output_notebook()
show(plot)

The above shows the number of 3-pointers made by a player in a season. I want to say its positively skewed, but it almost feels wrong to. 


<a id="3"></a> 
## *Adding a HoverTool to the plot*
The next few sections will make the above distribution more interactive and hopefully make it look less strange. Interactivty will be added through Bokeh, transforming the dataframe from before ```splash``` to a ColumDataSource object. 

In [None]:
def style(plot):
    '''A function to style a plot'''
    plot.title.align='center'
    plot.title.text_font_size='10pt'
    plot.xaxis.axis_label_text_font_size='12pt'
    plot.xaxis.major_label_text_font_size = '12pt'
    plot.yaxis.axis_label_text_font_size = '12pt'
    plot.yaxis.major_label_text_font_size = '12pt'
    
    return plot


In [None]:
#Same as before. Select columns we want, make bins and create a dataframe
df_3PA = DF.loc[(DF["3PA"]>20)]
arr_hist, edges = np.histogram(df_3PA["3P"], bins =81, range=[0,405] )

splash = pd.DataFrame({'count': arr_hist,
                      'left': edges[:-1],
                      'right': edges[1:]})
#Add a new column to the dataframe for the length of the interval
splash['interval'] = ['%d to %d shots' %(left,right)
                      for left,right in zip(splash['left'], splash['right'])]
splash.head()

In [None]:
#Add the above dataframe to a ColumnDataSource to allow for interactivty later on
src = ColumnDataSource(splash)
src.data.keys()

In [None]:
#Create a blank plot as before
plot_2 = figure(plot_height=400, plot_width=400, 
             title="Histogram of 3-pointers made in the NBA (1980-2017)",
             x_axis_label = "3-Pointers made in a season",
             y_axis_label = "Count")

#Add a quad to the plot with src this time
plot_2.quad(source=src,bottom=0,top='count', left='left', right='right', fill_color=(250, 54, 53), 
          fill_alpha=0.75, line_color='black', hover_fill_alpha=1.0, hover_fill_color='blue')

#Add a hover tool. Note the @interval, which refers to the interval column from src
hover = HoverTool(tooltips= [('3-Pointers made', '@interval'), 
                           ('Count', '@count')])
plot_2 = style(plot_2)
plot_2.add_tools(hover)
show(plot_2)

<a id="4"></a> 
## *Adding active interactions*
The tool tip is a passive interaction. Time to add something more invovled. This section will add the following things to the plot:

- Allow users to look at the distribution depending on the season, i.e. Look at 2003 vs 2017. (It should hopefully show that there are a greater number of 3-pointers being made in the more recent seasons.)
- Choose their own 3PA cut of criteria, i.e. Look at the distrubtion for those players who have 3PA > x, for some value x. Previously it was statically set to 20.


In [None]:
DF[["3PA", "3P", "Year"]][:10]

In [None]:
def subset_data(year_list, attempts, start=0, end=405, bin_width=5):
    '''Function to create a a subset of data by year and 3PA.
       @year_list is a list of years that the user would like to subset by
       @attempts is an int. Subsets DF so that rows that are > attempts are selected '''
    
    #TODO: Check for edge cases
    # Check that attempts in < 402
    #Check that no element in year_list > 2017 or < 1980
    #check start < end
    assert start < end, 'Error: Start should be less than end!'

    
    subset_by_year = pd.DataFrame(columns=['proportion', 'left', 'right', 'year_interval',
                                          'year_proportion','name','color'])
    shot_range = end-start
    
    for counter, year in enumerate(year_list): # Enumerate loops over year_list with automatic counter


        #Subset by year
        subset = DF.loc[(DF['Year'] == year) & (DF["3PA"]>attempts)]
        
        #Histogram with specified bins and range
        hist, edges = np.histogram(subset["3P"], bins=int(shot_range/bin_width), 
                                  range=[start,end])
        #Get proportions, divide count by total 
        arr_df = pd.DataFrame({'proportion': hist/np.sum(hist),
                              'left': edges[:-1],
                              'right': edges[1:]} )
        #Round proportion
        arr_df['year_proportion'] = ['%0.5f' % proportion for proportion in arr_df['proportion']]
        
        #Get interval
        arr_df['year_interval'] = ['%d to %d shots' %(left,right) for left,right 
                              in zip(arr_df['left'], arr_df['right'])]
        #Assign year for labels
        arr_df['year'] = year
        
        #Get unique colour for each year, will make it easier to visualise
        arr_df['color'] = Category20_16[counter]
        
        #Add to shot_by_year dataframe
        subset_by_year = subset_by_year.append(arr_df)
    
    subset_by_year = subset_by_year.sort_values(['year', 'left'])
    
    #return a ColumnDataSource to use when adding quads
    return ColumnDataSource(subset_by_year)
    

In [None]:
def plotter(src):
    '''Function takes a ColumnDataSource, src, and returns a plot object'''
    #Create a blank plot
    plot = figure(plot_width=500, plot_height=500,
                 title = "Histogram of 3-pointers made in the NBA by Year",
                  x_axis_label = '3-point shots made in a season',
                  y_axis_label = 'Proportion')
    
    # Add a quad to the plot 
    plot.quad(source=src, bottom=0, top='proportion', left='left', right='right',
              color='color', hover_fill_color='color', legend='year',
              hover_fill_alpha=0.7, line_color='black')
    #Add a hovertool
    hover = HoverTool(tooltips=[('Year', '@year'),
                               ('Interval', '@year_interval'),
                               ('Proportion', '@year_proportion')])
    plot.add_tools(hover)
    
    #Add styling
    plot = style(plot)
    
    return plot

In [None]:
src = subset_data([2017, 1982, 2000], 50)
plot = plotter(src)
output_notebook()
show(plot)

<a id="5"></a> 
## *Adding widgets*
Now that the subsetting and plotting functions have been defined, time to add widgets to get user selections.

In [None]:
#Convert years -> int -> str
possible_years = list(DF["Year"].unique())
possible_years = list(map(int,possible_years))
possible_years = list(map(str,possible_years))
#Create a selection checkbox for users to select which seasons they want to view
year_selection= CheckboxGroup(labels=possible_years, active=[0,1], inline=True)
show(year_selection)

In [None]:
#Get the years that the user has selected from the ChecboxGroup
test_years = [year_selection.labels[i] for i in year_selection.active]
test_years = list(map(int,test_years))
year_selection.on_change('active',update)
#Createa a slider to select 3PA cutoff
min_attempts = Slider(start=0,end=402,step=3, value=30, title="Minimum number of 3-pointers attempted")
min_attempts.on_change('value', update)

In [None]:
def update(attr,old,new):
    '''Update function for the plot'''
    
    #Get active years from selection
    active_years = [year_selection.labels[i] for i in year_selection.active]
    #Change from string to ints
    active_years = list(map(int,active_years))
    #3PA cutoff
    attempts = min_attempts.select.value
    #Dataset based on selection
    active_src = subset_data(active_years,attempts)
    #update src used for the quads
    src.data.update(active_src.data)


In [None]:
controls = WidgetBox(year_selection, min_attempts)
show(controls)

<a id="6"></a> 
## *Complete plot application*

In [14]:
def modify_doc(doc):
    '''Main function for plotting and modifying 3P histogram'''
    
    def subset_data(year_list, attempts, start=0, end=405, bin_width=5):
        '''Function to create a a subset of data by year and 3PA.
           @year_list is a list of years that the user would like to subset by
           @attempts is an int. Subsets DF so that rows that are > attempts are selected '''

        #TODO: Check for edge cases
        # Check that attempts in < 402
        #Check that no element in year_list > 2017 or < 1980
        #check start < end
        assert start < end, 'Error: Start should be less than end!'


        subset_by_year = pd.DataFrame(columns=['proportion', 'left', 'right', 'year_interval',
                                              'year_proportion','name','color'])
        shot_range = end-start

        for counter, year in enumerate(year_list): # Enumerate loops over year_list with automatic counter


            #Subset by year
            subset = DF.loc[(DF['Year'] == year) & (DF["3PA"]>attempts)]

            #Histogram with specified bins and range
            hist, edges = np.histogram(subset["3P"], bins=int(shot_range/bin_width), 
                                      range=[start,end])
            #Get proportions, divide count by total 
            arr_df = pd.DataFrame({'proportion': hist/np.sum(hist),
                                  'left': edges[:-1],
                                  'right': edges[1:]} )
            #Round proportion
            arr_df['year_proportion'] = ['%0.5f' % proportion for proportion in arr_df['proportion']]

            #Get interval
            arr_df['year_interval'] = ['%d to %d shots' %(left,right) for left,right 
                                  in zip(arr_df['left'], arr_df['right'])]
            #Assign year for labels
            arr_df['year'] = year

            #Get unique colour for each year, will make it easier to visualise
            arr_df['color'] = Category20_16[counter]

            #Add to shot_by_year dataframe
            subset_by_year = subset_by_year.append(arr_df)

        subset_by_year = subset_by_year.sort_values(['year', 'left'])

        #return a ColumnDataSource to use when adding quads
        return ColumnDataSource(subset_by_year)
    
    def style(plot):
        '''A function to style a plot'''
        plot.title.align='center'
        plot.title.text_font_size='10pt'
        plot.xaxis.axis_label_text_font_size='12pt'
        plot.xaxis.major_label_text_font_size = '12pt'
        plot.yaxis.axis_label_text_font_size = '12pt'
        plot.yaxis.major_label_text_font_size = '12pt'

        return plot

    def plotter(src):
        '''Function takes a ColumnDataSource, src, and returns a plot object'''
        #Create a blank plot
        plot = figure(plot_width=500, plot_height=500,
                     title = "Histogram of 3-pointers made in the NBA by Year",
                      x_axis_label = '3-point shots made in a season',
                      y_axis_label = 'Proportion')

        # Add a quad to the plot 
        plot.quad(source=src, bottom=0, top='proportion', left='left', right='right',
                  color='color', hover_fill_color='color', legend='year',
                  hover_fill_alpha=0.7, line_color='black')
        #Add a hovertool
        hover = HoverTool(tooltips=[('Year', '@year'),
                                   ('Interval', '@year_interval'),
                                   ('Proportion', '@year_proportion')])
        plot.add_tools(hover)

        #Add styling
        plot = style(plot)

        return plot

    def update(attr,old,new):
        '''Update function for the plot'''

        #Get active years from selection
        active_years = [year_selection.labels[i] for i in year_selection.active]
        #Change from string to ints
        active_years = list(map(int,active_years))
        #3PA cutoff
        attempts = min_attempts.value
        
        #Dataset based on selection
        active_src = subset_data(active_years,attempts)
        #update src used for the quads
        src.data.update(active_src.data)
        
    
    
    
    #Convert years -> int -> str
    possible_years = list(DF["Year"].unique())
    possible_years = list(map(int,possible_years))
    possible_years = list(map(str,possible_years))
    
    #Create a selection checkbox for users to select which seasons they want to view
    year_selection= CheckboxGroup(labels=possible_years, active=[0,1], inline=True)
    #Get the years that the user has selected from the ChecboxGroup
    year_selection.on_change('active',update)
    
    #Createa a slider to select 3PA cutoff
    min_attempts = Slider(start=0,end=402,step=3, value=30, title="Minimum number of 3-pointers attempted")
    min_attempts.on_change('value', update)
    
    starting_years = [year_selection.labels[i] for i in year_selection.active]
    starting_years = list(map(int,starting_years))
    
    src = subset_data(starting_years,30)
    
    p = plotter(src)
    #Add controls to a widgetbox
    controls = WidgetBox(year_selection, min_attempts)
    
    #Format the layout
    layout = row(controls,p)
    
    #Create a tab with the layout
    tab = Panel(child=layout, title='3P Histogram')
    tabs = Tabs(tabs=[tab])
    
    doc.add_root(tabs)


In [17]:
show(modify_doc)