# Basketball Statistical Analysis: NCAA vs NBA
This project seeks to allow users to find and explore trends not commonly seen by comparing the statistics of NBA players to their NCAA statistics. The goal of this project is to generate interactive bar charts that allow the user to explore trends within the data that are not easily observable. In this project, I used the package pandas to help clean, filter, and select the data for each of the charts. I used the matplotlib package to create bar charts and set a similar theme in all of the plots. I used ipywidgets interact functionality to generaete these interactive plots. Finally, using nbineract, I was able to generate an interactive HTML document, creating a usable interface with Jupyter notebook.

## Reading and Cleaning the Data
 I found a dataset on data.world.com that shows over 4,500 NBA players’ NBA and NCAA
statistics. The variables included in this project include the players name, years active, birth date, college, name, position, and NBA and NCAA statistics. We used the pandas package for the reading in and cleaning of the data. The first 5 rows of the data frame show the user what to expect when interacting with the data. Critical in this section is renaming columns so that the NBA and NCAA statistics have the same variable names for each player. 

In [11]:
#https://www.youtube.com/watch?v=HW29067qVWk.
#I watched this video that showed me a tutorial and 
#walkthrough of the basic jupyter notebook features. 
import matplotlib.pyplot as plt
import matplotlib as mpl
from ipywidgets import interact, interactive, fixed, interact_manual
import ipywidgets as widgets
import pandas as pd
import numpy as np
stats = pd.read_csv("players.csv")
#https://pandas.pydata.org/docs/. I used the pandas documentation
#that showed me how to use pandas to work with the data. 
stats["years_active"] = stats["active_to"] - stats["active_from"]
stats = stats.rename(columns = 
                                    {"NCAA_games": "NCAA_g_played", "NCAA_fgpct":"NCAA_fg%", "NCAA_fgpg":"NCAA_fg_per_game"})
stats.head(5)

Unnamed: 0,active_from,active_to,birth_date,college,name,position,position.1,NBA__3ptapg,NBA__3ptpct,NBA__3ptpg,...,NCAA_fgapg,NCAA_fg%,NCAA_fg_per_game,NCAA_ft,NCAA_ftapg,NCAA_ftpg,NCAA_g_played,NCAA_ppg,NCAA_ppg.1,years_active
0,1991,1995,24-Jun-68,Duke University,Alaa Abdelnaby,F-C,F-C,0.0,0.0,0.0,...,5.6,0.599,3.3,0.728,2.5,1.8,134.0,8.5,8.5,4
1,1969,1978,7-Apr-46,Iowa State University,Zaid Abdul-Aziz,C-F,C-F,,,,...,,,,,,,,,,9
2,1970,1989,16-Apr-47,"University of California, Los Angeles",Kareem Abdul-Jabbar,C,C,0.0,0.056,0.0,...,16.8,0.639,10.7,0.628,7.9,5.0,88.0,26.4,26.4,19
3,1991,2001,9-Mar-69,Louisiana State University,Mahmoud Abdul-Rauf,G,G,2.3,0.354,0.8,...,21.9,0.474,10.4,0.863,6.4,5.5,64.0,29.0,29.0,10
4,1998,2003,3-Nov-74,"University of Michigan, San Jose State University",Tariq Abdul-Wahad,F,F,0.3,0.237,0.1,...,,,,,,,,,,5


## Information
The .info() returns information about the date being worked with that shows that statistics being observed and the size of data. In this section, we can observe the variables included in this data set and that there are 4,576 entries. The entries begin in the late 1940s and end in 2017. There are some NA values in the data that are removed in some sections to ensure that only clean data is used. However, it was also applicable to replace some NA values with 0 to include those players in the data observed. For example, a player such as Lebron James', who did not attend college, data is still included in some sections for analysis. 

In [12]:
stats.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4576 entries, 0 to 4575
Data columns (total 33 columns):
active_from         4576 non-null int64
active_to           4576 non-null int64
birth_date          4547 non-null object
college             4274 non-null object
name                4576 non-null object
position            4575 non-null object
position.1          4575 non-null object
NBA__3ptapg         3448 non-null float64
NBA__3ptpct         2953 non-null float64
NBA__3ptpg          3448 non-null float64
NBA_efgpct          3426 non-null float64
NBA_fg%             4548 non-null float64
NBA_fg_per_game     4576 non-null float64
NBA_fga_per_game    4576 non-null float64
NBA_ft%             4378 non-null float64
NBA_ft_per_g        4576 non-null float64
NBA_fta_p_g         4576 non-null float64
NBA_g_played        4576 non-null int64
NBA_ppg             4576 non-null float64
NCAA__3ptapg        1868 non-null float64
NCAA__3ptpct        1726 non-null float64
NCAA__3ptpg         18

## Base Data Visualization
The base data visualization allows the user to select a statistic they wish to choose from in order to observe the distribution of that statistic among the players. The first plot shows the distribution of whatever statistic the user wishes. The intent is that outliers or high percentages are highlighted. Additionally, observing the distriution of each statistic provides initial insight into that statistic that may be useful in understanding high performing players. 

In [13]:
%matplotlib inline
#https://matplotlib.org/3.1.1/contents.html
#I used the matplotlib documentation to create
#the plots below

#https://www.youtube.com/watch?v=rkBPgTL-D3Y&t=407s
#I watched this youtube video and it included multiple examples
#on how to display interactive plots. 
x = stats["NBA_ppg"].dropna()
#hist1 called with interact displays an interactive char explained above
def hist1(x):
    plt.hist(stats[x].dropna(), color = "C1",edgecolor ="black",
                linewidth = 3,bins = 20)
    plt.style.use("ggplot")
    plt.title("Count of Stats")  
    mpl.rcParams['grid.color'] = 'k'
    mpl.rcParams['grid.linestyle'] = ':'
    mpl.rcParams['grid.linewidth'] = 0.5
    mpl.rcParams['figure.facecolor'] = '1'
    mpl.rcParams['figure.figsize'] = [10.0, 7.50]

#builds the stat list based on the statistic names from the data frame
stat = []
for col in stats.columns:
        stat += [col]
stat = stat[7:]
    
interactive(hist1, x = stat)


interactive(children=(Dropdown(description='x', options=('NBA__3ptapg', 'NBA__3ptpct', 'NBA__3ptpg', 'NBA_efgp…

The next barchart shows players that improved the most in their NBA stats. The intent is that the user will be able to look into these players deeper to find what caused that improvement. This was done by subtracting the players college points per game from the nba points per game. In this chart, the players with no college statistics are removed from the data in order to not skew the data. 

In [14]:
stats["ppg_diff"] = stats["NBA_ppg"] - stats["NCAA_ppg"]
players_with_p = stats.fillna(0)
players_with_p.sort_values(by = "ppg_diff", ascending = False)
#hist2 when called with interact displays an interactive plot
def hist2(size):   
    top_n = players_with_p.nlargest(size, 'ppg_diff')
    names = top_n["name"]
    c = top_n["position"]
    y_pos = np.arange(len(names))
    plt.bar(y_pos, top_n["ppg_diff"],edgecolor ="black",
                linewidth = 3, color = "C1")
    plt.xticks(y_pos, names, rotation = 90)
    plt.title("Top Players: Improved")
    mpl.rcParams['grid.color'] = 'k'
    mpl.rcParams['grid.linestyle'] = ':'
    mpl.rcParams['grid.linewidth'] = 0.5
    mpl.rcParams['figure.facecolor'] = '1'
    mpl.rcParams['figure.figsize'] = [10.0, 7.50]
    
interact(hist2, size = [5,8,10,12,15])

interactive(children=(Dropdown(description='size', options=(5, 8, 10, 12, 15), value=5), Output()), _dom_class…

<function __main__.hist2(size)>

The next chart also highlights differences between a player's college and NBA statistics. However, this chart allows the user to choose which statistic they wish to observe the differences in and also the size of the bar chart. By choosing the size, the user is able to more closely analyze the players with the largest differences. 

In [15]:

data = ["ppg", "g_played", "fg_per_game", "fg%"]
players_with_p = stats
#hist3 when called with interact displays an interactive chart 
def hist3(size, data):
    top_n = players_with_p.nlargest(size, "NBA_"+data)
    names = top_n["name"]
    c = top_n["position"]
    y_pos = np.arange(len(names))
    width = len(names)/(2*len(y_pos))
    plt.bar(y_pos, top_n["NBA_"+data], color = "C1",width = width, edgecolor = "black", linewidth = 
                3)
    plt.bar(y_pos + width, top_n["NCAA_"+data], color = "C4",width = width, edgecolor ="black",
                linewidth = 3)
    plt.xticks(y_pos + (width/2), names, rotation = 90)
    plt.title("Top " +str(size) + " Players: Statistic Differences in " + data)
    mpl.rcParams['grid.color'] = 'k'
    mpl.rcParams['grid.linestyle'] = ':'
    mpl.rcParams['grid.linewidth'] = 0.5
    mpl.rcParams['figure.facecolor'] = '1'
    mpl.rcParams['figure.figsize'] = [10.0, 7.50]
interact(hist3, data = data, size = [5,8,10,12,15])

interactive(children=(Dropdown(description='size', options=(5, 8, 10, 12, 15), value=5), Dropdown(description=…

<function __main__.hist3(size, data)>

The next goal is to allow the use to observe trends in both college and NBA stats throughout history. The user will be allowed to choose a year from 1948 to 2017, select whether they wish to observe NCAA stats, NBA stats, or both, the size of the chart, and the statistic. For each year, if a player was active during those years, they will be included in the analysis. The analysis is somewhat limited in the early years because there is not much data from those years. 

In [16]:
match = stats

def historical(year, data, size, college):
    active_players = match.loc[(match["active_from"] < year) & (match["active_to"] > year)]
    if college == "Both":
        top_n = active_players.nlargest(size, "NBA_"+data)
        names = top_n["name"]
        y_pos = np.arange(len(names))
        width = len(names)/(2*len(y_pos))
        plt.bar(y_pos, top_n["NBA_"+data], color = "C1",width = width, edgecolor = "black", linewidth = 
                3)
        plt.bar(y_pos + width, top_n["NCAA_"+data], color = "C4",width = width, edgecolor ="black",
                linewidth = 3)
        plt.xticks(y_pos + (width/2), names, rotation = 90)
        plt.title("College and NBA Stats: Top " + str(size)+ " Players Who Played in " + str(year) + " by " + data)
    if college == "NBA":
        top_n = active_players.nlargest(size, "NBA_"+data)
        names = top_n["name"]
        y_pos = np.arange(len(names))
        plt.bar(y_pos, top_n["NBA_"+data],color = "C1", edgecolor = "black", linewidth = 
                3)
        plt.xticks(y_pos, names, rotation = 90)
        plt.title("NBA Stats: Top " + str(size)+ " Players Who Played in " + str(year) + " by " + data)
    if college == "NCAA":
        top_n = active_players.nlargest(size, "NCAA_"+data)
        names = top_n["name"]
        y_pos = np.arange(len(names))
        plt.bar(y_pos, top_n["NCAA_"+data],color = "C4", edgecolor = "black", linewidth = 
                3)
        plt.xticks(y_pos, names, rotation = 90)
        plt.title("College Stats: Top " + str(size)+ " Players Who Played in " + str(year) + " by " + data)
    mpl.rcParams['grid.color'] = 'k'
    mpl.rcParams['grid.linestyle'] = ':'
    mpl.rcParams['grid.linewidth'] = 0.5
    mpl.rcParams['figure.facecolor'] = '1'
    mpl.rcParams['figure.figsize'] = [10.0, 7.50]
    
interact(historical, data = data, year = list(range(1948,2018)), size =[5,8,10,12,15, 20, 25], college = ["Both", "NCAA", "NBA"] )


interactive(children=(Dropdown(description='year', options=(1948, 1949, 1950, 1951, 1952, 1953, 1954, 1955, 19…

<function __main__.historical(year, data, size, college)>

## College Analysis
The next step is to analyze players college statistics to give NBA scouts a better idea of trends of players. The first plot displays the top colleges. This was calculated by counting the overall number of players who have from that college and made it to the NBA. 

In [17]:
colleges = pd.DataFrame({"count" :stats.groupby("college").size()}).reset_index()
#https://stackoverflow.com/questions/10373660/converting-a-pandas-groupby-output-from-series-to-dataframe
#I use this website in creating the line of code above that returns a grouped by data frame. 
def hist4(size):   
    top_n = colleges.nlargest(size, 'count')
    names = top_n["college"]
    y_pos = np.arange(len(names))
    plt.bar(y_pos, top_n["count"], color = "C1", edgecolor = "black", linewidth = 3)
    plt.xticks(y_pos, names, rotation = 90)
    plt.title("Top " + str(size) + " Colleges With Players in NBA")
    mpl.rcParams['grid.color'] = 'k'
    mpl.rcParams['grid.linestyle'] = ':'
    mpl.rcParams['grid.linewidth'] = 0.5
    mpl.rcParams['figure.facecolor'] = '1'
    mpl.rcParams['figure.figsize'] = [10.0, 7.50]
    
interact(hist4, size = (3,18,3))  

interactive(children=(IntSlider(value=9, description='size', max=18, min=3, step=3), Output()), _dom_classes=(…

<function __main__.hist4(size)>

Along with counting the total players coming from a college in determining the quality of players from those colleges, it useful to see differences in NBA and NCAA statistics of players from those colleges. We can assume that colleges with a higher difference of NCAA and NBA stats are more likely to be successful in the NBA. The difference functionality allows the user to observe these trends. Additionally, without the difference functionality, the user is able to select a statistic to see the average player statistic in that category in both NBA and NCAA statistics. 

In [18]:
colleges = colleges.nlargest(25, "count")
stats

new_col = pd.merge(colleges, match, on = "college")

def avg_stats(size, stat, diff):
    ncaa = new_col.groupby("college", as_index = False)["NCAA_"+stat].mean()
    nba = new_col.groupby("college", as_index = False)["NBA_"+stat].mean()
    merg = pd.merge(ncaa, nba, on = "college")
    college = merg["college"]
    if diff == True:
        merg[stat+"diff"] = merg["NBA_"+stat] - merg["NCAA_"+stat]
        top_n = merg.nlargest(size, stat+"diff")
        y_pos = np.arange(len(top_n[stat+"diff"]))
        width = len(top_n[stat+"diff"])/(len(y_pos))
        plt.bar(y_pos, -top_n[stat+"diff"], color = "C1",width = width, edgecolor = "black", linewidth = 
                3)
        plt.xticks(y_pos, college, rotation = 90)
        plt.title("Top Players: Statistic Differences")
    else:
        top_n = merg.nlargest(size, "NCAA_"+stat)
        y_pos = np.arange(len(top_n["NCAA_"+stat]))
        width = len(top_n["NCAA_"+stat])/(2*len(y_pos))
        plt.bar(y_pos, top_n["NCAA_"+stat], color = "C1",width = width, edgecolor = "black", linewidth = 
                3)
        plt.bar(y_pos + width, top_n["NBA_"+stat], color = "C4",width = width, edgecolor ="black",
                linewidth = 3)
        plt.xticks(y_pos + width/2, college, rotation = 90)
        plt.title("Top " + str(size)+ " Colleges: Statistic Differences in " + stat)
    mpl.rcParams['grid.color'] = 'k'
    mpl.rcParams['grid.linestyle'] = ':'
    mpl.rcParams['grid.linewidth'] = 0.5
    mpl.rcParams['figure.facecolor'] = '1'
    mpl.rcParams['figure.figsize'] = [10.0, 7.50]
interact(avg_stats, stat = data, size = [5,8,10,12,15], diff = False)

interactive(children=(Dropdown(description='size', options=(5, 8, 10, 12, 15), value=5), Dropdown(description=…

<function __main__.avg_stats(size, stat, diff)>

Much like the player historical data section, this section explores historical data and allows the user to select the time period they are interested in. This section highlights the colleges that were successful in those years in having a high output of NBA players. 

In [19]:
def historical2(year, size):
    active_players = match.loc[(match["active_from"] < year) & (match["active_to"] > year)]
    colleges = pd.DataFrame({"count" :active_players.groupby("college").size()}).reset_index()
    top_n = colleges.nlargest(size, "count")
    names = top_n["college"]
    y_pos = np.arange(len(names))
    plt.bar(y_pos, top_n["count"], color = "C1", edgecolor = "black", linewidth = 3)
    plt.xticks(y_pos, names, rotation = 90)
    plt.title("Top " + str(size) + " Colleges With Players in NBA in " + str(year))
    mpl.rcParams['grid.color'] = 'k'
    mpl.rcParams['grid.linestyle'] = ':'
    mpl.rcParams['grid.linewidth'] = 0.5
    mpl.rcParams['figure.facecolor'] = '1'
    mpl.rcParams['figure.figsize'] = [10.0, 7.50]

interact(historical2, year = list(range(1950, 2018)), size = [5,8,10,12,15, 20, 25])
    

interactive(children=(Dropdown(description='year', options=(1950, 1951, 1952, 1953, 1954, 1955, 1956, 1957, 19…

<function __main__.historical2(year, size)>

Finally, the last bar chart allows the user to select from the top 25 colleges with NBA player output. This bar chart shows that college's output of NBA players throughout the decades and highlights when those colleges were most successful. We can assume that the more NBA players a college has during a decade, the better their team is. It is interesting to note when each college is the most successful in creating NBA players and their trend throughout the decades. 

In [20]:
def historical_colleges(college):
    years = ["50s", "60s", "70s", "80s", "90s", "2000s", "2010s"]
    players = stats.loc[stats["college"] == college]
    one = len(players.loc[(players["active_from"] < 1960) & (players["active_from"] > 1940)].index)
    two = len(players.loc[(players["active_from"] < 1970) & (players["active_from"] > 1959)].index)
    three = len(players.loc[(players["active_from"] < 1980) & (players["active_from"] > 1969)].index)
    four = len(players.loc[(players["active_from"] < 1990) & (players["active_from"] > 1979)].index)
    five = len(players.loc[(players["active_from"] < 2000) & (players["active_from"] > 1989)].index)
    six = len(players.loc[(players["active_from"] < 2010) & (players["active_from"] > 1999)].index)
    seven = len(players.loc[(players["active_from"] < 2020) & (players["active_from"] > 2009)].index)
    count = [one, two, three, four, five, six, seven]
    y_pos = np.arange(len(years))
    plt.bar(y_pos, count, color = "C1", edgecolor = "black", linewidth = 3)
    plt.xticks(y_pos, years, rotation = 90)
    plt.title("Players to NBA from " + college)
    mpl.rcParams['grid.color'] = 'k'
    mpl.rcParams['grid.linestyle'] = ':'
    mpl.rcParams['grid.linewidth'] = 0.5
    mpl.rcParams['figure.facecolor'] = '1'
    mpl.rcParams['figure.figsize'] = [10.0, 7.50]
 
college_names = colleges["college"]
interact(historical_colleges, college = college_names)





interactive(children=(Dropdown(description='college', options=('University of Kentucky', 'University of Califo…

<function __main__.historical_colleges(college)>