<img src="https://theundefeated.com/wp-content/uploads/2017/05/nba-logo.png?w=50" style="float: left; margin: 20px; height: 55px">

# NBA Player Prediction: 1.1 Data Collection

_Authors: Patrick Wales-Dinan_

---

This project is an attempt to glean information about NBA Players. When teams invest in a player they are taking a risk. They pay them millions of dollars for the promise of a return on the basketball court and or through marketing/ticket sales. These two are obviously related but that relationship is not one to one. Here I wanted to focus on whether you could predict the future performance of an NBA player based on his statistics in his first two years. I had a number of different theories about what might be interesting and useful. Ultimately I decided that I wanted to see the following **`Can I use a players stats from their first two seasons in the NBA, to predict if they will achieve a benchmark`**. What benchmark? Well I chose a few and I will discuss them further as we continue.  

---

## Notebook Contents:
- [Part 1](#Part-1)
- [Part 2](#Part-2)
- [Storing Dataframes](#Storing-the-Dataframes)

## Other Notebooks:

- [1.2_Preprocessing](/Users/pwalesdi/Desktop/GA/NBA_Player_Prediction/Notebooks/1.2_NBA_Preprocessing.ipynb)
- [1.3_Pre_Processing](/Users/pwalesdi/Desktop/GA/GA_Project_2/Project_2_Graphs_&_Relationships.ipynb)
- [1.3_Pre_Processing](/Users/pwalesdi/Desktop/GA/GA_Project_2/Project_2_Graphs_&_Relationships.ipynb)
- [1.3_Pre_Processing](/Users/pwalesdi/Desktop/GA/GA_Project_2/Project_2_Graphs_&_Relationships.ipynb)

In [1]:
import numpy as np
import pandas as pd
import os
pd.set_option('display.max_rows', 2000)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)

## Part 1
This data was downloaded from Basketball Reference's (https://www.basketball-reference.com/) advanced metrics tables for each NBA season going back to 2006. I also took data from the NBA draft tables going back to 2006. I am using all of this information to create a single dataframe that I will use to model the data. The reason I chose to go back to 2006 is that this was the first year that the NBA required draft picks to be one year removed from high school. I decided not to go back any farther because I suspected that there could be significant differences between the draft classes that included high schoolers and those that haven't in the past 15 years.

Part one is just building the basic NBA dataframe that I will use to take all subsets from. Part two of this notebook is creating some specific dataframes that I can use for EDA and visualizations.

In [5]:
# Read in all of the .txt files that I have downloaded from basketball reference
path = "/Users/pwalesdinan/Desktop/GA/NBA_Player_Prediction/csv_files/"
nba_2019 = pd.read_csv(path + '/2019_advanced.txt')
nba_2018 = pd.read_csv(path + '/2018_advanced.txt')
nba_2017 = pd.read_csv(path + '/2017_advanced.txt')
nba_2016 = pd.read_csv(path + '/2016_advanced.txt')
nba_2015 = pd.read_csv(path + '/2015_advanced.txt')
nba_2014 = pd.read_csv(path + '/2014_advanced.txt')
nba_2013 = pd.read_csv(path + '/2013_advanced.txt')
nba_2012 = pd.read_csv(path + '/2012_advanced.txt')
nba_2011 = pd.read_csv(path + '/2011_advanced.txt')
nba_2010 = pd.read_csv(path + '/2010_advanced.txt')
nba_2009 = pd.read_csv(path + '/2009_advanced.txt')
nba_2008 = pd.read_csv(path + '/2008_advanced.txt')
nba_2007 = pd.read_csv(path + '/2007_advanced.txt')
nba_2006 = pd.read_csv(path + '/2006_advanced.txt')
draft_2019 = pd.read_csv(path + '/draft/2019_draft.txt')
draft_2018 = pd.read_csv(path + '/draft/2018_draft.txt')
draft_2017 = pd.read_csv(path + '/draft/2017_draft.txt')
draft_2016 = pd.read_csv(path + '/draft/2016_draft.txt')
draft_2015 = pd.read_csv(path + '/draft/2015_draft.txt')
draft_2014 = pd.read_csv(path + '/draft/2014_draft.txt')
draft_2013 = pd.read_csv(path + '/draft/2013_draft.txt')
draft_2012 = pd.read_csv(path + '/draft/2012_draft.txt')
draft_2011 = pd.read_csv(path + '/draft/2011_draft.txt')
draft_2010 = pd.read_csv(path + '/draft/2010_draft.txt')
draft_2009 = pd.read_csv(path + '/draft/2009_draft.txt')
draft_2008 = pd.read_csv(path + '/draft/2008_draft.txt')
draft_2007 = pd.read_csv(path + '/draft/2007_draft.txt')
draft_2006 = pd.read_csv(path + '/draft/2006_draft.txt')

# Adding a season column to each of the Dataframes that I have loaded in
nba_2019['SEASON'] = 2019
nba_2018['SEASON'] = 2018
nba_2017['SEASON'] = 2017
nba_2016['SEASON'] = 2016
nba_2015['SEASON'] = 2015
nba_2014['SEASON'] = 2014
nba_2013['SEASON'] = 2013
nba_2012['SEASON'] = 2012
nba_2011['SEASON'] = 2011
nba_2010['SEASON'] = 2010
nba_2009['SEASON'] = 2009
nba_2008['SEASON'] = 2008
nba_2007['SEASON'] = 2007
nba_2006['SEASON'] = 2006

# Adding a column to each of the Dataframes that is called "draft year +1" The reason that I am doing this is that
# I want each players first season to be able to be matched to their draft. Meaning that players first season is
# always the season after their draft year. 
draft_2019['DRAFT_YEAR+1'] = 2020
draft_2018['DRAFT_YEAR+1'] = 2019
draft_2017['DRAFT_YEAR+1'] = 2018
draft_2016['DRAFT_YEAR+1'] = 2017
draft_2015['DRAFT_YEAR+1'] = 2016
draft_2014['DRAFT_YEAR+1'] = 2015
draft_2013['DRAFT_YEAR+1'] = 2014
draft_2012['DRAFT_YEAR+1'] = 2013
draft_2011['DRAFT_YEAR+1'] = 2012
draft_2010['DRAFT_YEAR+1'] = 2011
draft_2009['DRAFT_YEAR+1'] = 2010
draft_2008['DRAFT_YEAR+1'] = 2009
draft_2007['DRAFT_YEAR+1'] = 2008
draft_2006['DRAFT_YEAR+1'] = 2007
#Creating a list of every advanced metrics DF
advanced_list = [nba_2019, nba_2018, nba_2017, nba_2016, nba_2015, nba_2014, nba_2013, 
           nba_2012, nba_2011, nba_2010, nba_2009, nba_2008, nba_2007, nba_2006]
#Creating a list of every draft DF
draft_list = [draft_2019, draft_2018, draft_2017, draft_2016, draft_2015, draft_2014, draft_2013, 
           draft_2012, draft_2011, draft_2010, draft_2009, draft_2008, draft_2007, draft_2006]

#Creating a master advanced metrics DF and a master draft DF
advanced = pd.concat([nba_2019, nba_2018, nba_2017, nba_2016, nba_2015, nba_2014, nba_2013, nba_2012, nba_2011, nba_2010, nba_2009, nba_2008, nba_2007, nba_2006])
draft = pd.concat([draft_2019, draft_2018, draft_2017, draft_2016, draft_2015, draft_2014, draft_2013, draft_2012, draft_2011, draft_2010, draft_2009, draft_2008, draft_2007, draft_2006])

#Splitting the player name and unique player id column and labeling them, 
# then dropping the Player column and the unnamed columns
advanced[['player_name','player_id']] = advanced.Player.str.split("\\", expand=True)
advanced.drop(columns=['Player', 'Unnamed: 19', 'Unnamed: 24', 'Rk'], inplace=True)
advanced.head(3)

#Splitting the player name and unique player id column and labeling them, then dropping various columns that are redundent between DFs
draft[['player_name','player_id']] = draft.Player.str.split("\\", expand=True)
draft.drop(columns=['Player', 'MP', 'MP.1', 'WS', 'WS/48', 'VORP', 'BPM', 'G', 'Rk'], inplace=True)
draft.head(3)

# Merge the DFs and then rename some of the columns and reorder all the columns
nba = pd.merge(advanced, draft, how='left', on='player_id')
nba.drop(columns=['player_name_y',], inplace=True)
nba.rename({'player_name_x':'Player_name', 'Tm_y':'Draft_team', 'PTS.1':'PPG', 'TRB.1':'RPG', 'AST.1':'APG'}, axis=1, inplace=True)
nba = nba[['Player_name', 'player_id','SEASON', 'Tm_x','DRAFT_YEAR+1','Draft_team','Pk','Pos','Age','G', 'MP', 'PER', 'TS%', '3PAr',
           'FTr', 'ORB%', 'DRB%', 'TRB%', 'AST%', 'STL%', 'BLK%', 'TOV%', 'USG%', 'OWS', 'DWS', 'WS', 'WS/48', 'OBPM', 'DBPM',
           'BPM', 'VORP', 'College', 'Yrs', 'PTS', 'TRB', 'AST', 'FG%', '3P%', 'FT%', 'PPG', 'RPG', 'APG',]]
# Some NBA teams have changed city locations during this time, so I have renamed their teams to be the most current 
# iteration of the team
nba.loc[(nba['Tm_x'] == 'NOH'), 'Tm_x'] = 'NOP'
nba.loc[(nba['Tm_x'] == 'SEA'), 'Tm_x'] = 'OKC'
nba.loc[(nba['Tm_x'] == 'NJN'), 'Tm_x'] = 'BRK'
nba.loc[(nba['Tm_x'] == 'CHA'), 'Tm_x'] = 'CHO'
nba.loc[(nba['Tm_x'] == 'NOK'), 'Tm_x'] = 'NOP'
nba = nba[(nba.Tm_x != 'TOT')]
# Here I am creating a column for the draft round and then populating it with the appropriate round that the player 
# was drafted in
nba["draft_round"] = np.nan
nba.loc[nba['Pk'] > 30 , 'draft_round'] = 2
nba.loc[nba['Pk'] < 31 , 'draft_round'] = 1
# Storing the data for use in other notebooks
%store nba

Stored 'nba' (DataFrame)


## Part 2

Here I am looking to specifically view players who have improved their [`VORP`](https://www.basketball-reference.com/about/bpm.html) (value over replacement player) by 2 units from any specific year to two years prior. This means that the dataframe contains individual seasons of players who improved their `VORP` by two units. For example [Stephen Curry](https://www.basketball-reference.com/players/c/curryst01.html#all_advanced) had a `VORP` of 2.4 in 2011 and then in 2013 his `VORP` jumped to 5.6. So in this example Steph's 2013 season would be classified as an `improvement season`. He then tallied a `VORP` of 7.9 in 2015 this would also be coded as an `improvement season`. Because I didn't want to have duplicates, this dataframe below only contains the first season that a player achieved this. The second dataframe that is created below is looking at which players have improved their [`PER`](https://www.basketball-reference.com/about/per.html) (Player Efficiency Rating) by a measure of 4 points from a season two years previously. 

In [6]:
years = pd.DataFrame()
for player in nba['Player_name'].unique():
    player_df = nba.loc[nba['Player_name'] == player]
    max_year = max(player_df.SEASON)
    min_year = min(player_df.SEASON)
    for year in range(min_year, max_year + 1):
        stats1 = player_df.loc[player_df['SEASON'] == year]
        stats1.squeeze()
        vorp1 = stats1['VORP']
        stats2 = player_df.loc[player_df['SEASON'] == (year + 2)]
        stats2.squeeze()
        vorp2 = stats2['VORP']
        if vorp2.sum() > (vorp1.sum() + 2):
            years = years.append(stats2)
improvement = pd.DataFrame()
for player in years['Player_name'].unique():
    first_season = min(years.loc[years['Player_name'] == player]["SEASON"])
    first_season_stats = years.loc[(years['SEASON'] == first_season) & (years["Player_name"] == player)]
    improvement = improvement.append(first_season_stats)

In [7]:
years = pd.DataFrame()
for player in nba['Player_name'].unique():
    player_df = nba.loc[nba['Player_name'] == player]
    max_year = max(player_df.SEASON)
    min_year = min(player_df.SEASON)
    for year in range(min_year, max_year + 1):
        stats1 = player_df.loc[player_df['SEASON'] == year]
        stats1.squeeze()
        vorp1 = stats1['PER']
        stats2 = player_df.loc[player_df['SEASON'] == (year + 2)]
        stats2.squeeze()
        vorp2 = stats2['PER']
        if vorp2.sum() > (vorp1.sum() + 4):
            years = years.append(stats2)
per_improvement = pd.DataFrame()
for player in years['Player_name'].unique():
    first_season = min(years.loc[years['Player_name'] == player]["SEASON"])
    first_season_stats = years.loc[(years['SEASON'] == first_season) & (years["Player_name"] == player)]
    per_improvement = per_improvement.append(first_season_stats)

In [8]:
per_improvement = per_improvement.loc[(per_improvement["MP"] > 900) | (per_improvement["G"] > 55)]
per_improvement = per_improvement[per_improvement.PER > 11]

In [9]:
per_improvement = per_improvement.loc[per_improvement["DRAFT_YEAR+1"].notnull()]
print(per_improvement.shape)
per_improvement.sort_values(by='SEASON', ascending=True)

(166, 43)


Unnamed: 0,Player_name,player_id,SEASON,Tm_x,DRAFT_YEAR+1,Draft_team,Pk,Pos,Age,G,MP,PER,TS%,3PAr,FTr,ORB%,DRB%,TRB%,AST%,STL%,BLK%,TOV%,USG%,OWS,DWS,WS,WS/48,OBPM,DBPM,BPM,VORP,College,Yrs,PTS,TRB,AST,FG%,3P%,FT%,PPG,RPG,APG,draft_round
5484,Brandon Roy,roybr01,2009,POR,2007.0,MIN,6.0,SG,24,78,2903,24.0,0.573,0.167,0.383,4.4,11.6,7.9,25.4,1.7,0.6,9.0,27.4,10.9,2.6,13.5,0.223,5.9,-0.2,5.8,5.7,University of Washington,6.0,6136.0,1388.0,1517.0,0.459,0.348,0.8,18.8,4.3,4.7,1.0
5419,Steve Novak,novakst01,2009,LAC,2007.0,HOU,32.0,PF,25,71,1161,13.5,0.606,0.722,0.058,1.8,10.9,6.2,5.8,0.9,0.3,5.4,16.8,2.0,0.2,2.2,0.092,1.6,-3.5,-1.9,0.0,Marquette University,11.0,2177.0,591.0,132.0,0.437,0.43,0.877,4.7,1.3,0.3,2.0
5478,Rajon Rondo,rondora01,2009,BOS,2007.0,PHO,21.0,PG,22,80,2642,18.8,0.543,0.063,0.353,4.8,13.9,9.6,39.7,3.0,0.3,19.2,19.2,4.8,5.1,9.9,0.179,2.0,2.5,4.5,4.3,University of Kentucky,13.0,8567.0,3989.0,6975.0,0.46,0.315,0.605,10.4,4.8,8.5,1.0
5346,Kyle Lowry,lowryky01,2009,MEM,2007.0,MEM,24.0,PG,22,49,1071,14.1,0.54,0.254,0.592,1.5,11.4,6.4,26.9,2.4,0.6,18.7,18.4,0.9,0.9,1.8,0.081,-0.7,-0.8,-1.5,0.1,Villanova University,13.0,12355.0,3654.0,5224.0,0.424,0.367,0.805,14.4,4.3,6.1,1.0
4611,Andrea Bargnani,bargnan01,2010,TOR,2007.0,TOR,1.0,PF,24,80,2799,15.5,0.552,0.284,0.205,4.6,15.9,10.4,5.4,0.5,3.0,8.8,22.3,3.3,0.9,4.2,0.072,0.7,-1.5,-0.8,0.8,,10.0,7873.0,2541.0,653.0,0.439,0.354,0.824,14.3,4.6,1.2,1.0
4621,Marco Belinelli,belinma01,2010,TOR,2008.0,GSW,18.0,SG,23,66,1121,12.6,0.543,0.417,0.319,1.5,8.3,5.0,11.8,1.9,0.3,12.2,20.1,0.9,0.2,1.2,0.049,0.4,-2.6,-2.2,-0.1,,12.0,8009.0,1670.0,1360.0,0.425,0.376,0.847,10.0,2.1,1.7,1.0
4781,Al Horford,horfoal01,2010,ATL,2008.0,ATL,3.0,C,23,81,2845,19.4,0.594,0.001,0.319,9.6,23.3,16.4,10.4,1.1,2.4,11.2,17.6,6.9,3.9,10.9,0.183,1.5,1.9,3.3,3.8,University of Florida,12.0,11092.0,6597.0,2548.0,0.525,0.368,0.754,14.1,8.4,3.2,1.0
4707,Kevin Durant,duranke01,2010,OKC,2008.0,SEA,2.0,SF,21,82,3239,26.2,0.607,0.21,0.504,3.8,17.9,11.0,13.5,1.8,1.9,11.7,32.0,11.1,5.0,16.1,0.238,4.9,0.2,5.1,5.8,University of Texas at Austin,12.0,22940.0,5992.0,3486.0,0.493,0.381,0.883,27.0,7.1,4.1,1.0
4396,Kevin Love,loveke01,2011,MIN,2009.0,MEM,5.0,PF,22,73,2611,24.3,0.593,0.206,0.486,13.7,34.2,23.6,11.8,0.9,0.8,11.1,22.9,8.9,2.5,11.4,0.21,3.9,-0.2,3.7,3.8,University of California Los Angeles,11.0,12006.0,7397.0,1519.0,0.442,0.37,0.827,18.3,11.3,2.3,1.0
4572,Shawne Williams,willish03,2011,NYK,2007.0,IND,17.0,PF,24,64,1323,12.2,0.558,0.551,0.127,5.1,15.6,10.3,5.3,1.5,2.8,10.2,15.2,1.4,1.1,2.5,0.089,0.0,-0.4,-0.5,0.5,University of Memphis,7.0,1769.0,945.0,220.0,0.403,0.339,0.755,5.6,3.0,0.7,1.0


### Storing the Dataframes

In [10]:
%store per_improvement
%store improvement

Stored 'per_improvement' (DataFrame)
Stored 'improvement' (DataFrame)
