<img src="https://theundefeated.com/wp-content/uploads/2017/05/nba-logo.png?w=50" style="float: left; margin: 20px; height: 55px">

# NBA Player Prediction

_Authors: Patrick Wales-Dinan_

---

This project is an attempt to glean information about NBA Players. When teams invest in a player they are taking a risk. They pay them millions of dollars for the promise of a return on the basketball court and or through marketing/ticket sales. These two are obviously related but that relationship is not one to one. Here I wanted to focus on whether you could predict the future performance of an NBA player based on his statistics in his first two years:




## Contents:
- [Data Import](#Data-Import)
- [Feature Creation](#Feature-Creation)
- [Choosing the Features](#Feature-Choice)
- [Log Scaling](#Log-Scaling-Independent-Variables)
- [Cleaning the Data and Modifying the Data](#Cleaning-&-Creating-the-Data-Set)
- [Modeling the Data](#Modeling-the-Data)
- [Model Analysis](#Analyzing-the-model)

Please visit the Graphs & Relationships notebook for additional visuals: Notebook - [Here](/Users/pwalesdi/Desktop/GA/GA_Project_2/Project_2_Graphs_&_Relationships.ipynb)




In [3]:
import numpy as np
import pandas as pd
import os
pd.set_option('display.max_rows', 2000)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)

## Part 1
This data was downloaded from Basketball Reference's (https://www.basketball-reference.com/) advanced metrics tables for each NBA season going back to 2006. I also took data from the NBA draft tables going back to 2006. I am using all of this information to create a single dataframe that I will use to model the data. The reason I chose to go back to 2006 is that this was the first year that the NBA required draft picks to be one year removed from high school. I decided not to go back any farther because I suspected that there could be significant differences between the draft classes that included high schoolers and those that haven't in the past 15 years.

Part one is just building the basic NBA dataframe that I will use to take all subsets from. Part two of this notebook is creating some specific dataframes that I can use for EDA and visualizations.

In [6]:
# Read in all of the .txt files that I have downloaded from basketball reference
nba_2019 = pd.read_csv('./csv_files/2019_advanced.txt')
nba_2018 = pd.read_csv('./csv_files/2018_advanced.txt')
nba_2017 = pd.read_csv('./csv_files/2017_advanced.txt')
nba_2016 = pd.read_csv('./csv_files/2016_advanced.txt')
nba_2015 = pd.read_csv('./csv_files/2015_advanced.txt')
nba_2014 = pd.read_csv('./csv_files/2014_advanced.txt')
nba_2013 = pd.read_csv('./csv_files/2013_advanced.txt')
nba_2012 = pd.read_csv('./csv_files/2012_advanced.txt')
nba_2011 = pd.read_csv('./csv_files/2011_advanced.txt')
nba_2010 = pd.read_csv('./csv_files/2010_advanced.txt')
nba_2009 = pd.read_csv('./csv_files/2009_advanced.txt')
nba_2008 = pd.read_csv('./csv_files/2008_advanced.txt')
nba_2007 = pd.read_csv('./csv_files/2007_advanced.txt')
nba_2006 = pd.read_csv('./csv_files/2006_advanced.txt')
draft_2019 = pd.read_csv('./csv_files/draft/2019_draft.txt')
draft_2018 = pd.read_csv('./csv_files/draft/2018_draft.txt')
draft_2017 = pd.read_csv('./csv_files/draft/2017_draft.txt')
draft_2016 = pd.read_csv('./csv_files/draft/2016_draft.txt')
draft_2015 = pd.read_csv('./csv_files/draft/2015_draft.txt')
draft_2014 = pd.read_csv('./csv_files/draft/2014_draft.txt')
draft_2013 = pd.read_csv('./csv_files/draft/2013_draft.txt')
draft_2012 = pd.read_csv('./csv_files/draft/2012_draft.txt')
draft_2011 = pd.read_csv('./csv_files/draft/2011_draft.txt')
draft_2010 = pd.read_csv('./csv_files/draft/2010_draft.txt')
draft_2009 = pd.read_csv('./csv_files/draft/2009_draft.txt')
draft_2008 = pd.read_csv('./csv_files/draft/2008_draft.txt')
draft_2007 = pd.read_csv('./csv_files/draft/2007_draft.txt')
draft_2006 = pd.read_csv('./csv_files/draft/2006_draft.txt')

# Adding a season column to each of the Dataframes that I have loaded in
nba_2019['SEASON'] = 2019
nba_2018['SEASON'] = 2018
nba_2017['SEASON'] = 2017
nba_2016['SEASON'] = 2016
nba_2015['SEASON'] = 2015
nba_2014['SEASON'] = 2014
nba_2013['SEASON'] = 2013
nba_2012['SEASON'] = 2012
nba_2011['SEASON'] = 2011
nba_2010['SEASON'] = 2010
nba_2009['SEASON'] = 2009
nba_2008['SEASON'] = 2008
nba_2007['SEASON'] = 2007
nba_2006['SEASON'] = 2006

# Adding a column to each of the Dataframes that is called "draft year +1" The reason that I am doing this is that
# I want each players first season to be able to be matched to their draft. Meaning that players first season is
# always the season after their draft year. 
draft_2019['DRAFT_YEAR+1'] = 2020
draft_2018['DRAFT_YEAR+1'] = 2019
draft_2017['DRAFT_YEAR+1'] = 2018
draft_2016['DRAFT_YEAR+1'] = 2017
draft_2015['DRAFT_YEAR+1'] = 2016
draft_2014['DRAFT_YEAR+1'] = 2015
draft_2013['DRAFT_YEAR+1'] = 2014
draft_2012['DRAFT_YEAR+1'] = 2013
draft_2011['DRAFT_YEAR+1'] = 2012
draft_2010['DRAFT_YEAR+1'] = 2011
draft_2009['DRAFT_YEAR+1'] = 2010
draft_2008['DRAFT_YEAR+1'] = 2009
draft_2007['DRAFT_YEAR+1'] = 2008
draft_2006['DRAFT_YEAR+1'] = 2007
#Creating a list of every advanced metrics DF
advanced_list = [nba_2019, nba_2018, nba_2017, nba_2016, nba_2015, nba_2014, nba_2013, 
           nba_2012, nba_2011, nba_2010, nba_2009, nba_2008, nba_2007, nba_2006]
#Creating a list of every draft DF
draft_list = [draft_2019, draft_2018, draft_2017, draft_2016, draft_2015, draft_2014, draft_2013, 
           draft_2012, draft_2011, draft_2010, draft_2009, draft_2008, draft_2007, draft_2006]

#Creating a master advanced metrics DF and a master draft DF
advanced = pd.concat([nba_2019, nba_2018, nba_2017, nba_2016, nba_2015, nba_2014, nba_2013, nba_2012, nba_2011, nba_2010, nba_2009, nba_2008, nba_2007, nba_2006])
draft = pd.concat([draft_2019, draft_2018, draft_2017, draft_2016, draft_2015, draft_2014, draft_2013, draft_2012, draft_2011, draft_2010, draft_2009, draft_2008, draft_2007, draft_2006])

#Splitting the player name and unique player id column and labeling them, 
# then dropping the Player column and the unnamed columns
advanced[['player_name','player_id']] = advanced.Player.str.split("\\", expand=True)
advanced.drop(columns=['Player', 'Unnamed: 19', 'Unnamed: 24', 'Rk'], inplace=True)
advanced.head(3)

#Splitting the player name and unique player id column and labeling them, then dropping various columns that are redundent between DFs
draft[['player_name','player_id']] = draft.Player.str.split("\\", expand=True)
draft.drop(columns=['Player', 'MP', 'MP.1', 'WS', 'WS/48', 'VORP', 'BPM', 'G', 'Rk'], inplace=True)
draft.head(3)

# Merge the DFs and then rename some of the columns and reorder all the columns
nba = pd.merge(advanced, draft, how='left', on='player_id')
nba.drop(columns=['player_name_y',], inplace=True)
nba.rename({'player_name_x':'Player_name', 'Tm_y':'Draft_team', 'PTS.1':'PPG', 'TRB.1':'RPG', 'AST.1':'APG'}, axis=1, inplace=True)
nba = nba[['Player_name', 'player_id','SEASON', 'Tm_x','DRAFT_YEAR+1','Draft_team','Pk','Pos','Age','G', 'MP', 'PER', 'TS%', '3PAr',
           'FTr', 'ORB%', 'DRB%', 'TRB%', 'AST%', 'STL%', 'BLK%', 'TOV%', 'USG%', 'OWS', 'DWS', 'WS', 'WS/48', 'OBPM', 'DBPM',
           'BPM', 'VORP', 'College', 'Yrs', 'PTS', 'TRB', 'AST', 'FG%', '3P%', 'FT%', 'PPG', 'RPG', 'APG',]]
# Some NBA teams have changed city locations during this time, so I have renamed their teams to be the most current 
# iteration of the team
nba.loc[(nba['Tm_x'] == 'NOH'), 'Tm_x'] = 'NOP'
nba.loc[(nba['Tm_x'] == 'SEA'), 'Tm_x'] = 'OKC'
nba.loc[(nba['Tm_x'] == 'NJN'), 'Tm_x'] = 'BRK'
nba.loc[(nba['Tm_x'] == 'CHA'), 'Tm_x'] = 'CHO'
nba.loc[(nba['Tm_x'] == 'NOK'), 'Tm_x'] = 'NOP'
nba = nba[(nba.Tm_x != 'TOT')]
# Here I am creating a column for the draft round and then populating it with the appropriate round that the player 
# was drafted in
nba["draft_round"] = np.nan
nba.loc[nba['Pk'] > 30 , 'draft_round'] = 2
nba.loc[nba['Pk'] < 31 , 'draft_round'] = 1
# Storing the data for use in other notebooks
%store nba

Stored 'nba' (DataFrame)


## Part 2

Here I am looking to specifically view players who have improved their `VORP` (value over replacement player) by 2 units from their 

In [None]:
years = pd.DataFrame()
for player in nba['Player_name'].unique():
    player_df = nba.loc[nba['Player_name'] == player]
    max_year = max(player_df.SEASON)
    min_year = min(player_df.SEASON)
    for year in range(min_year, max_year + 1):
        stats1 = player_df.loc[player_df['SEASON'] == year]
        stats1.squeeze()
        vorp1 = stats1['VORP']
        stats2 = player_df.loc[player_df['SEASON'] == (year + 2)]
        stats2.squeeze()
        vorp2 = stats2['VORP']
        if vorp2.sum() > (vorp1.sum() + 2):
            years = years.append(stats2)
improvement = pd.DataFrame()
for player in years['Player_name'].unique():
    first_season = min(years.loc[years['Player_name'] == player]["SEASON"])
    first_season_stats = years.loc[(years['SEASON'] == first_season) & (years["Player_name"] == player)]
    improvement = improvement.append(first_season_stats)

In [None]:
years = pd.DataFrame()
for player in nba['Player_name'].unique():
    player_df = nba.loc[nba['Player_name'] == player]
    max_year = max(player_df.SEASON)
    min_year = min(player_df.SEASON)
    for year in range(min_year, max_year + 1):
        stats1 = player_df.loc[player_df['SEASON'] == year]
        stats1.squeeze()
        vorp1 = stats1['PER']
        stats2 = player_df.loc[player_df['SEASON'] == (year + 2)]
        stats2.squeeze()
        vorp2 = stats2['PER']
        if vorp2.sum() > (vorp1.sum() + 4):
            years = years.append(stats2)
per_improvement = pd.DataFrame()
for player in years['Player_name'].unique():
    first_season = min(years.loc[years['Player_name'] == player]["SEASON"])
    first_season_stats = years.loc[(years['SEASON'] == first_season) & (years["Player_name"] == player)]
    per_improvement = per_improvement.append(first_season_stats)

In [None]:
per_improvement = per_improvement.loc[(per_improvement["MP"] > 900) | (per_improvement["G"] > 55)]
per_improvement = per_improvement[per_improvement.PER > 11]

In [None]:
per_improvement = per_improvement.loc[per_improvement["DRAFT_YEAR+1"].notnull()]
print(per_improvement.shape)
per_improvement.sort_values(by='SEASON', ascending=True)

In [None]:
%store per_improvement
%store improvement

In [None]:
improvement.sort_values(by='DRAFT_YEAR+1', ascending=True)