# Classifiers for NBA Players

For a full description of this project, please refer to the [GitHub repository](https://github.com/jacquelinekclee/naivebayes_nba_players).

## Table of Contents

- [Process the Training Data](#training)
- [Process the Test Data](#test)
- [Process the 2018-19 Data](#2019)
- [Process the 2020-21 Data](#2021)
- [Process the 2021-22 Data](#2022)
- [Positions Classifier](#positions)
    - [Final Positions Classifier](#positionsbest)
- [All Star Classifier](#allstars)
    - [Summary](#allstarsummary)

## Imports

In [1]:
import pandas as pd
import numpy as np
%load_ext autoreload
%autoreload 2
# source files can be found in the GitHub repository
from nba_players_classification import *

In [2]:
# pip install xgboost

In [4]:
import sklearn
import xgboost as xgb
from sklearn import metrics
from sklearn.model_selection import GridSearchCV, KFold
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import RandomizedSearchCV
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
from sklearn.model_selection import learning_curve
from sklearn.metrics import log_loss
from sklearn.metrics import accuracy_score
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import MinMaxScaler

## Process the Training Data<a class="anchor" id="training"></a>

Get the DataFrame with each player's statistics for each season from 1950-2017.
Since many relevant statistics weren't collected until 1980, I will only keep the season statistics for 1980-2017.

In [5]:
stats = pd.read_csv('Seasons_Stats.csv')
stats_cols = list(stats.columns)
stats.columns

Index(['Unnamed: 0', 'Year', 'Player', 'Pos', 'Age', 'Tm', 'G', 'GS', 'MP',
       'PER', 'TS%', '3PAr', 'FTr', 'ORB%', 'DRB%', 'TRB%', 'AST%', 'STL%',
       'BLK%', 'TOV%', 'USG%', 'blanl', 'OWS', 'DWS', 'WS', 'WS/48', 'blank2',
       'OBPM', 'DBPM', 'BPM', 'VORP', 'FG', 'FGA', 'FG%', '3P', '3PA', '3P%',
       '2P', '2PA', '2P%', 'eFG%', 'FT', 'FTA', 'FT%', 'ORB', 'DRB', 'TRB',
       'AST', 'STL', 'BLK', 'TOV', 'PF', 'PTS'],
      dtype='object')

In [6]:
max(stats['Year'])

2017.0

In [7]:
stats = stats.loc[stats['Year'] >= 1980].reset_index(drop=True)
stats.head()

Unnamed: 0.1,Unnamed: 0,Year,Player,Pos,Age,Tm,G,GS,MP,PER,...,FT%,ORB,DRB,TRB,AST,STL,BLK,TOV,PF,PTS
0,5727,1980.0,Kareem Abdul-Jabbar*,C,32.0,LAL,82.0,,3143.0,25.3,...,0.765,190.0,696.0,886.0,371.0,81.0,280.0,297.0,216.0,2034.0
1,5728,1980.0,Tom Abernethy,PF,25.0,GSW,67.0,,1222.0,11.0,...,0.683,62.0,129.0,191.0,87.0,35.0,12.0,39.0,118.0,362.0
2,5729,1980.0,Alvan Adams,C,25.0,PHO,75.0,,2168.0,19.2,...,0.797,158.0,451.0,609.0,322.0,108.0,55.0,218.0,237.0,1118.0
3,5730,1980.0,Tiny Archibald*,PG,31.0,BOS,80.0,80.0,2864.0,15.3,...,0.83,59.0,138.0,197.0,671.0,106.0,10.0,242.0,218.0,1131.0
4,5731,1980.0,Dennis Awtrey,C,31.0,CHI,26.0,,560.0,7.4,...,0.64,29.0,86.0,115.0,40.0,12.0,15.0,27.0,66.0,86.0


We only want to keep the following features. These features will be fundamental for our classifier:

* Year
* Player
* Position
* Games played
* True shooting percentage
* Assists
* Points
* Total Rebounds
* Total Steals
* Total Blocks

In [9]:
stats = stats[stats_cols[1:4] + ['G','TS%','TRB','AST','PTS', 'STL', 'BLK']]
stats.head()

Unnamed: 0,Year,Player,Pos,G,TS%,TRB,AST,PTS,STL,BLK
0,1980.0,Kareem Abdul-Jabbar*,C,82.0,0.639,886.0,371.0,2034.0,81.0,280.0
1,1980.0,Tom Abernethy,PF,67.0,0.511,191.0,87.0,362.0,35.0,12.0
2,1980.0,Alvan Adams,C,75.0,0.571,609.0,322.0,1118.0,108.0,55.0
3,1980.0,Tiny Archibald*,PG,80.0,0.574,197.0,671.0,1131.0,106.0,10.0
4,1980.0,Dennis Awtrey,C,26.0,0.524,115.0,40.0,86.0,12.0,15.0


Check for missing values

In [10]:
detect_missing_values(stats)

Year :  False
Player: False
Pos: False
G :  False
TS% :  True
TRB :  False
AST :  False
PTS :  False
STL :  False
BLK :  False


Some other observed irregularities are that the years are floats and some Player names have extra characters (like asterisks). I will thus change these. 

In [11]:
stats['Year'] = stats['Year'].apply(lambda s: int(s))
stats['Player'] = stats['Player'].apply(lambda name: name[:-1] if '*' in name else name)
(all([isinstance(year, int) for year in stats['Year']]), 
 all(['*' not in name for name in stats['Player']]))

(True, True)

Now I will be converting the TRB (total rebounds), AST (assists), and PTS (points) to their per game equivalents by dividing the gross number of rebounds, assists, or points by number of games played. This will enable us to compare all players with each other; a given players statistics won't appear inflated because they played more games than another player.

In [12]:
per_game_stats(stats, 'G')

In [13]:
stats = stats.drop(columns=['TRB', 'AST', 'PTS', 'BLK', 'STL', 'G'])

In [14]:
stats_descr = stats.describe()
stats_descr

Unnamed: 0,Year,TS%,RPG,APG,PPG,BPG,SPG
count,18927.0,18851.0,18927.0,18927.0,18927.0,18927.0,18927.0
mean,2000.272415,0.503862,3.468066,1.848202,8.047679,0.406166,0.659704
std,10.691977,0.094507,2.53764,1.848489,5.958002,0.509952,0.479654
min,1980.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,1992.0,0.473,1.609566,0.571429,3.383974,0.090909,0.3125
50%,2001.0,0.516,2.814815,1.25,6.492537,0.238095,0.567568
75%,2010.0,0.551,4.68058,2.503165,11.5,0.511628,0.911392
max,2017.0,1.136,18.658537,14.538462,37.085366,6.0,3.670732


PG=Point Guard 

G=Point Guard and Shooting Guard 

SG=Shooting Guard 

GF= Shooting Guard and Small Forward 

SF=Small Forward 

F= Small Forward and Power Forward 

PF= Power Forward 

FC= Power Forward and Center 

C= Center

In [15]:
list(set(stats['Pos']))

['SG-PG',
 'C',
 'PG-SG',
 'SF',
 'C-PF',
 'SG-PF',
 'C-SF',
 'PF',
 'SG-SF',
 'SF-SG',
 'PG-SF',
 'PG',
 'PF-SF',
 'PF-C',
 'SG',
 'SF-PF']

The stats 'Pos' column will be refined to match these positions.

* 'PG', 'SG', 'SG-PG', and 'PG-SG' will become 'G'
* 'PF-C', 'C-PF', and 'C-SF' will become 'FC'
* 'SF', 'SF-PF', 'PF', and 'PF-SF' will become 'F'
* 'SG-SF', 'SG-PF', 'PG-SF', and 'SF-SG' will become 'GF'
* 'C' will remain 'C'

In [16]:
stats['Pos_og'] = stats['Pos'].copy()

In [17]:
reset_position(stats)
list(set(stats['Pos']))

['FC', 'C', 'GF', 'G', 'F']

### Add a Column Indicating if a Player earned All Star honors that Season<a class="anchor" id="allstar"></a>

I will now add a column decoding if a player was an all-star in that season

In [18]:
all_stars_df = pd.read_csv('all_stars_upd2.csv')
# all_stars_df = all_stars_df.loc[all_stars_df['Year'] >= 1980].reset_index().rename(columns = {'Name':'Player'})
all_stars_df.head()

Unnamed: 0,Player,Year
0,Kareem Abdul-Jabbar,1970
1,Kareem Abdul-Jabbar,1971
2,Kareem Abdul-Jabbar,1972
3,Kareem Abdul-Jabbar,1973
4,Kareem Abdul-Jabbar,1974


In [19]:
# create tuples with the format (Player, Year)
all_star_tups = list(all_stars_df.apply(lambda row: (row['Player'], row['Year']), axis=1))

In [20]:
players_tups = list(stats.apply(lambda row: (row['Player'], row['Year']), axis=1))

In [21]:
# denote whether or not a player was an All Star that season
all_stars = pd.Series(players_tups).apply(lambda tup: 'Y' if tup in all_star_tups else 'N')

In [22]:
all_stars.value_counts(normalize=True)

N    0.949966
Y    0.050034
dtype: float64

In [23]:
stats['All Star'] = all_stars

In [24]:
stats.head()

Unnamed: 0,Year,Player,Pos,TS%,RPG,APG,PPG,BPG,SPG,Pos_og,All Star
0,1980,Kareem Abdul-Jabbar,C,0.639,10.804878,4.52439,24.804878,3.414634,0.987805,C,Y
1,1980,Tom Abernethy,F,0.511,2.850746,1.298507,5.402985,0.179104,0.522388,PF,N
2,1980,Alvan Adams,C,0.571,8.12,4.293333,14.906667,0.733333,1.44,C,N
3,1980,Tiny Archibald,G,0.574,2.4625,8.3875,14.1375,0.125,1.325,PG,N
4,1980,Dennis Awtrey,C,0.524,4.423077,1.538462,3.307692,0.576923,0.461538,C,N


In [25]:
# classify the stats
stats_all_stars_cont = stats[stats['All Star'] == 'Y']
stats_not_all_stars_cont = stats[stats['All Star'] == 'N']

### Add a Column Indicating if a Player earned MVP that Season<a class="anchor" id="mvp"></a>

In [26]:
mvps = pd.read_csv('mvps.csv')
mvps.head()

Unnamed: 0,Season,Lg,Player,Voting,Age,Tm,G,MP,PTS,TRB,AST,STL,BLK,FG%,3P%,FT%,WS,WS/48
0,2019-20,NBA,Giannis Antetokounmpo\antetgi01,(V),25,MIL,63,30.4,29.5,13.6,5.6,1.0,1.0,0.553,0.304,0.633,11.1,0.279
1,2018-19,NBA,Giannis Antetokounmpo\antetgi01,(V),24,MIL,72,32.8,27.7,12.5,5.9,1.3,1.5,0.578,0.256,0.729,14.4,0.292
2,2017-18,NBA,James Harden\hardeja01,(V),28,HOU,72,35.4,30.4,5.4,8.8,1.8,0.7,0.449,0.367,0.858,15.4,0.289
3,2016-17,NBA,Russell Westbrook\westbru01,(V),28,OKC,81,34.6,31.6,10.7,10.4,1.6,0.4,0.425,0.343,0.845,13.1,0.224
4,2015-16,NBA,Stephen Curry\curryst01,(V),27,GSW,79,34.2,30.1,5.4,6.7,2.1,0.2,0.504,0.454,0.908,17.9,0.318


In [27]:
mvps['Player'] = mvps['Player'].apply(lambda name: name[:name.index('\\')])
mvps.head()

Unnamed: 0,Season,Lg,Player,Voting,Age,Tm,G,MP,PTS,TRB,AST,STL,BLK,FG%,3P%,FT%,WS,WS/48
0,2019-20,NBA,Giannis Antetokounmpo,(V),25,MIL,63,30.4,29.5,13.6,5.6,1.0,1.0,0.553,0.304,0.633,11.1,0.279
1,2018-19,NBA,Giannis Antetokounmpo,(V),24,MIL,72,32.8,27.7,12.5,5.9,1.3,1.5,0.578,0.256,0.729,14.4,0.292
2,2017-18,NBA,James Harden,(V),28,HOU,72,35.4,30.4,5.4,8.8,1.8,0.7,0.449,0.367,0.858,15.4,0.289
3,2016-17,NBA,Russell Westbrook,(V),28,OKC,81,34.6,31.6,10.7,10.4,1.6,0.4,0.425,0.343,0.845,13.1,0.224
4,2015-16,NBA,Stephen Curry,(V),27,GSW,79,34.2,30.1,5.4,6.7,2.1,0.2,0.504,0.454,0.908,17.9,0.318


In [28]:
mvps['Year'] = mvps['Season'].apply(lambda season: 2000 if season == '1999-2000'
             else int(season[:2] + season[-2:len(season)]))
mvps.head()

Unnamed: 0,Season,Lg,Player,Voting,Age,Tm,G,MP,PTS,TRB,AST,STL,BLK,FG%,3P%,FT%,WS,WS/48,Year
0,2019-20,NBA,Giannis Antetokounmpo,(V),25,MIL,63,30.4,29.5,13.6,5.6,1.0,1.0,0.553,0.304,0.633,11.1,0.279,2020
1,2018-19,NBA,Giannis Antetokounmpo,(V),24,MIL,72,32.8,27.7,12.5,5.9,1.3,1.5,0.578,0.256,0.729,14.4,0.292,2019
2,2017-18,NBA,James Harden,(V),28,HOU,72,35.4,30.4,5.4,8.8,1.8,0.7,0.449,0.367,0.858,15.4,0.289,2018
3,2016-17,NBA,Russell Westbrook,(V),28,OKC,81,34.6,31.6,10.7,10.4,1.6,0.4,0.425,0.343,0.845,13.1,0.224,2017
4,2015-16,NBA,Stephen Curry,(V),27,GSW,79,34.2,30.1,5.4,6.7,2.1,0.2,0.504,0.454,0.908,17.9,0.318,2016


In [29]:
mvp_tups = list(mvps.apply(lambda row: (row['Player'], row['Year']), axis=1))

In [30]:
mvp_lst = pd.Series(mvp_tups).apply(lambda tup: 'Y' if tup in mvp_tups else 'N')
stats['MVP'] = mvp_lst
stats.head()

Unnamed: 0,Year,Player,Pos,TS%,RPG,APG,PPG,BPG,SPG,Pos_og,All Star,MVP
0,1980,Kareem Abdul-Jabbar,C,0.639,10.804878,4.52439,24.804878,3.414634,0.987805,C,Y,Y
1,1980,Tom Abernethy,F,0.511,2.850746,1.298507,5.402985,0.179104,0.522388,PF,N,Y
2,1980,Alvan Adams,C,0.571,8.12,4.293333,14.906667,0.733333,1.44,C,N,Y
3,1980,Tiny Archibald,G,0.574,2.4625,8.3875,14.1375,0.125,1.325,PG,N,Y
4,1980,Dennis Awtrey,C,0.524,4.423077,1.538462,3.307692,0.576923,0.461538,C,N,Y


In [31]:
stats['All Star'] = stats['All Star'] == 'Y'
stats['MVP'] = stats['MVP'] == 'Y'

### Handle Missing Data

In [32]:
for col in stats.columns:
    print(col, stats[col].isna().sum())

Year 0
Player 0
Pos 0
TS% 76
RPG 0
APG 0
PPG 0
BPG 0
SPG 0
Pos_og 0
All Star 0
MVP 0


In [33]:
stats.loc[stats['TS%'].isna()].describe()

Unnamed: 0,Year,TS%,RPG,APG,PPG,BPG,SPG
count,76.0,0.0,76.0,76.0,76.0,76.0,76.0
mean,2003.815789,,0.387061,0.144737,0.0,0.019737,0.027412
std,8.668272,,0.670641,0.347842,0.0,0.127561,0.134982
min,1984.0,,0.0,0.0,0.0,0.0,0.0
25%,1997.0,,0.0,0.0,0.0,0.0,0.0
50%,2004.5,,0.0,0.0,0.0,0.0,0.0
75%,2011.0,,0.541667,0.0,0.0,0.0,0.0
max,2017.0,,3.0,2.0,0.0,1.0,1.0


76 rows have missing values for TS% because they all scored 0 points. Thus, the missing values will be imputed with 0.

In [34]:
stats['TS%'].fillna(0, inplace=True)

In [35]:
for col in stats.columns:
    print(col, stats[col].isna().sum())

Year 0
Player 0
Pos 0
TS% 0
RPG 0
APG 0
PPG 0
BPG 0
SPG 0
Pos_og 0
All Star 0
MVP 0


In [36]:
stats.to_csv('players_1980_2017.csv', index = False)

# Process the Test Data (data to be classified)<a class="anchor" id="test"></a>

## Player Data from the 2018-19 Season<a class="anchor" id="2019"></a>

In [37]:
test_players = pd.read_csv('players_1819.csv')
test_players.head()

Unnamed: 0,Rk,Player,Pos,Age,Tm,G,GS,MP,FG,FGA,...,ORB,DRB,TRB,AST,STL,BLK,TOV,PF,PTS,Player-additional
0,1,Álex Abrines,SG,25,OKC,31,2,19.0,1.8,5.1,...,0.2,1.4,1.5,0.6,0.5,0.2,0.5,1.7,5.3,abrinal01
1,2,Quincy Acy,PF,28,PHO,10,0,12.3,0.4,1.8,...,0.3,2.2,2.5,0.8,0.1,0.4,0.4,2.4,1.7,acyqu01
2,3,Jaylen Adams,PG,22,ATL,34,1,12.6,1.1,3.2,...,0.3,1.4,1.8,1.9,0.4,0.1,0.8,1.3,3.2,adamsja01
3,4,Steven Adams,C,25,OKC,80,80,33.4,6.0,10.1,...,4.9,4.6,9.5,1.6,1.5,1.0,1.7,2.6,13.9,adamsst01
4,5,Bam Adebayo,C,21,MIA,82,28,23.3,3.4,5.9,...,2.0,5.3,7.3,2.2,0.9,0.8,1.5,2.5,8.9,adebaba01


In [38]:
test_players_adv = pd.read_csv('players_1819_adv.csv')
test_players_adv.head()

Unnamed: 0,Rk,Player,Pos,Age,Tm,G,MP,PER,TS%,3PAr,...,OWS,DWS,WS,WS/48,Unnamed: 24,OBPM,DBPM,BPM,VORP,Player-additional
0,1,Álex Abrines,SG,25,OKC,31,588,6.3,0.507,0.809,...,0.1,0.6,0.6,0.053,,-3.7,0.4,-3.3,-0.2,abrinal01
1,2,Quincy Acy,PF,28,PHO,10,123,2.9,0.379,0.833,...,-0.1,0.0,-0.1,-0.022,,-7.6,-0.5,-8.1,-0.2,acyqu01
2,3,Jaylen Adams,PG,22,ATL,34,428,7.6,0.474,0.673,...,-0.1,0.2,0.1,0.011,,-3.8,-0.5,-4.3,-0.2,adamsja01
3,4,Steven Adams,C,25,OKC,80,2669,18.5,0.591,0.002,...,5.1,4.0,9.1,0.163,,0.7,0.4,1.1,2.1,adamsst01
4,5,Bam Adebayo,C,21,MIA,82,1913,17.9,0.623,0.031,...,3.4,3.4,6.8,0.171,,-0.4,2.2,1.8,1.8,adebaba01


Merge the two dataframes in order to calculate the per game statistics

In [39]:
test_players = pd.merge(test_players, test_players_adv, on=['Player-additional', 'Age'])
test_players.head()

Unnamed: 0,Rk_x,Player_x,Pos_x,Age,Tm_x,G_x,GS,MP_x,FG,FGA,...,Unnamed: 19,OWS,DWS,WS,WS/48,Unnamed: 24,OBPM,DBPM,BPM,VORP
0,1,Álex Abrines,SG,25,OKC,31,2,19.0,1.8,5.1,...,,0.1,0.6,0.6,0.053,,-3.7,0.4,-3.3,-0.2
1,2,Quincy Acy,PF,28,PHO,10,0,12.3,0.4,1.8,...,,-0.1,0.0,-0.1,-0.022,,-7.6,-0.5,-8.1,-0.2
2,3,Jaylen Adams,PG,22,ATL,34,1,12.6,1.1,3.2,...,,-0.1,0.2,0.1,0.011,,-3.8,-0.5,-4.3,-0.2
3,4,Steven Adams,C,25,OKC,80,80,33.4,6.0,10.1,...,,5.1,4.0,9.1,0.163,,0.7,0.4,1.1,2.1
4,5,Bam Adebayo,C,21,MIA,82,28,23.3,3.4,5.9,...,,3.4,3.4,6.8,0.171,,-0.4,2.2,1.8,1.8


In [40]:
test_players.columns

Index(['Rk_x', 'Player_x', 'Pos_x', 'Age', 'Tm_x', 'G_x', 'GS', 'MP_x', 'FG',
       'FGA', 'FG%', '3P', '3PA', '3P%', '2P', '2PA', '2P%', 'eFG%', 'FT',
       'FTA', 'FT%', 'ORB', 'DRB', 'TRB', 'AST', 'STL', 'BLK', 'TOV', 'PF',
       'PTS', 'Player-additional', 'Rk_y', 'Player_y', 'Pos_y', 'Tm_y', 'G_y',
       'MP_y', 'PER', 'TS%', '3PAr', 'FTr', 'ORB%', 'DRB%', 'TRB%', 'AST%',
       'STL%', 'BLK%', 'TOV%', 'USG%', 'Unnamed: 19', 'OWS', 'DWS', 'WS',
       'WS/48', 'Unnamed: 24', 'OBPM', 'DBPM', 'BPM', 'VORP'],
      dtype='object')

In [41]:
test_players = test_players.rename(columns={'Player_x':'Player', 'AST':'APG', 'STL':'SPG', 'BLK':'BPG', 'TRB':'RPG', 'PTS':'PPG'})

In [42]:
test_players = test_players.rename(columns={'Pos_x':'Pos'})

Only keep the columns that correspond with the training data (stats)

In [43]:
test_players_cols = list(filter(lambda col: col in stats.columns, test_players.columns)) + ['Age', 'Player-additional']
test_players = test_players[test_players_cols]

In [44]:
test_players = test_players.dropna()

In [45]:
test_players.head()

Unnamed: 0,Player,Pos,RPG,APG,SPG,BPG,PPG,TS%,Age,Player-additional
0,Álex Abrines,SG,1.5,0.6,0.5,0.2,5.3,0.507,25,abrinal01
1,Quincy Acy,PF,2.5,0.8,0.1,0.4,1.7,0.379,28,acyqu01
2,Jaylen Adams,PG,1.8,1.9,0.4,0.1,3.2,0.474,22,adamsja01
3,Steven Adams,C,9.5,1.6,1.5,1.0,13.9,0.591,25,adamsst01
4,Bam Adebayo,C,7.3,2.2,0.9,0.8,8.9,0.623,21,adebaba01


Reset the positions so that they coincide with the training data (stats)

In [46]:
test_players['Pos_og'] = test_players.Pos.copy()

In [47]:
reset_position(test_players)

In [48]:
set(test_players['Pos'])

{'C', 'F', 'FC', 'G', 'GF'}

In [49]:
test_descr = test_players.describe()
test_descr

Unnamed: 0,RPG,APG,SPG,BPG,PPG,TS%,Age
count,1250.0,1250.0,1250.0,1250.0,1250.0,1250.0,1250.0
mean,3.49672,1.73712,0.60264,0.34408,8.14272,0.528119,26.5024
std,2.395956,1.430218,0.420303,0.354815,5.337567,0.112984,4.048593
min,0.0,0.0,0.0,0.0,0.0,0.0,19.0
25%,1.825,0.8,0.3,0.1,4.2,0.497,23.0
50%,2.9,1.3,0.5,0.3,7.0,0.538,26.0
75%,4.6,2.3,0.8,0.5,11.1,0.578,29.0
max,15.6,10.7,2.4,2.7,36.1,1.5,42.0


In [50]:
test_players['Year'] = 2019
test_tups = list(test_players.apply(lambda row: (row['Player'], row['Year']), axis=1))

### Handle Any Duplicates

In [51]:
test_players.drop_duplicates(['Player-additional', 'Age'], keep = 'last', inplace=True)

In [52]:
test_players.head()

Unnamed: 0,Player,Pos,RPG,APG,SPG,BPG,PPG,TS%,Age,Player-additional,Pos_og,Year
0,Álex Abrines,G,1.5,0.6,0.5,0.2,5.3,0.507,25,abrinal01,SG,2019
1,Quincy Acy,F,2.5,0.8,0.1,0.4,1.7,0.379,28,acyqu01,PF,2019
2,Jaylen Adams,G,1.8,1.9,0.4,0.1,3.2,0.474,22,adamsja01,PG,2019
3,Steven Adams,C,9.5,1.6,1.5,1.0,13.9,0.591,25,adamsst01,C,2019
4,Bam Adebayo,C,7.3,2.2,0.9,0.8,8.9,0.623,21,adebaba01,C,2019


### Add the All Star and MVP Data<a class="anchor" id="allstarmvp"></a> 

In [53]:
def check_tups(row, tups):
    tup = (row['Player'], row['Year'])
    if tup in tups:
        return 'Y'
    else:
        return 'N'

In [54]:
test_players['All Star'] = test_players.apply(lambda row: check_tups(row, all_star_tups), axis = 1)

In [55]:
test_players['MVP'] = test_players.apply(lambda row: check_tups(row, mvp_tups), axis = 1)

In [56]:
test_players.head()

Unnamed: 0,Player,Pos,RPG,APG,SPG,BPG,PPG,TS%,Age,Player-additional,Pos_og,Year,All Star,MVP
0,Álex Abrines,G,1.5,0.6,0.5,0.2,5.3,0.507,25,abrinal01,SG,2019,N,N
1,Quincy Acy,F,2.5,0.8,0.1,0.4,1.7,0.379,28,acyqu01,PF,2019,N,N
2,Jaylen Adams,G,1.8,1.9,0.4,0.1,3.2,0.474,22,adamsja01,PG,2019,N,N
3,Steven Adams,C,9.5,1.6,1.5,1.0,13.9,0.591,25,adamsst01,C,2019,N,N
4,Bam Adebayo,C,7.3,2.2,0.9,0.8,8.9,0.623,21,adebaba01,C,2019,N,N


In [57]:
test_players['All Star'] = test_players['All Star'] == 'Y'
test_players['MVP'] = test_players['MVP'] == 'Y'

### Handle Missing Data 

In [58]:
for col in test_players.columns:
    print(col, test_players[col].isna().sum())

Player 0
Pos 0
RPG 0
APG 0
SPG 0
BPG 0
PPG 0
TS% 0
Age 0
Player-additional 0
Pos_og 0
Year 0
All Star 0
MVP 0


In [59]:
test_players = test_players[['Player', 'Pos', 'RPG', 'APG', 'SPG', 'BPG', 'PPG', 'TS%', 'Pos_og',
       'Year', 'All Star', 'MVP']]

In [60]:
test_players.to_csv('players_1819_cleaned.csv', index = False)

## Process the 2020-21 Players Statistics<a class="anchor" id="2021"></a>

In [173]:
test_2021 = pd.read_csv('players_2021.csv')

In [174]:
test_2021.head()

Unnamed: 0,Rk,Player,Pos,Age,Tm,G,GS,MP,FG,FGA,...,FT%,ORB,DRB,TRB,AST,STL,BLK,TOV,PF,PTS
0,1,Precious Achiuwa\achiupr01,PF,21,MIA,32,2,14.0,2.3,4.1,...,0.528,1.2,2.7,3.9,0.6,0.4,0.5,1.0,1.8,5.9
1,2,Jaylen Adams\adamsja01,PG,24,MIL,7,0,2.6,0.1,1.1,...,,0.0,0.4,0.4,0.3,0.0,0.0,0.0,0.1,0.3
2,3,Steven Adams\adamsst01,C,27,NOP,30,30,28.2,3.6,5.8,...,0.456,4.2,4.9,9.1,2.3,0.9,0.6,1.6,1.9,8.2
3,4,Bam Adebayo\adebaba01,C,23,MIA,31,31,33.9,7.3,12.9,...,0.848,2.2,7.5,9.6,5.5,0.9,1.0,3.0,2.5,19.6
4,5,LaMarcus Aldridge\aldrila01,C,35,SAS,20,18,26.5,5.8,12.3,...,0.829,0.8,3.7,4.5,1.8,0.4,0.9,0.9,1.7,14.3


In [175]:
test_2021['Player-additional'] = test_2021.Player.str.split('\\').str[-1]

In [176]:
test_2021.columns

Index(['Rk', 'Player', 'Pos', 'Age', 'Tm', 'G', 'GS', 'MP', 'FG', 'FGA', 'FG%',
       '3P', '3PA', '3P%', '2P', '2PA', '2P%', 'eFG%', 'FT', 'FTA', 'FT%',
       'ORB', 'DRB', 'TRB', 'AST', 'STL', 'BLK', 'TOV', 'PF', 'PTS',
       'Player-additional'],
      dtype='object')

Use test_2021_adv, a dataframe which holds advanced statistics, to get the appropriate aggregate statistics

In [177]:
test_2021_adv = pd.read_csv('players_adv_2021.csv')
test_2021_adv.columns

Index(['Rk', 'Player', 'Pos', 'Age', 'Tm', 'G', 'MP', 'PER', 'TS%', '3PAr',
       'FTr', 'ORB%', 'DRB%', 'TRB%', 'AST%', 'STL%', 'BLK%', 'TOV%', 'USG%',
       'Unnamed: 19', 'OWS', 'DWS', 'WS', 'WS/48', 'Unnamed: 24', 'OBPM',
       'DBPM', 'BPM', 'VORP'],
      dtype='object')

In [178]:
test_2021_adv['Player-additional'] = test_2021_adv.Player.str.split('\\').str[-1]

In [179]:
all(test_2021_adv['Player-additional'] == test_2021['Player-additional'])

True

In [180]:
test_2021['TS%'] = test_2021_adv['TS%']

In [181]:
test_2021 = test_2021.rename(columns={'AST':'APG', 'STL':'SPG', 'BLK':'BPG', 'TRB':'RPG', 'PTS':'PPG'})

### Handle Any Duplicates 

In [182]:
test_2021['Player-additional'].nunique(), test_2021.shape[0]

(491, 509)

In [183]:
test_2021.drop_duplicates(['Player-additional', 'Age'], keep = 'last', inplace=True)

In [184]:
test_2021['Player-additional'].nunique(), test_2021.shape[0]

(491, 491)

Only keep the columns in the training data (stats)

In [185]:
cols_to_drop = list(filter(lambda col: col not in list(stats.columns), test_2021.columns))
test_2021 = test_2021.drop(columns=cols_to_drop)

In [186]:
test_2021.head()

Unnamed: 0,Player,Pos,RPG,APG,SPG,BPG,PPG,TS%
0,Precious Achiuwa\achiupr01,PF,3.9,0.6,0.4,0.5,5.9,0.578
1,Jaylen Adams\adamsja01,PG,0.4,0.3,0.0,0.0,0.3,0.125
2,Steven Adams\adamsst01,C,9.1,2.3,0.9,0.6,8.2,0.604
3,Bam Adebayo\adebaba01,C,9.6,5.5,0.9,1.0,19.6,0.636
4,LaMarcus Aldridge\aldrila01,C,4.5,1.8,0.4,0.9,14.3,0.549


### Handle Missing Data

In [187]:
detect_missing_values(test_2021)

Player: False
Pos: False
RPG :  False
APG :  False
SPG :  False
BPG :  False
PPG :  False
TS% :  True


In [188]:
test_2021 = test_2021.fillna(0)

In [190]:
detect_missing_values(test_2021)

Player: False
Pos: False
RPG :  False
APG :  False
SPG :  False
BPG :  False
PPG :  False
TS% :  False


In [191]:
test_2021_descr = test_2021.describe()
test_2021_descr

Unnamed: 0,RPG,APG,SPG,BPG,PPG,TS%
count,491.0,491.0,491.0,491.0,491.0,491.0
mean,3.612424,1.995519,0.607943,0.412424,8.861711,0.535741
std,2.502021,1.917178,0.419959,0.434422,6.826147,0.12545
min,0.0,0.0,0.0,0.0,0.0,0.0
25%,1.8,0.7,0.3,0.1,3.8,0.4995
50%,3.2,1.4,0.6,0.3,7.3,0.556
75%,4.9,2.5,0.9,0.6,12.5,0.604
max,14.1,11.1,1.9,3.4,32.8,1.0


In [192]:
test_2021['Player'] = test_2021['Player'].str.split('\\').str[0]
test_2021.head()

Unnamed: 0,Player,Pos,RPG,APG,SPG,BPG,PPG,TS%
0,Precious Achiuwa,PF,3.9,0.6,0.4,0.5,5.9,0.578
1,Jaylen Adams,PG,0.4,0.3,0.0,0.0,0.3,0.125
2,Steven Adams,C,9.1,2.3,0.9,0.6,8.2,0.604
3,Bam Adebayo,C,9.6,5.5,0.9,1.0,19.6,0.636
4,LaMarcus Aldridge,C,4.5,1.8,0.4,0.9,14.3,0.549


Reset the positions so that they correspond with the training data (stats)

In [193]:
test_2021['Pos_og'] = test_2021.Pos.copy()

In [194]:
reset_position(test_2021)
list(set(test_2021['Pos']))

['C', 'G', 'F']

### Add the All Star Data<a class="anchor" id="allstar2021"></a>

In [195]:
test_2021['Year'] = 2021

In [196]:
test_2021['All Star'] = test_2021.apply(lambda row: check_tups(row, all_star_tups), axis = 1)
test_2021['All Star'] = test_2021['All Star'] == 'Y'

### Add the MVP Data

In [197]:
mvp_tups.append(('Nikola Jokić', 2021))

In [198]:
test_2021['MVP'] = test_2021.apply(lambda row: check_tups(row, mvp_tups), axis = 1)

In [199]:
test_2021.MVP.value_counts()

N    490
Y      1
Name: MVP, dtype: int64

In [201]:
test_2021.to_csv('players_2021_cleaned.csv', index = False)

## Process the 2021-22 Players Statistics<a class="anchor" id="2022"></a>

In [202]:
test_2022 = pd.read_csv('players_2022.csv')

In [203]:
test_2022.head()

Unnamed: 0,Rk,Player,Pos,Age,Tm,G,GS,MP,FG,FGA,...,ORB,DRB,TRB,AST,STL,BLK,TOV,PF,PTS,Player-additional
0,1,Precious Achiuwa,C,22,TOR,73,28,23.6,3.6,8.3,...,2.0,4.5,6.5,1.1,0.5,0.6,1.2,2.1,9.1,achiupr01
1,2,Steven Adams,C,28,MEM,76,75,26.3,2.8,5.1,...,4.6,5.4,10.0,3.4,0.9,0.8,1.5,2.0,6.9,adamsst01
2,3,Bam Adebayo,C,24,MIA,56,56,32.6,7.3,13.0,...,2.4,7.6,10.1,3.4,1.4,0.8,2.6,3.1,19.1,adebaba01
3,4,Santi Aldama,PF,21,MEM,32,0,11.3,1.7,4.1,...,1.0,1.7,2.7,0.7,0.2,0.3,0.5,1.1,4.1,aldamsa01
4,5,LaMarcus Aldridge,C,36,BRK,47,12,22.3,5.4,9.7,...,1.6,3.9,5.5,0.9,0.3,1.0,0.9,1.7,12.9,aldrila01


In [204]:
test_2022.columns

Index(['Rk', 'Player', 'Pos', 'Age', 'Tm', 'G', 'GS', 'MP', 'FG', 'FGA', 'FG%',
       '3P', '3PA', '3P%', '2P', '2PA', '2P%', 'eFG%', 'FT', 'FTA', 'FT%',
       'ORB', 'DRB', 'TRB', 'AST', 'STL', 'BLK', 'TOV', 'PF', 'PTS',
       'Player-additional'],
      dtype='object')

In [205]:
test_2022['Player-additional'].nunique(), test_2022.shape[0]

(605, 812)

In [206]:
test_2022_adv = pd.read_csv('players_adv_2022.csv')
test_2022['TS%'] = test_2022_adv['TS%']
test_2022 = test_2022.rename(columns={'AST':'APG', 'STL':'SPG', 'BLK':'BPG', 'TRB':'RPG', 'PTS':'PPG'})

In [207]:
cols_to_keep = list(stats.columns) + ['Player-additional', 'Age']
cols_to_drop = list(filter(lambda col: col not in cols_to_keep, test_2022.columns))
test_2022 = test_2022.drop(columns=cols_to_drop)

In [208]:
test_2022.head()

Unnamed: 0,Player,Pos,Age,RPG,APG,SPG,BPG,PPG,Player-additional,TS%
0,Precious Achiuwa,C,22,6.5,1.1,0.5,0.6,9.1,achiupr01,0.503
1,Steven Adams,C,28,10.0,3.4,0.9,0.8,6.9,adamsst01,0.56
2,Bam Adebayo,C,24,10.1,3.4,1.4,0.8,19.1,adebaba01,0.608
3,Santi Aldama,PF,21,2.7,0.7,0.2,0.3,4.1,aldamsa01,0.452
4,LaMarcus Aldridge,C,36,5.5,0.9,0.3,1.0,12.9,aldrila01,0.604


In [209]:
detect_missing_values(test_2022)

Player: False
Pos: False
Age :  False
RPG :  False
APG :  False
SPG :  False
BPG :  False
PPG :  False
Player-additional: False
TS% :  True


In [210]:
test_2022.fillna(0,inplace=True)

In [211]:
test_2022['Pos_og'] = test_2022.Pos.copy()
reset_position(test_2022)
list(set(test_2022['Pos']))

['FC', 'C', 'GF', 'G', 'F', 'SG-PG-SF']

In [212]:
# handle special position case
test_2022['Pos'] = test_2022['Pos'].replace(['SG-PG-SF'], 'G')

In [213]:
list(set(test_2022['Pos']))

['FC', 'C', 'GF', 'G', 'F']

In [214]:
test_2022 = test_2022.groupby(['Player-additional', 'Age']).agg({'TS%':'mean', 'RPG':'mean', 'APG': 'mean', 
                                 'PPG':'mean' ,'BPG':'mean', 'SPG':'mean', 'Player':'unique','Pos':'unique',
                                 'Pos_og':'unique'}).reset_index()
test_2022['Pos_og'] = test_2022.Pos_og.apply(lambda lst: lst[0])
test_2022['Player'] = test_2022.Player.apply(lambda lst: lst[0])
test_2022['Pos'] = test_2022.Pos.apply(lambda lst: lst[0])

In [215]:
test_2022.Player.nunique(), test_2022.shape[0]

(605, 605)

### Handle Any Duplicates

In [216]:
test_2022.drop_duplicates(['Player-additional', 'Age'], keep = 'last', inplace=True)

In [217]:
test_2022['Player-additional'].nunique(), test_2022.shape[0]

(605, 605)

### Add All the All Star Data

In [218]:
test_2022['Year'] = 2022
test_2022['All Star'] = test_2022.apply(lambda row: check_tups(row, all_star_tups), axis = 1)
test_2022['All Star'] = test_2022['All Star'] == 'Y'

### Add the MVP Data

In [219]:
mvp_tups.append(('Nikola Jokić', 2022))
test_2022['MVP'] = test_2022.apply(lambda row: check_tups(row, mvp_tups), axis = 1)

In [220]:
test_2022.to_csv('players_2022_cleaned.csv', index = False)

# Positions Classifier<a class="anchor" id="positions"></a>

## Train the Random Forest Classifier for Position on the Data From 1950-2017

Assuming play style and statistics by position hasn't changed much over the years, this classifier should perform well without using year as a feature. A classifier with Year and one without Year as a feature will be compared

In [427]:
stats.Year.unique()

array([1980, 1981, 1982, 1983, 1984, 1985, 1986, 1987, 1988, 1989, 1990,
       1991, 1992, 1993, 1994, 1995, 1996, 1997, 1998, 1999, 2000, 2001,
       2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012,
       2013, 2014, 2015, 2016, 2017])

In [428]:
stats.columns

Index(['Year', 'Player', 'Pos', 'TS%', 'RPG', 'APG', 'PPG', 'BPG', 'SPG',
       'All Star', 'MVP'],
      dtype='object')

In [429]:
stats.Pos.value_counts(normalize=True)

F     0.399377
G     0.396101
C     0.198922
GF    0.002959
FC    0.002642
Name: Pos, dtype: float64

### With Year

In [632]:
training_data_stats = stats[['Year','Pos', 'TS%', 'RPG', 'APG', 'PPG', 'BPG', 'SPG',
       'All Star', 'MVP']]

In [432]:
X_train_pos = training_data_stats.drop(columns = 'Pos')
y_train_pos = LabelEncoder().fit_transform(training_data_stats['Pos'].values)

In [459]:
pos_labels = {0:'C', 1:'F', 2:'FC', 3:'G', 4:'GF'}

In [220]:
grid = { 
    'n_estimators': [300,500,700],
    'max_features': ['sqrt', 'log2'],
    'max_depth' : [5,10,15,20,25,None],
    'criterion' :['gini', 'entropy'],
    'random_state' : [18]
}

pos_rf_cv = GridSearchCV(estimator=RandomForestClassifier(), param_grid=grid, cv = 5)

GridSearchCV(cv=5, estimator=RandomForestClassifier(),
             param_grid={'criterion': ['gini', 'entropy'],
                         'max_depth': [5, 10, 15, 20, 25, None],
                         'max_features': ['sqrt', 'log2'],
                         'n_estimators': [300, 500, 700],
                         'random_state': [18]})

In [523]:
pos_rf_cv.fit(X_train_pos, y_train_pos)

GridSearchCV(cv=5, estimator=RandomForestClassifier(),
             param_grid={'criterion': ['gini', 'entropy'],
                         'max_depth': [5, 10, 15, 20, 25, None],
                         'max_features': ['sqrt', 'log2'],
                         'n_estimators': [300, 500, 700],
                         'random_state': [18]})

In [525]:
pos_rf_year = pos_rf_cv.best_estimator_

In [588]:
feature_names = [f"{X_train_pos.columns[i]}" for i in range(X_train_pos.shape[1])]
pos_rf_year_imp = pd.Series(pos_rf_year.feature_importances_, index=feature_names).sort_values(ascending=False)

In [589]:
pos_rf_year_imp

RPG         0.256018
APG         0.251740
BPG         0.231935
SPG         0.118068
PPG         0.074044
TS%         0.036067
Year        0.032127
All Star    0.000000
MVP         0.000000
dtype: float64

In [529]:
X_test_pos = test_players[['Year','TS%', 'RPG', 'APG', 'PPG', 'BPG', 'SPG', 'All Star', 'MVP']]
y_test_pos = LabelEncoder().fit_transform(test_players['Pos'].values)

In [531]:
pos_rf_year.score(X_test_pos, y_test_pos)

0.7360594795539034

In [532]:
test_results_rf_df = test_players.copy()
test_results_rf_df['pos_pred'] = pos_rf_year.predict(X_test_pos)
test_results_rf_df['pos_pred'].replace(pos_labels, inplace=True)

In [533]:
test_results_rf_df['correct_pos'] = test_results_rf_df['pos_pred'] == test_results_rf_df['Pos']

In [534]:
test_results_rf_df.loc[~test_results_rf_df.correct_pos].Pos.value_counts(normalize=True)

G     0.373239
C     0.309859
F     0.274648
GF    0.028169
FC    0.014085
Name: Pos, dtype: float64

In [902]:
test_results_rf_df.loc[test_results_rf_df.correct_pos].Pos.value_counts() / test_results_rf_df.Pos.value_counts()

C     0.541667
F     0.798969
FC         NaN
G     0.780992
GF         NaN
Name: Pos, dtype: float64

In [536]:
def prop_incorrect(df, position):
    incorrect = df.loc[~(df.correct_pos) & (df.Pos == position)].shape[0] 
    total = df.loc[(df.Pos == position)].shape[0]
    return round(incorrect / total, 2)

### Without Year 

In [540]:
X_train_pos_no_year = training_data_stats.drop(columns = ['Year', 'Pos'])

In [541]:
X_test_pos_no_year = X_test_pos.drop(columns = ['Year'])

In [501]:
pos_rf_cv.fit(X_train_pos_no_year, y_train_pos)

GridSearchCV(cv=5, estimator=RandomForestClassifier(),
             param_grid={'criterion': ['gini', 'entropy'],
                         'max_depth': [5, 10, 15, 20, 25, None],
                         'max_features': ['sqrt', 'log2'],
                         'n_estimators': [300, 500, 700],
                         'random_state': [18]})

In [543]:
pos_rf_cv.fit(X_train_pos_no_year, y_train_pos)
pos_rf_no_year = pos_rf_cv.best_estimator_

In [872]:
pos_rf_no_year.score(X_test_pos_no_year, y_test_pos)

0.724907063197026

In [545]:
0.7360594795539034 - 0.724907063197026

0.01115241635687736

In [584]:
feature_names = [f"{X_train_pos_no_year.columns[i]}" for i in range(X_train_pos_no_year.shape[1])]

In [586]:
pos_rf_no_year_imp = pd.Series(pos_rf_no_year.feature_importances_, index=feature_names).sort_values(ascending=False)

In [895]:
test_results_rf_no_year_df = test_players.copy()
test_results_rf_no_year_df['pos_pred'] = pos_rf_no_year.predict(X_test_pos_no_year)
test_results_rf_no_year_df['pos_pred'].replace(pos_labels, inplace=True)
test_results_rf_no_year_df['correct_pos'] = test_results_rf_no_year_df['pos_pred'] == test_results_rf_no_year_df['Pos']

In [896]:
test_results_rf_no_year_df.loc[~test_results_rf_no_year_df.correct_pos].Pos.value_counts(normalize=True)

G     0.425676
C     0.304054
F     0.229730
GF    0.027027
FC    0.013514
Name: Pos, dtype: float64

In [898]:
test_results_rf_no_year_df.loc[test_results_rf_no_year_df.correct_pos].Pos.value_counts(normalize=True)

G    0.458974
F    0.410256
C    0.130769
Name: Pos, dtype: float64

In [903]:
test_results_rf_no_year_df.loc[test_results_rf_no_year_df.correct_pos].Pos.value_counts() / test_results_rf_no_year_df.Pos.value_counts()

C     0.531250
F     0.824742
FC         NaN
G     0.739669
GF         NaN
Name: Pos, dtype: float64

In [904]:
test_results_rf_df.loc[test_results_rf_df.correct_pos].Pos.value_counts() / test_results_rf_df.Pos.value_counts()

C     0.541667
F     0.798969
FC         NaN
G     0.780992
GF         NaN
Name: Pos, dtype: float64

### Comparison
Accuracy decreases slightly by about 0.01 when removing year as a parameter. Year was the 3rd least important feature in the original Random Forest classifier, and the All Star and MVP features weren't important at all. This indicates that a players play style and production on the court, expressed in their rebounds, assists, blocks, steals, points, and shooting percentage, are much more indicative of position than anything else. Removing year decreased the proportion of each position that had a correct classification for Centers (-1%) and Guards (-4%), but increased for Forwards (+2%). Thus, it seems that the year a player played is more important for guards especially and also for centers. One interpretation of these results is that the per game statisics of a player benefit from the contextualization of year when it comes to classifying especially guards. For example, averaging 8 assists in the 1980's vs. averaging 8 assists in the 2010's mean different things for guards whereas the statistics of forwards are not as variant over the years. 

## Train a XGBoost Classifier for Position on the Data From 1950-2017

Since there is a bit of a class imbalance (about 40% each of Forwards and Guards, but only about 20% Centers), XGBoost might work better than a random forest. 

In [476]:
training_data_stats.Pos.value_counts(normalize=True)

F     0.399377
G     0.396101
C     0.198922
GF    0.002959
FC    0.002642
Name: Pos, dtype: float64

### With Year

In [780]:
params = { 'max_depth': [3,6,10],
           'learning_rate': [0.01, 0.05, 0.1],
           'n_estimators': [100, 500, 1000],
           'colsample_bytree': [0.3, 0.7]}
xgbr = xgb.XGBClassifier(seed = 20, objective='multi:softmax', num_class = 5)
pos_xgb_clf = GridSearchCV(estimator=xgbr, 
                   param_grid=params,
                   scoring='accuracy')

In [781]:
pos_xgb_clf.fit(X_train_pos, y_train_pos)

GridSearchCV(estimator=XGBClassifier(base_score=None, booster=None,
                                     callbacks=None, colsample_bylevel=None,
                                     colsample_bynode=None,
                                     colsample_bytree=None,
                                     early_stopping_rounds=None,
                                     enable_categorical=False, eval_metric=None,
                                     gamma=None, gpu_id=None, grow_policy=None,
                                     importance_type=None,
                                     interaction_constraints=None,
                                     learning_rate=None, max_bin=None,
                                     max_cat_to_...
                                     max_delta_step=None, max_depth=None,
                                     max_leaves=None, min_child_weight=None,
                                     missing=nan, monotone_constraints=None,
                                

In [784]:
pos_xgb_year = pos_xgb_clf.best_estimator_

In [785]:
test_pos_xgb_pred = pos_xgb_year.predict(X_test_pos)

In [786]:
accuracy_score(y_test_pos, test_pos_xgb_pred)

0.741635687732342

In [787]:
pos_labels = {0:'C', 1:'F', 2:'FC', 3:'G', 4:'GF'}

In [838]:
test_results_df = test_players.copy()

In [839]:
test_results_df['pos_pred'] = test_pos_xgb_pred
test_results_df['pos_pred'].replace(pos_labels, inplace=True)

In [840]:
test_results_df['correct_pos'] = test_results_df['pos_pred'] == test_results_df['Pos']

In [841]:
test_results_df.loc[~test_results_df.correct_pos].Pos.value_counts(normalize=True)

G     0.388489
C     0.366906
F     0.201439
GF    0.028777
FC    0.014388
Name: Pos, dtype: float64

In [842]:
test_results_df.loc[test_results_df.correct_pos].Pos.value_counts(normalize=True)

G    0.471178
F    0.416040
C    0.112782
Name: Pos, dtype: float64

In [843]:
prop_incorrect(test_results_df, 'C')

0.53

In [844]:
prop_incorrect(test_results_df, 'F')

0.14

In [845]:
prop_incorrect(test_results_df, 'G')

0.22

In [846]:
feature_names = [f"{X_train_pos.columns[i]}" for i in range(X_train_pos.shape[1])]
xgb_pos_year_imp = pd.Series(pos_xgb_year.feature_importances_, index=feature_names).sort_values(ascending=False)

In [847]:
xgb_pos_year_imp

BPG         0.315920
APG         0.272613
RPG         0.182221
SPG         0.124990
PPG         0.067944
TS%         0.019275
Year        0.017038
All Star    0.000000
MVP         0.000000
dtype: float32

### Without Year

In [848]:
pos_xgb_clf.fit(X_train_pos_no_year, y_train_pos)

GridSearchCV(estimator=XGBClassifier(base_score=None, booster=None,
                                     callbacks=None, colsample_bylevel=None,
                                     colsample_bynode=None,
                                     colsample_bytree=None,
                                     early_stopping_rounds=None,
                                     enable_categorical=False, eval_metric=None,
                                     gamma=None, gpu_id=None, grow_policy=None,
                                     importance_type=None,
                                     interaction_constraints=None,
                                     learning_rate=None, max_bin=None,
                                     max_cat_to_...
                                     max_delta_step=None, max_depth=None,
                                     max_leaves=None, min_child_weight=None,
                                     missing=nan, monotone_constraints=None,
                                

In [874]:
pos_xgb_no_year = pos_xgb_clf.best_estimator_

In [875]:
test_pos_no_year_xgb_pred = pos_xgb_no_year.predict(X_test_pos_no_year)

In [876]:
accuracy_score(y_test_pos, test_pos_no_year_xgb_pred)

0.724907063197026

In [854]:
0.741635687732342 - 0.724907063197026

0.016728624535315983

In [855]:
test_results_no_year_df = test_players.copy().drop(columns = 'Year')
test_results_no_year_df['pos_pred'] = test_pos_no_year_xgb_pred
test_results_no_year_df['pos_pred'].replace(pos_labels, inplace=True)
test_results_no_year_df['correct_pos'] = test_results_no_year_df['pos_pred'] == test_results_no_year_df['Pos']

In [856]:
test_results_no_year_df.loc[~test_results_no_year_df.correct_pos].Pos.value_counts(normalize=True)

G     0.391892
C     0.331081
F     0.236486
GF    0.027027
FC    0.013514
Name: Pos, dtype: float64

In [857]:
test_results_no_year_df.loc[test_results_no_year_df.correct_pos].Pos.value_counts(normalize=True)

G    0.471795
F    0.407692
C    0.120513
Name: Pos, dtype: float64

In [858]:
prop_incorrect(test_results_no_year_df, 'C')

0.51

In [859]:
prop_incorrect(test_results_no_year_df, 'F')

0.18

In [860]:
prop_incorrect(test_results_no_year_df, 'G')

0.24

In [865]:
feature_names = [f"{X_train_pos_no_year.columns[i]}" for i in range(X_train_pos_no_year.shape[1])]
xgb_pos_no_year_imp = pd.Series(pos_xgb_no_year.feature_importances_, index=feature_names).sort_values(ascending=False)

In [866]:
pos_rf_no_year_imp['Year'] = 0

In [867]:
xgb_pos_no_year_imp['Year'] = 0

In [868]:
feature_importance_summary = pd.concat([pos_rf_no_year_imp.to_frame(name = 'feature_importance').reset_index().assign(model='random forest without year'), 
          pos_rf_year_imp.to_frame(name = 'feature_importance').reset_index().assign(model='random forest with year'),
          xgb_pos_year_imp.to_frame(name = 'feature_importance').reset_index().assign(model='XGBoost with year'),
          xgb_pos_no_year_imp.to_frame(name = 'feature_importance').reset_index().assign(model='XGBoost without year')]).pivot(index='model', columns='index')

In [980]:
results = [['random forest without year', pos_rf_no_year.score(X_test_pos_no_year, y_test_pos), prop_incorrect(test_results_rf_no_year_df, 'C'), prop_incorrect(test_results_rf_no_year_df, 'F'), prop_incorrect(test_results_rf_no_year_df, 'G')],
             ['random forest with year', pos_rf_year.score(X_test_pos, y_test_pos), prop_incorrect(test_results_rf_df, 'C'), prop_incorrect(test_results_rf_df, 'F'), prop_incorrect(test_results_rf_df, 'G')],
             ['XGBoost without year', accuracy_score(y_test_pos, test_pos_no_year_xgb_pred), prop_incorrect(test_results_no_year_df, 'C'), prop_incorrect(test_results_no_year_df, 'F'), prop_incorrect(test_results_no_year_df, 'G')],
             ['XGBoost with year',accuracy_score(y_test_pos, test_pos_xgb_pred), prop_incorrect(test_results_df, 'C'), prop_incorrect(test_results_df, 'F'), prop_incorrect(test_results_df, 'G')]]
results_df = pd.DataFrame(data = results, columns = ['model', 'test accuracy', 'prop wrong for centers', 'prop wrong for forwards', 'prop wrong for guards'])

### Comparison
Accuracy decreases slightly by about 0.017 when removing year as a parameter. Similar to the Random Forest, year was the 3rd least important feature in the original XGBoost classifier, and the All Star and MVP features weren't important at all. Again, it seems that players' play style and production on the court, expressed in their rebounds, assists, blocks, steals, points, and shooting percentage, are much more indicative of position than anything else.

In [869]:
feature_importance_summary

Unnamed: 0_level_0,feature_importance,feature_importance,feature_importance,feature_importance,feature_importance,feature_importance,feature_importance,feature_importance,feature_importance
index,APG,All Star,BPG,MVP,PPG,RPG,SPG,TS%,Year
model,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2
XGBoost with year,0.272613,0.0,0.31592,0.0,0.067944,0.182221,0.12499,0.019275,0.017038
XGBoost without year,0.262064,0.0,0.288802,0.0,0.066676,0.204111,0.156717,0.021629,0.0
random forest with year,0.25174,0.0,0.231935,0.0,0.074044,0.256018,0.118068,0.036067,0.032127
random forest without year,0.256782,0.0,0.237337,0.0,0.082846,0.259655,0.120203,0.043176,0.0


In [907]:
results_df

Unnamed: 0,accuracy,test,prop wrong for centers,prop wrong for forwards,prop wrong for guards
0,random forest without year,0.724907,0.47,0.18,0.26
1,random forest with year,0.736059,0.46,0.2,0.22
2,XGBoost without year,0.724907,0.51,0.18,0.24
3,XGBoost with year,0.741636,0.53,0.14,0.22


## First Position Classifiers Summary

The XGBoost classifier did yield a slighly higher accuracy (0.742 compared to 0.725) when year was included, but the accuracies were the same for Random Forest and XGBoost when the year feature was dropped. 

The 2 types of classifiers found different features to be more important. The Random Forest classifiers thought RBG (rebounds per game) were more important than BPG (blocks per game), while the XGBoost classifiers didn't. Both classifiers had similar levels of feature importance for APG (assists per game) and PPG (points per game). 


Some pitfalls of both classifiers include: 
- Neither classifier was able to classify the hybrid positions, GF and FC, correctly. This is likely because only about 0.006 of the training data have these hybrid positions. 
- Both the All Star and MVP features had 0 importance for all 4 models tested. Including irrelevant features could make cost (e.g., runtime) unnecessarily high. 
- Although XGBoost was used to try and combat the class imbalance (around 2x guards and forwards than centers), XGBoost did *worse* and classifying centers than Random Forest did. 


In effort to create a better performing classifier, a new position column will be created. Hopefully making this problem only 3 classes instead of 5 will yield a better classifier. Also, MVP and All Star will be removed from the feature list. It seems that the year feature is particularly useful for classifying guards. Lastly, although XGBoost yielded a slightly higher accuracy, it classified centers much worse than the random forest (which was unexpected). Since the XGBoost didn't provide the expected benefits and its training time is much slower, Random Forest will be used going forward. 

## Train a 2nd Random Forest Classifier for Simplified Positions

The updated Random Forest classifier performed just as well as the XGBoost classifier (with year) on the 2018-2019 data. It performed much better on the centers such that a majority of centers were properly classified. This difference in accuracy for centers is offset by the 2nd Random Forest's worse error rate for forwards (20% incorrect vs. 14% incorrect for the XGBoost) and for the guards (only 1% point difference). Personally, this tradeoff is worth it so that the classifier does not perform exceptionally bad for 1 group and still performs relatively well overall. 

In [909]:
def reset_position_new(df):
    """
    Replace the positions in the given DataFrame so that each position is in the set
    {G, F, C}. For hybrid positions (guard and forward or forward and center), only keep
    the position listed first (e.g., C-SF (center and small forward) will become C for center).
    """
    df['Pos_new'] = df['Pos_og'].copy()
    df['Pos_new'] = df['Pos_new'].replace(['PG-SF', 'SG-SF', 'SG-PF', 'PG', 'SG', 'SG-PG', 'PG-SG'], 'G')
    df['Pos_new'] = df['Pos_new'].replace(['C-PF', 'C-SF'], 'C')
    df['Pos_new'] = df['Pos_new'].replace(['PF', 'SF', 'SF-PF', 'PF-SF', 'PF-C', 'SF-SG'], 'F')

In [910]:
reset_position_new(stats)

In [911]:
stats.Pos_new.value_counts(normalize=True)

F    0.402018
G    0.397633
C    0.200349
Name: Pos_new, dtype: float64

In [912]:
stats.Pos.value_counts(normalize=True)

F     0.399377
G     0.396101
C     0.198922
GF    0.002959
FC    0.002642
Name: Pos, dtype: float64

In [913]:
reset_position_new(test_players)

In [914]:
test_players.Pos_new.value_counts(normalize=True)

G    0.453532
F    0.366171
C    0.180297
Name: Pos_new, dtype: float64

In [915]:
test_players.Pos.value_counts(normalize=True)

G     0.449814
F     0.360595
C     0.178439
GF    0.007435
FC    0.003717
Name: Pos, dtype: float64

In [929]:
X_train_new_pos = stats[['TS%', 'RPG', 'APG', 'PPG', 'BPG', 'SPG', 'Year']]
y_train_new_pos = LabelEncoder().fit_transform(stats['Pos_new'].values)
X_test_new_pos = test_players[['TS%', 'RPG', 'APG', 'PPG', 'BPG', 'SPG', 'Year']]
y_test_new_pos = LabelEncoder().fit_transform(test_players['Pos_new'].values)

In [924]:
print(', '.join([key + '=' + str(val) for key, val in pos_rf_cv.best_params_.items()]))

criterion=gini, max_depth=10, max_features=sqrt, n_estimators=700, random_state=18


In [930]:
pos_rf_clf2 = RandomForestClassifier(criterion='gini', max_depth=10, max_features='sqrt', n_estimators=700, random_state=18)

In [931]:
pos_rf_clf2.fit(X_train_new_pos, y_train_new_pos)

RandomForestClassifier(max_depth=10, max_features='sqrt', n_estimators=700,
                       random_state=18)

In [988]:
feature_names = [f"{X_train_new_pos.columns[i]}" for i in range(X_train_new_pos.shape[1])]
rf2_pos_imp = pd.Series(pos_rf_clf2.feature_importances_, index=feature_names).sort_values(ascending=False)

In [991]:
print(rf2_pos_imp.to_markdown())

|      |         0 |
|:-----|----------:|
| RPG  | 0.253727  |
| APG  | 0.251458  |
| BPG  | 0.232891  |
| SPG  | 0.116283  |
| PPG  | 0.0769225 |
| TS%  | 0.0373015 |
| Year | 0.0314163 |


In [932]:
pos_rf_clf2.score(X_test_new_pos, y_test_new_pos)

0.741635687732342

In [934]:
pd.Series(y_test_new_pos).value_counts()

2    244
1    197
0     97
dtype: int64

In [937]:
test_players['Pos_new'].value_counts()

G    244
F    197
C     97
Name: Pos_new, dtype: int64

In [938]:
pos_labels2 = {0:'C', 1:'F', 2:'G'}

In [939]:
test_results_df2 = test_players.copy()
test_results_df2['pos_pred'] = pos_rf_clf2.predict(X_test_new_pos)
test_results_df2['pos_pred'].replace(pos_labels2, inplace=True)
test_results_df2['correct_pos'] = test_results_df2['pos_pred'] == test_results_df2['Pos']

In [942]:
prop_incorrect(test_results_df2, 'G')

0.23

In [943]:
prop_incorrect(test_results_df2, 'F')

0.2

In [981]:
rf2_row = pd.Series(['random forest without MVP, All Star',pos_rf_clf2.score(X_test_new_pos, y_test_new_pos), prop_incorrect(test_results_df2, 'C'), prop_incorrect(test_results_df2, 'F'), prop_incorrect(test_results_df2, 'G')])
rf2_row.index = results_df.columns

In [982]:
results_df.append(rf2_row, ignore_index = True)

Unnamed: 0,model,test accuracy,prop wrong for centers,prop wrong for forwards,prop wrong for guards
0,random forest without year,0.724907,0.47,0.18,0.26
1,random forest with year,0.736059,0.46,0.2,0.22
2,XGBoost without year,0.724907,0.51,0.18,0.24
3,XGBoost with year,0.741636,0.53,0.14,0.22
4,"random forest without MVP, All Star",0.741636,0.43,0.2,0.23


### Test on 2020-21 Players

The 2nd Random Forest classifier (include year, drop All Star and MVP, performed **much** better than the original Random Forest and XGBoost classifiers. This may be because the 2020-21 data only had the 3 main positions, G, C and F, while the first 2 classifiers tried predicting 5 classes (albeit unsuccessfuly for the 2 minority classes). 

In [955]:
reset_position_new(test_2021)

In [958]:
test_2021['Year'] = 2021

In [959]:
X_test_new_pos_2021 = test_2021[['TS%', 'RPG', 'APG', 'PPG', 'BPG', 'SPG', 'Year']]
y_test_new_pos_2021 = LabelEncoder().fit_transform(test_2021['Pos_new'].values)

In [984]:
pos_rf_clf2.score(X_test_new_pos_2021, y_test_new_pos_2021)

0.7213438735177866

In [976]:
X_test_pos_2021 = test_2021[['Year','TS%', 'RPG', 'APG', 'PPG', 'BPG', 'SPG', 'All Star', 'MVP']]
y_test_pos_2021 = LabelEncoder().fit_transform(test_2021['Pos'].values)

In [977]:
pos_rf_year.score(X_test_pos_2021, y_test_pos_2021)

0.39723320158102765

In [983]:
test_pos_xgb_pred_2021 = pos_xgb_year.predict(X_test_pos_2021)
accuracy_score(y_test_pos_2021, test_pos_xgb_pred_2021)

0.39723320158102765

In [1441]:
reset_position_new(test_2022)

In [1442]:
X_test_new_pos_2022 = test_2022[['TS%', 'RPG', 'APG', 'PPG', 'BPG', 'SPG', 'Year']]
y_test_new_pos_2022 = LabelEncoder().fit_transform(test_2022['Pos_new'].values)

In [1443]:
pos_rf_clf2.score(X_test_new_pos_2022, y_test_new_pos_2022)

0.6958677685950413

## Best Positions Classifier<a class="anchor" id="positionsbest"></a>

The best classifier for predicting player position ended up being a Random Forest classifier using the following features:

|   Feature   |          Importance |
|:-----|----------:|
| RPG  | 0.254  |
| APG  | 0.251  |
| BPG  | 0.233  |
| SPG  | 0.116  |
| PPG  | 0.077 |
| TS%  | 0.037 |
| Year | 0.031 |

# All Star Classifier<a class="anchor" id="allstars"></a>

## Train K-Nearest Neigbors Classifier on 1950-2017 Data

In [1658]:
stats.columns

Index(['Year', 'Player', 'Pos', 'TS%', 'RPG', 'APG', 'PPG', 'BPG', 'SPG',
       'Pos_og', 'All Star', 'MVP'],
      dtype='object')

In [1659]:
X_train_as = stats[['Year', 'Pos', 'TS%', 'RPG', 'APG', 'PPG', 'BPG', 'SPG']]
y_train_as = stats['All Star'].astype(int)

In [1660]:
X_train_as.columns

Index(['Year', 'Pos', 'TS%', 'RPG', 'APG', 'PPG', 'BPG', 'SPG'], dtype='object')

In [1300]:
scalar = MinMaxScaler()
knn = KNeighborsClassifier()

num_feat = ['Year','TS%', 'RPG', 'APG', 'PPG', 'BPG', 'SPG']
pl1 = Pipeline([
    ('min_max', scalar)
])

pl2 = Pipeline([
    ('pos', OneHotEncoder())
])
# preprocessing pipeline (put them together)
preproc = ColumnTransformer(
    transformers=[
        ('scaling', pl1, num_feat),
        ('step_name', pl2, ['Pos'])
    ])


pipeline = Pipeline([('preprocessor', preproc), ('clf', knn)])

knn_grid_params = {'clf__n_neighbors' : [25,50,75],
                   'clf__weights' : ['uniform','distance'],
                   'clf__metric' : ['minkowski','euclidean','manhattan']}

knn_gs = GridSearchCV(pipeline, knn_grid_params, verbose = 1, cv=3, n_jobs = -1, scoring = 'recall')

In [1301]:
knn_gs.fit(X_train_as, y_train_as)

Fitting 3 folds for each of 18 candidates, totalling 54 fits


GridSearchCV(cv=3,
             estimator=Pipeline(steps=[('preprocessor',
                                        ColumnTransformer(transformers=[('scaling',
                                                                         Pipeline(steps=[('min_max',
                                                                                          MinMaxScaler())]),
                                                                         ['Year',
                                                                          'TS%',
                                                                          'RPG',
                                                                          'APG',
                                                                          'PPG',
                                                                          'BPG',
                                                                          'SPG']),
                                                             

In [1302]:
knn_gs.best_params_

{'clf__metric': 'minkowski',
 'clf__n_neighbors': 25,
 'clf__weights': 'distance'}

In [1303]:
knn_gs.best_score_

0.4030975822115062

In [1304]:
X_train_as.columns

Index(['Year', 'Pos', 'TS%', 'RPG', 'APG', 'PPG', 'BPG', 'SPG'], dtype='object')

## Test on 2018-2019 Data

In [1696]:
X_test_as = test_players[X_train_as.columns]
y_test_as = test_players['All Star'].astype(int)

In [1697]:
test_as_pred = knn_gs.predict(X_test_as)

In [1698]:
metrics.recall_score(y_test_as, test_as_pred)

0.5769230769230769

In [1699]:
knn_as_results = X_test_as.copy()
knn_as_results['prediction'] = test_as_pred
knn_as_results['All Star'] = y_test_as

In [1700]:
num_all_stars_2019 = test_players.loc[test_players['All Star']].shape[0]

### Correctly Predicted All Stars 

In [1701]:
correct_as_2019 = test_players.loc[knn_as_results.loc[(knn_as_results.prediction==1) & (knn_as_results['All Star'])].index].Player
correct_as_2019

25      Giannis Antetokounmpo
71               Bradley Beal
273             Stephen Curry
275             Anthony Davis
323              Kevin Durant
344               Joel Embiid
429               Paul George
456             Blake Griffin
468              James Harden
558              Kyrie Irving
575              LeBron James
697             Kawhi Leonard
701            Damian Lillard
1194             Kemba Walker
1203        Russell Westbrook
Name: Player, dtype: object

### All Stars that Were Predicted Non-All Stars 

In [1702]:
as_pred_not_2019 = test_players.loc[knn_as_results.loc[(knn_as_results.prediction==0) & (knn_as_results['All Star'])].index].Player

In [1703]:
as_pred_not_2019

7        LaMarcus Aldridge
626           Nikola Jokić
717             Kyle Lowry
805        Khris Middleton
900          Dirk Nowitzki
919         Victor Oladipo
1067           Ben Simmons
1166         Klay Thompson
1170    Karl-Anthony Towns
1190        Nikola Vučević
1191           Dwyane Wade
Name: Player, dtype: object

In [1704]:
as_pred_not_2019.shape[0] / num_all_stars_2019

0.4230769230769231

### Players That Were Predicted to be All Stars But Weren't

In [1705]:
not_as_pred_as_2019 = test_players.loc[knn_as_results.loc[(knn_as_results.prediction==1) & ~(knn_as_results['All Star'])].index].Player
not_as_pred_as_2019

92       Devin Booker
987     Julius Randle
1196        John Wall
Name: Player, dtype: object

In [1706]:
not_as_pred_as_2019.shape[0] / num_all_stars_2019

0.11538461538461539

## Test on 2020-2021

In [1313]:
X_test_as_2021 = test_2021[X_train_as.columns]
y_test_as_2021 = test_2021['All Star'].astype(int)

In [1314]:
knn_as_results_2021 = X_test_as_2021.copy()
knn_as_results_2021['prediction'] = knn_gs.predict(X_test_as_2021)
knn_as_results_2021['All Star'] = y_test_as_2021

In [1329]:
metrics.recall_score(y_test_as_2021, knn_as_results_2021['prediction'])

0.7586206896551724

In [1456]:
num_all_stars_2021 = test_2021.loc[test_2021['All Star']].shape[0]

### Correctly Predicted All Stars 

In [1479]:
correct_as_2021 = test_2021.loc[knn_as_results_2021.loc[(knn_as_results_2021.prediction==1) & (knn_as_results_2021['All Star'])].index].Player
correct_as_2021

13     Giannis Antetokounmpo
38              Bradley Beal
110            Stephen Curry
111            Anthony Davis
119              Luka Dončić
128             Kevin Durant
133              Joel Embiid
158              Paul George
183             James Harden
184             James Harden
185             James Harden
224             Kyrie Irving
231             LeBron James
240             Nikola Jokić
263              Zach LaVine
271            Kawhi Leonard
275           Damian Lillard
323         Donovan Mitchell
397            Julius Randle
419         Domantas Sabonis
448             Jayson Tatum
496          Zion Williamson
Name: Player, dtype: object

### All Stars That Were Predicted Non-All Stars

In [1480]:
as_pred_not_2021 = test_2021.loc[knn_as_results_2021.loc[(knn_as_results_2021.prediction==0) & (knn_as_results_2021['All Star'])].index].Player
as_pred_not_2021

58       Devin Booker
70       Jaylen Brown
100       Mike Conley
163       Rudy Gobert
371        Chris Paul
431       Ben Simmons
472    Nikola Vučević
Name: Player, dtype: object

In [1481]:
as_pred_not_2021.shape[0] / num_all_stars_2021

0.2413793103448276

### Players That Were Predicted to be All Stars But Weren't

In [1508]:
not_as_pred_as_2021 = test_2021.loc[knn_as_results_2021.loc[(knn_as_results_2021.prediction==1) & ~(knn_as_results_2021['All Star'])].index].Player
not_as_pred_as_2021

79           Jimmy Butler
223        Brandon Ingram
303           CJ McCollum
461    Karl-Anthony Towns
506            Trae Young
Name: Player, dtype: object

In [1509]:
not_as_pred_as_2021.shape[0] / num_all_stars_2021

0.1724137931034483

## Test on 2021-2022

In [1707]:
X_test_as_2022 = test_2022[X_train_as.columns]
y_test_as_2022 = test_2022['All Star'].astype(int)
knn_as_results_2022 = X_test_as_2022.copy()
knn_as_results_2022['prediction'] = knn_gs.predict(X_test_as_2022)
knn_as_results_2022['All Star'] = y_test_as_2022

In [1708]:
metrics.recall_score(y_test_as_2022, knn_as_results_2022['prediction'])

0.5185185185185185

In [1709]:
num_all_stars_2022 = test_2022.loc[test_2022['All Star']].shape[0]

### Correctly Predicted All Stars

In [1467]:
correct_as_2022 = test_2022.loc[knn_as_results_2022.loc[(knn_as_results_2022.prediction==1) & (knn_as_results_2022['All Star'])].index].Player
correct_as_2022

11     Giannis Antetokounmpo
58              Devin Booker
86              Jimmy Butler
125            Stephen Curry
133            DeMar DeRozan
140              Luka Dončić
153             Kevin Durant
161              Joel Embiid
216             James Harden
272             LeBron James
288             Nikola Jokić
389                Ja Morant
525             Jayson Tatum
601               Trae Young
Name: Player, dtype: object

In [1468]:
correct_as_2022.shape[0] / num_all_stars_2022

0.5185185185185185

### All Stars That Were Predicted Non-All Stars

In [1469]:
as_pred_not_2022 = test_2022.loc[knn_as_results_2022.loc[(knn_as_results_2022.prediction==0) & (knn_as_results_2022['All Star'])].index].Player
as_pred_not_2022

7           Jarrett Allen
24            LaMelo Ball
182        Darius Garland
193           Rudy Gobert
202        Draymond Green
323           Zach LaVine
376       Khris Middleton
382      Donovan Mitchell
399       Dejounte Murray
437            Chris Paul
545    Karl-Anthony Towns
553         Fred VanVleet
581        Andrew Wiggins
Name: Player, dtype: object

In [1470]:
as_pred_not_2022.shape[0] / num_all_stars_2022

0.48148148148148145

### Players That Were Predicted to be All Stars But Weren't

In [1471]:
not_as_pred_as_2022 = test_2022.loc[knn_as_results_2022.loc[(knn_as_results_2022.prediction==1) & ~(knn_as_results_2022['All Star'])].index].Player
not_as_pred_as_2022

126              Anthony Davis
187                Paul George
190    Shai Gilgeous-Alexander
264               Kyrie Irving
Name: Player, dtype: object

## All Stars Classifier Summary<a class="anchor" id="allstarsummary"></a>

In [1710]:
as_2022 = test_2022.loc[test_2022['All Star']][['Player', 'Pos', 'Age', 'TS%', 'RPG', 'APG', 'PPG', 'BPG', 'SPG','All Star','Year']]

In [1711]:
def rating_col(player):
    if player in correct_as_2022.values:
        return 'properly rated'
    elif player in as_pred_not_2022.values:
        return 'overrated'

In [1712]:
as_2022['Rating'] = as_2022.Player.apply(rating_col)

In [1713]:
as_summary_2022 = as_2022[['Player', 'Rating', 'Year']]

In [1714]:
as_other_2022 = not_as_pred_as_2022.to_frame(name = 'Player')
as_other_2022['Rating'] = 'underrated'
as_other_2022['Year'] = 2022

In [1715]:
as_summary_2022 = pd.concat([as_other_2022, as_summary_2022]).sort_values('Rating')

In [1716]:
as_2021 = test_2021.loc[test_2021['All Star']][['Player', 'Pos', 'TS%', 'RPG', 'APG', 'PPG', 'BPG', 'SPG','All Star','Year']]
def rating_col(player):
    if player in correct_as_2021.values:
        return 'properly rated'
    elif player in as_pred_not_2021.values:
        return 'overrated'
as_2021['Rating'] = as_2021.Player.apply(rating_col)
as_summary_2021 = as_2021[['Player', 'Rating', 'Year']]
as_other_2021 = not_as_pred_as_2021.to_frame(name = 'Player')
as_other_2021['Rating'] = 'underrated'
as_other_2021['Year'] = 2021
as_summary_2021 = pd.concat([as_other_2021, as_summary_2021]).sort_values('Rating')

In [1717]:
as_2019 = test_players.loc[test_players['All Star']][['Player', 'Pos', 'TS%', 'RPG', 'APG', 'PPG', 'BPG', 'SPG','All Star','Year']]
def rating_col(player):
    if player in correct_as_2019.values:
        return 'properly rated'
    elif player in as_pred_not_2019.values:
        return 'overrated'
as_2019['Rating'] = as_2019.Player.apply(rating_col)
as_summary_2019 = as_2019[['Player', 'Rating', 'Year']]
as_other_2019 = not_as_pred_as_2019.to_frame(name = 'Player')
as_other_2019['Rating'] = 'underrated'
as_other_2019['Year'] = 2019
as_summary_2019 = pd.concat([as_other_2019, as_summary_2019]).sort_values('Rating')

In [1718]:
as_summaries = pd.concat([as_summary_2019, as_summary_2021, as_summary_2022])

In [1719]:
as_summaries['Year'] = as_summaries['Year'].astype(str)

In [1720]:
as_summaries_grouped = as_summaries.groupby('Player').agg({'Rating':'unique', 'Year':'unique'}).reset_index()

In [1721]:
as_summaries_grouped['Rating'] = as_summaries_grouped.Rating.str.join(', ')

In [1722]:
as_summaries_grouped['Year'] = as_summaries_grouped.Year.str.join(', ')

In [1732]:
as_summaries_grouped.loc[as_summaries_grouped.Rating == 'overrated']

Unnamed: 0,Player,Rating,Year
0,Andrew Wiggins,overrated,2022
2,Ben Simmons,overrated,"2019, 2021"
7,Chris Paul,overrated,"2021, 2022"
9,Darius Garland,overrated,2022
11,Dejounte Murray,overrated,2022
13,Dirk Nowitzki,overrated,2019
16,Draymond Green,overrated,2022
17,Dwyane Wade,overrated,2019
18,Fred VanVleet,overrated,2022
22,Jarrett Allen,overrated,2022


In [1733]:
as_summaries_grouped.loc[as_summaries_grouped.Rating == 'underrated']

Unnamed: 0,Player,Rating,Year
5,Brandon Ingram,underrated,2021
6,CJ McCollum,underrated,2021
27,John Wall,underrated,2019
47,Shai Gilgeous-Alexander,underrated,2022


In [1730]:
as_summaries_grouped.to_csv('all_star_classifier_summary.csv', index = False)