# Naive Bayes Classifier for NBA Players

For a full description of this project, please refer to the [GitHub repository](https://github.com/jacquelinekclee/naivebayes_nba_players).

## Table of Contents

- [Process the Training Data](#training)
    - [Discretize the Training Data](#discretize)
    - [Add All Star Data](#allstar)
    - [Add MVP Data](#mvp)
- [Process the Test Data](#test)
    - [Discretize the Test Data](#discretizetest)
    - [Add All Star and MVP Data](#allstarmvp)
- [Perform Naive Bayes Classification on the 2018-19 Data](#naivebayes)
    - [Predict if a Player is an All Star](#naiveallstar)
    - [Predict a Player's Position](#positionnaive)
    - [Predict a Player's Decade](#naivedecade)
- [Process the 2020-21 Data](#2021)
    - [Add All Star Data](#allstar2021)
    - [Discretize the 2020-21 Data](#discretize2021)
- [Perform Naive Bayes Classification on the 2020-21 Data](#naivebayes2021)
    - [Predict if a Player is an All Star](#naiveallstar2021)
    - [Predict a Player's Position](#pos2021)
- [Predict the 2020-21 MVP](#predictmvp)

## Imports

In [1]:
import pandas as pd
import numpy as np

from probabilities import *
from naivebayes_nba_players import *

## Process the Training Data<a class="anchor" id="training"></a>

Get the DataFrame with each player's statistics for each season from 1950-2017.
Since many relevant statistics weren't collected until 1980, I will only keep the season statistics for 1980-2017.

In [2]:
stats = pd.read_csv('Seasons_Stats.csv')
stats_cols = list(stats.columns)
stats.columns

Index(['Unnamed: 0', 'Year', 'Player', 'Pos', 'Age', 'Tm', 'G', 'GS', 'MP',
       'PER', 'TS%', '3PAr', 'FTr', 'ORB%', 'DRB%', 'TRB%', 'AST%', 'STL%',
       'BLK%', 'TOV%', 'USG%', 'blanl', 'OWS', 'DWS', 'WS', 'WS/48', 'blank2',
       'OBPM', 'DBPM', 'BPM', 'VORP', 'FG', 'FGA', 'FG%', '3P', '3PA', '3P%',
       '2P', '2PA', '2P%', 'eFG%', 'FT', 'FTA', 'FT%', 'ORB', 'DRB', 'TRB',
       'AST', 'STL', 'BLK', 'TOV', 'PF', 'PTS'],
      dtype='object')

In [3]:
max(stats['Year'])

2017.0

In [4]:
stats = stats.loc[stats['Year'] >= 1980].reset_index(drop=True)
stats.head()

Unnamed: 0.1,Unnamed: 0,Year,Player,Pos,Age,Tm,G,GS,MP,PER,...,FT%,ORB,DRB,TRB,AST,STL,BLK,TOV,PF,PTS
0,5727,1980.0,Kareem Abdul-Jabbar*,C,32.0,LAL,82.0,,3143.0,25.3,...,0.765,190.0,696.0,886.0,371.0,81.0,280.0,297.0,216.0,2034.0
1,5728,1980.0,Tom Abernethy,PF,25.0,GSW,67.0,,1222.0,11.0,...,0.683,62.0,129.0,191.0,87.0,35.0,12.0,39.0,118.0,362.0
2,5729,1980.0,Alvan Adams,C,25.0,PHO,75.0,,2168.0,19.2,...,0.797,158.0,451.0,609.0,322.0,108.0,55.0,218.0,237.0,1118.0
3,5730,1980.0,Tiny Archibald*,PG,31.0,BOS,80.0,80.0,2864.0,15.3,...,0.83,59.0,138.0,197.0,671.0,106.0,10.0,242.0,218.0,1131.0
4,5731,1980.0,Dennis Awtrey,C,31.0,CHI,26.0,,560.0,7.4,...,0.64,29.0,86.0,115.0,40.0,12.0,15.0,27.0,66.0,86.0


We only want to keep the following features. These features will be fundamental for our classifier:

* Year
* Player
* Position
* Games played
* True shooting percentage
* Assists
* Points
* Total Rebounds
* Total Steals
* Total Blocks

In [5]:
stats = stats[stats_cols[1:4] + ['G','TS%','TRB','AST','PTS', 'STL', 'BLK']]
stats

Unnamed: 0,Year,Player,Pos,G,TS%,TRB,AST,PTS,STL,BLK
0,1980.0,Kareem Abdul-Jabbar*,C,82.0,0.639,886.0,371.0,2034.0,81.0,280.0
1,1980.0,Tom Abernethy,PF,67.0,0.511,191.0,87.0,362.0,35.0,12.0
2,1980.0,Alvan Adams,C,75.0,0.571,609.0,322.0,1118.0,108.0,55.0
3,1980.0,Tiny Archibald*,PG,80.0,0.574,197.0,671.0,1131.0,106.0,10.0
4,1980.0,Dennis Awtrey,C,26.0,0.524,115.0,40.0,86.0,12.0,15.0
...,...,...,...,...,...,...,...,...,...,...
18922,2017.0,Cody Zeller,PF,62.0,0.604,405.0,99.0,639.0,62.0,58.0
18923,2017.0,Tyler Zeller,C,51.0,0.508,124.0,42.0,178.0,7.0,21.0
18924,2017.0,Stephen Zimmerman,C,19.0,0.346,35.0,4.0,23.0,2.0,5.0
18925,2017.0,Paul Zipser,SF,44.0,0.503,125.0,36.0,240.0,15.0,16.0


Check for missing values

In [6]:
detect_missing_values(stats)

Year :  False
Player: False
Pos: False
G :  False
TS% :  True
TRB :  False
AST :  False
PTS :  False
STL :  False
BLK :  False


Some other observed irregularities are that the years are floats and some Player names have extra characters (like asterisks). I will thus change these. 

In [7]:
stats['Year'] = [int(year) for year in stats['Year']]
stats['Player'] = [name[:-1] if '*' in name else name for name in stats['Player']]
(all([isinstance(year, int) for year in stats['Year']]), 
 all(['*' not in name for name in stats['Player']]))

(True, True)

Now I will be converting the TRB (total rebounds), AST (assists), and PTS (points) to their per game equivalents by dividing the gross number of rebounds, assists, or points by number of games played. This will enable us to compare all players with each other; a given players statistics won't appear inflated because they played more games than another player.

In [8]:
per_game_stats(stats, 'G')

In [9]:
stats = stats.drop(columns=['TRB', 'AST', 'PTS', 'BLK', 'STL', 'G'])

In [10]:
stats_descr = stats.describe()
stats_descr

Unnamed: 0,Year,TS%,RPG,APG,PPG,BPG,SPG
count,18927.0,18851.0,18927.0,18927.0,18927.0,18927.0,18927.0
mean,2000.272415,0.503862,3.468066,1.848202,8.047679,0.406166,0.659704
std,10.691977,0.094507,2.53764,1.848489,5.958002,0.509952,0.479654
min,1980.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,1992.0,0.473,1.609566,0.571429,3.383974,0.090909,0.3125
50%,2001.0,0.516,2.814815,1.25,6.492537,0.238095,0.567568
75%,2010.0,0.551,4.68058,2.503165,11.5,0.511628,0.911392
max,2017.0,1.136,18.658537,14.538462,37.085366,6.0,3.670732


PG=Point Guard 

G=Point Guard and Shooting Guard 

SG=Shooting Guard 

GF= Shooting Guard and Small Forward 

SF=Small Forward 

F= Small Forward and Power Forward 

PF= Power Forward 

FC= Power Forward and Center 

C= Center

In [11]:
list(set(stats['Pos']))

['SG-PG',
 'SG-SF',
 'SF',
 'SF-PF',
 'PG-SG',
 'SG',
 'C',
 'SF-SG',
 'C-PF',
 'PF-C',
 'PG-SF',
 'C-SF',
 'SG-PF',
 'PF',
 'PF-SF',
 'PG']

The stats 'Pos' column will be refined to match these positions.

* 'PG', 'SG', 'SG-PG', and 'PG-SG' will become 'G'
* 'PF-C', 'C-PF', and 'C-SF' will become 'FC'
* 'SF', 'SF-PF', 'PF', and 'PF-SF' will become 'F'
* 'SG-SF', 'SG-PF', 'PG-SF', and 'SF-SG' will become 'GF'
* 'C' will remain 'C'

In [12]:
reset_position(stats)
list(set(stats['Pos']))

['GF', 'FC', 'C', 'G', 'F']

In [13]:
stats

Unnamed: 0,Year,Player,Pos,TS%,RPG,APG,PPG,BPG,SPG
0,1980,Kareem Abdul-Jabbar,C,0.639,10.804878,4.524390,24.804878,3.414634,0.987805
1,1980,Tom Abernethy,F,0.511,2.850746,1.298507,5.402985,0.179104,0.522388
2,1980,Alvan Adams,C,0.571,8.120000,4.293333,14.906667,0.733333,1.440000
3,1980,Tiny Archibald,G,0.574,2.462500,8.387500,14.137500,0.125000,1.325000
4,1980,Dennis Awtrey,C,0.524,4.423077,1.538462,3.307692,0.576923,0.461538
...,...,...,...,...,...,...,...,...,...
18922,2017,Cody Zeller,F,0.604,6.532258,1.596774,10.306452,0.935484,1.000000
18923,2017,Tyler Zeller,C,0.508,2.431373,0.823529,3.490196,0.411765,0.137255
18924,2017,Stephen Zimmerman,C,0.346,1.842105,0.210526,1.210526,0.263158,0.105263
18925,2017,Paul Zipser,F,0.503,2.840909,0.818182,5.454545,0.363636,0.340909


#### Discretize the Training Data<a class="anchor" id="discretize"></a> 

To perform Naive Bayes with this continous data, I will discretize the data. Instead of the continuous data, I will replace each data point with its respective quantile (1st, 2nd, ..., or 99th).

In [14]:
players_stats_cols = list(stats.columns)[list(stats.columns).index('TS%'):]
players_quantiles = quantile_dict(stats)
discretized_columns = {col:[discretize(val, col, players_quantiles) for val in stats[col]] for col in players_stats_cols}

Using this discretized data, I will make a new datafram where all the data is categorical

In [15]:
stats_categorical = categorical_df(stats, discretized_columns)

In [16]:
stats_categorical.head()

Unnamed: 0,Year,Player,Pos,TS%,RPG,APG,PPG,BPG,SPG
0,1980,Kareem Abdul-Jabbar,C,3rd,3rd,3rd,3rd,3rd,3rd
1,1980,Tom Abernethy,F,3rd,2nd,2nd,2nd,2nd,2nd
2,1980,Alvan Adams,C,3rd,3rd,3rd,3rd,3rd,3rd
3,1980,Tiny Archibald,G,3rd,2nd,3rd,3rd,2nd,3rd
4,1980,Dennis Awtrey,C,3rd,2nd,2nd,1st,3rd,2nd


In [17]:
stats.head()

Unnamed: 0,Year,Player,Pos,TS%,RPG,APG,PPG,BPG,SPG
0,1980,Kareem Abdul-Jabbar,C,0.639,10.804878,4.52439,24.804878,3.414634,0.987805
1,1980,Tom Abernethy,F,0.511,2.850746,1.298507,5.402985,0.179104,0.522388
2,1980,Alvan Adams,C,0.571,8.12,4.293333,14.906667,0.733333,1.44
3,1980,Tiny Archibald,G,0.574,2.4625,8.3875,14.1375,0.125,1.325
4,1980,Dennis Awtrey,C,0.524,4.423077,1.538462,3.307692,0.576923,0.461538


#### Add a Column Indicating if a Player earned All Star honors that Season<a class="anchor" id="allstar"></a>

I will now add a column decoding if a player was an all-star in that season

In [18]:
all_stars_df = pd.read_csv('all_stars.csv')
all_stars_df = all_stars_df.rename(columns = {'Name':'Player'})
all_stars_df.head()

Unnamed: 0.1,Unnamed: 0,Player,Year
0,0,Kareem Abdul-Jabbar,1970
1,1,Kareem Abdul-Jabbar,1971
2,2,Kareem Abdul-Jabbar,1972
3,3,Kareem Abdul-Jabbar,1973
4,4,Kareem Abdul-Jabbar,1974


In [19]:
all_star_tups = build_tups(all_stars_df)

In [20]:
players_tups = build_tups(stats)

In [21]:
all_stars = create_yn_cols(players_tups, all_star_tups)

In [22]:
stats['All Star'] = all_stars
stats_categorical['All Star'] = all_stars

In [23]:
stats.head()

Unnamed: 0,Year,Player,Pos,TS%,RPG,APG,PPG,BPG,SPG,All Star
0,1980,Kareem Abdul-Jabbar,C,0.639,10.804878,4.52439,24.804878,3.414634,0.987805,Y
1,1980,Tom Abernethy,F,0.511,2.850746,1.298507,5.402985,0.179104,0.522388,N
2,1980,Alvan Adams,C,0.571,8.12,4.293333,14.906667,0.733333,1.44,N
3,1980,Tiny Archibald,G,0.574,2.4625,8.3875,14.1375,0.125,1.325,N
4,1980,Dennis Awtrey,C,0.524,4.423077,1.538462,3.307692,0.576923,0.461538,N


In [24]:
stats_all_stars = stats_categorical.loc[stats_categorical['All Star'] == 'Y']

In [25]:
stats_not_all_stars = stats_categorical.loc[stats_categorical['All Star'] == 'N']

In [26]:
stats_all_stars_cont = stats[stats['All Star'] == 'Y']
stats_not_all_stars_cont = stats[stats['All Star'] == 'N']

#### Add a Column Indicating if a Player earned All Star honors that Season<a class="anchor" id="mvp"></a>

In [27]:
mvps = pd.read_csv('mvps.csv')
mvps.head()

Unnamed: 0,Season,Lg,Player,Voting,Age,Tm,G,MP,PTS,TRB,AST,STL,BLK,FG%,3P%,FT%,WS,WS/48
0,2019-20,NBA,Giannis Antetokounmpo\antetgi01,(V),25,MIL,63,30.4,29.5,13.6,5.6,1.0,1.0,0.553,0.304,0.633,11.1,0.279
1,2018-19,NBA,Giannis Antetokounmpo\antetgi01,(V),24,MIL,72,32.8,27.7,12.5,5.9,1.3,1.5,0.578,0.256,0.729,14.4,0.292
2,2017-18,NBA,James Harden\hardeja01,(V),28,HOU,72,35.4,30.4,5.4,8.8,1.8,0.7,0.449,0.367,0.858,15.4,0.289
3,2016-17,NBA,Russell Westbrook\westbru01,(V),28,OKC,81,34.6,31.6,10.7,10.4,1.6,0.4,0.425,0.343,0.845,13.1,0.224
4,2015-16,NBA,Stephen Curry\curryst01,(V),27,GSW,79,34.2,30.1,5.4,6.7,2.1,0.2,0.504,0.454,0.908,17.9,0.318


In [28]:
mvps['Player'] = [name[:name.index('\\')] for name in mvps['Player']]
mvps.head()

Unnamed: 0,Season,Lg,Player,Voting,Age,Tm,G,MP,PTS,TRB,AST,STL,BLK,FG%,3P%,FT%,WS,WS/48
0,2019-20,NBA,Giannis Antetokounmpo,(V),25,MIL,63,30.4,29.5,13.6,5.6,1.0,1.0,0.553,0.304,0.633,11.1,0.279
1,2018-19,NBA,Giannis Antetokounmpo,(V),24,MIL,72,32.8,27.7,12.5,5.9,1.3,1.5,0.578,0.256,0.729,14.4,0.292
2,2017-18,NBA,James Harden,(V),28,HOU,72,35.4,30.4,5.4,8.8,1.8,0.7,0.449,0.367,0.858,15.4,0.289
3,2016-17,NBA,Russell Westbrook,(V),28,OKC,81,34.6,31.6,10.7,10.4,1.6,0.4,0.425,0.343,0.845,13.1,0.224
4,2015-16,NBA,Stephen Curry,(V),27,GSW,79,34.2,30.1,5.4,6.7,2.1,0.2,0.504,0.454,0.908,17.9,0.318


In [29]:
mvp_years = [2000 if season == '1999-2000'
             else int(season[:2] + season[-2:len(season)])
            for season in mvps['Season']]
mvps['Year'] = mvp_years
mvps.head()

Unnamed: 0,Season,Lg,Player,Voting,Age,Tm,G,MP,PTS,TRB,AST,STL,BLK,FG%,3P%,FT%,WS,WS/48,Year
0,2019-20,NBA,Giannis Antetokounmpo,(V),25,MIL,63,30.4,29.5,13.6,5.6,1.0,1.0,0.553,0.304,0.633,11.1,0.279,2020
1,2018-19,NBA,Giannis Antetokounmpo,(V),24,MIL,72,32.8,27.7,12.5,5.9,1.3,1.5,0.578,0.256,0.729,14.4,0.292,2019
2,2017-18,NBA,James Harden,(V),28,HOU,72,35.4,30.4,5.4,8.8,1.8,0.7,0.449,0.367,0.858,15.4,0.289,2018
3,2016-17,NBA,Russell Westbrook,(V),28,OKC,81,34.6,31.6,10.7,10.4,1.6,0.4,0.425,0.343,0.845,13.1,0.224,2017
4,2015-16,NBA,Stephen Curry,(V),27,GSW,79,34.2,30.1,5.4,6.7,2.1,0.2,0.504,0.454,0.908,17.9,0.318,2016


In [30]:
mvp_tups = build_tups(mvps)

In [31]:
mvp_lst = create_yn_cols(players_tups, mvp_tups)
stats['MVP'] = mvp_lst
stats.head()

Unnamed: 0,Year,Player,Pos,TS%,RPG,APG,PPG,BPG,SPG,All Star,MVP
0,1980,Kareem Abdul-Jabbar,C,0.639,10.804878,4.52439,24.804878,3.414634,0.987805,Y,Y
1,1980,Tom Abernethy,F,0.511,2.850746,1.298507,5.402985,0.179104,0.522388,N,N
2,1980,Alvan Adams,C,0.571,8.12,4.293333,14.906667,0.733333,1.44,N,N
3,1980,Tiny Archibald,G,0.574,2.4625,8.3875,14.1375,0.125,1.325,N,N
4,1980,Dennis Awtrey,C,0.524,4.423077,1.538462,3.307692,0.576923,0.461538,N,N


## Process the Test Data (data to be classified): Player Data from the 2018-19 Season<a class="anchor" id="test"></a>

In [32]:
test_players = pd.read_csv('players_1819.csv')
test_players.head()

Unnamed: 0,Rk,Player,Pos,Age,Tm,G,MP,PER,TS%,3PAr,...,USG%,OWS,DWS,WS,WS.1,WS/48,OBPM,DBPM,BPM,VORP
0,1,Álex Abrines\abrinal01,SG,25,OKC,31,588,6.3,0.507,0.809,...,12.2,0.1,0.6,0.6,0.6,0.053,-3.7,0.4,-3.3,-0.2
1,2,Quincy Acy\acyqu01,PF,28,PHO,10,123,2.9,0.379,0.833,...,9.2,-0.1,0.0,-0.1,-0.1,-0.022,-7.6,-0.5,-8.1,-0.2
2,3,Jaylen Adams\adamsja01,PG,22,ATL,34,428,7.6,0.474,0.673,...,13.5,-0.1,0.2,0.1,0.1,0.011,-3.8,-0.5,-4.3,-0.2
3,4,Steven Adams\adamsst01,C,25,OKC,80,2669,18.5,0.591,0.002,...,16.4,5.1,4.0,9.1,9.1,0.163,0.7,0.4,1.1,2.1
4,5,Bam Adebayo\adebaba01,C,21,MIA,82,1913,17.9,0.623,0.031,...,15.8,3.4,3.4,6.8,6.8,0.171,-0.4,2.2,1.8,1.8


The below data has players' aggregate statistics.

In [33]:
test_players_totals = pd.read_csv('players_1819_totals.csv')
test_players_totals.head()

Unnamed: 0.1,Unnamed: 0,Player,Pos,Age,Tm,G,GS,MP,FG,FGA,...,FT%,ORB,DRB,TRB,AST,STL,BLK,TOV,PF,PTS
0,1,Álex Abrines\abrinal01,SG,25,OKC,31,2,588,56,157,...,0.923,5,43,48,20,17,6,14,53,165
1,2,Quincy Acy\acyqu01,PF,28,PHO,10,0,123,4,18,...,0.7,3,22,25,8,1,4,4,24,17
2,3,Jaylen Adams\adamsja01,PG,22,ATL,34,1,428,38,110,...,0.778,11,49,60,65,14,5,28,45,108
3,4,Steven Adams\adamsst01,C,25,OKC,80,80,2669,481,809,...,0.5,391,369,760,124,117,76,135,204,1108
4,5,Bam Adebayo\adebaba01,C,21,MIA,82,28,1913,280,486,...,0.735,165,432,597,184,71,65,121,203,729


Merge the two dataframes in order to calculate the per game statistics

In [34]:
test_players = pd.merge(test_players, test_players_totals, on=['Player'])
test_players.head()

Unnamed: 0,Rk,Player,Pos_x,Age_x,Tm_x,G_x,MP_x,PER,TS%,3PAr,...,FT%,ORB,DRB,TRB,AST,STL,BLK,TOV,PF,PTS
0,1,Álex Abrines\abrinal01,SG,25,OKC,31,588,6.3,0.507,0.809,...,0.923,5,43,48,20,17,6,14,53,165
1,2,Quincy Acy\acyqu01,PF,28,PHO,10,123,2.9,0.379,0.833,...,0.7,3,22,25,8,1,4,4,24,17
2,3,Jaylen Adams\adamsja01,PG,22,ATL,34,428,7.6,0.474,0.673,...,0.778,11,49,60,65,14,5,28,45,108
3,4,Steven Adams\adamsst01,C,25,OKC,80,2669,18.5,0.591,0.002,...,0.5,391,369,760,124,117,76,135,204,1108
4,5,Bam Adebayo\adebaba01,C,21,MIA,82,1913,17.9,0.623,0.031,...,0.735,165,432,597,184,71,65,121,203,729


In [35]:
test_players.columns

Index(['Rk', 'Player', 'Pos_x', 'Age_x', 'Tm_x', 'G_x', 'MP_x', 'PER', 'TS%',
       '3PAr', 'FTr', 'ORB%', 'DRB%', 'TRB%', 'AST%', 'STL%', 'BLK%', 'TOV%',
       'USG%', 'OWS', 'DWS', 'WS', 'WS.1', 'WS/48', 'OBPM', 'DBPM', 'BPM',
       'VORP', 'Unnamed: 0', 'Pos_y', 'Age_y', 'Tm_y', 'G_y', 'GS', 'MP_y',
       'FG', 'FGA', 'FG%', '3P', '3PA', '3P%', '2P', '2PA', '2P%', 'eFG%',
       'FT', 'FTA', 'FT%', 'ORB', 'DRB', 'TRB', 'AST', 'STL', 'BLK', 'TOV',
       'PF', 'PTS'],
      dtype='object')

In [36]:
per_game_stats(test_players, 'G_x')
test_players.head()

Unnamed: 0,Rk,Player,Pos_x,Age_x,Tm_x,G_x,MP_x,PER,TS%,3PAr,...,STL,BLK,TOV,PF,PTS,RPG,APG,PPG,BPG,SPG
0,1,Álex Abrines\abrinal01,SG,25,OKC,31,588,6.3,0.507,0.809,...,17,6,14,53,165,1.548387,0.645161,5.322581,0.193548,0.548387
1,2,Quincy Acy\acyqu01,PF,28,PHO,10,123,2.9,0.379,0.833,...,1,4,4,24,17,2.5,0.8,1.7,0.4,0.1
2,3,Jaylen Adams\adamsja01,PG,22,ATL,34,428,7.6,0.474,0.673,...,14,5,28,45,108,1.764706,1.911765,3.176471,0.147059,0.411765
3,4,Steven Adams\adamsst01,C,25,OKC,80,2669,18.5,0.591,0.002,...,117,76,135,204,1108,9.5,1.55,13.85,0.95,1.4625
4,5,Bam Adebayo\adebaba01,C,21,MIA,82,1913,17.9,0.623,0.031,...,71,65,121,203,729,7.280488,2.243902,8.890244,0.792683,0.865854


In [37]:
test_players = test_players.rename(columns={'Pos_x':'Pos'})

Only keep the columns that correspond with the training data (stats)

In [38]:
test_players_cols = list(filter(lambda col: col in stats.columns, test_players.columns))
test_players = test_players[test_players_cols]

In [39]:
test_players = test_players.dropna()

In [40]:
test_players.head()

Unnamed: 0,Player,Pos,TS%,RPG,APG,PPG,BPG,SPG
0,Álex Abrines\abrinal01,SG,0.507,1.548387,0.645161,5.322581,0.193548,0.548387
1,Quincy Acy\acyqu01,PF,0.379,2.5,0.8,1.7,0.4,0.1
2,Jaylen Adams\adamsja01,PG,0.474,1.764706,1.911765,3.176471,0.147059,0.411765
3,Steven Adams\adamsst01,C,0.591,9.5,1.55,13.85,0.95,1.4625
4,Bam Adebayo\adebaba01,C,0.623,7.280488,2.243902,8.890244,0.792683,0.865854


Reset the positions so that they coincide with the training data (stats)

In [41]:
reset_position(test_players)

In [42]:
set(test_players['Pos'])

{'C', 'F', 'FC', 'G', 'GF'}

In [43]:
test_players['Player'] = [name[:name.index('\\')] for name in test_players['Player']]
test_players.head()

Unnamed: 0,Player,Pos,TS%,RPG,APG,PPG,BPG,SPG
0,Álex Abrines,G,0.507,1.548387,0.645161,5.322581,0.193548,0.548387
1,Quincy Acy,F,0.379,2.5,0.8,1.7,0.4,0.1
2,Jaylen Adams,G,0.474,1.764706,1.911765,3.176471,0.147059,0.411765
3,Steven Adams,C,0.591,9.5,1.55,13.85,0.95,1.4625
4,Bam Adebayo,C,0.623,7.280488,2.243902,8.890244,0.792683,0.865854


In [44]:
test_descr = test_players.describe()
test_descr

Unnamed: 0,TS%,RPG,APG,PPG,BPG,SPG
count,1250.0,1250.0,1250.0,1250.0,1250.0,1250.0
mean,0.528119,4.734154,2.382562,11.004665,0.433511,0.827367
std,0.112984,6.839645,4.177607,18.514852,0.623698,1.363554
min,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.497,1.622466,0.666667,3.675,0.088235,0.270885
50%,0.538,3.046537,1.313393,7.007812,0.25,0.522727
75%,0.578,5.334416,2.666667,13.210311,0.539221,0.947193
max,1.5,86.0,80.0,420.0,8.5,27.0


### Discretize the Test Data<a class="anchor" id="discretizetest"></a>

In [45]:
test_stats_cols = list(test_players.columns)[list(test_players.columns).index('TS%'):]

test_quantiles = quantile_dict(test_players)

test_discretized_columns = {col:[discretize(val, col, test_quantiles) for val in test_players[col]] for col in test_stats_cols}

# create the dataframe where all data is categorical
test_players_categorical = categorical_df(test_players, test_discretized_columns)

In [46]:
test_players.head()

Unnamed: 0,Player,Pos,TS%,RPG,APG,PPG,BPG,SPG
0,Álex Abrines,G,0.507,1.548387,0.645161,5.322581,0.193548,0.548387
1,Quincy Acy,F,0.379,2.5,0.8,1.7,0.4,0.1
2,Jaylen Adams,G,0.474,1.764706,1.911765,3.176471,0.147059,0.411765
3,Steven Adams,C,0.591,9.5,1.55,13.85,0.95,1.4625
4,Bam Adebayo,C,0.623,7.280488,2.243902,8.890244,0.792683,0.865854


In [47]:
test_players_categorical.head()

Unnamed: 0,Player,Pos,TS%,RPG,APG,PPG,BPG,SPG
0,Álex Abrines,G,2nd,1st,1st,2nd,2nd,2nd
1,Quincy Acy,F,1st,2nd,2nd,1st,2nd,1st
2,Jaylen Adams,G,1st,2nd,2nd,1st,2nd,2nd
3,Steven Adams,C,3rd,3rd,2nd,3rd,3rd,3rd
4,Bam Adebayo,C,3rd,3rd,2nd,2nd,3rd,2nd


In [48]:
test_players['Year'] = 2019
test_tups = build_tups(test_players)

#### Add the All Star and MVP Data<a class="anchor" id="allstarmvp"></a> 

In [49]:
test_all_stars = create_yn_cols(test_tups, all_star_tups)

In [50]:
test_players['All Star'] = test_all_stars
test_players_categorical['All Star'] = test_all_stars

In [51]:
test_mvp_lst = create_yn_cols(test_tups, mvp_tups)
test_players['MVP'] = test_mvp_lst
test_players.head()

Unnamed: 0,Player,Pos,TS%,RPG,APG,PPG,BPG,SPG,Year,All Star,MVP
0,Álex Abrines,G,0.507,1.548387,0.645161,5.322581,0.193548,0.548387,2019,N,N
1,Quincy Acy,F,0.379,2.5,0.8,1.7,0.4,0.1,2019,N,N
2,Jaylen Adams,G,0.474,1.764706,1.911765,3.176471,0.147059,0.411765,2019,N,N
3,Steven Adams,C,0.591,9.5,1.55,13.85,0.95,1.4625,2019,N,N
4,Bam Adebayo,C,0.623,7.280488,2.243902,8.890244,0.792683,0.865854,2019,N,N


In [52]:
test_players.loc[test_players['MVP'] == 'Y']

Unnamed: 0,Player,Pos,TS%,RPG,APG,PPG,BPG,SPG,Year,All Star,MVP
25,Giannis Antetokounmpo,F,0.644,12.472222,5.888889,27.694444,1.527778,1.277778,2019,Y,Y


In [53]:
test_players.head()

Unnamed: 0,Player,Pos,TS%,RPG,APG,PPG,BPG,SPG,Year,All Star,MVP
0,Álex Abrines,G,0.507,1.548387,0.645161,5.322581,0.193548,0.548387,2019,N,N
1,Quincy Acy,F,0.379,2.5,0.8,1.7,0.4,0.1,2019,N,N
2,Jaylen Adams,G,0.474,1.764706,1.911765,3.176471,0.147059,0.411765,2019,N,N
3,Steven Adams,C,0.591,9.5,1.55,13.85,0.95,1.4625,2019,N,N
4,Bam Adebayo,C,0.623,7.280488,2.243902,8.890244,0.792683,0.865854,2019,N,N


In [54]:
test_players_categorical.head()

Unnamed: 0,Player,Pos,TS%,RPG,APG,PPG,BPG,SPG,All Star
0,Álex Abrines,G,2nd,1st,1st,2nd,2nd,2nd,N
1,Quincy Acy,F,1st,2nd,2nd,1st,2nd,1st,N
2,Jaylen Adams,G,1st,2nd,2nd,1st,2nd,2nd,N
3,Steven Adams,C,3rd,3rd,2nd,3rd,3rd,3rd,N
4,Bam Adebayo,C,3rd,3rd,2nd,2nd,3rd,2nd,N


## Naive Bayes Probabilities: 2018-19 Players<a class="anchor" id="naivebayes"></a>

### Predict if a Player is an All Star or not<a class="anchor" id="naiveallstar"></a> 

In [55]:
stats.head()

Unnamed: 0,Year,Player,Pos,TS%,RPG,APG,PPG,BPG,SPG,All Star,MVP
0,1980,Kareem Abdul-Jabbar,C,0.639,10.804878,4.52439,24.804878,3.414634,0.987805,Y,Y
1,1980,Tom Abernethy,F,0.511,2.850746,1.298507,5.402985,0.179104,0.522388,N,N
2,1980,Alvan Adams,C,0.571,8.12,4.293333,14.906667,0.733333,1.44,N,N
3,1980,Tiny Archibald,G,0.574,2.4625,8.3875,14.1375,0.125,1.325,N,N
4,1980,Dennis Awtrey,C,0.524,4.423077,1.538462,3.307692,0.576923,0.461538,N,N


In [56]:
figures = list(test_players.columns)[1:-3][:]
figures

['Pos', 'TS%', 'RPG', 'APG', 'PPG', 'BPG', 'SPG']

In [57]:
all_star_probabilities = []
not_all_star_probabilities = []
for i in range(test_players.shape[0]):
    all_star_probabilities.append(
        naive_bayes_yn(i, figures, test_players_categorical, stats_all_stars))
    not_all_star_probabilities.append(
        naive_bayes_yn(i, figures, test_players_categorical, stats_not_all_stars))

In [58]:
prediction = []
for i in range(len(all_star_probabilities)):
    yes = all_star_probabilities[i]
    no = not_all_star_probabilities[i]
    if yes >= no:
        prediction.append('Y')
    else:
        prediction.append('N')

In [59]:
test_players_categorical['All Star Prediction'] = prediction

In [60]:
test_players_categorical.head()

Unnamed: 0,Player,Pos,TS%,RPG,APG,PPG,BPG,SPG,All Star,All Star Prediction
0,Álex Abrines,G,2nd,1st,1st,2nd,2nd,2nd,N,N
1,Quincy Acy,F,1st,2nd,2nd,1st,2nd,1st,N,N
2,Jaylen Adams,G,1st,2nd,2nd,1st,2nd,2nd,N,N
3,Steven Adams,C,3rd,3rd,2nd,3rd,3rd,3rd,N,Y
4,Bam Adebayo,C,3rd,3rd,2nd,2nd,3rd,2nd,N,N


In [61]:
stats_all_stars_cont = stats.loc[stats['All Star'] == 'Y']
stats_not_all_stars_cont = stats.loc[stats['All Star'] == 'N']

In [62]:
all_star_cont_probabilities = []
not_all_star_cont_probabilities = []

for i in range(test_players.shape[0]):
    all_star_cont_probabilities.append(
        normpdf(i, figures, test_players, stats_all_stars_cont))
    not_all_star_cont_probabilities.append(
        normpdf(i, figures, test_players, stats_not_all_stars_cont))

In [63]:
test_players['p(All Star)'] = all_star_cont_probabilities

Among the highest 26 (number of all stars that year) probabilities, how many did we predict correctly?

In [64]:
predicted_all_stars = set(test_players.sort_values('p(All Star)', ascending=False).head(26)['Player'])
actual_all_stars = set(all_stars_df.loc[all_stars_df['Year'] == 2018]['Player'])
len(list(filter(lambda name: name in actual_all_stars, predicted_all_stars)))

6

In [65]:
prediction = []
for i in range(len(all_star_cont_probabilities)):
    yes = all_star_cont_probabilities[i]
    no = not_all_star_cont_probabilities[i]
    if yes >= no:
        prediction.append('Y')
    else:
        prediction.append('N')

In [66]:
test_players['All Star Prediction'] = prediction

Calculate the correctness rate, or the proportion of players we predicted correctly

In [67]:
all_star_correct_categ = correct('All Star', test_players_categorical)
all_star_correct_categ

0.7216

In [68]:
all_star_correct_cont = correct('All Star', test_players)
all_star_correct_cont

0.9792

### Predict a Player's Position<a class="anchor" id="positionnaive"></a>

In [69]:
fig_nopos = figures[1:]
fig_nopos

['TS%', 'RPG', 'APG', 'PPG', 'BPG', 'SPG']

In [70]:
positions = list(set(stats['Pos']))
position_categdfs = [stats_categorical.loc[stats_categorical['Pos'] == p] for p in positions]
position_contdfs = [stats.loc[stats['Pos'] == p] for p in positions]

In [71]:
prediction_pos = []
for i in range(test_players_categorical.shape[0]):
    prediction_pos.append(
        naive_classes(i, fig_nopos, test_players_categorical, positions, position_categdfs))

In [72]:
test_players_categorical['Pos Prediction'] = prediction_pos

In [73]:
prediction_pos_cont = []
for i in range(test_players.shape[0]):
    prediction_pos_cont.append(
        normpdf_multiple_classes(i, fig_nopos, test_players, positions, position_contdfs))
test_players['Pos Prediction'] = prediction_pos_cont

Calculate the correctness rate

In [74]:
pos_correct_categ = correct('Pos', test_players_categorical)
pos_correct_categ

0.5824

In [75]:
pos_correct_cont = correct('Pos', test_players)
pos_correct_cont

0.6432

Since these correctness rates are relatively low, I will calculate a modified correctness rate to see if I predicted a players' overall position (forward, center or guard)

In [76]:
almost_correct_categ_pos = 0
almost_correct_cont_pos = 0
for i in range(test_players_categorical.shape[0]):
    row_categ = test_players_categorical.iloc[i]
    pred_categ = row_categ['Pos Prediction']
    correct_categ = row_categ['Pos']
    if ('F' in pred_categ and 'F' in correct_categ)\
    or ('G' in pred_categ and 'G' in correct_categ)\
    or ('C' in pred_categ and 'C' in correct_categ):
        almost_correct_categ_pos += 1
    row_cont = test_players.iloc[i]
    pred_cont = row_cont['Pos Prediction']
    correct_cont = row_cont['Pos']
    if ('F' in pred_cont and 'F' in correct_cont)\
    or ('G' in pred_cont and 'G' in correct_cont)\
    or ('C' in pred_cont and 'C' in correct_cont):
        almost_correct_cont_pos += 1

In [77]:
pos_almost_categ = almost_correct_categ_pos/test_players_categorical.shape[0]
pos_almost_categ

0.868

In [78]:
pos_almost_cont = almost_correct_cont_pos/test_players.shape[0]
pos_almost_cont

0.7992

### Predict a Player's Decade<a class="anchor" id="naivedecade"></a>

In [79]:
assign_decades(stats)
assign_decades(stats_categorical)

In [80]:
decades = list(set(stats['Decade']))
decade_categdfs = [stats_categorical.loc[stats_categorical['Decade'] == d] for d in decades]
decade_contdfs = [stats.loc[stats['Decade'] == d] for d in decades]

In [81]:
prediction_decade = []
for i in range(test_players_categorical.shape[0]):
    prediction_decade.append(
        naive_classes(i,figures, test_players_categorical, decades, decade_categdfs))
test_players_categorical['Decade Prediction'] = prediction_decade

In [82]:
prediction_decade_cont = []
for i in range(test_players.shape[0]):
    prediction_decade_cont.append(
        normpdf_multiple_classes(i,figures, test_players, decades, decade_contdfs))
test_players['Decade Prediction'] = prediction_decade_cont

Calculate the correctness rate

In [83]:
correct_categ_decade = 0
correct_cont_decade = 0
for i in range(test_players_categorical.shape[0]):
    row_categ = test_players_categorical.iloc[i]
    pred_categ = row_categ['Decade Prediction']
    correct_categ = '2010s'
    if correct_categ == pred_categ:
        correct_categ_decade += 1
    row_cont = test_players.iloc[i]
    pred_cont = row_cont['Decade Prediction']
    correct_cont = '2010s'
    if correct_cont == pred_cont:
        correct_cont_decade += 1

In [84]:
decade_correct_categ = correct_categ_decade / test_players_categorical.shape[0]
decade_correct_categ

0.2768

In [85]:
decade_correct_cont = correct_cont_decade / test_players.shape[0]
decade_correct_cont

0.0

## Process the 2020-21 Players Statistics<a class="anchor" id="2021"></a>

In [86]:
test_2021 = pd.read_csv('players_2021.csv')

In [87]:
test_2021.head()

Unnamed: 0,Rk,Player,Pos,Age,Tm,G,GS,MP,FG,FGA,...,FT%,ORB,DRB,TRB,AST,STL,BLK,TOV,PF,PTS
0,1,Precious Achiuwa\achiupr01,PF,21,MIA,32,2,14.0,2.3,4.1,...,0.528,1.2,2.7,3.9,0.6,0.4,0.5,1.0,1.8,5.9
1,2,Jaylen Adams\adamsja01,PG,24,MIL,7,0,2.6,0.1,1.1,...,,0.0,0.4,0.4,0.3,0.0,0.0,0.0,0.1,0.3
2,3,Steven Adams\adamsst01,C,27,NOP,30,30,28.2,3.6,5.8,...,0.456,4.2,4.9,9.1,2.3,0.9,0.6,1.6,1.9,8.2
3,4,Bam Adebayo\adebaba01,C,23,MIA,31,31,33.9,7.3,12.9,...,0.848,2.2,7.5,9.6,5.5,0.9,1.0,3.0,2.5,19.6
4,5,LaMarcus Aldridge\aldrila01,C,35,SAS,20,18,26.5,5.8,12.3,...,0.829,0.8,3.7,4.5,1.8,0.4,0.9,0.9,1.7,14.3


In [88]:
test_2021.columns

Index(['Rk', 'Player', 'Pos', 'Age', 'Tm', 'G', 'GS', 'MP', 'FG', 'FGA', 'FG%',
       '3P', '3PA', '3P%', '2P', '2PA', '2P%', 'eFG%', 'FT', 'FTA', 'FT%',
       'ORB', 'DRB', 'TRB', 'AST', 'STL', 'BLK', 'TOV', 'PF', 'PTS'],
      dtype='object')

Use test_2021_adv, a dataframe which olds advanced statistics, to get the appropriate aggregate statistics

In [89]:
test_2021_adv = pd.read_csv('players_adv_2021.csv')
test_2021_adv.columns

Index(['Rk', 'Player', 'Pos', 'Age', 'Tm', 'G', 'MP', 'PER', 'TS%', '3PAr',
       'FTr', 'ORB%', 'DRB%', 'TRB%', 'AST%', 'STL%', 'BLK%', 'TOV%', 'USG%',
       'Unnamed: 19', 'OWS', 'DWS', 'WS', 'WS/48', 'Unnamed: 24', 'OBPM',
       'DBPM', 'BPM', 'VORP'],
      dtype='object')

In [90]:
all(test_2021_adv['Player'] == test_2021['Player'])

True

In [91]:
stats.columns

Index(['Year', 'Player', 'Pos', 'TS%', 'RPG', 'APG', 'PPG', 'BPG', 'SPG',
       'All Star', 'MVP', 'Decade'],
      dtype='object')

In [92]:
test_2021['TS%'] = test_2021_adv['TS%']

In [93]:
test_2021 = test_2021.rename(columns={'AST':'APG', 'STL':'SPG', 'BLK':'BPG', 'TRB':'RPG', 'PTS':'PPG'})

Only keep the columns in the training data (stats)

In [94]:
cols_to_drop = list(filter(lambda col: col not in list(stats.columns), test_2021.columns))
test_2021 = test_2021.drop(columns=cols_to_drop)

In [95]:
test_2021.head()

Unnamed: 0,Player,Pos,RPG,APG,SPG,BPG,PPG,TS%
0,Precious Achiuwa\achiupr01,PF,3.9,0.6,0.4,0.5,5.9,0.578
1,Jaylen Adams\adamsja01,PG,0.4,0.3,0.0,0.0,0.3,0.125
2,Steven Adams\adamsst01,C,9.1,2.3,0.9,0.6,8.2,0.604
3,Bam Adebayo\adebaba01,C,9.6,5.5,0.9,1.0,19.6,0.636
4,LaMarcus Aldridge\aldrila01,C,4.5,1.8,0.4,0.9,14.3,0.549


Detect and handle missing values

In [96]:
detect_missing_values(test_2021)

Player: False
Pos: False
RPG :  False
APG :  False
SPG :  False
BPG :  False
PPG :  False
TS% :  True


In [97]:
test_2021 = test_2021.dropna()

In [98]:
detect_missing_values(test_2021)

Player: False
Pos: False
RPG :  False
APG :  False
SPG :  False
BPG :  False
PPG :  False
TS% :  False


In [99]:
test_2021_descr = test_2021.describe()
test_2021_descr

Unnamed: 0,RPG,APG,SPG,BPG,PPG,TS%
count,506.0,506.0,506.0,506.0,506.0,506.0
mean,3.635968,2.033597,0.61502,0.424308,8.957905,0.539061
std,2.51096,1.982242,0.422219,0.439331,6.855209,0.118127
min,0.0,0.0,0.0,0.0,0.0,0.0
25%,1.825,0.7,0.3,0.1,3.8,0.5
50%,3.2,1.4,0.6,0.3,7.3,0.5555
75%,5.0,2.675,0.9,0.6,12.7,0.605
max,14.1,11.1,1.9,3.4,32.8,1.0


In [100]:
test_2021['Player'] = [name[:name.index('\\')] for name in test_2021['Player']]
test_2021.head()

Unnamed: 0,Player,Pos,RPG,APG,SPG,BPG,PPG,TS%
0,Precious Achiuwa,PF,3.9,0.6,0.4,0.5,5.9,0.578
1,Jaylen Adams,PG,0.4,0.3,0.0,0.0,0.3,0.125
2,Steven Adams,C,9.1,2.3,0.9,0.6,8.2,0.604
3,Bam Adebayo,C,9.6,5.5,0.9,1.0,19.6,0.636
4,LaMarcus Aldridge,C,4.5,1.8,0.4,0.9,14.3,0.549


Reset the positions so that they correspond with the training data (stats)

In [101]:
reset_position(test_2021)
list(set(test_2021['Pos']))

['C', 'F', 'G']

#### Add the All Star Data<a class="anchor" id="allstar2021"></a>

In [102]:
test2021_tups = []
for i in range(test_2021.shape[0]):
    row = list(test_2021.iloc[i])
    test2021_tups.append((row[0], 2021))
test2021_all_stars = create_yn_cols(test2021_tups, all_star_tups)
test_2021['All Star'] = test2021_all_stars

#### Discretize the data<a class="anchor" id="discretize2021"></a>

In [103]:
cols = list(test_2021.columns)[list(test_2021.columns).index('RPG'):-1]
q_vals = [0.25, 0.75]
test_2021_quantiles = {col:[np.quantile(test_2021[col], q) for q in q_vals] for col in cols}

In [105]:
test_2021_discretized_columns = {col:
[discretize(val, col, test_2021_quantiles) for val in test_2021[col]] for col in cols}
test_2021_categorical = categorical_df(test_2021, test_2021_discretized_columns)

In [106]:
test_2021_categorical.head()

Unnamed: 0,Player,Pos,RPG,APG,SPG,BPG,PPG,TS%,All Star
0,Precious Achiuwa,F,2nd,1st,2nd,2nd,2nd,2nd,N
1,Jaylen Adams,G,1st,1st,1st,1st,1st,1st,N
2,Steven Adams,C,3rd,2nd,2nd,2nd,2nd,2nd,N
3,Bam Adebayo,C,3rd,3rd,2nd,3rd,3rd,3rd,N
4,LaMarcus Aldridge,C,2nd,2nd,2nd,3rd,3rd,2nd,N


## Naive Bayes Classification on the 2020-21 Data<a class="anchor" id="naive2021"></a> 

### Predict if a Player is an All Star or Not<a class="anchor" id="naiveallstar2021"></a>

In [107]:
all_star_probabilities_2021 = []
not_all_star_probabilities_2021 = []
for i in range(test_2021_categorical.shape[0]):
    all_star_probabilities_2021.append(
        naive_bayes_yn(i, figures, test_2021_categorical, stats_all_stars))
    not_all_star_probabilities_2021.append(
        naive_bayes_yn(i, figures, test_2021_categorical, stats_not_all_stars_cont))

all_star_prediction_2021 = []
for i in range(len(all_star_probabilities_2021)):
    yes = all_star_probabilities_2021[i]
    no = not_all_star_probabilities_2021[i]
    if yes >= no:
        all_star_prediction_2021.append('Y')
    else:
        all_star_prediction_2021.append('N')

test_2021_categorical['All Star Prediction'] = all_star_prediction_2021

In [108]:
all_star_cont_probabilities_2021 = []
not_all_star_cont_probabilities_2021 = []

for i in range(test_2021.shape[0]):
    all_star_cont_probabilities_2021.append(normpdf(i, figures, test_2021, stats_all_stars_cont))
    not_all_star_cont_probabilities_2021.append(normpdf(i, figures, test_2021, stats_not_all_stars_cont))

all_star_prediction_2021_cont = []
for i in range(len(all_star_cont_probabilities_2021)):
    yes = all_star_cont_probabilities_2021[i]
    no = not_all_star_cont_probabilities_2021[i]
    if yes >= no:
        all_star_prediction_2021_cont.append('Y')
    else:
        all_star_prediction_2021_cont.append('N')
        
test_2021['All Star Prediction'] = all_star_prediction_2021_cont

Calculate the correctness rate

In [109]:
all_star_correct_categ_2021 = correct('All Star', test_2021_categorical)
all_star_correct_categ_2021

0.04743083003952569

In [110]:
all_star_correct_cont_2021 = correct('All Star', test_2021)
all_star_correct_cont_2021

0.9525691699604744

Out of the 22 (amount of all stars) highest probabilties, how many did we predict correctly?

In [111]:
test_2021['p(All Star)'] = all_star_cont_probabilities_2021
predicted_2021_all_stars = list(test_2021.sort_values('p(All Star)', ascending=False).head(22)['Player'])

In [112]:
actual_2021_all_stars = list((all_stars_df.loc[all_stars_df['Year'] == 2021]['Player']))

In [113]:
len(list(filter(lambda name: name in actual_2021_all_stars, predicted_2021_all_stars)))

4

### Predict a Player's Position<a class="anchor" id="pos2021"></a>

In [114]:
pos_pred_categ2021 = []
for i in range(test_2021_categorical.shape[0]):
    pos_pred_categ2021.append(
        naive_classes(i, fig_nopos, test_2021_categorical, positions, position_categdfs))
test_2021_categorical['Pos Prediction'] = pos_pred_categ2021

pos_pred_cont2021 = []
for i in range(test_2021.shape[0]):
    pos_pred_cont2021.append(
        normpdf_multiple_classes(i, fig_nopos, test_2021, positions, position_contdfs))
test_2021['Pos Prediction'] = pos_pred_cont2021

pos_correct_categ2021 = correct('Pos', test_2021_categorical)
pos_correct_cont2021 = correct('Pos', test_2021)

Calculate the correctness rate

In [115]:
pos_correct_categ2021, pos_correct_cont2021

(0.5869565217391305, 0.6482213438735178)

In [116]:
almost_correct_categ_pos2021 = 0
almost_correct_cont_pos2021 = 0
for i in range(test_2021_categorical.shape[0]):
    row_categ = test_2021_categorical.iloc[i]
    pred_categ = row_categ['Pos Prediction']
    correct_categ = row_categ['Pos']
    if ('F' in pred_categ and 'F' in correct_categ)\
    or ('G' in pred_categ and 'G' in correct_categ)\
    or ('C' in pred_categ and 'C' in correct_categ):
        almost_correct_categ_pos2021 += 1
    row_cont = test_2021.iloc[i]
    pred_cont = row_cont['Pos Prediction']
    correct_cont = row_cont['Pos']
    if ('F' in pred_cont and 'F' in correct_cont)\
    or ('G' in pred_cont and 'G' in correct_cont)\
    or ('C' in pred_cont and 'C' in correct_cont):
        almost_correct_cont_pos2021 += 1

Calculate the almost correctness rate

In [117]:
almost_pos_correct_categ2021 = almost_correct_categ_pos2021 / test_2021_categorical.shape[0]
almost_pos_correct_cont2021 = almost_correct_cont_pos2021 / test_2021.shape[0]

In [118]:
almost_pos_correct_categ2021, almost_pos_correct_cont2021

(0.8596837944664032, 0.8142292490118577)

### Predict the MVP Winner for the 2020-21 Season<a class="anchor" id="predictmvp"></a>

In [119]:
stats_mvps_cont = stats.loc[stats['MVP'] == 'Y']

In [120]:
def normpdf_mvp(stats_index, df):
    row = df.iloc[stats_index]
    test_stats = []
    for f in figures:
        test_stats.append(row[f])
    positions = list(stats_mvps_cont[figures[0]])
    probability = ((positions.count(test_stats[0]) + 1) / len(positions))
    for i in range(1, len(test_stats)):
        stat = test_stats[i]
        mvp_stats = list(stats_mvps_cont[figures[i]])
        mean = np.mean(mvp_stats)
        std = np.std(mvp_stats)
        var = float(std)**2
        denom = (2*math.pi*var)**.5
        num = math.exp(-(float(stat)-float(mean))**2/(2*var))
        probability *= num/denom
    return probability

In [121]:
mvp_cont_probs = []
for i in range(test_2021.shape[0]):
    mvp_cont_probs.append(normpdf_mvp(i, test_2021))

test_2021['MVP Prediction'] = mvp_cont_probs

Who are the 5 players that are most likely to be the 2020-21 MVP?

In [122]:
mvp_contenders = list(test_2021.sort_values('MVP Prediction', ascending=False)['Player'][:5])
mvp_contenders

['LeBron James',
 'Kawhi Leonard',
 'Giannis Antetokounmpo',
 'Luka Dončić',
 'Jaylen Brown']

In [123]:
mvp_cont_probs1819 = []
for i in range(test_players.shape[0]):
    mvp_cont_probs1819.append(normpdf_mvp(i, test_players))

test_players['MVP Prediction'] = mvp_cont_probs1819

In [124]:
mvp_contenders = list(test_players.sort_values('MVP Prediction', ascending=False)['Player'][:5])
mvp_contenders

['LeBron James',
 'Avery Bradley',
 'Bradley Beal',
 'Kawhi Leonard',
 'Kyrie Irving']

In [126]:
mvp_cont_probs_stats = []
for i in range(stats.shape[0]):
    mvp_cont_probs_stats.append(normpdf_mvp(i, stats))

stats['MVP Prediction'] = mvp_cont_probs_stats

In [128]:
mvp_contenders = set(list(stats.sort_values('MVP Prediction', ascending=False)['Player'][:20]))
mvp_contenders

{'Chris Webber',
 'Dwyane Wade',
 'James Harden',
 'Karl Malone',
 'Kobe Bryant',
 'Larry Bird',
 'LeBron James',
 'Paul Pierce'}