# Predicting NBA All-Stars - Jan. Update

## Motivation

In December, I collected pre All-Star statistics from the last 25 years in the NBA to develop a prediction model for this year's all-stars. Those notebooks can be found on my GitHub:

https://github.com/ibraeksi/nba-analytics/
* nba_allstar_prediction.ipynb
* nba_stats_scraping.ipynb

There, I used binary classification to achieve 78% accuracy. As an improvement, I suggested using multi-class classification due to the nature of the actual selection process. The classes would be:

    0: Not an All-star
    1: Eastern Conference All-Star Frontcourt
    2: Eastern Conference All-Star Backcourt
    3: Western Conference All-Star Frontcourt
    4: Western Conference All-Star Backcourt

Now that we have a larger sample size from the current season, it is time to update the predictions using multi-class classification ahead of the All-Star game in February. First set of All-Stars are announced tomorrow on Jan. 27 and the rest of the roster will follow next week on Feb. 3.

## Feature Engineering

I have exported the modified version of the dataset from the previous study:

In [1]:
import pandas as pd
import numpy as np

nba = pd.read_csv('nba_allstar_modified.csv')
nba.head()

Unnamed: 0,PLAYER_NAME,TEAM,AGE,GP,W,L,W_PCT,MIN,FGM,FGA,...,PICK,DEBUT,YEAR,TEAM_GP,TEAM_W,TEAM_L,TEAM_W_PCT,ALLSTAR,CONF,GP_PCT
0,Aaron Gordon,DEN,25.0,19,8,11,0.421,29.1,4.8,11.2,...,4.0,2014,2021,36.0,21.0,15.0,0.583,0.0,W,0.527778
1,Al Horford,BOS,35.0,24,10,14,0.417,28.2,5.8,13.0,...,3.0,2007,2021,36.0,19.0,17.0,0.528,0.0,E,0.666667
2,Alec Burks,NYK,29.0,25,13,12,0.52,24.7,3.7,9.1,...,12.0,2011,2021,37.0,19.0,18.0,0.514,0.0,E,0.675676
3,Andre Drummond,LAL,27.0,25,9,16,0.36,28.9,7.2,15.2,...,9.0,2012,2021,37.0,24.0,13.0,0.649,0.0,W,0.675676
4,Andrew Wiggins,GSW,26.0,37,19,18,0.514,32.2,6.5,14.1,...,1.0,2014,2021,37.0,19.0,18.0,0.514,0.0,W,1.0


As a reminder, the selection process since 2013:

    East --> 4 Guards, 6 Frontcourt (F or C) and 2 Wildcards(G, F or C) 
    West --> 4 Guards, 6 Frontcourt (F or C) and 2 Wildcards(G, F or C)

Usually, the 12-player roster is divided between 6 frontcourt and 6 backcourt players. The data currently has the following position types:

In [2]:
nba['POS'].value_counts()

G      1411
F      1105
C       356
G-F     345
C-F     245
F-C     223
F-G     158
Name: POS, dtype: int64

These positions will be divided into 2 groups:

    Frontcourt = F, C, C-F, F-C
    Backcourt  = G, G-F, F-G

In [3]:
frontcourt = ['F', 'C', 'C-F', 'F-C']
backcourt = ['G', 'G-F', 'F-G']

i = -1
for pos in nba['POS']:
    i += 1
    if pos in frontcourt:
        nba.loc[i, 'TYPE'] = 'F'
    elif pos in backcourt:
        nba.loc[i, 'TYPE'] = 'B'

nba['TYPE'].value_counts()

F    1929
B    1914
Name: TYPE, dtype: int64

As seen, we have a balanced dataset in terms of player positions. So we can use percentile ranks among these groups to find the best-performing players for each category:

In [4]:
pos_type = ['B', 'F']
conf_type = ['E', 'W']

for year in range(1997,2022):
    for conf in conf_type:
        for pos in pos_type:
            nba_year = nba[nba['YEAR'] == year]
            nba_group = nba_year[(nba_year['TYPE'] == pos) & (nba_year['CONF'] == conf)]
            nba.loc[nba['YEAR'] == year, 'PTS_' + conf + pos] = nba_group['PTS'].rank(pct = True)
            nba.loc[nba['YEAR'] == year, 'FP_' + conf + pos] = nba_group['NBA_FANTASY_PTS'].rank(pct = True)
            nba.loc[nba['YEAR'] == year, 'PM_' + conf + pos] = nba_group['PLUS_MINUS'].rank(pct = True)
            nba.loc[nba['YEAR'] == year, 'W_PCT_' + conf + pos] = nba_group['W_PCT'].rank(pct = True)

nba.describe()

Unnamed: 0,AGE,GP,W,L,W_PCT,MIN,FGM,FGA,FG_PCT,FG3M,...,PM_EF,W_PCT_EF,PTS_WB,FP_WB,PM_WB,W_PCT_WB,PTS_WF,FP_WF,PM_WF,W_PCT_WF
count,3843.0,3843.0,3843.0,3843.0,3843.0,3843.0,3843.0,3843.0,3843.0,3843.0,...,988.0,988.0,952.0,952.0,952.0,952.0,941.0,941.0,941.0,941.0
mean,27.430653,45.452771,22.933906,22.518865,0.502566,31.365756,5.252615,11.5121,0.456916,1.015691,...,0.51164,0.51164,0.51208,0.51208,0.51208,0.51208,0.512221,0.512221,0.512221,0.512221
std,4.03912,8.556883,8.528899,8.124093,0.156047,4.506355,1.867728,3.93101,0.05163,0.847181,...,0.288659,0.288626,0.288685,0.288709,0.288648,0.288633,0.288686,0.288707,0.288666,0.288609
min,19.0,13.0,2.0,2.0,0.068,24.0,1.3,2.9,0.312,0.0,...,0.020408,0.020408,0.019231,0.019231,0.019231,0.019231,0.021277,0.021277,0.021277,0.021277
25%,24.0,41.0,17.0,16.0,0.389,27.5,3.9,8.5,0.423,0.1,...,0.261646,0.264456,0.261905,0.261905,0.262845,0.261905,0.256757,0.263158,0.264706,0.263158
50%,27.0,48.0,23.0,22.0,0.51,31.2,5.0,10.9,0.449,1.0,...,0.511905,0.511905,0.5125,0.511111,0.513158,0.512821,0.512195,0.512821,0.512195,0.511905
75%,30.0,52.0,29.0,28.0,0.615,34.9,6.5,14.0,0.484,1.6,...,0.760417,0.760417,0.762218,0.763158,0.762218,0.763158,0.761905,0.763158,0.763158,0.766667
max,41.0,61.0,48.0,48.0,0.944,44.0,11.9,27.5,0.725,5.1,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


We can delete the original numerical columns and player bio data:

In [5]:
# Create a "clean" dataframe
nbamod = nba.copy().reset_index(drop=True)
# Original player stats and bio
nbamod.drop(nbamod.columns[1:41], axis=1, inplace=True)
# Original team stats
nbamod.drop(nbamod.columns[nbamod.columns.str.contains('TEAM')], axis=1, inplace=True)

The resulting missing values from percentile ranks can be filled with 0 to indicate that they have no weight for the given player:

In [6]:
nbamod = nbamod.fillna(0)
nbamod.head()

Unnamed: 0,PLAYER_NAME,YEAR,ALLSTAR,CONF,GP_PCT,TYPE,PTS_EB,FP_EB,PM_EB,W_PCT_EB,...,PM_EF,W_PCT_EF,PTS_WB,FP_WB,PM_WB,W_PCT_WB,PTS_WF,FP_WF,PM_WF,W_PCT_WF
0,Aaron Gordon,2021,0.0,W,0.527778,F,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.512195,0.609756,0.182927,0.170732
1,Al Horford,2021,0.0,E,0.666667,F,0.0,0.0,0.0,0.0,...,0.25641,0.25641,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,Alec Burks,2021,0.0,E,0.675676,B,0.229167,0.145833,0.78125,0.666667,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,Andre Drummond,2021,0.0,W,0.675676,F,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.756098,0.878049,0.04878,0.073171
4,Andrew Wiggins,2021,0.0,W,1.0,F,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.731707,0.536585,0.243902,0.365854


Finally, we can create a new target variable based on the previously mentioned selection criteria:

In [7]:
nbamod['SELECT'] = 0
nbamod.loc[(nba['ALLSTAR'] == 1) & (nbamod['CONF'] == 'E') & (nbamod['TYPE'] == 'F'), 'SELECT'] = 1
nbamod.loc[(nba['ALLSTAR'] == 1) & (nbamod['CONF'] == 'E') & (nbamod['TYPE'] == 'B'), 'SELECT'] = 2
nbamod.loc[(nba['ALLSTAR'] == 1) & (nbamod['CONF'] == 'W') & (nbamod['TYPE'] == 'F'), 'SELECT'] = 3
nbamod.loc[(nba['ALLSTAR'] == 1) & (nbamod['CONF'] == 'W') & (nbamod['TYPE'] == 'B'), 'SELECT'] = 4
nbamod['SELECT'].value_counts()

0    3251
3     169
1     159
2     137
4     127
Name: SELECT, dtype: int64

In [8]:
nbamod = nbamod.rename(columns={"PLAYER_NAME": "PLAYER"})
nbamod.dtypes.value_counts()

float64    18
object      3
int64       2
dtype: int64

## Random Forest Hyperparameter Tuning

As shown in the previous project, random forest performs the best on this dataset. Therefore we can train a random forest algorithm with cross validation and grid search to find the optimal parameters:

In [9]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_predict
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report

cols = nbamod.columns
train_cols = cols.drop(['PLAYER', 'YEAR', 'ALLSTAR', 'CONF', 'TYPE', 'SELECT'])
features = nbamod[train_cols]
target = nbamod['SELECT']

hyperparameters = {
    'criterion': ['entropy', 'gini'],
    'max_depth': [5,10],
    'max_features': ['auto', 'log2', 'sqrt'],
    'min_samples_leaf': [1,5],
    'min_samples_split': [3,5],
    'n_estimators': [6,9],
    'class_weight': [None, 'balanced']
}

rf = RandomForestClassifier(random_state=1)
   
grid = GridSearchCV(rf, param_grid=hyperparameters, cv=10)

grid.fit(features, target)

print("Best Score: {}".format(grid.best_score_))
print("Best Parameters: {}".format(grid.best_params_))

best_rf = grid.best_estimator_

predictions = cross_val_predict(best_rf, features, target, cv=10)
cm = confusion_matrix(target, predictions)

print("Best Predictions:\n{}\n".format(cm))

report = classification_report(target, predictions)

print("Classification Report:\n{}\n".format(report))

Best Score: 0.9320813041125542
Best Parameters: {'class_weight': None, 'criterion': 'entropy', 'max_depth': 10, 'max_features': 'auto', 'min_samples_leaf': 5, 'min_samples_split': 3, 'n_estimators': 9}
Best Predictions:
[[3162   28   30   15   16]
 [  49  110    0    0    0]
 [  48    0   89    0    0]
 [  42    0    0  127    0]
 [  33    0    0    0   94]]

Classification Report:
              precision    recall  f1-score   support

           0       0.95      0.97      0.96      3251
           1       0.80      0.69      0.74       159
           2       0.75      0.65      0.70       137
           3       0.89      0.75      0.82       169
           4       0.85      0.74      0.79       127

    accuracy                           0.93      3843
   macro avg       0.85      0.76      0.80      3843
weighted avg       0.93      0.93      0.93      3843




As seen, the best-performing model has a macro average of 80% accuracy which is higher than the 78% achieved with binary classification. If we look at the individual F1 scores, we can see that the model is good at identifying the players who are not performing at all-star level (class 0). The worst score is 0.7 with East backcourt while the best score is 0.82 with West frontcourt.

## Predicting 2022 All-Stars

We can now move on to predicting the 2022 all-stars based on the updated data from the current season. As of this writing in late January, most teams have played a minimum of 41 games, half of the season. The final rosters will be announced on Feb.3, 2022. 

In [10]:
bio = pd.read_csv('player_bio_historical.csv')
pl_current = pd.read_csv('pre_allstar_player_stats_2022.csv')
tm_current = pd.read_csv('pre_allstar_team_stats_2022.csv')

The dataset has to go through the same data transformation steps as the training dataset:

In [11]:
# Add player positions
pl_current['POS'] = pl_current['PLAYER_ID'].map(bio.set_index('PERSON_ID')['POSITION'])

In [12]:
# Add year column
pl_current['YEAR'] = pl_current['SEASON'].str.replace(r"(-\d*)", "").astype('int64') + 1
tm_current['YEAR'] = tm_current['SEASON'].str.replace(r"(-\d*)", "").astype('int64') + 1
pl_current.head()

  pl_current['YEAR'] = pl_current['SEASON'].str.replace(r"(-\d*)", "").astype('int64') + 1
  tm_current['YEAR'] = tm_current['SEASON'].str.replace(r"(-\d*)", "").astype('int64') + 1


Unnamed: 0,PLAYER_ID,PLAYER_NAME,NICKNAME,TEAM_ID,TEAM_ABBREVIATION,AGE,GP,W,L,W_PCT,...,PTS_RANK,PLUS_MINUS_RANK,NBA_FANTASY_PTS_RANK,DD2_RANK,TD3_RANK,CFID,CFPARAMS,SEASON,POS,YEAR
0,203932,Aaron Gordon,Aaron,1610612743,DEN,26.0,43,23,20,0.535,...,92,45,104,89,28,5,2039321610612743,2021-22,F,2022
1,1630565,Aaron Henry,Aaron,1610612755,PHI,22.0,6,6,0,1.0,...,557,488,575,200,28,5,16305651610612755,2021-22,F,2022
2,1628988,Aaron Holiday,Aaron,1610612764,WAS,25.0,36,17,19,0.472,...,313,377,338,200,28,5,16289881610612764,2021-22,G,2022
3,1630174,Aaron Nesmith,Aaron,1610612738,BOS,22.0,33,18,15,0.545,...,432,280,464,200,28,5,16301741610612738,2021-22,G-F,2022
4,1630598,Aaron Wiggins,Aaron,1610612760,OKC,23.0,29,6,23,0.207,...,249,405,272,200,28,5,16305981610612760,2021-22,G,2022


In [13]:
nbanow = pl_current.copy()

# Map the team stats based on TEAM_ID
nbanow.loc[nbanow['YEAR'] == 2022, 'TEAM_GP'] = nbanow.loc[nbanow['YEAR'] == 2022, 'TEAM_ID'].map(
    tm_current[tm_current['YEAR'] == 2022].set_index('TEAM_ID')['GP'])
nbanow.loc[nbanow['YEAR'] == 2022, 'TEAM_W'] = nbanow.loc[nbanow['YEAR'] == 2022, 'TEAM_ID'].map(
    tm_current[tm_current['YEAR'] == 2022].set_index('TEAM_ID')['W'])
nbanow.loc[nbanow['YEAR'] == 2022, 'TEAM_L'] = nbanow.loc[nbanow['YEAR'] == 2022, 'TEAM_ID'].map(
    tm_current[tm_current['YEAR'] == 2022].set_index('TEAM_ID')['L'])
nbanow.loc[nbanow['YEAR'] == 2022, 'TEAM_W_PCT'] = nbanow.loc[nbanow['YEAR'] == 2022, 'TEAM_ID'].map(
    tm_current[tm_current['YEAR'] == 2022].set_index('TEAM_ID')['W_PCT'])

nbanow.head()

Unnamed: 0,PLAYER_ID,PLAYER_NAME,NICKNAME,TEAM_ID,TEAM_ABBREVIATION,AGE,GP,W,L,W_PCT,...,TD3_RANK,CFID,CFPARAMS,SEASON,POS,YEAR,TEAM_GP,TEAM_W,TEAM_L,TEAM_W_PCT
0,203932,Aaron Gordon,Aaron,1610612743,DEN,26.0,43,23,20,0.535,...,28,5,2039321610612743,2021-22,F,2022,46,25,21,0.543
1,1630565,Aaron Henry,Aaron,1610612755,PHI,22.0,6,6,0,1.0,...,28,5,16305651610612755,2021-22,F,2022,47,28,19,0.596
2,1628988,Aaron Holiday,Aaron,1610612764,WAS,25.0,36,17,19,0.472,...,28,5,16289881610612764,2021-22,G,2022,48,23,25,0.479
3,1630174,Aaron Nesmith,Aaron,1610612738,BOS,22.0,33,18,15,0.545,...,28,5,16301741610612738,2021-22,G-F,2022,49,25,24,0.51
4,1630598,Aaron Wiggins,Aaron,1610612760,OKC,23.0,29,6,23,0.207,...,28,5,16305981610612760,2021-22,G,2022,47,14,33,0.298


In [14]:
# We can use the team abbreviations to keep the input short
nbanow = nbanow.rename(columns={"TEAM_ABBREVIATION": "TEAM"})

# All of the eastern conference team abbreviations over the past 25 years
east = ['IND', 'BOS', 'CHI', 'NYK', 'WAS', 'MIA', 'BKN', 'TOR', 'PHI', 'CHA', 'MIL', 'ATL',
        'CLE', 'ORL', 'DET', 'NJN', 'CHH']

nbanow['CONF'] = np.where(nbanow['TEAM'].isin(east), 'E', 'W')

nbanow.head()

Unnamed: 0,PLAYER_ID,PLAYER_NAME,NICKNAME,TEAM_ID,TEAM,AGE,GP,W,L,W_PCT,...,CFID,CFPARAMS,SEASON,POS,YEAR,TEAM_GP,TEAM_W,TEAM_L,TEAM_W_PCT,CONF
0,203932,Aaron Gordon,Aaron,1610612743,DEN,26.0,43,23,20,0.535,...,5,2039321610612743,2021-22,F,2022,46,25,21,0.543,W
1,1630565,Aaron Henry,Aaron,1610612755,PHI,22.0,6,6,0,1.0,...,5,16305651610612755,2021-22,F,2022,47,28,19,0.596,E
2,1628988,Aaron Holiday,Aaron,1610612764,WAS,25.0,36,17,19,0.472,...,5,16289881610612764,2021-22,G,2022,48,23,25,0.479,E
3,1630174,Aaron Nesmith,Aaron,1610612738,BOS,22.0,33,18,15,0.545,...,5,16301741610612738,2021-22,G-F,2022,49,25,24,0.51,E
4,1630598,Aaron Wiggins,Aaron,1610612760,OKC,23.0,29,6,23,0.207,...,5,16305981610612760,2021-22,G,2022,47,14,33,0.298,W


In [15]:
# Find missing values
print(nbanow.shape)
missing_cols = nbanow.columns[nbanow.isna().any()].tolist()
missing_vals = nbanow[missing_cols].isna().sum()
for col in range(len(missing_cols)):
    print('{}: {} missing'.format(missing_cols[col],missing_vals[col]))

(590, 74)
POS: 35 missing


Due to the Covid protocols, this season has seen a record number of players appear in a game. As seen, there are 35 players who are not in the league's database resulting in missing position data. However, these players are not in all-star contention and therefore can be removed:

In [16]:
nbanow.dropna(axis=0, subset=['POS'], inplace=True)
print(nbanow.shape)
nbanow.columns[nbanow.isna().any()].tolist()

(555, 74)


[]

In [17]:
# We can remove the following columns that do not contain any information on player performance
nbanow = nbanow.drop(['PLAYER_ID', 'TEAM_ID', 'NICKNAME', 'CFID', 'CFPARAMS', 'SEASON'], axis=1)

# We can also remove the columns titled rank since those ranks are not relevant to our study
nbanow = nbanow.drop(nbanow.columns[nbanow.columns.str.contains('RANK')], axis=1)

# Calculate games played with respect to total team games
nbanow['GP_PCT'] = nbanow['GP'] / nbanow['TEAM_GP']
nbanow['GP_PCT'].describe()

count    555.000000
mean       0.574831
std        0.305187
min        0.020408
25%        0.304348
50%        0.659574
75%        0.833333
max        1.000000
Name: GP_PCT, dtype: float64

In [18]:
# Filtering the data based on games and minutes played
nbanow.drop(nbanow.loc[nbanow['GP_PCT'] < 0.4].index, inplace=True)
nbanow.drop(nbanow.loc[nbanow['MIN'] < 24].index, inplace=True)
print(nbanow.shape)

(181, 40)


In [19]:
# Create a "clean" dataframe
nbanowmod = nbanow.copy().reset_index(drop=True)

# Identify position types
frontcourt = ['F', 'C', 'C-F', 'F-C']
backcourt = ['G', 'G-F', 'F-G']

i = -1
for pos in nbanowmod['POS']:
    i += 1
    if pos in frontcourt:
        nbanowmod.loc[i, 'TYPE'] = 'F'
    elif pos in backcourt:
        nbanowmod.loc[i, 'TYPE'] = 'B'

nbanowmod['TYPE'].value_counts()

B    105
F     76
Name: TYPE, dtype: int64

In [20]:
pos_type = ['B', 'F']
conf_type = ['E', 'W']

for conf in conf_type:
    for pos in pos_type:
        nbanowmod_2022 = nbanowmod[nbanowmod['YEAR'] == 2022]
        nbanowmod_group = nbanowmod_2022[(nbanowmod_2022['TYPE'] == pos) & (nbanowmod_2022['CONF'] == conf)]
        nbanowmod.loc[nbanowmod['YEAR'] == 2022, 'PTS_' + conf + pos] = nbanowmod_group['PTS'].rank(pct = True)
        nbanowmod.loc[nbanowmod['YEAR'] == 2022, 'FP_' + conf + pos] = nbanowmod_group['NBA_FANTASY_PTS'].rank(pct = True)
        nbanowmod.loc[nbanowmod['YEAR'] == 2022, 'PM_' + conf + pos] = nbanowmod_group['PLUS_MINUS'].rank(pct = True)
        nbanowmod.loc[nbanowmod['YEAR'] == 2022, 'W_PCT_' + conf + pos] = nbanowmod_group['W_PCT'].rank(pct = True)
        
nbanowmod.describe()

Unnamed: 0,AGE,GP,W,L,W_PCT,MIN,FGM,FGA,FG_PCT,FG3M,...,PM_EF,W_PCT_EF,PTS_WB,FP_WB,PM_WB,W_PCT_WB,PTS_WF,FP_WF,PM_WF,W_PCT_WF
count,181.0,181.0,181.0,181.0,181.0,181.0,181.0,181.0,181.0,181.0,...,38.0,38.0,51.0,51.0,51.0,51.0,38.0,38.0,38.0,38.0
mean,26.519337,39.110497,20.027624,19.082873,0.511061,30.413812,5.556354,12.171271,0.461851,1.69558,...,0.513158,0.513158,0.509804,0.509804,0.509804,0.509804,0.513158,0.513158,0.513158,0.513158
std,4.242105,6.053369,6.563731,6.346371,0.145393,3.602619,1.917674,4.152543,0.070634,0.914077,...,0.292321,0.292417,0.291432,0.291459,0.291393,0.291419,0.292417,0.292417,0.292257,0.292417
min,19.0,21.0,4.0,6.0,0.15,24.0,2.3,4.5,0.345,0.0,...,0.026316,0.026316,0.019608,0.019608,0.019608,0.019608,0.026316,0.026316,0.026316,0.026316
25%,23.0,36.0,15.0,15.0,0.419,27.5,4.0,9.3,0.417,1.1,...,0.269737,0.269737,0.264706,0.264706,0.264706,0.27451,0.269737,0.269737,0.276316,0.269737
50%,26.0,40.0,21.0,18.0,0.523,30.3,5.2,11.8,0.446,1.7,...,0.513158,0.513158,0.509804,0.509804,0.519608,0.509804,0.513158,0.513158,0.519737,0.513158
75%,29.0,43.0,24.0,23.0,0.619,33.3,6.8,15.2,0.487,2.4,...,0.756579,0.756579,0.754902,0.754902,0.75,0.754902,0.766447,0.756579,0.763158,0.756579
max,37.0,49.0,37.0,39.0,0.824,38.2,10.9,21.4,0.779,4.8,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


In [21]:
# Rename player name column
nbanowmod = nbanowmod.rename(columns={"PLAYER_NAME": "PLAYER"})
# Fill empty rank values with 0
nbanowmod = nbanowmod.fillna(0)
# Remove original player stats
nbanowmod.drop(nbanowmod.columns[1:32], axis=1, inplace=True)
# Remove original team stats
nbanowmod.drop(nbanowmod.columns[nbanowmod.columns.str.contains('TEAM')], axis=1, inplace=True)

print(nbanowmod.shape)
print(nbanowmod.head())

(181, 22)
           PLAYER  POS  YEAR CONF    GP_PCT TYPE    PTS_EB     FP_EB  \
0    Aaron Gordon    F  2022    W  0.934783    F  0.000000  0.000000   
1      Al Horford  C-F  2022    E  0.816327    F  0.000000  0.000000   
2      Alec Burks    G  2022    E  0.979167    B  0.222222  0.388889   
3     Alex Caruso    G  2022    E  0.608696    B  0.083333  0.462963   
4  Andrew Wiggins    F  2022    W  0.916667    F  0.000000  0.000000   

      PM_EB  W_PCT_EB  ...     PM_EF  W_PCT_EF  PTS_WB  FP_WB  PM_WB  \
0  0.000000  0.000000  ...  0.000000  0.000000     0.0    0.0    0.0   
1  0.000000  0.000000  ...  0.513158  0.473684     0.0    0.0    0.0   
2  0.509259  0.361111  ...  0.000000  0.000000     0.0    0.0    0.0   
3  0.916667  0.962963  ...  0.000000  0.000000     0.0    0.0    0.0   
4  0.000000  0.000000  ...  0.000000  0.000000     0.0    0.0    0.0   

   W_PCT_WB    PTS_WF     FP_WF     PM_WF  W_PCT_WF  
0       0.0  0.578947  0.552632  0.842105  0.552632  
1       0.0  0.0

Now we have this season's data with the same features as the training data. We can use the best-performing Random Forest Classifier from previous step to make predictions. Here are the parameters as a reminder:

In [22]:
best_rf

RandomForestClassifier(criterion='entropy', max_depth=10, min_samples_leaf=5,
                       min_samples_split=3, n_estimators=9, random_state=1)

In [23]:
cols = nbanowmod.columns
test_cols = cols.drop(['PLAYER', 'YEAR', 'POS', 'CONF', 'TYPE'])

new_predictions = best_rf.predict(nbanowmod[test_cols])
nbanowmod['PREDICT'] = new_predictions

nbanowmod['PREDICT'].value_counts()

0    156
2      8
4      7
1      5
3      5
Name: PREDICT, dtype: int64

As seen, the model predicts 13 East (Class 1&2) and 12 West (Class 3&4) All-Star players. The rosters consist of 12 players each, but due to injuries they usually end up having 1 replacement player each. Therefore these numbers are representative. We can now check the predicted all-stars:

In [25]:
print('Eastern Conference')
print('-'*18)
print('Frontcourt  :  {}'.format(nbanowmod[(nbanowmod['PREDICT'] == 1)]['PLAYER'].to_list()))
print('Backcourt :  {}\n'.format(nbanowmod[(nbanowmod['PREDICT'] == 2)]['PLAYER'].to_list()))
print('Western Conference')
print('-'*18)
print('Frontcourt  :  {}'.format(nbanowmod[(nbanowmod['PREDICT'] == 3)]['PLAYER'].to_list()))
print('Backcourt :  {}'.format(nbanowmod[(nbanowmod['PREDICT'] == 4)]['PLAYER'].to_list()))

Eastern Conference
------------------
Frontcourt  :  ['Giannis Antetokounmpo', 'Jimmy Butler', 'Joel Embiid', 'Kevin Durant', 'Nikola Vucevic']
Backcourt :  ['Darius Garland', 'DeMar DeRozan', 'Fred VanVleet', 'James Harden', 'Jayson Tatum', 'Jrue Holiday', 'Trae Young', 'Zach LaVine']

Western Conference
------------------
Frontcourt  :  ['Karl-Anthony Towns', 'LeBron James', 'Nikola Jokic', 'Paul George', 'Rudy Gobert']
Backcourt :  ['Chris Paul', 'Dejounte Murray', 'Devin Booker', 'Donovan Mitchell', 'Ja Morant', 'Luka Doncic', 'Stephen Curry']


## Conclusions

Both rosters seem to represent the current status of the league really well. There are only 2 players in this list that could miss out due to different factors. First one is Paul George, who at this point has missed about half of his teams' games. As I set the lower limit for games played at 40%, he is still eligible to be identified by my model. Since he will not be back for the All-Star game, he will probably not get chosen by the coaches. Second player that will probably miss out is Nikola Vucevic, who benefits from playing for the 2nd best team in the conference with worse frontcourt players.

We can compare the model's predictions with Sports Illustrated's predictions. Over the past 2 days, they have released their predictions for both conferences in the following articles:

https://www.si.com/nba/2022/01/26/nba-all-star-game-2022-roster-predictions-kevin-durant-giannis-antetokounmpo-joel-embiid

https://www.si.com/nba/2022/01/25/nba-all-star-game-2022-roster-predictions-lebron-james-stephen-curry-nikola-jokic

The only differences as predicted are:

    Draymond Green instead of Paul George
    Jarrett Allen instead of Nikola Vucevic
    
In addition, they list Darius Garland as an injury replacement. I think, Garland will make the team and Jarrett Allen will be the replacement.

This study has shown that the all-stars can be predicted with relatively high accuracy based on a limited number of traditional player and team statistics. 