# Predicting NBA All-Stars

## Motivation

I collected pre All-Star statistics starting from 1996-97 season in the NBA to develop a prediction model for this year's all-stars. Those notebooks can be found on my GitHub:

https://github.com/ibraeksi/nba-analytics/
* nba_allstar_stats_scraping.ipynb
* historical_allstar_data_preprocessing.ipynb
* allstar_prediction_binary_classification.ipynb

There, I used binary classification to achieve 76% accuracy. As an improvement, I suggested using multi-class classification due to the nature of the actual selection process. The classes would be:

    0: Not an All-star
    1: Eastern Conference All-Star Frontcourt
    2: Eastern Conference All-Star Backcourt
    3: Western Conference All-Star Frontcourt
    4: Western Conference All-Star Backcourt

In [1]:
# Importing libraries
import os, sys
import pandas as pd
import numpy as np
import warnings
import matplotlib.pyplot as plt

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_predict
from sklearn.model_selection import GridSearchCV
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix
from sklearn.metrics import f1_score
from sklearn.metrics import roc_auc_score
from sklearn.metrics import classification_report

# Configure libraries
pd.set_option('display.max_columns', 500)
pd.options.display.float_format = '{:.4f}'.format
warnings.filterwarnings('ignore')

# Add base to path
sys.path.append('..')
sys.path.append('../../')

# Initialize dirs
data_dir = 'data'
input_data_dir = os.path.join(data_dir, 'input')
output_data_dir = os.path.join(data_dir, 'output')

In [2]:
nba = pd.read_csv(os.path.join(output_data_dir, 'nba_allstar_processed_data.csv'))

## Feature Engineering

As a reminder, the selection process since 2013:

    East --> 4 Guards, 6 Frontcourt (F or C) and 2 Wildcards(G, F or C) 
    West --> 4 Guards, 6 Frontcourt (F or C) and 2 Wildcards(G, F or C)

Usually, the 12-player roster is divided between 6 frontcourt and 6 backcourt players. The data currently has the following position types:

In [3]:
nba['POS'].value_counts()

POS
G      1653
F      1237
C       396
G-F     391
C-F     267
F-C     262
F-G     191
Name: count, dtype: int64

These positions will be divided into 2 groups:

    Frontcourt = F, C, C-F, F-C
    Backcourt  = G, G-F, F-G

In [4]:
frontcourt = ['F', 'C', 'C-F', 'F-C']
backcourt = ['G', 'G-F', 'F-G']

i = -1
for pos in nba['POS']:
    i += 1
    if pos in frontcourt:
        nba.loc[i, 'TYPE'] = 'F'
    elif pos in backcourt:
        nba.loc[i, 'TYPE'] = 'B'

nba['TYPE'].value_counts()

TYPE
B    2235
F    2162
Name: count, dtype: int64

As seen, we have a balanced dataset in terms of player positions. So we can use percentile ranks among these groups to find the best-performing players for each category:

In [5]:
pos_type = ['B', 'F']
conf_type = ['E', 'W']

for year in range(1997,2025):
    for conf in conf_type:
        for pos in pos_type:
            nba_year = nba[nba['YEAR'] == year]
            nba_group = nba_year[(nba_year['TYPE'] == pos) & (nba_year['CONF'] == conf)]
            nba.loc[nba['YEAR'] == year, 'PTS_' + conf + pos] = nba_group['PTS'].rank(pct = True)
            nba.loc[nba['YEAR'] == year, 'FP_' + conf + pos] = nba_group['NBA_FANTASY_PTS'].rank(pct = True)
            nba.loc[nba['YEAR'] == year, 'PM_' + conf + pos] = nba_group['PLUS_MINUS'].rank(pct = True)
            nba.loc[nba['YEAR'] == year, 'W_PCT_' + conf + pos] = nba_group['W_PCT'].rank(pct = True)

nba.describe()

Unnamed: 0,AGE,GP,W,L,W_PCT,MIN,FGM,FGA,FG_PCT,FG3M,FG3A,FG3_PCT,FTM,FTA,FT_PCT,OREB,DREB,REB,AST,TOV,STL,BLK,BLKA,PF,PFD,PTS,PLUS_MINUS,NBA_FANTASY_PTS,DD2,TD3,WNBA_FANTASY_PTS,WT,DRAFT,ROUND,PICK,DEBUT,YEAR,TEAM_GP,TEAM_W,TEAM_L,TEAM_W_PCT,ALLSTAR,GP_PCT,PTS_EB,FP_EB,PM_EB,W_PCT_EB,PTS_EF,FP_EF,PM_EF,W_PCT_EF,PTS_WB,FP_WB,PM_WB,W_PCT_WB,PTS_WF,FP_WF,PM_WF,W_PCT_WF
count,4397.0,4397.0,4397.0,4397.0,4397.0,4397.0,4397.0,4397.0,4397.0,4397.0,4397.0,4397.0,4397.0,4397.0,4397.0,4397.0,4397.0,4397.0,4397.0,4397.0,4397.0,4397.0,4397.0,4397.0,4397.0,4397.0,4397.0,4397.0,4397.0,4397.0,4397.0,4397.0,4397.0,4397.0,4397.0,4397.0,4397.0,4397.0,4397.0,4397.0,4397.0,4397.0,4397.0,1133.0,1133.0,1133.0,1133.0,1104.0,1104.0,1104.0,1104.0,1102.0,1102.0,1102.0,1102.0,1058.0,1058.0,1058.0,1058.0
mean,27.3759,45.7375,23.0953,22.6423,0.5031,31.2466,5.3148,11.596,0.4592,1.101,3.0378,0.2977,2.6627,3.4384,0.7699,1.3401,4.1462,5.4861,3.2215,1.9087,1.0058,0.6137,0.664,2.4263,2.1634,14.3903,0.5393,28.7488,7.0064,0.2329,27.4323,221.8531,1875.287,1.0969,15.161,2004.9993,2011.7159,51.7209,26.0957,25.6252,0.5046,0.1494,0.885,0.5115,0.5115,0.5115,0.5115,0.5118,0.5118,0.5118,0.5118,0.5118,0.5118,0.5118,0.5118,0.5123,0.5123,0.5123,0.5123
std,4.0729,8.5086,8.5031,8.1225,0.1552,4.4233,1.9066,3.9888,0.0539,0.8869,2.2949,0.1415,1.605,1.9924,0.0935,0.9691,1.844,2.6285,2.1041,0.7741,0.4194,0.589,0.3249,0.6031,1.7851,5.393,3.671,8.579,8.9132,1.2158,8.1449,26.9005,492.8981,0.4852,14.0778,8.4792,7.6474,5.9959,8.435,8.459,0.1519,0.3565,0.135,0.2887,0.2887,0.2886,0.2886,0.2887,0.2887,0.2886,0.2886,0.2887,0.2887,0.2886,0.2886,0.2887,0.2887,0.2886,0.2886
min,19.0,13.0,2.0,2.0,0.068,24.0,1.2,2.9,0.312,0.0,0.0,0.0,0.2,0.2,0.289,0.0,1.0,1.3,0.2,0.4,0.1,0.0,0.0,0.6,0.0,3.1,-11.4,9.9,0.0,0.0,9.9,133.0,0.0,0.0,0.0,1982.0,1997.0,30.0,4.0,4.0,0.077,0.0,0.4,0.0167,0.0167,0.0167,0.0167,0.0204,0.0204,0.0204,0.0204,0.0189,0.0189,0.0189,0.0189,0.0213,0.0213,0.0213,0.0213
25%,24.0,41.0,17.0,17.0,0.393,27.5,3.9,8.5,0.424,0.2,0.8,0.276,1.5,2.0,0.724,0.6,2.8,3.4,1.6,1.3,0.7,0.2,0.4,2.0,0.5,10.4,-1.9,22.2,1.0,0.0,21.3,200.0,1997.0,1.0,4.0,1998.0,2005.0,50.0,20.0,19.0,0.4,0.0,0.8261,0.2632,0.2619,0.2619,0.2619,0.2624,0.2625,0.2619,0.2632,0.2619,0.2619,0.2622,0.2619,0.258,0.2632,0.2641,0.264
50%,27.0,48.0,23.0,22.0,0.51,31.1,5.0,11.0,0.451,1.1,3.1,0.344,2.3,3.0,0.785,1.0,3.7,4.8,2.6,1.8,0.9,0.4,0.6,2.4,2.0,13.3,0.4,27.3,3.0,0.0,25.9,220.0,2004.0,1.0,11.0,2005.0,2012.0,53.0,26.0,25.0,0.51,0.0,0.9375,0.5116,0.5122,0.5119,0.5122,0.5119,0.5119,0.5119,0.5119,0.5111,0.5109,0.513,0.5128,0.5114,0.5128,0.5128,0.5124
75%,30.0,52.0,29.0,28.0,0.615,34.6,6.5,14.2,0.486,1.7,4.6,0.383,3.5,4.5,0.836,1.9,5.2,7.0,4.4,2.4,1.2,0.8,0.9,2.8,3.2,17.7,3.0,34.1,10.0,0.0,32.4,240.0,2011.0,1.0,23.0,2012.0,2018.0,55.0,32.0,32.0,0.615,0.0,1.0,0.7619,0.7632,0.7632,0.7619,0.7642,0.7619,0.7609,0.7605,0.7628,0.7628,0.7619,0.7632,0.7619,0.7632,0.7643,0.7632
max,41.0,61.0,48.0,48.0,0.944,44.0,11.9,27.5,0.762,5.1,13.6,1.0,10.5,13.0,1.0,6.6,11.7,16.5,12.6,5.8,3.0,4.3,2.1,4.5,9.6,36.6,15.3,62.4,51.0,27.0,61.9,325.0,2023.0,7.0,160.0,2023.0,2024.0,61.0,48.0,48.0,0.923,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


We can delete the original numerical columns and player bio data:

In [6]:
# Create a "clean" dataframe
nbamod = nba.copy().reset_index(drop=True)
# Original player stats and bio
nbamod.drop(nbamod.columns[1:42], axis=1, inplace=True)
# Original team stats
nbamod.drop(nbamod.columns[nbamod.columns.str.contains('TEAM')], axis=1, inplace=True)

The resulting missing values from percentile ranks can be filled with 0 to indicate that they have no weight for the given player:

In [7]:
nbamod = nbamod.fillna(0)
nbamod.head()

Unnamed: 0,PLAYER_NAME,YEAR,ALLSTAR,CONF,GP_PCT,TYPE,PTS_EB,FP_EB,PM_EB,W_PCT_EB,PTS_EF,FP_EF,PM_EF,W_PCT_EF,PTS_WB,FP_WB,PM_WB,W_PCT_WB,PTS_WF,FP_WF,PM_WF,W_PCT_WF
0,Allan Houston,1997,0.0,E,1.0,B,0.5263,0.1184,0.5263,0.7632,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,Allen Iverson,1997,0.0,E,0.8913,B,0.9474,0.9737,0.0526,0.1053,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,Alonzo Mourning,1997,1.0,E,0.9583,F,0.0,0.0,0.0,0.0,0.8913,0.913,0.913,0.913,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,Andrew Lang,1997,0.0,E,0.8511,F,0.0,0.0,0.0,0.0,0.0217,0.0435,0.4565,0.4348,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,Anfernee Hardaway,1997,1.0,E,0.4773,B,0.8289,0.8421,0.8684,0.9474,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Finally, we can create a new target variable based on the previously mentioned selection criteria:

In [8]:
nbamod['SELECT'] = 0
nbamod.loc[(nba['ALLSTAR'] == 1) & (nbamod['CONF'] == 'E') & (nbamod['TYPE'] == 'F'), 'SELECT'] = 1
nbamod.loc[(nba['ALLSTAR'] == 1) & (nbamod['CONF'] == 'E') & (nbamod['TYPE'] == 'B'), 'SELECT'] = 2
nbamod.loc[(nba['ALLSTAR'] == 1) & (nbamod['CONF'] == 'W') & (nbamod['TYPE'] == 'F'), 'SELECT'] = 3
nbamod.loc[(nba['ALLSTAR'] == 1) & (nbamod['CONF'] == 'W') & (nbamod['TYPE'] == 'B'), 'SELECT'] = 4
nbamod['SELECT'].value_counts()

SELECT
0    3740
3     183
1     172
2     160
4     142
Name: count, dtype: int64

In [9]:
nbamod = nbamod.rename(columns={"PLAYER_NAME": "PLAYER"})
nbamod.dtypes.value_counts()

float64    18
object      3
int64       2
Name: count, dtype: int64

## Random Forest Hyperparameter Tuning

As shown in the previous project, random forest performs the best on this dataset. Therefore we can train a random forest algorithm with cross validation and grid search to find the optimal parameters:

In [10]:
cols = nbamod.columns
train_cols = cols.drop(['PLAYER', 'YEAR', 'ALLSTAR', 'CONF', 'TYPE', 'SELECT'])
features = nbamod[train_cols]
target = nbamod['SELECT']

hyperparameters = {
    'criterion': ['entropy', 'gini'],
    'max_depth': [5,10],
    'max_features': ['auto', 'log2', 'sqrt'],
    'min_samples_leaf': [1,5],
    'min_samples_split': [3,5],
    'n_estimators': [6,9],
    'class_weight': [None, 'balanced']
}

rf = RandomForestClassifier(random_state=1)
   
grid = GridSearchCV(rf, param_grid=hyperparameters, cv=10)

grid.fit(features, target)

print("Best Score: {}".format(grid.best_score_))
print("Best Parameters: {}".format(grid.best_params_))

best_rf = grid.best_estimator_

predictions = cross_val_predict(best_rf, features, target, cv=10)
cm = confusion_matrix(target, predictions)

print("Best Predictions:\n{}\n".format(cm))

report = classification_report(target, predictions)

print("Classification Report:\n{}\n".format(report))

Best Score: 0.9297287223027542
Best Parameters: {'class_weight': None, 'criterion': 'entropy', 'max_depth': 10, 'max_features': 'log2', 'min_samples_leaf': 1, 'min_samples_split': 5, 'n_estimators': 9}
Best Predictions:
[[3623   30   36   25   26]
 [  60  112    0    0    0]
 [  51    0  109    0    0]
 [  45    0    0  138    0]
 [  36    0    0    0  106]]

Classification Report:
              precision    recall  f1-score   support

           0       0.95      0.97      0.96      3740
           1       0.79      0.65      0.71       172
           2       0.75      0.68      0.71       160
           3       0.85      0.75      0.80       183
           4       0.80      0.75      0.77       142

    accuracy                           0.93      4397
   macro avg       0.83      0.76      0.79      4397
weighted avg       0.93      0.93      0.93      4397




As seen, the best-performing model has a macro average of 79% accuracy which is higher than the 76% achieved with binary classification. If we look at the individual F1 scores, we can see that the model is good at identifying the players who are not performing at all-star level (class 0). The worst score is 0.71 with East backcourt while the best score is 0.8 with West frontcourt.

## Predicting 2025 All-Stars

We can now move on to predicting the 2025 all-stars based on the updated data from the current season:

In [11]:
bio = pd.read_csv(os.path.join(input_data_dir, 'player_bio_historical.csv'))
pl_current = pd.read_csv(os.path.join(input_data_dir, 'pre_allstar_player_stats_2025.csv'))
tm_current = pd.read_csv(os.path.join(input_data_dir, 'pre_allstar_team_stats_2025.csv'))

The dataset has to go through the same data transformation steps as the training dataset:

In [12]:
# Add player positions
pl_current['POS'] = pl_current['PLAYER_ID'].map(bio.set_index('PERSON_ID')['POSITION'])

In [13]:
# Add year column
pl_current['YEAR'] = pl_current['SEASON'].str.split('-').str[0].astype('int64') + 1
tm_current['YEAR'] = tm_current['SEASON'].str.split('-').str[0].astype('int64') + 1

pl_current.head()

Unnamed: 0,PLAYER_ID,PLAYER_NAME,NICKNAME,TEAM_ID,TEAM_ABBREVIATION,AGE,GP,W,L,W_PCT,MIN,FGM,FGA,FG_PCT,FG3M,FG3A,FG3_PCT,FTM,FTA,FT_PCT,OREB,DREB,REB,AST,TOV,STL,BLK,BLKA,PF,PFD,PTS,PLUS_MINUS,NBA_FANTASY_PTS,DD2,TD3,WNBA_FANTASY_PTS,GP_RANK,W_RANK,L_RANK,W_PCT_RANK,MIN_RANK,FGM_RANK,FGA_RANK,FG_PCT_RANK,FG3M_RANK,FG3A_RANK,FG3_PCT_RANK,FTM_RANK,FTA_RANK,FT_PCT_RANK,OREB_RANK,DREB_RANK,REB_RANK,AST_RANK,TOV_RANK,STL_RANK,BLK_RANK,BLKA_RANK,PF_RANK,PFD_RANK,PTS_RANK,PLUS_MINUS_RANK,NBA_FANTASY_PTS_RANK,DD2_RANK,TD3_RANK,WNBA_FANTASY_PTS_RANK,SEASON,POS,YEAR
0,1630639,A.J. Lawson,A.J.,1610612761,TOR,24.0,4,1,3,0.25,3.8,1.0,2.0,0.5,0.5,1.3,0.4,0.3,0.8,0.333,0.3,0.5,0.8,0.0,0.0,0.0,0.0,0.3,0.0,0.5,2.8,-2.5,3.7,0,0,4.0,489,487,457,444,488,413,429,121,317,354,93,427,350,480,432,477,478,504,502,479,458,298,511,395,415,408,470,197,28,457,2024-25,G,2025
1,1631260,AJ Green,AJ,1610612749,MIL,25.0,38,23,15,0.605,21.8,2.7,5.9,0.447,2.3,5.2,0.442,0.3,0.4,0.857,0.3,2.1,2.3,1.3,0.6,0.5,0.1,0.0,2.2,0.6,7.9,3.6,13.9,0,0,15.1,179,92,247,145,216,255,256,261,55,91,35,413,434,109,425,291,332,298,360,311,413,460,126,378,243,73,301,197,28,272,2024-25,G,2025
2,1642358,AJ Johnson,AJ,1610612749,MIL,20.0,6,4,2,0.667,2.4,0.5,1.2,0.429,0.2,0.2,1.0,0.0,0.0,0.0,0.0,0.5,0.5,0.5,0.3,0.0,0.0,0.2,0.0,0.0,1.2,2.2,2.2,0,0,2.3,470,434,475,95,517,468,483,324,408,468,1,489,490,489,494,477,495,418,428,479,458,369,511,503,481,119,496,197,28,492,2024-25,G,2025
3,203932,Aaron Gordon,Aaron,1610612743,DEN,29.0,26,16,10,0.615,26.3,4.5,8.6,0.527,1.2,2.8,0.425,2.6,3.3,0.791,1.6,3.4,5.0,2.6,1.4,0.5,0.2,0.6,1.5,2.6,12.9,4.9,23.5,2,0,23.0,323,220,341,134,149,130,153,97,199,235,52,80,79,215,77,136,115,135,148,311,329,134,292,94,117,44,153,108,28,151,2024-25,F,2025
4,1628988,Aaron Holiday,Aaron,1610612745,HOU,28.0,33,22,11,0.667,11.1,1.4,3.2,0.421,0.8,2.1,0.391,0.5,0.6,0.895,0.2,0.8,1.0,1.2,0.5,0.4,0.1,0.1,0.9,0.7,4.1,2.0,8.1,0,0,8.1,246,113,327,95,381,380,367,348,262,286,115,345,378,58,453,446,459,312,385,361,388,451,393,331,365,129,392,197,28,388,2024-25,G,2025


In [14]:
nbanow = pl_current.copy()

# Map the team stats based on TEAM_ID
nbanow.loc[nbanow['YEAR'] == 2025, 'TEAM_GP'] = nbanow.loc[nbanow['YEAR'] == 2025, 'TEAM_ID'].map(
    tm_current[tm_current['YEAR'] == 2025].set_index('TEAM_ID')['GP'])
nbanow.loc[nbanow['YEAR'] == 2025, 'TEAM_W'] = nbanow.loc[nbanow['YEAR'] == 2025, 'TEAM_ID'].map(
    tm_current[tm_current['YEAR'] == 2025].set_index('TEAM_ID')['W'])
nbanow.loc[nbanow['YEAR'] == 2025, 'TEAM_L'] = nbanow.loc[nbanow['YEAR'] == 2025, 'TEAM_ID'].map(
    tm_current[tm_current['YEAR'] == 2025].set_index('TEAM_ID')['L'])
nbanow.loc[nbanow['YEAR'] == 2025, 'TEAM_W_PCT'] = nbanow.loc[nbanow['YEAR'] == 2025, 'TEAM_ID'].map(
    tm_current[tm_current['YEAR'] == 2025].set_index('TEAM_ID')['W_PCT'])

nbanow.head()

Unnamed: 0,PLAYER_ID,PLAYER_NAME,NICKNAME,TEAM_ID,TEAM_ABBREVIATION,AGE,GP,W,L,W_PCT,MIN,FGM,FGA,FG_PCT,FG3M,FG3A,FG3_PCT,FTM,FTA,FT_PCT,OREB,DREB,REB,AST,TOV,STL,BLK,BLKA,PF,PFD,PTS,PLUS_MINUS,NBA_FANTASY_PTS,DD2,TD3,WNBA_FANTASY_PTS,GP_RANK,W_RANK,L_RANK,W_PCT_RANK,MIN_RANK,FGM_RANK,FGA_RANK,FG_PCT_RANK,FG3M_RANK,FG3A_RANK,FG3_PCT_RANK,FTM_RANK,FTA_RANK,FT_PCT_RANK,OREB_RANK,DREB_RANK,REB_RANK,AST_RANK,TOV_RANK,STL_RANK,BLK_RANK,BLKA_RANK,PF_RANK,PFD_RANK,PTS_RANK,PLUS_MINUS_RANK,NBA_FANTASY_PTS_RANK,DD2_RANK,TD3_RANK,WNBA_FANTASY_PTS_RANK,SEASON,POS,YEAR,TEAM_GP,TEAM_W,TEAM_L,TEAM_W_PCT
0,1630639,A.J. Lawson,A.J.,1610612761,TOR,24.0,4,1,3,0.25,3.8,1.0,2.0,0.5,0.5,1.3,0.4,0.3,0.8,0.333,0.3,0.5,0.8,0.0,0.0,0.0,0.0,0.3,0.0,0.5,2.8,-2.5,3.7,0,0,4.0,489,487,457,444,488,413,429,121,317,354,93,427,350,480,432,477,478,504,502,479,458,298,511,395,415,408,470,197,28,457,2024-25,G,2025,47.0,15.0,32.0,0.319
1,1631260,AJ Green,AJ,1610612749,MIL,25.0,38,23,15,0.605,21.8,2.7,5.9,0.447,2.3,5.2,0.442,0.3,0.4,0.857,0.3,2.1,2.3,1.3,0.6,0.5,0.1,0.0,2.2,0.6,7.9,3.6,13.9,0,0,15.1,179,92,247,145,216,255,256,261,55,91,35,413,434,109,425,291,332,298,360,311,413,460,126,378,243,73,301,197,28,272,2024-25,G,2025,45.0,26.0,19.0,0.578
2,1642358,AJ Johnson,AJ,1610612749,MIL,20.0,6,4,2,0.667,2.4,0.5,1.2,0.429,0.2,0.2,1.0,0.0,0.0,0.0,0.0,0.5,0.5,0.5,0.3,0.0,0.0,0.2,0.0,0.0,1.2,2.2,2.2,0,0,2.3,470,434,475,95,517,468,483,324,408,468,1,489,490,489,494,477,495,418,428,479,458,369,511,503,481,119,496,197,28,492,2024-25,G,2025,45.0,26.0,19.0,0.578
3,203932,Aaron Gordon,Aaron,1610612743,DEN,29.0,26,16,10,0.615,26.3,4.5,8.6,0.527,1.2,2.8,0.425,2.6,3.3,0.791,1.6,3.4,5.0,2.6,1.4,0.5,0.2,0.6,1.5,2.6,12.9,4.9,23.5,2,0,23.0,323,220,341,134,149,130,153,97,199,235,52,80,79,215,77,136,115,135,148,311,329,134,292,94,117,44,153,108,28,151,2024-25,F,2025,47.0,28.0,19.0,0.596
4,1628988,Aaron Holiday,Aaron,1610612745,HOU,28.0,33,22,11,0.667,11.1,1.4,3.2,0.421,0.8,2.1,0.391,0.5,0.6,0.895,0.2,0.8,1.0,1.2,0.5,0.4,0.1,0.1,0.9,0.7,4.1,2.0,8.1,0,0,8.1,246,113,327,95,381,380,367,348,262,286,115,345,378,58,453,446,459,312,385,361,388,451,393,331,365,129,392,197,28,388,2024-25,G,2025,46.0,32.0,14.0,0.696


In [15]:
# We can use the team abbreviations to keep the input short
nbanow = nbanow.rename(columns={"TEAM_ABBREVIATION": "TEAM"})

# All of the eastern conference team abbreviations over the past 25 years
east = ['IND', 'BOS', 'CHI', 'NYK', 'WAS', 'MIA', 'BKN', 'TOR', 'PHI', 'CHA', 'MIL', 'ATL',
        'CLE', 'ORL', 'DET', 'NJN', 'CHH']

nbanow['CONF'] = np.where(nbanow['TEAM'].isin(east), 'E', 'W')

nbanow.head()

Unnamed: 0,PLAYER_ID,PLAYER_NAME,NICKNAME,TEAM_ID,TEAM,AGE,GP,W,L,W_PCT,MIN,FGM,FGA,FG_PCT,FG3M,FG3A,FG3_PCT,FTM,FTA,FT_PCT,OREB,DREB,REB,AST,TOV,STL,BLK,BLKA,PF,PFD,PTS,PLUS_MINUS,NBA_FANTASY_PTS,DD2,TD3,WNBA_FANTASY_PTS,GP_RANK,W_RANK,L_RANK,W_PCT_RANK,MIN_RANK,FGM_RANK,FGA_RANK,FG_PCT_RANK,FG3M_RANK,FG3A_RANK,FG3_PCT_RANK,FTM_RANK,FTA_RANK,FT_PCT_RANK,OREB_RANK,DREB_RANK,REB_RANK,AST_RANK,TOV_RANK,STL_RANK,BLK_RANK,BLKA_RANK,PF_RANK,PFD_RANK,PTS_RANK,PLUS_MINUS_RANK,NBA_FANTASY_PTS_RANK,DD2_RANK,TD3_RANK,WNBA_FANTASY_PTS_RANK,SEASON,POS,YEAR,TEAM_GP,TEAM_W,TEAM_L,TEAM_W_PCT,CONF
0,1630639,A.J. Lawson,A.J.,1610612761,TOR,24.0,4,1,3,0.25,3.8,1.0,2.0,0.5,0.5,1.3,0.4,0.3,0.8,0.333,0.3,0.5,0.8,0.0,0.0,0.0,0.0,0.3,0.0,0.5,2.8,-2.5,3.7,0,0,4.0,489,487,457,444,488,413,429,121,317,354,93,427,350,480,432,477,478,504,502,479,458,298,511,395,415,408,470,197,28,457,2024-25,G,2025,47.0,15.0,32.0,0.319,E
1,1631260,AJ Green,AJ,1610612749,MIL,25.0,38,23,15,0.605,21.8,2.7,5.9,0.447,2.3,5.2,0.442,0.3,0.4,0.857,0.3,2.1,2.3,1.3,0.6,0.5,0.1,0.0,2.2,0.6,7.9,3.6,13.9,0,0,15.1,179,92,247,145,216,255,256,261,55,91,35,413,434,109,425,291,332,298,360,311,413,460,126,378,243,73,301,197,28,272,2024-25,G,2025,45.0,26.0,19.0,0.578,E
2,1642358,AJ Johnson,AJ,1610612749,MIL,20.0,6,4,2,0.667,2.4,0.5,1.2,0.429,0.2,0.2,1.0,0.0,0.0,0.0,0.0,0.5,0.5,0.5,0.3,0.0,0.0,0.2,0.0,0.0,1.2,2.2,2.2,0,0,2.3,470,434,475,95,517,468,483,324,408,468,1,489,490,489,494,477,495,418,428,479,458,369,511,503,481,119,496,197,28,492,2024-25,G,2025,45.0,26.0,19.0,0.578,E
3,203932,Aaron Gordon,Aaron,1610612743,DEN,29.0,26,16,10,0.615,26.3,4.5,8.6,0.527,1.2,2.8,0.425,2.6,3.3,0.791,1.6,3.4,5.0,2.6,1.4,0.5,0.2,0.6,1.5,2.6,12.9,4.9,23.5,2,0,23.0,323,220,341,134,149,130,153,97,199,235,52,80,79,215,77,136,115,135,148,311,329,134,292,94,117,44,153,108,28,151,2024-25,F,2025,47.0,28.0,19.0,0.596,W
4,1628988,Aaron Holiday,Aaron,1610612745,HOU,28.0,33,22,11,0.667,11.1,1.4,3.2,0.421,0.8,2.1,0.391,0.5,0.6,0.895,0.2,0.8,1.0,1.2,0.5,0.4,0.1,0.1,0.9,0.7,4.1,2.0,8.1,0,0,8.1,246,113,327,95,381,380,367,348,262,286,115,345,378,58,453,446,459,312,385,361,388,451,393,331,365,129,392,197,28,388,2024-25,G,2025,46.0,32.0,14.0,0.696,W


In [16]:
# Find missing values
print(nbanow.shape)
missing_cols = nbanow.columns[nbanow.isna().any()].tolist()
missing_vals = nbanow[missing_cols].isna().sum()
for col in range(len(missing_cols)):
    print('{}: {} missing'.format(missing_cols[col],missing_vals[col]))

(524, 74)
POS: 1 missing


The 3 players with missing positions have played very limited minutes and therefore can be removed from the dataset:

In [17]:
nbanow.dropna(axis=0, subset=['POS'], inplace=True)
print(nbanow.shape)
nbanow.columns[nbanow.isna().any()].tolist()

(523, 74)


[]

In [18]:
# We can remove the following columns that do not contain any information on player performance
nbanow = nbanow.drop(['PLAYER_ID', 'TEAM_ID', 'NICKNAME', 'SEASON', 'WNBA_FANTASY_PTS'], axis=1)

# We can also remove the columns titled rank since those ranks are not relevant to our study
nbanow = nbanow.drop(nbanow.columns[nbanow.columns.str.contains('RANK')], axis=1)

# Calculate games played with respect to total team games
nbanow['GP_PCT'] = nbanow['GP'] / nbanow['TEAM_GP']
nbanow['GP_PCT'].describe()

count   523.0000
mean      0.6179
std       0.3016
min       0.0208
25%       0.3723
50%       0.6889
75%       0.8913
max       1.0000
Name: GP_PCT, dtype: float64

In [19]:
# Filtering the data based on games and minutes played
nbanow.drop(nbanow.loc[nbanow['GP_PCT'] < 0.4].index, inplace=True)
nbanow.drop(nbanow.loc[nbanow['MIN'] < 24].index, inplace=True)
print(nbanow.shape)

(181, 40)


In [20]:
# Create a "clean" dataframe
nbanowmod = nbanow.copy().reset_index(drop=True)

# Identify position types
frontcourt = ['F', 'C', 'C-F', 'F-C']
backcourt = ['G', 'G-F', 'F-G']

for i in range(len(nbanowmod)):
    pos = nbanowmod.loc[i, 'POS']

    if nbanowmod.loc[i, 'PLAYER_NAME'] in ['Jayson Tatum', 'Jaylen Brown', 'Jalen Williams']:
        nbanowmod.loc[i, 'TYPE'] = 'F'
    else:
        if pos in frontcourt:
            nbanowmod.loc[i, 'TYPE'] = 'F'
        elif pos in backcourt:
            nbanowmod.loc[i, 'TYPE'] = 'B'

nbanowmod['TYPE'].value_counts()

TYPE
B    94
F    87
Name: count, dtype: int64

In [21]:
pos_type = ['B', 'F']
conf_type = ['E', 'W']

for conf in conf_type:
    for pos in pos_type:
        nbanowmod_2025 = nbanowmod[nbanowmod['YEAR'] == 2025]
        nbanowmod_group = nbanowmod_2025[(nbanowmod_2025['TYPE'] == pos) & (nbanowmod_2025['CONF'] == conf)]
        nbanowmod.loc[nbanowmod['YEAR'] == 2025, 'PTS_' + conf + pos] = nbanowmod_group['PTS'].rank(pct = True)
        nbanowmod.loc[nbanowmod['YEAR'] == 2025, 'FP_' + conf + pos] = nbanowmod_group['NBA_FANTASY_PTS'].rank(pct = True)
        nbanowmod.loc[nbanowmod['YEAR'] == 2025, 'PM_' + conf + pos] = nbanowmod_group['PLUS_MINUS'].rank(pct = True)
        nbanowmod.loc[nbanowmod['YEAR'] == 2025, 'W_PCT_' + conf + pos] = nbanowmod_group['W_PCT'].rank(pct = True)
        
nbanowmod.describe()

Unnamed: 0,AGE,GP,W,L,W_PCT,MIN,FGM,FGA,FG_PCT,FG3M,FG3A,FG3_PCT,FTM,FTA,FT_PCT,OREB,DREB,REB,AST,TOV,STL,BLK,BLKA,PF,PFD,PTS,PLUS_MINUS,NBA_FANTASY_PTS,DD2,TD3,YEAR,TEAM_GP,TEAM_W,TEAM_L,TEAM_W_PCT,GP_PCT,PTS_EB,FP_EB,PM_EB,W_PCT_EB,PTS_EF,FP_EF,PM_EF,W_PCT_EF,PTS_WB,FP_WB,PM_WB,W_PCT_WB,PTS_WF,FP_WF,PM_WF,W_PCT_WF
count,181.0,181.0,181.0,181.0,181.0,181.0,181.0,181.0,181.0,181.0,181.0,181.0,181.0,181.0,181.0,181.0,181.0,181.0,181.0,181.0,181.0,181.0,181.0,181.0,181.0,181.0,181.0,181.0,181.0,181.0,181.0,181.0,181.0,181.0,181.0,181.0,48.0,48.0,48.0,48.0,43.0,43.0,43.0,43.0,46.0,46.0,46.0,46.0,44.0,44.0,44.0,44.0
mean,26.7403,39.4199,20.0387,19.3812,0.5019,30.2337,5.6641,12.0862,0.4703,1.8028,4.9536,0.3356,2.4569,3.0873,0.7909,1.2481,4.2006,5.4448,3.6696,1.8657,1.0287,0.6017,0.6586,2.1464,2.6459,15.5818,0.4072,30.6409,6.1878,0.4696,2025.0,46.5856,23.1713,23.4144,0.4968,0.8462,0.5104,0.5104,0.5104,0.5104,0.5116,0.5116,0.5116,0.5116,0.5109,0.5109,0.5109,0.5109,0.5114,0.5114,0.5114,0.5114
std,4.3962,6.7273,7.8545,6.5024,0.1635,3.8893,2.0995,4.2589,0.0653,0.9674,2.496,0.1,1.4633,1.79,0.0857,0.8789,1.9317,2.6113,2.0562,0.8561,0.3762,0.5034,0.2979,0.548,1.3693,5.8884,4.3675,9.4502,8.6293,1.9876,0.0,1.1593,7.5497,7.3522,0.1602,0.1433,0.2916,0.2916,0.2916,0.2916,0.292,0.292,0.292,0.2918,0.2918,0.2917,0.2918,0.2917,0.2919,0.2919,0.2918,0.2918
min,19.0,19.0,2.0,3.0,0.069,24.0,2.4,5.2,0.347,0.0,0.0,0.0,0.2,0.3,0.535,0.2,1.6,1.9,0.8,0.6,0.4,0.0,0.1,0.9,0.5,6.2,-12.9,16.3,0.0,0.0,2025.0,44.0,6.0,9.0,0.13,0.4043,0.0208,0.0208,0.0208,0.0208,0.0233,0.0233,0.0233,0.0233,0.0217,0.0326,0.0217,0.0217,0.0227,0.0227,0.0227,0.0227
25%,23.0,35.0,14.0,16.0,0.406,26.7,4.0,8.4,0.427,1.2,3.6,0.318,1.3,1.8,0.743,0.6,2.8,3.6,2.0,1.2,0.8,0.3,0.4,1.8,1.5,10.8,-2.4,23.2,0.0,0.0,2025.0,46.0,19.0,19.0,0.413,0.766,0.2656,0.2656,0.2604,0.2656,0.2674,0.2674,0.2674,0.2674,0.2663,0.2717,0.2717,0.2663,0.267,0.2642,0.2841,0.267
50%,26.0,41.0,21.0,19.0,0.515,30.3,5.4,11.7,0.457,1.8,4.9,0.355,2.1,2.8,0.81,0.9,3.5,4.7,3.3,1.7,0.9,0.5,0.6,2.1,2.5,14.5,0.5,29.0,3.0,0.0,2025.0,47.0,24.0,23.0,0.511,0.8913,0.5104,0.5104,0.5104,0.5104,0.5116,0.5116,0.5116,0.5116,0.5109,0.5217,0.5054,0.5109,0.5114,0.517,0.5114,0.5227
75%,29.0,45.0,25.0,23.0,0.596,33.2,7.0,15.1,0.494,2.3,6.3,0.393,3.3,4.1,0.86,1.6,5.2,6.6,4.9,2.4,1.2,0.8,0.8,2.5,3.4,19.4,3.5,36.6,8.0,0.0,2025.0,47.0,27.0,28.0,0.578,0.9574,0.7552,0.7552,0.7552,0.7578,0.7558,0.7558,0.7558,0.7558,0.7554,0.75,0.7554,0.7636,0.7614,0.7557,0.7472,0.7557
max,40.0,48.0,38.0,40.0,0.886,38.4,12.7,23.3,0.733,4.3,12.3,0.5,7.7,10.8,0.97,4.5,10.7,14.5,11.4,4.7,3.1,3.9,1.5,3.5,8.0,32.5,12.2,64.6,42.0,21.0,2025.0,48.0,38.0,40.0,0.809,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


In [22]:
# Rename player name column
nbanowmod = nbanowmod.rename(columns={"PLAYER_NAME": "PLAYER"})
# Fill empty rank values with 0
nbanowmod = nbanowmod.fillna(0)
# Remove original player stats
nbanowmod.drop(nbanowmod.columns[1:32], axis=1, inplace=True)
# Remove original team stats
nbanowmod.drop(nbanowmod.columns[nbanowmod.columns.str.contains('TEAM')], axis=1, inplace=True)

print(nbanowmod.shape)
print(nbanowmod.head())

(181, 22)
           PLAYER  POS  YEAR CONF  GP_PCT TYPE  PTS_EB  FP_EB  PM_EB  \
0    Aaron Gordon    F  2025    W  0.5532    F  0.0000 0.0000 0.0000   
1      Al Horford  C-F  2025    E  0.7500    F  0.0000 0.0000 0.0000   
2       Alex Sarr    C  2025    E  0.8913    F  0.0000 0.0000 0.0000   
3  Alperen Sengun    C  2025    W  1.0000    F  0.0000 0.0000 0.0000   
4   Amen Thompson  G-F  2025    W  0.9348    B  0.0000 0.0000 0.0000   

   W_PCT_EB  PTS_EF  FP_EF  PM_EF  W_PCT_EF  PTS_WB  FP_WB  PM_WB  W_PCT_WB  \
0    0.0000  0.0000 0.0000 0.0000    0.0000  0.0000 0.0000 0.0000    0.0000   
1    0.0000  0.1163 0.2326 0.8837    0.8837  0.0000 0.0000 0.0000    0.0000   
2    0.0000  0.4186 0.4535 0.0465    0.0698  0.0000 0.0000 0.0000    0.0000   
3    0.0000  0.0000 0.0000 0.0000    0.0000  0.0000 0.0000 0.0000    0.0000   
4    0.0000  0.0000 0.0000 0.0000    0.0000  0.3696 0.6087 0.6957    0.9130   

   PTS_WF  FP_WF  PM_WF  W_PCT_WF  
0  0.5000 0.1818 0.8182    0.7955  
1  0.0000 

Now we have this season's data with the same features as the training data. We can use the best-performing Random Forest Classifier from previous step to make predictions. Here are the parameters as a reminder:

In [23]:
cols = nbanowmod.columns
test_cols = cols.drop(['PLAYER', 'YEAR', 'POS', 'CONF', 'TYPE'])

new_predictions = best_rf.predict(nbanowmod[test_cols])
proba_predict = best_rf.predict_proba(nbanowmod[test_cols])

nbanowmod['PREDICT'] = new_predictions
nbanowmod[['PREDICT_NOT', 'PREDICT_EF', 'PREDICT_EB', 'PREDICT_WF', 'PREDICT_WB']] = pd.DataFrame(proba_predict)

nbanowmod['PREDICT'].value_counts()

PREDICT
0    153
3      9
2      7
4      6
1      6
Name: count, dtype: int64

As seen, the model predicts 13 East (Class 1&2) and 12 West (Class 3&4) All-Star players. We can now check the predicted all-stars:

In [24]:
print('Eastern Conference')
print('-'*18)
print('Frontcourt  :  {}'.format(nbanowmod[(nbanowmod['PREDICT'] == 1)]['PLAYER'].to_list()))
print('Backcourt :  {}\n'.format(nbanowmod[(nbanowmod['PREDICT'] == 2)]['PLAYER'].to_list()))
print('Western Conference')
print('-'*18)
print('Frontcourt  :  {}'.format(nbanowmod[(nbanowmod['PREDICT'] == 3)]['PLAYER'].to_list()))
print('Backcourt :  {}'.format(nbanowmod[(nbanowmod['PREDICT'] == 4)]['PLAYER'].to_list()))

Eastern Conference
------------------
Frontcourt  :  ['Evan Mobley', 'Franz Wagner', 'Giannis Antetokounmpo', 'Jaylen Brown', 'Jayson Tatum', 'Karl-Anthony Towns']
Backcourt :  ['Cade Cunningham', 'Damian Lillard', 'Donovan Mitchell', 'Jalen Brunson', 'LaMelo Ball', 'Trae Young', 'Tyrese Maxey']

Western Conference
------------------
Frontcourt  :  ['Alperen Sengun', 'Anthony Davis', 'Domantas Sabonis', 'Jalen Williams', 'Jaren Jackson Jr.', 'Kevin Durant', 'LeBron James', 'Nikola Jokić', 'Victor Wembanyama']
Backcourt :  ['Anthony Edwards', "De'Aaron Fox", 'James Harden', 'Kyrie Irving', 'Luka Dončić', 'Shai Gilgeous-Alexander']


### Final Rosters based on Positions

In [42]:
eastallstars = nbanowmod.sort_values('PREDICT_EB', ascending=False).head(2)['PLAYER'].to_list() + nbanowmod.sort_values('PREDICT_EF', ascending=False).head(3)['PLAYER'].to_list() + nbanowmod.sort_values('PREDICT_EB', ascending=False).iloc[2:4]['PLAYER'].to_list() + nbanowmod.sort_values('PREDICT_EF', ascending=False).iloc[3:6]['PLAYER'].to_list()

print('Eastern Conference')
print('Starters')
print('Backcourt  :  {}'.format(nbanowmod.sort_values('PREDICT_EB', ascending=False).head(2)['PLAYER'].to_list()))
print('Frontcourt  :  {}'.format(nbanowmod.sort_values('PREDICT_EF', ascending=False).head(3)['PLAYER'].to_list()))
print('Reserves')
print('Backcourt  :  {}'.format(nbanowmod.sort_values('PREDICT_EB', ascending=False).iloc[2:4]['PLAYER'].to_list()))
print('Frontcourt  :  {}'.format(nbanowmod.sort_values('PREDICT_EF', ascending=False).iloc[3:6]['PLAYER'].to_list()))

eastwildcard = []
for i in range(12):
    proballstar = nbanowmod[nbanowmod['CONF'] == 'E'].sort_values('PREDICT_NOT').reset_index(drop=True)
    if proballstar.loc[i, 'PLAYER'] in eastallstars:
        continue
    else:
        eastwildcard.append(proballstar.loc[i, 'PLAYER'])
print('Wildcard  :  {}'.format(eastwildcard))

Eastern Conference
Starters
Backcourt  :  ['Jalen Brunson', 'Cade Cunningham']
Frontcourt  :  ['Giannis Antetokounmpo', 'Jayson Tatum', 'Karl-Anthony Towns']
Reserves
Backcourt  :  ['Damian Lillard', 'Trae Young']
Frontcourt  :  ['Franz Wagner', 'Jaylen Brown', 'Evan Mobley']
Wildcard  :  ['Donovan Mitchell', 'LaMelo Ball']


In [43]:
westallstars = nbanowmod.sort_values('PREDICT_WB', ascending=False).head(2)['PLAYER'].to_list() + nbanowmod.sort_values('PREDICT_WF', ascending=False).head(3)['PLAYER'].to_list() + nbanowmod.sort_values('PREDICT_WB', ascending=False).iloc[2:4]['PLAYER'].to_list() + nbanowmod.sort_values('PREDICT_WF', ascending=False).iloc[3:6]['PLAYER'].to_list()

print('Western Conference')
print('Starters')
print('Backcourt  :  {}'.format(nbanowmod.sort_values('PREDICT_WB', ascending=False).head(2)['PLAYER'].to_list()))
print('Frontcourt  :  {}'.format(nbanowmod.sort_values('PREDICT_WF', ascending=False).head(3)['PLAYER'].to_list()))
print('Reserves')
print('Backcourt  :  {}'.format(nbanowmod.sort_values('PREDICT_WB', ascending=False).iloc[2:4]['PLAYER'].to_list()))
print('Frontcourt  :  {}'.format(nbanowmod.sort_values('PREDICT_WF', ascending=False).iloc[3:6]['PLAYER'].to_list()))

westwildcard = []
for i in range(12):
    proballstar = nbanowmod[nbanowmod['CONF'] == 'W'].sort_values('PREDICT_NOT').reset_index(drop=True)
    if proballstar.loc[i, 'PLAYER'] in westallstars:
        continue
    else:
        westwildcard.append(proballstar.loc[i, 'PLAYER'])
print('Wildcard  :  {}'.format(westwildcard))

Western Conference
Starters
Backcourt  :  ['Shai Gilgeous-Alexander', 'Luka Dončić']
Frontcourt  :  ['Anthony Davis', 'Victor Wembanyama', 'Nikola Jokić']
Reserves
Backcourt  :  ["De'Aaron Fox", 'Anthony Edwards']
Frontcourt  :  ['Jalen Williams', 'Domantas Sabonis', 'LeBron James']
Wildcard  :  ['Alperen Sengun', 'Kyrie Irving']


## Conclusions

Adding more classes based on positions increased the accuracy of the prediction model from 76% to 79%. Comparing the predictions with the actual all-star selections, we see that the model correctly predicts 10 out of 12 in the east and 9 out of 12 in the west. It still tends to overvalue individual stats more than team wins as the model picks 2 players from Sacramento Kings who did not make the actual roster. This can potentially be improved by using team ranks as an additional feature to train on instead of the team wins since in reality the top 4 teams in both conferences almost always are awarded with an all-star selection regardless of the number of wins.