# From Rookies to the All-NBA team
# Step 2: Modeling

## Importing Libraries, Packages and Modules


In [1]:
import pandas as pd
from matplotlib import pyplot as plt
import numpy as np
import seaborn as sns
from sklearn.datasets import make_blobs
from sklearn.model_selection import train_test_split
from sklearn import preprocessing
from sklearn.ensemble import RandomForestClassifier
from sklearn import preprocessing
from sklearn.model_selection import cross_val_score
from sklearn import metrics
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.preprocessing import OrdinalEncoder
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import RandomizedSearchCV
from sklearn.model_selection import GridSearchCV

## Loading Data

In [2]:
rookies = pd.read_csv("rookie.csv")
rookies

Unnamed: 0.1,Unnamed: 0,seas_id,season,player_id,player,pos,age,experience,lg,tm,...,usg_percent,ows,dws,ws,ws_48,obpm,dbpm,bpm,vorp,all_nba
0,0,31143,2024,5109,Adam Flagler,SG,24.0,1,NBA,OKC,...,21.7,0.0,0.0,0.0,-0.159,-6.7,-3.9,-10.6,0.0,False
1,1,31144,2024,5110,Adama Sanogo,PF,21.0,1,NBA,CHI,...,24.8,0.1,0.1,0.2,0.133,-0.4,-6.3,-6.7,-0.1,False
2,2,31154,2024,5111,Alex Fudge,SF,20.0,1,NBA,TOT,...,19.2,0.0,0.0,0.0,-0.007,-3.7,-1.5,-5.2,0.0,False
3,3,31160,2024,5112,Amari Bailey,PG,19.0,1,NBA,CHO,...,21.1,-0.1,0.0,-0.1,-0.044,-5.2,-3.1,-8.3,-0.1,False
4,4,31161,2024,5113,Amen Thompson,SF,21.0,1,NBA,HOU,...,18.5,1.9,2.4,4.3,0.149,-0.1,1.9,1.8,1.3,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2803,2803,11784,1989,2474,Vernon Maxwell,SG,23.0,1,NBA,SAS,...,20.9,0.1,1.4,1.5,0.035,-1.3,-1.2,-2.5,-0.3,False
2804,2804,11786,1989,2475,Vinny Del Negro,SG,22.0,1,NBA,SAC,...,16.4,1.5,1.0,2.5,0.077,-1.7,0.5,-1.2,0.3,False
2805,2805,11795,1989,2476,Wayne Engelstad,PF,23.0,1,NBA,DEN,...,27.9,0.0,0.1,0.0,0.031,-5.0,-1.7,-6.7,-0.1,False
2806,2806,11796,1989,2477,Will Perdue,C,23.0,1,NBA,CHI,...,21.0,-0.3,0.2,-0.1,-0.021,-6.7,-1.6,-8.3,-0.3,False


## Preparing for Modeling

In order to properly use Random Forest models to project which rookies will reach an All-NBA team, we will need to encode categorical variables as numeric variables. We do this using an ordinal encoder.

The three categorical variables we will need to encode ordinally are 'nba_tm', 'defense_tm', and 'rookie_tm'. Let's first print the counts of each category in these variables. This will allow us to check the encoding is correct afterward.

In [3]:
print(rookies[rookies['season'] < 2021]['nba_tm'].value_counts())
print(rookies[rookies['season'] < 2021]['defense_tm'].value_counts())
print(rookies[rookies['season'] < 2021]['rookie_tm'].value_counts())

nba_tm
Not Selected    2401
1st                1
3rd                1
Name: count, dtype: int64
defense_tm
Not Selected    2401
2nd                2
Name: count, dtype: int64
rookie_tm
Not Selected    2076
2nd              164
1st              163
Name: count, dtype: int64


In [4]:
enc = OrdinalEncoder()
rookies[['rookie_tm', 'defense_tm', 'nba_tm']] = enc.fit_transform(rookies[['rookie_tm', 'defense_tm', 'nba_tm']])

Here we define a function prep_rookies_data(). This will remove all players from the data set whose rookie seasons were before 2020-2021. This is to ensure the players in the sample had at least five seasons to reach an All-NBA team.

The function also drops unnecessary columns; moreover, it produces X, a dataframe with the features to be used in the models, and y, a series with the labels.

In [5]:
def prep_rookies_data(data):
    
    df = data.copy()
    
    df=df[df['season'] < 2021]
    
    df=df.drop(['Unnamed: 0','seas_id','season','player_id', 'player',
                'pos', 'experience', 'lg', 'tm'],axis=1)
        
    X = df.drop(['all_nba'], axis = 1)
    y = df["all_nba"]

    return (X, y)

Now, we must split the sample into training and testing sets. The test set will have 20 percent of the observations while the training set will contain the remaining 80 percent.

Subsequently, prep_rookies_data() is used to split the training and testing sets into their features and labels.

In [6]:
train, test = train_test_split(rookies, test_size = 0.2, random_state = 1)

X_train, y_train = prep_rookies_data(train)
X_test, y_test = prep_rookies_data(test)

To check the categorical variables got ordinally encoded properly, we can print the value counts for each appropriate variable in the testing and training feature sets. As we can see, the counts add up correctly.

In [7]:
print(X_test['nba_tm'].value_counts())
print(X_train['nba_tm'].value_counts())
print(X_test['defense_tm'].value_counts())
print(X_train['defense_tm'].value_counts())
print(X_test['rookie_tm'].value_counts())
print(X_train['rookie_tm'].value_counts())

nba_tm
2.0    483
Name: count, dtype: int64
nba_tm
2.0    1918
0.0       1
1.0       1
Name: count, dtype: int64
defense_tm
2.0    483
Name: count, dtype: int64
defense_tm
2.0    1918
1.0       2
Name: count, dtype: int64
rookie_tm
2.0    408
1.0     40
0.0     35
Name: count, dtype: int64
rookie_tm
2.0    1668
0.0     128
1.0     124
Name: count, dtype: int64


To prepare further for modeling, we convert the datasets to numpy arrays. We also print their shapes to ensure the process has produced the correct results.

In [9]:
X_train = np.array(X_train)
y_train = np.array(y_train)
X_test = np.array(X_test)
y_test = np.array(y_test)

In [10]:
print('X_train Shape:', X_train.shape)
print('y_train Shape:', y_train.shape)
print('X_test Shape:', X_test.shape)
print('y_test Shape:', y_test.shape)

X_train Shape: (1920, 50)
y_train Shape: (1920,)
X_test Shape: (483, 50)
y_test Shape: (483,)


## First Model: Random Forest Classifier

To produce a random forest classifier model to project which rookies will become All-NBA players, we first determine the best number of estimators to include. For this, we loop through values between 1 and 100, using cross-validation to ascertain which value provides the moost predictive value.

In [11]:
N = 100
scores = np.zeros(N)
best_score = -np.inf

for n in range(1, N + 1):
    m = RandomForestClassifier(n_estimators = n)
    scores[n - 1] = cross_val_score(m, X_train, y_train, cv = 5).mean()
    if scores[n - 1] > best_score:
        best_score = scores[n - 1]
        best_n = n
        
best_n, best_score

(29, 0.9619791666666666)

The optimal number of estimators is 29. Let's construct a random forest classifier model with this parameter.

In [12]:
m = RandomForestClassifier(n_estimators = best_n, random_state = 1)
m.fit(X_train, y_train)
m.score(X_test, y_test)

0.94824016563147

This model is 94.8% accurate on the training data. While this sounds quite good, not very many players actually ever reach an All-NBA team. Let's see how strong this result is actually.

In [13]:
sum(~y_test)

458

In [14]:
cnf_matrix = metrics.confusion_matrix(y_test, m.predict(X_test))
cnf_matrix

array([[451,   7],
       [ 18,   7]], dtype=int64)

The confusion matrix illustrates the shortcomings of this model. Of the 14 players in the training data projected to be All-NBA players, only half actually reached that level. Moreover, of the 25 players who actually did become All-NBA-caliber in the training set, 72% were incorrectly projected to fail to do so.

While it is obviously difficult to project a career outcome based off of merely one season of performances, these results can likely be improved upon with a different model.

## Second Model: Random Forest Regressor

Let's try projecting the probability that players will reach an All-NBA team rather than simply classifying them.

First, consider the baseline mean absolute error of a model that projects players to make All-NBA teams at an average rate.

Information on how to construct a Random Forest Regressor Model was found in a Towards Data Science post by Will Koehrsen, which can be found here: https://towardsdatascience.com/random-forest-in-python-24d0893d51c0

In [36]:
baseline = 1 - sum(~y_test)/y_test.shape[0]
print('Average All-NBA probability: ', round(baseline, 3))
baseline_error = abs(np.full(y_test.shape[0], baseline) - y_test)
print('Average baseline error: ', round(np.mean(baseline_error), 3))

Average All-NBA probability:  0.052
Average baseline error:  0.098


We see that 5.2 percent of the players in the training dataset actually make an All-NBA team. Using this value as the projected probability to make an All-NBA team for all players would result in a mean absolute error of 0.098.

Now let's train a simple random forest regressor model on the training data to see how it performs on the test set.

In [16]:
m2 = RandomForestRegressor(n_estimators = 1000, random_state = 1)
m2.fit(X_train, y_train)

In [17]:
predictions = m2.predict(X_test)
errors = abs(predictions - y_test)
print('Average error:', round(np.mean(errors), 3))

Average error: 0.079


We see that this model yields an average error 0f 0.079. This is somewhat better than the baseline error of 0.098.

In [18]:
calibration_error = round(100 * (np.mean(predictions) / (1 - sum(~rookies['all_nba'])/rookies.shape[0]) - 1), 1)
descriptor = "many" if calibration_error >= 0 else "few"
print("The model predicts ", abs(calibration_error), "% too ", descriptor, " All-NBA players")

The model predicts  44.3 % too  many  All-NBA players


Unfortunately, the model appears to be poorly calibrated. The projected number of All-NBA players is 44.3% too high.

## Third Model: Improved Random Forest Regressor

Let's attempt to improve this model by tuning the model's hyperparameters. We can do this using a random grid to ascertain what the ideal values for these hyperparameters are and constructing a model with these values.

Information on how to tune these hyperparameters was found in a subsequent post by Will Koehrsen on Towards Data Science. The post can be found here: https://towardsdatascience.com/hyperparameter-tuning-the-random-forest-in-python-using-scikit-learn-28d2aa77dd74

In [19]:
n_estimators = [200, 400, 600, 800, 1000, 1200, 1400, 1600, 1800, 2000]
max_features = [1.0, 'sqrt']
max_depth = [10, 20, 30, 40, 50, 60, 70, 80, 90, 100, None]
min_samples_split = [2, 5, 10]
min_samples_leaf = [1, 2, 4]
bootstrap = [True, False]
random_grid = {'n_estimators': n_estimators, 'max_features': max_features, 'max_depth': max_depth,
               'min_samples_split': min_samples_split, 'min_samples_leaf': min_samples_leaf, 'bootstrap': bootstrap}

In [20]:
m3 = RandomForestRegressor()
m3_rs = RandomizedSearchCV(estimator = m3, param_distributions = random_grid, n_iter = 100, cv = 3, random_state = 1, n_jobs = -1)
m3_rs.fit(X_train, y_train)

In [21]:
m3_rs.best_params_

{'n_estimators': 600,
 'min_samples_split': 2,
 'min_samples_leaf': 4,
 'max_features': 'sqrt',
 'max_depth': 70,
 'bootstrap': True}

After exploring a variety of options, the process produces the set of best parameters for the model. Let's see how a new model with these values fares.

In [22]:
predictions = m3_rs.predict(X_test)
errors = abs(predictions - y_test)
print('Average error:', round(np.mean(errors), 3))

Average error: 0.077


The average error is 0.077, which is only marginally better than the previous model's performance.

In [23]:
calibration_error = round(100 * (np.mean(predictions) / (1 - sum(~rookies['all_nba'])/rookies.shape[0]) - 1), 1)
descriptor = "many" if calibration_error >= 0 else "few"
print("The model predicts ", abs(calibration_error), "% too ", descriptor, " All-NBA players")

The model predicts  30.6 % too  many  All-NBA players


A larger improvement is evident regarding the calibration. The overestimation of the number of All-NBA players drops from 44.3 percent to 30.6 percent.

## Fourth Model: Further Improved Random Forest Regressor

Let's try to further improve this model by performing the same process again but with more refined options for the hyperparameters.

In [28]:
param_grid = {'bootstrap': [True], 'max_depth': [60, 70, 80], 'max_features': ['sqrt'],
              'min_samples_leaf': [3, 4, 5], 'min_samples_split': [2, 4, 6], 'n_estimators': [500, 600, 700]
}
# Create a based model
m4 = RandomForestRegressor()
# Instantiate the grid search model
m4_gs = GridSearchCV(estimator = m4, param_grid = param_grid, cv = 3, n_jobs = -1)

In [29]:
m4_gs.fit(X_train, y_train)
m4_gs.best_params_

{'bootstrap': True,
 'max_depth': 70,
 'max_features': 'sqrt',
 'min_samples_leaf': 5,
 'min_samples_split': 2,
 'n_estimators': 500}

In [30]:
m4_best = m4_gs.best_estimator_
predictions = m4_best.predict(X_test)
errors = abs(predictions - y_test)
print('Average error:', round(np.mean(errors), 3))

Average error: 0.077


We see that with the new slate of hyperparameters, the average error remains the same.

In [31]:
calibration_error = round(100 * (np.mean(predictions) / (1 - sum(~rookies['all_nba'])/rookies.shape[0]) - 1), 1)
descriptor = "many" if calibration_error >= 0 else "few"
print("The model predicts ", abs(calibration_error), "% too ", descriptor, " All-NBA players")

The model predicts  29.7 % too  many  All-NBA players


The calibration error also remains quite similar. This suggests that the performance of this model is about as strong as can be attained given the data on hand and modeling method.