In this notebook we will be fitting an XGboost (extreme gradient boosting) algorithm to classify All-NBA status.

### Model

XGboost is a more recent classification algorithm that also makes use of ensemble learners. In this algorithm, we similarly create decision trees, but rather than each tree being independent, each successive tree tries to improve on its predecessor. This is done by trying to fit to residual values, and find any underlying patterns that may be there. The algorithm is stopped when the residuals are sufficiently random and no more patterns can be found. Similar to Random Forests, we have quite a bit of hyper parameters to tune, so a randomized CV search is utilized. 

In [1]:
import pandas as pd

nba_train = pd.read_csv('Data_Scripting_Cleaning/Full_data/Training_Sets/nba_train.csv')
nba_test = pd.read_csv('Data_Scripting_Cleaning/Full_data/Test_Sets/nba_test.csv')

min_minutes = nba_train[(nba_train['all_nba_c_year']==1)].MP.min()
min_G = nba_train[(nba_train['all_nba_c_year']==1)].G.min()
nba_filt_train = nba_train[(nba_train['MP']>=min_minutes) & (nba_train['G']>=min_G)]
nba_filt_test = nba_test[(nba_test['MP']>=min_minutes) & (nba_test['G']>=min_G)]

y_train = nba_filt_train['all_nba_c_year']

y_test = nba_filt_test['all_nba_c_year']

In [2]:
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import  OneHotEncoder

#Now we will fit a random forest model to the data. We will fit this data into a pipeline to scale the data and then fit the model.

num_features = ['Age','G', 'GS', 'MP', 'FG', 'FGA', 'FG%', '3P', '3PA', '3P%',
       '2P', '2PA', '2P%', 'eFG%', 'FT', 'FTA', 'FT%', 'ORB', 'DRB', 'TRB',
       'AST', 'STL', 'BLK', 'TOV', 'PF', 'PTS', 'PER', 'TS%', '3PAr', 'FTr',
       'ORB%', 'DRB%', 'TRB%', 'AST%', 'STL%', 'BLK%', 'TOV%', 'USG%', 'OWS',
       'DWS', 'WS', 'WS/48', 'OBPM', 'DBPM', 'BPM', 'VORP', 'W',
       'num_all_nba']
cat_features = ['Tm']


preprocessor = ColumnTransformer(
    [("select", "passthrough", num_features),
     ("ohe", OneHotEncoder(handle_unknown="ignore"), cat_features)],
     remainder="drop"
)

preprocessor.fit(nba_filt_train)

X_train = pd.DataFrame(preprocessor.transform(nba_filt_train), columns = preprocessor.get_feature_names_out())
X_test = pd.DataFrame(preprocessor.transform(nba_filt_test), columns = preprocessor.get_feature_names_out())

In [3]:
import xgboost as xgb
from sklearn.model_selection import RandomizedSearchCV
from sklearn.metrics import classification_report


scale_pos = (len(y_train) - sum(y_train))/sum(y_train)
xgb_model = xgb.XGBClassifier(random_state=0, scale_pos_weight = scale_pos)
param_grid = {
    'n_estimators': [100, 200, 300, 400, 500],
    'max_depth': [5, 10, 15, 20, 25],
    'learning_rate': [0.01, 0.05, 0.1, 0.15, 0.2, .3],
    'min_child_weight': [1, 2, 5, 10, 15],
    'gamma': [0, 0.1, 0.2, 0.3, 0.4],
    'colsample_bytree': [0.3, 0.4, 0.5, 0.6, 0.7],
    'subsample': [0.3, 0.4, 0.5, 0.6, 0.7,1]
}
model = RandomizedSearchCV(estimator=xgb_model, param_distributions=param_grid, cv= 5, n_iter=500,
                           random_state=0
                           )


In [4]:
model.fit(X_train, y_train)

In [10]:
model.best_params_

{'subsample': 1,
 'n_estimators': 200,
 'min_child_weight': 2,
 'max_depth': 20,
 'learning_rate': 0.1,
 'gamma': 0.3,
 'colsample_bytree': 0.6}

### Results

We see that this algorithm gives us the highest recall of the algorithms tested, with a slightly lower precision. This however, leads to the highest F-1 score achieved as of yet of .81

In [5]:
print(classification_report(y_test, model.predict(X_test), target_names=['Not All-NBA', 'All-NBA']))

              precision    recall  f1-score   support

 Not All-NBA       0.98      0.96      0.97       819
     All-NBA       0.77      0.85      0.81       123

    accuracy                           0.95       942
   macro avg       0.87      0.90      0.89       942
weighted avg       0.95      0.95      0.95       942



In [6]:
pd.crosstab(y_test, model.predict(X_test), rownames=['Actual'], colnames=['Predicted'])

Predicted,0,1
Actual,Unnamed: 1_level_1,Unnamed: 2_level_1
0,788,31
1,19,104


For our feature importance, we 

In [7]:
import altair as alt

coef_df = pd.DataFrame({'coefs':model.best_estimator_.feature_importances_,
                          'features':X_train.columns})
coef_df_nz = coef_df[coef_df['coefs']!=0]

alt.Chart(coef_df_nz).mark_bar().encode(
    y='coefs',
    x=alt.Y('features', sort='-y'))




In [8]:
nba_filt_test[(model.predict(X_test)==1) & (y_test!=1)][['Player',"year"]]

Unnamed: 0,Player,year
463,Ja Morant,2023
577,James Harden,2023
676,John Wall,2015
851,James Harden,2021
960,Devin Booker,2020
1152,Chauncey Billups,2008
1187,Kevin Johnson,1997
1250,John Stockton,2000
1552,Paul Pierce,2011
1642,Kiki Vandeweghe,1983


In [9]:
nba_filt_test[(model.predict(X_test)!=1) & (y_test==1)][['Player',"year"]]

Unnamed: 0,Player,year
23,Gary Payton,1994
284,Joe Dumars,1990
314,Mitch Richmond,1996
717,Isiah Thomas,1987
1010,Chauncey Billups,2009
1055,Tim Hardaway,1993
1093,Stephon Marbury,2003
1446,Ben Simmons,2020
1462,Mark Price,1989
1463,Eddie Jones,2000
