In [168]:
import pandas as pd
import numpy as np
import altair as alt

# Data

In [169]:
nba_train = pd.read_csv('Data_Scripting_Cleaning/Full_data/Training_Sets/nba_szn_train.csv')
nba_test = pd.read_csv('Data_Scripting_Cleaning/Full_data/Test_Sets/nba_szn_test.csv')

First we can examine the proportions of All-NBA in our dataset. We created the test and training set so that 9 random seasons were in the test set, and the rest were in training. This means our proportions should be similar for the classes in each set.

For the training set we have:

In [None]:
#Proportions of all_nba_c_year
nba_train['all_nba_c_year'].value_counts(normalize=True)

0    0.967329
1    0.032671
Name: all_nba_c_year, dtype: float64

For the testing stat we have:

In [None]:
nba_test['all_nba_c_year'].value_counts(normalize=True)

0    0.96696
1    0.03304
Name: all_nba_c_year, dtype: float64

Both sets have similar proportions as expected, but clearly we have an incredibly unbalanced dataset. To account for this, we may tune the `class-weights` parameter which will place more weights on the All-NBA class in order to account for this imbalance

### Data Filtering

We will filter our data based on variables like minutes played (`MP`), games played (`G`), since we know that these awards go to the best players in the NBA, and the best players tend to play a lot. In the new 2023 CBA (collective bargaining agreement), there is a minimum game requirement (65 games) that must be met in order to win All-NBA. However, since this rule was not in place for prior awards (where this data comes from), we can instead filter so that we only consider players who have played more games than the players with the least minutes played and games played that still won All-NBA.

In [None]:
min_minutes = nba_train[(nba_train['all_nba_c_year']==1)].MP.min()
min_G = nba_train[(nba_train['all_nba_c_year']==1)].G.min()
nba_filt_train = nba_train[(nba_train['MP']>=min_minutes) & (nba_train['G']>=min_G)]
nba_filt_test = nba_test[(nba_test['MP']>=min_minutes) & (nba_test['G']>=min_G)]

y_train = nba_filt_train['all_nba_c_year']

y_test = nba_filt_test['all_nba_c_year']

First we may fit a simpler model. Which players will make all-nba, versus which players won't given their current season stats? In this case we will ignore teams and instead only focus on the binary indicator. We may extract predicted teams by ordering the probabilities and constructing the teams in that way.

# Logistic Model

First we can describe the model of interest under a statistical framework. Denote the following quantities:
1) y: an n x 1 vector, containing the binary variable of interest
2) X: and n x (p+1) matrix, consisting of the feature variables and intercept
3) $\beta$: A (p+1) x 1 vector of coefficients.

We will also use the following functions:

1) $logit(p) = log(\frac{p}{1-p})$; this is often denoted as the log-odds
2) $expit(x) = \frac{1}{1+exp(-x)}$; this is the inverse function of logit, i.e. $expit(x) = logit^{-1}(x)$

The model we will be utilizing is:
$$
y_i \sim Bernoulli(p_i = expit(x_i^{T}\beta))
$$

Where $x_i$ is the i'th row of the feature matrix X.

If we assume independence (clearly broken here since player performance is clearly correlated across different seasons, but we will disregard this for now), then we have a likelihood function of the form:

$$
L(\beta) = \prod_{i=1}^n (expit(x_i^{T}\beta))^{y_i}\times (1-expit(x_i^{T}\beta))^{1-y_i}
$$

Leading to a log-likelihood function (our unregularized negative objective function) of:

$$
\ell(\beta) = \sum_{i=1}^n y_i(log(expit(x_i^{T}\beta))) + (1-y_i)log(1-expit(x_i^{T}\beta)) \\ = \sum_{i=1}^n y_i(x_i^T\beta)-log(1+exp(x_i^T\beta))
$$



Thus we will be finding:

$$
\underset{\beta}{min}\sum_{i=1}^n -y_i(x_i^T\beta)+log(1+exp(x_i^T\beta)) + r(\beta)
$$

where $r(\beta)$ is a regularization term

For the L1 regularizer SKlearn specifically will be minimizing:
$$
\underset{\beta}{min} \ C\sum_{i=1}^n -y_i(x_i^T\beta)+log(1+exp(x_i^T\beta)) + \sum_{i=0}^p|\beta_i|
$$

Where the regularizing constant is given as C>0. This is a hyperparameter we must tune

One issue with this model, however, is that it does not know that only 15 (10 for year $\leq$ 1988) are selected for All-NBA. So for our predictions, we will take the top 6 players in the G category, top 6 in F, and top 3 in C (4, 4, 2 for year $\leq$ 1988). For cross validation (for tuning C), we will use the F1 score of these modified predictions

### Fitting Logistic Model to NBA Data

Since we have already standardized our variables, we must simply one hot encode the categorical variable, `Tm`. We will use a pipeline for ease of use.

We will consider the following parameter grid for C. We will select the best C based off of k-fold cross validation, with k=5. We will also tune the class-weights as well.

In [173]:
#Create custom scoring function for CV

def predicted_all_nba(nba_test_df, model):
    test_df = nba_test_df.copy()
    test_df.loc[:,'prob_all_nba'] = model.predict_proba(test_df)[:,1]
    years = test_df['year'].unique()
    test_df['pred_all_nba'] = 0
    for year in years:

        if year <= 1988:
            G_all_nba = test_df[(test_df['year']==year) & (test_df['Position']=='G')].sort_values(by='prob_all_nba', ascending=False).head(4)['Player'].tolist()

            F_all_nba = test_df[(test_df['year']==year) & (test_df['Position']=='F')].sort_values(by='prob_all_nba', ascending=False).head(4)['Player'].tolist()
            
            C_all_nba = test_df[(test_df['year']==year) & (test_df['Position']=='C')].sort_values(by='prob_all_nba', ascending=False).head(2)['Player'].tolist()
            
            all_nba_players = G_all_nba + F_all_nba + C_all_nba

            test_df.loc[(test_df['year']==year) & (test_df['Player'].isin(all_nba_players)), 'pred_all_nba'] = 1
        else:
            G_all_nba = test_df[(test_df['year']==year) & (test_df['Position']=='G')].sort_values(by='prob_all_nba', ascending=False).head(6)['Player'].tolist()

            F_all_nba = test_df[(test_df['year']==year) & (test_df['Position']=='F')].sort_values(by='prob_all_nba', ascending=False).head(6)['Player'].tolist()
            
            C_all_nba = test_df[(test_df['year']==year) & (test_df['Position']=='C')].sort_values(by='prob_all_nba', ascending=False).head(3)['Player'].tolist()
            
            all_nba_players = G_all_nba + F_all_nba + C_all_nba

            test_df.loc[(test_df['year']==year) & (test_df['Player'].isin(all_nba_players)), 'pred_all_nba'] = 1

    return test_df[['Player', 'year', 'prob_all_nba','pred_all_nba','all_nba_c_year']]

from sklearn.metrics import f1_score

def all_nba_f1(model, X, y):
    y_pred = predicted_all_nba(X, model)['pred_all_nba']
    return f1_score(y, y_pred)

In [174]:
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV

num_features = ['Age','G', 'GS', 'MP', 'FG', 'FGA', 'FG%', '3P', '3PA', '3P%',
       '2P', '2PA', '2P%', 'eFG%', 'FT', 'FTA', 'FT%', 'ORB', 'DRB', 'TRB',
       'AST', 'STL', 'BLK', 'TOV', 'PF', 'PTS', 'PER', 'TS%', '3PAr', 'FTr',
       'ORB%', 'DRB%', 'TRB%', 'AST%', 'STL%', 'BLK%', 'TOV%', 'USG%', 'OWS',
       'DWS', 'WS', 'WS/48', 'OBPM', 'DBPM', 'BPM', 'VORP', 'W',
       'num_all_nba']

cat_features = ['Tm']

#Now I will create a pipeline where I extract my_features, and apply OHE to cat_features
ct = ColumnTransformer(
    [("select", "passthrough", num_features),
     ("ohe", OneHotEncoder(handle_unknown="ignore"), cat_features)],
     remainder="drop"
)



clf = Pipeline([
    ("col_transform", ct),
    ("classifier", LogisticRegression(penalty = 'l1', solver = 'liblinear', 
                                      max_iter = 10000, random_state=0
                                      ))
])

#will create parameter grid for gridsearch for C and class_weight
#Setting the range for class weights
weights = np.linspace(0.1,0.5,10)

param_grid = {
    'classifier__C': np.logspace(-4, 4, 15),
    'classifier__class_weight': [{0: x, 1: 1.0-x} for x in weights]
}

#do gridsearch using all_nba_f1 as scoring metric
model = GridSearchCV(clf, param_grid, cv=5,              
                           scoring=all_nba_f1)
model.fit(nba_filt_train, y_train)

In [175]:
model.best_params_

{'classifier__C': 1.0, 'classifier__class_weight': {0: 0.1, 1: 0.9}}

### Logistic Results

We now have the following metrics for our model.

In [176]:
y_preds = predicted_all_nba(nba_filt_test, model)['pred_all_nba']

In [177]:
from sklearn.metrics import classification_report
print(classification_report(y_test, y_preds))

              precision    recall  f1-score   support

           0       0.97      0.97      0.97       809
           1       0.82      0.82      0.82       120

    accuracy                           0.95       929
   macro avg       0.89      0.89      0.89       929
weighted avg       0.95      0.95      0.95       929



We see that our precision is .82, indicating that of the model's predicted positives, 82% of them were correct. The recall, of .76, indicates only 76% of the All-NBA players in the dataset were classified as All-NBA.  

Looking at the confusion matrix for these results we see that we had 20 false postives, and 30 false negatives. These are not ideal, but considering the size of our dataset, may still work for our purposes.

In [178]:
pd.crosstab(y_test,  y_preds, rownames=["Actual"], colnames=["Predicted"])

Predicted,0,1
Actual,Unnamed: 1_level_1,Unnamed: 2_level_1
0,787,22
1,22,98


We also see that our L1 penalty reduced the number of covariates from 79 to 34 

In [179]:
#Now we will extract the feature names from the pipeline
feature_names = model.best_estimator_.named_steps['col_transform'].get_feature_names_out()
coef_df = pd.DataFrame({'coef':model.best_estimator_['classifier'].coef_[0]
                        ,'var':feature_names})
coef_df.shape

(79, 2)

In [180]:
coef_df_nz = coef_df[coef_df['coef']!=0]
coef_df_nz.shape

(32, 2)

We see from our coefficient plot that wins and win shares seems to have the highest impact on the log-odds of winning All-NBA, while FT% has the biggest negative impact. This bears further study, but for the purposes of this project (prediction), we do not necessarily care about how the model is weighing each of the features

In [181]:
#Now we will make a bar chart of these coefficients
alt.Chart(coef_df_nz).mark_bar().encode(
    y=alt.Y('coef',title='Coefficient'),
    x=alt.X('var',title='Variable', sort = '-y'))

We see for this dataset we have the following players who were predicted All-NBA but did not win it. (False Positives)

In [182]:
nba_filt_test[(y_preds==1) & (y_test!=1)][['Player',"year"]].sort_values(by='year', ascending=False)

Unnamed: 0,Player,year
86,Kevin Durant,2023
329,Anthony Davis,2023
511,James Harden,2023
3493,Pau Gasol,2012
3404,Derrick Rose,2012
3336,Tim Duncan,2012
3615,Josh Smith,2012
2719,Chauncey Billups,2008
2933,Carmelo Anthony,2008
2940,Allen Iverson,2008


For our False Negatives we have:

In [183]:
nba_filt_test[(y_preds!=1) & (y_test==1)][['Player',"year"]].sort_values(by='year', ascending=False)

Unnamed: 0,Player,year
361,De'Aaron Fox,2023
373,Domantas Sabonis,2023
494,Jaylen Brown,2023
3365,Tyson Chandler,2012
3362,Carmelo Anthony,2012
3263,Rajon Rondo,2012
3432,Dirk Nowitzki,2012
2869,Carlos Boozer,2008
2889,Tracy McGrady,2008
3059,Manu Ginóbili,2008


We can also examine what the 2023 predicted All_NBA team looks like versus the actual team.

In [184]:
true_2023_all_nba = nba_filt_test[(nba_filt_test['year']==2023) & (nba_filt_test['all_nba_c_year']==1)][['Player','Position','all_nba_tm']]

pred_2023_all_nba = nba_filt_test[(nba_filt_test['year']==2023) & (y_preds==1)][['Player','Position','all_nba_tm']]

In [185]:
true_2023_all_nba

Unnamed: 0,Player,Position,all_nba_tm
35,Jimmy Butler,F,2nd
153,Giannis Antetokounmpo,F,1st
175,Donovan Mitchell,G,2nd
224,Julius Randle,F,3rd
236,Damian Lillard,G,3rd
295,Stephen Curry,G,2nd
331,LeBron James,F,3rd
361,De'Aaron Fox,G,3rd
373,Domantas Sabonis,C,3rd
392,Shai Gilgeous-Alexander,G,1st


In [186]:
pred_2023_all_nba

Unnamed: 0,Player,Position,all_nba_tm
35,Jimmy Butler,F,2nd
86,Kevin Durant,F,
153,Giannis Antetokounmpo,F,1st
175,Donovan Mitchell,G,2nd
224,Julius Randle,F,3rd
236,Damian Lillard,G,3rd
295,Stephen Curry,G,2nd
329,Anthony Davis,C,
331,LeBron James,F,3rd
392,Shai Gilgeous-Alexander,G,1st
