In [None]:
import pandas as pd
import numpy as np
import sklearn as sk
import altair as alt

In [None]:
nba_train = pd.read_csv('Data_Scripting_Cleaning/Full_data/Training_Sets/nba_train.csv')
nba_test = pd.read_csv('Data_Scripting_Cleaning/Full_data/Test_Sets/nba_test.csv')

In this notebook we will be fitting a logistic regression model with an L1 penalty to encourage coefficient sparsity using balanced datasets

### Data Filtering

In [None]:
min_minutes = nba_train[(nba_train['all_nba_c_year']==1)].MP.min()
min_G = nba_train[(nba_train['all_nba_c_year']==1)].G.min()
nba_filt_train = nba_train[(nba_train['MP']>=min_minutes) & (nba_train['G']>=min_G)]
nba_filt_test = nba_test[(nba_test['MP']>=min_minutes) & (nba_test['G']>=min_G)]




y_train = nba_filt_train['all_nba_c_year']

y_test = nba_filt_test['all_nba_c_year']

First we may fit a simpler model. Which players will make all-nba, versus which players won't? In this case we will ignore teams and instead only focus on the binary indicator. We will also see how the classifier does when we don't filter for positions

We will construct this model using a current year players stats and predict whether they make all_nba in the current year.

### Fitting the Model

Since we have already standardized our variables, we must simply one hot encode the categorical variable, `Tm`. We will use a pipeline for ease of use.

We will consider the following parameter grid for C. We will select the best C based off of k-fold cross validation, with k=5.

In [35]:
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV

num_features = ['Age','G', 'GS', 'MP', 'FG', 'FGA', 'FG%', '3P', '3PA', '3P%',
       '2P', '2PA', '2P%', 'eFG%', 'FT', 'FTA', 'FT%', 'ORB', 'DRB', 'TRB',
       'AST', 'STL', 'BLK', 'TOV', 'PF', 'PTS', 'PER', 'TS%', '3PAr', 'FTr',
       'ORB%', 'DRB%', 'TRB%', 'AST%', 'STL%', 'BLK%', 'TOV%', 'USG%', 'OWS',
       'DWS', 'WS', 'WS/48', 'OBPM', 'DBPM', 'BPM', 'VORP', 'W',
       'num_all_nba']

cat_features = ['Tm']

#Now I will create a pipeline where I extract my_features, and apply OHE to cat_features
ct = ColumnTransformer(
    [("select", "passthrough", num_features),
     ("ohe", OneHotEncoder(handle_unknown="ignore"), cat_features)],
     remainder="drop"
)



clf = Pipeline([
    ("col_transform", ct),
    ("classifier", LogisticRegression(penalty = 'l1', solver = 'liblinear', 
                                      max_iter = 10000, random_state=0,
                                      ))
])

#will create parameter grid for gridsearch for C and class_weight
#Setting the range for class weights
weights = np.linspace(0.0,0.5,25)

param_grid = {
    'classifier__C': np.logspace(-4, 4, 20),
    'classifier__class_weight': [{0:x, 1:1.0-x} for x in weights]
}

model = GridSearchCV(clf, param_grid)

model.fit(nba_filt_train, y_train)

KeyboardInterrupt: 

In [None]:
model.best_params_

# Results

We now have the following metrics for our model.

In [None]:
from sklearn.metrics import classification_report
print(classification_report(y_test, model.predict(nba_filt_test)))

In [None]:
from sklearn.metrics import f1_score
f1_score(y_test, model.predict(nba_filt_test))

We see that our precision is .82, indicating that of the model's predicted positives, 82% of them were correct. The recall, of .76, indicates only 76% of the All-NBA players in the dataset were classified as All-NBA.  

Looking at the confusion matrix for these results we see that we had 20 false postives, and 30 false negatives. These are not ideal, but considering the size of our dataset, may still work for our purposes.

In [None]:
pd.crosstab(y_test, model.predict(nba_filt_test), rownames=["Actual"], colnames=["Predicted"])

We also see that our L1 penalty reduced the number of covariates from 79 to 34 

In [None]:
#Now we will extract the feature names from the pipeline
feature_names = model.best_estimator_.named_steps['col_transform'].get_feature_names_out()
coef_df = pd.DataFrame({'coef':model.best_estimator_['classifier'].coef_[0]
                        ,'var':feature_names})
coef_df.shape

In [None]:
coef_df_nz = coef_df[coef_df['coef']!=0]
coef_df_nz.shape

We see from our coefficient plot that wins and win shares seems to have the highest impact on the log-odds of winning All-NBA, while FT% has the biggest negative impact. This bears further study, but for the purposes of this project (prediction), we do not necessarily care about how the model is weighing each of the features

In [None]:
#Now we will make a bar chart of these coefficients
alt.Chart(coef_df_nz).mark_bar().encode(
    y=alt.Y('coef',title='Coefficient'),
    x=alt.X('var',title='Variable', sort = '-y'))

We see for this dataset we have the following players who were predicted All-NBA but did not win it. (False Positives)

In [None]:
nba_filt_test[(model.predict(nba_filt_test)==1) & (y_test!=1)][['Player',"year"]].sort_values(by='year', ascending=False)

For our False Negatives we have:

In [None]:
nba_filt_test[(model.predict(nba_filt_test)!=1) & (y_test==1)][['Player',"year"]].sort_values(by='year', ascending=False)

Now we may consider fitting position specific models. We will use the same `MP` and `G` filters we used prior.

In [None]:
nba_g_train = pd.read_csv('Data_Scripting_Cleaning/Full_data/Training_Sets/nba_g_train.csv')
nba_g_test = pd.read_csv('Data_Scripting_Cleaning/Full_data/Test_Sets/nba_g_test.csv')

In [None]:
nba_g_filt_train = nba_g_train[(nba_g_train['MP']>=min_minutes) & (nba_g_train['G']>=min_G)]
nba_g_filt_test = nba_g_test[(nba_g_test['MP']>=min_minutes) & (nba_g_test['G']>=min_G)]
y_g_train = nba_g_filt_train['all_nba_c_year']
y_g_test = nba_g_filt_test['all_nba_c_year']


In [None]:
#Now I will create my pipeline as before, but I will include a step to remove a subset of variables I specify
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder


clf_g = Pipeline([
    ("col_transform", ct),
    ("classifier", LogisticRegression(penalty = 'l1', solver = 'liblinear', 
                                      max_iter = 10000, random_state=0))
])

model_g = GridSearchCV(clf_g, param_grid)
model_g.fit(nba_g_filt_train, y_g_train)

By looking at only guards we have similar accuracy, with worse recall for true positives

In [None]:
print(classification_report(y_g_test, model_g.predict(nba_g_filt_test)))

In [None]:
pd.crosstab(y_g_test, model_g.predict(nba_g_filt_test), rownames=["Actual"], colnames=["Predicted"])

We have our false positives as:

In [None]:
nba_g_filt_test[(model_g.predict(nba_g_filt_test)==1) & (y_g_test!=1)][['Player','year']]

We have our false negatives as:

In [None]:
nba_g_filt_test[(model_g.predict(nba_g_filt_test)!=1) & (y_g_test==1)][['Player','year']]

Examining our coefficients we see the L1 penalty reduced our number of features by almost 60.

In [None]:
coef_g_df = pd.DataFrame({'coef':model_g.best_estimator_['classifier'].coef_[0],
                          'var':feature_names})
coef_df.shape

In [None]:
coef_g_nz = coef_g_df[coef_g_df['coef']!=0]
coef_g_nz.shape

Looking at their values we have:

In [None]:
alt.Chart(coef_g_nz).mark_bar().encode(
    y=alt.Y('coef',title='Coefficient'),
    x=alt.X('var',title='Variable', sort = '-y'))

We see here that VORP has one of the largest positive coefficients, while number of personal fouls is one of the largest magnitude negative coefficients. 

Repeating this analysis for the C's and F's we get:

F's:

In [None]:
nba_F_train = pd.read_csv('Data_Scripting_Cleaning/Full_data/Training_Sets/nba_F_train.csv')
nba_F_test = pd.read_csv('Data_Scripting_Cleaning/Full_data/Test_Sets/nba_F_test.csv')

nba_F_filt_train = nba_F_train[(nba_F_train['MP']>=min_minutes) & (nba_F_train['G']>=min_G)]
nba_F_filt_test = nba_F_test[(nba_F_test['MP']>=min_minutes) & (nba_F_test['G']>=min_G)]


y_F_train = nba_F_filt_train['all_nba_c_year']
y_F_test = nba_F_filt_test['all_nba_c_year']

In [None]:
clf_F = Pipeline([
    ("col_transform", ct),
    ("classifier", LogisticRegression(penalty = 'l1', solver = 'liblinear', 
                                      max_iter = 10000, random_state=0))
])
model_F = GridSearchCV(clf_F, param_grid)
model_F.fit(nba_F_filt_train, y_F_train)

In [None]:
print(classification_report(y_F_test, model_F.predict(nba_F_filt_test)))

In [None]:
pd.crosstab(y_F_test, model_F.predict(nba_F_filt_test), rownames=["Actual"], colnames=["Predicted"])

In [None]:
coef_F_df = pd.DataFrame({'coef':model_F.best_estimator_['classifier'].coef_[0],
                          'var':feature_names})
coef_F_df.shape

In [None]:
coef_F_nz = coef_F_df[coef_F_df['coef']!=0]
coef_F_nz.shape

In [None]:
alt.Chart(coef_F_nz).mark_bar().encode(
    y=alt.Y('coef',title='Coefficient'),
    x=alt.X('var',title='Variable', sort = '-y'))

C's:

In [None]:
nba_C_train = pd.read_csv('Data_Scripting_Cleaning/Full_data/Training_Sets/nba_C_train.csv')
nba_C_test = pd.read_csv('Data_Scripting_Cleaning/Full_data/Test_Sets/nba_C_test.csv')

nba_C_filt_train = nba_C_train[(nba_C_train['MP']>=min_minutes) & (nba_C_train['G']>=min_G)]
nba_C_filt_test = nba_C_test[(nba_C_test['MP']>=min_minutes) & (nba_C_test['G']>=min_G)]


y_C_train = nba_C_filt_train['all_nba_c_year']
y_C_test = nba_C_filt_test['all_nba_c_year']

clf_C = Pipeline([
    ("col_transform", ct),
    ("classifier", LogisticRegression(penalty = 'l1', solver = 'liblinear', 
                                      max_iter = 10000, random_state=0))
])
model_C = GridSearchCV(clf_C, param_grid)
model_C.fit(nba_C_filt_train, y_C_train)

We see the Centers had a quite poor positive recall in this dataset. This makes sense, since this dataset is the most unbalanced of the three, since only one center is picked for each All-NBA team.

In [None]:
print(classification_report(y_C_test, model_C.predict(nba_C_filt_test)))

In [None]:
pd.crosstab(y_C_test, model_C.predict(nba_C_filt_test), rownames=["Actual"], colnames=["Predicted"])

In [None]:
coef_C = pd.DataFrame({'coef':model_C.best_estimator_['classifier'].coef_[0],
                       'var':feature_names})
coef_C.shape

In [None]:
coef_C_nz = coef_C[coef_C['coef']!=0]
coef_C_nz.shape

In [None]:
alt.Chart(coef_C_nz).mark_bar().encode(
    y=alt.Y('coef',title='Coefficient'),
    x=alt.X('var',title='Variable', sort = '-y'))

For this model Wins and the advanced statistic Win shares were two of the stats with the largest positive impact on the model, while FT% was one of the largest negative coefficients

Now we may try to balance our datasets. First we will try this on the larger dataset. We will first use SMOTENC sampling (Synthetic Minority Oversampling Technique-Numerical, Categorical). This method uses a k-nearest neighbors approach (default k=5).

In [None]:
#balance all_nba_c_year in the nba_train set
from imblearn.over_sampling import SMOTENC

smote = SMOTENC(random_state=0,categorical_features=[nba_filt_train[num_features+cat_features].shape[1]-1])
X_train_resampled, y_train_resampled = smote.fit_resample(nba_filt_train[num_features+cat_features], nba_filt_train['all_nba_c_year'])

In [None]:
clf_bal = Pipeline([
    ("col_transform", ct),
    ("classifier", LogisticRegression(penalty = 'l1', solver = 'liblinear', 
                                      max_iter = 10000, random_state=0))
])
model_smote = GridSearchCV(clf_bal, param_grid)
model_smote.fit(X_train_resampled, y_train_resampled)

In [None]:
print(classification_report(nba_filt_test['all_nba_c_year'], model_smote.predict(nba_filt_test[num_features+cat_features])))

In [None]:
pd.crosstab(nba_filt_test['all_nba_c_year'], model_smote.predict(nba_filt_test[num_features+cat_features]), rownames=["Actual"], colnames=["Predicted"])

We see using this balanced dataset greatly increases our recall, but at the cost of our precision. 

We may also try under-sampling. This will essentially remove rows of the majority class (those who did not make All-NBA).

In [None]:
#Now we may use undersampling to balance the classes
from imblearn.under_sampling import RandomUnderSampler
ran_uns = RandomUnderSampler(random_state=0)
X_train_resampled, y_train_resampled = ran_uns.fit_resample(nba_filt_train[num_features+cat_features], nba_filt_train['all_nba_c_year'])


In [None]:
clf_us_bal = Pipeline([
    ("col_transform", ct),
    ("classifier", LogisticRegression(penalty = 'l1', solver = 'liblinear', 
                                      max_iter = 10000, random_state=0))
])
model_us = GridSearchCV(clf_us_bal, param_grid)
model_us.fit(X_train_resampled, y_train_resampled)

In [None]:
print(classification_report(nba_filt_test['all_nba_c_year'], model_us.predict(nba_filt_test[num_features+cat_features])))

In [None]:
pd.crosstab(nba_filt_test['all_nba_c_year'], model_us.predict(nba_filt_test[num_features+cat_features]), rownames=["Actual"], colnames=["Predicted"])

Finally we can try over-sampling.

In [None]:
#Now we may use oversampling to balance the classes
from imblearn.over_sampling import RandomOverSampler
ran_os = RandomOverSampler(random_state=0)
X_train_resampled, y_train_resampled = ran_os.fit_resample(nba_filt_train[num_features+cat_features], nba_filt_train['all_nba_c_year'])
clf_os_bal = Pipeline([
    ("col_transform", ct),
    ("classifier", LogisticRegression(penalty = 'l1', solver = 'liblinear', 
                                      max_iter = 10000, random_state=0))
])
model_os = GridSearchCV(clf_os_bal, param_grid)
model_os.fit(X_train_resampled, y_train_resampled)

In [None]:
print(classification_report(nba_filt_test['all_nba_c_year'], model_os.predict(nba_filt_test[num_features+cat_features])))

In [None]:
pd.crosstab(nba_filt_test['all_nba_c_year'], model_os.predict(nba_filt_test[num_features+cat_features]), rownames=["Actual"], colnames=["Predicted"])

We see all of these 3 methods are not significantly better, or are quite worse than the baseline unbalanced class model. Specifically we have quite poor precision for our positive classifications 