**Applying ML to the NBA Draft**

On Thursday June 22nd, 2023, the San Antonio Spurs selected Victor Wembanyama with their first overall pick. With other prospective stars such as Scoot Henderson and the Thompson twins in this year’s draft, this appeared to be one of the most exciting draft classes in recent memory. As an NBA fan and a data analyst, I became extremely curious about the analytics that goes behind each pick. While high picks often translate to the best players, there have been a multitude of cases throughout the years in which teams have made the wrong choices with their first overall selection. At the same time, lower picks have proven to be valuable if a team is able to aptly predict future stardom.Thus, I developed a machine learning model that will predict where each player should be drafted in their respective classes by feeding college stats data and NBA draft combine data. 

**Variables Used‍**

For the parameters of my model, I used the following Kaggle datasets with player’s respective college statistics and draft combine data.  The college stats includes metrics like field goal percentage, points per game, and more intricate parameters like player efficiency rating, win shares, defensisve and offensive BPM, etc. NBA Draft Combine data presents an overview of a player's physical prowess and athletic capabilities, offering indicators such as height, weight, wingspan, and results from various physical tests like the max vertical leap and sprint time. ‍The greatest challenge was choosing an appropriate response variable. For this task, I wanted to pick a variable that would accurately predict success in the NBA, a variable that real time scouts and analysts hope to use in the draft. I chose, RAPTOR, a stat developed by sports scientists from the writers of FiveThirtyEight that is described as follows:‍‍

“RAPTOR consists of two major components that are blended together to rate players: a “box” (as in “box score”) component, which uses individual statistics (including statistics derived from player tracking and play-by-play data), and an “on-off” component, which evaluates a team’s performance when the player and various combinations of his teammates are on or off the floor.NBA teams highly value floor spacing, defense and shot creation, and they place relatively little value on traditional big-man skills. RAPTOR likewise values these things — not because we made any deliberate attempt to design the system that way but because the importance of those skills emerges naturally from the data. 

RAPTOR thinks ball-dominant players such as James Harden and Steph Curry are phenomenally good. It highly values two-way wings such as Kawhi Leonard and Paul George. It can have a love-hate relationship with centers, who are sometimes overvalued in other statistical systems. But it appreciates modern centers such as Nikola Jokić and Joel Embiid, as well as defensive stalwarts like Rudy Gobert.”

I used their RAPTOR WAR variation, which includes minutes played and games played as a factor, as teams are looking for players that can be effective throughout an 82 game season with sufficient playing time. 

**Draft Prediction Code**

Abstract: This code utilizes machine learning models to predict the career performance of NBA players. The code incorporates data preprocessing, imputation, model training, and prediction steps. It calculates the average of the seven best seasons to balance the importance of peak performance and longevity. The trained models are evaluated, and the Ordinary Least Squares (OLS) model is chosen as the best performer. The code applies the OLS model to a test set of NBA players and compares the predicted career performance with the actual performance. 

In [None]:
# Import libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import Lasso, Ridge
from sklearn.ensemble import RandomForestRegressor
from sklearn.neural_network import MLPRegressor
from sklearn.metrics import mean_squared_error
from sklearn.impute import KNNImputer
from sklearn.preprocessing import StandardScaler
import statsmodels.api as sm

**WAR Total Calculation**

While WAR is already an established metric, finding an average career WAR would undervalue players that play past their primes. Players like Vince Carter who was a perrennial All-Star during his time with the Raptors will have a mediocre average WAR due to his time as a 40-year-old veteran. If we calculate the sum, it will likewise overvalue these players - most franchises would have one all-star player for one year over a bad player for three, even if their cumulative WAR is the same. Thus, I chose to find the average of the 7 best WAR seasons, balancing the importance of peak and longevity. 

In [70]:
# Read the CSV file into a pandas DataFrame
df = pd.read_csv('war.csv')

# Calculate the average 'war_total' for each player
average_war = df.groupby('player_name')['war_total'].mean().reset_index()

# Sort the DataFrame by the average 'war_total' in descending order
average_war = average_war.sort_values('war_total', ascending=False)

# Select the top 7 average 'war_total' values for each player
adjusted_war = average_war.groupby('player_name').head(7)

# Reset the index of the DataFrame
adjusted_war = adjusted_war.reset_index(drop=True)

# Display the resulting DataFrame
print(adjusted_war)

             player_name  war_total
0           James Harden  17.006302
1          Stephen Curry  16.410551
2           LeBron James  14.788844
3             Chris Paul  13.701146
4           Nikola Jokic  13.064720
...                  ...        ...
1187       James Wiseman  -2.470387
1188       Collin Sexton  -2.568559
1189  Aleksej Pokusevski  -3.608008
1190        Theo Maledon  -4.747338
1191          Kevin Knox  -5.964009

[1192 rows x 2 columns]


**Merging Data Frames**

Here, I merged the draft combine data with the college data on player_name. These datapoints will act as the parameters for my machine learning model. I also converted categorical variables such as the player's position and college year (freshman, senior, etc.)

In [71]:
# Read csv files into pandas dataframes
df_combine = pd.read_csv('combine.csv')
df_college = pd.read_csv('college.csv')

# Merge dataframes on player_name
df = pd.merge(df_combine, df_college, on='player_name')
df = pd.merge(df, adjusted_war, on='player_name')

# Remove duplicate rows based on a specific column
df = df[~df.duplicated(subset='player_name')]

# Create dummy variables from the 'position' and 'yr' columns
df = pd.get_dummies(df, columns=['position', 'yr'])

# We create a separate DataFrame that keeps 'player_name' for the test set
df_with_player_name = df.copy()

# Drop the 'player_name' variable from df
df.drop(['player_name'], axis=1, inplace=True)

# Save column names
column_names = df.columns

**Imputing Data**

Due to different standards of the draft combine throughout the year, I imputed data using a KNN Imputer. 
The KNN Imputer measures the 3 nearest neighbors to estimate missing values based on the values of similar data points. 

In [72]:
# Initialize KNNImputer
imputer = KNNImputer(n_neighbors=3)

# Fit and transform the dataset
df = imputer.fit_transform(df)

# Convert the result back to a pandas DataFrame
df = pd.DataFrame(df, columns=column_names)

**Developing ML Model**

Here, I used cross-validation to determine which ML method would be the best fit by testing the R^2 score  of each model. I employed a LASSO model, Ridge model, Neural Network, OLS, and Random Forest. In the end, the OLS had the highest R^2 of 0.38 and I chose to use that for the rest of my analysis.

In [115]:
# Define the response variable
response_variable = 'war_total'

# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(df.drop(response_variable, axis=1), df[response_variable], test_size=0.2, random_state=42)

# We also split df_with_player_name to keep 'player_name' in the test set
X_train_with_player_name, X_test_with_player_name, _, _ = train_test_split(df_with_player_name.drop(response_variable, axis=1), df_with_player_name[response_variable], test_size=0.2, random_state=42)

# Create and train the Lasso model
lasso = Lasso()
lasso.fit(X_train, y_train)
lasso_score = lasso.score(X_test, y_test)
print("Lasso model score:", lasso_score)

# Create and train the Ridge model
ridge = Ridge()
ridge.fit(X_train, y_train)
ridge_score = ridge.score(X_test, y_test)
print("Ridge model score:", ridge_score)

# Scale the input features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)


# Create and train the Neural Network model
nn = MLPRegressor(random_state=42)
nn.fit(X_train_scaled, y_train)
nn_score = nn.score(X_test_scaled, y_test)
print("Neural Network model score:", nn_score)

# Add a constant term to the input features
X_train_ols = sm.add_constant(X_train)
X_test_ols = sm.add_constant(X_test)

# Create and train the OLS model
ols = sm.OLS(y_train, X_train_ols)
ols_results = ols.fit()
ols_score = ols_results.rsquared
print("OLS model score:", ols_score)

# Create and train the Random Forest model
rf = RandomForestRegressor(random_state=42)
rf.fit(X_train, y_train)
rf_score = rf.score(X_test, y_test)
print("Random Forest model score:", rf_score)




Lasso model score: 0.13393428343206049
Ridge model score: 0.1790617207499492




Neural Network model score: 0.12985744215605766
OLS model score: 0.3796447682388234
Random Forest model score: 0.1526453016650422


**OLS Summary Table**

In [79]:
# Print the summary statistics of the OLS model
print(ols_results.summary())

                            OLS Regression Results                            
Dep. Variable:              war_total   R-squared:                       0.380
Model:                            OLS   Adj. R-squared:                  0.217
Method:                 Least Squares   F-statistic:                     2.340
Date:                Wed, 19 Jul 2023   Prob (F-statistic):           3.06e-07
Time:                        15:14:52   Log-Likelihood:                -686.06
No. Observations:                 358   AIC:                             1522.
Df Residuals:                     283   BIC:                             1813.
Df Model:                          74                                         
Covariance Type:            nonrobust                                         
                             coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------------------
const                    321

**Test Set Predictions**

Finally, I applied my model to the test set and predicted the WAR of 90 randomized NBA players. After looking at the top 10 war_total_pred, it is clear that the model is not entirely perfect. The top 3 players, Marcus Smart, Donte DiVincenzo, and Andre Roberson all appear to be of the same build - versatile defenders who can provide valuable spacing for their teammates. However, they are a clear tier below Hall of Fame players in the test set, like James Harden and Kawhi Leonard, with a predicted 7-year WAR a fifth of their actual WAR. One potential reason for this is that both these players developed into a mold completely different from their college careers. Kawhi Leonard was touted as a 3-D and wing straight out of San Diego State, but developed unique skills that led him to stardome. Similarly, James Harden was expected to become a sparkplug 6th man coming out of Arizona State, before eventually changing the game of basketball forever with his impressive scoring. 

Most of these players with a high predicted WAR had long, illustrious careers in the NBA. However, there are also players within the top 10 predicted WAR who did not translate to quality NBA players. Lester Hudson is a prime example. The ML recognized his outstanding college stats (Hudson is the only player to record a quadruble double in NCAA D1 sports with (25 points, 12 rebounds, 10 assists and 10 steals vs. Central Baptist College), but could not gauge differences in division/competition. Thus, despite a high predicted WAR, Hudson was picked with the 58th pick and only played a total of 57 NBA games. 


In [95]:
# Predict 'war_total' using the OLS model
y_pred_ols = ols_results.predict(X_test_ols)

# Create a DataFrame from the predictions
df_predictions = pd.DataFrame(y_pred_ols, columns=['war_total_pred'])

# Add 'player_name' from X_test_with_player_name to df_predictions
df_predictions['player_name'] = X_test_with_player_name['player_name'].values
df_predictions = df_predictions[['player_name', 'war_total_pred']]

# Sort df_predictions by war_total_pred in descending order
df_predictions = df_predictions.sort_values('war_total_pred', ascending=False)

# Merge df_predictions with adjusted_war based on player_name
df_predictions = df_predictions.merge(adjusted_war[['player_name', 'war_total']], on='player_name', how='left')

# Rename the war_total column to war_actual
df_predictions = df_predictions.rename(columns={'war_total': 'war_actual'})

# Print the updated DataFrame
df_predictions.head(10)




Unnamed: 0,player_name,war_total_pred,war_actual
0,Marcus Smart,4.89086,5.062695
1,Donte DiVincenzo,4.30845,3.135029
2,Andre Roberson,4.28094,2.459231
3,Trae Young,3.55765,4.990915
4,Lester Hudson,3.462179,-0.03158
5,James Harden,3.175913,17.006302
6,Eric Maynor,3.095262,-1.576366
7,Damian Lillard,2.901052,11.007493
8,John Wall,2.869443,5.329992
9,Kawhi Leonard,2.565787,12.910969


**Developing Draft Prediction Function**

The last aspect I wanted to focus on this project was how well my ML model would perform if used on previous draft classes. One aspect to note is that my ML model was trained on these draft classes, and has already learned patterns and relationships from the historical data it was trained on. It has seen examples from the past draft classes and can leverage that knowledge to make more accurate predictions for similar scenarios. This may not necessarily apply to future draft classes at the same level of accuracy.

In [116]:
# Add a constant term to the input features of the entire dataset
X_ols = sm.add_constant(df.drop(response_variable, axis=1))

# Predict 'war_total' for the entire dataset using the OLS model
y_pred_ols_all = ols_results.predict(X_ols)

# Create a DataFrame from the predictions
df_predictions_all = pd.DataFrame(y_pred_ols_all, columns=['war_total_pred'])

# Add 'player_name' from df_with_player_name to df_predictions_all
df_predictions_all['player_name'] = df_with_player_name['player_name'].values
df_predictions_all = df_predictions_all[['player_name', 'war_total_pred']]

# Sort df_predictions_all by 'war_total_pred' in descending order
df_predictions_all = df_predictions_all.sort_values(by='war_total_pred', ascending=False)

# Read draft.csv into a DataFrame
df_draft = pd.read_csv('draft.csv')

# Merge df_predictions_all and df_draft on 'player_name'
df_predictions_all = pd.merge(df_predictions_all, df_draft, on='player_name', how='left')

# Select the columns to keep
df_predictions_all = df_predictions_all[['player_name', 'season', 'overall_pick', 'war_total_pred']]

# Sort df_predictions_all by 'war_total_pred' in descending order
df_predictions_all = df_predictions_all.sort_values(by='war_total_pred', ascending=False)


def get_players_by_season(season):
    # Filter df_predictions_all for the given season
    df_season = df_predictions_all[df_predictions_all['season'] == season]
    
    # Sort df_season by 'war_total_pred' in descending order
    df_season = df_season.sort_values(by='war_total_pred', ascending=False)
    
    return df_season


**Draft Class Application**

**2012**

The first draft class I wanted to observe was 2012, with multiple all-stars that were picked in the second round. My predictive model puts Draymond Green as the number two prospect right below Anthony Davis. While it is debatable on whether he is better all-time than Damian Lillard, he is a clear tier above the rest of the players he is placed above. I found it interesting that my model was able to predict Draymond's future success solely based off of his college success and draft combine stats while only being the 35th overall pick. Furthermore, it placed Khris Middleton, another All-Star picked in the second round, within the top 10 draft picks, which proves that it is a useful tool in discovering underrated talent.  

In [117]:
season = 2012  # replace with the desired season
df_season = get_players_by_season(season)
df_season.head(10)

Unnamed: 0,player_name,season,overall_pick,war_total_pred
1,Anthony Davis,2012.0,1.0,6.052855
2,Draymond Green,2012.0,35.0,5.152269
16,Bradley Beal,2012.0,3.0,3.32258
31,Damian Lillard,2012.0,6.0,2.901052
44,Jeremy Lamb,2012.0,12.0,2.643825
56,Michael Kidd-Gilchrist,2012.0,2.0,2.385151
57,John Henson,2012.0,14.0,2.367483
67,Terrence Jones,2012.0,18.0,2.192918
70,Andrew Nicholson,2012.0,19.0,2.164353
71,Khris Middleton,2012.0,39.0,2.157398


**2009**

I also looked at the 2009 draft class, touted as one of the strongest draft classes of the past decade. Interestingly, it puts Stephen Curry as the first overall pick in a draft where he famously dropped to 7th overall. One advantage and disadvantage of this model is the fact that it does not consider injury history. Thus, the ML model overlooks Curry's injuries in college and correctly puts him at the first overall pick. This may be disadvantageous with other players, who sustained injuries through college and continue to sustain them throughout their NBA careers. The model also accurately predicts the impact of Jrue Holiday, who was a 17th overall pick but has provided all-star quality level play and Hall-of-Fame level defense for teams around the league. While he is not a better performer than James Harden, it is likely due to RAPTOR's value of defense that causes him to be so high. 

In [118]:
season = 2009  # replace with the desired season
df_season = get_players_by_season(season)
df_season.head(10)

Unnamed: 0,player_name,season,overall_pick,war_total_pred
0,Stephen Curry,2009.0,7.0,8.306802
5,Jrue Holiday,2009.0,17.0,4.78811
14,Lester Hudson,2009.0,58.0,3.462179
15,Tyreke Evans,2009.0,4.0,3.358822
18,DeMarre Carroll,2009.0,27.0,3.177605
19,James Harden,2009.0,3.0,3.175913
22,Eric Maynor,2009.0,20.0,3.095262
29,James Johnson,2009.0,16.0,2.947545
30,Jeff Teague,2009.0,19.0,2.934953
41,Ty Lawson,2009.0,18.0,2.694614
