# 2024 Predictions

This notebook will be used to make predictions for the 2024 NBA MVP.

In [1]:
#Import libraries
import pandas as pd
import warnings
from sklearn.ensemble import RandomForestRegressor
from xgboost import XGBRegressor
warnings.filterwarnings("ignore", category=FutureWarning)

In [2]:
#Import data
train = pd.read_csv('training2024.csv')
train = train[train['Year'] >= 2010]
test = pd.read_csv('testing2024.csv')

In 2024, players must play at least 65 games to be eligible for MVP voting. For this reason, I am filtering the testing data for only players who meet this criteria. However, since this rule was not in place for prior years, I am leaving the training data as is. This is because a player who played under 65 games would still be able to receive MVP votes.

In [3]:
#Filter data
test = test[test['G'] >= 65]

In [4]:
#Delete extra row
del train['Unnamed: 0']
del test['Unnamed: 0']

In [5]:
#Define features
features = ['FTA', 'W%', 'BLK', 'MP', 'PTS_C', 'DRB_C', 'FGA', 'STL', 'FG%', 'TRB_C', 'TOV', 'PG', 'FT%',
           'G', 'C', 'ORB', '2PA', '2P', '2P%', 'Age', 'PF', 'eFG%', '3P%', 'AST_C']

In [6]:
#Format data
X_train = train[features]
y_train = train['Share']
X_test = test[features]

Boosting

In [7]:
#Initialize model
boost_mod = XGBRegressor(learning_rate = 0.01, max_depth = 4, n_estimators = 500)

#Fit model
boost_mod.fit(X_train, y_train)

#Predict
pred_share = boost_mod.predict(X_test)
pred_boost_df = pd.DataFrame(pred_share, columns=['Predicted Share'], index=X_test.index)

pred_2024 = pd.concat([test[['Player']], pred_boost_df], axis=1)
pred_2024 = pred_2024.sort_values('Predicted Share', ascending = False)
pred_2024.head(5)

Unnamed: 0,Player,Predicted Share
314,Giannis Antetokounmpo,0.332592
398,Shai Gilgeous-Alexander,0.247208
143,Nikola Jokić,0.227964
479,De'Aaron Fox,0.044936
326,Anthony Edwards,0.031534


Random Forest

In [8]:
#Initialize model
rf_mod = RandomForestRegressor(max_features = 24, n_estimators = 500)

#Fit model
rf_mod.fit(X_train, y_train)

#Predict
pred_share = rf_mod.predict(X_test)
pred_rf_df = pd.DataFrame(pred_share, columns=['Predicted Share'], index=X_test.index)

pred_2024 = pd.concat([test[['Player']], pred_rf_df], axis=1)
pred_2024 = pred_2024.sort_values('Predicted Share', ascending = False)
pred_2024.head(5)

Unnamed: 0,Player,Predicted Share
314,Giannis Antetokounmpo,0.292686
143,Nikola Jokić,0.09295
398,Shai Gilgeous-Alexander,0.06382
242,Anthony Davis,0.021216
361,Zion Williamson,0.011178


Both the random forest and XGBoost models predicted that Giannis would win the MVP. However, Nikola Jokić won the MVP, placing 2nd according to random forest and 3rd to XGBoost. Both models were accurate in predicting that Shai and Jokić would finish in the top three in voting. So why was Giannis predicted to win by both models when he actually finished 4th and only received 1 first-place vote this year? I believe this is because, despite having a high win percentage, the Bucks (Giannis's team) underperformed expectations. This means Giannis, who averaged 30.4 points per game, was both on a winning team and scoring a lot. These were the two most important variables, as seen in the variable_importance notebook. So, going forward, I want to look at a way to measure a team's performance relative to their expectations this season instead of just looking at their overall success.