Now that we have gathered data from Basketball Reference using web scraping, cleaned our data, and conducted some exploratory data analysis, it's time to create our machine learning model to predict who will win the NBA MVP!

In [110]:
#Import Libraries
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
#Machine Learning 
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.ensemble import RandomForestRegressor
from sklearn.neighbors import KNeighborsRegressor
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.model_selection import GridSearchCV, KFold

#Ignore Warnings
import warnings
warnings.filterwarnings('ignore')

In [2]:
#Read in data
df = pd.read_csv('all_stats.csv')

In [3]:
df.head(10)

Unnamed: 0.1,Unnamed: 0,Player,Pos,Age,Tm,G,GS,MP,FG,FGA,...,PF,PTS,Year,Pts Won,Pts Max,Share,Team,Wins,Losses,W%
0,0,A.C. Green,PF,23,LAL,79,72,28.4,4.0,7.4,...,2.2,10.8,1987,0.0,780.0,0.0,Los Angeles Lakers,65,17,0.79268
1,1,Adrian Branch,SF,23,LAL,32,0,6.8,1.5,3.0,...,1.2,4.3,1987,0.0,780.0,0.0,Los Angeles Lakers,65,17,0.79268
2,2,Billy Thompson,SF,23,LAL,59,0,12.9,2.4,4.4,...,2.5,5.6,1987,0.0,780.0,0.0,Los Angeles Lakers,65,17,0.79268
3,3,Byron Scott,SG,25,LAL,82,82,33.3,6.8,13.8,...,2.0,17.0,1987,0.0,780.0,0.0,Los Angeles Lakers,65,17,0.79268
4,4,James Worthy,SF,25,LAL,82,82,34.4,7.9,14.7,...,2.5,19.4,1987,0.0,780.0,0.0,Los Angeles Lakers,65,17,0.79268
5,5,Kareem Abdul-Jabbar,C,39,LAL,78,78,31.3,7.2,12.7,...,3.1,17.5,1987,0.0,780.0,0.0,Los Angeles Lakers,65,17,0.79268
6,6,Kurt Rambis,PF,28,LAL,78,10,19.4,2.1,4.0,...,2.6,5.7,1987,0.0,780.0,0.0,Los Angeles Lakers,65,17,0.79268
7,7,Magic Johnson,PG,27,LAL,80,80,36.3,8.5,16.4,...,2.1,23.9,1987,733.0,780.0,0.94,Los Angeles Lakers,65,17,0.79268
8,8,Michael Cooper,SG,30,LAL,82,2,27.5,3.9,9.0,...,2.4,10.5,1987,0.0,780.0,0.0,Los Angeles Lakers,65,17,0.79268
9,9,Mike Smrek,C,24,LAL,35,3,6.7,0.9,1.7,...,2.0,2.2,1987,0.0,780.0,0.0,Los Angeles Lakers,65,17,0.79268


In [4]:
#Delete Unwanted Column
del df['Unnamed: 0']

# Missing Data

In [5]:
df.isna().sum().sort_values(ascending=False)

3P%        2333
FT%         574
2P%         110
eFG%         64
FG%          64
Player        0
PF            0
AST           0
STL           0
BLK           0
TOV           0
PTS           0
DRB           0
Year          0
Pts Won       0
Pts Max       0
Share         0
Team          0
Wins          0
Losses        0
TRB           0
FTA           0
ORB           0
Pos           0
FT            0
2PA           0
2P            0
3PA           0
3P            0
FGA           0
FG            0
MP            0
GS            0
G             0
Tm            0
Age           0
W%            0
dtype: int64

It does appear that we do have some missing data. My assumption is that the missing data is for players who did not attempt a shot in the given category. For example, if a player never shot a 3, then their 3P% would be missing. Lets check this. 

In [6]:
#Define dataframe with null 3P%
null_3P = df[df['3P%'].isnull()]
#Show 3PA Data
null_3P['3PA'].value_counts()

0.0    2333
Name: 3PA, dtype: int64

Looks like everyone who has missing data for 3P% also has 0 3PA. Thus, it does appear as those with null values never attempted a shot. Now we can explore the other categories with missing data. It is likely the same reasoning as 3P% applies. 

In [7]:
#Dictionary with the null values and attempts
null_values = {
    'FT%': 'FTA',
    '2P%': '2PA',
    'eFG%': 'FGA',
    'FG%': 'FGA',
}

for null_key, attempt_key in null_values.items():
    null_rows = df[df[null_key].isnull()]
    value_counts = null_rows[attempt_key].value_counts()
    print(f'Null Value: {null_key}')
    print(value_counts)
    print()

Null Value: FT%
0.0    574
Name: FTA, dtype: int64

Null Value: 2P%
0.0    110
Name: 2PA, dtype: int64

Null Value: eFG%
0.0    64
Name: FGA, dtype: int64

Null Value: FG%
0.0    64
Name: FGA, dtype: int64



Looks like for all of the columns with missing data it is because the player has never taken that type of shot. To fill the missing values I will use the median for that position. We could also drop the rows with missing values; however, there are rare cases where players who do not attempt 3s are in the MVP running. For example, Shaquille O'Neal has had multiple season where he attempted 0 3 point shots but won the MVP in the 1999-2000 season. 

In [8]:
# Assuming 'Position' is the column that specifies the player's position
columns_to_fill = ['3P%', 'FT%', '2P%', 'eFG%', 'FG%']
for column in columns_to_fill:
    median_by_position = df.groupby('Pos')[column].transform('median')
    df[column].fillna(median_by_position, inplace=True)

In [9]:
df.isna().sum().sort_values(ascending=False)

Player     0
FT%        0
DRB        0
TRB        0
AST        0
STL        0
BLK        0
TOV        0
PF         0
PTS        0
Year       0
Pts Won    0
Pts Max    0
Share      0
Team       0
Wins       0
Losses     0
ORB        0
FTA        0
Pos        0
FT         0
Age        0
Tm         0
G          0
GS         0
MP         0
FG         0
FGA        0
FG%        0
3P         0
3PA        0
3P%        0
2P         0
2PA        0
2P%        0
eFG%       0
W%         0
dtype: int64

No more missing data, nice! Now we need to use dummy variables for the Position and Team columns

# Feature Engineering

In [10]:
# Create dummy variables for the Pos column
position_dummies = pd.get_dummies(df['Pos'],)
team_dummies = pd.get_dummies(df['Team'])
df_with_dummies = pd.concat([df, position_dummies, team_dummies], axis=1)
df_with_dummies.drop('Pos', axis=1, inplace=True)
df_with_dummies.drop('Team', axis=1, inplace=True)

We will now add features that meassure if a player scored above average for the year in the points, rebounds, assists, blocks, and steals categories. 

In [11]:
average_stats_for_year = df_with_dummies.groupby('Year')[['PTS', 'ORB', 'DRB', 'TRB', 'STL', 'AST', 'BLK']].transform('mean')
# Calculate the new columns by dividing player's statistics by average statistics for their year
statistics_columns = ['PTS', 'ORB', 'DRB', 'TRB', 'STL', 'AST', 'BLK']
for col in statistics_columns:
    df_with_dummies[f'{col}_C'] = df_with_dummies[col] / average_stats_for_year[col]

In [12]:
df_with_dummies.columns

Index(['Player', 'Age', 'Tm', 'G', 'GS', 'MP', 'FG', 'FGA', 'FG%', '3P', '3PA',
       '3P%', '2P', '2PA', '2P%', 'eFG%', 'FT', 'FTA', 'FT%', 'ORB', 'DRB',
       'TRB', 'AST', 'STL', 'BLK', 'TOV', 'PF', 'PTS', 'Year', 'Pts Won',
       'Pts Max', 'Share', 'Wins', 'Losses', 'W%', 'C', 'C-PF', 'PF', 'PF-C',
       'PF-SF', 'PG', 'PG-SF', 'PG-SG', 'SF', 'SF-C', 'SF-PF', 'SF-SG', 'SG',
       'SG-PF', 'SG-PG', 'SG-PG-SF', 'SG-SF', 'Atlanta Hawks',
       'Boston Celtics', 'Brooklyn Nets', 'Charlotte Bobcats',
       'Charlotte Hornets', 'Chicago Bulls', 'Cleveland Cavaliers',
       'Dallas Mavericks', 'Denver Nuggets', 'Detroit Pistons',
       'Golden State Warriors', 'Houston Rockets', 'Indiana Pacers',
       'Los Angeles Clippers', 'Los Angeles Lakers', 'Memphis Grizzlies',
       'Miami Heat', 'Milwaukee Bucks', 'Minnesota Timberwolves',
       'New Jersey Nets', 'New Orleans Hornets', 'New Orleans Pelicans',
       'New Orleans/Oklahoma City Hornets', 'New York Knicks',
       'Okl

In [13]:
features = ['Age', 'G', 'GS', 'MP', 'FG', 'FGA', 'FG%', '3P',
       '3PA', '3P%', '2P', '2PA', '2P%', 'eFG%', 'FT', 'FTA', 'FT%', 'ORB',
       'DRB', 'TRB', 'AST', 'STL', 'BLK', 'TOV', 'PF', 'PTS', 'Year',
       'Wins', 'Losses', 'W%', 'C', 'C-PF', 'PF', 'PF-C',
       'PF-SF', 'PG', 'PG-SF', 'PG-SG', 'SF', 'SF-C', 'SF-PF', 'SF-SG', 'SG',
       'SG-PF', 'SG-PG', 'SG-PG-SF', 'SG-SF', 'Atlanta Hawks',
       'Boston Celtics', 'Brooklyn Nets', 'Charlotte Bobcats',
       'Charlotte Hornets', 'Chicago Bulls', 'Cleveland Cavaliers',
       'Dallas Mavericks', 'Denver Nuggets', 'Detroit Pistons',
       'Golden State Warriors', 'Houston Rockets', 'Indiana Pacers',
       'Los Angeles Clippers', 'Los Angeles Lakers', 'Memphis Grizzlies',
       'Miami Heat', 'Milwaukee Bucks', 'Minnesota Timberwolves',
       'New Jersey Nets', 'New Orleans Hornets', 'New Orleans Pelicans',
       'New Orleans/Oklahoma City Hornets', 'New York Knicks',
       'Oklahoma City Thunder', 'Orlando Magic', 'Philadelphia 76ers',
       'Phoenix Suns', 'Portland Trail Blazers', 'Sacramento Kings',
       'San Antonio Spurs', 'Seattle SuperSonics', 'Toronto Raptors',
       'Utah Jazz', 'Vancouver Grizzlies', 'Washington Bullets',
       'Washington Wizards','PTS_C', 'ORB_C', 'DRB_C', 'TRB_C', 'STL_C',
       'AST_C', 'BLK_C']

Now we need to define our train and test data. We will use the data from 1987-2022 as the training data and then test on the 2023 season.

In [14]:
#Training vs Testing
train = df_with_dummies[df_with_dummies['Year'] < 2023]
test = df_with_dummies[df_with_dummies['Year'] == 2023]

In [15]:
X_train = train[features]
y_train = train['Share']
X_test = test[features]
y_test = test['Share']

# Model

Lets start by trying a simple ridge regression model before finding a model that works best.

In [16]:
#Ridge Regression
Ridge = Ridge(alpha=.1)
Ridge.fit(X_train, y_train)

In [17]:
pred = Ridge.predict(X_test)

In [18]:
pred = pd.DataFrame(pred, columns=['Predicted Share'], index=test.index)

In [19]:
pred.head(10)

Unnamed: 0,Predicted Share
264,0.007772
265,0.024093
266,0.011023
267,0.1973
268,-0.002102
269,-0.014238
270,-0.003205
271,-0.018103
272,-0.007461
273,0.047002


Now we need to compare this to our actual share values and then create a way to meassure error so we can examine multiple models.

In [20]:
results = pd.concat([test[['Player', 'Share']], pred], axis=1)

In [21]:
results.sort_values('Share', ascending=False).head(10)

Unnamed: 0,Player,Share,Predicted Share
16218,Joel Embiid,0.915,0.180187
789,Nikola Jokić,0.674,0.166851
267,Giannis Antetokounmpo,0.606,0.1973
3067,Jayson Tatum,0.28,0.12112
1375,Shai Gilgeous-Alexander,0.046,0.15019
14973,Donovan Mitchell,0.03,0.082808
4561,Domantas Sabonis,0.027,0.091319
359,Luka Dončić,0.01,0.177
7160,Stephen Curry,0.005,0.101524
11593,Jimmy Butler,0.003,0.109498


We need a way to meassure if this model did a good job so we can compare the performance of multiple models. We want to create a model that does a good job of predicting the Share of the best players so we will take this into consideration when creating an error metric. To do this we need to rank the players based on Share and Predicted Share.

In [22]:
results = results.sort_values('Share', ascending=False)
results['Rank'] = results['Share'].rank(ascending=False).astype(int)
results.head(6)

Unnamed: 0,Player,Share,Predicted Share,Rank
16218,Joel Embiid,0.915,0.180187,1
789,Nikola Jokić,0.674,0.166851,2
267,Giannis Antetokounmpo,0.606,0.1973,3
3067,Jayson Tatum,0.28,0.12112,4
1375,Shai Gilgeous-Alexander,0.046,0.15019,5
14973,Donovan Mitchell,0.03,0.082808,6


In [23]:
results = results.sort_values('Predicted Share', ascending=False)
results['Predicted Rank'] = results['Predicted Share'].rank(ascending=False).astype(int)
results.head(6)

Unnamed: 0,Player,Share,Predicted Share,Rank,Predicted Rank
267,Giannis Antetokounmpo,0.606,0.1973,3,1
16218,Joel Embiid,0.915,0.180187,1,2
359,Luka Dončić,0.01,0.177,8,3
789,Nikola Jokić,0.674,0.166851,2,4
1375,Shai Gilgeous-Alexander,0.046,0.15019,5,5
9153,LeBron James,0.0,0.134393,276,6


In [24]:
results['Difference'] = results['Rank'] - results['Predicted Rank']

In [25]:
results[results['Rank'] <= 6].sort_values('Difference', ascending=True)

Unnamed: 0,Player,Share,Predicted Share,Rank,Predicted Rank,Difference
14973,Donovan Mitchell,0.03,0.082808,6,21,-15
3067,Jayson Tatum,0.28,0.12112,4,9,-5
789,Nikola Jokić,0.674,0.166851,2,4,-2
16218,Joel Embiid,0.915,0.180187,1,2,-1
1375,Shai Gilgeous-Alexander,0.046,0.15019,5,5,0
267,Giannis Antetokounmpo,0.606,0.1973,3,1,2


In [46]:
def ranks(df):
    df = df.sort_values('Share', ascending=False)
    df['Rank'] = df['Share'].rank(ascending=False).astype(int)
    df = df.sort_values('Predicted Share', ascending=False)  
    df['Predicted Rank'] = df['Predicted Share'].rank(ascending=False).astype(int)
    df['Difference'] = df['Rank'] - df['Predicted Rank']
    
    return df

To create an error metric we will base it on the top 6 ranks. For example, our model predicted Shai Gilgeous-Alexander would be in the top 6 and he was, so we would not be penalized. However, our model predicted Donovan Mitchell would be 26th and he finished in the top 6 so we would be penalized.

In [26]:
#Error metric function
def error_metric(results):
    scores=[]
    results = results.sort_values('Share', ascending=False)
    for index, row in results.iterrows():
        if row['Rank'] <= 6 and row['Predicted Rank'] <= 6:
            scores.append(1)
        elif row['Predicted Rank'] > 6 and row['Rank'] <= 6:
            scores.append(row['Rank'] / row['Predicted Rank'])
        
    return sum(scores) / 6
        

In [27]:
error_metric(results)

0.7883597883597884

Note we want this number to be close to one. A value of one signifies correctly predicting the top 6 players. One thing to note, we want to be able to test the results for multiple years. This is because it is possible the year 2023 was unique and our model cannot be generalized to all years. Lets look at results for multiple years. Once we have done this, we can look into different models.

In [28]:
def backtesting(model):
    years = list(range(1987, 2024))
    scores = []
    
    for year in years[23:]:
        train = df_with_dummies[df_with_dummies['Year'] < year]
        test = df_with_dummies[df_with_dummies['Year'] == year]
        
        X_train = train[features]
        y_train = train['Share']
        X_test = test[features]
        
        model.fit(X_train, y_train)
        y_pred = model.predict(X_test)
        results = test[['Player', 'Share']].copy()
        results['Predicted Share'] = y_pred
        results['Rank'] = results['Share'].rank(ascending=False, method='first').astype(int)
        results['Predicted Rank'] = results['Predicted Share'].rank(ascending=False, method='first').astype(int)
        
        scores.append(error_metric(results))
        
    return sum(scores) / len(scores)


In [29]:
Ridge = backtesting(Ridge)
print(Ridge)

0.7907361597837789


In [30]:
#Define Results DF
results_df = pd.DataFrame(columns=['Model', 'Results'])
results_df.append({'Model': 'Ridge', 'Results': Ridge}, ignore_index=True)

Unnamed: 0,Model,Results
0,Ridge,0.790736


It looks like our model performed a little better than just checking 2023; however, no major difference which is good. Now let's look at the performance of multiple models.

In [31]:
#Models
models = {'LinReg': LinearRegression(),
          'Lasso': Lasso(alpha=0.1),
          'Random Forest': RandomForestRegressor(random_state=123, n_jobs=-1),
          'KNN': KNeighborsRegressor(),
          'Gradient Boost': GradientBoostingRegressor(random_state=123),
         }

#Generate results
for model_name, model_instance in models.items():
    results = backtesting(model_instance)  # Call your backtesting function here
    results_df = results_df.append({'Model': model_name, 'Results': results}, ignore_index=True)

In [32]:
Ridge_score = {"Model": "Ridge", "Results": Ridge}
results_df = results_df.append(Ridge_score, ignore_index=True)

In [33]:
results_df.sort_values('Results', ascending=False)

Unnamed: 0,Model,Results
2,Random Forest,0.817981
3,KNN,0.806624
4,Gradient Boost,0.799862
0,LinReg,0.794931
5,Ridge,0.790736
1,Lasso,0.132344


It looks like Random Forest did the best out of the six models that we tested. Going forward I want to do hyperparameter tuning on the random forest model to increase accuracy and prepare the model to make predictions for the 2023-2024 NBA season which starts October 2023.