Predicting the NBA MVP: Machine Learning Project [Part 3 of 3]

In this notebook, we used numerical predictors to train a machine learning model which will use statistics such as points, assists, steals, rebounds, blocks, etc. to predict what rank a player would finish in the MVP race if it was based solely on stats. We find Ridge regression to be very accurate and the Standard Scaler method to be most accurate in predicting the MVP of each year. We identified the average precision error metric to more fine tune the prediction model, however, the model did not improve due to this. We as well ran diagnotics on our model, determining how much the machine learning model's predicted rank varied from the actual MVP rank of the player. In our tests, it showed that our average precision was about 0.85, which is pretty accurate. With that average precision, we created a backtesting model that tests across several years to tune the model to become more accurate, and eventually be able to predict future MVPs more effectively in the upcoming years.

https://www.youtube.com/watch?v=3cn1nHlbFVw

In [1]:
import pandas as pd

In [2]:
stats = pd.read_csv("player_mvp_stats.csv")

In [3]:
stats

Unnamed: 0.1,Unnamed: 0,Player,Pos,Age,Tm,G,GS,MP,FG,FGA,...,Pts Max,Share,Team,W,L,W/L%,GB,PS/G,PA/G,SRS
0,0,A.C. Green,PF,30,PHO,82,55,34.5,5.7,11.3,...,0.0,0.000,Phoenix Suns,56,26,0.683,7.0,108.2,103.4,4.68
1,1,Cedric Ceballos,SF,24,PHO,53,43,30.2,8.0,15.0,...,0.0,0.000,Phoenix Suns,56,26,0.683,7.0,108.2,103.4,4.68
2,2,Charles Barkley,PF,30,PHO,65,65,35.4,8.0,16.1,...,1010.0,0.005,Phoenix Suns,56,26,0.683,7.0,108.2,103.4,4.68
3,3,Dan Majerle,SG,28,PHO,80,76,40.1,6.0,14.2,...,0.0,0.000,Phoenix Suns,56,26,0.683,7.0,108.2,103.4,4.68
4,4,Danny Ainge,SG,34,PHO,68,1,22.9,3.3,7.9,...,0.0,0.000,Phoenix Suns,56,26,0.683,7.0,108.2,103.4,4.68
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
13531,13531,Spencer Hawes,PF,28,MIL,54,1,14.8,2.5,5.1,...,0.0,0.000,Milwaukee Bucks,42,40,0.512,9.0,103.6,103.8,-0.45
13532,13532,Steve Novak,PF,33,MIL,8,0,2.8,0.3,0.9,...,0.0,0.000,Milwaukee Bucks,42,40,0.512,9.0,103.6,103.8,-0.45
13533,13533,Terrence Jones,PF,25,MIL,54,12,23.5,4.3,9.1,...,0.0,0.000,Milwaukee Bucks,42,40,0.512,9.0,103.6,103.8,-0.45
13534,13534,Thon Maker,C,19,MIL,57,34,9.9,1.5,3.2,...,0.0,0.000,Milwaukee Bucks,42,40,0.512,9.0,103.6,103.8,-0.45


In [4]:
del stats["Unnamed: 0"]

In [5]:
stats

Unnamed: 0,Player,Pos,Age,Tm,G,GS,MP,FG,FGA,FG%,...,Pts Max,Share,Team,W,L,W/L%,GB,PS/G,PA/G,SRS
0,A.C. Green,PF,30,PHO,82,55,34.5,5.7,11.3,0.502,...,0.0,0.000,Phoenix Suns,56,26,0.683,7.0,108.2,103.4,4.68
1,Cedric Ceballos,SF,24,PHO,53,43,30.2,8.0,15.0,0.535,...,0.0,0.000,Phoenix Suns,56,26,0.683,7.0,108.2,103.4,4.68
2,Charles Barkley,PF,30,PHO,65,65,35.4,8.0,16.1,0.495,...,1010.0,0.005,Phoenix Suns,56,26,0.683,7.0,108.2,103.4,4.68
3,Dan Majerle,SG,28,PHO,80,76,40.1,6.0,14.2,0.418,...,0.0,0.000,Phoenix Suns,56,26,0.683,7.0,108.2,103.4,4.68
4,Danny Ainge,SG,34,PHO,68,1,22.9,3.3,7.9,0.417,...,0.0,0.000,Phoenix Suns,56,26,0.683,7.0,108.2,103.4,4.68
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
13531,Spencer Hawes,PF,28,MIL,54,1,14.8,2.5,5.1,0.484,...,0.0,0.000,Milwaukee Bucks,42,40,0.512,9.0,103.6,103.8,-0.45
13532,Steve Novak,PF,33,MIL,8,0,2.8,0.3,0.9,0.286,...,0.0,0.000,Milwaukee Bucks,42,40,0.512,9.0,103.6,103.8,-0.45
13533,Terrence Jones,PF,25,MIL,54,12,23.5,4.3,9.1,0.470,...,0.0,0.000,Milwaukee Bucks,42,40,0.512,9.0,103.6,103.8,-0.45
13534,Thon Maker,C,19,MIL,57,34,9.9,1.5,3.2,0.459,...,0.0,0.000,Milwaukee Bucks,42,40,0.512,9.0,103.6,103.8,-0.45


In [6]:
pd.isnull(stats).sum()

Player        0
Pos           0
Age           0
Tm            0
G             0
GS            0
MP            0
FG            0
FGA           0
FG%          53
3P            0
3PA           0
3P%        1909
2P            0
2PA           0
2P%          94
eFG%         53
FT            0
FTA           0
FT%         499
ORB           0
DRB           0
TRB           0
AST           0
STL           0
BLK           0
TOV           0
PF            0
PTS           0
Year          0
Pts Won       0
Pts Max       0
Share         0
Team          0
W             0
L             0
W/L%          0
GB            0
PS/G          0
PA/G          0
SRS           0
dtype: int64

In [7]:
stats[pd.isnull(stats["3P%"])][["Player", "3PA"]]

Unnamed: 0,Player,3PA
8,Jerrod Mustaf,0.0
12,Mark West,0.0
16,Aaron Swinson,0.0
17,Antonio Lang,0.0
28,Wayman Tisdale,0.0
...,...,...
13505,Evan Eschmeyer,0.0
13506,Gheorghe Mureșan,0.0
13508,Jim McIlvaine,0.0
13514,Mark Hendrickson,0.0


In [8]:
stats[pd.isnull(stats["FT%"])][["Player", "FTA"]]

Unnamed: 0,Player,FTA
37,John Coker,0.0
52,Jason Sasser,0.0
63,Adrian Caldwell,0.0
79,Bruno Šundov,0.0
118,Jamal Robinson,0.0
...,...,...
13326,Jason Hart,0.0
13370,George King,0.0
13450,Luke Zeller,0.0
13498,Malcolm Lee,0.0


In [9]:
# We are replacing all the values that are null with 0 to make it easier to go through the data
# All the values that are turned to zero are not necessarily true
stats = stats.fillna(0)

In [10]:
# Training a Machine Learning Model
stats.columns

Index(['Player', 'Pos', 'Age', 'Tm', 'G', 'GS', 'MP', 'FG', 'FGA', 'FG%', '3P',
       '3PA', '3P%', '2P', '2PA', '2P%', 'eFG%', 'FT', 'FTA', 'FT%', 'ORB',
       'DRB', 'TRB', 'AST', 'STL', 'BLK', 'TOV', 'PF', 'PTS', 'Year',
       'Pts Won', 'Pts Max', 'Share', 'Team', 'W', 'L', 'W/L%', 'GB', 'PS/G',
       'PA/G', 'SRS'],
      dtype='object')

In [11]:
predictors = ['Age', 'G', 'GS', 'MP', 'FG', 'FGA', 'FG%', '3P',
       '3PA', '3P%', '2P', '2PA', '2P%', 'eFG%', 'FT', 'FTA', 'FT%', 'ORB',
       'DRB', 'TRB', 'AST', 'STL', 'BLK', 'TOV', 'PF', 'PTS', 'Year',
       'W', 'L', 'W/L%', 'GB', 'PS/G', 'PA/G', 'SRS']

In [12]:
train = stats[stats["Year"] < 2022]

In [13]:
test = stats[stats["Year"] == 2022]

In [14]:
!pip install scikit-learn



In [15]:
from sklearn.linear_model import Ridge

reg = Ridge(alpha = .1)

In [16]:
reg.fit(train[predictors], train["Share"])

Ridge(alpha=0.1)

In [17]:
predictions = reg.predict(test[predictors])

In [18]:
predictions = pd.DataFrame(predictions, columns = ["Predictions"], index = test.index)

In [19]:
predictions

Unnamed: 0,Predictions
563,0.013738
564,-0.026914
565,-0.005550
566,0.015746
567,-0.004158
...,...
11610,-0.019983
11611,-0.010407
11612,0.003836
11613,0.001286


In [20]:
combination = pd.concat([test[["Player", "Share"]], predictions], axis = 1)

In [21]:
combination

Unnamed: 0,Player,Share,Predictions
563,Aaron Gordon,0.0,0.013738
564,Austin Rivers,0.0,-0.026914
565,Bol Bol,0.0,-0.005550
566,Bones Hyland,0.0,0.015746
567,Bryn Forbes,0.0,-0.004158
...,...,...,...
11610,Micah Potter,0.0,-0.019983
11611,Rodney McGruder,0.0,-0.010407
11612,Saben Lee,0.0,0.003836
11613,Saddiq Bey,0.0,0.001286


In [22]:
combination.sort_values("Share", ascending = False).head(10)

Unnamed: 0,Player,Share,Predictions
578,Nikola Jokić,0.875,0.184572
752,Joel Embiid,0.706,0.186741
10812,Giannis Antetokounmpo,0.595,0.216179
822,Devin Booker,0.216,0.089173
10629,Luka Dončić,0.146,0.155767
1094,Jayson Tatum,0.043,0.093299
11328,Ja Morant,0.01,0.119862
6046,Stephen Curry,0.004,0.091791
820,Chris Paul,0.002,0.077226
7697,LeBron James,0.001,0.154754


In [23]:
# Identifying an Error Metric
from sklearn.metrics import mean_squared_error

mean_squared_error(combination["Share"], combination["Predictions"])

0.0022362850667278925

In [24]:
combination["Share"].value_counts()

0.000    593
0.001      3
0.875      1
0.706      1
0.002      1
0.216      1
0.043      1
0.004      1
0.146      1
0.595      1
0.010      1
Name: Share, dtype: int64

In [25]:
combination = combination.sort_values("Share", ascending = False)
combination["Rk"] = list(range(1, combination.shape[0] + 1))

In [26]:
combination.head(10)

Unnamed: 0,Player,Share,Predictions,Rk
578,Nikola Jokić,0.875,0.184572,1
752,Joel Embiid,0.706,0.186741,2
10812,Giannis Antetokounmpo,0.595,0.216179,3
822,Devin Booker,0.216,0.089173,4
10629,Luka Dončić,0.146,0.155767,5
1094,Jayson Tatum,0.043,0.093299,6
11328,Ja Morant,0.01,0.119862,7
6046,Stephen Curry,0.004,0.091791,8
820,Chris Paul,0.002,0.077226,9
7697,LeBron James,0.001,0.154754,10


In [27]:
combination = combination.sort_values("Predictions", ascending = False)
combination["Predicted RK"] = list(range(1, combination.shape[0] + 1))

In [28]:
combination.head(35)

Unnamed: 0,Player,Share,Predictions,Rk,Predicted RK
10812,Giannis Antetokounmpo,0.595,0.216179,3,1
752,Joel Embiid,0.706,0.186741,2,2
578,Nikola Jokić,0.875,0.184572,1,3
10629,Luka Dončić,0.146,0.155767,5,4
7697,LeBron James,0.001,0.154754,10,5
5833,Kevin Durant,0.001,0.138624,12,6
11328,Ja Morant,0.01,0.119862,7,7
10954,Trae Young,0.0,0.111152,289,8
7687,Anthony Davis,0.0,0.104127,112,9
751,James Harden,0.0,0.10402,393,10


In [29]:
combination.sort_values("Share", ascending = False).head(10)

Unnamed: 0,Player,Share,Predictions,Rk,Predicted RK
578,Nikola Jokić,0.875,0.184572,1,3
752,Joel Embiid,0.706,0.186741,2,2
10812,Giannis Antetokounmpo,0.595,0.216179,3,1
822,Devin Booker,0.216,0.089173,4,17
10629,Luka Dončić,0.146,0.155767,5,4
1094,Jayson Tatum,0.043,0.093299,6,13
11328,Ja Morant,0.01,0.119862,7,7
6046,Stephen Curry,0.004,0.091791,8,15
820,Chris Paul,0.002,0.077226,9,21
3710,DeMar DeRozan,0.001,0.097854,11,11


In [30]:
# If you are a top-5 vote getter, are you in the top-5 of our prediction model
# Creating an error metric to determine this: Average Precision
def find_ap(combination):
    actual = combination.sort_values("Share", ascending=False).head(5)
    predicted = combination.sort_values("Predictions", ascending=False)
    ps = []
    found = 0
    seen = 1
    for index,row in predicted.iterrows():
        if row["Player"] in actual["Player"].values:
            found += 1
            ps.append(found / seen)
        seen += 1

    return sum(ps) / len(ps)

In [31]:
ap = find_ap(combination)

In [32]:
# This error metric will help us predict the top 5; i.e. it will tell us how accurate our machine learning model is
ap

0.8588235294117647

In [33]:
years = list(range(1994, 2023))

In [34]:
aps = []
all_predictions = []
for year in years[5:]:
    train = stats[stats["Year"] < year]
    tet = stats[stats["Year"] == year]
    reg.fit(train[predictors], train["Share"])
    predictions = reg.predict(test[predictors])
    predictions = pd.DataFrame(predictions, columns = ["Predictions"], index = test.index)
    combination = pd.concat([test[["Player", "Share"]], predictions], axis = 1)
    all_predictions.append(combination)
    aps.append(find_ap(combination))

In [35]:
sum(aps) / len(aps)

0.8203655821302881

In [36]:
# Adding ranks to the machine learning model to perform diagnostics
def add_ranks(predictions):
    predictions = predictions.sort_values("Predictions", ascending = False)
    predictions["Predicted_Rk"] = list(range(1, predictions.shape[0] + 1))
    predictions = predictions.sort_values("Share", ascending = False)
    predictions["Rk"] = list(range(1, predictions.shape[0] + 1))
    predictions["Diff"] = (predictions["Rk"] - predictions["Predicted_Rk"])
    return predictions

In [37]:
ranking = add_ranks(all_predictions[1])
ranking[ranking["Rk"] < 6].sort_values("Diff", ascending = False)

Unnamed: 0,Player,Share,Predictions,Predicted_Rk,Rk,Diff
10812,Giannis Antetokounmpo,0.595,0.23352,1,3,2
752,Joel Embiid,0.706,0.213438,2,2,0
578,Nikola Jokić,0.875,0.184082,3,1,-2
10629,Luka Dončić,0.146,0.128075,8,5,-3
822,Devin Booker,0.216,0.096236,13,4,-9


In [58]:
# Building an Average Precision testing function
def backtest(stats, model, year, predictors):
    aps = []
    all_predictions = []
    for year in years[5:]:
        train = stats[stats["Year"] < year]
        test = stats[stats["Year"] == year]
        reg.fit(train[predictors], train["Share"])
        predictions = reg.predict(test[predictors])
        predictions = pd.DataFrame(predictions, columns = ["Predictions"], index = test.index)
        combination = pd.concat([test[["Player", "Share"]], predictions], axis = 1)
        combination = add_ranks(combination)
        all_predictions.append(combination)
        aps.append(find_ap(combination))
    return sum(aps) / len(aps), aps, pd.concat(all_predictions)

In [39]:
mean_ap, aps, all_predictions = backtest(stats, r
                                         eg, years[5:], predictors)

In [59]:
mean_ap

0.8203655821302881

In [60]:
# Diagnosing Model Performance
all_predictions[all_predictions["Rk"] < 5].sort_values("Diff").head(10)

Unnamed: 0,Player,Share,Predictions,Predicted_Rk,Rk,Diff
822,Devin Booker,0.216,0.089173,17,4,-13
822,Devin Booker,0.216,0.09976,17,4,-13
822,Devin Booker,0.216,0.101049,17,4,-13
822,Devin Booker,0.216,0.096454,17,4,-13
822,Devin Booker,0.216,0.100112,16,4,-12
822,Devin Booker,0.216,0.10283,16,4,-12
822,Devin Booker,0.216,0.103475,16,4,-12
822,Devin Booker,0.216,0.09718,16,4,-12
822,Devin Booker,0.216,0.102917,16,4,-12
822,Devin Booker,0.216,0.099195,16,4,-12


In [61]:
pd.concat([pd.Series(reg.coef_), pd.Series(predictors)], axis = 1).sort_values(0, ascending = False)

Unnamed: 0,0,1
22,0.455871,BLK
7,0.092781,3P
10,0.0869,2P
13,0.077087,eFG%
36,0.061988,STL_R
5,0.047691,FGA
14,0.039088,FT
34,0.031116,PTS_R
35,0.030628,AST_R
18,0.025082,DRB


In [62]:
# Applying more predictors
stats_ratios = stats[["PTS", "AST", "STL", "BLK", "3P", "Year"]].groupby("Year").apply(lambda x: x / x.mean())

In [63]:
stats_ratios

Unnamed: 0,PTS,AST,STL,BLK,3P,Year
0,1.770767,0.851797,1.234513,1.131387,0.376284,1.0
1,2.300792,0.851797,1.508850,0.905109,0.000000,1.0
2,2.601943,2.304861,2.194690,1.357664,2.633987,1.0
3,1.987595,1.703593,2.194690,1.131387,9.030812,1.0
4,1.072097,1.302748,1.097345,0.226277,4.515406,1.0
...,...,...,...,...,...,...
13531,0.735752,0.819562,0.479763,1.528302,0.650951,1.0
13532,0.071202,0.000000,0.000000,0.000000,0.130190,1.0
13533,1.281633,0.601012,1.119447,2.547170,0.520761,1.0
13534,0.474679,0.218550,0.319842,1.273585,0.650951,1.0


In [64]:
stats[["PTS_R", "AST_R", "STL_R", "BLK_R", "3P_R"]] = stats_ratios[["PTS", "AST", "STL", "BLK", "3P"]]

In [65]:
stats.head()

Unnamed: 0,Player,Pos,Age,Tm,G,GS,MP,FG,FGA,FG%,...,W/L%,GB,PS/G,PA/G,SRS,PTS_R,AST_R,STL_R,BLK_R,3P_R
0,A.C. Green,PF,30,PHO,82,55,34.5,5.7,11.3,0.502,...,0.683,7.0,108.2,103.4,4.68,1.770767,0.851797,1.234513,1.131387,0.376284
1,Cedric Ceballos,SF,24,PHO,53,43,30.2,8.0,15.0,0.535,...,0.683,7.0,108.2,103.4,4.68,2.300792,0.851797,1.50885,0.905109,0.0
2,Charles Barkley,PF,30,PHO,65,65,35.4,8.0,16.1,0.495,...,0.683,7.0,108.2,103.4,4.68,2.601943,2.304861,2.19469,1.357664,2.633987
3,Dan Majerle,SG,28,PHO,80,76,40.1,6.0,14.2,0.418,...,0.683,7.0,108.2,103.4,4.68,1.987595,1.703593,2.19469,1.131387,9.030812
4,Danny Ainge,SG,34,PHO,68,1,22.9,3.3,7.9,0.417,...,0.683,7.0,108.2,103.4,4.68,1.072097,1.302748,1.097345,0.226277,4.515406


In [66]:
predictors += ["PTS_R", "AST_R", "STL_R", "BLK_R", "3P_R"]

In [67]:
predictors

['Age',
 'G',
 'GS',
 'MP',
 'FG',
 'FGA',
 'FG%',
 '3P',
 '3PA',
 '3P%',
 '2P',
 '2PA',
 '2P%',
 'eFG%',
 'FT',
 'FTA',
 'FT%',
 'ORB',
 'DRB',
 'TRB',
 'AST',
 'STL',
 'BLK',
 'TOV',
 'PF',
 'PTS',
 'Year',
 'W',
 'L',
 'W/L%',
 'GB',
 'PS/G',
 'PA/G',
 'SRS',
 'PTS_R',
 'AST_R',
 'STL_R',
 'BLK_R',
 '3P_R',
 'PTS_R',
 'AST_R',
 'STL_R',
 'BLK_R',
 '3P_R']

In [68]:
mean_ap, aps, all_predictions = backtest(stats, reg, years[5:], predictors)

In [69]:
# Our prediction model does not benefit from the application of more predicotrs, it worsens
mean_ap

0.7249501701545077

In [70]:
stats["NPos"] = stats["Pos"].astype("category").cat.codes

In [73]:
stats["NTm"] = stats["Tm"].astype("category").cat.codes

In [72]:
stats["Pos"].unique()

array(['PF', 'SF', 'SG', 'PG', 'C', 'PF-SF', 'PG-SG', 'SG-PG', 'PF-C',
       'SG-SF', 'SF-PF', 'SF-SG', 'C-PF', 'SG-PF', 'PG-SF', 'SG-PG-SF',
       'SF-C'], dtype=object)

In [74]:
stats["NTm"].value_counts()

7     480
27    471
17    470
14    467
12    465
8     465
0     463
1     456
15    456
9     455
11    454
13    454
26    454
31    453
5     452
18    451
24    448
29    445
28    445
34    444
10    444
19    441
30    441
33    441
36    410
16    353
20    300
25    240
32    223
2     182
23    163
3     157
21    143
4     135
6     130
35     88
37     65
22     32
Name: NTm, dtype: int64

In [93]:
# Using a Random Forest Model: Combines the output of multiple decision trees to reach a single result
from sklearn.ensemble import RandomForestRegressor

rf = RandomForestRegressor(n_estimators = 50, random_state = 1, min_samples_split = 5)
mean_ap, aps, all_predictions = backtest(stats, rf, years[24:], predictors + ["NPos", "NTm"])

In [94]:
mean_ap

0.7237240105978343

In [95]:
mean_ap, aps, all_predictions = backtest(stats, reg, years[24:], predictors)

In [96]:
mean_ap

0.7249501701545077

In [97]:
# StandardScaler removes the mean and scales each feature/variable to unit variance. 
# This operation is performed feature-wise in an independent way. StandardScaler can be 
# influenced by outliers (if they exist in the dataset) since it involves the estimation of 
# the empirical mean and standard deviation of each feature.
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()

In [100]:
def backtest(stats, model, years, predictors):
    aps = []
    all_predictions = []
    for year in years:
        train = stats[stats["Year"] < year].copy()
        test = stats[stats["Year"] == year].copy()
        sc.fit(train[predictors])
        train[predictors] = sc.transform(train[predictors])
        test[predictors] = sc.transform(test[predictors])
        model.fit(train[predictors],train["Share"])
        predictions = model.predict(test[predictors])
        predictions = pd.DataFrame(predictions, columns = ["Predictions"], index = test.index)
        combination = pd.concat([test[["Player", "Share"]], predictions], axis = 1)
        combination = add_ranks(combination)
        all_predictions.append(combination)
        aps.append(find_ap(combination))
    return sum(aps) / len(aps), aps, pd.concat(all_predictions)

In [101]:
mean_ap, aps, all_predictions = backtest(stats, reg, years[28:], predictors)

In [102]:
mean_ap

0.8833333333333334

In [104]:
sc.transform(stats[predictors])

array([[ 0.7599737 ,  1.19316527,  1.02766237, ...,  0.34024963,
         0.11317719, -0.5089315 ],
       [-0.63805645,  0.03573835,  0.61154736, ...,  0.73827751,
        -0.08173908, -0.81596653],
       [ 0.7599737 ,  0.51467363,  1.37442488, ...,  1.7333472 ,
         0.30809347,  1.33327864],
       ...,
       [-0.40505142,  0.07564963, -0.46341641, ...,  0.17330303,
         1.33273856, -0.39104316],
       [-1.80308157,  0.19538344,  0.29946111, ..., -0.98682471,
         0.23566718, -0.28481232],
       [-0.40505142,  1.11334273,  1.89456863, ...,  0.17330303,
        -0.42257564,  1.09618862]])