# Predicting The NBA MVP

The file i'm working with has information of each NBA player from 1991-2022.

I will try to predict the share value - the percentage of votes the player got using machine learning.

The player with the biggest share - won the MVP.

In [1]:
import pandas as pd

# Ridge regression is a form of linear regression that is designed to avoid overfitting
from sklearn.linear_model import Ridge
from sklearn.metrics import mean_squared_error
from sklearn.ensemble import RandomForestRegressor

In [2]:
stats = pd.read_csv("stats.csv")
stats

Unnamed: 0.1,Unnamed: 0,Player,Pos,Age,Tm,G,GS,MP,FG,FGA,...,Share,Team,W,L,W/L%,GB,PS/G,PA/G,SRS,Conference
0,0,A.C. Green,PF,27,LAL,82,21,26.4,3.1,6.6,...,0.0,Los Angeles Lakers,58,24,0.707,5.0,106.3,99.6,6.73,West
1,1,Byron Scott,SG,29,LAL,82,82,32.1,6.1,12.8,...,0.0,Los Angeles Lakers,58,24,0.707,5.0,106.3,99.6,6.73,West
2,2,Elden Campbell,PF,22,LAL,52,0,7.3,1.1,2.4,...,0.0,Los Angeles Lakers,58,24,0.707,5.0,106.3,99.6,6.73,West
3,3,Irving Thomas,PF,25,LAL,26,0,4.2,0.7,1.9,...,0.0,Los Angeles Lakers,58,24,0.707,5.0,106.3,99.6,6.73,West
4,4,James Worthy,SF,29,LAL,78,74,38.6,9.2,18.7,...,0.0,Los Angeles Lakers,58,24,0.707,5.0,106.3,99.6,6.73,West
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
14692,14692,Spencer Hawes,PF,28,MIL,54,1,14.8,2.5,5.1,...,0.0,Milwaukee Bucks,42,40,0.512,9.0,103.6,103.8,-0.45,East
14693,14693,Steve Novak,PF,33,MIL,8,0,2.8,0.3,0.9,...,0.0,Milwaukee Bucks,42,40,0.512,9.0,103.6,103.8,-0.45,East
14694,14694,Terrence Jones,PF,25,MIL,54,12,23.5,4.3,9.1,...,0.0,Milwaukee Bucks,42,40,0.512,9.0,103.6,103.8,-0.45,East
14695,14695,Thon Maker,C,19,MIL,57,34,9.9,1.5,3.2,...,0.0,Milwaukee Bucks,42,40,0.512,9.0,103.6,103.8,-0.45,East


### A Bit of Cleaning

In [3]:
stats = stats.drop(columns=['Unnamed: 0'])

Checking for null values

In [4]:
pd.isnull(stats).sum()

Player           0
Pos              0
Age              0
Tm               0
G                0
GS               0
MP               0
FG               0
FGA              0
FG%             59
3P               0
3PA              0
3P%           2086
2P               0
2PA              0
2P%            100
eFG%            59
FT               0
FTA              0
FT%            521
ORB              0
DRB              0
TRB              0
AST              0
STL              0
BLK              0
TOV              0
PF               0
PTS              0
Year             0
Pts Won          0
Pts Max          0
Share            0
Team             0
W                0
L                0
W/L%             0
GB               0
PS/G             0
PA/G             0
SRS              0
Conference       0
dtype: int64

We can see that there are missing values at percentage columns (such as FG% and 3P%),

they are calculated by shots made / shots attempted.

My assumpsion is that the player with a null value at this field had 0 attempts. Let's check if this is true.

In [5]:
stats[pd.isnull(stats['3P%'])][['Player','3P','3PA','3P%']]

Unnamed: 0,Player,3P,3PA,3P%
2,Elden Campbell,0.0,0.0,
3,Irving Thomas,0.0,0.0,
18,Jack Haley,0.0,0.0,
20,Keith Owens,0.0,0.0,
30,Benoit Benjamin,0.0,0.0,
...,...,...,...,...
14666,Evan Eschmeyer,0.0,0.0,
14667,Gheorghe Mureșan,0.0,0.0,
14669,Jim McIlvaine,0.0,0.0,
14675,Mark Hendrickson,0.0,0.0,


In [6]:
stats[pd.isnull(stats['FG%'])][['Player','FG','FGA','FG%']].head(10)

Unnamed: 0,Player,FG,FGA,FG%
103,Adrian Caldwell,0.0,0.0,
250,Guy Rucker,0.0,0.0,
428,Gani Lawal,0.0,0.0,
1172,C.J. Miles,0.0,0.0,
1850,Ade Murkey,0.0,0.0,
2112,Ronny Turiaf,0.0,0.0,
2358,DeJon Jarreau,0.0,0.0,
2411,Lari Ketner,0.0,0.0,
2932,Ben Moore,0.0,0.0,
2947,Trey McKinney-Jones,0.0,0.0,


My assumption was correct.

I'm going to change these values to 0, as these players are probably not competing for MVP.

In [7]:
stats = stats.fillna(0)

### Predicting Using Ridge Regression

I'm going to use all the numeric values to make the prediction except 'Pts Won', 'Pts Max', 'Share', as they are directly correlated with share.

In [8]:
stats.columns

Index(['Player', 'Pos', 'Age', 'Tm', 'G', 'GS', 'MP', 'FG', 'FGA', 'FG%', '3P',
       '3PA', '3P%', '2P', '2PA', '2P%', 'eFG%', 'FT', 'FTA', 'FT%', 'ORB',
       'DRB', 'TRB', 'AST', 'STL', 'BLK', 'TOV', 'PF', 'PTS', 'Year',
       'Pts Won', 'Pts Max', 'Share', 'Team', 'W', 'L', 'W/L%', 'GB', 'PS/G',
       'PA/G', 'SRS', 'Conference'],
      dtype='object')

In [9]:
predictors = ['Age', 'G', 'GS', 'MP', 'FG', 'FGA', 'FG%', '3P',
       '3PA', '3P%', '2P', '2PA', '2P%', 'eFG%', 'FT', 'FTA', 'FT%', 'ORB',
       'DRB', 'TRB', 'AST', 'STL', 'BLK', 'TOV', 'PF', 'PTS', 'Year',
       'W', 'L', 'W/L%', 'GB', 'PS/G', 'PA/G', 'SRS']

Creating train and test dataframes.

In [10]:
train = stats[stats['Year'] < 2022]
test = stats[stats['Year'] == 2022]

Creating the Ridge regression model.

In [11]:
# alpha controlls how much the coefficients are going to be shrunk to avoid overfitting
reg = Ridge(alpha=0.1)

Training the model.

In [12]:
reg.fit(train[predictors], train['Share'])

Predicting the share value of 2021.

In [13]:
predictions = reg.predict(test[predictors])
predictions = pd.DataFrame(predictions, columns =['Prediction'], index = test.index)

In [14]:
predictions.head()

Unnamed: 0,Prediction
648,0.012934
649,-0.028142
650,-0.006163
651,0.016564
652,-0.00482


Combining the test dataframe with the predictions to compare the values.

In [15]:
combined = pd.concat([test[['Player','Share']], predictions], axis=1) # axis=1 means concat columns, not rows
combined.sort_values('Share', ascending=False).head(15)

Unnamed: 0,Player,Share,Prediction
663,Nikola Jokić,0.875,0.190365
837,Joel Embiid,0.706,0.190462
11678,Giannis Antetokounmpo,0.595,0.21941
907,Devin Booker,0.216,0.091309
11469,Luka Dončić,0.146,0.157395
1179,Jayson Tatum,0.043,0.095902
12226,Ja Morant,0.01,0.120508
6398,Stephen Curry,0.004,0.093138
905,Chris Paul,0.002,0.078329
8241,LeBron James,0.001,0.157828


We can see that the algorithm did not get Nikola Jokić as the highest share value.

Let's try to use MSE (mean squared error) as an error metric for the regression model.

In [16]:
mean_squared_error(combined['Share'],combined['Prediction'])

0.002240241602564993

In [17]:
combined['Share'].value_counts()

0.000    593
0.001      3
0.875      1
0.706      1
0.002      1
0.216      1
0.043      1
0.004      1
0.146      1
0.595      1
0.010      1
Name: Share, dtype: int64

This is not very helpful as most of the NBA players get 0 points, and their share is 0 (525 players got 0 points).

Need to find a different error metric.

### Using Avarage Precision as an error metric

Adding rank by share value & prediction value.

In [18]:
# Rank by share
combined = combined.sort_values('Share', ascending=False)
combined['Share_Rank'] = range(1, combined.shape[0]+1)

# Rank by prediction
combined = combined.sort_values('Prediction', ascending=False)
combined['Prediction_Rank'] = range(1, combined.shape[0]+1)

# Difference between shared rank and predicted rank
combined['Diff'] = combined['Share_Rank'] - combined['Prediction_Rank']

combined.head(10)

Unnamed: 0,Player,Share,Prediction,Share_Rank,Prediction_Rank,Diff
11678,Giannis Antetokounmpo,0.595,0.21941,3,1,2
837,Joel Embiid,0.706,0.190462,2,2,0
663,Nikola Jokić,0.875,0.190365,1,3,-2
8241,LeBron James,0.001,0.157828,10,4,6
11469,Luka Dončić,0.146,0.157395,5,5,0
6185,Kevin Durant,0.001,0.140627,12,6,6
12226,Ja Morant,0.01,0.120508,7,7,0
11820,Trae Young,0.0,0.109246,289,8,281
8231,Anthony Davis,0.0,0.107306,112,9,103
836,James Harden,0.0,0.103584,393,10,383


Because my goal is to predict the MVP, I only care about the top players (let's say the top 5 of share values).

In [19]:
combined.sort_values('Share',ascending=False).head()

Unnamed: 0,Player,Share,Prediction,Share_Rank,Prediction_Rank,Diff
663,Nikola Jokić,0.875,0.190365,1,3,-2
837,Joel Embiid,0.706,0.190462,2,2,0
11678,Giannis Antetokounmpo,0.595,0.21941,3,1,2
907,Devin Booker,0.216,0.091309,4,17,-13
11469,Luka Dončić,0.146,0.157395,5,5,0


In [20]:
combined.sort_values('Prediction',ascending=False).head(10)

Unnamed: 0,Player,Share,Prediction,Share_Rank,Prediction_Rank,Diff
11678,Giannis Antetokounmpo,0.595,0.21941,3,1,2
837,Joel Embiid,0.706,0.190462,2,2,0
663,Nikola Jokić,0.875,0.190365,1,3,-2
8241,LeBron James,0.001,0.157828,10,4,6
11469,Luka Dončić,0.146,0.157395,5,5,0
6185,Kevin Durant,0.001,0.140627,12,6,6
12226,Ja Morant,0.01,0.120508,7,7,0
11820,Trae Young,0.0,0.109246,289,8,281
8231,Anthony Davis,0.0,0.107306,112,9,103
836,James Harden,0.0,0.103584,393,10,383


In [21]:
combined[combined['Share_Rank']==4]

Unnamed: 0,Player,Share,Prediction,Share_Rank,Prediction_Rank,Diff
907,Devin Booker,0.216,0.091309,4,17,-13


An error metric that makes sense is avarage precision - of the top 5 ranked players in the MVP voting, how far down the predictions you have to go in order to find them.

e.g. We will get penalised for Devin Booker	 as he ranked 4, but the prediction got him at rank 17.

In [22]:
def find_avg_precision(combined):
    actual = combined.sort_values('Share', ascending=False).head(5)
    predicted = combined.sort_values('Prediction', ascending=False)
    ps=[]
    found = 0
    seen = 0
    for index, row in predicted.iterrows():
        seen += 1
        if row['Player'] in actual['Player'].values:
            found += 1
            ps.append(found/seen)
        if(found == 5):
            break
    return sum(ps)/len(ps)

In [23]:
find_avg_precision(combined)

0.8188235294117646

Let's use backtesting to get a more robust error metric (avarage precision).

In [24]:
def calc_diff(combined):
    combined = combined.sort_values('Share', ascending=False)
    combined['Share_Rank'] = range(1, combined.shape[0]+1)
    combined = combined.sort_values('Prediction', ascending=False)
    combined['Prediction_Rank'] = range(1, combined.shape[0]+1)
    combined['Diff'] = combined['Share_Rank'] - combined['Prediction_Rank']
    return combined

In [25]:
years = range(1991,2023)

def backtest(stats, model, years, predictors):
    avg_precisions = []
    all_predictions = []
    for year in years[5:]:
        train = stats[stats['Year'] < year]
        test = stats[stats['Year'] == year]
        model.fit(train[predictors], train['Share'])
        predictions = model.predict(test[predictors])
        predictions = pd.DataFrame(predictions, columns =['Prediction'], index = test.index)
        combined = pd.concat([test[['Player','Share']], predictions], axis=1)
        combined = calc_diff(combined)
        all_predictions.append(combined)
        avg_precisions.append(find_avg_precision(combined))
    return sum(avg_precisions)/len(avg_precisions), avg_precisions, pd.concat(all_predictions)

In [26]:
mean_ap, aps, all_predictions = backtest(stats, reg, years, predictors)

The mean avarage precision for 1996 - 2022.

In [27]:
mean_ap

0.7152712173135063

#### Diving Deeper into the Algorithm's Performance - Diagnosting the Model

Which players has the biggest difference?

In [28]:
all_predictions[all_predictions['Share_Rank'] <= 5].sort_values('Diff').head(10)

Unnamed: 0,Player,Share,Prediction,Share_Rank,Prediction_Rank,Diff
1334,Jason Kidd,0.712,0.02821,2,52,-50
8642,Glen Rice,0.117,0.03311,5,53,-48
5420,Steve Nash,0.839,0.0341,1,45,-44
8910,Peja Stojaković,0.228,0.03627,4,38,-34
13331,Joakim Noah,0.258,0.046968,4,37,-33
5438,Steve Nash,0.739,0.054129,1,34,-33
3849,Chauncey Billups,0.344,0.052696,5,35,-30
1499,Chris Paul,0.138,0.072293,5,33,-28
5453,Steve Nash,0.785,0.074421,2,21,-19
4912,Tim Hardaway,0.207,0.059984,4,20,-16


Which variables have the most impact on the regression's decision?

In [29]:
pd.concat([pd.Series(reg.coef_,name='Coefficient'), pd.Series(predictors,name='Predictor')], axis=1).sort_values('Coefficient',ascending=False).head(10)

Unnamed: 0,Coefficient,Predictor
13,0.087852,eFG%
18,0.03386,DRB
29,0.023198,W/L%
17,0.020993,ORB
10,0.016456,2P
21,0.01207,STL
22,0.010901,BLK
15,0.010414,FTA
20,0.007113,AST
12,0.007054,2P%


To improve the performance I'll add more predictors.

A good predictor will be the ratio of the player's performance to the avarage performance of all players for that year.

In [30]:
stats_ratios = stats[['Year','PTS','AST','STL','BLK','3P']].groupby('Year').transform(lambda x: x/x.mean())
stats_ratios.head()

Unnamed: 0,PTS,AST,STL,BLK,3P
0,1.013334,0.420714,0.961127,0.673469,0.508587
1,1.614653,1.028412,1.647646,0.673469,4.577279
2,0.311795,0.093492,0.274608,1.571429,0.0
3,0.20044,0.186984,0.274608,0.0,0.0
4,2.383005,1.63611,1.78495,0.897959,1.52576


Adding the new predictors to the original dataframe & the predictors list.

In [31]:
stats[['PTS_R','AST_R','STL_R','BLK_R','3P_R']] = stats_ratios[['PTS','AST','STL','BLK','3P']]
predictors += ['PTS_R','AST_R','STL_R','BLK_R','3P_R']

Backtesting again for performance

In [32]:
mean_ap, aps, all_predictions = backtest(stats, reg, years, predictors)

In [33]:
mean_ap

0.726619022474594

A bit of an improvement from the previous 0.715.

### Adding Categorical Variable

Position is a categorical variable that cannot be used as is in a regression model.

Changing it to numerical value, and adding it to the predictors.

In [34]:
stats['Pos_N'] = stats['Pos'].astype('category').cat.codes

In [36]:
predictors += ['Pos_N']

### Testing Different Models

#### Ridge Regression

In [37]:
reg = Ridge(alpha=0.1)
ridge_mean_ap, aps, all_predictions = backtest(stats, reg, years, predictors)


In [38]:
ridge_mean_ap

0.7262621890226512

#### Elastic Net Regression

In [78]:
from sklearn.linear_model import ElasticNet

elasticnet = ElasticNet(alpha=0.1, l1_ratio=0.5)

en_mean_ap, aps, all_predictions = backtest(stats, elasticnet, years, predictors)

In [79]:
en_mean_ap

0.7604406358238323

#### Gradient Boosting


In [74]:
from sklearn.ensemble import GradientBoostingRegressor

gb = GradientBoostingRegressor(n_estimators=100, learning_rate=0.1, max_depth=3, random_state=1)

gb_mean_ap, aps, all_predictions = backtest(stats, gb, years, predictors)

In [75]:
gb_mean_ap

0.7188835746997083

#### Random Forest

In [76]:
rf = RandomForestRegressor(n_estimators=100, random_state=1, min_samples_split=5)
rf_mean_ap, aps, all_predictions = backtest(stats, rf, years, predictors)

In [77]:
rf_mean_ap

0.7178318343131725