# Modeleling

Now that we have gathered data from Basketball Reference using web scraping, cleaned our data, and conducted some exploratory data analysis, it's time to create our machine learning model to predict who will win the NBA MVP!

In [1]:
#Import Libraries
import pandas as pd
import numpy as np
#Machine Learning 
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.ensemble import RandomForestRegressor
from sklearn.neighbors import KNeighborsRegressor
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.model_selection import GridSearchCV, KFold, cross_val_predict, cross_val_score, ParameterGrid
from sklearn.metrics import mean_absolute_error, mean_squared_error, make_scorer
from sklearn.preprocessing import StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from xgboost import XGBRegressor
from statsmodels.stats.outliers_influence import variance_inflation_factor
from sklearn.inspection import permutation_importance
from sklearn.svm import SVR
from ipynb.fs.full.my_functions import ranks
#Ignore Warnings
import warnings
warnings.filterwarnings('ignore')

In [2]:
#Read in data
df = pd.read_csv('all_stats.csv')

In [3]:
df.head(10)

Unnamed: 0.1,Unnamed: 0,Player,Pos,Age,Tm,G,GS,MP,FG,FGA,...,PF,PTS,Year,Pts Won,Pts Max,Share,Team,Wins,Losses,W%
0,0,A.C. Green,PF,23,LAL,79,72,28.4,4.0,7.4,...,2.2,10.8,1987,0.0,780.0,0.0,Los Angeles Lakers,65,17,0.79268
1,1,Adrian Branch,SF,23,LAL,32,0,6.8,1.5,3.0,...,1.2,4.3,1987,0.0,780.0,0.0,Los Angeles Lakers,65,17,0.79268
2,2,Billy Thompson,SF,23,LAL,59,0,12.9,2.4,4.4,...,2.5,5.6,1987,0.0,780.0,0.0,Los Angeles Lakers,65,17,0.79268
3,3,Byron Scott,SG,25,LAL,82,82,33.3,6.8,13.8,...,2.0,17.0,1987,0.0,780.0,0.0,Los Angeles Lakers,65,17,0.79268
4,4,James Worthy,SF,25,LAL,82,82,34.4,7.9,14.7,...,2.5,19.4,1987,0.0,780.0,0.0,Los Angeles Lakers,65,17,0.79268
5,5,Kareem Abdul-Jabbar,C,39,LAL,78,78,31.3,7.2,12.7,...,3.1,17.5,1987,0.0,780.0,0.0,Los Angeles Lakers,65,17,0.79268
6,6,Kurt Rambis,PF,28,LAL,78,10,19.4,2.1,4.0,...,2.6,5.7,1987,0.0,780.0,0.0,Los Angeles Lakers,65,17,0.79268
7,7,Magic Johnson,PG,27,LAL,80,80,36.3,8.5,16.4,...,2.1,23.9,1987,733.0,780.0,0.94,Los Angeles Lakers,65,17,0.79268
8,8,Michael Cooper,SG,30,LAL,82,2,27.5,3.9,9.0,...,2.4,10.5,1987,0.0,780.0,0.0,Los Angeles Lakers,65,17,0.79268
9,9,Mike Smrek,C,24,LAL,35,3,6.7,0.9,1.7,...,2.0,2.2,1987,0.0,780.0,0.0,Los Angeles Lakers,65,17,0.79268


In [4]:
#Delete Unwanted Column
del df['Unnamed: 0']

# Missing Data

In [5]:
df.isna().sum().sort_values(ascending=False)

3P%        2333
FT%         574
2P%         110
eFG%         64
FG%          64
Player        0
PF            0
AST           0
STL           0
BLK           0
TOV           0
PTS           0
DRB           0
Year          0
Pts Won       0
Pts Max       0
Share         0
Team          0
Wins          0
Losses        0
TRB           0
FTA           0
ORB           0
Pos           0
FT            0
2PA           0
2P            0
3PA           0
3P            0
FGA           0
FG            0
MP            0
GS            0
G             0
Tm            0
Age           0
W%            0
dtype: int64

It does appear that we do have some missing data. My assumption is that the missing data is for players who did not attempt a shot in the given category. For example, if a player never shot a 3, then their 3P% would be missing. Lets check this. 

In [6]:
#Define dataframe with null 3P%
null_3P = df[df['3P%'].isnull()]
#Show 3PA Data
null_3P['3PA'].value_counts()

3PA
0.0    2333
Name: count, dtype: int64

Looks like everyone who has missing data for 3P% also has 0 3PA. Thus, it does appear as those with null values never attempted a shot. Now we can explore the other categories with missing data. It is likely the same reasoning as 3P% applies. 

In [7]:
#Dictionary with the null values and attempts
null_values = {
    'FT%': 'FTA',
    '2P%': '2PA',
    'eFG%': 'FGA',
    'FG%': 'FGA',
}

for null_key, attempt_key in null_values.items():
    null_rows = df[df[null_key].isnull()]
    value_counts = null_rows[attempt_key].value_counts()
    print(f'Null Value: {null_key}')
    print(value_counts)
    print()

Null Value: FT%
FTA
0.0    574
Name: count, dtype: int64

Null Value: 2P%
2PA
0.0    110
Name: count, dtype: int64

Null Value: eFG%
FGA
0.0    64
Name: count, dtype: int64

Null Value: FG%
FGA
0.0    64
Name: count, dtype: int64



Looks like for all of the columns with missing data it is because the player has never taken that type of shot. To fill the missing values I will use the median for that position that year. We could also drop the rows with missing values; however, there are rare cases where players who do not attempt 3s are in the MVP running. For example, Shaquille O'Neal has had multiple season where he attempted 0 3 point shots but won the MVP in the 1999-2000 season. 

In [8]:
columns_to_fill = ['3P%', 'FT%', '2P%', 'eFG%', 'FG%']
for column in columns_to_fill:
    median_by_position_year = df.groupby(['Year', 'Pos'])[column].transform('median') 
    df[column].fillna(median_by_position_year, inplace=True)  

In [9]:
df.isna().sum().sort_values(ascending=False)

3P%        14
FT%         1
Player      0
PF          0
DRB         0
TRB         0
AST         0
STL         0
BLK         0
TOV         0
PTS         0
Year        0
Pts Won     0
Pts Max     0
Share       0
Team        0
Wins        0
Losses      0
ORB         0
FTA         0
Pos         0
FGA         0
Age         0
Tm          0
G           0
GS          0
MP          0
FG          0
FG%         0
FT          0
3P          0
3PA         0
2P          0
2PA         0
2P%         0
eFG%        0
W%          0
dtype: int64

There do still appear to be some missing values. For this I will use the median for the overall position

In [10]:
columns_to_fill = ['3P%', 'FT%', '2P%', 'eFG%', 'FG%']
for column in columns_to_fill:
    median_by_position = df.groupby('Pos')[column].transform('median')  
    df[column].fillna(median_by_position, inplace=True) 

In [11]:
df.isna().sum().sort_values(ascending=False)

Player     0
FT%        0
DRB        0
TRB        0
AST        0
STL        0
BLK        0
TOV        0
PF         0
PTS        0
Year       0
Pts Won    0
Pts Max    0
Share      0
Team       0
Wins       0
Losses     0
ORB        0
FTA        0
Pos        0
FT         0
Age        0
Tm         0
G          0
GS         0
MP         0
FG         0
FGA        0
FG%        0
3P         0
3PA        0
3P%        0
2P         0
2PA        0
2P%        0
eFG%       0
W%         0
dtype: int64

No more missing data, nice! Now we need to use dummy variables for the Position and Team columns

# Feature Engineering

In [12]:
# Create dummy variables for the Pos column
position_dummies = pd.get_dummies(df['Pos'],)
team_dummies = pd.get_dummies(df['Team'])
df_with_dummies = pd.concat([df, position_dummies, team_dummies], axis=1)
df_with_dummies.drop('Pos', axis=1, inplace=True)
df_with_dummies.drop('Team', axis=1, inplace=True)

We will now add features that meassure if a player scored above average for the year in the points, rebounds, assists, blocks, and steals categories. 

In [13]:
average_stats_for_year = df_with_dummies.groupby('Year')[['PTS', 'ORB', 'DRB', 'TRB', 'STL', 'AST', 'BLK']].transform('mean')
# Calculate the new columns by dividing player's statistics by average statistics for their year
statistics_columns = ['PTS', 'ORB', 'DRB', 'TRB', 'STL', 'AST', 'BLK']
for col in statistics_columns:
    df_with_dummies[f'{col}_C'] = df_with_dummies[col] / average_stats_for_year[col]

In [14]:
df_with_dummies.columns

Index(['Player', 'Age', 'Tm', 'G', 'GS', 'MP', 'FG', 'FGA', 'FG%', '3P', '3PA',
       '3P%', '2P', '2PA', '2P%', 'eFG%', 'FT', 'FTA', 'FT%', 'ORB', 'DRB',
       'TRB', 'AST', 'STL', 'BLK', 'TOV', 'PF', 'PTS', 'Year', 'Pts Won',
       'Pts Max', 'Share', 'Wins', 'Losses', 'W%', 'C', 'C-PF', 'PF', 'PF-C',
       'PF-SF', 'PG', 'PG-SF', 'PG-SG', 'SF', 'SF-C', 'SF-PF', 'SF-SG', 'SG',
       'SG-PF', 'SG-PG', 'SG-PG-SF', 'SG-SF', 'Atlanta Hawks',
       'Boston Celtics', 'Brooklyn Nets', 'Charlotte Bobcats',
       'Charlotte Hornets', 'Chicago Bulls', 'Cleveland Cavaliers',
       'Dallas Mavericks', 'Denver Nuggets', 'Detroit Pistons',
       'Golden State Warriors', 'Houston Rockets', 'Indiana Pacers',
       'Los Angeles Clippers', 'Los Angeles Lakers', 'Memphis Grizzlies',
       'Miami Heat', 'Milwaukee Bucks', 'Minnesota Timberwolves',
       'New Jersey Nets', 'New Orleans Hornets', 'New Orleans Pelicans',
       'New Orleans/Oklahoma City Hornets', 'New York Knicks',
       'Okl

Note, the PF column appears twice in the dataset. The first time represents the average number of personal fouls per game and the second is a dummy variable representing if the players is a power forward or not. To fix this, I will be changed the personal fouls variable to FPG (fouls per game)

In [15]:
#Column Names
column_names = list(df_with_dummies.columns)

#Get index
pf_index = column_names.index("PF")

#Rename first PF
column_names[pf_index] = "FPG"

#Rename columns
df_with_dummies.columns = column_names

#Print updated column names
df_with_dummies.columns

Index(['Player', 'Age', 'Tm', 'G', 'GS', 'MP', 'FG', 'FGA', 'FG%', '3P', '3PA',
       '3P%', '2P', '2PA', '2P%', 'eFG%', 'FT', 'FTA', 'FT%', 'ORB', 'DRB',
       'TRB', 'AST', 'STL', 'BLK', 'TOV', 'FPG', 'PTS', 'Year', 'Pts Won',
       'Pts Max', 'Share', 'Wins', 'Losses', 'W%', 'C', 'C-PF', 'PF', 'PF-C',
       'PF-SF', 'PG', 'PG-SF', 'PG-SG', 'SF', 'SF-C', 'SF-PF', 'SF-SG', 'SG',
       'SG-PF', 'SG-PG', 'SG-PG-SF', 'SG-SF', 'Atlanta Hawks',
       'Boston Celtics', 'Brooklyn Nets', 'Charlotte Bobcats',
       'Charlotte Hornets', 'Chicago Bulls', 'Cleveland Cavaliers',
       'Dallas Mavericks', 'Denver Nuggets', 'Detroit Pistons',
       'Golden State Warriors', 'Houston Rockets', 'Indiana Pacers',
       'Los Angeles Clippers', 'Los Angeles Lakers', 'Memphis Grizzlies',
       'Miami Heat', 'Milwaukee Bucks', 'Minnesota Timberwolves',
       'New Jersey Nets', 'New Orleans Hornets', 'New Orleans Pelicans',
       'New Orleans/Oklahoma City Hornets', 'New York Knicks',
       'Ok

In [16]:
#Define full feature list
features_full = ['Age', 'G', 'GS', 'MP', 'FG', 'FGA', 'FG%', '3P',
       '3PA', '3P%', '2P', '2PA', '2P%', 'eFG%', 'FT', 'FTA', 'FT%', 'ORB',
       'DRB', 'TRB', 'AST', 'STL', 'BLK', 'TOV', 'FPG', 'PTS', 'Year',
       'Wins', 'Losses', 'W%', 'C', 'C-PF', 'PF', 'PF-C',
       'PF-SF', 'PG', 'PG-SF', 'PG-SG', 'SF', 'SF-C', 'SF-PF', 'SF-SG', 'SG',
       'SG-PF', 'SG-PG', 'SG-PG-SF', 'SG-SF', 'Atlanta Hawks',
       'Boston Celtics', 'Brooklyn Nets', 'Charlotte Bobcats',
       'Charlotte Hornets', 'Chicago Bulls', 'Cleveland Cavaliers',
       'Dallas Mavericks', 'Denver Nuggets', 'Detroit Pistons',
       'Golden State Warriors', 'Houston Rockets', 'Indiana Pacers',
       'Los Angeles Clippers', 'Los Angeles Lakers', 'Memphis Grizzlies',
       'Miami Heat', 'Milwaukee Bucks', 'Minnesota Timberwolves',
       'New Jersey Nets', 'New Orleans Hornets', 'New Orleans Pelicans',
       'New Orleans/Oklahoma City Hornets', 'New York Knicks',
       'Oklahoma City Thunder', 'Orlando Magic', 'Philadelphia 76ers',
       'Phoenix Suns', 'Portland Trail Blazers', 'Sacramento Kings',
       'San Antonio Spurs', 'Seattle SuperSonics', 'Toronto Raptors',
       'Utah Jazz', 'Vancouver Grizzlies', 'Washington Bullets',
       'Washington Wizards','PTS_C', 'ORB_C', 'DRB_C', 'TRB_C', 'STL_C',
       'AST_C', 'BLK_C']

In [17]:
#Training vs Testing
train = df_with_dummies[df_with_dummies['Year'] <= 2018]
test = df_with_dummies[(df_with_dummies['Year'] >= 2019) & (df_with_dummies['Year'] <= 2022)]

In [18]:
#Select features
VIF_df = df_with_dummies[features_full]

#Convert to integers
VIF_df = VIF_df.astype(int)

#Drop NANs
VIF_df = VIF_df.dropna()

#Calculate VIF 
vif_data = pd.DataFrame()
vif_data["Feature"] = VIF_df.columns
vif_data["VIF"] = [variance_inflation_factor(VIF_df.values, i) for i in range(VIF_df.shape[1])]

print(vif_data.to_string())

                              Feature         VIF
0                                 Age    1.153860
1                                   G    2.395401
2                                  GS    3.923618
3                                  MP   15.361448
4                                  FG   63.526685
5                                 FGA  132.055370
6                                 FG%   53.176632
7                                  3P    5.801088
8                                 3PA   21.916655
9                                 3P%    1.022360
10                                 2P   45.291877
11                                2PA   94.862645
12                                2P%    2.090589
13                               eFG%   52.113180
14                                 FT   14.312342
15                                FTA   14.278360
16                                FT%    1.104156
17                                ORB   10.006120
18                                DRB   20.172663


The high VIF are a concern. I will do forward stepwise selection with AIC as the decision criteria to select features. This will be done in R.

In [19]:
#Export data
data_export = df_with_dummies
data_export = data_export.drop('Player', axis=1)
data_export = data_export.drop('Tm', axis = 1)

data_export.to_csv("mvpData.csv")

The variables selected from AIC and forward stepwise selection are defined below

In [20]:
features = ['FTA', 'W%', 'BLK', 'MP', 'PTS', 'DRB', 'FGA', 'STL', 'FG%', 'TRB', 'TOV', 'PG', 'FT%',
           'G', 'C', 'ORB', '2PA', '2P', '2P%', 'Age', 'PF', 'eFG%', '3P%', 'AST']

In [21]:
#Export features
data_to_export = df_with_dummies[features + ['Year', 'Share']]
data_to_export.to_csv('finalFeatures.csv')

# Model

Lets start by trying a simple Linear Regression model before finding a model that works best.

In [22]:
X_train = train[features]
y_train = train['Share']
X_test = test[features]
y_test = test['Share']

In [23]:
#Linear Regression
LinReg = LinearRegression()
LinReg.fit(X_train, y_train)

In [24]:
pred = LinReg.predict(X_test)

In [25]:
pred = pd.DataFrame(pred, columns=['Predicted Share'], index=test.index)

In [26]:
pred.head(10)

Unnamed: 0,Predicted Share
705,0.029959
706,0.00453
707,0.0183
708,0.010683
709,-0.006361
710,0.005192
711,-0.001283
712,0.003432
713,0.009277
714,0.013535


In [27]:
results = pd.concat([test[['Player', 'Share', 'Year']], pred], axis=1)

In [28]:
results.sort_values('Share', ascending=False).head(10)

Unnamed: 0,Player,Share,Year,Predicted Share
751,Nikola Jokić,0.961,2021,0.182534
14564,Giannis Antetokounmpo,0.952,2020,0.259139
13469,Giannis Antetokounmpo,0.932,2019,0.240979
773,Nikola Jokić,0.875,2022,0.225479
11129,James Harden,0.768,2019,0.217086
4304,LeBron James,0.746,2020,0.176467
963,Joel Embiid,0.706,2022,0.214693
13169,Giannis Antetokounmpo,0.595,2022,0.252479
10014,Joel Embiid,0.58,2021,0.185921
4235,Stephen Curry,0.449,2021,0.164298


We need a way to meassure if this model did a good job so we can compare the performance of multiple models. We want to create a model that does a good job of predicting the Share of the best players so we will take this into consideration when creating an error metric. To do this we need to rank the players based on Share and Predicted Share. Lets just look at our predictions for the 2019 season.

In [29]:
results_2019 = results[results['Year'] == 2019]
results_2019 = results_2019.sort_values('Share', ascending=False)
results_2019['Rank'] = results_2019['Share'].rank(ascending=False).astype(int)
results_2019.head(6)

Unnamed: 0,Player,Share,Year,Predicted Share,Rank
13469,Giannis Antetokounmpo,0.932,2019,0.240979,1
11129,James Harden,0.768,2019,0.217086,2
1631,Paul George,0.352,2019,0.134302,3
13826,Nikola Jokić,0.21,2019,0.130311,4
4862,Stephen Curry,0.173,2019,0.123514,5
3441,Damian Lillard,0.068,2019,0.115895,6


In [30]:
results_2019 = results_2019.sort_values('Predicted Share', ascending=False)
results_2019['Predicted Rank'] = results_2019['Predicted Share'].rank(ascending=False).astype(int)
results_2019.sort_values('Share', ascending=False).head(6)

Unnamed: 0,Player,Share,Year,Predicted Share,Rank,Predicted Rank
13469,Giannis Antetokounmpo,0.932,2019,0.240979,1,1
11129,James Harden,0.768,2019,0.217086,2,2
1631,Paul George,0.352,2019,0.134302,3,9
13826,Nikola Jokić,0.21,2019,0.130311,4,10
4862,Stephen Curry,0.173,2019,0.123514,5,12
3441,Damian Lillard,0.068,2019,0.115895,6,13


It looks like the model did a decent job at predicting the MVP. We will need to look further into Anthony Davis as he did not recieve any MVP votes but our model predicted he finished fourth in voting.

In [31]:
results_2019['Difference'] = results_2019['Rank'] - results_2019['Predicted Rank']

In [32]:
results_2019[results_2019['Rank'] <= 6].sort_values('Difference', ascending=True)

Unnamed: 0,Player,Share,Year,Predicted Share,Rank,Predicted Rank,Difference
4862,Stephen Curry,0.173,2019,0.123514,5,12,-7
3441,Damian Lillard,0.068,2019,0.115895,6,13,-7
1631,Paul George,0.352,2019,0.134302,3,9,-6
13826,Nikola Jokić,0.21,2019,0.130311,4,10,-6
13469,Giannis Antetokounmpo,0.932,2019,0.240979,1,1,0
11129,James Harden,0.768,2019,0.217086,2,2,0


Now I will use cross-validation to determine the optimal model. To this this I will be fitting the model on all years and holding one out. Then predicting the MVP in that held out year. To asses model performance I will be using the MSE calcauted on the rank values of the top 5 voting share players for the given year. This is  to emphasize accurately predicting only true MVP candidates.

Althogh data was collected data back to the 80s. The NBA has changed a lot since then and scoring is at an all time high. Furthermore, the 3-point shot has become one of the most important aspects in the modern NBA. This was not always the case. For this reason, I am only going to train the model on data dating back to the year 2010. The goal here is to train the model on data that is more reflective of the current state of the game of basketball and what makes an MVP in the modern NBA era.

# Cross-Validation

In [33]:
#Subset to Modern NBA
df_with_dummies = df_with_dummies[df_with_dummies['Year'] >= 2010]

#Reset the index
df_with_dummies = df_with_dummies.reset_index(drop=True)

#Sort data by year
df_sorted = df_with_dummies.sort_values(by='Year')

#Store years
years = np.unique(df_sorted['Year'].values)

#Format data
X = df_sorted[features]
y = df_sorted['Share']

#Create scaled data for models
scaler = StandardScaler()
X_scaled = pd.DataFrame(scaler.fit_transform(X), columns=X.columns, index = X.index)

In [35]:
#Define models to test (List of all models tested)
models = [
    {
        #Linear regression
        'name': 'Linear Regression',
        'model': LinearRegression(),
        'parameters': {}
    },
    {
        #Ridge regression
        'name': 'Ridge Regression',
        'model': Ridge(),
        'parameters': {'alpha': np.logspace(-4, 2, num=100)}
    },
    {
        #Lasso regression
        'name': 'Lasso Regression',
        'model': Lasso(),
        'parameters': {'alpha': np.logspace(-4, 2, num=100)}
    },
    {
        #KNN
        'name': 'KNN',
        'model': KNeighborsRegressor(),
        'parameters': {'n_neighbors': range(1,51)}
    },
    {
        #Random Forest
        'name': 'Random Forest',
        'model': RandomForestRegressor(random_state=123),
        'parameters': {
            'n_estimators': [100, 250, 500],
            'max_features': ['sqrt', 'log2', 24, 3]
        }
    },
    {
        #XGBoosting
        'name': 'XGBoost',
        'model': XGBRegressor(),
        'parameters': {
            'learning_rate': [0.01, 0.1, 0.2, 0.5], 
            'n_estimators': [100, 200, 300, 500],
            'max_depth': [2, 4, 6, 8, 10]
        }
    }
]

In [36]:
#Perform cross-validation

#List to store results
results_list = []

#Try each model
for model_data in models:
    #Save info
    model_name = model_data['name']
    model = model_data['model']
    params = model_data['parameters']
    
    #Print status
    print(f"{model_name} starting")
    
    #Check if model requires scaling
    if model_name in ['KNN', 'Ridge Regression', 'Lasso Regression']:
        X_data = X_scaled
        print('Using scaled data')
    else:
        X_data = X
        print('Using unscaled data')
    
    #Check if model has parameters to tune
    if params:
        #Set initial MSE
        best_mse = float('inf')
        #Set initial best parameters
        best_params = None
        
        #Test different parameters
        for param in ParameterGrid(params):
            #Progress tracking
            print(param)
            
            #MSE for parameters
            param_mse = []
            
            #Use CV to find optimal parameters
            for year in years:
                #Get indices for training set
                train_ind = np.where(df_sorted['Year'].values != year)[0]
        
                #Get indices for test set
                test_ind = np.where(df_sorted['Year'].values == year)[0]
                
                #Make a copy of the test set
                df_test = df_sorted.iloc[test_ind].reset_index(drop = True)
        
                #Split data
                X_train, X_test = X_data.iloc[train_ind], X_data.iloc[test_ind]
                y_train, y_test = y.iloc[train_ind], y.iloc[test_ind]
                
                #Set parameters
                model.set_params(**param)
                
                #Fit model
                model.fit(X_train, y_train)
                
                #Make predictions
                y_pred = model.predict(X_test)
                
                #Save predictions
                df_test['Predicted Share'] = y_pred
                
                #Subset to only players who recieved votes
                df_test = df_test[df_test['Share'] != 0]
                
                #Set ranks
                df_test = ranks(df_test)
                
                #Order data by rank
                df_test_sorted = df_test.sort_values(by = 'Rank', ascending = True)
                
                #Subset top 5 players
                df_subset = df_test_sorted[df_test_sorted['Rank'].isin(range(1, 6))]
                
                #Calculate and save MSE on ranks
                mse_year = mean_squared_error(df_subset['Rank'], df_subset['Predicted Rank'])
                param_mse.append(mse_year)
                
            #Calculate MSE
            mse = np.mean(param_mse)
            
            #Update best model if needed
            if mse < best_mse:
                best_mse = mse
                best_params = param
                
        #Save results
        results_data = {'Model': model_name, 'MSE': best_mse, 'Best_Params': best_params}
        results_list.append(results_data)
            
    #Fit models that don't require parameter tuning
    else:
        #List to store MSEs
        mse_list = []
        
        #Loop over each year
        for year in years:
            #Get indices for training set
            train_ind = np.where(df_sorted['Year'].values != year)[0]
        
            #Get indices for test set
            test_ind = np.where(df_sorted['Year'].values == year)[0]
            
            #Make a copy of the test set
            df_test = df_sorted.iloc[test_ind].reset_index(drop = True)
        
            #Split data
            X_train, X_test = X_data.iloc[train_ind], X_data.iloc[test_ind]
            y_train, y_test = y.iloc[train_ind], y.iloc[test_ind]
            
            #Fit model
            model.fit(X_train, y_train)
            
            #Make predictions
            y_pred = model.predict(X_test) 
                
            #Save predictions
            df_test['Predicted Share'] = y_pred
                
            #Subset to only players who recieved votes
            df_test = df_test[df_test['Share'] != 0]
                
            #Set ranks
            df_test = ranks(df_test)
                
            #Order data by rank
            df_test_sorted = df_test.sort_values(by = 'Rank', ascending = True)
                
            #Subset top 5 players
            df_subset = df_test_sorted[df_test_sorted['Rank'].isin(range(1, 6))]
                
            #Calculate and save MSE on ranks
            mse_year = mean_squared_error(df_subset['Rank'], df_subset['Predicted Rank'])
            mse_list.append(mse_year)
            
        #Calculate MSE
        best_mse = np.mean(mse_list)
        
        #Save model results
        results_data = {'Model': model_name, 'MSE': best_mse}
        results_list.append(results_data)
    
    #Print status
    print(f"{model_name} done")

Linear Regression starting
Using unscaled data
Linear Regression done
Ridge Regression starting
Using scaled data
{'alpha': 0.0001}
{'alpha': 0.00011497569953977356}
{'alpha': 0.00013219411484660288}
{'alpha': 0.0001519911082952933}
{'alpha': 0.0001747528400007683}
{'alpha': 0.00020092330025650479}
{'alpha': 0.00023101297000831605}
{'alpha': 0.00026560877829466864}
{'alpha': 0.0003053855508833416}
{'alpha': 0.0003511191734215131}
{'alpha': 0.0004037017258596554}
{'alpha': 0.0004641588833612782}
{'alpha': 0.0005336699231206312}
{'alpha': 0.0006135907273413176}
{'alpha': 0.0007054802310718645}
{'alpha': 0.0008111308307896872}
{'alpha': 0.0009326033468832199}
{'alpha': 0.0010722672220103231}
{'alpha': 0.0012328467394420659}
{'alpha': 0.0014174741629268048}
{'alpha': 0.0016297508346206436}
{'alpha': 0.001873817422860383}
{'alpha': 0.0021544346900318843}
{'alpha': 0.0024770763559917113}
{'alpha': 0.002848035868435802}
{'alpha': 0.0032745491628777285}
{'alpha': 0.0037649358067924675}
{'alpha

{'learning_rate': 0.01, 'max_depth': 4, 'n_estimators': 100}
{'learning_rate': 0.01, 'max_depth': 4, 'n_estimators': 200}
{'learning_rate': 0.01, 'max_depth': 4, 'n_estimators': 300}
{'learning_rate': 0.01, 'max_depth': 4, 'n_estimators': 500}
{'learning_rate': 0.01, 'max_depth': 6, 'n_estimators': 100}
{'learning_rate': 0.01, 'max_depth': 6, 'n_estimators': 200}
{'learning_rate': 0.01, 'max_depth': 6, 'n_estimators': 300}
{'learning_rate': 0.01, 'max_depth': 6, 'n_estimators': 500}
{'learning_rate': 0.01, 'max_depth': 8, 'n_estimators': 100}
{'learning_rate': 0.01, 'max_depth': 8, 'n_estimators': 200}
{'learning_rate': 0.01, 'max_depth': 8, 'n_estimators': 300}
{'learning_rate': 0.01, 'max_depth': 8, 'n_estimators': 500}
{'learning_rate': 0.01, 'max_depth': 10, 'n_estimators': 100}
{'learning_rate': 0.01, 'max_depth': 10, 'n_estimators': 200}
{'learning_rate': 0.01, 'max_depth': 10, 'n_estimators': 300}
{'learning_rate': 0.01, 'max_depth': 10, 'n_estimators': 500}
{'learning_rate': 0.

In [42]:
#Add RMSE and print results
results_df = pd.DataFrame(results_list)
results_df['RMSE'] = results_df['MSE'] ** 0.5
results_df.sort_values(by='MSE', ascending = True)

Unnamed: 0,Model,MSE,Best_Params,RMSE
5,XGBoost,7.228571,"{'learning_rate': 0.01, 'max_depth': 4, 'n_est...",2.6886
4,Random Forest,7.342857,"{'max_features': 24, 'n_estimators': 500}",2.709771
1,Ridge Regression,9.814286,{'alpha': 12.32846739442066},3.132776
2,Lasso Regression,9.985714,{'alpha': 0.0001},3.160018
3,KNN,10.471429,{'n_neighbors': 2},3.235959
0,Linear Regression,10.671429,,3.266715


In [41]:
#Print XGBoost full parameters
results_df[results_df['Model'] == 'XGBoost']['Best_Params'].values[0]

{'learning_rate': 0.01, 'max_depth': 4, 'n_estimators': 500}

After assessing and tuning all of the models it appears that the XGBoost regression did the best with the random forest performing similar. I will now use these two models to predict the 2024 NBA MVP and compare their results.

I am also going to look into which variables are the most important. This will be done in the variable_importance notebook.