Now that we have gathered data from Basketball Reference using web scraping, cleaned our data, and conducted some exploratory data analysis, it's time to create our machine learning model to predict who will win the NBA MVP!

In [84]:
#Import Libraries
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
#Machine Learning 
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.ensemble import RandomForestRegressor
from sklearn.neighbors import KNeighborsRegressor
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.model_selection import GridSearchCV, KFold, cross_val_predict
from sklearn.metrics import mean_absolute_error, mean_squared_error, make_scorer
#Ignore Warnings
import warnings
warnings.filterwarnings('ignore')

In [2]:
#Read in data
df = pd.read_csv('all_stats.csv')

In [3]:
df.head(10)

Unnamed: 0.1,Unnamed: 0,Player,Pos,Age,Tm,G,GS,MP,FG,FGA,...,PF,PTS,Year,Pts Won,Pts Max,Share,Team,Wins,Losses,W%
0,0,A.C. Green,PF,23,LAL,79,72,28.4,4.0,7.4,...,2.2,10.8,1987,0.0,780.0,0.0,Los Angeles Lakers,65,17,0.79268
1,1,Adrian Branch,SF,23,LAL,32,0,6.8,1.5,3.0,...,1.2,4.3,1987,0.0,780.0,0.0,Los Angeles Lakers,65,17,0.79268
2,2,Billy Thompson,SF,23,LAL,59,0,12.9,2.4,4.4,...,2.5,5.6,1987,0.0,780.0,0.0,Los Angeles Lakers,65,17,0.79268
3,3,Byron Scott,SG,25,LAL,82,82,33.3,6.8,13.8,...,2.0,17.0,1987,0.0,780.0,0.0,Los Angeles Lakers,65,17,0.79268
4,4,James Worthy,SF,25,LAL,82,82,34.4,7.9,14.7,...,2.5,19.4,1987,0.0,780.0,0.0,Los Angeles Lakers,65,17,0.79268
5,5,Kareem Abdul-Jabbar,C,39,LAL,78,78,31.3,7.2,12.7,...,3.1,17.5,1987,0.0,780.0,0.0,Los Angeles Lakers,65,17,0.79268
6,6,Kurt Rambis,PF,28,LAL,78,10,19.4,2.1,4.0,...,2.6,5.7,1987,0.0,780.0,0.0,Los Angeles Lakers,65,17,0.79268
7,7,Magic Johnson,PG,27,LAL,80,80,36.3,8.5,16.4,...,2.1,23.9,1987,733.0,780.0,0.94,Los Angeles Lakers,65,17,0.79268
8,8,Michael Cooper,SG,30,LAL,82,2,27.5,3.9,9.0,...,2.4,10.5,1987,0.0,780.0,0.0,Los Angeles Lakers,65,17,0.79268
9,9,Mike Smrek,C,24,LAL,35,3,6.7,0.9,1.7,...,2.0,2.2,1987,0.0,780.0,0.0,Los Angeles Lakers,65,17,0.79268


In [4]:
#Delete Unwanted Column
del df['Unnamed: 0']

# Missing Data

In [5]:
df.isna().sum().sort_values(ascending=False)

3P%        2333
FT%         574
2P%         110
eFG%         64
FG%          64
Player        0
PF            0
AST           0
STL           0
BLK           0
TOV           0
PTS           0
DRB           0
Year          0
Pts Won       0
Pts Max       0
Share         0
Team          0
Wins          0
Losses        0
TRB           0
FTA           0
ORB           0
Pos           0
FT            0
2PA           0
2P            0
3PA           0
3P            0
FGA           0
FG            0
MP            0
GS            0
G             0
Tm            0
Age           0
W%            0
dtype: int64

It does appear that we do have some missing data. My assumption is that the missing data is for players who did not attempt a shot in the given category. For example, if a player never shot a 3, then their 3P% would be missing. Lets check this. 

In [6]:
#Define dataframe with null 3P%
null_3P = df[df['3P%'].isnull()]
#Show 3PA Data
null_3P['3PA'].value_counts()

0.0    2333
Name: 3PA, dtype: int64

Looks like everyone who has missing data for 3P% also has 0 3PA. Thus, it does appear as those with null values never attempted a shot. Now we can explore the other categories with missing data. It is likely the same reasoning as 3P% applies. 

In [7]:
#Dictionary with the null values and attempts
null_values = {
    'FT%': 'FTA',
    '2P%': '2PA',
    'eFG%': 'FGA',
    'FG%': 'FGA',
}

for null_key, attempt_key in null_values.items():
    null_rows = df[df[null_key].isnull()]
    value_counts = null_rows[attempt_key].value_counts()
    print(f'Null Value: {null_key}')
    print(value_counts)
    print()

Null Value: FT%
0.0    574
Name: FTA, dtype: int64

Null Value: 2P%
0.0    110
Name: 2PA, dtype: int64

Null Value: eFG%
0.0    64
Name: FGA, dtype: int64

Null Value: FG%
0.0    64
Name: FGA, dtype: int64



Looks like for all of the columns with missing data it is because the player has never taken that type of shot. To fill the missing values I will use the median for that position. We could also drop the rows with missing values; however, there are rare cases where players who do not attempt 3s are in the MVP running. For example, Shaquille O'Neal has had multiple season where he attempted 0 3 point shots but won the MVP in the 1999-2000 season. 

In [8]:
# Assuming 'Position' is the column that specifies the player's position
columns_to_fill = ['3P%', 'FT%', '2P%', 'eFG%', 'FG%']
for column in columns_to_fill:
    median_by_position = df.groupby('Pos')[column].transform('median')
    df[column].fillna(median_by_position, inplace=True)

In [9]:
df.isna().sum().sort_values(ascending=False)

Player     0
FT%        0
DRB        0
TRB        0
AST        0
STL        0
BLK        0
TOV        0
PF         0
PTS        0
Year       0
Pts Won    0
Pts Max    0
Share      0
Team       0
Wins       0
Losses     0
ORB        0
FTA        0
Pos        0
FT         0
Age        0
Tm         0
G          0
GS         0
MP         0
FG         0
FGA        0
FG%        0
3P         0
3PA        0
3P%        0
2P         0
2PA        0
2P%        0
eFG%       0
W%         0
dtype: int64

No more missing data, nice! Now we need to use dummy variables for the Position and Team columns

# Feature Engineering

In [10]:
# Create dummy variables for the Pos column
position_dummies = pd.get_dummies(df['Pos'],)
team_dummies = pd.get_dummies(df['Team'])
df_with_dummies = pd.concat([df, position_dummies, team_dummies], axis=1)
df_with_dummies.drop('Pos', axis=1, inplace=True)
df_with_dummies.drop('Team', axis=1, inplace=True)

We will now add features that meassure if a player scored above average for the year in the points, rebounds, assists, blocks, and steals categories. 

In [11]:
average_stats_for_year = df_with_dummies.groupby('Year')[['PTS', 'ORB', 'DRB', 'TRB', 'STL', 'AST', 'BLK']].transform('mean')
# Calculate the new columns by dividing player's statistics by average statistics for their year
statistics_columns = ['PTS', 'ORB', 'DRB', 'TRB', 'STL', 'AST', 'BLK']
for col in statistics_columns:
    df_with_dummies[f'{col}_C'] = df_with_dummies[col] / average_stats_for_year[col]

In [12]:
df_with_dummies.columns

Index(['Player', 'Age', 'Tm', 'G', 'GS', 'MP', 'FG', 'FGA', 'FG%', '3P', '3PA',
       '3P%', '2P', '2PA', '2P%', 'eFG%', 'FT', 'FTA', 'FT%', 'ORB', 'DRB',
       'TRB', 'AST', 'STL', 'BLK', 'TOV', 'PF', 'PTS', 'Year', 'Pts Won',
       'Pts Max', 'Share', 'Wins', 'Losses', 'W%', 'C', 'C-PF', 'PF', 'PF-C',
       'PF-SF', 'PG', 'PG-SF', 'PG-SG', 'SF', 'SF-C', 'SF-PF', 'SF-SG', 'SG',
       'SG-PF', 'SG-PG', 'SG-PG-SF', 'SG-SF', 'Atlanta Hawks',
       'Boston Celtics', 'Brooklyn Nets', 'Charlotte Bobcats',
       'Charlotte Hornets', 'Chicago Bulls', 'Cleveland Cavaliers',
       'Dallas Mavericks', 'Denver Nuggets', 'Detroit Pistons',
       'Golden State Warriors', 'Houston Rockets', 'Indiana Pacers',
       'Los Angeles Clippers', 'Los Angeles Lakers', 'Memphis Grizzlies',
       'Miami Heat', 'Milwaukee Bucks', 'Minnesota Timberwolves',
       'New Jersey Nets', 'New Orleans Hornets', 'New Orleans Pelicans',
       'New Orleans/Oklahoma City Hornets', 'New York Knicks',
       'Okl

In [13]:
features = ['Age', 'G', 'GS', 'MP', 'FG', 'FGA', 'FG%', '3P',
       '3PA', '3P%', '2P', '2PA', '2P%', 'eFG%', 'FT', 'FTA', 'FT%', 'ORB',
       'DRB', 'TRB', 'AST', 'STL', 'BLK', 'TOV', 'PF', 'PTS', 'Year',
       'Wins', 'Losses', 'W%', 'C', 'C-PF', 'PF', 'PF-C',
       'PF-SF', 'PG', 'PG-SF', 'PG-SG', 'SF', 'SF-C', 'SF-PF', 'SF-SG', 'SG',
       'SG-PF', 'SG-PG', 'SG-PG-SF', 'SG-SF', 'Atlanta Hawks',
       'Boston Celtics', 'Brooklyn Nets', 'Charlotte Bobcats',
       'Charlotte Hornets', 'Chicago Bulls', 'Cleveland Cavaliers',
       'Dallas Mavericks', 'Denver Nuggets', 'Detroit Pistons',
       'Golden State Warriors', 'Houston Rockets', 'Indiana Pacers',
       'Los Angeles Clippers', 'Los Angeles Lakers', 'Memphis Grizzlies',
       'Miami Heat', 'Milwaukee Bucks', 'Minnesota Timberwolves',
       'New Jersey Nets', 'New Orleans Hornets', 'New Orleans Pelicans',
       'New Orleans/Oklahoma City Hornets', 'New York Knicks',
       'Oklahoma City Thunder', 'Orlando Magic', 'Philadelphia 76ers',
       'Phoenix Suns', 'Portland Trail Blazers', 'Sacramento Kings',
       'San Antonio Spurs', 'Seattle SuperSonics', 'Toronto Raptors',
       'Utah Jazz', 'Vancouver Grizzlies', 'Washington Bullets',
       'Washington Wizards','PTS_C', 'ORB_C', 'DRB_C', 'TRB_C', 'STL_C',
       'AST_C', 'BLK_C']

Now we need to define our train and test data. We will use the data from 1987-2018 as the training data and then test on the 2019, 20, 21, and 22 seasons. We will leave the 2023 season unseen until the end so we can see how well our model does on data it has never seen.

In [14]:
#Training vs Testing
train = df_with_dummies[df_with_dummies['Year'] <= 2018]
test = df_with_dummies[(df_with_dummies['Year'] >= 2019) & (df_with_dummies['Year'] <= 2022)]

In [15]:
X_train = train[features]
y_train = train['Share']
X_test = test[features]
y_test = test['Share']

# Model

Lets start by trying a simple Linear Regression model before finding a model that works best.

In [16]:
#Linear Regression
LinReg = LinearRegression()
LinReg.fit(X_train, y_train)

In [17]:
pred = LinReg.predict(X_test)

In [18]:
pred = pd.DataFrame(pred, columns=['Predicted Share'], index=test.index)

In [19]:
pred.head(10)

Unnamed: 0,Predicted Share
705,0.011845
706,0.008741
707,0.018473
708,0.005495
709,-0.009567
710,0.005395
711,0.002789
712,-0.016878
713,0.008461
714,0.021728


Now we need to compare this to our actual share values and then create a way to meassure error so we can examine multiple models.

In [20]:
results = pd.concat([test[['Player', 'Share', 'Year']], pred], axis=1)

In [21]:
results.sort_values('Share', ascending=False).head(10)

Unnamed: 0,Player,Share,Year,Predicted Share
751,Nikola Jokić,0.961,2021,0.150032
14564,Giannis Antetokounmpo,0.952,2020,0.215372
13469,Giannis Antetokounmpo,0.932,2019,0.199536
773,Nikola Jokić,0.875,2022,0.207829
11129,James Harden,0.768,2019,0.19234
4304,LeBron James,0.746,2020,0.162463
963,Joel Embiid,0.706,2022,0.19113
13169,Giannis Antetokounmpo,0.595,2022,0.227301
10014,Joel Embiid,0.58,2021,0.146613
4235,Stephen Curry,0.449,2021,0.138918


We need a way to meassure if this model did a good job so we can compare the performance of multiple models. We want to create a model that does a good job of predicting the Share of the best players so we will take this into consideration when creating an error metric. To do this we need to rank the players based on Share and Predicted Share. Lets just look at our predictions for the 2019 season.

In [22]:
results_2019 = results[results['Year'] == 2019]
results_2019 = results_2019.sort_values('Share', ascending=False)
results_2019['Rank'] = results_2019['Share'].rank(ascending=False).astype(int)
results_2019.head(6)

Unnamed: 0,Player,Share,Year,Predicted Share,Rank
13469,Giannis Antetokounmpo,0.932,2019,0.199536,1
11129,James Harden,0.768,2019,0.19234,2
1631,Paul George,0.352,2019,0.132975,3
13826,Nikola Jokić,0.21,2019,0.100187,4
4862,Stephen Curry,0.173,2019,0.105894,5
3441,Damian Lillard,0.068,2019,0.098622,6


In [23]:
results_2019 = results_2019.sort_values('Predicted Share', ascending=False)
results_2019['Predicted Rank'] = results_2019['Predicted Share'].rank(ascending=False).astype(int)
results_2019.sort_values('Share', ascending=False).head(6)

Unnamed: 0,Player,Share,Year,Predicted Share,Rank,Predicted Rank
13469,Giannis Antetokounmpo,0.932,2019,0.199536,1,1
11129,James Harden,0.768,2019,0.19234,2,2
1631,Paul George,0.352,2019,0.132975,3,8
13826,Nikola Jokić,0.21,2019,0.100187,4,11
4862,Stephen Curry,0.173,2019,0.105894,5,10
3441,Damian Lillard,0.068,2019,0.098622,6,12


It looks like the model did a decent job at predicting the MVP. We will need to look further into Anthony Davis as he did not recieve any MVP votes but our model predicted he finished fourth in voting.

In [24]:
results_2019['Difference'] = results_2019['Rank'] - results_2019['Predicted Rank']

In [25]:
results_2019[results_2019['Rank'] <= 6].sort_values('Difference', ascending=True)

Unnamed: 0,Player,Share,Year,Predicted Share,Rank,Predicted Rank,Difference
13826,Nikola Jokić,0.21,2019,0.100187,4,11,-7
3441,Damian Lillard,0.068,2019,0.098622,6,12,-6
1631,Paul George,0.352,2019,0.132975,3,8,-5
4862,Stephen Curry,0.173,2019,0.105894,5,10,-5
13469,Giannis Antetokounmpo,0.932,2019,0.199536,1,1,0
11129,James Harden,0.768,2019,0.19234,2,2,0


In [26]:
def ranks(df):
    df = df.sort_values('Share', ascending=False)
    df['Rank'] = df['Share'].rank(ascending=False).astype(int)
    df = df.sort_values('Predicted Share', ascending=False)  
    df['Predicted Rank'] = df['Predicted Share'].rank(ascending=False).astype(int)
    df['Difference'] = df['Rank'] - df['Predicted Rank']
    
    return df

To evaluate the performance of the model I will look at three common evaluation metrics as well as a custome one. The three standard error metrics are mean absolute error (mae), mean squared error (mse) and root mean squared error(rmse). The custom metric does not punish us if we correctly predict a player who finished top six in voting. It will punish us if we do not predict a player in the top six. For example, James Harden finished in the top 6 and we predicted he would so we would not be punished; however, Nikola Jokíc finished in the top 6 and we predicted he would finish 11th so we would be punished.

In [27]:
results_2019

Unnamed: 0,Player,Share,Year,Predicted Share,Rank,Predicted Rank,Difference
13469,Giannis Antetokounmpo,0.932,2019,0.199536,1,1,0
11129,James Harden,0.768,2019,0.192340,2,2,0
4282,LeBron James,0.001,2019,0.179065,11,3,8
8113,Anthony Davis,0.000,2019,0.142139,271,4,267
6468,Joel Embiid,0.049,2019,0.137721,7,5,2
...,...,...,...,...,...,...,...
3896,Treveon Graham,0.000,2019,-0.035309,271,526,-255
5294,Frank Ntilikina,0.000,2019,-0.040963,271,527,-256
5299,Kevin Knox,0.000,2019,-0.041393,271,528,-257
11239,Avery Bradley,0.000,2019,-0.050883,271,529,-258


In [28]:
def error_metrics(df):
    top6_df = df[df["Rank"].isin([1, 2, 3, 4, 5, 6])]
        
    mae = mean_absolute_error(top6_df["Rank"], top6_df["Predicted Rank"])
    mse = mean_squared_error(top6_df["Rank"], top6_df["Predicted Rank"])
    rmse = np.sqrt(mse)
        
    #Custom Metric
    scores = []
    df = df.sort_values('Share', ascending=False)
    for index, row in df.iterrows():
        if row['Rank'] <= 6 and row['Predicted Rank'] <= 6:
            scores.append(1)
        elif row['Predicted Rank'] > 6 and row['Rank'] <= 6:
            scores.append(row['Rank'] / row['Predicted Rank'])
        
    custom = sum(scores) / 6
        
    metrics = {
        'MAE': mae,
        'MSE': mse,
        'RMSE': rmse,
        'Custom Metric': custom
    }

    return metrics

In [29]:
error_metrics(results_2019)

{'MAE': 3.8333333333333335,
 'MSE': 22.5,
 'RMSE': 4.743416490252569,
 'Custom Metric': 0.6231060606060607}

Note we want the custom metric to be close to one. A value of one signifies correctly predicting the top 6 players. One thing to note, we want to be able to test the results for multiple years. This is because it is possible the year 2019 was unique and our model cannot be generalized to all years. Lets look at results for 2019-2022. Once we have done this, we can look into different models.

In [30]:
def test_all_years(results, start_year, end_year):
    scores =[]
    years = list(range(start_year,end_year + 1))
    for year in years:
        df = results[results['Year'] == year]
        df = ranks(df)
        metrics = error_metrics(df)
        scores.append((year, metrics))
    return scores

In [31]:
linear_scores = test_all_years(results, 2019, 2022)
linear_scores

[(2019,
  {'MAE': 3.8333333333333335,
   'MSE': 22.5,
   'RMSE': 4.743416490252569,
   'Custom Metric': 0.6231060606060607}),
 (2020,
  {'MAE': 1.0,
   'MSE': 1.6666666666666667,
   'RMSE': 1.2909944487358056,
   'Custom Metric': 0.9523809523809524}),
 (2021,
  {'MAE': 4.833333333333333,
   'MSE': 83.16666666666667,
   'RMSE': 9.119576013536301,
   'Custom Metric': 0.8641975308641975}),
 (2022,
  {'MAE': 4.5,
   'MSE': 47.166666666666664,
   'RMSE': 6.867799259345505,
   'Custom Metric': 0.7703703703703705})]

It looks like our model performed very well for the 2020 season. Now lets average these results for the four years so we can test multiple models and compare their performance.

In [32]:
def average_results(scores):
    sum_metrics = {'MAE': 0, 'MSE': 0, 'RMSE': 0, 'Custom Metric': 0}
    count_metrics = {'MAE': 0, 'MSE': 0, 'RMSE': 0, 'Custom Metric': 0}

    for year, metrics in scores:
        for metric, value in metrics.items():
            sum_metrics[metric] += value
            count_metrics[metric] += 1

    average_metrics = {metric: sum_metric / count_metric for metric, (sum_metric, count_metric) in zip(sum_metrics.keys(), zip(sum_metrics.values(), count_metrics.values()))}

    average_df = pd.DataFrame([average_metrics])

    return average_df

In [33]:
average_results(linear_scores)

Unnamed: 0,MAE,MSE,RMSE,Custom Metric
0,3.541667,38.625,5.505447,0.802514


Now lets put it all together and create a function that returns the average metrics from just the name of the model.

In [34]:
def multi_year_test(model):
    train = df_with_dummies[df_with_dummies['Year'] <= 2018]
    test = df_with_dummies[(df_with_dummies['Year'] >= 2019) & (df_with_dummies['Year'] <= 2022)]
    
    X_train = train[features]
    y_train = train['Share']
    X_test = test[features]
    y_test = test['Share']
    
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    y_pred_df = pd.DataFrame(pred, columns=['Predicted Share'], index=test.index)
    results = pd.concat([test[['Player', 'Share', 'Year']], y_pred_df], axis=1) 
    results['Predicted Share'] = y_pred
    results = ranks(results)
    model_scores = test_all_years(results, 2019, 2022)
    average_df = average_results(model_scores)
    return average_df

Test it on linear regression

In [35]:
multi_year_test(LinReg)

Unnamed: 0,MAE,MSE,RMSE,Custom Metric
0,3.541667,38.625,5.505447,0.802514


Looks good! Now we have a way to input a model and get all of our error metrics in a dataframe. Now lets test multiple models and see which one performs the best!

In [41]:
#Models
models = {'LinReg': LinearRegression(),
          'Lasso': Lasso(alpha=0.1, random_state=123),
          'Ridge': Ridge(alpha=0.1, random_state=123),
          'Random Forest': RandomForestRegressor(random_state=123, n_jobs=-1),
          'KNN': KNeighborsRegressor(),
          'Gradient Boost': GradientBoostingRegressor(random_state=123),
         }

#Initialize an empty DataFrame to collect results
results_df = pd.DataFrame()

#Generate results
for model_name, model_instance in models.items():
    all_models = multi_year_test(model_instance)
    all_models.insert(0, 'Model', model_name)
    results_df = results_df.append(all_models)

In [42]:
results_df

Unnamed: 0,Model,MAE,MSE,RMSE,Custom Metric
0,LinReg,3.541667,38.625,5.505447,0.802514
0,Lasso,39.541667,2046.375,45.150208,0.147076
0,Ridge,3.5,37.5,5.42149,0.803058
0,Random Forest,3.708333,48.708333,5.694381,0.791835
0,KNN,14.75,3261.416667,31.198241,0.799542
0,Gradient Boost,5.208333,80.291667,8.053101,0.768331


In [43]:
results_df.sort_values('RMSE')

Unnamed: 0,Model,MAE,MSE,RMSE,Custom Metric
0,Ridge,3.5,37.5,5.42149,0.803058
0,LinReg,3.541667,38.625,5.505447,0.802514
0,Random Forest,3.708333,48.708333,5.694381,0.791835
0,Gradient Boost,5.208333,80.291667,8.053101,0.768331
0,KNN,14.75,3261.416667,31.198241,0.799542
0,Lasso,39.541667,2046.375,45.150208,0.147076


It looks like Ridge Regression did the best out of the six models that we tested for all error metrics. Lets do some hyperparameter tuning to select the ideal alpha value.

In [86]:
#Ridge model
Ridge_Tuning = Ridge(random_state=123)
alphas = np.logspace(-1, 3, 100)

# Use GridSearchCV with the RMSE scorer
grid = GridSearchCV(Ridge_Tuning, param_grid={'alpha': alphas}, cv=kf, scoring='neg_mean_squared_error')
grid.fit(X_train, y_train)
best_alpha = grid.best_params_['alpha']
print(best_alpha)

2.848035868435802


We can see that the best alpha for our Ridge model is 2.848. Lets see how this model does for the data in 2023 which our model has yet to see. 

In [139]:
#Instantiate Model
Ridge_2023 = Ridge(alpha=2.848035868435802, random_state=123)

#Test
test_2023 = df_with_dummies[(df_with_dummies['Year'] == 2023)]
X_train = train[features]
y_train = train['Share']
X_test_2023 = test_2023[features]
y_test_2023 = test_2023['Share'] 

#Fit
Ridge_2023.fit(X_train, y_train)

#Predict
pred_2023 = Ridge_2023.predict(X_test_2023)
pred_2023_df = pd.DataFrame(pred_2023, columns=['Predicted Share'], index=test_2023.index)

#Results
results_2023 = pd.concat([test_2023[['Player', 'Share']], pred_2023_df], axis=1)
results_2023 = ranks(results_2023)

In [140]:
results_2023.sort_values('Share', ascending=False).head(10)

Unnamed: 0,Player,Share,Predicted Share,Rank,Predicted Rank,Difference
16218,Joel Embiid,0.915,0.200768,1,3,-2
789,Nikola Jokić,0.674,0.184499,2,4,-2
267,Giannis Antetokounmpo,0.606,0.209066,3,1,2
3067,Jayson Tatum,0.28,0.141174,4,9,-5
1375,Shai Gilgeous-Alexander,0.046,0.172629,5,5,0
14973,Donovan Mitchell,0.03,0.10325,6,19,-13
4561,Domantas Sabonis,0.027,0.102403,7,21,-14
359,Luka Dončić,0.01,0.201027,8,2,6
7160,Stephen Curry,0.005,0.12951,9,11,-2
11593,Jimmy Butler,0.003,0.122394,10,14,-4


Looks like the tuned ridge model did a pretty good job. We can see we correctly predicted 2/3 of the top 3 players, 3/5 of the top 5 players, and got Shai Gilgeous-Alexander finishing in 5th correct. Going forward I want to do addition tweaking and feature editing to increase accuracy and prepare the model to make predictions for the 2023-2024 NBA season which starts October 2023. I would also like to attempt to use backtesting to test model accuracy.