In [1]:
import pandas as pd

In [2]:
stats = pd.read_csv("PlayerMVPStats.csv")

In [3]:
stats


Unnamed: 0.1,Unnamed: 0,Player,Pos,Age,Tm,G,GS,MP,FG,FGA,...,Pts Max,Share,Team,W,L,W/L%,GB,PS/G,PA/G,SRS
0,0,A.C. Green,PF,27,LAL,82,21,26.4,3.1,6.6,...,0.0,0.0,Los Angeles Lakers,58.0,24.0,0.707,5.0,106.3,99.6,6.73
1,1,Byron Scott,SG,29,LAL,82,82,32.1,6.1,12.8,...,0.0,0.0,Los Angeles Lakers,58.0,24.0,0.707,5.0,106.3,99.6,6.73
2,2,Elden Campbell,PF,22,LAL,52,0,7.3,1.1,2.4,...,0.0,0.0,Los Angeles Lakers,58.0,24.0,0.707,5.0,106.3,99.6,6.73
3,3,Irving Thomas,PF,25,LAL,26,0,4.2,0.7,1.9,...,0.0,0.0,Los Angeles Lakers,58.0,24.0,0.707,5.0,106.3,99.6,6.73
4,4,James Worthy,SF,29,LAL,78,74,38.6,9.2,18.7,...,0.0,0.0,Los Angeles Lakers,58.0,24.0,0.707,5.0,106.3,99.6,6.73
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
14692,14692,Spencer Hawes,PF,28,MIL,54,1,14.8,2.5,5.1,...,0.0,0.0,Milwaukee Bucks,42.0,40.0,0.512,9.0,103.6,103.8,-0.45
14693,14693,Steve Novak,PF,33,MIL,8,0,2.8,0.3,0.9,...,0.0,0.0,Milwaukee Bucks,42.0,40.0,0.512,9.0,103.6,103.8,-0.45
14694,14694,Terrence Jones,PF,25,MIL,54,12,23.5,4.3,9.1,...,0.0,0.0,Milwaukee Bucks,42.0,40.0,0.512,9.0,103.6,103.8,-0.45
14695,14695,Thon Maker,C,19,MIL,57,34,9.9,1.5,3.2,...,0.0,0.0,Milwaukee Bucks,42.0,40.0,0.512,9.0,103.6,103.8,-0.45


In order to work with this data set, I cleaned up the csv file and deleted any unnecessary columns. I also replaced all the NA values with 0s.

In [4]:
del stats["Unnamed: 0"]

In [5]:
pd.isnull(stats).sum()

Player        0
Pos           0
Age           0
Tm            0
G             0
GS            0
MP            0
FG            0
FGA           0
FG%          59
3P            0
3PA           0
3P%        2086
2P            0
2PA           0
2P%         100
eFG%         59
FT            0
FTA           0
FT%         521
ORB           0
DRB           0
TRB           0
AST           0
STL           0
BLK           0
TOV           0
PF            0
PTS           0
Year          0
Pts Won       0
Pts Max       0
Share         0
Team          0
W           288
L           288
W/L%        288
GB          288
PS/G        288
PA/G        288
SRS         288
dtype: int64

In [6]:
stats = stats.fillna(0)

Now, by looking through all possible columns, I wanted to see which columns I could use as my predictors in order to predict the "Share" column.

In [7]:
stats.columns

Index(['Player', 'Pos', 'Age', 'Tm', 'G', 'GS', 'MP', 'FG', 'FGA', 'FG%', '3P',
       '3PA', '3P%', '2P', '2PA', '2P%', 'eFG%', 'FT', 'FTA', 'FT%', 'ORB',
       'DRB', 'TRB', 'AST', 'STL', 'BLK', 'TOV', 'PF', 'PTS', 'Year',
       'Pts Won', 'Pts Max', 'Share', 'Team', 'W', 'L', 'W/L%', 'GB', 'PS/G',
       'PA/G', 'SRS'],
      dtype='object')

In [8]:
predictors = [ 'Age', 'G', 'GS', 'MP', 'FG', 'FGA', 'FG%', '3P',
       '3PA', '3P%', '2P', '2PA', '2P%', 'eFG%', 'FT', 'FTA', 'FT%', 'ORB',
       'DRB', 'TRB', 'AST', 'STL', 'BLK', 'TOV', 'PF', 'PTS', 'Year', 
        'W', 'L', 'W/L%', 'GB', 'PS/G','PA/G', 'SRS']

I chose all columns that had numerical values and weren't directly correlated to the "Share" column. The "Pts Won" and "Pts Max" are columns directly correlated and would only make the model less accurate.

Now I decided to play around with the model to see if my 2022 MVP prediction would be accurate.

In [9]:
train = stats[stats["Year"] < 2022]

In [10]:
test = stats[stats["Year"] == 2022]

In [11]:
from sklearn.linear_model import Ridge

reg = Ridge(alpha = .1)

In [12]:
reg.fit(train[predictors], train["Share"])

In [13]:
predictions = reg.predict(test[predictors])

In [14]:
predictions = pd.DataFrame(predictions, columns = ["predictions"], index = test.index)

In [15]:
predictions

Unnamed: 0,predictions
648,0.012934
649,-0.028142
650,-0.006163
651,0.016564
652,-0.004820
...,...
12508,0.060355
12509,0.069539
12510,0.083546
12511,0.080897


In [16]:
combination = pd.concat([test[["Player","Share"]],predictions],axis=1)

I was able to see how my regressive predictor ranked individuals in the 2022 season, by how many votes they would recieve. The highest value indicates that they would have won MVP.

In [17]:
combination

Unnamed: 0,Player,Share,predictions
648,Aaron Gordon,0.0,0.012934
649,Austin Rivers,0.0,-0.028142
650,Bol Bol,0.0,-0.006163
651,Bones Hyland,0.0,0.016564
652,Bryn Forbes,0.0,-0.004820
...,...,...,...
12508,Micah Potter,0.0,0.060355
12509,Rodney McGruder,0.0,0.069539
12510,Saben Lee,0.0,0.083546
12511,Saddiq Bey,0.0,0.080897


In [18]:
combination.sort_values("Share", ascending=False).head(10)

Unnamed: 0,Player,Share,predictions
663,Nikola Jokić,0.875,0.190365
837,Joel Embiid,0.706,0.190462
11678,Giannis Antetokounmpo,0.595,0.21941
907,Devin Booker,0.216,0.091309
11469,Luka Dončić,0.146,0.157395
1179,Jayson Tatum,0.043,0.095902
12226,Ja Morant,0.01,0.120508
6398,Stephen Curry,0.004,0.093138
905,Chris Paul,0.002,0.078329
8241,LeBron James,0.001,0.236135


The model was not accurate in showing the 2022 MVP, when compared to the "Share" column for the season, we can see that my model predicted Giannis Antetokunmpo to be the MVP, instead Nikola Jokic was the MVP.

I then went forward in determining an error metric to see how accurate my predictor could be for all seasons in the data set.

In [19]:
from sklearn.metrics import mean_squared_error

In [20]:
mean_squared_error(combination["Share"],combination["predictions"])

0.0052267057271756096

Using the mean squared error object from sklearn, I was not able to generat an accurate error metric, since the data set has many rows of people who recieved No votes. The mean squared error from sklearn would not be useful for this model. Therefore I would have to create my own way of determining how accurate the model was. 

In [21]:
combination["Share"].value_counts()

0.000    593
0.001      3
0.875      1
0.706      1
0.002      1
0.216      1
0.043      1
0.004      1
0.146      1
0.595      1
0.010      1
Name: Share, dtype: int64

In [22]:
combination = combination.sort_values("Share", ascending= False)
combination["Rk"] = list(range(1,combination.shape[0]+1))

In [23]:
combination.head(10)

Unnamed: 0,Player,Share,predictions,Rk
663,Nikola Jokić,0.875,0.190365,1
837,Joel Embiid,0.706,0.190462,2
11678,Giannis Antetokounmpo,0.595,0.21941,3
907,Devin Booker,0.216,0.091309,4
11469,Luka Dončić,0.146,0.157395,5
1179,Jayson Tatum,0.043,0.095902,6
12226,Ja Morant,0.01,0.120508,7
6398,Stephen Curry,0.004,0.093138,8
905,Chris Paul,0.002,0.078329,9
8241,LeBron James,0.001,0.236135,10


In [24]:
combination = combination.sort_values("predictions", ascending =False)
combination["Prediction_Rk"] = list(range(1,combination.shape[0]+1))

In [25]:
combination.head(10)

Unnamed: 0,Player,Share,predictions,Rk,Prediction_Rk
8241,LeBron James,0.001,0.236135,10,1
11678,Giannis Antetokounmpo,0.595,0.21941,3,2
837,Joel Embiid,0.706,0.190462,2,3
663,Nikola Jokić,0.875,0.190365,1,4
8231,Anthony Davis,0.0,0.185613,112,5
8582,Dejounte Murray,0.0,0.173725,84,6
1211,Shai Gilgeous-Alexander,0.0,0.166799,313,7
5031,Christian Wood,0.0,0.158033,479,8
11469,Luka Dončić,0.146,0.157395,5,9
1856,Domantas Sabonis,0.0,0.154263,344,10


In [26]:
combination.sort_values("Share", ascending = False).head(10)

Unnamed: 0,Player,Share,predictions,Rk,Prediction_Rk
663,Nikola Jokić,0.875,0.190365,1,4
837,Joel Embiid,0.706,0.190462,2,3
11678,Giannis Antetokounmpo,0.595,0.21941,3,2
907,Devin Booker,0.216,0.091309,4,79
11469,Luka Dončić,0.146,0.157395,5,9
1179,Jayson Tatum,0.043,0.095902,6,64
12226,Ja Morant,0.01,0.120508,7,28
6398,Stephen Curry,0.004,0.093138,8,73
905,Chris Paul,0.002,0.078329,9,142
3938,DeMar DeRozan,0.001,0.099241,11,54


The first step in creating an error metric was to first rank each player, first by their rank based on the "Share" column, and then another rank by the "predictions" column. When I compared these columns together, I was able to find a metric more suitable for my model.

In [27]:
def find_ap(combination):
    actual = combination.sort_values("Share", ascending = False).head(5)
    predicted = combination.sort_values("predictions", ascending =False)
    ps =[]
    found = 0
    seen = 1
    for index, row in predicted.iterrows():
        if row["Player"] in actual["Player"].values:
            found +=1
            ps.append(found/seen)
        seen += 1
    return sum(ps)/len(ps)

I created this function to find the average precision (ap), by using the top 5 players in the actual rank column and comparing them to the top 5 players int he predicted rank column. I only used the top 5 players since these would typically be the players with the most votes anyway, disregarding all players with 0 votes completely. 

In [28]:
find_ap(combination)

0.4848804500703234

when we run the function, we can see that my modelw as only 48% accurate as of now.

Next I decided to run the model using the first 5 years in the data set (1991-1995) to train the model and the rest of the years to be tested on. 

In [29]:
years = list(range(1991,2023))

In [30]:
aps = []
all_predictions = []
for year in years[5:]:
    train = stats[stats["Year"] < year]
    test = stats[stats["Year"] == year]
    reg.fit(train[predictors],train["Share"])
    predictions = reg.predict(test[predictors])
    predictions = pd.DataFrame(predictions, columns = ["predictions"], index = test.index)
    combination = pd.concat([test[["Player","Share"]],predictions],axis=1)
    all_predictions.append(combination)
    aps.append(find_ap(combination))

In [31]:
sum(aps)/ len(aps)

0.7029029551156751

Here, on average, the model was 70% in predicting the ranks of the top 5 players for the MVP vote after 1995.

I decided to create funtions to streamline adding ranks and running the find_ap function for every year, rather than continuosely using the for loop from above.

The add_ranks function adds the ranks per year, and the the backtest function runs the for loop for every year, while adding the ranks and finding the average precision per iteration.

In [32]:
def add_ranks(combination):
    combination = combination.sort_values("Share", ascending= False)
    combination["Rk"] = list(range(1,combination.shape[0]+1))
    combination = combination.sort_values("predictions", ascending =False)
    combination["Prediction_Rk"] = list(range(1,combination.shape[0]+1))
    combination["Diff"] = combination["Rk"] - combination["Prediction_Rk"]
    return combination

In [33]:
ranking = add_ranks(all_predictions[1])
ranking[ranking["Rk"] < 6].sort_values("Diff",ascending = False)

Unnamed: 0,Player,Share,predictions,Rk,Prediction_Rk,Diff
1710,Karl Malone,0.857,0.192318,1,2,-1
10976,Michael Jordan,0.832,0.167629,2,3,-1
970,Grant Hill,0.327,0.128646,3,6,-3
4912,Tim Hardaway,0.207,0.059984,4,20,-16
8642,Glen Rice,0.117,0.03311,5,53,-48


In [34]:
def backtest(stats, model, year, predictors):
    aps = []
    all_predictions = []
    for year in years[5:]:
        train = stats[stats["Year"] < year]
        test = stats[stats["Year"] == year]
        model.fit(train[predictors],train["Share"])
        predictions = reg.predict(test[predictors])
        predictions = pd.DataFrame(predictions, columns = ["predictions"], index = test.index)
        combination = pd.concat([test[["Player","Share"]],predictions],axis=1)
        combination = add_ranks(combination)
        combination["Year"] = stats["Year"]
        combination["PTS"] = stats["PTS"]
        combination["AST"] = stats["AST"]
        combination["TRB"] = stats["TRB"]
        all_predictions.append(combination)
        aps.append(find_ap(combination))
    return sum(aps)/len(aps), aps, pd.concat(all_predictions)

In [35]:
mean_ap, aps, all_predictions = backtest(stats,reg,years[5:],predictors)

In [36]:
mean_ap

0.7029029551156751

Here, the backtest function showed the same ap as the for loop above. This means that we can start adding predictors and playing around with the models to gain a higher mean_ap value, without having to hard code our for loop every time. 

In [37]:
all_predictions[all_predictions["Rk"] < 6].sort_values("Diff").head(10)

Unnamed: 0,Player,Share,predictions,Rk,Prediction_Rk,Diff,Year,PTS,AST,TRB
907,Devin Booker,0.216,0.091309,4,79,-75,2022,26.8,4.8,5.0
1334,Jason Kidd,0.712,0.02821,2,52,-50,2002,14.7,9.9,7.3
8642,Glen Rice,0.117,0.03311,5,53,-48,1997,26.8,2.0,4.0
5420,Steve Nash,0.839,0.0341,1,45,-44,2005,15.5,11.5,3.3
8910,Peja Stojaković,0.228,0.03627,4,38,-34,2004,24.2,2.1,6.3
13331,Joakim Noah,0.258,0.046968,4,37,-33,2014,12.6,5.4,11.3
5438,Steve Nash,0.739,0.054129,1,34,-33,2006,18.8,10.5,4.2
3849,Chauncey Billups,0.344,0.052696,5,35,-30,2006,18.5,8.6,3.1
1499,Chris Paul,0.138,0.072293,5,33,-28,2021,16.4,8.9,4.5
5453,Steve Nash,0.785,0.074421,2,21,-19,2007,18.6,11.6,3.5


In [38]:
pd.concat([pd.Series(reg.coef_), pd.Series(predictors)],axis =1).sort_values(0, ascending=False)

Unnamed: 0,0,1
13,0.087852,eFG%
18,0.03386,DRB
29,0.023198,W/L%
17,0.020993,ORB
10,0.016456,2P
21,0.01207,STL
22,0.010901,BLK
15,0.010414,FTA
20,0.007113,AST
12,0.007054,2P%


By grabing the coefficients, we can see which columns have the most emphasis in the model. I decided to compare the PTS, AST, STL, BLK, 3P, Year, eFG%, W/L%, and TRB columns and create a new column that shows how each player performed in these categories as a ratio of the average in that stat line. 

In [39]:
stats_ratios = stats[["PTS","AST","STL","BLK","3P","Year","eFG%","W/L%","TRB"]].groupby("Year").apply(lambda x: (x)/x.mean())

In [40]:
stats_ratios

Unnamed: 0,PTS,AST,STL,BLK,3P,Year,eFG%,W/L%,TRB
0,1.013334,0.420714,0.961127,0.673469,0.508587,1.0,1.041653,1.438663,1.706296
1,1.614653,1.028412,1.647646,0.673469,4.577279,1.0,1.093092,1.438663,0.812522
2,0.311795,0.093492,0.274608,1.571429,0.000000,1.0,0.975210,1.438663,0.487513
3,0.200440,0.186984,0.274608,0.000000,0.000000,1.0,0.728728,1.438663,0.325009
4,2.383005,1.636110,1.784950,0.897959,1.525760,1.0,1.073803,1.438663,1.245867
...,...,...,...,...,...,...,...,...,...
14692,0.735752,0.819562,0.479763,1.528302,0.650951,1.0,1.074210,1.029491,0.981705
14693,0.071202,0.000000,0.000000,0.000000,0.130190,1.0,0.724940,1.029491,0.112195
14694,1.281633,0.601012,1.119447,2.547170,0.520761,1.0,0.992985,1.029491,1.598776
14695,0.474679,0.218550,0.319842,1.273585,0.650951,1.0,1.088425,1.029491,0.560974


In [41]:
stats[["PTS_T","AST_R","STL_R","BLK_R","3P_R","eFG%_R","W/L%_R","TRB_R"]] = stats_ratios[["PTS","AST","STL","BLK","3P","eFG%","W/L%","TRB"]]

In [42]:
stats.head()

Unnamed: 0,Player,Pos,Age,Tm,G,GS,MP,FG,FGA,FG%,...,PA/G,SRS,PTS_T,AST_R,STL_R,BLK_R,3P_R,eFG%_R,W/L%_R,TRB_R
0,A.C. Green,PF,27,LAL,82,21,26.4,3.1,6.6,0.476,...,99.6,6.73,1.013334,0.420714,0.961127,0.673469,0.508587,1.041653,1.438663,1.706296
1,Byron Scott,SG,29,LAL,82,82,32.1,6.1,12.8,0.477,...,99.6,6.73,1.614653,1.028412,1.647646,0.673469,4.577279,1.093092,1.438663,0.812522
2,Elden Campbell,PF,22,LAL,52,0,7.3,1.1,2.4,0.455,...,99.6,6.73,0.311795,0.093492,0.274608,1.571429,0.0,0.97521,1.438663,0.487513
3,Irving Thomas,PF,25,LAL,26,0,4.2,0.7,1.9,0.34,...,99.6,6.73,0.20044,0.186984,0.274608,0.0,0.0,0.728728,1.438663,0.325009
4,James Worthy,SF,29,LAL,78,74,38.6,9.2,18.7,0.492,...,99.6,6.73,2.383005,1.63611,1.78495,0.897959,1.52576,1.073803,1.438663,1.245867


In [43]:
predictors += ["PTS_T","AST_R","STL_R","BLK_R","3P_R","eFG%_R","W/L%_R","TRB_R"]

In [44]:
mean_ap, aps, all_predictions = backtest(stats,reg,years[5:],predictors)

In [45]:
mean_ap

0.7219067662808563

By doing this, our model increased by 2% precision compared to origninally, which is alot in a model that has so many depended varaibles.  

In [46]:
stats["NPos"] = stats["Pos"].astype("category").cat.codes

In [47]:
stats["Ntm"] = stats["Tm"].astype("category").cat.codes

In [48]:
from sklearn.ensemble import RandomForestRegressor

In [49]:
rf = RandomForestRegressor(n_estimators = 100, random_state = 1, min_samples_split =5)

In [50]:
mean_ap, aps, all_predictions = backtest(stats,rf,years[28:],predictors)

In [51]:
mean_ap

0.7430315431614715

By using a Random Forest Regressor, our model increased by 4%.

In [None]:
all_predictions[all_predictions["Rk"] < 6].sort_values("Diff").head(10)

In [53]:
all_predictions[all_predictions["Rk"] < 3].groupby("Rk").head(30)

Unnamed: 0,Player,Share,predictions,Rk,Prediction_Rk,Diff,Year,PTS,AST,TRB
10261,David Robinson,0.508,0.159476,2,2,0,1996,25.0,3.0,12.2
10962,Michael Jordan,0.986,0.140083,1,4,-3,1996,30.4,4.3,6.6
1710,Karl Malone,0.857,0.171357,1,2,-1,1997,27.4,4.5,9.9
10976,Michael Jordan,0.832,0.140057,2,3,-1,1997,29.6,4.3,5.9
1723,Karl Malone,0.726,0.166212,2,2,0,1998,27.0,3.9,10.3
10990,Michael Jordan,0.934,0.122499,1,4,-3,1998,28.7,3.5,5.8
1736,Karl Malone,0.701,0.138245,1,2,-1,1999,23.8,4.1,9.4
4931,Alonzo Mourning,0.655,0.088866,2,7,-5,1999,20.1,1.6,11.0
143,Shaquille O'Neal,0.998,0.223928,1,1,0,2000,29.7,3.8,13.6
5885,Kevin Garnett,0.337,0.097435,2,9,-7,2000,22.9,5.0,11.8


Here, I decided to play around with certain concepts in the MVP discussion. While this is not particularly true all the time, it is often specualted that the player with the best stats would receive the highest number of votes, as long as they have a winning team score. Therefore, I added more emphasis on PTS, AST, TRB, and eFG% columns by turning them into quadratic values to give these variables more weight in the model. 

In [54]:
stats["PTS_2"] = stats["PTS"] * stats["PTS"]

In [55]:
stats

Unnamed: 0,Player,Pos,Age,Tm,G,GS,MP,FG,FGA,FG%,...,AST_R,STL_R,BLK_R,3P_R,eFG%_R,W/L%_R,TRB_R,NPos,Ntm,PTS_2
0,A.C. Green,PF,27,LAL,82,21,26.4,3.1,6.6,0.476,...,0.420714,0.961127,0.673469,0.508587,1.041653,1.438663,1.706296,2,15,82.81
1,Byron Scott,SG,29,LAL,82,82,32.1,6.1,12.8,0.477,...,1.028412,1.647646,0.673469,4.577279,1.093092,1.438663,0.812522,12,15,210.25
2,Elden Campbell,PF,22,LAL,52,0,7.3,1.1,2.4,0.455,...,0.093492,0.274608,1.571429,0.000000,0.975210,1.438663,0.487513,2,15,7.84
3,Irving Thomas,PF,25,LAL,26,0,4.2,0.7,1.9,0.340,...,0.186984,0.274608,0.000000,0.000000,0.728728,1.438663,0.325009,2,15,3.24
4,James Worthy,SF,29,LAL,78,74,38.6,9.2,18.7,0.492,...,1.636110,1.784950,0.897959,1.525760,1.073803,1.438663,1.245867,8,15,457.96
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
14692,Spencer Hawes,PF,28,MIL,54,1,14.8,2.5,5.1,0.484,...,0.819562,0.479763,1.528302,0.650951,1.074210,1.029491,0.981705,2,18,38.44
14693,Steve Novak,PF,33,MIL,8,0,2.8,0.3,0.9,0.286,...,0.000000,0.000000,0.000000,0.130190,0.724940,1.029491,0.112195,2,18,0.36
14694,Terrence Jones,PF,25,MIL,54,12,23.5,4.3,9.1,0.470,...,0.601012,1.119447,2.547170,0.520761,0.992985,1.029491,1.598776,2,18,116.64
14695,Thon Maker,C,19,MIL,57,34,9.9,1.5,3.2,0.459,...,0.218550,0.319842,1.273585,0.650951,1.088425,1.029491,0.560974,0,18,16.00


In [56]:
predictors += ["PTS_2"]

In [57]:
stats["AST_2"] = stats["AST"] * stats["AST"]
predictors += ["AST_2"]

In [58]:
stats["TRB_2"] = stats["TRB"] * stats["TRB"]
predictors += ["TRB_2"]

In [59]:
stats["eFG%_2"] = stats["eFG%"] * stats["eFG%"]
predictors += ["eFG%_2"]

In [60]:
mean_ap, aps, all_predictions = backtest(stats,reg,years[5:],predictors)

In [61]:
mean_ap

0.725977351268948

In [62]:
all_predictions[all_predictions["Rk"] < 6].groupby("Rk").head(20)

Unnamed: 0,Player,Share,predictions,Rk,Prediction_Rk,Diff,Year,PTS,AST,TRB
10962,Michael Jordan,0.986,0.290991,1,1,0,1996,30.4,4.3,6.6
5197,Hakeem Olajuwon,0.211,0.264942,4,3,1,1996,26.9,3.6,10.9
10261,David Robinson,0.508,0.231424,2,4,-2,1996,25.0,3.0,12.2
7645,Anfernee Hardaway,0.319,0.097303,3,11,-8,1996,21.7,7.1,4.3
10965,Scottie Pippen,0.200,0.064252,5,18,-13,1996,19.4,5.9,6.4
...,...,...,...,...,...,...,...,...,...,...
6618,Russell Westbrook,0.271,0.194198,4,1,3,2015,28.1,8.6,7.3
4014,LeBron James,0.425,0.178484,3,2,1,2015,25.3,7.4,6.0
13984,James Harden,0.720,0.171407,2,3,-1,2015,27.4,7.0,5.7
4275,Anthony Davis,0.156,0.159507,5,4,1,2015,24.4,2.2,10.2


In [63]:
mean_ap, aps, all_predictions = backtest(stats,rf,years[28:],predictors)

In [64]:
mean_ap

0.7381475257349244

In [65]:
all_predictions[all_predictions["Rk"] < 2].groupby("Rk").head(30)

Unnamed: 0,Player,Share,predictions,Rk,Prediction_Rk,Diff,Year,PTS,AST,TRB
10962,Michael Jordan,0.986,0.246582,1,1,0,1996,30.4,4.3,6.6
1710,Karl Malone,0.857,0.225376,1,3,-2,1997,27.4,4.5,9.9
10990,Michael Jordan,0.934,0.209684,1,3,-2,1998,28.7,3.5,5.8
1736,Karl Malone,0.701,0.148281,1,3,-2,1999,23.8,4.1,9.4
143,Shaquille O'Neal,0.998,0.332874,1,1,0,2000,29.7,3.8,13.6
1030,Allen Iverson,0.904,0.238466,1,2,-1,2001,31.1,4.6,3.8
5396,Tim Duncan,0.757,0.179194,1,3,-2,2002,25.5,3.7,12.7
12939,Tim Duncan,0.808,0.159596,1,6,-5,2003,23.3,3.9,12.9
14388,Kevin Garnett,0.991,0.178313,1,1,0,2004,24.2,5.0,13.9
5420,Steve Nash,0.839,0.081535,1,17,-16,2005,15.5,11.5,3.3


While are Ridge regressor model improved, our Random Forest Model actually worsened. Therefore, the model with the highest precision is the model with the average ratios instead of the quadratic values. 

There are a lot more factors in determining the MVP vote of a player in a given season that can't be determined through stats, which is why models like these are very hard to predict, but getting a 74% accuracy, is still exceptional in determining something from nothing. Ideas like Voter Fatigue, where MVP voters will often get tired of voting for the same person every year even if they have the best stat line. If this weren't the case, then Lebron James, Stephen Curry, and Giannis Attentokunmpo would have a lot more MVPs under their belt. Another Idea is that MVP voters often like a story line when voting as well, players who little to no help from teamates in creating a winning team record while having the best stat line, or teams who improved their teams winning record by just joining the team earlier in the season often get brownie points that an algorithm can't necessarily predict. 

As I learn more about regression and machine learning, I will continue to build a model more and more accurate, so that one day I can have an algorithm for sports betting.