In this project, I found that just predicting the mean number of fantasy points ended up being more effective than using XGBoost or RNNs. Out of curiosity, I want to use this naive prediction on the dataset used in the write-up that inspired this project.

In that article, Christopher Zita used multivariable RNNs to predict the 2018-2019 seasons of Tom Brady and Todd Gurley more accurately that most 'expert' fantasy predictions. As training data he used 7 previous seasons (if they existed).

I will simply take the mean fantasy points scored by each player in each season and use this as my prediction. Let's see if this beats Zita.

In [16]:
import datetime as dt
import os
import numpy as np
import pandas as pd
from sklearn.metrics import mean_squared_error, mean_absolute_error

In [17]:
# Read in data
players = {}
path = '../data/aside_check_zita_naive'
for filename in os.listdir(path):
    if '.csv' in filename:
        players[filename.split('.')[0]] = pd.read_csv(os.path.join(path,filename))

for key, value in players.items():
    print('{}:'.format(key))
    print(value.head())

gurley:
   Rk  G#      Date   Tm Unnamed: 4  Opp   Result Pos  Att   Yds  ...  Tgt.1  \
0   1   3   9/27/15  STL        NaN  PIT   L 6-12  RB  NaN   NaN  ...    NaN   
1   2   4   10/4/15  STL          @  ARI  W 24-22  RB  3.0  -5.0  ...    0.0   
2   3   5  10/11/15  STL          @  GNB  L 10-24  RB  2.0   7.0  ...    0.0   
3   4   6  10/25/15  STL        NaN  CLE   W 24-6  RB  2.0  17.0  ...    0.0   
4   5   7   11/1/15  STL        NaN  SFO   W 27-6  RB  0.0   0.0  ...    1.0   

   Rec.1  Yds.3  TD.3  Num     Pct  Num.1  FantPt  DKPt  FDPt  
0    NaN    NaN   NaN   14  28.00%    0.0     1.4   2.4   1.9  
1      0    0.0     0   36  67.90%    0.0    16.1  21.1  17.1  
2      0    0.0     0   45  64.30%    0.0    15.9  18.9  15.9  
3      0    0.0     0   36  67.90%    0.0    28.3  35.3  30.3  
4      1   -2.0     0   36  52.20%    0.0    20.6  26.6  22.1  

[5 rows x 28 columns]
brady:
   Rk  G#     Date   Tm Unnamed: 4  Opp   Result Pos  Cmp   Att  ...  Cmp.1  \
0   1   1  9/12/11

In [18]:
def train_test_split(player_df, split_date):
    player_df_cp = player_df.copy()
    player_df_cp['Date'] = pd.to_datetime(player_df_cp['Date'])
    train = player_df_cp.loc[player_df_cp['Date']<split_date, ['FantPt']]
    test = player_df_cp.loc[player_df_cp['Date']>split_date, ['FantPt']]
    
    return train, test

In [19]:
split_date = dt.datetime(2018,5,1)

for key, value in players.items():
    y_train, y_test = train_test_split(value, split_date)
    preds = [y_train.mean()] * len(y_test)
    rmse = np.sqrt(mean_squared_error(y_test, preds))
    mae = mean_absolute_error(y_test, preds)
    print('{}:'.format(key))
    print("RMSE: {:.2f}\n MAE: {:.2f}".format(rmse, mae))

gurley:
RMSE: 13.10
 MAE: 12.04
brady:
RMSE: 6.78
 MAE: 5.54


For Brady, Zita's RNN achieved and MAE of 5.3, while the best expert predictions got 4.9. For Gurley, Zita's RNN was the strongest at 5.68.

It seems that the machine learning approach did not make a noticeable difference for a consistent player like Brady. However, considerable gains in performance can be had for less consistent players like Gurley.