<p style="font-size:24px"><b>Predicting Fantasy Football Stars</b></p>
Itay Akad

I love fantasy football. In general, I love sports, but there’s something about sitting down for an entire day on Sundays and watching football with your league mates that cannot be beat. Compared to other sports, fantasy football is easily the best type of fantasy sport. Basketball and baseball have too many games in a week to constantly be updating and changing your line up, and soccer has too little actual metrics to use to determine what fantasy points are worth. However, football has the perfect mix of frequency and statistical measurement. Games aren’t on too frequently, but not too scarcely either. Points are judged by yards and touchdowns, so one play can literally make or break your week. Overall, fantasy football is very exciting and fun, and I’m hoping that I can somehow predict who I should draft next year.

<i>Cleaning Data</i>

First and foremost, I need a data set to work with. Thankfully, Funk Monarch on Kaggle posted a huge data set containing every important offensive metric for every player from every week since 2012. This was a lot of data to sort through, but ultimately, through the help of SQL queries and ChatGPT, I managed to isolate what I determined to be the important metrics for flex positions (wide receivers, running backs, and tight ends) in one spreadsheet, and quarterbacks in another spreadsheet. Now, I can begin building a predictive model.

<i>Building The Model (Flex Players)</i>

To build the actual model, I transitioned from MySQL to VS Code to use python. Python has various libraries that make building predictive models much, much easier. To start, I ran a correlation test to see which of the categorical values in my dataset had the highest correlation to ppr_ppg (PPR Points Per Game). Everything in the data will be catered towards PPR scoring, so I apologize in advance to all of those in standard and half-PPR scoring leagues.

In [3]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

In [4]:
flex_data_path = '/Users/itayakad/Desktop/Github Projects/FantasyFootballPredictor/oy_flex.csv'
flex_data = pd.read_csv(flex_data_path)
numeric_columns = flex_data.select_dtypes(include=['number'])
correlation = numeric_columns.corr()
ppr_ppg_correlation = correlation['ppr_ppg'].sort_values(ascending=False).head(10)
print(ppr_ppg_correlation)

ppr_ppg                        1.000000
ypg                            0.952677
fantasy_points_ppr             0.908767
total_yards                    0.889715
total_tds                      0.834190
receiving_yards_after_catch    0.818325
receptions                     0.805404
rec_ypg                        0.784015
targets                        0.772262
target_share                   0.770296
Name: ppr_ppg, dtype: float64


Based on the test, I decided to train my model based on the values YPG (yards per game), total yards, total touchdowns, receiving yards after catch, receptions, and games played.

In [5]:
top_features = ['ypg','total_yards','total_tds','receiving_yards_after_catch','receptions','games']
X = numeric_columns[top_features]
y = numeric_columns['ppr_ppg']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

model = LinearRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print(f"Mean Squared Error: {mse}")
print(f"R-squared: {r2}")

Mean Squared Error: 0.8516498979088538
R-squared: 0.9668970681095035


<i>Testing the Algorithm</i>

Now that we have out model (which has pretty good MSE and R^2 values), I want to test how strong it is. To do this, I used the model to predict who the top 20 players each year in PPR PPG will be and compared it with a list of the actual top 20 players in total PPR points for that year. I then gave it an accuracy score, which awarded 2 points to model if it got the right person in the right position that they finished, 1 point if it predited a person to be in the top 20, but in the wrong position, and 0 if it completely missed. I then compared this with a control model that predicts the top 20 based on who the top 20 in PPR PPG were the year before. Essentially, this allows me to see how much more accurate my model is compared to blindly copy and pasting the top 20 players in PPR PPG from the year prior.

In [6]:
def predict_next_year(train_year):
    data_train = flex_data[flex_data['season'] == train_year]
    data_actual = flex_data[flex_data['season'] == (train_year + 1)]
    numeric_columns_train = data_train.select_dtypes(include=['number'])

    X_train = numeric_columns_train[top_features]
    y_train = numeric_columns_train['ppr_ppg']

    model = LinearRegression()
    model.fit(X_train, y_train)

    data_predict = data_train.copy()
    data_predict['predicted_ppr_ppg'] = model.predict(X_train)

    top_20_predicted = data_predict[['name', 'predicted_ppr_ppg']].sort_values(by='predicted_ppr_ppg', ascending=False).head(20).reset_index(drop=True)
    top_20_predicted['predicted_rank'] = range(1, 21)

    top_20_actual = data_actual[['name', 'fantasy_points_ppr']].sort_values(by='fantasy_points_ppr', ascending=False).head(20).reset_index(drop=True)
    top_20_actual['actual_rank'] = range(1, 21)

    score = accuracy_test(top_20_actual, top_20_predicted)

    combined_results = pd.DataFrame({
        'Rank': range(1, 21),
        'Projected Leaders': top_20_predicted['name'],
        'Projected PPR PPG': top_20_predicted['predicted_ppr_ppg'],
        '': range(1,21),
        'Actual Leaders': top_20_actual['name'],
        'Actual PPR Total': top_20_actual['fantasy_points_ppr']
    })
    
    print(f"Projected vs Actual Leaders for the {train_year + 1} Season based on {train_year} Stats:")
    print(combined_results)
    print(f"Accuracy Score: {score}")

In [7]:
def control(train_year):
    data_train = flex_data[flex_data['season'] == train_year]
    data_actual = flex_data[flex_data['season'] == (train_year + 1)]
    
    top_20_actual = data_actual[['name', 'fantasy_points_ppr']].sort_values(by='fantasy_points_ppr', ascending=False).head(20).reset_index(drop=True)
    top_20_actual['actual_rank'] = range(1, 21)

    control_predictions = data_train[['name', 'ppr_ppg']].sort_values(by='ppr_ppg', ascending=False).head(20).reset_index(drop=True)
    control_predictions['predicted_rank'] = range(1, 21)
    
    score = accuracy_test(top_20_actual, control_predictions)
    print(f"Accuracy Score (Control): {score}")

In [8]:
def accuracy_test(actual, predicted):
    score = 0
    predicted_ranks = {row['name']: row['predicted_rank'] for index, row in predicted.iterrows()}
    actual_ranks = {row['name']: row['actual_rank'] for index, row in actual.iterrows()}
    for player in actual_ranks:
        if player in predicted_ranks:
            if predicted_ranks[player] == actual_ranks[player]:
                score += 2  # Exact rank match
            else:
                score += 1  # Player is in the top 20 but not the exact rank

    return(score)

In [9]:
def test_data():
    for i in range(3,14):
        predict_next_year(2010+i)
        control(2010+i)

test_data()

Projected vs Actual Leaders for the 2014 Season based on 2013 Stats:
    Rank Projected Leaders  Projected PPR PPG        Actual Leaders  \
0      1    Jamaal Charles          25.529133   1     Antonio Brown   
1      2    Calvin Johnson          21.738849   2      Le'Veon Bell   
2      3       Julio Jones          21.634875   3    DeMarco Murray   
3      4        Matt Forte          20.733416   4        Matt Forte   
4      5  Demaryius Thomas          20.547921   5  Demaryius Thomas   
5      6      Jimmy Graham          19.991285   6      Jordy Nelson   
6      7      LeSean McCoy          19.682903   7        Dez Bryant   
7      8        A.J. Green          19.523381   8    Marshawn Lynch   
8      9  Brandon Marshall          19.501303   9  Emmanuel Sanders   
9     10     Antonio Brown          19.437357  10       Julio Jones   
10    11        Dez Bryant          19.169592  11      Randall Cobb   
11    12   Justin Blackmon          18.896266  12     Jeremy Maclin   
12    13

<i>Improving the Model</i>

Based on the results, my model had a combined accuracy score of 88 vs the control's score of 84. While this is only a 4.7% change, it shows that using my model has improved results over simply basing it off of last year's rankings. To try and improve this model, I found a data set on Kaggle from Nick Cantalupa that has every important team metric from the last 20 years. After cleaning and fixing some inconsistent variables, I joined my 2 data sets together to get one big flex player data set.

I decided to add how many points the player's team scored that season ('points') and how many offensive plays that team had ('plays_offense') as 2 new predicters because an NFL player cannot be good in fantasy if his team is not putting up points. That's not to say that a team cannot be bad with a great fantasy NFl player, though. A bad team can have a top 5 fantasy player and put up 20-30 points each week, but if they are allowing 30-40 points each week, then they are a bad football team. Fortunatley, I do not care about whether or not the team is good or bad, I just care if they put up points.

In [10]:
flex_data_path = '/Users/itayakad/Desktop/Github Projects/FantasyFootballPredictor/flexplayer_team.csv'
flex_data = pd.read_csv(flex_data_path)
numeric_columns = flex_data.select_dtypes(include=['number'])

top_features = ['ypg','total_yards','total_tds','receiving_yards_after_catch','receptions','games','points','plays_offense']
X = numeric_columns[top_features]
y = numeric_columns['ppr_ppg']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

model = LinearRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print(f"Mean Squared Error: {mse}")
print(f"R-squared: {r2}")

Mean Squared Error: 0.7969601142958663
R-squared: 0.9674979735471614


Adding 'points' to my model actually lowered the MSE by 0.06, and given the size of the data set, this is very good. Now, I will check it against the existing data to see how it performs against the control.

In [11]:
test_data()

Projected vs Actual Leaders for the 2014 Season based on 2013 Stats:
    Rank Projected Leaders  Projected PPR PPG        Actual Leaders  \
0      1    Jamaal Charles          25.459590   1     Antonio Brown   
1      2    Calvin Johnson          21.795934   2      Le'Veon Bell   
2      3       Julio Jones          21.637542   3    DeMarco Murray   
3      4        Matt Forte          20.688224   4        Matt Forte   
4      5  Demaryius Thomas          20.646115   5  Demaryius Thomas   
5      6      Jimmy Graham          20.004117   6      Jordy Nelson   
6      7      LeSean McCoy          19.678399   7        Dez Bryant   
7      8        A.J. Green          19.589185   8    Marshawn Lynch   
8      9  Brandon Marshall          19.477485   9  Emmanuel Sanders   
9     10     Antonio Brown          19.392189  10       Julio Jones   
10    11        Dez Bryant          19.033968  11      Randall Cobb   
11    12   Justin Blackmon          18.878069  12     Jeremy Maclin   
12    13

This upgraded model performed worse than the previous one (Accuracy score of 87), despite having a lower MSE.