<p style="font-size:24px"><b>Predicting Fantasy Football Stars</b></p>

I love fantasy football. In general, I love sports, but there’s something about sitting down for an entire day on Sundays and watching football with your league mates that cannot be beat. Compared to other sports, fantasy football is easily the best type of fantasy sport. Basketball and baseball have too many games in a week to constantly be updating and changing your line up, and soccer has too little actual metrics to use to determine what fantasy points are worth. However, football has the perfect mix of frequency and statistical measurement. Games aren’t on too frequently, but not too scarcely either. Points are judged by yards and touchdowns, so one play can literally make or break your week. Overall, fantasy football is very exciting and fun, and I’m hoping that I can somehow predict who I should draft next year.

<i>Cleaning Data</i>

First and foremost, I need a data set to work with. Thankfully, Funk Monarch on Kaggle posted a huge data set containing every important offensive metric for every player from every week since 2012. This was a lot of data to sort through, but ultimately, through the help of SQL queries and ChatGPT, I managed to isolate what I determined to be the important metrics for flex positions (wide receivers, running backs, and tight ends) in one spreadsheet, and quarterbacks in another spreadsheet. Now, I can begin building a predictive model.

<i>Building The Model (Flex Players)</i>

To build the actual model, I transitioned from MySQL to VS Code to use python. Python has various libraries that make building predictive models much, much easier. To start, I ran a correlation test to see which of the categorical values in my dataset had the highest correlation to ppr_ppg (PPR Points Per Game). Everything in the data will be catered towards PPR scoring, so I apologize in advance to all of those in standard and half-PPR scoring leagues.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

In [None]:
flex_data_path = '/Users/itayakad/Desktop/Github Projects/FantasyFootballPredictor/Data/oy_flex.csv'
flex_data = pd.read_csv(flex_data_path)
numeric_columns = flex_data.select_dtypes(include=['number'])
correlation = numeric_columns.corr()
ppr_ppg_correlation = correlation['ppr_ppg'].sort_values(ascending=False).head(10)
print(ppr_ppg_correlation)

Based on the test, I decided to train my model based on the values YPG (yards per game), total yards, total touchdowns, receiving yards after catch, receptions, and games played.

In [None]:
top_features = ['ypg','total_yards','total_tds','receiving_yards_after_catch','receptions','games']
X = numeric_columns[top_features]
y = numeric_columns['ppr_ppg']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

model = LinearRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print(f"Mean Squared Error: {mse}")
print(f"R-squared: {r2}")

<i>Testing the Algorithm</i>

Now that we have our model (which has pretty good MSE and R^2 values), I want to test how strong it is. To do this, I used the model to predict who the top 20 players each year in PPR PPG will be and compared it with a list of the actual top 20 players in total PPR points for that year. I then gave it an accuracy score, which awarded 2 points to model if it got the right person in the right position that they finished, 1 point if it predited a person to be in the top 20, but in the wrong position, and 0 if it completely missed. I then compared this with a control model that predicts the top 20 based on who the top 20 in PPR PPG were the year before. Essentially, this allows me to see how much more accurate my model is compared to blindly copy and pasting the top 20 players in PPR PPG from the year prior.

In [None]:
def predict_next_year(train_year):
    data_train = flex_data[flex_data['season'] == train_year]
    data_actual = flex_data[flex_data['season'] == (train_year + 1)]
    numeric_columns_train = data_train.select_dtypes(include=['number'])

    X_train = numeric_columns_train[top_features]
    y_train = numeric_columns_train['ppr_ppg']

    model = LinearRegression()
    model.fit(X_train, y_train)

    data_predict = data_train.copy()
    data_predict['predicted_ppr_ppg'] = model.predict(X_train)

    top_20_predicted = data_predict[['name', 'predicted_ppr_ppg']].sort_values(by='predicted_ppr_ppg', ascending=False).head(20).reset_index(drop=True)
    top_20_predicted['predicted_rank'] = range(1, 21)

    top_20_actual = data_actual[['name', 'fantasy_points_ppr']].sort_values(by='fantasy_points_ppr', ascending=False).head(20).reset_index(drop=True)
    top_20_actual['actual_rank'] = range(1, 21)

    score = accuracy_test(top_20_actual, top_20_predicted)

    combined_results = pd.DataFrame({
        'Rank': range(1, 21),
        'Projected Leaders': top_20_predicted['name'],
        'Projected PPR PPG': top_20_predicted['predicted_ppr_ppg'],
        '': range(1,21),
        'Actual Leaders': top_20_actual['name'],
        'Actual PPR Total': top_20_actual['fantasy_points_ppr']
    })
    
    print(f"Projected vs Actual Leaders for the {train_year + 1} Season based on {train_year} Stats:")
    print(combined_results)
    print(f"Accuracy Score: {score}")

In [None]:
def control(train_year):
    data_train = flex_data[flex_data['season'] == train_year]
    data_actual = flex_data[flex_data['season'] == (train_year + 1)]
    
    top_20_actual = data_actual[['name', 'fantasy_points_ppr']].sort_values(by='fantasy_points_ppr', ascending=False).head(20).reset_index(drop=True)
    top_20_actual['actual_rank'] = range(1, 21)

    control_predictions = data_train[['name', 'ppr_ppg']].sort_values(by='ppr_ppg', ascending=False).head(20).reset_index(drop=True)
    control_predictions['predicted_rank'] = range(1, 21)
    
    score = accuracy_test(top_20_actual, control_predictions)
    print(f"Accuracy Score (Control): {score}")

In [None]:
def accuracy_test(actual, predicted):
    score = 0
    predicted_ranks = {row['name']: row['predicted_rank'] for index, row in predicted.iterrows()}
    actual_ranks = {row['name']: row['actual_rank'] for index, row in actual.iterrows()}
    for player in actual_ranks:
        if player in predicted_ranks:
            if predicted_ranks[player] == actual_ranks[player]:
                score += 2  # Exact rank match
            else:
                score += 1  # Player is in the top 20 but not the exact rank

    return(score)

In [None]:
def test_data():
    for i in range(3,14):
        predict_next_year(2010+i)
        control(2010+i)

test_data()

<i>Improving the Model</i>

Based on the results, my model had a combined accuracy score of 88 vs the control's score of 84. While this is only a 4.7% change, it shows that using my model has improved results over simply basing it off of last year's rankings. To try and improve this model, I found a data set on Kaggle from Nick Cantalupa that has every important team metric from the last 20 years. After cleaning and fixing some inconsistent variables, I joined my 2 data sets together to get one big flex player data set.

I decided to add how many points the player's team scored that season ('points') and how many offensive plays that team had ('plays_offense') as 2 new predicters because an NFL player cannot be good in fantasy if his team is not putting up points. That's not to say that a team cannot be bad with a great fantasy NFl player, though. A bad team can have a top 5 fantasy player and put up 20-30 points each week, but if they are allowing 30-40 points each week, then they are a bad football team. Fortunatley, I do not care about whether or not the team is good or bad, I just care if they put up points.

In [None]:
flex_data_path = '/Users/itayakad/Desktop/Github Projects/FantasyFootballPredictor/Data/flex_team.csv'
flex_data = pd.read_csv(flex_data_path)
numeric_columns = flex_data.select_dtypes(include=['number'])

top_features = ['ypg','total_yards','total_tds','receiving_yards_after_catch','receptions','games','points','plays_offense']
X = numeric_columns[top_features]
y = numeric_columns['ppr_ppg']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

model = LinearRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print(f"Mean Squared Error: {mse}")
print(f"R-squared: {r2}")

Adding 'points' to my model actually lowered the MSE by 0.06, and given the size of the data set, this is very good. Now, I will check it against the existing data to see how it performs against the control.

In [None]:
test_data()

This upgraded model performed worse than the previous one (Accuracy score of 87), despite having a lower MSE. What this is telling me is that we are using the right statistical measures to make our predictions, but are getting "unlucky", and the source of our unluckiness is due to something "season altering injuries". For example, the model predicted Nick Chubb to be a top player in 2023, and he looked like he was going to be after Weeks 1 and 2. Then, boom. He destroyed his knee in Week 3 and was out for the rest of the year.

To confirm my suspcisions, I adjusted the test_data() method to compare against PPR PPG leaders instead of Total PPR Points leaders. If the accuracy score is higher, then this tells me that the model is accuratley predicting the best players in PPR formats, without accounting for the chance that they miss an extended period of time due to injuries.

In [None]:
def predict_next_year(train_year):
    data_train = flex_data[flex_data['season'] == train_year]
    data_actual = flex_data[flex_data['season'] == (train_year + 1)]
    numeric_columns_train = data_train.select_dtypes(include=['number'])

    X_train = numeric_columns_train[top_features]
    y_train = numeric_columns_train['ppr_ppg']

    model = LinearRegression()
    model.fit(X_train, y_train)

    data_predict = data_train.copy()
    data_predict['predicted_ppr_ppg'] = model.predict(X_train)

    top_20_predicted = data_predict[['name', 'predicted_ppr_ppg']].sort_values(by='predicted_ppr_ppg', ascending=False).head(20).reset_index(drop=True)
    top_20_predicted['predicted_rank'] = range(1, 21)

    top_20_actual = data_actual[['name', 'ppr_ppg']].sort_values(by='ppr_ppg', ascending=False).head(20).reset_index(drop=True)
    top_20_actual['actual_rank'] = range(1, 21)

    score = accuracy_test(top_20_actual, top_20_predicted)

    combined_results = pd.DataFrame({
        'Rank': range(1, 21),
        'Projected Leaders': top_20_predicted['name'],
        'Projected PPR PPG': top_20_predicted['predicted_ppr_ppg'],
        '': range(1,21),
        'Actual Leaders': top_20_actual['name'],
        'Actual PPR PPG': top_20_actual['ppr_ppg']
    })
    
    print(f"Projected vs Actual Leaders for the {train_year + 1} Season based on {train_year} Stats:")
    print(combined_results)
    print(f"Accuracy Score: {score}")

test_data()

When comparing the model to PPR PPG instead of Total PPR Points, the model gains an additional 3 points of accuracy (90 total). This confirms my suspicions that injuries are indeed impacting the accuracy of this model. But how do we account for injuries, or more specifically, the <i>potential</i> for injuries?

First, I want to see if there's any statistical measures that may correlate to a player missing significant time.

In [None]:
# Load the flex data
flex_data = pd.read_csv('/Users/itayakad/Desktop/Github Projects/FantasyFootballPredictor/Data/flex_team.csv')

# Create a target variable for playing very few games (e.g., less than 10 games)
flex_data['games_next_season'] = flex_data.groupby('name')['games'].shift(-1)
flex_data['few_games_next_season'] = flex_data['games_next_season'].apply(lambda x: 1 if x < 10 else 0)

# Drop rows where next season data is not available
flex_data = flex_data.dropna(subset=['games_next_season'])

# Select only numeric columns for features
numeric_columns = flex_data.select_dtypes(include=['number']).drop(columns=['games_next_season'])

# Calculate correlation
correlation = numeric_columns.corrwith(flex_data['few_games_next_season'])
print(correlation.sort_values(ascending=False).head(15))


This correlation test tells me that out of the statistical measures availible, poor play like turning the ball over, throwing interceptions, and overall losing tend to result in a reduced number of games played the following season. Using this information, I will create a model that attempts to rank how likely a player is to being injured.

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

# Load the flex data
flex_data = pd.read_csv('/Users/itayakad/Desktop/Github Projects/FantasyFootballPredictor/Data/flex_team.csv')

# Create a target variable for playing very few games (e.g., less than 10 games)
flex_data['games_next_season'] = flex_data.groupby('name')['games'].shift(-1)
flex_data['few_games_next_season'] = flex_data['games_next_season'].apply(lambda x: 1 if x < 10 else 0)

# Calculate the most correlated features for the model
top_features = ['losses', 'pass_int', 'turnovers', 'turnover_pct', 'fumbles_lost', 'years_played']
X = flex_data[top_features]
y = flex_data['few_games_next_season']

# Handle missing data: Fill NaN values with the average of the column
X = X.fillna(X.mean())
y = y.fillna(0)  # Filling target variable NaNs with 0, indicating not missing games

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train a logistic regression model (as we are predicting a binary outcome)
model = LogisticRegression()
model.fit(X_train, y_train)

# Predict probabilities
flex_data['injury_prob'] = model.predict_proba(X)[:, 1] * 100

# Handle missing data: Assign the average probability to players without prior year data
average_prob = flex_data['injury_prob'].mean()
flex_data['injury_prob'] = flex_data['injury_prob'].fillna(average_prob)

# Create the 'safe_prob' column as the inverse of 'injury_prob'
flex_data['safe_prob'] = 100 - flex_data['injury_prob']

# Print the updated dataframe with the new columns for the 2023 season
flex_data_2023 = flex_data[flex_data['season'] == 2023]
sorted_flex_data = flex_data_2023.sort_values(by='injury_prob', ascending=False)
print(sorted_flex_data[['name', 'season', 'injury_prob', 'safe_prob']].head())

# Save the updated dataframe to a new CSV file
flex_data.to_csv('/Users/itayakad/Desktop/Github Projects/FantasyFootballPredictor/Data/flex_team_injury.csv', index=False)


Now that we have the 'safe prob' (inverse of 'injury prob') category, we can use it to try and see if it improves the model by predicting who will be subject to injury.

In [None]:
flex_data_path = '/Users/itayakad/Desktop/Github Projects/FantasyFootballPredictor/Data/flex_team_injury.csv'
flex_data = pd.read_csv(flex_data_path)
numeric_columns = flex_data.select_dtypes(include=['number'])

top_features = ['ypg','total_yards','total_tds','receiving_yards_after_catch','receptions','games','points','plays_offense','safe_prob']
X = numeric_columns[top_features]
y = numeric_columns['ppr_ppg']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

model = LinearRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print(f"Mean Squared Error: {mse}")
print(f"R-squared: {r2}")

In [None]:
test_data()

This new model, which includes injury probability, improves the model by 1 point, moving it to a total score of 91. While this is a very minor improvement, it still shows how model continues to improve and learn with every new addition.