<p style="font-size:24px"><b>Predicting Fantasy Football Stars</b></p>

I love fantasy football. In general, I love sports, but there’s something about sitting down for an entire day on Sundays and watching football with your league mates that cannot be beat. Compared to other sports, fantasy football is easily the best type of fantasy sport. Basketball and baseball have too many games in a week to constantly be updating and changing your line up, and soccer has too little actual metrics to use to determine what fantasy points are worth. However, football has the perfect mix of frequency and statistical measurement. Games aren’t on too frequently, but not too scarcely either. Points are judged by yards and touchdowns, so one play can literally make or break your week. Overall, fantasy football is very exciting and fun, and I’m hoping that I can somehow predict who I should draft next year.

<i>Cleaning Data</i>

First and foremost, I need a data set to work with. Thankfully, Funk Monarch on Kaggle posted a huge data set containing every important offensive metric for every player from every week since 2012. This was a lot of data to sort through, but ultimately, through the help of SQL queries and ChatGPT, I managed to isolate what I determined to be the important metrics for flex positions (wide receivers, running backs, and tight ends) in one spreadsheet, and quarterbacks in another spreadsheet. Now, I can begin building a predictive model.

<i>Building The Model (Flex Players)</i>

To build the actual model, I transitioned from MySQL to VS Code to use python. Python has various libraries that make building predictive models much, much easier. Also, everything in the data will be catered towards PPR scoring, so I apologize in advance to all of those in standard and half-PPR scoring leagues.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

First, I want to a construct a very basic base model to compare our future models against. This model will rely only on the previous year's total PPR points rankings to predict the next year's leaders. Yes, this is copy and paste, but it's what the 'experts' at ESPN do anyway.

In [None]:
def control(train_year):
    flex_data_path = '/Users/itayakad/Desktop/Github Projects/FantasyFootballPredictor/Data/oy_flex.csv'
    flex_data = pd.read_csv(flex_data_path)

    data_train = flex_data[flex_data['season'] == train_year]
    data_actual = flex_data[flex_data['season'] == (train_year + 1)]
    
    top_20_actual = data_actual[['name', 'fantasy_points_ppr']].sort_values(by='fantasy_points_ppr', ascending=False).head(20).reset_index(drop=True)
    top_20_actual['actual_rank'] = range(1, 21)

    control_predictions = data_train[['name', 'fantasy_points_ppr']].sort_values(by='fantasy_points_ppr', ascending=False).head(20).reset_index(drop=True)
    control_predictions['predicted_rank'] = range(1, 21)
    
    score = accuracy_test(top_20_actual, control_predictions)
    print(f"Accuracy Score (Control): {score}")
    
    return(score)

Now that that is done, we can begin to construct our own model. Obviously, for our model, using the fantasy point totals from the year before is cheating, and we will not be doing that. Instead, we will be looking to use other metrics. From the dataset provided, I ran a correlation test to see which statistical measures would be most applicable.

In [None]:
flex_data_path = '/Users/itayakad/Desktop/Github Projects/FantasyFootballPredictor/Data/oy_flex.csv'
flex_data = pd.read_csv(flex_data_path)
numeric_columns = flex_data.select_dtypes(include=['number'])
correlation = numeric_columns.corr()
ppr_correlation = correlation['fantasy_points_ppr'].sort_values(ascending=False).head(10)
print(ppr_correlation)

Based on the test, I decided to train my model based on total yards, receptions, receiving yards after catch, total touchdowns, and targets.

In [None]:
top_features = ['total_yards','receptions','receiving_yards_after_catch','total_tds','targets']
X = numeric_columns[top_features]
y = numeric_columns['fantasy_points_ppr']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

model = LinearRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print(f"Mean Squared Error: {mse}")
print(f"R-squared: {r2}")

<i>Testing the Algorithm</i>

Now that we have our model (which has a very high MSE but an almost perfect R^2), I want to test how strong it is. To do this, I used this model and compared it with a list of the actual top 20 players in total PPR points for that year. I then gave it an accuracy score, which awarded 2 points to the model if it got the right person in the right position that they finished, 1 point if it predicted a person to be in the top 20, but in the wrong position, and 0 if it completely missed. I then compared this with the control model's accuracy score.

In [None]:
def predict_next_year(train_year):

    data_train = flex_data[flex_data['season'] == train_year]
    data_actual = flex_data[flex_data['season'] == (train_year + 1)]
    numeric_columns_train = data_train.select_dtypes(include=['number'])

    X_train = numeric_columns_train[top_features]
    y_train = numeric_columns_train['fantasy_points_ppr']

    model = LinearRegression()
    model.fit(X_train, y_train)

    data_predict = data_train.copy()
    data_predict['predicted_fantasy_points_ppr'] = model.predict(X_train)

    top_20_predicted = []
    top_20_actual = []
    for _, row in data_predict[['name', 'predicted_fantasy_points_ppr']].sort_values(by='predicted_fantasy_points_ppr', ascending=False).iterrows():
        if row['name'] not in [player['name'] for player in top_20_predicted]:
            top_20_predicted.append({'name': row['name'], 'predicted_fantasy_points_ppr': row['predicted_fantasy_points_ppr']})
        if len(top_20_predicted) == 20:
            break

    if not data_actual.empty and 'name' in data_actual.columns:
        for _, row in data_actual[['name', 'fantasy_points_ppr']].sort_values(by='fantasy_points_ppr', ascending=False).iterrows():
            if row['name'] not in [player['name'] for player in top_20_actual]:
                top_20_actual.append({'name': row['name'], 'fantasy_points_ppr': row['fantasy_points_ppr']})
            if len(top_20_actual) == 20:
                break
    else:
        for i in range(20):
            top_20_actual.append({'name': 'NA', 'fantasy_points_ppr': 'NA'})

    top_20_predicted_df = pd.DataFrame(top_20_predicted)
    top_20_predicted_df['predicted_rank'] = range(1, 21)

    top_20_actual_df = pd.DataFrame(top_20_actual)
    top_20_actual_df['actual_rank'] = range(1, 21)

    if 'name' in top_20_predicted_df.columns and 'name' in top_20_actual_df.columns:
        if 'NA' not in top_20_actual_df['name'].values:
            score = accuracy_test(top_20_actual_df, top_20_predicted_df)
        else:
            score = 'NA'
    else:
        score = 'NA'

    combined_results = pd.DataFrame({
        'Rank': range(1, 21),
        'Projected Leaders': top_20_predicted_df['name'],
        'Projected PPR PPG': top_20_predicted_df['predicted_fantasy_points_ppr'],
        '': range(1, 21),
        'Actual Leaders': top_20_actual_df['name'],
        'Actual PPR Total': top_20_actual_df['fantasy_points_ppr']
    })

    print(f"Projected vs Actual Leaders for the {train_year + 1} Season based on {train_year} Stats:")
    print(combined_results)
    print(f"Accuracy Score: {score}")
    return score

In [None]:
def accuracy_test(actual, predicted):
    score = 0
    predicted_ranks = {row['name']: row['predicted_rank'] for index, row in predicted.iterrows()}
    actual_ranks = {row['name']: row['actual_rank'] for index, row in actual.iterrows()}
    for player in actual_ranks:
        if player in predicted_ranks:
            if predicted_ranks[player] == actual_ranks[player]:
                score += 2  # Exact rank match
            else:
                score += 1  # Player is in the top 20 but not the exact rank

    return(score)

In [None]:
def test_data():
    total_score = 0
    total_score_c = 0
    for i in range(3,13):
        score = predict_next_year(2010+i)
        total_score += score
        score_c = control(2010+i)
        total_score_c += score_c
    print(f"Total Score: {total_score}")
    print(f"Total Score (Control): {total_score_c}")
test_data()

<i>Improving the Model</i>

Based on the results, my model had a combined accuracy score of 80 vs the control's score of 77. While this is only about a 4% change, it shows that using my model has improved results over simply basing it off of last year's rankings. To try and improve this model, I found a data set on Kaggle from Nick Cantalupa that has every important team metric from the last 20 years. After cleaning and fixing some inconsistent variables, I joined my 2 data sets together to get one big flex player data set. I then ran a correlation test on that to find out which team stats are most applicable.

In [None]:
flex_data_path = '/Users/itayakad/Desktop/Github Projects/FantasyFootballPredictor/Data/flex_onlyteamstats.csv'
flex_data = pd.read_csv(flex_data_path)
numeric_columns = flex_data.select_dtypes(include=['number'])
correlation = numeric_columns.corr()
fantasy_points_ppr_correlation = correlation['fantasy_points_ppr'].sort_values(ascending=False).head(10)
print(fantasy_points_ppr_correlation)

While the correlation is minimal (topping at only 0.13), I still want to include this data in my model as from a logistic standpoint, it makes sense to factor in how the player's team is doing to determine how well they will do.

Since the correlation is so small, I didn't stick entirely to the top correlated attributes and experimented with others. Ultimatley, I decided to add how many points that player's team scored that year ('points'), how many offensive plays they had ('plays_offense'), how many games the player played (this wasn't part of the new team data, but rather was something that came to mind while I was working on this). I decided on these because I believe that a fantasy football player is only as good as his team. If they aren't putting up yards on offense or getting near the endzone, then how can I expect the player to produce?

In [None]:
flex_data_path = '/Users/itayakad/Desktop/Github Projects/FantasyFootballPredictor/Data/flex_team.csv'
flex_data = pd.read_csv(flex_data_path)
numeric_columns = flex_data.select_dtypes(include=['number'])

top_features = ['ypg','total_yards','total_tds','receiving_yards_after_catch','receptions','games','points','plays_offense']
X = numeric_columns[top_features]
y = numeric_columns['fantasy_points_ppr']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

model = LinearRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print(f"Mean Squared Error: {mse}")
print(f"R-squared: {r2}")

Adding these metrics to my model actually lowered the MSE by 2.5, and given the size of the data set, this is very, very good. Now, I will check it against the existing data to see how it performs against the control.

In [None]:
test_data()

This upgraded model performed even better than the previous one (Accuracy score of 83, 3 more than the previous 80). So far, our model has continued to grow and improve compared to the base copy-and-paste ESPN model, which is great. However, one of the biggest things holding our model back is