<p style="font-size:24px"><b>Predicting Fantasy Football Stars (Pt. 2)</b></p>

This is a follow up from the flex player prediction. Now I will attempt to do the same for the quarterbacks.

<i>Building The Model (Quarterbacks)</i>

To build the actual model, I will once again use various Python libraries.

In [44]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

In [45]:
flex_data_path = '/Users/itayakad/Desktop/Github Projects/FantasyFootballPredictor/qbplayer_team.csv'
flex_data = pd.read_csv(flex_data_path)
numeric_columns = flex_data.select_dtypes(include=['number'])
correlation = numeric_columns.corr()
ppr_ppg_correlation = correlation['ppr_ppg'].sort_values(ascending=False).head(10)
print(ppr_ppg_correlation)

ppr_ppg                1.000000
ypg                    0.941769
pass_ypg               0.916007
fantasy_points_ppr     0.868723
offense_pct            0.858218
total_tds              0.856935
total_yards            0.849204
passing_tds            0.838506
passing_yards          0.836214
passing_first_downs    0.832679
Name: ppr_ppg, dtype: float64


Based on the test, I decided to train my model based on yards per game (ypg), Passing YPG (pass_ypg), offense percentage (offense_pct), total touchdowns (total_tds), total yards (total_yards), how many points the QB's team scored that year (points), and years played in the NFL (years_played).

In [46]:
top_features = ['ypg','pass_ypg','offense_pct','total_tds','total_yards','points','years_played']
X = numeric_columns[top_features]
y = numeric_columns['ppr_ppg']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

model = LinearRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print(f"Mean Squared Error: {mse}")
print(f"R-squared: {r2}")

Mean Squared Error: 1.7968326908304206
R-squared: 0.9656608646770318


<i>Testing the Algorithm</i>

Now that we have our model, I want to test how strong it is. To do this, I used the model to predict who the top 20 players each year in PPR PPG will be and compared it with a list of the actual top 20 players in total PPR points for that year. I then gave it an accuracy score, which awarded 2 points to model if it got the right person in the right position that they finished, 1 point if it predited a person to be in the top 20, but in the wrong position, and 0 if it completely missed. I then compared this with a control model that predicts the top 20 based on who the top 20 in PPR PPG were the year before. Essentially, this allows me to see how much more accurate my model is compared to blindly copy and pasting the top 20 players in PPR PPG from the year prior.

Also, I would like to note the high MSE of this model. Despite using the same method of finding which parameters to use, this model has a much higher MSE than the flex player model. This could be due to various things, but the answer that makes the most sense to me is that quarterback's scoring is much more diverse than that of a flex player, espeically now in the modern NFL where having a mobile QB (i.e. one that picks up a lot of their points via running) is essential. This can skew the data as half of the top QBs in the league get their points through throwing 4000 yard season, whereas others run in for 10+ touchdowns. For the model, it seems difficult to include both sides without expanding the range to even greater lengths, resulting in an even higher MSE.

In [47]:
def predict_next_year(train_year):
    data_train = flex_data[flex_data['season'] == train_year]
    data_actual = flex_data[flex_data['season'] == (train_year + 1)]
    numeric_columns_train = data_train.select_dtypes(include=['number'])

    X_train = numeric_columns_train[top_features]
    y_train = numeric_columns_train['ppr_ppg']

    model = LinearRegression()
    model.fit(X_train, y_train)

    data_predict = data_train.copy()
    data_predict['predicted_ppr_ppg'] = model.predict(X_train)

    top_20_predicted = data_predict[['name', 'predicted_ppr_ppg']].sort_values(by='predicted_ppr_ppg', ascending=False).head(20).reset_index(drop=True)
    top_20_predicted['predicted_rank'] = range(1, 21)

    top_20_actual = data_actual[['name', 'fantasy_points_ppr']].sort_values(by='fantasy_points_ppr', ascending=False).head(20).reset_index(drop=True)
    top_20_actual['actual_rank'] = range(1, 21)

    score = accuracy_test(top_20_actual, top_20_predicted)

    combined_results = pd.DataFrame({
        'Rank': range(1, 21),
        'Projected Leaders': top_20_predicted['name'],
        'Projected PPR PPG': top_20_predicted['predicted_ppr_ppg'],
        '': range(1,21),
        'Actual Leaders': top_20_actual['name'],
        'Actual PPR Total': top_20_actual['fantasy_points_ppr']
    })
    
    print(f"Projected vs Actual Leaders for the {train_year + 1} Season based on {train_year} Stats:")
    print(combined_results)
    print(f"Accuracy Score: {score}")

In [48]:
def control(train_year):
    data_train = flex_data[flex_data['season'] == train_year]
    data_actual = flex_data[flex_data['season'] == (train_year + 1)]
    
    top_20_actual = data_actual[['name', 'fantasy_points_ppr']].sort_values(by='fantasy_points_ppr', ascending=False).head(20).reset_index(drop=True)
    top_20_actual['actual_rank'] = range(1, 21)

    control_predictions = data_train[['name', 'ppr_ppg']].sort_values(by='ppr_ppg', ascending=False).head(20).reset_index(drop=True)
    control_predictions['predicted_rank'] = range(1, 21)
    
    score = accuracy_test(top_20_actual, control_predictions)
    print(f"Accuracy Score (Control): {score}")

In [49]:
def accuracy_test(actual, predicted):
    score = 0
    predicted_ranks = {row['name']: row['predicted_rank'] for index, row in predicted.iterrows()}
    actual_ranks = {row['name']: row['actual_rank'] for index, row in actual.iterrows()}
    for player in actual_ranks:
        if player in predicted_ranks:
            if predicted_ranks[player] == actual_ranks[player]:
                score += 2  # Exact rank match
            else:
                score += 1  # Player is in the top 20 but not the exact rank

    return(score)

In [50]:
def test_data():
    for i in range(3,14):
        predict_next_year(2010+i)
        control(2010+i)

test_data()

Projected vs Actual Leaders for the 2014 Season based on 2013 Stats:
    Rank   Projected Leaders  Projected PPR PPG          Actual Leaders  \
0      1      Peyton Manning          27.564020   1       Aaron Rodgers   
1      2          Drew Brees          22.346841   2         Andrew Luck   
2      3          Cam Newton          19.340338   3      Russell Wilson   
3      4         Andy Dalton          19.035668   4      Peyton Manning   
4      5      Russell Wilson          18.136966   5  Ben Roethlisberger   
5      6          Nick Foles          17.904410   6          Drew Brees   
6      7          Alex Smith          17.651316   7           Matt Ryan   
7      8    Matthew Stafford          17.425610   8      Ryan Tannehill   
8      9         Andrew Luck          17.335047   9           Tom Brady   
9     10    Colin Kaepernick          17.072619  10         Eli Manning   
10    11           Tony Romo          16.986211  11           Tony Romo   
11    12       Aaron Rodgers   

The model had an accuracy score of 153 vs the base model's score of 149. For reference, the flex player model also outperformed the bot by about 3/4 points, but the difference here is that those models had total scores in the mid-to-high 80s, while these ones have scores in the low 150s. This tells us a few things about QBs in the NFL, specifically how consistent and long the best of the best play for compared to the flex players.