<p style="font-size:24px"><b>Predicting Fantasy Football Stars (Pt. 2)</b></p>

This is a follow up from the flex player prediction. Now I will attempt to do the same for the quarterbacks.

<i>Building The Model (Quarterbacks)</i>

To build the actual model, I will once again use various Python libraries.

In [15]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

In [16]:
flex_data_path = '.../Data/Created Data/qb_team.csv'
flex_data = pd.read_csv(flex_data_path)
numeric_columns = flex_data.select_dtypes(include=['number'])
correlation = numeric_columns.corr()
fantasy_points_ppr_correlation = correlation['fantasy_points_ppr'].sort_values(ascending=False).head(10)
print(fantasy_points_ppr_correlation)

fantasy_points_ppr           1.000000
total_tds                    0.985159
total_yards                  0.981375
passing_yards                0.968168
touches                      0.966384
passing_first_downs          0.965919
passing_tds                  0.965791
completions                  0.955741
offense_snaps                0.953146
passing_yards_after_catch    0.952845
Name: fantasy_points_ppr, dtype: float64


Based on the test, I decided to train my model based on total touchdowns ('total_tds'), total yards, passing yards, touches, and passing first downs.

In [17]:
top_features = ['total_tds','total_yards','passing_yards','touches','passing_first_downs','passing_tds','completions','passing_yards_after_catch']
X = numeric_columns[top_features]
y = numeric_columns['fantasy_points_ppr']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

model = LinearRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print(f"Mean Squared Error: {mse}")
print(f"R-squared: {r2}")

Mean Squared Error: 28.738176536443863
R-squared: 0.9981161769185948


<i>Testing the Algorithm</i>

Now that we have our model, I want to test how strong it is. To do this, I used the model to predict who the top 20 players each year in PPR PPG will be and compared it with a list of the actual top 20 players in total PPR points for that year. I then gave it an accuracy score, which awarded 2 points to model if it got the right person in the right position that they finished, 1 point if it predited a person to be in the top 20, but in the wrong position, and 0 if it completely missed. I then compared this with a control model that predicts the top 20 based on who the top 20 in PPR PPG were the year before. Essentially, this allows me to see how much more accurate my model is compared to blindly copy and pasting the top 20 players in PPR PPG from the year prior.

Also, I would like to note the high MSE of this model. Despite using the same method of finding which parameters to use, this model has a much higher MSE than the flex player model. This could be due to various things, but the answer that makes the most sense to me is that quarterback's scoring is much more diverse than that of a flex player, espeically now in the modern NFL where having a mobile QB (i.e. one that picks up a lot of their points via running) is essential. This can skew the data as half of the top QBs in the league get their points through throwing 4000 yard season, whereas others run in for 10+ touchdowns. For the model, it seems difficult to include both sides without expanding the range to even greater lengths, resulting in an even higher MSE.

In [18]:
def predict_next_year(train_year):

    data_train = flex_data[flex_data['season'] == train_year]
    data_actual = flex_data[flex_data['season'] == (train_year + 1)]
    numeric_columns_train = data_train.select_dtypes(include=['number'])

    X_train = numeric_columns_train[top_features]
    y_train = numeric_columns_train['fantasy_points_ppr']

    model = LinearRegression()
    model.fit(X_train, y_train)

    data_predict = data_train.copy()
    data_predict['predicted_fantasy_points_ppr'] = model.predict(X_train)

    top_20_predicted = []
    top_20_actual = []
    for _, row in data_predict[['name', 'predicted_fantasy_points_ppr']].sort_values(by='predicted_fantasy_points_ppr', ascending=False).iterrows():
        if row['name'] not in [player['name'] for player in top_20_predicted]:
            top_20_predicted.append({'name': row['name'], 'predicted_fantasy_points_ppr': row['predicted_fantasy_points_ppr']})
        if len(top_20_predicted) == 20:
            break

    if not data_actual.empty and 'name' in data_actual.columns:
        for _, row in data_actual[['name', 'fantasy_points_ppr']].sort_values(by='fantasy_points_ppr', ascending=False).iterrows():
            if row['name'] not in [player['name'] for player in top_20_actual]:
                top_20_actual.append({'name': row['name'], 'fantasy_points_ppr': row['fantasy_points_ppr']})
            if len(top_20_actual) == 20:
                break
    else:
        for i in range(20):
            top_20_actual.append({'name': 'NA', 'fantasy_points_ppr': 'NA'})

    top_20_predicted_df = pd.DataFrame(top_20_predicted)
    top_20_predicted_df['predicted_rank'] = range(1, 21)

    top_20_actual_df = pd.DataFrame(top_20_actual)
    top_20_actual_df['actual_rank'] = range(1, 21)

    if 'name' in top_20_predicted_df.columns and 'name' in top_20_actual_df.columns:
        if 'NA' not in top_20_actual_df['name'].values:
            score = accuracy_test(top_20_actual_df, top_20_predicted_df)
        else:
            score = 'NA'
    else:
        score = 'NA'

    combined_results = pd.DataFrame({
        'Rank': range(1, 21),
        'Projected Leaders': top_20_predicted_df['name'],
        'Projected PPR PPG': top_20_predicted_df['predicted_fantasy_points_ppr'],
        '': range(1, 21),
        'Actual Leaders': top_20_actual_df['name'],
        'Actual PPR Total': top_20_actual_df['fantasy_points_ppr']
    })

    print(f"Projected vs Actual Leaders for the {train_year + 1} Season based on {train_year} Stats:")
    print(combined_results)
    print(f"Accuracy Score: {score}")
    return score

In [19]:
control_qb_path = '.../Data/Created Data/control_qb.csv'
control_qb = pd.read_csv(control_qb_path)
print(control_qb.head(10))

   rank    player_name  year position  points  preseason_rank  \
0   397    AJ McCarron  2017       QB    2.64              39   
1   574    AJ McCarron  2019       QB   16.90              44   
2   543    AJ McCarron  2020       QB    0.80              57   
3   579    AJ McCarron  2023       QB    0.76              62   
4    13  Aaron Rodgers  2013       QB  169.44               2   
5    23  Aaron Rodgers  2014       QB  354.14               3   
6    21  Aaron Rodgers  2015       QB  301.24               2   
7    38  Aaron Rodgers  2016       QB  380.02               2   
8    22  Aaron Rodgers  2017       QB  129.60               1   
9    28  Aaron Rodgers  2018       QB  312.58               1   

   postseason_rank  accuracy_score  yearly_ac  total_accuracy_score  
0               57               0         16                   170  
1               49               0         18                   170  
2               75               0         18                   170  
3   

In [20]:
def accuracy_test(actual, predicted):
    score = 0
    predicted_ranks = {row['name']: row['predicted_rank'] for index, row in predicted.iterrows()}
    actual_ranks = {row['name']: row['actual_rank'] for index, row in actual.iterrows()}
    for player in actual_ranks:
        if player in predicted_ranks:
            if predicted_ranks[player] == actual_ranks[player]:
                score += 2  # Exact rank match
            else:
                score += 1  # Player is in the top 20 but not the exact rank

    return(score)

In [21]:
def test_data():
    total_score = 0
    total_score_c = 0
    for i in range(3,13):
        score = predict_next_year(2010+i)
        total_score += score
        score_c = control_qb[control_qb['year'] == 2010 + i]['yearly_ac'].unique()
        total_score_c = control_qb[control_qb['year'] == 2010 + i]['total_accuracy_score'].unique()
    print(f"Total Score: {total_score}")
    print(f"Total Score (Control): {total_score_c}")
test_data()

Projected vs Actual Leaders for the 2014 Season based on 2013 Stats:
    Rank   Projected Leaders  Projected PPR PPG          Actual Leaders  \
0      1      Peyton Manning         418.248682   1       Aaron Rodgers   
1      2          Drew Brees         351.147189   2         Andrew Luck   
2      3         Andy Dalton         302.861985   3      Russell Wilson   
3      4          Cam Newton         301.018330   4      Peyton Manning   
4      5    Matthew Stafford         280.718570   5  Ben Roethlisberger   
5      6         Andrew Luck         275.687789   6          Drew Brees   
6      7      Russell Wilson         275.591775   7           Matt Ryan   
7      8  Ben Roethlisberger         265.977400   8      Ryan Tannehill   
8      9    Colin Kaepernick         258.247915   9           Tom Brady   
9     10          Alex Smith         253.556449  10         Eli Manning   
10    11           Matt Ryan         252.686223  11          Jay Cutler   
11    12          Nick Foles   

The model had an accuracy score of 149, which 12% less than the control's score of 170. For reference, the flex player models had total scores in the high 70s to low 80s range, while these ones have scores in the mid 100s. This tells us a few things about QBs in the NFL, specifically how consistent and long the best of the best play for compared to the flex players.

This also tells me that the current model is very, very poor. This makes sense too because the MSE was so unfathomably large. Because of this, I want to experiment with different predictors that are not necessarily at the top of the correlation test, but appear in general.

At first, my better judgement proved to be worse than the correlation test as my MSE just kept going up. However, I then decided to create a function that will test all combinations of the top 25 from the correlation test. This ran for about 4 minutes before I realized I created a monster function that had to be put down. Through some reworks, I changed my approach. The initial approach evaluated all possible combinations of the top 25 features, which is computationally expensive, whereas the updated approach uses Recursive Feature Elimination (RFE) to rank and select the most important features, evaluating only a limited number of top-ranked feature combinations to significantly reduce computation time. This makes the updated approach more efficient while still considering feature interactions and combined predictive power.

In [22]:
from itertools import combinations
from sklearn.feature_selection import RFE

flex_data_path = '.../Data/Created Data/qb_team.csv'
flex_data = pd.read_csv(flex_data_path)
numeric_columns = flex_data.select_dtypes(include=['number'])

correlation = numeric_columns.corr()
fantasy_points_ppr_correlation = correlation['fantasy_points_ppr'].sort_values(ascending=False)

top_25_features = fantasy_points_ppr_correlation.index[1:26].tolist()  # index[0] is 'fantasy_points_ppr'

X = numeric_columns[top_25_features]
y = numeric_columns['fantasy_points_ppr']

model = LinearRegression()
rfe = RFE(model, n_features_to_select=1)
rfe.fit(X, y)

ranking = rfe.ranking_
ranked_features = pd.DataFrame({'Feature': top_25_features, 'Rank': ranking}).sort_values(by='Rank')

def evaluate_top_feature_combinations(X, y, max_features=5):
    results = []
    feature_list = ranked_features['Feature'].tolist()

    for r in range(1, max_features + 1):
        for combo in combinations(feature_list[:max_features], r):
            combo = list(combo)
            
            X_subset = X[combo]
            X_train, X_test, y_train, y_test = train_test_split(X_subset, y, test_size=0.2, random_state=42)

            model = LinearRegression()
            model.fit(X_train, y_train)
            
            y_pred = model.predict(X_test)
            mse = mean_squared_error(y_test, y_pred)
            
            results.append((combo, mse))
    
    results.sort(key=lambda x: x[1])
    
    return results

results = evaluate_top_feature_combinations(X, y, max_features=15)

print("Top 10 feature combinations based on MSE:")
for combo, mse in results[:10]:
    print(f"Features: {combo}, MSE: {mse}")


Top 10 feature combinations based on MSE:
Features: ['total_tds', 'interceptions', 'ppr_ppg', 'sack_fumbles', 'passing_tds', 'rushing_fumbles', 'total_yards', 'passing_yards', 'ypg'], MSE: 5.687469724459937
Features: ['total_tds', 'interceptions', 'offense_pct', 'ppr_ppg', 'sack_fumbles', 'passing_tds', 'rushing_fumbles', 'total_yards', 'passing_yards', 'ypg'], MSE: 5.6890489257900825
Features: ['total_tds', 'interceptions', 'offense_pct', 'ppr_ppg', 'sack_fumbles', 'passing_tds', 'rushing_fumbles', 'total_yards', 'passing_yards'], MSE: 5.788718298561504
Features: ['total_tds', 'interceptions', 'attempts', 'ppr_ppg', 'sack_fumbles', 'passing_tds', 'rushing_fumbles', 'total_yards', 'passing_yards', 'ypg'], MSE: 5.83034218681096
Features: ['total_tds', 'interceptions', 'ppr_ppg', 'sack_fumbles', 'passing_tds', 'rushing_fumbles', 'total_yards', 'passing_yards'], MSE: 5.855898498509869
Features: ['total_tds', 'interceptions', 'attempts', 'offense_pct', 'ppr_ppg', 'sack_fumbles', 'passing_t

This approach resulted in a combination that produced an MSE of about 5.7, which is an 80%(!!!) decrease. That is absolutley huge for the model. Now, we just have to see if that results in a better accuracy score.

In [23]:
top_features = ['total_tds', 'interceptions', 'ppr_ppg', 'sack_fumbles', 'passing_tds', 'rushing_fumbles', 'total_yards', 'passing_yards', 'ypg']
test_data()

Projected vs Actual Leaders for the 2014 Season based on 2013 Stats:
    Rank   Projected Leaders  Projected PPR PPG          Actual Leaders  \
0      1      Peyton Manning         410.544413   1       Aaron Rodgers   
1      2          Drew Brees         357.567810   2         Andrew Luck   
2      3          Cam Newton         296.518804   3      Russell Wilson   
3      4         Andy Dalton         288.265392   4      Peyton Manning   
4      5         Andrew Luck         286.334860   5  Ben Roethlisberger   
5      6    Matthew Stafford         273.871535   6          Drew Brees   
6      7      Russell Wilson         272.207778   7           Matt Ryan   
7      8    Colin Kaepernick         268.457814   8      Ryan Tannehill   
8      9  Ben Roethlisberger         263.995018   9           Tom Brady   
9     10          Nick Foles         256.764718  10         Eli Manning   
10    11          Alex Smith         254.527796  11          Jay Cutler   
11    12           Matt Ryan   

This new approach lowered the model to 147, a 2 point decrease. Just as we saw in the flex player predictor, a lower MSE did not necessarily correlate to a higher accuracy score, which is odd.