<p style="font-size:24px"><b>Predicting Fantasy Football Stars (Pt. 2)</b></p>

This is a follow up from the flex player prediction. Now I will attempt to do the same for the quarterbacks.

<i>Building The Model (Quarterbacks)</i>

To build the actual model, I will once again use various Python libraries.

In [12]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

In [13]:
flex_data_path = '.../FantasyFootballPredictor/Data/qb_team.csv'
flex_data = pd.read_csv(flex_data_path)
numeric_columns = flex_data.select_dtypes(include=['number'])
correlation = numeric_columns.corr()
fantasy_points_ppr_correlation = correlation['fantasy_points_ppr'].sort_values(ascending=False).head(10)
print(fantasy_points_ppr_correlation)

fantasy_points_ppr           1.000000
total_tds                    0.985430
total_yards                  0.981781
passing_yards                0.968837
touches                      0.966782
passing_first_downs          0.966562
passing_tds                  0.965720
offense_snaps                0.963398
completions                  0.956602
passing_yards_after_catch    0.953485
Name: fantasy_points_ppr, dtype: float64


Based on the test, I decided to train my model based on total touchdowns ('total_tds'), total yards, passing yards, touches, and passing first downs.

In [14]:
top_features = ['total_tds','total_yards','passing_yards','touches','passing_first_downs','passing_tds','completions','passing_yards_after_catch']
X = numeric_columns[top_features]
y = numeric_columns['fantasy_points_ppr']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

model = LinearRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print(f"Mean Squared Error: {mse}")
print(f"R-squared: {r2}")

Mean Squared Error: 33.75002876458052
R-squared: 0.9976446415303521


<i>Testing the Algorithm</i>

Now that we have our model, I want to test how strong it is. To do this, I used the model to predict who the top 20 players each year in PPR PPG will be and compared it with a list of the actual top 20 players in total PPR points for that year. I then gave it an accuracy score, which awarded 2 points to model if it got the right person in the right position that they finished, 1 point if it predited a person to be in the top 20, but in the wrong position, and 0 if it completely missed. I then compared this with a control model that predicts the top 20 based on who the top 20 in PPR PPG were the year before. Essentially, this allows me to see how much more accurate my model is compared to blindly copy and pasting the top 20 players in PPR PPG from the year prior.

Also, I would like to note the high MSE of this model. Despite using the same method of finding which parameters to use, this model has a much higher MSE than the flex player model. This could be due to various things, but the answer that makes the most sense to me is that quarterback's scoring is much more diverse than that of a flex player, espeically now in the modern NFL where having a mobile QB (i.e. one that picks up a lot of their points via running) is essential. This can skew the data as half of the top QBs in the league get their points through throwing 4000 yard season, whereas others run in for 10+ touchdowns. For the model, it seems difficult to include both sides without expanding the range to even greater lengths, resulting in an even higher MSE.

In [15]:
def predict_next_year(train_year):

    data_train = flex_data[flex_data['season'] == train_year]
    data_actual = flex_data[flex_data['season'] == (train_year + 1)]
    numeric_columns_train = data_train.select_dtypes(include=['number'])

    X_train = numeric_columns_train[top_features]
    y_train = numeric_columns_train['fantasy_points_ppr']

    model = LinearRegression()
    model.fit(X_train, y_train)

    data_predict = data_train.copy()
    data_predict['predicted_fantasy_points_ppr'] = model.predict(X_train)

    top_20_predicted = []
    top_20_actual = []
    for _, row in data_predict[['name', 'predicted_fantasy_points_ppr']].sort_values(by='predicted_fantasy_points_ppr', ascending=False).iterrows():
        if row['name'] not in [player['name'] for player in top_20_predicted]:
            top_20_predicted.append({'name': row['name'], 'predicted_fantasy_points_ppr': row['predicted_fantasy_points_ppr']})
        if len(top_20_predicted) == 20:
            break

    if not data_actual.empty and 'name' in data_actual.columns:
        for _, row in data_actual[['name', 'fantasy_points_ppr']].sort_values(by='fantasy_points_ppr', ascending=False).iterrows():
            if row['name'] not in [player['name'] for player in top_20_actual]:
                top_20_actual.append({'name': row['name'], 'fantasy_points_ppr': row['fantasy_points_ppr']})
            if len(top_20_actual) == 20:
                break
    else:
        for i in range(20):
            top_20_actual.append({'name': 'NA', 'fantasy_points_ppr': 'NA'})

    top_20_predicted_df = pd.DataFrame(top_20_predicted)
    top_20_predicted_df['predicted_rank'] = range(1, 21)

    top_20_actual_df = pd.DataFrame(top_20_actual)
    top_20_actual_df['actual_rank'] = range(1, 21)

    if 'name' in top_20_predicted_df.columns and 'name' in top_20_actual_df.columns:
        if 'NA' not in top_20_actual_df['name'].values:
            score = accuracy_test(top_20_actual_df, top_20_predicted_df)
        else:
            score = 'NA'
    else:
        score = 'NA'

    combined_results = pd.DataFrame({
        'Rank': range(1, 21),
        'Projected Leaders': top_20_predicted_df['name'],
        'Projected PPR PPG': top_20_predicted_df['predicted_fantasy_points_ppr'],
        '': range(1, 21),
        'Actual Leaders': top_20_actual_df['name'],
        'Actual PPR Total': top_20_actual_df['fantasy_points_ppr']
    })

    print(f"Projected vs Actual Leaders for the {train_year + 1} Season based on {train_year} Stats:")
    print(combined_results)
    print(f"Accuracy Score: {score}")
    return score

In [16]:
def control(train_year):
    flex_data_path_control = '.../FantasyFootballPredictor/Data/oy_qb.csv'
    flex_data = pd.read_csv(flex_data_path_control)

    data_train = flex_data[flex_data['season'] == train_year]
    data_actual = flex_data[flex_data['season'] == (train_year + 1)]
    
    top_20_actual = data_actual[['name', 'fantasy_points_ppr']].sort_values(by='fantasy_points_ppr', ascending=False).head(20).reset_index(drop=True)
    top_20_actual['actual_rank'] = range(1, 21)

    control_predictions = data_train[['name', 'fantasy_points_ppr']].sort_values(by='fantasy_points_ppr', ascending=False).head(20).reset_index(drop=True)
    control_predictions['predicted_rank'] = range(1, 21)
    
    score = accuracy_test(top_20_actual, control_predictions)
    print(f"Accuracy Score (Control): {score}")
    
    return(score)

In [17]:
def accuracy_test(actual, predicted):
    score = 0
    predicted_ranks = {row['name']: row['predicted_rank'] for index, row in predicted.iterrows()}
    actual_ranks = {row['name']: row['actual_rank'] for index, row in actual.iterrows()}
    for player in actual_ranks:
        if player in predicted_ranks:
            if predicted_ranks[player] == actual_ranks[player]:
                score += 2  # Exact rank match
            else:
                score += 1  # Player is in the top 20 but not the exact rank

    return(score)

In [18]:
def test_data():
    total_score = 0
    total_score_c = 0
    for i in range(3,13):
        score = predict_next_year(2010+i)
        total_score += score
        score_c = control(2010+i)
        total_score_c += score_c
    print(f"Total Score: {total_score}")
    print(f"Total Score (Control): {total_score_c}")
test_data()

Projected vs Actual Leaders for the 2014 Season based on 2013 Stats:
    Rank   Projected Leaders  Projected PPR PPG          Actual Leaders  \
0      1      Peyton Manning         419.427070   1       Aaron Rodgers   
1      2          Drew Brees         351.295000   2         Andrew Luck   
2      3         Andy Dalton         303.408608   3      Russell Wilson   
3      4          Cam Newton         301.170436   4      Peyton Manning   
4      5    Matthew Stafford         280.780944   5  Ben Roethlisberger   
5      6      Russell Wilson         276.013430   6          Drew Brees   
6      7         Andrew Luck         275.886518   7           Matt Ryan   
7      8  Ben Roethlisberger         266.343589   8      Ryan Tannehill   
8      9    Colin Kaepernick         258.144944   9           Tom Brady   
9     10           Tony Romo         255.034416  10         Eli Manning   
10    11          Alex Smith         254.474334  11           Tony Romo   
11    12           Matt Ryan   

The model had an accuracy score of 149, which is only slightly higher than the control's 145. For reference, the flex player models had total scores in the high 70s to low 80s range, while these ones have scores in the high 140s. This tells us a few things about QBs in the NFL, specifically how consistent and long the best of the best play for compared to the flex players.

This also tells me that the current model is very, very poor. It outperformed the base model by a little over 2%, a negligable total. This makes sense too because the MSE was so unfathomably large. Because of this, I want to experiment with different predictors that are not necessarily at the top of the correlation test, but appear in general.

At first, my better judgement proved to be worse than the correlation test as my MSE just kept going up. However, I then decided to create a function that will test all combinations of the top 25 from the correlation test. This ran for about 4 minutes before I realized I created a monster function that had to be put down. Through some reworks, I changed my approach. The initial approach evaluated all possible combinations of the top 25 features, which is computationally expensive, whereas the updated approach uses Recursive Feature Elimination (RFE) to rank and select the most important features, evaluating only a limited number of top-ranked feature combinations to significantly reduce computation time. This makes the updated approach more efficient while still considering feature interactions and combined predictive power.

In [19]:
from itertools import combinations
from sklearn.feature_selection import RFE

flex_data_path = '.../FantasyFootballPredictor/Data/qb_team.csv'
flex_data = pd.read_csv(flex_data_path)
numeric_columns = flex_data.select_dtypes(include=['number'])

correlation = numeric_columns.corr()
fantasy_points_ppr_correlation = correlation['fantasy_points_ppr'].sort_values(ascending=False)

top_25_features = fantasy_points_ppr_correlation.index[1:26].tolist()  # index[0] is 'fantasy_points_ppr'

X = numeric_columns[top_25_features]
y = numeric_columns['fantasy_points_ppr']

model = LinearRegression()
rfe = RFE(model, n_features_to_select=1)
rfe.fit(X, y)

ranking = rfe.ranking_
ranked_features = pd.DataFrame({'Feature': top_25_features, 'Rank': ranking}).sort_values(by='Rank')

def evaluate_top_feature_combinations(X, y, max_features=5):
    results = []
    feature_list = ranked_features['Feature'].tolist()

    for r in range(1, max_features + 1):
        for combo in combinations(feature_list[:max_features], r):
            combo = list(combo)
            
            X_subset = X[combo]
            X_train, X_test, y_train, y_test = train_test_split(X_subset, y, test_size=0.2, random_state=42)

            model = LinearRegression()
            model.fit(X_train, y_train)
            
            y_pred = model.predict(X_test)
            mse = mean_squared_error(y_test, y_pred)
            
            results.append((combo, mse))
    
    results.sort(key=lambda x: x[1])
    
    return results

results = evaluate_top_feature_combinations(X, y, max_features=15)

print("Top 10 feature combinations based on MSE:")
for combo, mse in results[:10]:
    print(f"Features: {combo}, MSE: {mse}")


Top 10 feature combinations based on MSE:
Features: ['total_tds', 'interceptions', 'ppr_ppg', 'passing_tds', 'sack_fumbles', 'rushing_fumbles', 'total_yards', 'passing_yards', 'ypg'], MSE: 4.770049067802421
Features: ['total_tds', 'interceptions', 'ppr_ppg', 'passing_tds', 'sack_fumbles', 'rushing_fumbles', 'total_yards', 'passing_yards', 'passing_first_downs', 'ypg'], MSE: 4.782078036987525
Features: ['offense_pct', 'total_tds', 'interceptions', 'ppr_ppg', 'passing_tds', 'sack_fumbles', 'rushing_fumbles', 'total_yards', 'passing_yards', 'ypg'], MSE: 4.8009416723512235
Features: ['offense_pct', 'total_tds', 'interceptions', 'ppr_ppg', 'passing_tds', 'sack_fumbles', 'rushing_fumbles', 'total_yards', 'passing_yards', 'passing_first_downs', 'ypg'], MSE: 4.802665380115127
Features: ['offense_pct', 'total_tds', 'interceptions', 'ppr_ppg', 'passing_tds', 'sack_fumbles', 'rushing_fumbles', 'total_yards', 'passing_yards'], MSE: 4.807129136376698
Features: ['offense_pct', 'total_tds', 'intercep

This approach resulted in a combination that produced an MSE of about 4.7, which is an 86%(!!!) decrease. That is absolutley huge for the model. Now, we just have to see if that results in a better accuracy score.

In [20]:
top_features = ['total_tds', 'interceptions', 'ppr_ppg', 'passing_tds', 'sack_fumbles', 'rushing_fumbles', 'total_yards', 'passing_yards', 'ypg']
test_data()

Projected vs Actual Leaders for the 2014 Season based on 2013 Stats:
    Rank   Projected Leaders  Projected PPR PPG          Actual Leaders  \
0      1      Peyton Manning         411.677902   1       Aaron Rodgers   
1      2          Drew Brees         358.769212   2         Andrew Luck   
2      3          Cam Newton         296.777000   3      Russell Wilson   
3      4         Andy Dalton         289.190164   4      Peyton Manning   
4      5         Andrew Luck         286.262931   5  Ben Roethlisberger   
5      6    Matthew Stafford         273.596099   6          Drew Brees   
6      7      Russell Wilson         271.038569   7           Matt Ryan   
7      8    Colin Kaepernick         267.957672   8      Ryan Tannehill   
8      9  Ben Roethlisberger         263.961157   9           Tom Brady   
9     10           Tony Romo         257.456965  10         Eli Manning   
10    11          Nick Foles         257.228018  11           Tony Romo   
11    12          Alex Smith   

This new approach improved the model to 151, a 2 point improvement over the original 149.