## Abstract

Our group attempted to create a model that would take in a football players statistics for one season and would output a projected fantasy score for the following year. We wanted to create a model that could be trained on several years worth of data and would choose the model that was best for that dataset. We wanted split up each major fantasy football position into its own model in case there was one model that was better for one position vs. another. We wound up scraping our data online so it will be easy in the future to add in years and incorporate them into our training data. We ultimately were successful in creating this setup and obtaining results that seem relatively accurate to us. We would like to compare our results vs. other prediction algorithms out there for predicting fantasy football performance at some point.

<a href="https://github.com/johnny-kantaros/fantasy-football">Click to see source code!</a>

## Introduction

Our project focused on developing a machine learning pipeline for fantasy football applications. All three of us have played fantasy football for a number of years, so we were motivated to see if modeling historic data could lead to accurate predictions for future seasons. For those who are unfamiliar, or need a refresher on what fantasy football is, here is a quick recap:  

<b><u>Fantasy Football</u></b>  

The primary goal of fantasy football is to select a fantasy team comprised of current NFL players. The standard roster includes one quarter back, two running backs, two wide receivers, one tight end, one kicker, one defensive/special teams unit, and one "flex," which can be an additional running back, wide receiver, or tight end. There is also space for ~5-7 bench players whose points will not count if they remain on the bench. Here is an example roster: 

<img src = "./images/roster.png">

Typically, a fantasy football league will consist of 8-12 teams, and participants will battle head to head against there friends to see whose collective team performs better that week. The league will have playoffs towards the end of the season and eventually a championship.   

As you will see in the "Week 1" columns on the right, there is one feature named "Proj," which stands for projections. These metrics are very popular and commonly utilized in fantasy football, and team managers will often use them to compare different players and set their lineup each week. Like many, we have always been curious how these projections are generated. There have been several individuals and groups who have also tried to accomplish this task. For example, Chelsea Robinson at Louisiana tech wrote a case study in 2020 with her findings from advanced statistical modeling using historical fantasy data. Similar to us, she relied on regression modeling to output a ranking list for the following season. Although mathematically strong, her model uses less data and fewer features than ours, which might not produce as accurate as a result. Another interesting case study comes Roman Lutz at UMASS Amherst, who employed a similar solution as us. More specifically, he pulled data from over 5 seasons and used SVM regression along with neural networks for optimization. Similar to the first case study, his data was also fairly basic and lacked the advanced features found in ours. Consequently, his MSE was around 6, while ours was closer to 2. This is a significant error difference when it comes to prediction, so we are proud with our result. The last case study worth mentioning comes from Benjamin Hendricks at Harvard. In his approach, Hendricks uses an <i>ensemble</i> method to reach predictions. In his calculations, he leverages data from existing models, applies natural language processing techniques to perform sentiment analysis on player performance, and combines these metrics with standard data from NFL.com and FantasyData.io. Hendricks's use of sentiment analysis and crowd sourcing is a unique approach and feature to include. He relies on the crowd's opinion on players and teams instead of just the "masters." He also includes advanced, real time statistics such as injuries and weather analysis. This is an impressive, detailed approach with great performance (30% better than most sport sites).

## Values Statement


<u>Potential Users</u>  

The potential users of our project are fantasy football team owners. Our data, modeling, and output are all fairly specific, so there will not be many applications outside this domain. It is worth noting that our current output is specific to fantasy football <i>drafts</i>, which take place at the beginning of the year and allow users to pick their team for the year. If we had more time, we would have liked to model for weekly predictions.  

<u>Who benefits?</u>  

Hopefully, the owners of fantasy football teams who leverage our product will gain increased insight and an edge over their opponents. These users can run our model for that given year and shape their draft off the results.

<u>Who is harmed?</u>  

While no one will be truly harmed, this algorithm could provide an unfair advantage for certain members of a league. The algorithm should not be used if any sort of wagering is involved in the league, as this could cause for unfair and biased outcomes.  

<u>What is your personal reason for working on this problem?</u>  

As aforementioned, we all have played fantasy football for a number of years and have been interested with how the projections are produced by major sites like ESPN and Yahoo. We wanted to see if we could replicate and expand on these predictions using the machine learning techniques we have explored this semester.  

<u>Societal Impact</u>  

There will be very little societal impact of our product. As we mentioned, it is a very specific application of machine learning, and it will primarily be used for fun instead of addressing any societal problems.  

## Materials and Methods

### <u> Our data</u>

##### <u>Normal data</u>
We wound up scraping most of our data online from various websites that provide NFL player statistics. We tested various websites but the one with the most data that was easily available to scrape was from a website called <a href="https://www.fantasypros.com/nfl/stats/">FantasyPros</a>. This website has cleanly formatted NFL data for every player from each year. They also conveniently split up the players into positional groups, which made our job easier. Furthermore, the url for each position and year was structred in such a way that we could write the following function to web-scrape our basic data:


In [2]:
import pandas as pd
import requests

def read_new(year, pos):

        # Link takes lowercase positions
        pos = pos.lower()

        url = f"https://www.fantasypros.com/nfl/stats/{pos}.php?year={year}"

        response = requests.get(url)
        html = response.content

        # Make df
        df = pd.read_html(html, header=1)[0]

        # Clean name and team data

        df.insert(1, 'Tm', df['Player'].str.rsplit(n=1).str[-1].str.slice(1, -1))
        df['Player'] = df['Player'].str.rsplit(n=1).str[0]

        # Get y (following year ppg)
        next_year = str(int(year) + 1)
        url = f"https://www.fantasypros.com/nfl/stats/{pos}.php?year={next_year}"

        response = requests.get(url)
        html = response.content

        # Make df
        y = pd.read_html(html, header=1)[0]

        df['y'] = y['FPTS/G']

        return df

This is what an example basic dataset looked like:

In [3]:
df = read_new(2021, "QB")
df.head(3)

Unnamed: 0,Rank,Tm,Player,CMP,ATT,PCT,YDS,Y/A,TD,INT,SACKS,ATT.1,YDS.1,TD.1,FL,G,FPTS,FPTS/G,ROST,y
0,1,BUF,Josh Allen,409,646,63.3,4407,6.8,36,15,26,122,763,6,3,17,417.7,24.6,99.9%,25.2
1,2,LAC,Justin Herbert,443,672,65.9,5014,7.5,38,15,31,63,302,3,1,17,395.6,23.3,96.6%,24.3
2,3,FA,Tom Brady,485,719,67.5,5316,7.4,43,12,22,28,81,2,3,17,386.7,22.7,1.8%,25.6


As you can see, each row represents a singular NFL player. In this case, we pulled QB data from 2021, so each row will represent a quarterback and their respective stats from that season.  There are many features which display player performance throughout the season. Some example stats include ATT (pass attempts), YDS (passing yards), TD (touchdowns), CMP (completions). Our target variable, which we are trying to predict in future years, is FPTS/G: This is what it looks like:

In [4]:
df['FPTS/G']

0     24.6
1     23.3
2     22.7
3     22.0
4     20.4
      ... 
78    -0.3
79    -0.1
80    -0.2
81    -0.5
82    -0.4
Name: FPTS/G, Length: 83, dtype: float64

We decided on fantasy points per game instead of total fantasy points to account for injuries and other potential limitations of an aggregate value. For example, in our first modeling approach, when we used total fantasy points, some of the top players received extremely low predictions for the following season. One example was Saquon Barkley, who is a top running back in the league. One year, he only played in 2 games due to a season ending injury. However, in those two games, he averaged ~15 points per game. In this regard, although he recorded one of the lowest total points for that year, he was one of the best players.  

##### <u>Advanced Data</u>

We also pulled <a href="https://www.fantasypros.com/nfl/advanced-stats-qb.php?year=2021">advanced player data</a> from the same website, which brings in some more advanced calculations into our dataset. While many of these metrics are important, they are often skipped by the mainstream media due to their complicated nature or low appeal for their audience. Because the two datasets came from the same website, we could use a similar approach for our web-scraping, and the merge was made easier due to matching names. One area which required a little massaging was ensuring we did not have duplicate variables. As you will see in our basic data, there are multiple Td, Yds, Att columns. This represents passing vs rushing statistics. As each position had slightly different data, it became important to us to invest time in cleaning / un-duplicating these features. Additionally, many columns were repeated in the merging process with the advanced dataset. To clean this data in an efficient and organized way, we wrote a bunch of functions in our main class file to help us tackle the problem.

We wound up training our model on every year except for the most recent. This allowed us to test our results against the most recent years worth of data. We evaluated our models based on MSE. If a model provided a better MSE than the model we had previously saved as the best, we would update and now return the new type of model. Our biggest hurdle was aquiring enough data to run an effective model as there are only 32 teams and some positions only have 1 that gets points. We had to take several years worth to help us overcome this challenge.

### <u>Our approach</u>


##### Data collection

A big problem we faced was a lack of data. More specifically, we initially started with just one season of data to make our predictions. This quickly caused problems, as in some positional groups we were left with only ~30 players as observations after cleaning and preparing our data. Therefore, we switched our data source and layered ~10 seasons worth of data onto each positional group. We ended up removing player names as a feature, as this could have ended up being a feature due to repeated values over different years. This left us with hundreds of observations to work with.


##### Preprocessing

Before we employed our models, we performed feature selection and normalization techniques. First, because of our merged dataset, we had a copious amount of features to choose from. We relied on sklearn's SelectKBest algorithm for most of the heavy lifting. Before this process, however, we made sure to standardize our data to ensure the feature selection algorithm did not favor features with naturally larger values. Here is our feature selection function:

In [6]:
from sklearn.feature_selection import SelectKBest, f_regression
def getBestFeatures(X, y, numFeatures = 5):
        
        # Get best features
        selector = SelectKBest(score_func=f_regression, k=numFeatures)
        selector.fit(X, y)

        selected_features = X.columns[selector.get_support()]

        X_selected = X[selected_features]

        return X_selected, selected_features

For each positional group, the 5 selected features were different and unique to that position. For example, 20+ yard receptions are much more important in predicting wide receiver performance than they are for quarterbacks, who pass the ball.  

##### Modeling

* TODO - talk about modeling function, what it returns, which models we used

Next, we performed our modeling.


##### Performance evaluation

* TODO - talk about metrics

How you evaluated your models (loss, accuracy, etc), and the size of your test set.



Results
----

* TODO


This is the section in which you describe the main findings or achievements of your model. You can report things like accuracies on train/test data, loss scores, comparisons to previous models, etc. To compare a small set of numbers, tables are fine, but more complex phenomena should be illustrated with figures. Both figures and tables should include appropriate captions, axis labels, legends, and another professional annotations. It’s fine for your figures to either be constructed manually or as computational outputs (e.g. from Pandas).

Please remember: your results do not speak for themselves. While figures and tables are highly effective forms of communication, your prose is necessary to tell your story.

Concluding Discussion
----


* TODO

Your conclusion is the right time to assess:

In what ways did our project work?

Did we meet the goals that we set at the beginning of the project?

How do our results compare to the results of others who have also studied similar problems?

If we had more time, data, or computational resources, what might we do differently in order to improve further?

Group Contributions Statement
----

In your group contributions statement, please include a short paragraph for each group member describing how they contributed to the project:


Who worked on which parts of the source code?

Who performed or visualized which experiments?

Who led the writing of which parts of the blog post?

Etc.

<u>Ethan Coomber</u>:   

I spent a lot of time working on cleaning data and developing the model. We had to make sure we had sufficient data and I tried to ensure we had clean, usable data. Once I was able to ensure that, I spent my time working on developing a way to choose the best model. This took time as we had to research various models and determine what kind of model would be most effective in helping us predict performance. We then implemented the models we thought had potential and had to have a way to select the best one.


<u>Johnny Kantaros</u>:  

I spent time initially working on data collection (including the web-scraping), and then spent a lot of time on data cleaning and preprocessing tactics. A large portion of this project was data collection, manipulation, and wrangling, and I definitely learned a lot about the various functionalities of Pandas and other frameworks. Finally, I helped Ethan with adding some models to our modeling function. Our team did a great job working collaboratively so everyone achieved learning in all parts of the pipeline. In terms of this blog post, I wrote the introduction, values statements, and part of the materials + methods sections.

Personal Reflection
----

At the very end of your blog post, in a few paragraphs, respond to the following questions:

What did you learn from the process of researching, implementing, and communicating about your project?

How do you feel about what you achieved? Did meet your initial goals? Did you exceed them or fall short? In what ways?

In what ways will you carry the experience of working on this project into your next courses, career stages, or personal life?

### Sources below * delete when bib is working

Robinson, Chelsea. "The Prediction of Fantasy Football." Louisiana Tech University, 14 May 2020,

Lutz, Roman. "Fantasy Football Prediction." University of Massachusetts, Amherst, 26 May 2015.

Hendricks, Benjamin. Sports Analytics with Natural Language Processing: Using Crowd Sentiment to Help Pick Winners in Fantasy Football, Harvard University, Massachusetts, 2022.