# A Note on Saving Data
Web scrapers face a unique set of challenges, the most interesting of which may be the adversarial nature of it all. In general, webmasters do not want their data scraped, and they have a handful of tools available to them to inconvenience the webscraper. The site from which we'll be scraping data, profootballreference, deters webscrapers by temporarily blocking traffic from users that make too many requests in too short of a time period. To counteract this, we'll make use of python's `time.sleep()` function, but, where possible, we'll also be saving webpages locally. This will allow us ad hoc access without the need to wait through numerous calls to `time.sleep()`.

# Data Collection

Naturally, we must begin with collecting the data. This will involve scraping profootballreference.

Let's define our imports and some helper functions to keep our code organized.

In [None]:
import time
import tqdm
import pandas as pd
import html5lib


def get_passing_url(year):
    return f"https://www.pro-football-reference.com/years/{year}/passing.htm"


def get_rushing_url(year):
    return f"https://www.pro-football-reference.com/years/{year}/rushing.htm"


def get_receiving_url(year):
    return f"https://www.pro-football-reference.com/years/{year}/receiving.htm"


def get_soup(url):
    import requests
    from bs4 import BeautifulSoup
    page = requests.get(url)
    soup = BeautifulSoup(page.content, 'html.parser')
    return soup


def get_valid_subdirectories(soup):
    html_subdirectories = soup.find_all("a")
    valid_subdirectories = [link for link in html_subdirectories if '/players/' in str(link)]
    subdirectories = [link.get('href') for link in valid_subdirectories]
    return subdirectories


def make_directory(name):
    import os
    if not os.path.exists(name):
        os.makedirs(name)

Great. Now our first step will involve iterating through each season of interest and getting the webpages for that season's passing, rushing, and receiving stats. 

From each of those webpages, we will extract every link that leads to a player's individual stats page. We will store all of these links in a list called `player_links`, which we will then make unique.

#### A Note on Seasons

How many seasons should would be interested in? That's a difficult question to answer. The naive perspective suggest that we want *all* of the available data, and, while there would be some merit to that approach, anyone familiar wih the NFL knows that the game has evolved over time; the way that the game was played in its earliest form has little in common with the game as it exists today, and historical data would feature trends that are not mirrored today. For that reason, we want to focus on "modern" data.

Still, defining 'modern" is not easy. When discussing the NFL, "modern" generally refers to any season after the NFL-AFL merger in 1970. In most contexts, this is a good defintion, but, for the purpose of this project, I don't feel that it is sufficient. The way that the game is played and the ways in which players are utilized continued to evolve after the merger, and the data reflects that. For that reason, I had to settle on a different defintion of "modern."

For the purposes of this project, the modern NFL began in 2012. In 2011, the union representing the players (the NFL Player's Association) negotiated a new collective bargaining agreement (CBA) with the team owners. The CBA had *massive* rammifications, many of which have surely gone unnoticed, but one of the more visible effects has been the reduced interest in signing veteran players to be minor contributors. To oversimplify a nuanced and fascinating topic: the CBA that took prior to the 2012 season limited earnings for rookie players, increasing their value relative to their veteran counterparts. The demand for low-level veterans plummted because rookie contract players offer 90% of the performance for 50% of the price (numbers that I've fabricated to demonstrate the point). This shift meant that many positions (especially runningback) skewed much younger after the CBA took effect. For this project, we want data that reflects current positional values, thus, we are using only data from after the 2011 CBA.

In [None]:
player_links = []
for year in tqdm.tqdm(range(2012, 2023)):
    # Passing
    make_directory(f"../data/passing")
    passing_url = get_passing_url(year)
    passing_soup = get_soup(passing_url)
    
    with open(f"../data/passing/{year}.html", "w") as file:
        file.write(str(passing_soup))
    
    passing_table = passing_soup.find(id="passing")
    passing_subdirectories = get_valid_subdirectories(passing_table)
    player_links.extend(passing_subdirectories)
    # Rushing
    make_directory(f"../data/rushing")
    rushing_url = get_rushing_url(year)
    rushing_soup = get_soup(rushing_url)
    
    with open(f"../data/rushing/{year}.html", "w") as file:
        file.write(str(rushing_soup))
    
    rushing_table = rushing_soup.find(id="rushing")
    rushing_subdirectories = get_valid_subdirectories(rushing_table)
    player_links.extend(rushing_subdirectories)
    # Receiving
    receiving_url = get_receiving_url(year)
    receiving_soup = get_soup(receiving_url)
    
    with open(f"../data/receiving/{year}.html", "w") as file:
        file.write(str(receiving_soup))
    
    receiving_table = receiving_soup.find(id="receiving")
    receiving_subdirectories = get_valid_subdirectories(receiving_table)
    player_links.extend(receiving_subdirectories)
    # Wait 30 seconds to avoid getting blocked
    time.sleep(30)
    
# Make unique
player_links = list(set(player_links))

Great, we now have a list of subdirectories for every player that passed, rushed, or caught a ball in since 2011. We can append these subdirectories to a base URL to get links to each player's page.

Next we'll have to parse the HTML of each player's page and extract the data. Because doing so would involve several layers of data structure nesting, I opted instead to create a `Player` class to hold each player's data. Because "Tell, Don't Ask" programming is a good practice, and because we are good little object-oriented programmers, we'll instantiate the class with just a single subdirectory and have it handle the HTML parsing. Unfortunately, the format of each player's webpage varies depending on the player's position, meaning we need position-specific parsing functionality, and, thus, we will have `Player` be an abstract class, and we'll create position classes.

In [None]:
from abc import ABC, abstractmethod
from bs4 import BeautifulSoup
import requests

class Player(ABC):
    def __init__(self, link):
        self.url = f"https://www.pro-football-reference.com{link}"
        self.soup = self.get_soup()

    def get_soup(self):
        page = requests.get(self.url)
        soup = BeautifulSoup(page.content, 'html.parser')
        return soup

    @abstractmethod
    def some_player_method(self):
        pass
    
    @classmethod
    def create(cls, link):
        position = cls._get_position_static(link)
        if position == 'QB':
            return Quarterback(link)
        else:
            return SkillPosition(link)

    @staticmethod
    def _get_position_static(link):
        url = f"https://www.pro-football-reference.com{link}"
        page = requests.get(url)
        soup = BeautifulSoup(page.content, 'html.parser')
        position_strong_tag = soup.find('strong', string='Position')
        if position_strong_tag and position_strong_tag.next_sibling:
            position = position_strong_tag.next_sibling.strip(': ').strip()
            return position
        return None


class Quarterback(Player):
    def __init__(self, link):
        super().__init__(link)
        self._set_throws()
        
    def some_player_method(self):
        print('Quarterback method')

    def _set_throws(self):
        self.throws = self.get_throws()

    def get_throws(self):
        throws_strong_tag = self.soup.find('strong', string='Throws:')
        if throws_strong_tag and throws_strong_tag.next_sibling:
            throws = throws_strong_tag.next_sibling.strip(': ').strip()
            return throws
        return None
    
class SkillPosition(Player):
    def some_player_method(self):
        # Implementation specific to SkillPosition
        print('SkillPosition method')

In [None]:
# Thwarted by pfr's rate limiting again...