### Which running backs have had 50+ carries in each of the last 5 years?

This notebook scrapes data from [Pro Football Reference](https://www.pro-football-reference.com/)'s Rushing and Receiving data table and places it into a data frame ([sample table](https://www.pro-football-reference.com/years/2017/rushing.htm)). The notebook uses the `Requests` and `Beautiful Soup 4` modules to gather the web page data. A `Player` class is used to create objects representing a single player. The program loops through the 5 most recent NFL seasons and gathers data for each season. A column for the player's fantasy points for the season is added to each data frame. The points total is based on a standard Yahoo! league (0 PPR). The scoring can be changed in the `FANTASY_SETTINGS_DICT` dictionary. A data frame for each season is placed in a list, and this list is concatenated into one big data frame. Then, various manipulations are made to the data frame to find all running backs who have had 50 or more rushing attempts in each of the last 5 years. The final data frame is saved as a `.csv` file.

** Note: **This dataset does not include 2 point conversions. This will affect the fantasy point total for some players. Luckily, few running backs get many 2 point conversions (9 RB's in 2017 each had only 1 conversion; the rest had 0)  This will only have a small affect on the point total for some of the running backs.

Import the needed modules:

In [1]:
import pandas as pd
import datetime
import rush_rec_scraper as rrs

from constants import TOTAL_RUSH_REC_HEADER, FANTASY_SETTINGS_DICT

In [2]:
def modify_data_frame(data_frame_list, num_years, player_url=False):
    """
    This function takes a list of data frames as input. It concatenates the data frames together
    to create one large data frame. It then modifies the data to get the running backs with 50 or
    more carries in each of the past num_years season. The modified data frame is returned.
    """
    # Concatenate the data frames to create one large data frame that has data for each season.
    big_df = pd.concat(data_frame_list)

    # Keep the player_url column in the data frame when player_url=True
    if not player_url:
        big_df.drop('url', axis=1, inplace=True)

    # The concatenation creates duplicate indexes, so we will reset the index
    big_df.reset_index(inplace=True, drop=True)

    # Eliminate players with fewer than 50 rush attempts.
    big_df = big_df[big_df['rush_attempts'] >= 50]

    # Some players have a NaN value their position. Any player with 50 or more rush attempts,
    # but no position is likely a running back.
    big_df['position'] = big_df['position'].fillna('RB')

    # Some running backs have their position description in lowercase letters.
    # Use a lambda function to fix this inconsistency.
    big_df['position'] = big_df['position'].apply(lambda x: 'RB' if 'rb' in x else x)

    # Only interested in running backs
    big_df = big_df[big_df['position'] == 'RB']

    # Set the player's name and the season's year as the indexes.
    big_df = big_df.set_index(['name', 'year'])

    # Sort the data frame by the player's name
    big_df.sort_index(level='name', inplace=True)

    # Get each player's name
    names = big_df.index.get_level_values('name').unique()

    # Loop through each player. If they don't have num_years season's worth of data, then we
    # drop them from the data set.
    for name in names:
        if len(big_df.loc[name]) != num_years:
            big_df.drop(name, inplace=True)

    return big_df


We can now execute the functions and create the final data frame:

In [3]:
# Get the current date to help figure out which year to start gathering data from.
now = datetime.datetime.now()

# Number of years of data.
num_years = 5

# Starting with last year since it's a full season of data.
# Regular season football ends in late December or early January, so if the current
# date is late December, and the season has already ended, then unfortunately the
# newly created season will not be included in the data set.
# Football season starts the week after Labor Day weekend. Labor Day is always on
# the first Monday of September, and the NFL regular season is 17 weeks long
# (16 games + bye week for a team). From my initial calculations, if September 1 is
# Labor Day, then the NFL regular season ends on December 28.
start_year = now.year - 1

# Get the final year to gather data from.
end_year = start_year - num_years

# First, we need to scrape the data from Pro-Football Reference.

# Holds each data frame scraped from Pro-Football Reference.
# Each data frame has data for a single season.
data_frame_list = []

# Iterate through each year of data and create a data frame for each one.
for year in range(start_year, end_year, -1):
    # Create url for given season.
    url = 'https://www.pro-football-reference.com/years/' + str(year) + '/rushing.htm'

    # Identify the table ID to get scrape from the correct table.
    table_id = 'rushing_and_receiving'

    # Scrape the data to get each player's web page elements.
    player_list = rrs.scrape_data(url, table_id)

    # Use the elements to create Player objects.
    list_of_player_dicts = rrs.create_player_objects(player_list, TOTAL_RUSH_REC_HEADER)

    # Create a data frame for the season
    df = rrs.make_data_frame(list_of_player_dicts, year, TOTAL_RUSH_REC_HEADER, FANTASY_SETTINGS_DICT)

    data_frame_list.append(df)

# Concatenate the data frames and clean the data.
big_df = modify_data_frame(data_frame_list, num_years, False)

# Write the data frame to a csv file and save it in the current working directory.
# big_df.to_csv('5_seasons_50_carries.csv')

The dimensions of the data frame (number of rows, number of columns):

In [4]:
big_df.shape

(65, 25)

Number of running backs with 50 or more carriers in each of the past 5 seasons (2013-2017):

In [5]:
len(big_df.index.get_level_values('name').unique())

13

The name of each player with 50 or more rush attempts in each of the last 5 seasons:

In [6]:
for name in big_df.index.get_level_values('name').unique():
    print(name)

Alfred Morris
Chris Ivory
DeMarco Murray
Doug Martin
Eddie Lacy
Frank Gore
Giovani Bernard
Lamar Miller
Le'Veon Bell
LeGarrette Blount
LeSean McCoy
Mark Ingram
Matt Forte


No running back has a season with fewer than 50 rush attempts:

In [7]:
big_df['rush_attempts'].min()

69

The data set:

In [8]:
big_df

Unnamed: 0_level_0,Unnamed: 1_level_0,team,age,position,games_played,games_started,rush_attempts,rush_yards,rush_td,longest_run,yards_per_rush,...,yards_per_rec,rec_td,longest_rec,rec_per_game,rec_yards_per_game,catch_percentage,scrimmage_yards,rush_rec_td,fumbles,fantasy_points
name,year,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1
Alfred Morris,2013,WAS,25,RB,16,16,276,1275,7,45,4.6,...,8.7,0.0,17.0,0.6,4.9,75.0,1353,7,5,177.3
Alfred Morris,2014,WAS,26,RB,16,16,265,1074,8,30,4.1,...,9.1,0.0,26.0,1.1,9.7,65.4,1229,8,2,170.9
Alfred Morris,2015,WAS,27,RB,16,16,202,751,1,48,3.7,...,5.5,0.0,12.0,0.6,3.4,76.9,806,1,0,86.6
Alfred Morris,2016,DAL,28,RB,14,0,69,243,2,17,3.5,...,3.7,0.0,6.0,0.2,0.8,50.0,254,2,0,37.4
Alfred Morris,2017,DAL,29,RB,14,5,115,547,1,70,4.8,...,6.4,0.0,13.0,0.5,3.2,77.8,592,1,0,65.2
Chris Ivory,2013,NYJ,25,RB,15,6,182,833,3,69,4.6,...,5.0,0.0,12.0,0.1,0.7,28.6,843,3,2,102.3
Chris Ivory,2014,NYJ,26,RB,16,10,198,821,6,71,4.1,...,6.8,1.0,23.0,1.1,7.7,66.7,944,7,2,136.4
Chris Ivory,2015,NYJ,27,RB,15,14,247,1070,7,58,4.3,...,7.2,1.0,36.0,2.0,14.5,81.1,1287,8,4,176.7
Chris Ivory,2016,JAX,28,RB,11,1,117,439,3,42,3.8,...,9.3,0.0,37.0,1.8,16.9,71.4,625,3,5,80.5
Chris Ivory,2017,JAX,29,RB,14,3,112,382,1,34,3.4,...,8.3,1.0,29.0,1.5,12.5,75.0,557,2,2,67.7
