# Week 3 -- Data Collection

This notebook contains the functions to scrape offensive player stats, week and total fantasy points for players & the defense for a team, and the defense stats. It initializes the dataframes the observations will be added to. I set up the url functions to update with the week number that will need to be manually updated (see below).

I first import all necessary libraries and set my dataframe formats.

In [1]:
import pandas as pd
import numpy as np
import requests
import urllib
from bs4 import BeautifulSoup
import time
import warnings
warnings.simplefilter('ignore')

In [2]:
%%capture

from tqdm import tqdm_notebook as tqdm
from tqdm import tnrange
tqdm().pandas()

In [20]:
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
pd.set_option('display.float_format', lambda x: '%.1f' % x)

### Update Week Number

week_no needs to be updated for the week just ended in the football season. This will update url functions and the dataframe for the week over week fantasy points dataframe.

In [18]:
week_no = 3

### DataFrames

#### Offense

This dataframe will be used for the offensive player stats on the season.

In [19]:
df = pd.DataFrame(columns = ['Player', 'Team', 'Position', 'Age', 'Games', 'GamesStarted', 'CompletedPasses', 
                             'PassesAttempted', 'PassingYds', 'PassingTDs', 'Interceptions', 'RushingAttempts', 
                             'RushingYds', 'RushingYdspAtt', 'RushingTDs', 'Targeted', 'Receptions', 
                             'ReceivingYds', 'YdspReception', 'ReceivingTDs', 'Fumbles', 'LostFumbles', 'TtlTDs', 
                             'TwoPTConversions', 'TwoPTConversionPasses', 'FDFantasyPts', 'PositionRank', 
                             'OverallRank'])

#### Fantasy Points

These dataframes will be used when scraping the player/team fantasy points. There is one for total points and one for the week. The week column will update when the week_no variable is updated above.

In [17]:
week_fantasy = pd.DataFrame(columns = ['Player', 'Team', 'Position', 'Week_' + str(week_no)])

In [16]:
weekTTL_fantasy = pd.DataFrame(columns = ['Player', 'Team', 'Position', 'TTL'])

### Defense

This dataframe will be used to collect a team's defensive stats as a total.

In [10]:
defense_df = pd.DataFrame(columns = ['Team', 'Ttl_Pts_Allowed', 'Ttl_Offense_Plays_Allowed', 'Yds_p_Play', 'Ttl_Yds', 
                                    'Rushing_Att', 'Rushing_Yds', 'Rushing_Yds_p_Att', 'Rushing_TDs', 'Passing_Att',
                                    'Passing_Yds_p_Att', 'Completions', 'Yds_p_Completion', 'Passing_Yds', 
                                     'Passing_TDs', 'RZ_Att', 'RZ_TD', 'RZ_Percent', 'Ttl_Turnovers', 'Interceptions',
                                    'Fumbles', 'Sacks'])

### Functions

In [21]:
def get_offense(url):
    """
    This function takes a url and passes it through requests.get and then uses BeautifulSoup to parse the html for 
    stats by player.
    
    Parameters:
        url: website looking to scrape for information.
        
    Returns:
        Completed dataframe with offensive player stats season to date in the NFL.
    
    """
    
    html = requests.get(url)
    soup = BeautifulSoup(html.content, 'html.parser')
    container = soup.find('tbody')
    
    for i in tqdm(range(len(container.findAll('td', {'data-stat': 'player'})))):
        name = container.findAll('td', {'data-stat': 'player'})[i].get_text().rstrip(' ')
        team = container.findAll('td', {'data-stat': 'team'})[i].get_text()
        position = container.findAll('td', {'data-stat': 'fantasy_pos'})[i].get_text()
        age = container.findAll('td', {'data-stat': 'age'})[i].get_text()
        games = container.findAll('td', {'data-stat': 'g'})[i].get_text()
        gamesstarted = container.findAll('td', {'data-stat': 'gs'})[i].get_text()
        completions = container.findAll('td', {'data-stat': 'pass_cmp'})[i].get_text()
        pass_att = container.findAll('td', {'data-stat': 'pass_att'})[i].get_text()
        pass_yds = container.findAll('td', {'data-stat': 'pass_yds'})[i].get_text()
        pass_tds = container.findAll('td', {'data-stat': 'pass_td'})[i].get_text()
        pass_int = container.findAll('td', {'data-stat': 'pass_int'})[i].get_text()
        rush_att = container.findAll('td', {'data-stat': 'rush_att'})[i].get_text()
        rush_yds = container.findAll('td', {'data-stat': 'rush_yds'})[i].get_text()
        yds_p_att = container.findAll('td', {'data-stat': 'rush_yds_per_att'})[i].get_text()
        rushing_td = container.findAll('td', {'data-stat': 'rush_td'})[i].get_text()
        targets = container.findAll('td', {'data-stat': 'targets'})[i].get_text()
        receptions = container.findAll('td', {'data-stat': 'rec'})[i].get_text()
        rec_yds = container.findAll('td', {'data-stat': 'rec_yds'})[i].get_text()
        rec_yds_rec = container.findAll('td', {'data-stat': 'rec_yds_per_rec'})[i].get_text()
        rec_td = container.findAll('td', {'data-stat': 'rec_td'})[i].get_text()
        fumbles = container.findAll('td', {'data-stat': 'fumbles'})[i].get_text()
        lost_fum = container.findAll('td', {'data-stat': 'fumbles_lost'})[i].get_text()
        ttl_td = container.findAll('td', {'data-stat': 'all_td'})[i].get_text()
        twopt_conv = container.findAll('td', {'data-stat': 'two_pt_md'})[i].get_text()
        twopt_pass = container.findAll('td', {'data-stat': 'two_pt_pass'})[i].get_text()
        fant_pts = container.findAll('td', {'data-stat': 'fanduel_points'})[i].get_text()
        pos_rank = container.findAll('td', {'data-stat': 'fantasy_rank_pos'})[i].get_text()
        overall_rank = container.findAll('td', {'data-stat': 'fantasy_rank_overall'})[i].get_text()
        
        global df
        
        df = df.append({'Player': name, 
                       'Team': team,
                       'Position': position,
                       'Age': age,
                       'Games': games,
                       'GamesStarted': gamesstarted,
                       'CompletedPasses': completions,
                       'PassesAttempted': pass_att,
                       'PassingYds': pass_yds,
                       'PassingTDs': pass_tds,
                       'Interceptions': pass_int,
                       'RushingAttempts': rush_att,
                       'RushingYds': rush_yds,
                       'RushingYdspAtt': yds_p_att,
                       'RushingTDs': rushing_td,
                       'Targeted': targets,
                       'Receptions': receptions,
                       'ReceivingYds': rec_yds,
                       'YdspReception': rec_yds_rec,
                       'ReceivingTDs': rec_td,
                       'Fumbles': fumbles,
                       'LostFumbles': lost_fum,
                       'TtlTDs': ttl_td,
                       'TwoPTConversions': twopt_conv,
                       'TwoPTConversionPasses': twopt_pass,
                       'FDFantasyPts': fant_pts,
                       'PositionRank': pos_rank,
                       'OverallRank': overall_rank}, ignore_index = True)

In [23]:
### week fantasy points

def get_lw_fantasy(url):
    """
    This function takes a url and passes it through requests.get and then uses BeautifulSoup to parse the html for 
    fantasy points by player and team when it comes to the defense.
    
    Parameters:
        url: website looking to scrape for information.
        
    Returns:
        Lists for names, teams, positions and week fantasy points that can be set equal to the columns for a 
        dataframe.
    
    """
    
    html = requests.get(url)
    soup = BeautifulSoup(html.content, 'html.parser')
    container = soup.find('tbody')
    
    names = []
    teams = []
    positions = []
    week_1 = []
    
    team_count = 1
    pos_count = 2
    fant_pts = 3
    
    for i in range(len(container.findAll('td', class_ = 'player-label'))):
        name = container.findAll('td', class_ = 'player-label')[i].get_text()
        names.append(name)
        
    for i in range(len(container.findAll('td', class_ = 'center'))):
        if team_count <= len(container.findAll('td', class_ = 'center')):
            team = container.findAll('td', class_ = 'center')[team_count].get_text()
            teams.append(team)
            team_count += 6
        
        if pos_count <= len(container.findAll('td', class_ = 'center')):
            position = container.findAll('td', class_ = 'center')[pos_count].get_text()
            positions.append(position)
            pos_count += 6
        
        if fant_pts <= len(container.findAll('td', class_ = 'center')):
            fant_pt = container.findAll('td', class_ = 'center')[fant_pts].get_text()
            week_1.append(fant_pt)
            fant_pts += 6
    
    return names, teams, positions, week_1

In [42]:
def get_defense_stats(url):
    """
    This function takes a url and passes it through requests.get and then uses BeautifulSoup to parse the html for 
    stats for a team's defense.
    
    Parameters:
        url: website looking to scrape for information.
        
    Returns:
        Completed dataframe with each team's total defensive stats season to date in the NFL.
    
    """
    html = requests.get(url)
    soup = BeautifulSoup(html.content, 'html.parser')
    container = soup.find('tbody')
    
    yds_count = 0
    att_count = 0
    yds_p_att_count = 0
    td_count = 0
    
    for i in range(len(container.findAll('span', class_ = 'hidden-xs-down'))):
                
        name = container.findAll('span', class_ = 'hidden-xs-down')[i].get_text()
        
        ttl_pts = container.findAll('td', {'data-title': 'PTS'})[i].get_text().replace('\n', '').replace(' ', '')        
        ttl_off = container.findAll('td', {'data-title': 'PLAYS'})[i].get_text().replace('\n', '').replace(' ', '')
        ttl_yds_p_play = container.findAll('td', {'data-title': 'YDS/PLAY'})[i].get_text().replace('\n', '').replace(' ', '')
        completions = container.findAll('td', {'data-title': 'COMP'})[i].get_text().replace('\n', '').replace(' ', '')
        yds_p_comp = container.findAll('td', {'data-title': 'YDS/COMP'})[i].get_text().replace('\n', '').replace(' ', '')
        rz_att = container.findAll('td', {'data-title': 'RZ ATT'})[i].get_text().replace('\n', '').replace(' ', '')
        rz_td = container.findAll('td', {'data-title': 'RZ TD'})[i].get_text().replace('\n', '').replace(' ', '')
        rz_perc = container.findAll('td', {'data-title': 'RZ %'})[i].get_text().replace('\n', '').replace(' ', '')
        turnovers = container.findAll('td', {'data-title': 'TOV'})[i].get_text().replace('\n', '').replace(' ', '')
        ints = container.findAll('td', {'data-title': 'INT'})[i].get_text().replace('\n', '').replace(' ', '')
        fumble = container.findAll('td', {'data-title': 'FUML'})[i].get_text().replace('\n', '').replace(' ', '')
        sacks = container.findAll('td', {'data-title': 'SACKS'})[i].get_text().replace('\n', '').replace(' ', '')
        
        
        if yds_count <= len(container.findAll('td', {'data-title': 'YDS'})):
            ttl_yards = container.findAll('td', {'data-title': 'YDS'})[yds_count].get_text().replace('\n', '').replace(' ', '')
            yds_count += 1
            rush_yds = container.findAll('td', {'data-title': 'YDS'})[yds_count].get_text().replace('\n', '').replace(' ', '')
            yds_count += 1
            rec_yds = container.findAll('td', {'data-title': 'YDS'})[yds_count].get_text().replace('\n', '').replace(' ', '')
            yds_count += 1
               
        if att_count <= len(container.findAll('td', {'data-title': 'ATT'})):
            rush_att = container.findAll('td', {'data-title': 'ATT'})[att_count].get_text().replace('\n', '').replace(' ', '')
            att_count += 1
            rec_att = container.findAll('td', {'data-title': 'ATT'})[att_count].get_text().replace('\n', '').replace(' ', '')
            att_count += 1

        if yds_p_att_count <= len(container.findAll('td', {'data-title': 'YDS/ATT'})):
            rush_ypa = container.findAll('td', {'data-title': 'YDS/ATT'})[yds_p_att_count].get_text().replace('\n', '').replace(' ', '')
            yds_p_att_count += 1
            rec_ypa = container.findAll('td', {'data-title': 'YDS/ATT'})[yds_p_att_count].get_text().replace('\n', '').replace(' ', '')
            yds_p_att_count += 1
            
        if td_count <= len(container.findAll('td', {'data-title': 'YDS/ATT'})):
            rush_td = container.findAll('td', {'data-title': 'TD'})[td_count].get_text().replace('\n', '').replace(' ', '')
            td_count += 1
            rec_td = container.findAll('td', {'data-title': 'TD'})[td_count].get_text().replace('\n', '').replace(' ', '')
            td_count += 1
        
        global defense_df
        defense_df = defense_df.append({'Team': name, 
                                       'Ttl_Pts_Allowed': ttl_pts,
                                       'Ttl_Offense_Plays_Allowed': ttl_off,
                                       'Yds_p_Play': ttl_yds_p_play,
                                       'Ttl_Yds': ttl_yards,
                                       'Rushing_Att': rush_att,
                                       'Rushing_Yds': rush_yds,
                                       'Rushing_Yds_p_Att': rush_ypa,
                                       'Rushing_TDs': rush_td,
                                       'Passing_Att': rec_att,
                                       'Passing_Yds_p_Att': rec_ypa,
                                       'Completions': completions,
                                       'Yds_p_Completion': yds_p_comp,
                                       'Passing_Yds': rec_yds,
                                       'Passing_TDs': rec_td,
                                       'RZ_Att': rz_att,
                                       'RZ_TD': rz_td,
                                       'RZ_Percent': rz_perc,
                                       'Ttl_Turnovers': turnovers,
                                       'Interceptions': ints,
                                       'Fumbles': fumble,
                                       'Sacks': sacks}, ignore_index = True)

## Collecting Data

### Offense Player Stats

In [25]:
offense_url = 'https://www.pro-football-reference.com/years/2020/fantasy.htm'
get_offense(offense_url)

HBox(children=(FloatProgress(value=0.0, max=461.0), HTML(value='')))




### Fantasy Pts by Week

In [26]:
fantTTL_url = 'https://www.fantasypros.com/nfl/reports/leaders/?year=2020&start=1&end=' + str(week_no)
weekTTL_fantasy['Player'], weekTTL_fantasy['Team'], weekTTL_fantasy['Position'], weekTTL_fantasy['TTL'] = get_lw_fantasy(fantTTL_url)

In [29]:
week_url = 'https://www.fantasypros.com/nfl/reports/leaders/?year=2020&start=' + str(week_no) + '&end=' + str(week_no)
week_fantasy['Player'], week_fantasy['Team'], week_fantasy['Position'], week_fantasy['Week_' + str(week_no)] = get_lw_fantasy(week_url)

#### Merging Ttl and Week Fantasy Pts DataFrames

In [31]:
fantPts = pd.merge(weekTTL_fantasy, week_fantasy, 'left', on = 'Player')
fantPts.drop(columns = ['Team_y', 'Position_y'], inplace = True)
fantPts.rename(columns = {'Team_x': 'Team', 'Position_x': 'Position'}, inplace = True)
fantPts.head()

Unnamed: 0,Player,Team,Position,TTL,Week_3
0,Russell Wilson,SEA,QB,103.0,36.8
1,Josh Allen,BUF,QB,94.9,32.2
2,Patrick Mahomes II,KC,QB,87.9,40.0
3,Dak Prescott,DAL,QB,86.9,29.5
4,Kyler Murray,ARI,QB,85.1,24.7


### Defense Stats by Team

In [40]:
defense_url = 'https://www.lineups.com/nfl/team-stats/defense'
get_defense_stats(defense_url)

In [46]:
defense_url = 'https://www.lineups.com/nfl/team-stats/defense'
html = requests.get(defense_url)

In [47]:
soup = BeautifulSoup(html.content, 'html.parser')

In [50]:
container = soup.find('tbody')

container.find('span', class_ = 'hidden-xs-down')

<span _ngcontent-sc335="" class="hidden-xs-down">Indianapolis Colts</span>

In [51]:
defense_df.head()

Unnamed: 0,Team,Ttl_Pts_Allowed,Ttl_Offense_Plays_Allowed,Yds_p_Play,Ttl_Yds,Rushing_Att,Rushing_Yds,Rushing_Yds_p_Att,Rushing_TDs,Passing_Att,Passing_Yds_p_Att,Completions,Yds_p_Completion,Passing_Yds,Passing_TDs,RZ_Att,RZ_TD,RZ_Percent,Ttl_Turnovers,Interceptions,Fumbles,Sacks
0,Baltimore Ravens,22,125,4.9,610,44,189,4.3,0,75,6.2,46,10.1,464,2,3,2,66.7%,5,2,3,6
1,Kansas City Chiefs,40,137,6.1,839,66,301,4.6,3,65,8.7,42,13.4,564,2,6,5,83.3%,2,2,0,6
2,Indianapolis Colts,45,154,4.4,676,70,280,4.0,1,75,6.1,47,9.7,454,4,6,4,66.7%,6,6,0,9
3,San Francisco 49ers,46,189,4.8,912,80,350,4.4,2,104,5.7,64,9.2,588,2,4,2,50%,4,2,2,5
4,Los Angeles Chargers,57,188,5.4,1011,71,328,4.6,1,111,6.6,72,10.1,730,3,9,2,22.2%,2,1,1,6


### Pickle DataFrames

In [35]:
df.to_pickle('player_stats')
fantPts.to_pickle('fantasy_weeks')
defense_df.to_pickle('defense_data')