# Predicting Draft Round

For this project, I will be attempting to classify draft picks based on a prospect's combine performance and final year of college play statistics. I gathered my data from two sources:

        https://www.pro-football-reference.com/draft/
        https://www.sports-reference.com/cfb/players/
        
In this notebook you'll find the steps I took to gather my data by scraping both sites.

The first step is to load the necessary libraries.

In [1]:
import requests
import re
from bs4 import BeautifulSoup
import bs4
import pandas as pd
import numpy as np
import json
import urllib
import time

import warnings
warnings.simplefilter('ignore')

In [2]:
%%capture

from tqdm import tqdm_notebook as tqdm
from tqdm import tnrange
tqdm().pandas()

## Combine Results Functions

To get the data for combine performance, I built the following functions. This way, when I'm ready, I can pass in each function to append a list and later create a dictionary & dataframe. 

In [4]:
def names(data):
    all_names = []
    count = 0
    for i in data:
        all_names.append(data[count].get_text())
        count += 1
    return all_names

In [5]:
def get_position(data):
    all_positions = []
    count = 0
    for i in data:
        if count in range(len(data)):
            all_positions.append(data[count].get_text())
            count += 12
    
    return all_positions

In [6]:
def get_school(data):
    all_schools = []
    count = 1
    for i in data:
        if count in range(len(data)):
            all_schools.append(data[count].get_text())
            count += 12
    
    return all_schools

In [7]:
# this column will be used to then search for the college stats

def get_stats_links(data):
    stats_links = []
    count = 0
    for i in data:
        if count in range(len(data)):
            link = data[count]
            link = str(link)
            link = link.split('<')
            link = link[2]
            link = link.lstrip('a href =')
            link = link.rstrip('> College Stats')
            link = link.replace('"', '')
            stats_links.append(link)
            count += 9
        else:
            continue
    return stats_links

In [8]:
def get_height(data):
    all_heights = []
    count = 3
    for i in data:
        if count in range(len(data)):
            all_heights.append(data[count].get_text())
            count += 12
    
    return all_heights

In [9]:
def get_weight(data):
    all_weights = []
    count = 4
    for i in data:
        if count in range(len(data)):
            all_weights.append(data[count].get_text())
            count += 12
    
    return all_weights

In [10]:
def get_40yd(data):
    all_40yds = []
    count = 5
    for i in data:
        if count in range(len(data)):
            all_40yds.append(data[count].get_text())
            count += 12
    
    return all_40yds

In [11]:
def get_vertical(data):
    all_verticals = []
    count = 6
    for i in data:
        if count in range(len(data)):
            all_verticals.append(data[count].get_text())
            count += 12
    
    return all_verticals

In [12]:
def get_bench(data):
    all_bench = []
    count = 7
    for i in data:
        if count in range(len(data)):
            all_bench.append(data[count].get_text())
            count += 12
    
    return all_bench

In [13]:
def get_broadjump(data):
    all_jumps = []
    count = 8
    for i in data:
        if count in range(len(data)):
            all_jumps.append(data[count].get_text())
            count += 12
    
    return all_jumps

In [14]:
def get_3cone(data):
    all_cones = []
    count = 9
    for i in data:
        if count in range(len(data)):
            all_cones.append(data[count].get_text())
            count += 12
    
    return all_cones

In [15]:
def get_shuttle(data):
    all_shuttles = []
    count = 10
    for i in data:
        if count in range(len(data)):
            all_shuttles.append(data[count].get_text())
            count += 12
    
    return all_shuttles

In [16]:
def get_draft(data):
    all_draft = []
    count = 11
    for i in data:
        if count in range(len(data)):
            all_draft.append(data[count].get_text())
            count += 12
    
    return all_draft

In [395]:
# scraping pro-football-reference for data

def get_combine_history(links):
    SearchLinks = []
    PlayerNames = []
    Positions = []
    Schools = []
    StatsLinks = []
    Heights = []
    Weights = []
    _40yds = []
    Verticals = []
    Benched = []
    BroadJumps = []
    _3cones = []
    Shuttles = []
    Drafts = []
    
    for link in tqdm(links):
        url = link
        html_page = requests.get(url) 
        soup = BeautifulSoup(html_page.content, 'html.parser')

        player_container = soup.find('tbody')
        player_name = player_container.findAll('th', scope='row')
        player_deets = player_container.findAll('td')
        stats_finder = player_container.findAll('td', {'class':'right'})
           
        for name in names(player_name):
            PlayerNames.append(name)
            SearchLinks.append(str(link))

        for position in get_position(player_deets):
            Positions.append(position)
     
        for school in get_school(player_deets):
            Schools.append(school)
        
        for statlink in get_stats_links(stats_finder):
            StatsLinks.append(statlink)

        for height in get_height(player_deets):
            Heights.append(height)

        for weight in get_weight(player_deets):
            Weights.append(weight)       

        for _40yd in get_40yd(player_deets):
            _40yds.append(_40yd)
            
        for vertical in get_vertical(player_deets):
            Verticals.append(vertical)

        for bench in get_bench(player_deets):
            Benched.append(bench)

        for jump in get_broadjump(player_deets):
            BroadJumps.append(jump)

        for cone in get_3cone(player_deets):
            _3cones.append(cone)      

        for shuttle in get_shuttle(player_deets):
            Shuttles.append(shuttle)

        for draft in get_draft(player_deets):
            Drafts.append(draft)

        time.sleep(.5)

    combine_dictionary = {'Link': SearchLinks, 'PlayerName': PlayerNames, 'Position': Positions, 'School': Schools, 
                          'CollegeStats': StatsLinks, 'Height': Heights, 'Weight': Weights, '40yd': _40yds, 
                          'Vertical': Verticals, 'Bench': Benched, 'BroadJump': BroadJumps, '3Cone': _3cones, 
                          'Shuttle': Shuttles, 'Drafted': Drafts}
    return combine_dictionary

## Get Combine Results

For this project, I'm interested in all combines from 2000 to 2020. The url follows the same pattern for each year so I need to create a list of urls for each year for my scraping function to then search through.

In [397]:
combine_years = range(2000, 2021)
links = []

for year in combine_years:
    link = 'https://www.pro-football-reference.com/draft/' + str(year) + '-combine.htm'
    links.append(link)

Now that the function is ready to go, I can pass through my links and set a new dataframe equal to the results. The fuction returns a dictionary that I'll then coerce into a dataframe so it's easier to work with.

In [398]:
nfl_combine_df = pd.DataFrame(get_combine_history(links))
nfl_combine_df.head()

HBox(children=(FloatProgress(value=0.0, max=21.0), HTML(value='')))




Unnamed: 0,Link,PlayerName,Position,School,CollegeStats,Height,Weight,40yd,Vertical,Bench,BroadJump,3Cone,Shuttle,Drafted
0,https://www.pro-football-reference.com/draft/2...,John Abraham,OLB,South Carolina,/td,6-4,252,4.55,,,,,,New York Jets / 1st / 13th pick / 2000
1,https://www.pro-football-reference.com/draft/2...,Shaun Alexander,RB,Alabama,https://www.sports-reference.com/cfb/players/s...,6-0,218,4.58,,,,,,Seattle Seahawks / 1st / 19th pick / 2000
2,https://www.pro-football-reference.com/draft/2...,Darnell Alford,OT,Boston Col.,/td,6-4,334,5.56,25.0,23.0,94.0,8.48,4.98,Kansas City Chiefs / 6th / 188th pick / 2000
3,https://www.pro-football-reference.com/draft/2...,Kyle Allamon,TE,Texas Tech,/td,6-2,253,4.97,29.0,,104.0,7.29,4.49,
4,https://www.pro-football-reference.com/draft/2...,Rashard Anderson,CB,Jackson State,/td,6-2,206,4.55,34.0,,123.0,7.18,4.15,Carolina Panthers / 1st / 23rd pick / 2000


In [399]:
nfl_combine_df.shape

(6893, 14)

Before I move on to getting the college stats, I have some columns I know I need to clean up. The first column is the 'Link' column. I added this to the function because I wanted to keep the year as a reference. I'll split out the year and drop the 'Link' column when the time comes.

In [400]:
nfl_combine_df['CombineYear'] = nfl_combine_df['Link'].map(lambda x: x.split('/'))
nfl_combine_df['CombineYear'] = nfl_combine_df['CombineYear'].map(lambda x: x[-1])
nfl_combine_df['CombineYear'] = nfl_combine_df['CombineYear'].map(lambda x: x.rstrip('-combine.htm'))
nfl_combine_df['CombineYear'].head()

0    2000
1    2000
2    2000
3    2000
4    2000
Name: CombineYear, dtype: object

In [23]:
nfl_combine_df['Drafted'].head()

0          New York Jets / 1st / 13th pick / 2000
1       Seattle Seahawks / 1st / 19th pick / 2000
2    Kansas City Chiefs / 6th / 188th pick / 2000
3                                                
4      Carolina Panthers / 1st / 23rd pick / 2000
Name: Drafted, dtype: object

The information on the draft is all stored in the 'Drafted' column. It includes the team a player is drafted by, the round the prospect was picked in, the pick number itself and the year of the draft in question. I will split each entity out into a new column. In the instance that a player is not drafted, I will set the value equal to 'Not Drafted'. Once I start with my EDA I'll decide if the 'Not Drafted' players are important to keep. The original 'Drafted' column will be dropped later.

In [401]:
nfl_combine_df['Draft Team'] = nfl_combine_df['Drafted']
nfl_combine_df['Draft Team'] = nfl_combine_df['Draft Team'].map(lambda x: x.split('/'))
nfl_combine_df['Draft Team'] = nfl_combine_df['Draft Team'].map(lambda x: x[0])
nfl_combine_df['Draft Team'].head()

0         New York Jets 
1      Seattle Seahawks 
2    Kansas City Chiefs 
3                       
4     Carolina Panthers 
Name: Draft Team, dtype: object

In [402]:
nfl_combine_df['Round'] = nfl_combine_df['Drafted']
nfl_combine_df['Round'] = nfl_combine_df['Round'].map(lambda x: x.split('/'))
nfl_combine_df['Round'] = nfl_combine_df['Round'].map(lambda x: x[1] if 1 in range(len(x)) else "Not Drafted")
nfl_combine_df['Round'].head()

0           1st 
1           1st 
2           6th 
3    Not Drafted
4           1st 
Name: Round, dtype: object

In [404]:
nfl_combine_df['Pick'] = nfl_combine_df['Drafted']
nfl_combine_df['Pick'] = nfl_combine_df['Pick'].map(lambda x: x.split('/'))
nfl_combine_df['Pick'] = nfl_combine_df['Pick'].map(lambda x: x[2] if 2 in range(len(x)) else "Not Drafted")
nfl_combine_df['Pick'].head()

0      13th pick 
1      19th pick 
2     188th pick 
3     Not Drafted
4      23rd pick 
Name: Pick, dtype: object

In [406]:
nfl_combine_df['Draft Year'] = nfl_combine_df['Drafted']
nfl_combine_df['Draft Year'] = nfl_combine_df['Draft Year'].map(lambda x: x.split('/'))
nfl_combine_df['Draft Year'] = nfl_combine_df['Draft Year'].map(lambda x: x[3] if 3 in range(len(x)) else "Not Drafted")
nfl_combine_df['Draft Year'].head()

0           2000
1           2000
2           2000
3    Not Drafted
4           2000
Name: Draft Year, dtype: object

In [407]:
nfl_combine_df.head()

Unnamed: 0,Link,PlayerName,Position,School,CollegeStats,Height,Weight,40yd,Vertical,Bench,BroadJump,3Cone,Shuttle,Drafted,CombineYear,Draft Team,Round,Pick,Draft Year
0,https://www.pro-football-reference.com/draft/2...,John Abraham,OLB,South Carolina,/td,6-4,252,4.55,,,,,,New York Jets / 1st / 13th pick / 2000,2000,New York Jets,1st,13th pick,2000
1,https://www.pro-football-reference.com/draft/2...,Shaun Alexander,RB,Alabama,https://www.sports-reference.com/cfb/players/s...,6-0,218,4.58,,,,,,Seattle Seahawks / 1st / 19th pick / 2000,2000,Seattle Seahawks,1st,19th pick,2000
2,https://www.pro-football-reference.com/draft/2...,Darnell Alford,OT,Boston Col.,/td,6-4,334,5.56,25.0,23.0,94.0,8.48,4.98,Kansas City Chiefs / 6th / 188th pick / 2000,2000,Kansas City Chiefs,6th,188th pick,2000
3,https://www.pro-football-reference.com/draft/2...,Kyle Allamon,TE,Texas Tech,/td,6-2,253,4.97,29.0,,104.0,7.29,4.49,,2000,,Not Drafted,Not Drafted,Not Drafted
4,https://www.pro-football-reference.com/draft/2...,Rashard Anderson,CB,Jackson State,/td,6-2,206,4.55,34.0,,123.0,7.18,4.15,Carolina Panthers / 1st / 23rd pick / 2000,2000,Carolina Panthers,1st,23rd pick,2000


In [408]:
nfl_combine_df.shape

(6893, 19)

In [32]:
nfl_combine_df.columns

Index(['Link', 'PlayerName', 'Position', 'School', 'CollegeStats', 'Height',
       'Weight', '40yd', 'Vertical', 'Bench', 'BroadJump', '3Cone', 'Shuttle',
       'Drafted', 'CombineYear', 'Draft Team', 'Round', 'Pick', 'Draft Year'],
      dtype='object')

The final shape for this dataframe is 6,893 rows and 19 columns. I'll merge this dataframe with the data I collect from sports-reference. Before moving on to that step, I'll save down my dataframe as a csv to reference later.

In [409]:
nfl_combine_df.to_csv('2000-2020_nfl_combine_data.csv')

## College Stats

I was able to isolate the college stats links from my combine observations. I'm going to use these as the links I search through for the college stats. Not all of my observations have a link for their college stats. I filtered those out of my link list and will handle the missing values when I finally combine the two dataframes. Of my original 6,893 observations, 5,486 have a link to the college stats so my data set will still be fairly robust

In [410]:
college_stats = nfl_combine_df[nfl_combine_df['CollegeStats'] != '/td']
college_stats = list(college_stats['CollegeStats'])
print(len(college_stats))
college_stats

5486


['https://www.sports-reference.com/cfb/players/shaun-alexander-1.html',
 'https://www.sports-reference.com/cfb/players/lavar-arrington-1.html',
 'https://www.sports-reference.com/cfb/players/john-baker-3.html',
 'https://www.sports-reference.com/cfb/players/anthony-becht-1.html',
 'https://www.sports-reference.com/cfb/players/tom-brady-1.html',
 'https://www.sports-reference.com/cfb/players/ralph-brown-2.html',
 'https://www.sports-reference.com/cfb/players/courtney-brown-1.html',
 'https://www.sports-reference.com/cfb/players/marc-bulger-1.html',
 'https://www.sports-reference.com/cfb/players/plaxico-burress-1.html',
 'https://www.sports-reference.com/cfb/players/trung-canidate-1.html',
 'https://www.sports-reference.com/cfb/players/tyrone-carter-1.html',
 'https://www.sports-reference.com/cfb/players/kwame-cavil-1.html',
 'https://www.sports-reference.com/cfb/players/doug-chapman-1.html',
 'https://www.sports-reference.com/cfb/players/ike-charlton-1.html',
 'https://www.sports-refere

I'm going to go about appending values that I scrape a little differently this time around. I initialize a dataframe before I start my scrape. This will have all the columns I'll need to gather for later use. The website I'm grabbing from has tables for each of the following categories:
        
        * Defense & Fumbles
        * Passing
        * Rushing & Receptions
        * Kicking
        * Punt Returns
        * Scoring
        
I was able to get the stats from a player's final college year which I'll use in my classification model.

In [209]:
college_stats_df = pd.DataFrame(columns=['Link', 'Defense_Games', 'Solo_Tackles', 'Assisted_Tackles', 'Ttl_Tackles', 
                                         'Loss', 'Sacks', 'Defensive_Interceptions', 'Def_Int_Yds', 'Yds_per_Int', 
                                         'Pick_6', 'Defended_Passes', 'Recovered_Fumbles', 'Rec_Fumbles_Yds', 
                                         'Fumbles_Returned_TD', 'Forced_Fumbles', 'Passing_Games', 'Completions', 
                                         'Pass_Attempts', 'Completion_Percent', 'Pass_Yards', 'Pass_Yds_per_Attempt', 
                                         'Adj_Pass_Yds_per_Attempt', 'Pass_TDs', 'Pass_Interceptions', 
                                         'Passer_Rating', 'Rushing/Receiving_Games', 'Rush_Attempts', 'Rush_Yds', 
                                         'Rush_Yds_per_Attempt', 'Rush_TDs', 'Receptions', 'Rec_Yds', 
                                         'Rec_Yds_per_Reception', 'Rec_TDs', 'Plays_from_Scrimmage', 'Scrimmage_Yds', 
                                         'Scrimmage_Yds_per_Attempt', 'Scrimmage_TDs', 'Kicking_Games', 
                                         'XP_Made', 'XP_Attempts', 'XP_Percent', 'FG_Made', 'FG_Attempts', 
                                         'FG_Percent', 'TTL_Kicking_Points', '#Punts', 'Punt_Yds', 'Yds_per_Punt', 
                                         'PuntRet_Games', 'Punt_Returns', 'Punt_Return_Yds', 'Yrds_per_Return', 
                                         'Punt_Returned_for_TD', 'Kickoff_Returns', 'KO_Return_Yds', 
                                         'Yds_per_KO_Return', 'KO_Returned_for_TD', 'Scoring_Games', 'TD_Other', 
                                         'Ttl_TDs', '2PT_Converstions', 'Safety', 'TTL_Points'])

## College Stats Functions

Like gathering data for the combine results, I will use functions to get the data from the college stats sites. The first one creates containers for each of the tables on a given page. sports-reference gets a little funky. After the first table, each table is actually captured in a comment. The function below looks to see if first the table under the designated header is in a comment, or a child of the division. The comment is then passed through as an html itself and pass this new html instance through a new soup. If that's the case, it will just create a normal container.

A huge thanks to [Mark von Oven](https://stackoverflow.com/questions/53543064/using-beautifulsoup-to-parse-a-big-comment) on StackOverflow for answering his own question and saving my all sorts of grief trying to access this data. I updated his code for my needs in this case.

In [412]:
def defense_container(soup):
        
    # defense container
    try:
        all_defense = soup.find('div', id='all_defense')
        for item in all_defense.children:
            if isinstance(item, bs4.element.Comment):
                html = item
        defense = BeautifulSoup(html, features="html.parser")
    except:
        html = ''
        defense = soup.find('div', id='all_defense')
        
    return defense

In [413]:
def passing_container(soup):

    # passing container
    try:
        all_passing = soup.find('div', id='all_passing')
        for item in all_passing.children:
            if isinstance(item, bs4.element.Comment):
                html = item
        passing = BeautifulSoup(html, features="html.parser")
    except:
        html = ''
        passing = soup.find('div', id='all_passing')
        
    return passing

In [414]:
def rushing_container(soup):

    # rushing container
    try:
        all_rushing = soup.find('div', id='all_rushing')
        for item in all_rushing.children:
            if isinstance(item, bs4.element.Comment):
                html = item
        rushing = BeautifulSoup(html, features="html.parser")
    except:
        html = ''
        rushing = soup.find('div', id='all_rushing')
        
    return rushing

In [415]:
def scoring_container(soup):
   
    # scoring container
    try:
        all_scoring = soup.find('div', id='all_scoring')
        for item in all_scoring.children:
            if isinstance(item, bs4.element.Comment):
                html = item
        scoring = BeautifulSoup(html, features="html.parser")
    except:
        html = ''
        scoring = soup.find('div', id='all_scoring')  
        
    return scoring

In [416]:
def kicking_container(soup):

    # punting container
    try:
        all_kicking = soup.find('div', id='all_kicking')
        for item in all_kicking.children:
            if isinstance(item, bs4.element.Comment):
                html = item
        kicking = BeautifulSoup(html, features="html.parser")
    except:
        html = ''
        kicking = soup.find('div', id='all_kicking')
        
    return kicking

In [417]:
def punt_ret_container(soup):
    
    # punt ret container
    try:
        all_punt_ret = soup.find('div', id='all_punt_ret')
        for item in all_punt_ret.children:
            if isinstance(item, bs4.element.Comment):
                html = item
        punt_ret = BeautifulSoup(html, features="html.parser")
    except:
        html = ''
        punt_ret = soup.find('div', id='all_punt_ret')
        
    return punt_ret

In [210]:
# web scraper

def get_college_stats(link_list):
    for link in tqdm(link_list):
        url = link
        html_page = requests.get(url) 
        soup = BeautifulSoup(html_page.content, 'html.parser')
        
        # containers
        defense = defense_container(soup)
        passing = passing_container(soup)
        rushing = rushing_container(soup)
        scoring = scoring_container(soup)
        kicking = kicking_container(soup)
        punt_ret = punt_ret_container(soup)
        
        # defense 
        # defense games
        try:
            def_games = defense.findAll('td', {'data-stat': 'g'})
            defense_games = def_games[-2].get_text()
                  
        except AttributeError:
            defense_games = 0
        
        # defense solo tackles
        try:
            solotackles = defense.findAll('td', {'data-stat': 'tackles_solo'})
            solotackles = solotackles[-2].get_text()

        except AttributeError:
            solotackles = 0
        
        # defense assisted tackles
        try:
            assistedtackles = defense.findAll('td', {'data-stat': 'tackles_assists'})
            assistedtackles = assistedtackles[-2].get_text()
        
        except AttributeError:
            assistedtackles = 0
            
        # defense tackles_total
        try:
            ttl_tackles = defense.findAll('td', {'data-stat': 'tackles_total'})
            ttl_tackles = ttl_tackles[-2].get_text()
        
        except AttributeError:
            ttl_tackles = 0
            
        # tackles for a loss
        try:
            tackle_loss = defense.findAll('td', {'data-stat': 'tackles_loss'})
            tackle_loss = tackle_loss[-2].get_text()
            
        except AttributeError:
            tackle_loss = 0
            
        # sacks
        try:
            sacks = defense.findAll('td', {'data-stat': 'sacks'})
            sacks = sacks[-2].get_text()
            
        except AttributeError:
            sacks = 0
            
        # defense interception
        try:
            defense_int = defense.findAll('td', {'data-stat': 'def_int'})
            defense_int = defense_int[-2].get_text()
            
        except AttributeError:
            defense_int = 0
            
        # defense interception yards
        try:
            int_yards = defense.findAll('td', {'data-stat': 'def_int_yds'})
            int_yards = int_yards[-2].get_text()
            
        except AttributeError:
            int_yards = 0
            
        # yards per interception
        try:
            yards_per_int = defense.findAll('td', {'data-stat': 'def_int_yds_per_int'})
            yards_per_int = yards_per_int[-2].get_text()
            
        except AttributeError:
            yards_per_int = 0
            
        # pick6
        try:
            pick6 = defense.findAll('td', {'data-stat': 'def_int_td'})
            pick6 = pick6[-2].get_text()
            
        except AttributeError:
            pick6 = 0
            
        # defended passes
        try:
            defended_pass = defense.findAll('td', {'data-stat': 'pass_defended'})
            defended_pass = defended_pass[-2].get_text()
            
        except AttributeError:
            defended_pass = 0
            
        # recovered fumbles
        try:
            recovered_fumbles = defense.findAll('td', {'data-stat': 'fumbles_rec'})
            recovered_fumbles = recovered_fumbles[-2].get_text()
            
        except AttributeError:
            recovered_fumbles = 0
            
        # fumble recovery yards
        try:
            fumble_recover_yrds = defense.findAll('td', {'data-stat': 'fumbles_rec_yds'})
            fumble_recover_yrds = fumble_recover_yrds[-2].get_text()
            
        except AttributeError:
            fumble_recover_yrds = 0

        # fumble recovery for a td
        try:
            recovered_fumbles_td = defense.findAll('td', {'data-stat': 'fumbles_rec_td'})
            recovered_fumbles_td = recovered_fumbles_td[-2].get_text()
            
        except AttributeError:
            recovered_fumbles_td = 0
            
        # forced fumbles
        try:
            forced_fumbles = defense.findAll('td', {'data-stat': 'fumbles_forced'})
            forced_fumbles = forced_fumbles[-2].get_text()
            
        except AttributeError:
            forced_fumbles = 0  
            
        # passing 
        # passing games
        try:
            passing_games = passing.findAll('td', {'data-stat': 'g'})
            passing_games = passing_games[-2].get_text()
            
        except AttributeError:
            passing_games = 0
           
        # completions
        try:
            completions = passing.findAll('td', {'data-stat': 'pass_cmp'})
            completions = completions[-2].get_text()
            
        except AttributeError:
            completions = 0    
            
        # attempts
        try:
            pass_attempts = passing.findAll('td', {'data-stat': 'pass_att'})
            pass_attempts = pass_attempts[-2].get_text()
            
        except AttributeError:
            pass_attempts = 0 
            
        # completion percentage
        try:
            complete_rate = passing.findAll('td', {'data-stat': 'pass_cmp_pct'})
            complete_rate = complete_rate[-2].get_text()
            
        except AttributeError:
            complete_rate = 0 

        # passing yards
        try:
            passing_yards = passing.findAll('td', {'data-stat': 'pass_yds'})
            passing_yards = passing_yards[-2].get_text()
            
        except AttributeError:
            passing_yards = 0 
            
        # yards per attempt
        try:
            yds_per_attempt = passing.findAll('td', {'data-stat': 'pass_yds_per_att'})
            yds_per_attempt = yds_per_attempt[-2].get_text()
            
        except AttributeError:
            yds_per_attempt = 0 
                            
        # adj yards per attempt
        try:
            adj_yds_per_attempt = passing.findAll('td', {'data-stat': 'adj_pass_yds_per_att'})
            adj_yds_per_attempt = adj_yds_per_attempt[-2].get_text()
            
        except AttributeError:
            adj_yds_per_attempt = 0 
        
        # pass tds
        try:
            pass_tds = passing.findAll('td', {'data-stat': 'pass_td'})
            pass_tds = pass_tds[-2].get_text()
            
        except AttributeError:
            pass_tds = 0 
                       
        # pass interceptions
        try:
            pass_ints = passing.findAll('td', {'data-stat': 'pass_int'})
            pass_ints = pass_ints[-2].get_text()
            
        except AttributeError:
            pass_ints = 0  
            
        # passer rating
        try:
            passer_rating = passing.findAll('td', {'data-stat': 'pass_rating'})
            passer_rating = passer_rating[-2].get_text()
            
        except AttributeError:
            passer_rating = 0    

        # Rushing_Receiving games
        try:
            rush_rec_games = rushing.findAll('td', {'data-stat': 'g'})
            rush_rec_games = rush_rec_games[-2].get_text()
            
        except AttributeError:
            rush_rec_games = 0
            
        # rushing attempts
        try:
            rushing_att = rushing.findAll('td', {'data-stat': 'rush_att'})
            rushing_att = rushing_att[-2].get_text()
            
        except AttributeError:
            rushing_att = 0
            
        # rushing yds
        try:
            rushing_yds = rushing.findAll('td', {'data-stat': 'rush_yds'})
            rushing_yds = rushing_yds[-2].get_text()
            
        except AttributeError:
            rushing_yds = 0
        
        # rushing yds per attempt
        try:
            rushing_yds_per_att = rushing.findAll('td', {'data-stat': 'rush_yds_per_att'})
            rushing_yds_per_att = rushing_yds_per_att[-2].get_text()
            
        except AttributeError:
            rushing_yds_per_att = 0
        
        # rushing tds
        try:
            rushing_tds = rushing.findAll('td', {'data-stat': 'rush_td'})
            rushing_tds = rushing_tds[-2].get_text()
            
        except AttributeError:
            rushing_tds = 0
        
        # receptions
        try:
            reception = rushing.findAll('td', {'data-stat': 'rec'})
            reception = reception[-2].get_text()
            
        except AttributeError:
            reception = 0
        
        # reception yds
        try:
            rec_yards = rushing.findAll('td', {'data-stat': 'rec_yds'})
            rec_yards = rec_yards[-2].get_text()
            
        except AttributeError:
            rec_yards = 0
        
        # yds per reception
        try:
            yds_per_reception = rushing.findAll('td', {'data-stat': 'rec_yds_per_rec'})
            yds_per_reception = yds_per_reception[-2].get_text()
            
        except AttributeError:
            yds_per_reception = 0
        
        # td receptions
        try:
            td_receptions = rushing.findAll('td', {'data-stat': 'rec_td'})
            td_receptions = td_receptions[-2].get_text()
            
        except AttributeError:
            td_receptions = 0
        
        # scrimmages
        try:
            scrimmages = rushing.findAll('td', {'data-stat': 'scrim_att'})
            scrimmages = scrimmages[-2].get_text()
            
        except AttributeError:
            scrimmages = 0
        
        # scrimmage yds
        try:
            scrimmage_yds = rushing.findAll('td', {'data-stat': 'scrim_yds'})
            scrimmage_yds = scrimmage_yds[-2].get_text()
            
        except AttributeError:
            scrimmage_yds = 0
        
        # scrimmage yds per attempt
        try:
            scrimmage_yds_per_att = rushing.findAll('td', {'data-stat': 'scrim_yds_per_att'})
            scrimmage_yds_per_att = scrimmage_yds_per_att[-2].get_text()
            
        except AttributeError:
            scrimmage_yds_per_att = 0
        
        # scrimmage tds
        try:
            scrimmage_tds = rushing.findAll('td', {'data-stat': 'scrim_td'})
            scrimmage_tds = scrimmage_tds[-2].get_text()
            
        except AttributeError:
            scrimmage_tds = 0
                
        # kicking games 
        try:
            kick_games = kicking.findAll('td', {'data-stat': 'g'})
            kick_games = kick_games[-2].get_text()
            
        except AttributeError:
            kick_games = 0
            
        # extra points 
        try:
            xp_made = kicking.findAll('td', {'data-stat': 'xpm'})
            xp_made = xp_made[-2].get_text()
            
        except AttributeError:
            xp_made = 0    
            
        # extra point attempts
        try:
            xp_attempt = kicking.findAll('td', {'data-stat': 'xpa'})
            xp_attempt = xp_attempt[-2].get_text()
            
        except AttributeError:
            xp_attempt = 0     
            
        # extra point percentage
        try:
            xp_perc = kicking.findAll('td', {'data-stat': 'xp_pct'})
            xp_perc = xp_perc[-2].get_text()
            
        except AttributeError:
            xp_perc = 0    
        
        # field goals
        try:
            field_goals = kicking.findAll('td', {'data-stat': 'fgm'})
            field_goals = field_goals[-2].get_text()
            
        except AttributeError:
            field_goals = 0 
        
        # field goal attempts
        try:
            field_goals_att = kicking.findAll('td', {'data-stat': 'fga'})
            field_goals_att = field_goals_att[-2].get_text()
            
        except AttributeError:
            field_goals_att = 0 
        
        # field goal percent
        try:
            fg_percent = kicking.findAll('td', {'data-stat': 'fg_pct'})
            fg_percent = fg_percent[-2].get_text()
            
        except AttributeError:
            fg_percent = 0
        
        # ttl kicking points
        try:
            ttl_kick_pts = kicking.findAll('td', {'data-stat': 'kick_points'})
            ttl_kick_pts = ttl_kick_pts[-2].get_text()
            
        except AttributeError:
            ttl_kick_pts = 0
        
        # num punts
        try:
            punts = kicking.findAll('td', {'data-stat': 'punt'})
            punts = punts[-2].get_text()
            
        except AttributeError:
            punts = 0
        
        # punt yards
        try:
            punt_yds = kicking.findAll('td', {'data-stat': 'punt_yds'})
            punt_yds = punt_yds[-2].get_text()
            
        except AttributeError:
            punt_yds = 0
        
        # yards per punt
        try:
            yds_per_punt = kicking.findAll('td', {'data-stat': 'punt_yds_per_punt'})
            yds_per_punt = yds_per_punt[-2].get_text()
            
        except AttributeError:
            yds_per_punt = 0
        
        # punt return games 
        try:
            puntret_games = punt_ret.findAll('td', {'data-stat': 'g'})
            puntret_games = puntret_games[-2].get_text()
        
        except AttributeError:
            puntret_games = 0
        
       # punts returned 
        try:
            ret_punts = punt_ret.findAll('td', {'data-stat': 'punt_ret'})
            ret_punts = ret_punts[-2].get_text()
        
        except AttributeError:
            ret_punts = 0 
        
        # returned yds
        try:
            ret_punts_yds = punt_ret.findAll('td', {'data-stat': 'punt_ret_yds'})
            ret_punts_yds = ret_punts_yds[-2].get_text()
        
        except AttributeError:
            ret_punts_yds = 0
        
        # yds per punt return
        try:
            yds_per_punt_return = punt_ret.findAll('td', {'data-stat': 'punt_ret_yds_per_ret'})
            yds_per_punt_return = yds_per_punt_return[-2].get_text()
        
        except AttributeError:
            yds_per_punt_return = 0
        
        # punt returned for a TD
        try:
            punt_ret_TD = punt_ret.findAll('td', {'data-stat': 'punt_ret_td'})
            punt_ret_TD = punt_ret_TD[-2].get_text()
        
        except AttributeError:
            punt_ret_TD = 0
        
        # kickoff returns
        try:
            kick_returns = punt_ret.findAll('td', {'data-stat': 'kick_ret'})
            kick_returns = kick_returns[-2].get_text()
        
        except AttributeError:
            kick_returns = 0
        
        # kickoff return yds
        try:
            kick_return_yds = punt_ret.findAll('td', {'data-stat': 'kick_ret_yds'})
            kick_return_yds = kick_return_yds[-2].get_text()
        
        except AttributeError:
            kick_return_yds = 0
        
        # yds per kickoff return
        try:
            yds_per_KO_ret = punt_ret.findAll('td', {'data-stat': 'kick_ret_yds_per_ret'})
            yds_per_KO_ret = yds_per_KO_ret[-2].get_text()
        
        except AttributeError:
            yds_per_KO_ret = 0
        
        # kickoff returned for a TD
        try:
            KO_ret_TD = punt_ret.findAll('td', {'data-stat': 'kick_ret_td'})
            KO_ret_TD = KO_ret_TD[-2].get_text()
        
        except AttributeError:
            KO_ret_TD = 0
        
        # scoring games 
        try:
            scoring_games = scoring.findAll('td', {'data-stat': 'g'})
            scoring_games = scoring_games[-2].get_text()
        
        except AttributeError:
            scoring_games = 0
        
        # other TDs scored 
        try:
            other_tds = scoring.findAll('td', {'data-stat': 'td_other'})
            other_tds = other_tds[-2].get_text()
        
        except AttributeError:
            other_tds = 0
        
        # other TDs scored 
        try:
            other_tds = scoring.findAll('td', {'data-stat': 'td_other'})
            other_tds = other_tds[-2].get_text()
        
        except AttributeError:
            other_tds = 0
        
        # ttl TDs scored 
        try:
            total_tds = scoring.findAll('td', {'data-stat': 'td_total'})
            total_tds = total_tds[-2].get_text()
        
        except AttributeError:
            total_tds = 0
       
        # 2 pt conversions 
        try:
            two_pt_conv = scoring.findAll('td', {'data-stat': 'two_pt_md'})
            two_pt_conv = two_pt_conv[-2].get_text()
        
        except AttributeError:
            two_pt_conv = 0
        
        # safety
        try:
            safety = scoring.findAll('td', {'data-stat': 'safety_md'})
            safety = safety[-2].get_text()
        
        except AttributeError:
            safety = 0
        
        # total points scored
        try:
            ttl_pts = scoring.findAll('td', {'data-stat': 'points'})
            ttl_pts = ttl_pts[-2].get_text()
        
        except AttributeError:
            ttl_pts = 0
        
        time.sleep(.5)
        
        global college_stats_df
        college_stats_df = college_stats_df.append({'Link': str(url),
                                                   'Defense_Games': defense_games, 
                                                   'Solo_Tackles': solotackles, 
                                                   'Assisted_Tackles': assistedtackles, 
                                                   'Ttl_Tackles': ttl_tackles, 
                                                   'Loss': tackle_loss, 
                                                   'Sacks': sacks, 
                                                   'Defensive_Interceptions': defense_int, 
                                                   'Def_Int_Yds': int_yards, 
                                                   'Yds_per_Int': yards_per_int, 
                                                   'Pick_6': pick6, 
                                                   'Defended_Passes': defended_pass, 
                                                   'Recovered_Fumbles': recovered_fumbles,
                                                   'Rec_Fumbles_Yds': fumble_recover_yrds, 
                                                   'Fumbles_Returned_TD': recovered_fumbles_td, 
                                                   'Forced_Fumbles': forced_fumbles,
                                                   'Passing_Games': passing_games, 
                                                   'Completions': completions,
                                                   'Pass_Attempts': pass_attempts,
                                                   'Completion_Percent': complete_rate,
                                                   'Pass_Yards': passing_yards,
                                                   'Pass_Yds_per_Attempt': yds_per_attempt,
                                                   'Adj_Pass_Yds_per_Attempt': adj_yds_per_attempt,
                                                   'Pass_TDs': pass_tds,
                                                   'Pass_Interceptions': pass_ints,
                                                   'Passer_Rating': passer_rating,
                                                   'Rushing/Receiving_Games': rush_rec_games,
                                                   'Rush_Attempts': rushing_att,
                                                   'Rush_Yds': rushing_yds,
                                                   'Rush_Yds_per_Attempt': rushing_yds_per_att, 
                                                   'Rush_TDs': rushing_tds,
                                                   'Receptions': reception,
                                                   'Rec_Yds': rec_yards,
                                                   'Rec_Yds_per_Reception': yds_per_reception,
                                                   'Rec_TDs': td_receptions,
                                                   'Plays_from_Scrimmage': scrimmages,
                                                   'Scrimmage_Yds': scrimmage_yds,
                                                   'Scrimmage_Yds_per_Attempt': scrimmage_yds_per_att,
                                                   'Scrimmage_TDs': scrimmage_tds,
                                                   'Kicking_Games': kick_games,
                                                   'XP_Made': xp_made,
                                                   'XP_Attempts': xp_attempt,
                                                   'XP_Percent': xp_perc,
                                                   'FG_Made': field_goals,
                                                   'FG_Attempts': field_goals_att,
                                                   'FG_Percent': fg_percent,
                                                   'TTL_Kicking_Points': ttl_kick_pts,
                                                   '#Punts': punts,
                                                   'Punt_Yds': punt_yds,
                                                   'Yds_per_Punt': yds_per_punt,
                                                   'PuntRet_Games': puntret_games,
                                                   'Punt_Returns': ret_punts,
                                                   'Punt_Return_Yds': ret_punts_yds,
                                                   'Yrds_per_Return': yds_per_punt_return,
                                                   'Punt_Returned_for_TD': punt_ret_TD,
                                                   'Kickoff_Returns': kick_returns,
                                                   'KO_Return_Yds': kick_return_yds,
                                                   'Yds_per_KO_Return': yds_per_KO_ret,
                                                   'KO_Returned_for_TD': KO_ret_TD,
                                                   'Scoring_Games': scoring_games,
                                                   'TD_Other': other_tds,
                                                   'Ttl_TDs': total_tds,
                                                   '2PT_Converstions': two_pt_conv,
                                                   'Safety': safety,
                                                   'TTL_Points': ttl_pts  
                                                   },
                                                  ignore_index=True)


## Get College Stats Data

Now that these functions are complete, I can scrape all 5.6k sites for the college statistics. I made some enemies at sports-reference and wasn't able to pass the full list of links at once. Instead I had to split my scraping into batches. I ran this function for 50-100 links at a time by slicing the list. Thankfully I was adding to a global dataframe so every observation should be captured no problem.

In [337]:
get_college_stats(college_stats[5401:])

HBox(children=(FloatProgress(value=0.0, max=85.0), HTML(value='')))




The cells below are just my checks as I was scraping the sites. Every now and then I would get stopped and I wanted to check the last links in my dataframe and compare that to where I was in the list of links. I also checked the length of dataframe compared to the list to make sure I wasn't missing anything. Before I push to a final csv, I will check that all links are accounted for but this was just something for in the moment.

In [290]:
list(college_stats_df['Link'].tail())

['https://www.sports-reference.com/cfb/players/jonathan-dwyer-1.html',
 'https://www.sports-reference.com/cfb/players/marcus-easley-1.html',
 'https://www.sports-reference.com/cfb/players/aj-edds-1.html',
 'https://www.sports-reference.com/cfb/players/brody-eldridge-1.html',
 'https://www.sports-reference.com/cfb/players/dedrick-epps-1.html']

In [418]:
len(list(college_stats_df['Link']))

5486

In [419]:
len(college_stats)

5486

In [292]:
print(college_stats[2357])
print(college_stats[2358])

https://www.sports-reference.com/cfb/players/dedrick-epps-1.html
https://www.sports-reference.com/cfb/players/jacoby-ford-1.html


I wanted to make sure I hadn't missed any of the links. I made a simple dataframe where I grabbed the 'Link' from my resulting dataframe, the 'List' from the link list I was looping through and a 'Check' to return a boolean value if the two observations are the same. Thankfully I had no False values returned so I know I'm ready to proceed. I'm pretty sure I wouldn't be allowed to scrape one more thing even if I need to so it's time to move on. I save off one last csv of my college stats that I'll combine in with the 'Combine' data in another notebook.

In [420]:
link_check = pd.DataFrame(columns = ['Link', 'List', 'Check'])
link_check['Link'] = college_stats_df['Link']
link_check['List'] = [i for i in college_stats]
link_check['Check'] = link_check['Link'] == link_check['List']
link_check.head()

Unnamed: 0,Link,List,Check
0,https://www.sports-reference.com/cfb/players/s...,https://www.sports-reference.com/cfb/players/s...,True
1,https://www.sports-reference.com/cfb/players/l...,https://www.sports-reference.com/cfb/players/l...,True
2,https://www.sports-reference.com/cfb/players/j...,https://www.sports-reference.com/cfb/players/j...,True
3,https://www.sports-reference.com/cfb/players/a...,https://www.sports-reference.com/cfb/players/a...,True
4,https://www.sports-reference.com/cfb/players/t...,https://www.sports-reference.com/cfb/players/t...,True


In [421]:
link_check['Check'].value_counts()

True    5486
Name: Check, dtype: int64

In [422]:
college_stats_df.to_csv('college_stats_final.csv')