### Nick Babcock - NFL Data webscraping

<u>**In this portion of the project, I will be using NFL.com to construct a database that will give statistics of the 2021 season about NFL football teams that are relevant to the sports gambling industry.**<u/>

**Some issues I ran in to while creating this dataset:**
1. When using pandas, it wasn't correctly ordering numbers in chronological order due to the numbers being in string format. For example, the program was sorting a team with 1000 yards to be less than 900 yards. After discovering this, I simply converted the number entries into integers where necessary.
2. When this dataset was first created, the index for each entry was defaulted to start at 0. This did not make sense when ranking my dataset, so I simply reset the index to start at 1 while deleting the originally generated index.
3. The tables that I have created included some categories (columns) that may not be very useful information for the user, so I had to delete the ones that I felt were clogging up my dataset.
4. The code used to generate this dataset is very repetitive and could be cleaned up to make for a better, less congested presentation.
<br>
<br>
This is a dataset that is intended to help a user who is interested in sports gambling decide which matchups they should bet on. In general, if team A has great success running the ball and is playing team B who has little success defending the run, the user might want to bet on the total amount of yards team A will run for that day. On the other hand, if A has great success throwing the ball and is playing team B who has little success defending the pass, the user might want to bet on the total amount of yards team A will throw for that day.

In [179]:
import requests
import urllib
import bs4
from bs4 import BeautifulSoup
import pandas as pd

<u>**Teams' offenses ranked by the most passing yards this season**</u>
<br>NOTE: Use this table to find out what teams have more success when throwing the ball.

In [178]:
html_text = urllib.request.urlopen("https://www.nfl.com/stats/team-stats/offense/passing/2021/reg/all").read()
soup = BeautifulSoup(html_text, 'html.parser')


table = soup.find('body').find('table')
header = [name.text for name in table.find('thead').find_all('th')]
data = {name: [] for name in header}
for row in table.find('tbody').find_all("tr"):               
    cols = [col.text.replace('\n', '') for col in row.find_all('td')]
    cols = [col.replace(' ', '') for col in cols]
    for i in range(len(cols[0])):
        if cols[0][0:i] == cols[0][i:]:
            cols[0] = cols[0][0:i]
            i += 1    
    for i, col in enumerate(cols):
        data[header[i]].append(cols[i])
       
df = pd.DataFrame.from_dict(data)
df = df.sort_values(by = ['Pass Yds'], ascending = False)
df = df.reset_index(drop = True)
df.index += 1
df = df.drop(['Yds/Att', 'Rate', '1st', '1st%', '20+', '40+', 'Lng', 'Sck', 'SckY'], axis = 1)
df




Unnamed: 0,Team,Att,Cmp,Cmp %,Pass Yds,TD,INT
1,Raiders,413,278,67.3,3414,17,9
2,Cowboys,422,291,69.0,3335,24,8
3,Chiefs,466,306,65.7,3298,25,11
4,Buccaneers,432,291,67.4,3244,29,8
5,Bills,418,279,66.7,3099,25,11
6,Cardinals,356,261,73.3,3029,20,8
7,Rams,367,247,67.3,3021,24,9
8,Chargers,395,261,66.1,2927,22,8
9,Packers,369,243,65.9,2829,22,5
10,Jets,400,248,62.0,2793,15,18


<u>**Teams' offenses ranked by the most rushing yards this season**</u>
<br>NOTE: Use this table to find out what teams have more success when running the ball.

In [171]:
html_text2 = urllib.request.urlopen("https://www.nfl.com/stats/team-stats/offense/rushing/2021/reg/all").read()
soup2 = BeautifulSoup(html_text2, 'html.parser')


table2 = soup2.find('body').find('table')
header2 = [name.text for name in table2.find('thead').find_all('th')]
data2 = {name: [] for name in header2}
for row in table2.find('tbody').find_all("tr"):               
    cols = [col.text.replace('\n', '') for col in row.find_all('td')]
    cols = [col.replace(' ', '') for col in cols]
    for i in range(len(cols[0])):
        if cols[0][0:i] == cols[0][i:]:
            cols[0] = cols[0][0:i]
            i += 1    
    for i, col in enumerate(cols):
        data2[header2[i]].append(cols[i])

data2['Rush Yds'] = [int(item) for item in data2['Rush Yds']]
       
df2 = pd.DataFrame.from_dict(data2)
df2 = df2.sort_values(by = ['Rush Yds'], ascending = False)
df2 = df2.reset_index(drop = True)
df2.index += 1
df2 = df2.drop(['Lng', 'Rush 1st', 'Rush 1st%', 'Rush FUM'], axis = 1)
df2

Unnamed: 0,Team,Att,Rush Yds,YPC,TD,20+,40+
1,Browns,329,1725,5.2,17,13,2
2,Eagles,338,1687,5.0,17,13,0
3,Colts,315,1627,5.2,16,13,3
4,Ravens,316,1510,4.8,12,8,0
5,Titans,340,1419,4.2,16,4,2
6,Cowboys,307,1402,4.6,10,6,1
7,Bears,315,1389,4.4,9,8,1
8,Cardinals,337,1353,4.0,17,5,2
9,Bills,285,1301,4.6,12,8,1
10,Patriots,310,1279,4.1,13,8,0


<u>**Teams' offenses ranked by the most tuchdowns scored this season**</u>
<br>NOTE: Use this table to find out what teams have more success scoring tuchdowns.

In [173]:
html_text3 = urllib.request.urlopen("https://www.nfl.com/stats/team-stats/offense/scoring/2021/reg/all").read()
soup3 = BeautifulSoup(html_text3, 'html.parser')


table3 = soup3.find('body').find('table')
header3 = [name.text for name in table3.find('thead').find_all('th')]
data3 = {name: [] for name in header3}
for row in table3.find('tbody').find_all("tr"):               
    cols = [col.text.replace('\n', '') for col in row.find_all('td')]
    cols = [col.replace(' ', '') for col in cols]
    for i in range(len(cols[0])):
        if cols[0][0:i] == cols[0][i:]:
            cols[0] = cols[0][0:i]
            i += 1    
    for i, col in enumerate(cols):
        data3[header3[i]].append(cols[i])
       
df3 = pd.DataFrame.from_dict(data3)
df3 = df3.sort_values(by = ['Tot TD'], ascending = False)
df3 = df3.reset_index(drop = True)
df3.index += 1
df3 = df3.drop(['2-PT'], axis = 1)
df3



Unnamed: 0,Team,Rsh TD,Rec TD,Tot TD
1,Cowboys,10,24,39
2,Buccaneers,8,29,39
3,Bills,12,25,38
4,Cardinals,17,20,38
5,Colts,16,18,36
6,Chiefs,8,25,34
7,Eagles,17,13,34
8,Saints,9,23,34
9,Titans,16,14,34
10,Patriots,13,16,32


<u>**Teams' defenses ranked by the most passing yards allowed this season**</u>
<br>NOTE: Use this table to find out what teams have the least success defending their opponent's passing plays.

In [174]:
html_text4 = urllib.request.urlopen("https://www.nfl.com/stats/team-stats/defense/passing/2021/reg/all").read()
soup4 = BeautifulSoup(html_text4, 'html.parser')


table4 = soup4.find('body').find('table')
header4 = [name.text for name in table4.find('thead').find_all('th')]
data4 = {name: [] for name in header4}
for row in table4.find('tbody').find_all("tr"):               
    cols = [col.text.replace('\n', '') for col in row.find_all('td')]
    cols = [col.replace(' ', '') for col in cols]
    for i in range(len(cols[0])):
        if cols[0][0:i] == cols[0][i:]:
            cols[0] = cols[0][0:i]
            i += 1    
    for i, col in enumerate(cols):
        data4[header4[i]].append(cols[i])
       
df4 = pd.DataFrame.from_dict(data4)
df4 = df4.sort_values(by = ['Yds'], ascending = False)
df4 = df4.reset_index(drop = True)
df4.index += 1
df4 = df4.drop(['Att', 'Cmp', '1st', '1st%', 'Lng'], axis = 1)
df4

Unnamed: 0,Team,Cmp %,Yds/Att,Yds,TD,INT,Rate,20+,40+,Sck
1,Dolphins,63.9,7.3,3016,20,7,94.1,48,5,23
2,Cowboys,62.9,7.7,2885,15,15,83.6,45,9,24
3,Jets,70.3,8.4,2822,17,3,108.2,46,8,20
4,Ravens,60.3,8.1,2811,16,5,95.1,47,11,22
5,Seahawks,66.3,7.2,2796,14,4,94.9,31,5,17
6,Titans,62.5,7.0,2786,18,9,88.7,38,7,27
7,Chiefs,66.9,7.7,2780,18,10,94.8,39,11,19
8,Saints,64.2,7.8,2770,17,13,88.6,41,9,25
9,Colts,66.1,7.5,2719,25,13,95.8,31,6,23
10,FootballTeam,68.1,7.7,2700,24,6,106.0,31,6,20


<u>**Teams' defenses ranked by the most rushing yards allowed this season**</u>
<br>NOTE: Use this table to find out what teams have the least success defending their opponent's rushing plays.

In [176]:
html_text5 = urllib.request.urlopen("https://www.nfl.com/stats/team-stats/defense/rushing/2021/reg/all").read()
soup5 = BeautifulSoup(html_text5, 'html.parser')


table5 = soup5.find('body').find('table')
header5 = [name.text for name in table5.find('thead').find_all('th')]
data5 = {name: [] for name in header5}
for row in table5.find('tbody').find_all("tr"):               
    cols = [col.text.replace('\n', '') for col in row.find_all('td')]
    cols = [col.replace(' ', '') for col in cols]
    for i in range(len(cols[0])):
        if cols[0][0:i] == cols[0][i:]:
            cols[0] = cols[0][0:i]
            i += 1    
    for i, col in enumerate(cols):
        data5[header5[i]].append(cols[i])
        
data5['Rush Yds'] = [int(item) for item in data5['Rush Yds']]
       
df5 = pd.DataFrame.from_dict(data5)
df5 = df5.sort_values(by = ['Rush Yds'], ascending = False)
df5 = df5.reset_index(drop = True)
df5.index += 1
df5 = df5.drop(['Lng', 'Rush 1st', 'Rush 1st%'], axis = 1)
df5

Unnamed: 0,Team,Att,Rush Yds,YPC,TD,20+,40+,Rush FUM
1,Lions,345,1473,4.3,11,8,1,2
2,Chargers,308,1451,4.7,13,8,1,3
3,Raiders,315,1385,4.4,11,9,0,6
4,Texans,297,1335,4.5,15,4,1,6
5,Jets,289,1320,4.6,20,9,1,5
6,Bears,305,1304,4.3,10,6,0,3
7,Eagles,312,1282,4.1,12,6,0,5
8,Vikings,267,1270,4.8,8,3,0,0
9,Steelers,266,1266,4.8,8,12,2,3
10,Panthers,293,1260,4.3,8,5,1,8
