# Goal: Extract the statistics from the NFL website and place it into a CSV file.

Background: The National Football League (NFL) is the official league for the sport of American Football. The league consists of 32 teams from various cities across the United States.  

I want to get each NFL team's statistics from the NFL website so I can see which factors influence an NFL team's performance the most. To do this, I decided to check out data from the official NFL website: https://www.nfl.com/stats/team-stats/offense/passing/2020/reg/all

The webpage does not offer a CSV form of this file at the moment. So I'll need to scrape the HTML file that comprises the webpage to extract the data I want.

I'll start by importing BeautifulSoup and requests, two modules that will help with web scraping, along with pandas and numpy, which will come in handy when I turn the extracted data into a dataframe.

Let's then use BeautifulSoup to read in the html file from the official NFL stats webpage:

In [1]:
from bs4 import BeautifulSoup
import requests
import pandas as pd
import numpy as np

html_file = requests.get('https://www.nfl.com/stats/team-stats/offense/passing/2020/reg/all').text
soup = BeautifulSoup(html_file, 'lxml')

<img src="webpage_screenshots/webpage.PNG" class="img-responsive" alt="Screenshot of the NFL stats webpage" width="900" height="300">

Screenshot of the NFL stats webpage

Each NFL team's name is listed the HTML file as part of a 'div class = "d3-o-club-fullname"' tag. I can use BeautifulSoup's find() function to search the HTML file for an instance of this tag. Here I printed the results from that search:

In [2]:
team = soup.find('div', class_ = "d3-o-club-fullname").text
team.strip()

'Football Team'

I used the find() function above, which only returns the first search result. But using the find_all() function returns all of the search results, which means I can get the names of all of the NFL teams in the order they appear on the webpage:

In [3]:
team_names = soup.find_all('div', class_ = 'd3-o-club-fullname')
for i in range(len(team_names)):
    team_names[i] = team_names[i].text.strip()

team_names

['Football Team',
 'Buccaneers',
 'Seahawks',
 '49ers',
 'Chargers',
 'Steelers',
 'Cardinals',
 'Eagles',
 'Jets',
 'Giants',
 'Saints',
 'Patriots',
 'Vikings',
 'Dolphins',
 'Raiders',
 'Rams',
 'Chiefs',
 'Jaguars',
 'Colts',
 'Texans',
 'Titans',
 'Packers',
 'Lions',
 'Broncos',
 'Cowboys',
 'Browns',
 'Bengals',
 'Bears',
 'Panthers',
 'Bills',
 'Ravens',
 'Falcons']

When I make my dataframe of all the data from the NFL webpage, I want the dataframe's columns to match the columns on the webpage. So I'll extract the names of the columns on the webpage and store them in a list called col_names:

In [4]:
col_names = soup.find_all('th', scope = 'col')
#col_names = soup.find_all('thead')
for i in range(len(col_names)):
    col_names[i] = col_names[i].text.strip()

col_names

['Att',
 'Cmp',
 'Cmp %',
 'Yds/Att',
 'Pass Yds',
 'TD',
 'INT',
 'Rate',
 '1st',
 '1st%',
 '20+',
 '40+',
 'Lng',
 'Sck',
 'SckY']

<img src="webpage_screenshots/Columns.PNG" class="img-responsive" alt="Screenshot of the NFL stats webpage. The image contains a circle around the columns in the webpage's stats table." width="900" height="300">

Circled in the image above is the part of the NFL webpage where I draw the names of my columns from.

Great! Each team's data is loaded into the HTML webpage under the 'tr' tag. So I'll look inside one of these 'tr' tags to inspect its contents.

In [5]:
soup.find_all('tr')[1]

<tr>
<td scope="row" tabindex="0">
<div class="d3-o-club-info">
<div class="d3-o-club-logo">
<picture><!--[if IE 9]><video style="display: none; "><![endif]--><source media="(min-width:1024px)" srcset="https://static.www.nfl.com/t_q-best/league/api/clubs/logos/WAS"/><source media="(min-width:768px)" srcset="https://static.www.nfl.com/t_q-best/league/api/clubs/logos/WAS"/><source srcset="https://static.www.nfl.com/t_q-best/league/api/clubs/logos/WAS"/><!--[if IE 9]></video><![endif]--><img class="img-responsive" src="https://static.www.nfl.com/t_q-best/league/api/clubs/logos/WAS"/></picture>
</div>
<div class="d3-o-club-fullname">
            Football Team
            
          </div>
<div class="d3-o-club-shortname">
            Football Team
            
          </div>
</div>
</td>
<td>
                    601
                  </td>
<td>
                    389
                  </td>
<td>
                    64.7
                  </td>
<td>
                    6.3
              

<img src="webpage_screenshots/tr_tags.PNG" class="img-responsive" alt="Screenshot of the NFL stats webpage. The image contains circles around the rows for each team, which represent the tr tags in the HTML file" width="900" height="300">

Circled in the image above are the rows for the NFL teams in the webpage's stats table. These are represented by 'tr' tags in the HTML file.

From the HTML code above, I can now confirm that the name of the football team is indeed listed under the 'div class="d3-o-club-fullname"' tag. In this case, we're working with Washington's NFL team, which is officially named "Football Team". 

(Note: the webpage may be updated by the time you read this, and the table may be reordered in the future. In this case, you may see a different team name in the 'div' tag. The format, however, will most likely remain the same.)

Inside each 'td' tag is the stats for the specific NFL team I'm working with. Conveniently, the order of the stats matches the order of elements in col_names. So 601 represents 'Att', the number of pass attempts, 389 represents 'Cmp' the number of pass completions, etc.

The goal of my next block of code below is to get all of the stats for each NFL team and store them in a list.

I'll start by creating a set called teams. Since the webpage uses the 'tr' tag to denote each team, I will find_all() instances of the 'tr' tag. Each element of teams will contain the contents inside one of these 'tr' tags. I'll also create a list called data which I'll leave empty for now.

In [6]:
teams = soup.find_all('tr')
data = []

Great! Each element of teams will look a bit like the snippet of HTML for the Washington Football Team that we saw above. The only difference between each element will be the team name and the actual numbers that comprise that team's stats. The HTML formatting remains identical.

I'll loop through each team in teams. Each time I do so, I'll create a team_stats list that contains all of the stats for that team. The team_stats list starts out empty. For each team, I'll through each 'td' tag, convert the contents into a float format, then append it to the team_stats list.

At the end, the team_stats list will be full of stats for the particular team I'm working with. I'll take the team_stats list and append it to the 'data' list I created earlier.

To clarify:
The team_stats is a list of float point numbers that represents the stats of an __individual__ team.
The data list is a list of team_stats lists. The 'data' list contains the stats of __every__ NFL team.

The loop then continues onto the next team and repeats the process above.

In [7]:

for i in range(1, len(teams)):
    #i starts from 1 because teams[0] doesn't actually contain the stats of any particular team
    
    team_stats = []
    
    #go into the <td> section to find the stats
    stats = teams[i].find_all('td', scope=None)
    for stat in stats:
        stat = str(stat.text.strip())
        
        if 'T' in stat:
            stat = stat.replace('T', '')
            
        team_stats.append(float(stat))
    #these lines help parse the stats into a format the dataframe can read (namely floats)
    
    data.append(team_stats)

Here's what the team_stats list looks like for the final team we analyzed:

In [8]:
team_stats

[628.0,
 408.0,
 65.0,
 7.4,
 4620.0,
 27.0,
 11.0,
 93.9,
 243.0,
 38.7,
 59.0,
 8.0,
 63.0,
 41.0,
 257.0]

And here's what the first five elements of the data list look like:

In [9]:
data[:5]

[[601.0,
  389.0,
  64.7,
  6.3,
  3796.0,
  16.0,
  16.0,
  80.1,
  184.0,
  30.6,
  40.0,
  6.0,
  68.0,
  50.0,
  331.0],
 [626.0,
  410.0,
  65.5,
  7.6,
  4776.0,
  42.0,
  12.0,
  102.8,
  238.0,
  38.0,
  67.0,
  12.0,
  50.0,
  22.0,
  150.0],
 [563.0,
  388.0,
  68.9,
  7.5,
  4245.0,
  40.0,
  13.0,
  105.0,
  216.0,
  38.4,
  45.0,
  11.0,
  62.0,
  48.0,
  304.0],
 [570.0,
  371.0,
  65.1,
  7.6,
  4320.0,
  25.0,
  17.0,
  90.1,
  217.0,
  38.1,
  55.0,
  10.0,
  76.0,
  39.0,
  287.0],
 [627.0,
  413.0,
  65.9,
  7.3,
  4548.0,
  31.0,
  10.0,
  97.0,
  226.0,
  36.0,
  54.0,
  10.0,
  72.0,
  34.0,
  219.0]]

The 'data' list is rather hard to read right now. Let's turn it into a form that's easier to read by converting it into a dataframe. The new dataframe containing the data list will be called 'nfl'.

In [10]:
data = np.array(data)

In [11]:
nfl = pd.DataFrame(data = data)
nfl

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14
0,601.0,389.0,64.7,6.3,3796.0,16.0,16.0,80.1,184.0,30.6,40.0,6.0,68.0,50.0,331.0
1,626.0,410.0,65.5,7.6,4776.0,42.0,12.0,102.8,238.0,38.0,67.0,12.0,50.0,22.0,150.0
2,563.0,388.0,68.9,7.5,4245.0,40.0,13.0,105.0,216.0,38.4,45.0,11.0,62.0,48.0,304.0
3,570.0,371.0,65.1,7.6,4320.0,25.0,17.0,90.1,217.0,38.1,55.0,10.0,76.0,39.0,287.0
4,627.0,413.0,65.9,7.3,4548.0,31.0,10.0,97.0,226.0,36.0,54.0,10.0,72.0,34.0,219.0
5,656.0,428.0,65.2,6.3,4129.0,35.0,11.0,93.5,206.0,31.4,48.0,7.0,84.0,14.0,126.0
6,575.0,387.0,67.3,7.1,4102.0,27.0,13.0,94.1,211.0,36.7,45.0,14.0,80.0,29.0,186.0
7,598.0,334.0,55.9,6.2,3728.0,22.0,20.0,72.9,177.0,29.6,43.0,9.0,81.0,65.0,401.0
8,499.0,292.0,58.5,6.2,3115.0,16.0,14.0,75.9,146.0,29.3,39.0,6.0,69.0,43.0,319.0
9,517.0,321.0,62.1,6.5,3336.0,12.0,11.0,79.6,178.0,34.4,36.0,5.0,53.0,50.0,310.0


I can make this dataframe more readable my specifying the names of each row and column. Fortunately, I already have the column names saved under the col_names list. Each row represents an NFL team, and the ordering of teams in the team_names list just so happens to perfectly align with the ordering of each team's stats. So I can use the team_names list to name my rows:

In [12]:
nfl.index = team_names
nfl.columns = col_names
nfl

Unnamed: 0,Att,Cmp,Cmp %,Yds/Att,Pass Yds,TD,INT,Rate,1st,1st%,20+,40+,Lng,Sck,SckY
Football Team,601.0,389.0,64.7,6.3,3796.0,16.0,16.0,80.1,184.0,30.6,40.0,6.0,68.0,50.0,331.0
Buccaneers,626.0,410.0,65.5,7.6,4776.0,42.0,12.0,102.8,238.0,38.0,67.0,12.0,50.0,22.0,150.0
Seahawks,563.0,388.0,68.9,7.5,4245.0,40.0,13.0,105.0,216.0,38.4,45.0,11.0,62.0,48.0,304.0
49ers,570.0,371.0,65.1,7.6,4320.0,25.0,17.0,90.1,217.0,38.1,55.0,10.0,76.0,39.0,287.0
Chargers,627.0,413.0,65.9,7.3,4548.0,31.0,10.0,97.0,226.0,36.0,54.0,10.0,72.0,34.0,219.0
Steelers,656.0,428.0,65.2,6.3,4129.0,35.0,11.0,93.5,206.0,31.4,48.0,7.0,84.0,14.0,126.0
Cardinals,575.0,387.0,67.3,7.1,4102.0,27.0,13.0,94.1,211.0,36.7,45.0,14.0,80.0,29.0,186.0
Eagles,598.0,334.0,55.9,6.2,3728.0,22.0,20.0,72.9,177.0,29.6,43.0,9.0,81.0,65.0,401.0
Jets,499.0,292.0,58.5,6.2,3115.0,16.0,14.0,75.9,146.0,29.3,39.0,6.0,69.0,43.0,319.0
Giants,517.0,321.0,62.1,6.5,3336.0,12.0,11.0,79.6,178.0,34.4,36.0,5.0,53.0,50.0,310.0


Ahhh, there we go! Finally, a dataframe that is clear and simple is to read!

My final step is to save the dataframe I just made into a csv file. I chose to call my csv file 'nflstats.csv'.
After running the following line of code, a file called 'nflstats.csv' containing the stats for each NFL team will show up in the same folder as this notebook file.

In [13]:
nfl.to_csv(r'nflstats.csv')

Mission complete! The stats are now prepared, cleaned, and ready for data analysis. Thank you for reading until the end of this notebook. As always, if you have any suggestions, feel free to let me know!