# Web Scraping Using Requests & BeautifulSoup - Hari Patchigolla

# Installation

In [None]:
! pip install requests

In [None]:
! pip install beautifulsoup4

In [None]:
! pip install pandas

# Importing Packages

In [2]:
import requests
from bs4 import BeautifulSoup as bs
import pandas as pd

# Making an NFL Quarterback Dataset

There are 4 main steps:
1) Find a data source (a webpage)
2) Get the HTML of the webpage
3) Parse through the HTML and locate where specific data is
      - CSS Selectors
4) Store your data


## 1) Find a data source (a webpage)

For this workshop we will be scrping our data from: https://www.nfl.com/stats/player-stats/category/passing/2021/reg/all/passingyards/DESC

Let's explore the webpage a little bit:
- Notice all the available data
- Notice the change in the url when clicking on a specific player

## 2) Get the HTML of the webpage

[`requests.get()`](https://www.w3schools.com/python/ref_requests_get.asp)

This methods sends a HTTP GET requests to specified url and returns to us a response object

In [3]:
base_url = "https://www.nfl.com/"
add_on = "stats/player-stats/category/passing/2021/reg/all/passingyards/DESC"

response = requests.get(base_url + add_on)

In [4]:
type(response)

requests.models.Response

A `200` status code means a succesful response

In [5]:
response.status_code

200

`response.content` this will return to us the HTML content

In [7]:
# response.content

## 3) Parse through the HTML and locate where specific data is with the help of CSS Selectors

The `BeautifulSoup` object lets us do this

In [8]:
soup_player_stats = bs(response.content, 'html.parser') 

In [9]:
type(soup_player_stats)

bs4.BeautifulSoup

`.select()`

One of the methods provided by a `BeautifulSoup` object is the `.select()` methods that takes in a CSS selector

In [10]:
tbody_lst = soup_player_stats.select('#main-content > section.d3-l-grid--outer.d3-l-section-row > div > div > div > div > table > tbody')

In [12]:
# tbody_lst

In [13]:
len(tbody_lst)

1

The length of the list is only `1` because we selectoed for ontl the `tbody` HTML tag. Lets fix this by also selecting for the `tr` tags

In [14]:
tr_lst = tbody_lst[0].select('tr')
len(tr_lst)

25

In [16]:
# tr_lst

Now lets just focus on the first element of the dataset, which is the row on Tom Brady.

Notice how there are a bunch of `td` tags

In [18]:
td_lst = tr_lst[0].select('td')
td_lst

[<td scope="row" tabindex="0"><div class="d3-o-media-object d3-o-media-object--vertical-center"><figure class="d3-o-media-object__figure d3-o-player-headshot"><picture is-lazy="/t_lazy"><!--[if IE 9]><video style="display:none"><![endif]--><source media="(min-width:1024px)" srcset="https://static.www.nfl.com/image/private/t_headshot_desktop/t_lazy/f_auto/league/q7dpdlxyu5rs05rgh1le 1x, https://static.www.nfl.com/image/private/t_headshot_desktop_2x/t_lazy/f_auto/league/q7dpdlxyu5rs05rgh1le 2x, https://static.www.nfl.com/image/private/t_headshot_desktop_3x/t_lazy/f_auto/league/q7dpdlxyu5rs05rgh1le"/><source media="(min-width:768px)" srcset="https://static.www.nfl.com/image/private/t_headshot_tablet/t_lazy/f_auto/league/q7dpdlxyu5rs05rgh1le 1x, https://static.www.nfl.com/image/private/t_headshot_tablet_2x/t_lazy/f_auto/league/q7dpdlxyu5rs05rgh1le 2x, https://static.www.nfl.com/image/private/t_headshot_tablet_3x/t_lazy/f_auto/league/q7dpdlxyu5rs05rgh1le"/><source srcset="https://static.www

`.string`

Notice how there us an `a` tag in the first element, which name the name `Tom Brady`, we can use the `.string` command to extract this.

In [22]:
td_lst[0].a.string

' Tom Brady '

As for the rest of the rest of the players in the `tr_lst` lets just map through all the ements and use `.string` to get thats data.

This is a bit more python realted so I won't go into much deatial on this.

In [26]:
td_lst[1:]

[<td class="selected">5316</td>,
 <td>7.4</td>,
 <td>719</td>,
 <td>485</td>,
 <td>67.4</td>,
 <td>43</td>,
 <td>12</td>,
 <td>102.1</td>,
 <td>269</td>,
 <td>37.4</td>,
 <td>75</td>,
 <td>10</td>,
 <td>62</td>,
 <td>22</td>,
 <td>144</td>]

In [28]:
for players in tr_lst:
    print(players.select('td')[0].a.string, list(map(lambda x: x.string, players.select('td')[1:])))

 Tom Brady  ['5316', '7.4', '719', '485', '67.4', '43', '12', '102.1', '269', '37.4', '75', '10', '62', '22', '144']
 Justin Herbert  ['5014', '7.5', '672', '443', '65.9', '38', '15', '97.7', '256', '38.1', '53', '15', '72', '31', '214']
 Matthew Stafford  ['4886', '8.1', '601', '404', '67.2', '41', '17', '102.9', '233', '38.8', '65', '18', '79', '30', '243']
 Patrick Mahomes  ['4839', '7.4', '658', '436', '66.3', '37', '13', '98.4', '260', '39.5', '58', '11', '75', '28', '146']
 Derek Carr  ['4804', '7.7', '626', '428', '68.4', '23', '14', '94', '217', '34.7', '67', '10', '61', '40', '241']
 Joe Burrow  ['4611', '8.9', '520', '366', '70.4', '34', '14', '108.3', '202', '38.8', '60', '15', '82', '51', '370']
 Dak Prescott  ['4449', '7.5', '596', '410', '68.8', '37', '10', '104.2', '227', '38.1', '55', '7', '51', '30', '144']
 Josh Allen  ['4407', '6.8', '646', '409', '63.3', '36', '15', '92.2', '234', '36.2', '51', '8', '61', '26', '164']
 Kirk Cousins  ['4221', '7.5', '561', '372', '66

Now, we will "click" on each of the players and extract data from their specific page on nfl.com

Notice how the `href` part of the `<a>` tag is the relative url for Tom Brady's part of the webpages

In [29]:
td_lst[0].a

<a aria-label="Tom Brady profile page" class="d3-o-player-fullname nfl-o-cta--link" href="/players/tom-brady/"> Tom Brady </a>

We can extract this relative url using the following syntax

In [30]:
td_lst[0].a['href']

'/players/tom-brady/'

Because we have this `href` relative url we are able to "click" on this link and extract data from this webpage as follows.

Note: The CSS Selector `.nfl-c-player-info__value` slectors for all the data shown in the `PLAYER INFO` table

In [32]:
base_url = "https://www.nfl.com"
add_on = td_lst[0].a['href']
temp_player_stats = requests.get(base_url + add_on)
temp_soup_player_stats = bs(temp_player_stats.text, 'html.parser')
print(td_lst[0].a.string, list(map(lambda x: x.string, temp_soup_player_stats.select('.nfl-c-player-info__value'))))

 Tom Brady  ['6-4', '225', None, None, '23', 'Michigan', '45', None]


Now lets just do that same thing as up above but we will "click" through all of the players

In [33]:
for players in tr_lst: #.nfl-c-player-info__value
    base_url = "https://www.nfl.com"
    add_on = players.select('td')[0].a['href']
    temp_player_stats = requests.get(base_url + add_on)
    temp_soup_player_stats = bs(temp_player_stats.text, 'html.parser')
    print(players.select('td')[0].a.string, list(map(lambda x: x.string, temp_soup_player_stats.select('.nfl-c-player-info__value'))))

 Tom Brady  ['6-4', '225', None, None, '23', 'Michigan', '45', None]
 Justin Herbert  ['6-6', '236', '33', '10', '3', 'Oregon', '24', 'Eugene, OR']
 Matthew Stafford  ['6-3', '220', '33 1/4', '10', '14', 'Georgia', '34', None]
 Patrick Mahomes  ['6-2', '225', '33 1/4', '9 1/4', '6', 'Texas Tech', '27', None]
 Derek Carr  ['6-3', '210', '31 1/2', '9 1/4', '9', 'Fresno State', '31', None]
 Joe Burrow  ['6-4', '215', '31', '9', '3', 'LSU', '25', 'Athens, OH']
 Dak Prescott  ['6-2', '238', '32 1/4', '10', '7', 'Mississippi State', '29', None]
 Josh Allen  ['6-5', '237', '33 1/4', '10 1/4', '5', 'Wyoming', '26', 'Firebaugh, CA']
 Kirk Cousins  ['6-3', '205', '31 3/4', '10', '11', 'Michigan State', '34', None]
 Aaron Rodgers  ['6-2', '225', None, None, '18', 'California', '38', None]
 Matt Ryan  ['6-4', '220', '32 3/4', '9 1/2', '15', 'Boston College', '37', 'Exton, PA']
 Jimmy Garoppolo  ['6-2', '225', '31', '9 1/4', '9', 'Eastern Illinois', '30', None]
 Mac Jones  ['6-3', '217', '32 3/4', 

## 4) Store your data

There is nothing new here expect for the fact that I am appending values to the proper list

In [36]:
from collections import defaultdict

In [46]:
data = defaultdict(list)

In [47]:
for players in tr_lst:
    p_name = players.select('td')[0].a.string
    print(p_name)
    data["player_name"].append(p_name)
    stats_1st = list(map(lambda x: x.string, players.select('td')[1:]))
    data["Pass_Yds"].append(stats_1st[0]) 
    data["Yds_Att"].append(stats_1st[1]) 
    data["Att"].append(stats_1st[2])
    data["Cmp"].append(stats_1st[3]) 
    data["Cmp_per"].append(stats_1st[4])
    data["TD"].append(stats_1st[5]) 
    data["INT"].append(stats_1st[6]) 
    data["Rate"].append(stats_1st[7]) 
    data["_1st"].append(stats_1st[8]) 
    data["_1st_per"].append(stats_1st[9]) 
    data["_20_plus"].append(stats_1st[10])
    data["_40_plus"].append(stats_1st[11]) 
    data["Lng"].append(stats_1st[12]) 
    data["Sck"].append(stats_1st[13])
    data["SckY"].append(stats_1st[14])
    
    base_url = "https://www.nfl.com"
    add_on = players.select('td')[0].a['href']
    temp_player_stats = requests.get(base_url + add_on)
    temp_soup_player_stats = bs(temp_player_stats.text, 'html.parser')
    
   
    player_info = list(map(lambda x: x.string, temp_soup_player_stats.select('.nfl-c-player-info__value'))) 
    data["player_height"].append(player_info[0]) 
    data["player_weight"].append(player_info[1]) 
    data["player_arms"].append(player_info[2]) 
    data["player_hands"].append(player_info[3]) 
    data["player_exp"].append(player_info[4]) 
    data["player_college"].append(player_info[5]) 
    data["player_age"].append(player_info[6]) 
    if p_name.strip() == "Ben Roethlisberger":
        data["player_hometown"].append(None) 
    else:
        data["player_hometown"].append(player_info[7]) 

 Tom Brady 
 Justin Herbert 
 Matthew Stafford 
 Patrick Mahomes 
 Derek Carr 
 Joe Burrow 
 Dak Prescott 
 Josh Allen 
 Kirk Cousins 
 Aaron Rodgers 
 Matt Ryan 
 Jimmy Garoppolo 
 Mac Jones 
 Kyler Murray 
 Ben Roethlisberger 
 Ryan Tannehill 
 Trevor Lawrence 
 Carson Wentz 
 Taylor Heinicke 
 Jared Goff 
 Jalen Hurts 
 Russell Wilson 
 Teddy Bridgewater 
 Baker Mayfield 
 Lamar Jackson 


In [49]:
df = pd.DataFrame(data = data)

In [50]:
df

Unnamed: 0,player_name,Pass_Yds,Yds_Att,Att,Cmp,Cmp_per,TD,INT,Rate,_1st,...,Sck,SckY,player_height,player_weight,player_arms,player_hands,player_exp,player_college,player_age,player_hometown
0,Tom Brady,5316,7.4,719,485,67.4,43,12,102.1,269,...,22,144,6-4,225,,,23,Michigan,45.0,
1,Justin Herbert,5014,7.5,672,443,65.9,38,15,97.7,256,...,31,214,6-6,236,33,10,3,Oregon,24.0,"Eugene, OR"
2,Matthew Stafford,4886,8.1,601,404,67.2,41,17,102.9,233,...,30,243,6-3,220,33 1/4,10,14,Georgia,34.0,
3,Patrick Mahomes,4839,7.4,658,436,66.3,37,13,98.4,260,...,28,146,6-2,225,33 1/4,9 1/4,6,Texas Tech,27.0,
4,Derek Carr,4804,7.7,626,428,68.4,23,14,94.0,217,...,40,241,6-3,210,31 1/2,9 1/4,9,Fresno State,31.0,
5,Joe Burrow,4611,8.9,520,366,70.4,34,14,108.3,202,...,51,370,6-4,215,31,9,3,LSU,25.0,"Athens, OH"
6,Dak Prescott,4449,7.5,596,410,68.8,37,10,104.2,227,...,30,144,6-2,238,32 1/4,10,7,Mississippi State,29.0,
7,Josh Allen,4407,6.8,646,409,63.3,36,15,92.2,234,...,26,164,6-5,237,33 1/4,10 1/4,5,Wyoming,26.0,"Firebaugh, CA"
8,Kirk Cousins,4221,7.5,561,372,66.3,33,7,103.1,192,...,28,197,6-3,205,31 3/4,10,11,Michigan State,34.0,
9,Aaron Rodgers,4115,7.8,531,366,68.9,37,4,111.9,213,...,30,188,6-2,225,,,18,California,38.0,


In [51]:
df.to_csv("top_25_nfl_QBs.csv")