# Web Scraping Example

This notebook will walk through a real life web scraping example, collecting data from [MLB's official website](https://www.mlb.com/). Our goal is to collect data on Hitting Statistics from players throughout MLB history. 

We will cover:
- Examining the HTML of a web page
- Using requests to collect raw HTML
- Extracting data using BeautifulSoup
- Formatting data in a pandas DataFrame
- Automating the Web Scraping process with Python

## Examine HTML

The specific data we want to scrape can be found here: https://www.mlb.com/stats/

A few things you will notice:
- Data is spread across multiple pages
- We need to use drop down to switch years

How does the url of the page change when you switch to a different page or year?

In [1]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

In [2]:
# Collect raw HTML from web page
resp = requests.get('https://www.mlb.com/stats/')

In [3]:
resp.status_code

200

200 status code means request was successful

In [4]:
soup = BeautifulSoup(resp.content) # Use BeautifulSoup to interact with raw HTML

In [1]:
# print(soup.prettify())
# COMMENTED THIS LINE OUT SO OUTPUT DOES NOT TAKE UP ENTIRE NOTEBOOK

Select table with all of the data from soup. 

Stored in a `div` tag with class `stats-body-table player`

In [7]:
table = soup.find('div', class_='stats-body-table player')

Within the table itself, we can select the table body (`tbody` tag) which is where all of the data is stored

In [8]:
table_body = table.find('tbody')

Every row in the table body is indicated with a `tr` tag. We can `find_all()` of these `tr` tags to create a list of every row in the table

In [9]:
rows = table_body.find_all('tr')

In [10]:
len(rows)

25

In [18]:
# Find a tag with Player name within first row
rows[0].find('a', class_='bui-link').attrs['aria-label']

'Mike Trout'

Within each row, there is an `a` tag with class `bui-link` which has the name of the player. Loop over each row, select this `a` tag, and extract player name from `aria-label` attribute

In [20]:
player_names = [row.find('a', class_='bui-link').attrs['aria-label'] for row in rows]

Position is stored in `div` tag with class `position-28TbwVOg`. Extract position from first row

In [22]:
rows[0].find('div', class_='position-28TbwVOg').text

'CF'

Iterate over every row and extract player position using the div tag described above.

In [24]:
positions = [row.find('div', class_='position-28TbwVOg').text for row in rows]

Now we have player names and positions, next step is to collect the bulk of data from the table.

Each row has multiple `td` tags containing the data for that row. Find all `td` tags for a given row and extract text from each

In [27]:
rows[0].find_all('td')[0].text

'LAA'

Iterate over every row in rows, select all td tags and extract text from each

In [28]:
player_data = []
for row in rows:
    td_tags = row.find_all('td')
    td_text = [td.text for td in td_tags]
    player_data.append(td_text)

In [29]:
player_data[0]

['LAA',
 '27',
 '95',
 '25',
 '32',
 '8',
 '1',
 '9',
 '19',
 '19',
 '27',
 '0',
 '0',
 '.337',
 '.457',
 '.726',
 '1.183']

In [31]:
len(player_data)

25

In [34]:
df = pd.DataFrame(player_data)
df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16
0,LAA,27,95,25,32,8,1,9,19,19,27,0,0,0.337,0.457,0.726,1.183
1,SD,31,116,27,44,8,0,7,22,16,22,6,0,0.379,0.455,0.629,1.084
2,CLE,30,111,18,33,8,2,7,30,19,10,3,1,0.297,0.402,0.595,0.997
3,STL,28,107,14,34,9,0,7,24,12,18,0,0,0.318,0.387,0.598,0.985
4,COL,30,114,15,35,7,1,9,24,7,30,0,0,0.307,0.355,0.623,0.978


Now that we have all of the data from our web page, I want to collect the table's column names to use in our DataFrame.

All column names are stored within table in a `thead` tag. Each column is stored in a `th` tag within `thead`

In [35]:
headers = table.find('thead').find_all('th')

In [39]:
headers[0].find('button').text

'PLAYER'

Each header had column name listed twice, so we selected the button tag to get first instance of column name.

In [41]:
columns_raw = [header.find('button').text for header in headers]

In [51]:
columns_raw

['PLAYER',
 'TEAM',
 'G',
 'AB',
 'R',
 'H',
 '2B',
 '3B',
 'HR',
 'RBI',
 'BB',
 'SO',
 'SB',
 'CS',
 'AVG',
 'OBP',
 'SLG',
 'caret-upcaret-downOPS']

Create a list `columns` with all column names in base DataFrame (player name and position will be added after the fact)

In [43]:
columns = ['TEAM',
 'G',
 'AB',
 'R',
 'H',
 '2B',
 '3B',
 'HR',
 'RBI',
 'BB',
 'SO',
 'SB',
 'CS',
 'AVG',
 'OBP',
 'SLG',
 'OPS']

Create list `columns_final` which will format DataFrame once all data has been added (after we include `PLAYER` and `POSITION` column)

In [55]:
columns_final = ['PLAYER', 'POSITION','TEAM',
 'G',
 'AB',
 'R',
 'H',
 '2B',
 '3B',
 'HR',
 'RBI',
 'BB',
 'SO',
 'SB',
 'CS',
 'AVG',
 'OBP',
 'SLG',
 'OPS']

In [44]:
df.columns = columns
df.head()

Unnamed: 0,TEAM,G,AB,R,H,2B,3B,HR,RBI,BB,SO,SB,CS,AVG,OBP,SLG,OPS
0,LAA,27,95,25,32,8,1,9,19,19,27,0,0,0.337,0.457,0.726,1.183
1,SD,31,116,27,44,8,0,7,22,16,22,6,0,0.379,0.455,0.629,1.084
2,CLE,30,111,18,33,8,2,7,30,19,10,3,1,0.297,0.402,0.595,0.997
3,STL,28,107,14,34,9,0,7,24,12,18,0,0,0.318,0.387,0.598,0.985
4,COL,30,114,15,35,7,1,9,24,7,30,0,0,0.307,0.355,0.623,0.978


DataFrame is formatted with column names. Add `player_names` and `positions` as new columns

In [45]:
df['PLAYER'] = player_names
df['POSITION'] = positions

In [46]:
df.head()

Unnamed: 0,TEAM,G,AB,R,H,2B,3B,HR,RBI,BB,SO,SB,CS,AVG,OBP,SLG,OPS,PLAYER,POSITION
0,LAA,27,95,25,32,8,1,9,19,19,27,0,0,0.337,0.457,0.726,1.183,Mike Trout,CF
1,SD,31,116,27,44,8,0,7,22,16,22,6,0,0.379,0.455,0.629,1.084,Manny Machado,3B
2,CLE,30,111,18,33,8,2,7,30,19,10,3,1,0.297,0.402,0.595,0.997,Jose Ramirez,3B
3,STL,28,107,14,34,9,0,7,24,12,18,0,0,0.318,0.387,0.598,0.985,Nolan Arenado,3B
4,COL,30,114,15,35,7,1,9,24,7,30,0,0,0.307,0.355,0.623,0.978,C.J. Cron,1B


We can pass `columns_final` into DataFrame to adjust the order of the columns. `columns_final` has `PLAYER` and `POSITION` first so the resulting DataFrame will have those columns in the beginning

In [60]:
df[columns_final]

Unnamed: 0,PLAYER,POSITION,TEAM,G,AB,R,H,2B,3B,HR,RBI,BB,SO,SB,CS,AVG,OBP,SLG,OPS
0,Mike Trout,CF,LAA,27,95,25,32,8,1,9,19,19,27,0,0,0.337,0.457,0.726,1.183
1,Manny Machado,3B,SD,31,116,27,44,8,0,7,22,16,22,6,0,0.379,0.455,0.629,1.084
2,Jose Ramirez,3B,CLE,30,111,18,33,8,2,7,30,19,10,3,1,0.297,0.402,0.595,0.997
3,Nolan Arenado,3B,STL,28,107,14,34,9,0,7,24,12,18,0,0,0.318,0.387,0.598,0.985
4,C.J. Cron,1B,COL,30,114,15,35,7,1,9,24,7,30,0,0,0.307,0.355,0.623,0.978
5,Josh Bell,1B,WSH,30,106,21,37,7,0,4,21,17,13,0,1,0.349,0.444,0.528,0.972
6,Owen Miller,1B,CLE,24,81,20,27,10,0,3,13,11,19,0,0,0.333,0.4,0.568,0.968
7,Aaron Judge,RF,NYY,29,111,21,32,6,0,10,22,11,34,2,0,0.288,0.352,0.613,0.965
8,Eric Hosmer,1B,SD,29,104,12,37,8,0,3,19,13,16,0,0,0.356,0.427,0.519,0.946
9,Jazz Chisholm Jr.,2B,MIA,25,90,16,27,6,3,5,21,7,23,6,1,0.3,0.343,0.6,0.943


We have all data from first page of 2022 collected and stored in a DataFrame.

Condense our code into a function:
- Accepts a single URL
- Extracts all relevant data from the web page
- Formats it into a pandas DataFrame

In [72]:
def collect_page_data(url, cols_base, cols_final):
    
    # STEP 1
    resp = requests.get(url) # Collect raw HTML from web page
    
    # Convert it to a BeautifulSoup object
    soup = BeautifulSoup(resp.content)
    
    table = soup.find('div', class_='stats-body-table player') # Select table from HTML
    table_body = table.find('tbody') 
    rows = table_body.find_all('tr') # Select every row from table body
    
    # STEP 2 - check if `rows` variable has any data
    if rows:
        # Extract Player names from each row
        player_names = [row.find('a', class_='bui-link').attrs['aria-label'] for row in rows]

        # Extract player position from each row
        positions = [row.find('div', class_='position-28TbwVOg').text for row in rows]

        # Extract player data
        player_data = []
        for row in rows:
            td_tags = row.find_all('td')
            td_text = [td.text for td in td_tags]
            player_data.append(td_text)

        # Create DataFrame from player data
        df = pd.DataFrame(player_data)

        # set columns to 'cols_base' argument
        df.columns = cols_base

        # Add player name and position as new columns
        df['PLAYER'] = player_names
        df['POSITION'] = positions

        # return completed DataFrame
        return df[cols_final]
    
    # STEP 3
    else: # if no data found in rows variable, return None
        return None

Function Summary:
- the function accepts three parameters
    - `url`: the url of the page we want to scrape
    - `cols_base`: base columns in data table (everything except `PLAYER` and `POSITION`)
    - `cols_final`: list of columns for our final DataFrame
- Step 1:
    - Function makes a request to `url` and stores raw HTML in BeautifulSoup object
    - We select the `table` with all of the data we want
    - Select the body of that table (`tbody`)
    - Then select every row from the table body (`table_body.find_all('tr')`)
- Step 2:
    - We don't know if a page has data or not until we inspect the `rows` variable
    - If `rows` has data in it, we go through the process of extracting that data and storing it in a DataFrame
- Step 3:
    - If `rows` does not have any data in it (ie. the variable is an empty list), then there is nothing for us to extract as the function returns `None`

In [58]:
collect_page_data('https://www.mlb.com/stats/?page=2', columns, columns_final)

Unnamed: 0,PLAYER,POSITION,TEAM,G,AB,R,H,2B,3B,HR,RBI,BB,SO,SB,CS,AVG,OBP,SLG,OPS
0,Pete Alonso,1B,NYM,31,118,19,33,5,0,7,26,10,30,0,0,0.28,0.351,0.5,0.851
1,Juan Soto,RF,WSH,31,113,21,29,5,0,6,8,24,23,3,0,0.257,0.391,0.46,0.851
2,Connor Joe,DH,COL,29,115,17,34,5,2,4,11,13,20,2,1,0.296,0.372,0.478,0.85
3,Wander Franco,SS,TB,29,120,20,38,9,1,4,15,5,12,3,0,0.317,0.341,0.508,0.849
4,Seiya Suzuki,RF,CHC,29,94,13,24,7,1,4,16,16,32,1,2,0.255,0.366,0.479,0.845
5,Matt Olson,1B,ATL,31,115,15,31,13,0,3,13,21,27,0,0,0.27,0.38,0.461,0.841
6,Bryce Harper,DH,PHI,31,117,23,31,10,1,6,19,8,32,5,1,0.265,0.318,0.521,0.839
7,Tommy Edman,2B,STL,28,100,18,29,3,2,3,14,14,18,7,1,0.29,0.388,0.45,0.838
8,Christian Yelich,LF,MIL,31,112,22,29,8,1,5,20,16,29,3,0,0.259,0.356,0.482,0.838
9,Rowdy Tellez,1B,MIL,31,106,16,26,9,0,7,27,7,26,0,0,0.245,0.304,0.528,0.832


Figure out how to get while loop to stop executing:
- We can't know how many pages of data are available for a given year
- We want the while loop to collect data on a page until we find a page with no data (ie. there are no available pages left for that year)
- We want to check if the rows in the table has data to extract or not so we use an if statement to check if the variable is an empty list or not
- If there is data to extract in `rows` (STEP 2), we go through the process of extracting that data, storing it in a DataFrame, and returning the DataFrame
- If `rows` is an empty list (STEP 3), the function will return `None` so our while loop knows it needs to stop executing

In [61]:
resp = requests.get('https://www.mlb.com/stats/?page=8')

In [62]:
resp.status_code

200

In [63]:
soup = BeautifulSoup(resp.content)

In [67]:
soup.find('div', class_='stats-body-table player').find('tbody').find_all('tr')

[]

In [91]:
[] == True

False

We can collect data for a single page. Now we want to collect all data for a given year (across all pages).

When we call the `collect_page_data` function it will return a DataFrame with the data from a given page or `None` if no data was found on that page. In the while loop below, we check if `page_df` is `None` or not. If it is not `None`, we add the resulting DataFrame to the list `all_dfs`. If `page_df` is `None`, it means there was no data on the page and we set `more_pages` equal to `False` indicating we have collected data from all available pages and our while loop should stop executing.

In [93]:
page_num = 1
more_pages = True
all_dfs = []
while more_pages:
    
    base_url = 'https://www.mlb.com/stats/?page={}'.format(page_num)
    
    page_df = collect_page_data(base_url, columns, columns_final)
    
    if page_df is not None: # if page_df is a DataFrame, add to all_dfs list
        all_dfs.append(page_df)
        page_num += 1
    
    else: # if page_df is None, set more_pages to False (ie. exit While loop)
        more_pages = False

Use `pd.concat()` to combine list of DataFrames into one large DataFrame

In [77]:
full_df = pd.concat(all_dfs)

In [78]:
full_df.head()

Unnamed: 0,PLAYER,POSITION,TEAM,G,AB,R,H,2B,3B,HR,RBI,BB,SO,SB,CS,AVG,OBP,SLG,OPS
0,Mike Trout,CF,LAA,27,95,25,32,8,1,9,19,19,27,0,0,0.337,0.457,0.726,1.183
1,Manny Machado,3B,SD,31,116,27,44,8,0,7,22,16,22,6,0,0.379,0.455,0.629,1.084
2,Jose Ramirez,3B,CLE,30,111,18,33,8,2,7,30,19,10,3,1,0.297,0.402,0.595,0.997
3,Nolan Arenado,3B,STL,28,107,14,34,9,0,7,24,12,18,0,0,0.318,0.387,0.598,0.985
4,Josh Bell,1B,WSH,30,106,21,37,7,0,4,21,17,13,0,1,0.349,0.444,0.528,0.972


In [79]:
full_df.shape

(174, 19)

In [80]:
full_df.tail()

Unnamed: 0,PLAYER,POSITION,TEAM,G,AB,R,H,2B,3B,HR,RBI,BB,SO,SB,CS,AVG,OBP,SLG,OPS
19,Avisail Garcia,RF,MIA,27,100,6,20,3,0,1,4,2,31,2,0,0.2,0.231,0.26,0.491
20,Marcus Semien,2B,TEX,28,111,11,19,6,0,0,8,9,24,2,1,0.171,0.236,0.225,0.461
21,Cristian Pache,CF,OAK,29,96,11,16,1,1,2,7,2,26,0,1,0.167,0.184,0.26,0.444
22,Jonathan Schoop,2B,DET,29,108,7,17,3,0,2,6,5,23,0,0,0.157,0.202,0.241,0.443
23,Whit Merrifield,2B,KC,27,108,4,15,3,0,0,7,6,16,3,0,0.139,0.179,0.167,0.346


In [81]:
def collect_year_data(year, cols_base, cols_final):
    
    page_num = 1
    more_pages = True
    all_dfs = []
    while more_pages:

        base_url = 'https://www.mlb.com/stats/{}?page={}'.format(year, page_num)

        page_df = collect_page_data(base_url, columns, columns_final)

        if page_df is not None: # if page_df is a DataFrame, add to all_dfs list
            all_dfs.append(page_df)
            page_num += 1

        else: # if page_df is None, set more_pages to False (ie. exit While loop)
            more_pages = False
        
    full_df = pd.concat(all_dfs) # Combine list of DataFrames into single DataFrame
    
    # return final df
    return full_df

In [82]:
df_2022 = collect_year_data(2022, columns, columns_final)

In [83]:
df_2022.shape

(174, 19)

In [84]:
df_2022.head()

Unnamed: 0,PLAYER,POSITION,TEAM,G,AB,R,H,2B,3B,HR,RBI,BB,SO,SB,CS,AVG,OBP,SLG,OPS
0,Mike Trout,CF,LAA,27,95,25,32,8,1,9,19,19,27,0,0,0.337,0.457,0.726,1.183
1,Manny Machado,3B,SD,31,116,27,44,8,0,7,22,16,22,6,0,0.379,0.455,0.629,1.084
2,Jose Ramirez,3B,CLE,30,111,18,33,8,2,7,30,19,10,3,1,0.297,0.402,0.595,0.997
3,Nolan Arenado,3B,STL,28,107,14,34,9,0,7,24,12,18,0,0,0.318,0.387,0.598,0.985
4,Josh Bell,1B,WSH,30,106,21,37,7,0,4,21,17,13,0,1,0.349,0.444,0.528,0.972


In [85]:
df_2022.tail()

Unnamed: 0,PLAYER,POSITION,TEAM,G,AB,R,H,2B,3B,HR,RBI,BB,SO,SB,CS,AVG,OBP,SLG,OPS
19,Avisail Garcia,RF,MIA,27,100,6,20,3,0,1,4,2,31,2,0,0.2,0.231,0.26,0.491
20,Marcus Semien,2B,TEX,28,111,11,19,6,0,0,8,9,24,2,1,0.171,0.236,0.225,0.461
21,Cristian Pache,CF,OAK,29,96,11,16,1,1,2,7,2,26,0,1,0.167,0.184,0.26,0.444
22,Jonathan Schoop,2B,DET,29,108,7,17,3,0,2,6,5,23,0,0,0.157,0.202,0.241,0.443
23,Whit Merrifield,2B,KC,27,108,4,15,3,0,0,7,6,16,3,0,0.139,0.179,0.167,0.346


In [86]:
df_2021 = collect_year_data(2021, columns, columns_final)

In [87]:
df_2021.shape

(132, 19)

In [88]:
df_2021.head()

Unnamed: 0,PLAYER,POSITION,TEAM,G,AB,R,H,2B,3B,HR,RBI,BB,SO,SB,CS,AVG,OBP,SLG,OPS
0,Bryce Harper,RF,PHI,141,488,101,151,42,1,35,84,100,134,13,3,0.309,0.429,0.615,1.044
1,Vladimir Guerrero,1B,TOR,161,604,123,188,29,1,48,111,86,110,4,1,0.311,0.401,0.601,1.002
2,Juan Soto,RF,WSH,151,502,111,157,20,2,29,95,145,93,9,7,0.313,0.465,0.534,0.999
3,Fernando Tatis Jr.,SS,SD,130,478,99,135,31,0,42,97,62,153,25,4,0.282,0.364,0.611,0.975
4,Shohei Ohtani,DH,LAA,158,537,103,138,26,8,46,100,96,189,26,10,0.257,0.372,0.592,0.964


In [89]:
df_2021.tail()

Unnamed: 0,PLAYER,POSITION,TEAM,G,AB,R,H,2B,3B,HR,RBI,BB,SO,SB,CS,AVG,OBP,SLG,OPS
2,Carlos Santana,1B,KC,158,565,66,121,15,0,19,69,86,102,2,0,0.214,0.319,0.342,0.661
3,Michael A. Taylor,CF,KC,142,483,58,118,16,1,12,54,33,144,14,7,0.244,0.297,0.356,0.653
4,David Fletcher,2B,LAA,157,626,74,164,27,3,2,47,31,60,15,3,0.262,0.297,0.324,0.621
5,Elvis Andrus,SS,OAK,146,497,60,121,25,2,3,37,31,81,12,2,0.243,0.294,0.32,0.614
6,Kevin Newman,SS,PIT,148,517,50,117,22,3,5,39,27,41,6,1,0.226,0.265,0.309,0.574


Collect data across multiple years

In [94]:
year_dfs = []
for year in range(2022, 2015, -1):
    df = collect_year_data(year, columns, columns_final)
    year_dfs.append(df)

In [96]:
final_df = pd.concat(year_dfs)

In [97]:
final_df.shape

(1014, 19)

In [98]:
final_df.head()

Unnamed: 0,PLAYER,POSITION,TEAM,G,AB,R,H,2B,3B,HR,RBI,BB,SO,SB,CS,AVG,OBP,SLG,OPS
0,Mike Trout,CF,LAA,27,95,25,32,8,1,9,19,19,27,0,0,0.337,0.457,0.726,1.183
1,Manny Machado,3B,SD,32,116,27,44,8,0,7,22,17,22,6,0,0.379,0.459,0.629,1.088
2,Jose Ramirez,3B,CLE,30,111,18,33,8,2,7,30,19,10,3,1,0.297,0.402,0.595,0.997
3,Nolan Arenado,3B,STL,28,107,14,34,9,0,7,24,12,18,0,0,0.318,0.387,0.598,0.985
4,Josh Bell,1B,WSH,30,106,21,37,7,0,4,21,17,13,0,1,0.349,0.444,0.528,0.972


In [100]:
final_df.dtypes

PLAYER      object
POSITION    object
TEAM        object
G           object
AB          object
R           object
H           object
2B          object
3B          object
HR          object
RBI         object
BB          object
SO          object
SB          object
CS          object
AVG         object
OBP         object
SLG         object
OPS         object
dtype: object