# PETI8123 Lab 7: Web Scraping

<div align="center">
    <img src="https://www.parsehub.com/blog/content/images/2021/06/what-is-web-sraping-parsehub.jpeg" width="500">
</div>

Web scraping refers to the collection of the HTML of a web page to extract the details we want to use. The most complicated part is to inspect the web page source code to determine what to take and what to ignore.

In this demo, we demonstrate how to create a dataset from the weekend performance web page of Box Office Mojo, which is a well-known data source for box office performance data.


In [25]:
# Import the necessary libraries
import numpy as np
import pandas as pd

# Import BeautifulSoup for web scraping
from bs4 import BeautifulSoup

# Import requests for making HTTP requests
import requests


## 1. Determine the URL

The first step is to identify to web address (URL) to start crawling. You can open this URL in the browser window to inspect its content.

In [26]:
url = "https://www.boxofficemojo.com/weekend/by-year/2021/"

## 2. Make Request

This step is to connect to the website and request its content at the URL.

In [27]:
# Send an HTTP GET request to a specified URL and store the response in the 'req' variable
req = requests.get(url)

# Extract the text content from the response and store it in the 'content' variable
content = req.text
print(content[:200])

<!doctype html><html class="a-no-js" data-19ax5a9jf="dingo"><head><script>var aPageStart = (new Date()).getTime();</script><meta charset="utf-8"/>
<script type='text/javascript'>var ue_t0=ue_t0||+new 


## 3. Use "Beautiful Soup" to Inspect Page Content

In [None]:
# Create a BeautifulSoup object 'soup' to parse the HTML content obtained from a web page
soup = BeautifulSoup(content)

## 4. Inspect Page Content to Set Up Scraping

Generally tabular data that is visible on the page is be put into ``tr`` tags. With some of the code below, I am also exploring edge cases in the ouput of the page. When there is a special occasion for the weekend, in this case thanksgiving, there is a different format and structure that is displayed.

In [None]:
# Use BeautifulSoup to find all HTML <tr> (table row) elements in the parsed HTML
# content and store them in the 'rows' variable
rows = soup.findAll('tr')

print(len(rows))

63


In [None]:
# Find all HTML <td> (table cell) elements within the 9th table row (index 8)
# and store them in the 'data' variable
data = rows[8].findAll('td')

print(data)

[<td class="a-text-left mojo-header-column mojo-truncate mojo-field-type-date_interval mojo-sort-column"><a class="a-link-normal" href="/weekend/2021W48/occasion/us_thanksgiving_5/?ref_=bo_wey_table_8">Nov 24-28</a><div class="a-section a-spacing-none"><span class="a-size-small a-color-secondary">Thanksgiving 5-Day</span></div></td>, <td class="a-text-right mojo-field-type-money">$137,256,716</td>, <td class="a-text-right mojo-field-type-percent_delta">-</td>, <td class="a-text-right mojo-field-type-money mojo-estimatable">$142,082,464</td>, <td class="a-text-right mojo-field-type-percent_delta mojo-estimatable">-</td>, <td class="a-text-right mojo-field-type-positive_integer">42</td>, <td class="a-text-left mojo-field-type-release mojo-cell-wide"><a class="a-link-normal" href="/release/rl1887208961/?ref_=bo_wey_table_8">Encanto</a></td>, <td class="a-text-left mojo-field-type-genre hidden">-</td>, <td class="a-text-right mojo-field-type-money hidden">-</td>, <td class="a-text-right mo

In [None]:
# Date when there is a special occasion listed
data[0].findAll('a')[0].text

'Nov 24-28'

In [None]:
# Special occasion that is listed
data[0].findAll('span')[0].text

'Thanksgiving 5-Day'

In [None]:
# Find all HTML <span> elements within the first <td> element of the 'data' list
data[0].findAll('span')

[<span class="a-size-small a-color-secondary">Thanksgiving 5-Day</span>]

## 5. Test Data Construction

In [None]:
for row in rows:
  data = row.findAll('td')
  print(data[0].findAll('span')[0].text)

IndexError: ignored

In [None]:
# Initialize an empty list to store the data for each row
appended_data = []

# Iterate through each 'row' in the 'rows' list
for row in rows:
  # Create a dictionary 'data_row' to store data for the current row
  data_row = {}

  # Find all HTML <td> elements within the current 'row'
  data = row.findAll('td')

  # Check if the 'data' list is empty and continue to the next iteration if it is
  if len(data) == 0:
    continue

  # Check if the first <td> element contains <span> elements (special weekend)
  if len(data[0].findAll('span')) > 0:
    # Extract the 'occasion' from the first <span> element and 'date' from the first <a> element
    data_row['occasion'] = data[0].findAll('span')[0].text
    data_row['date'] = data[0].findAll('a')[0].text
  else:
    # Set 'occasion' as an empty string and extract 'date' from the first <td> element
    data_row['occasion'] = ""
    data_row['date'] = data[0].text

  # Extract data for other columns within the same row
  top10_gross = data[1].text.replace('$', '').replace(',', '')
  data_row['top10_gross'] = int(top10_gross)

  data_row['top10_wow_change'] = data[2].text

  overall_gross = data[3].text.replace('$', '').replace(',', '')
  data_row['overall_gross'] = int(overall_gross)
  data_row['overall_wow_change'] = data[4].text
  data_row['num_releases'] = data[5].text
  data_row['top_release'] = data[6].text
  data_row['week_no'] = data[10].text

  # Append the 'data_row' dictionary to the 'appended_data' list
  appended_data.append(data_row)

# Create a DataFrame 'weekend_data' using the 'appended_data' list and specify column names
weekend_data = pd.DataFrame(appended_data, columns=['date', 'occasion', 'top10_gross', 'top10_wow_change', 'overall_gross', 'overall_wow_change', 'num_releases', 'top_release', 'week_no'])


In [None]:
# Display the first few rows of the 'weekend_data' DataFrame
weekend_data.head()

Unnamed: 0,date,occasion,top10_gross,top10_wow_change,overall_gross,overall_wow_change,num_releases,top_release,week_no
0,"Dec 31-Jan 2, 2022",,"$95,723,075",-31.6%,"$98,910,707",-31.2%,35,Spider-Man: No Way Home,53
1,Dec 24-26,,"$139,868,872",-50.4%,"$143,835,740",-49.2%,40,Spider-Man: No Way Home,52
2,Dec 17-19,,"$281,737,588",+591.1%,"$282,972,675",+544%,43,Spider-Man: No Way Home,51
3,Dec 10-12,,"$40,765,448",-14.2%,"$43,940,100",-16.6%,45,West Side Story,50
4,Dec 3-5,Post-Thanksgiving,"$47,539,355",-48.7%,"$52,704,939",-45.4%,47,Encanto,49


## 6. Save Data

We need to save the data collected from a web page to a file for later use. The code below stores it in an Excel file named ``box_office.xlsx``. You should able to see this file created in the same folder as this lab file after running this code.

In [None]:
weekend_data.to_excel("box_office.xlsx", index=False) # prefer saving with index=False

## ⚠️ Exercises

**1.** Open ``box_office.xlsx`` and you will see that the gross columns (``top10_gross`` and ``overall_gross``) are stored as text (not numbers). Change the code above so that it outputs gross amount correctly as numbers.

Hint: Use ``int()`` and ``replace()`` functions.

In [None]:
# Code for Q1
df = pd.read_excel('box_office.xlsx')
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 62 entries, 0 to 61
Data columns (total 9 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   date                62 non-null     object
 1   occasion            35 non-null     object
 2   top10_gross         62 non-null     object
 3   top10_wow_change    62 non-null     object
 4   overall_gross       62 non-null     object
 5   overall_wow_change  62 non-null     object
 6   num_releases        62 non-null     int64 
 7   top_release         62 non-null     object
 8   week_no             62 non-null     int64 
dtypes: int64(2), object(7)
memory usage: 4.5+ KB


**2.** Rewrite the code above to scrap the box office data in year 2022.

In [None]:
# Code for Q2
# change the url to 2022