To serve this slide deck, run the following line in the terminal or PowerShell:

In [3]:
jupyter nbconvert 'slides/advanced-web-scraping.ipynb' --to slides --output='../advanced-web-scraping'

SyntaxError: invalid syntax (957921146.py, line 1)

TODO: Fix the reference numbers on the footnotes.

# Advanced Web Scraping with Python

<br>

### Lorae Stojanovic

June 20, 2024


# Agenda
1. [How does a website work?](#/2)
    - [Key terminology](#/2/1)
    - [Accessing a website](#/2/3)
2. [Web Scraping Basics](#/3)
    - [Web requests](#web-requests)
2. Selenium
3. API Scraping
4. Automating a web scrape with GitHub Actions
4. Demo
5. [Additional Resources](#/3)

Libraries we WILL use in this presentation:

- requests
- BeautifulSoup
- JSON
- XML
- selenium

These topics will NOT be covered in this presentation:

- pyppeteer, playwright (an even more advanced way to scrape using Chrome Developer Protocol) [FACT CHECK!!!!]
- scrapy
- MechanicalSoup

# How does a website work?

A lot goes on behind the scenes when you view a website like https://www.brookings.edu/.

Understanding how your computer interacts with remote resources will help you become a more capable data collector.

## **Key terminology**
### **Client**
"Clients are the typical **web user's internet-connected devices** (for example, your computer connected to your Wi-Fi, or your phone connected to your mobile network) and **web-accessing software available on those devices** (usually a web browser like Firefox or Chrome)."<sup id="ref01" class="reference"><a href="https://developer.mozilla.org/en-US/docs/Learn/Getting_started_with_the_web/How_the_Web_works" title="Mozilla (2023). How the web works. Retrieved from https://developer.mozilla.org/en-US/docs/Learn/Getting_started_with_the_web/How_the_Web_works">[1]</a></sup> 

### **Server**
"Servers are **computers that store webpages, sites, or apps**. When a client device wants to access a webpage, a copy of the webpage is downloaded from the server onto the client machine to be displayed in the user's web browser."<sup id="ref02" class="reference"><a href="https://developer.mozilla.org/en-US/docs/Learn/Getting_started_with_the_web/How_the_Web_works" title="Mozilla (2023). How the web works. Retrieved from https://developer.mozilla.org/en-US/docs/Learn/Getting_started_with_the_web/How_the_Web_works">[2]</a></sup> 

</p>
<p style="font-size:10px">
[1] Mozilla (2023). How the web works. Retrieved from https://developer.mozilla.org/en-US/docs/Learn/Getting_started_with_the_web/How_the_Web_works <br>
[2] Ibid.
</p>

Clients send **requests** for information from servers. The servers send back **responses**.
![A schematic illustrating the relationship between clients and servers](slides/client-server-request-response.png)

### **HTTP web requests**

There are many types of internet *protocols* that you can use to make requests,<sup id="ref03" class="reference"><a href="https://www.geeksforgeeks.org/types-of-internet-protocols/" title="GeeksforGeeks (2023). Types of Internet Protocols. Retrieved from https://www.geeksforgeeks.org/types-of-internet-protocols/">[3]</a></sup> but when we web scrape, we will concern ourselves primarily with **HTTP (Hypertext Transfer Protocol)**. There are 9 types:<sup id="ref04" class="reference"><a href="https://developer.mozilla.org/en-US/docs/Web/HTTP/Methods" title="Mozilla (2023). HTTP request methods. Retrieved from https://developer.mozilla.org/en-US/docs/Web/HTTP/Methods">[4]</a></sup> 

- GET
- HEAD
- POST
- PUT
- DELETE
- CONNECT
- OPTIONS
- TRACE
- PATCH


</p>
<p style="font-size:10px">
[3] GeeksforGeeks (2023). Types of Internet Protocols. Retrieved from https://www.geeksforgeeks.org/types-of-internet-protocols/" <br>
[4] Mozilla (2023). HTTP request methods. Retrieved from https://developer.mozilla.org/en-US/docs/Web/HTTP/Methods" <br>
</p>

Luckily for us web scrapers, we do not need to memorize all nine types. Ninety nine percent of the time, we will only concern ourselves with **GET** and **POST** requests.

- "The **GET** method requests a representation of the specified resource. Requests using GET should only retrieve data."<sup id="ref05" class="reference"><a href="https://developer.mozilla.org/en-US/docs/Web/HTTP/Methods" title="Mozilla (2023). HTTP request methods. Retrieved from https://developer.mozilla.org/en-US/docs/Web/HTTP/Methods">[5]</a></sup> 
- "The **POST** method submits an entity to the specified resource, often causing a change in state or side effects on the server."<sup id="ref06" class="reference"><a href="https://developer.mozilla.org/en-US/docs/Web/HTTP/Methods" title="Mozilla (2023). HTTP request methods. Retrieved from https://developer.mozilla.org/en-US/docs/Web/HTTP/Methods">[6]</a></sup> 

Don't worry if you're confused. It's easy to tell which type you need to use. And today, we'll do a demo using both.

</p>
<p style="font-size:10px">
[5] Mozilla (2023). HTTP request methods. Retrieved from https://developer.mozilla.org/en-US/docs/Web/HTTP/Methods" <br>
[6] Ibid. <br>
</p>


## **Accessing a website**

When you enter a URL (e.g., www.brookings.edu) into your browser, several things happen:

1. DNS lookup
2. Initial HTTP request
3. Server response
4. Parsing HTML + additional requests
5. Assembling the page

### **Step 1: DNS Lookup**
Your browser first translates the human-readable URL (e.g. http://brookings.edu/) into an IP address (e.g. 137.135.107.235) using a Domain Name System (DNS) lookup.<sup id="ref07" class="footnote"><a href="" title="Fun fact: we talked about internet protocols in the previous section, but only described HTTP, which is most relevant to this application. DNS is another type of internet protocol.">[7]</a></sup> 

Think of the DNS search like the phone book that links a name of a store to a street address. The street address, in this case, is an IP address which points to the server where the website is hosted. <sup id="ref08" class="reference"><a href="https://developer.mozilla.org/en-US/docs/Learn/Getting_started_with_the_web/How_the_Web_works" title="Mozilla (2023). How the web works. Retrieved from https://developer.mozilla.org/en-US/docs/Learn/Getting_started_with_the_web/How_the_Web_works">[8]</a></sup> <sup id="ref09" class="reference"><a href="https://www.cloudflare.com/learning/dns/what-is-dns/" title="Cloudflare. What is DNS? Retrieved from https://www.cloudflare.com/learning/dns/what-is-dns/">[9]</a></sup> 
</p>
<p style="font-size:10px">
[7] Fun fact: we talked about internet protocols in the previous section, but only described HTTP, which is most relevant to this application. DNS is another type of internet protocol.
[8] Mozilla (2023). How the web works. Retrieved from https://developer.mozilla.org/en-US/docs/Learn/Getting_started_with_the_web/How_the_Web_works
<br>
[9] Cloudflare. What is DNS? Retrieved from https://www.cloudflare.com/learning/dns/what-is-dns/<br>
</p>

### **Step 2: HTTP Request**
Now that your browser has the IP address of website, it sends an HTTP request to the server at this IP address. This request asks for the main HTML file of the website.

```bash
curl -X GET 'https://www.brookings.edu/' \
  -H 'accept: text/html,application/xhtml+xml,application/xml;...' \
  -H 'accept-language: en-US,en;q=0.9' \
  -H 'cookie: _fbp=REDACTED; hubspotutk=REDACTED; ...' \
  -H 'priority: u=0, i' \
  -H 'referer: https://www.google.com/' \
  -H 'sec-ch-ua: "Google Chrome";v="REDACTED", "Chromium";v="REDACTED", "Not.A/Brand";v="REDACTED"' \
  -H 'sec-ch-ua-mobile: ?0' \
  -H 'sec-ch-ua-platform: "Windows"' \
  -H 'sec-fetch-dest: document' \
  -H 'sec-fetch-mode: navigate' \
  -H 'sec-fetch-site: cross-site' \
  -H 'sec-fetch-user: ?1' \
  -H 'upgrade-insecure-requests: 1' \
  -H 'user-agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0.0.0 Safari/537.36'
```

Can you identify what type of request this is?

### **Step 3: Server Response**
The server processes this request and sends back the requested HTML file. This file contains the basic structure of the webpage.

In [None]:

<!DOCTYPE html>
<html lang="en-US">

<head>
	<meta charset="UTF-8">
	<meta name="viewport" content="width=device-width, initial-scale=1.0, viewport-fit=cover">
	<meta name='robots' content='index, follow, max-image-preview:large, max-snippet:-1, max-video-preview:-1' />
<link rel="alternate" href="https://www.brookings.edu/" hreflang="en" />
<link rel="alternate" href="https://www.brookings.edu/es/" hreflang="es" />
<link rel="alternate" href="https://www.brookings.edu/ar/" hreflang="ar" />
<link rel="alternate" href="https://www.brookings.edu/zh/" hreflang="zh" />
<link rel="alternate" href="https://www.brookings.edu/fr/" hreflang="fr" />
<link rel="alternate" href="https://www.brookings.edu/ko/" hreflang="ko" />
<link rel="alternate" href="https://www.brookings.edu/ru/" hreflang="ru" />

	<!-- This site is optimized with the Yoast SEO plugin v22.0 - https://yoast.com/wordpress/plugins/seo/ -->
	<title>Brookings - Quality. Independence. Impact.</title>
	<meta name="description" content="The Brookings Institution is a nonprofit public policy organization based in Washington, DC. Our mission is to conduct in-depth research that leads to new ideas for solving problems facing society at the local, national and global level." />
	<link rel="canonical" href="https://www.brookings.edu/" />
	<meta property="og:locale" content="en_US" />

### **Step 4: Parsing HTML + additional requests**
Your browser starts **parsing** the HTML file: reading its instructions to turn it into a user-friendly webpage. 

Oftentimes, this code contains references to additional external resources it needs to display the webpage, such as:

- **CSS (Cascading Style Sheets)**: To set default aesthetics like fonts, colors, and line spacing
- **JavaScript code**: To add interactivity and dynamic content
- **Images and videos**: To incorporate multimedia content
- **API (Application Programming Interface) responses**: To obtain data from servers, often in JSON format, to display on the webpage

As your browser encounters **additional references to files** in the HTML code, it **makes HTTP requests** to the server to retrieve them.

### **Step 5: Assembling the page**

After downloading all the external resources needed to build the webpage, your browser will compile and execute any JavaScript code that it received.

With all the downloaded elements in place, the browser processes the HTML, the CSS style sheets, and combines it with other resources (such as downloaded fonts, photos, videos, and data downloaded from APIs) to **paint** the webpage to your screen.<sup id="ref05" class="reference"><a href="https://developer.mozilla.org/en-US/docs/Web/Performance/How_browsers_work" title="Mozilla (2023). Populating the page: how browsers work. Retrieved from https://developer.mozilla.org/en-US/docs/Web/Performance/How_browsers_work">[5]</a></sup> 
</p>
<p style="font-size:10px">
[5] Mozilla (2023). Populating the page: how browsers work. Retrieved from https://developer.mozilla.org/en-US/docs/Web/Performance/How_browsers_work" <br>
</p>

![An image of the brookings.edu homepage](slides/brookings-edu-screenshot.png)

# Web Scraping Basics

There are several ways to collect data online. The strategy you choose must be catered to the website in question.

## **Types of web scraping**

I like to categorize web data collection techniques by the point in the client-server interaction that they interact.

1. **DNS lookup**

2. **Initial HTTP request**

3. **Server response**
   - **HTML scraping** parses the main HTML file for the webpage to extract data.

4. **Parsing HTML + additional requests**
   - Technically not classified as "web scraping," **APIs** are a neat way to access data. They are usually requested at this point in the client-server interaction.

5. **Assembling the page**
   - **Selenium** behaves like a browser to view content that is otherwise not available in HTML files, because it is dynamically rendered using JavaScript code.

# Sample code: HTML parsing

First, we'll walk through making HTTP requests and parsing HTML.


If you don't already have these packages installed, start by installing them.

Note: It's a best practice to use environments to control package dependencies.<sup id="ref10" class="footnote"><a href="https://python-poetry.org/" title="This project uses Poetry: https://python-poetry.org/">[10]</a></sup> 

</p>
<p style="font-size:10px">
[10] This project uses Poetry: https://python-poetry.org/ <br>
</p>

NOTE: If you're following along with this Jupyter notebook in your IDE (such as VSCode or JupyterLab), please skip the following terminal command. Instead, type `poetry shell` into the terminal.

In [None]:
pip install requests beautifulsoup4

Now we import the needed libraries:

In [17]:
import requests
from bs4 import BeautifulSoup

Time to get scraping. Recall the earlier GET request we saw? It had a lot of information in it, in the form of **headers**.

```bash
curl -X GET 'https://www.brookings.edu/' \
  -H 'accept: text/html,application/xhtml+xml,application/xml;...' \
  -H 'accept-language: en-US,en;q=0.9' \
  -H 'cookie: _fbp=REDACTED; hubspotutk=REDACTED; ...' \
  -H 'priority: u=0, i' \
  -H 'referer: https://www.google.com/' \
  -H 'sec-ch-ua: "Google Chrome";v="REDACTED", "Chromium";v="REDACTED", "Not.A/Brand";v="REDACTED"' \
  -H 'sec-ch-ua-mobile: ?0' \
  -H 'sec-ch-ua-platform: "Windows"' \
  -H 'sec-fetch-dest: document' \
  -H 'sec-fetch-mode: navigate' \
  -H 'sec-fetch-site: cross-site' \
  -H 'sec-fetch-user: ?1' \
  -H 'upgrade-insecure-requests: 1' \
  -H 'user-agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0.0.0 Safari/537.36'
```

As you can see, headers contain quite a bit of information. Unless otherwise required, I tend to keep my headers on the lighter side.

These are the headers I usually provide in my HTTP requests. Sometimes, I get blocked - in which case, I change the user-agent string slightly.

Note that headers must be formatted as a dictionary.

In [2]:
my_headers = {
    'User-Agent': (
        'Mozilla/5.0 (Windows NT 10.0; Win64; x64) '
        'AppleWebKit/537.36 (KHTML, like Gecko) '
        'Chrome/122.0.6261.112 Safari/537.36'
    ),
    'Accept': (
        'text/html,application/xhtml+xml,application/xml;q=0.9,'
        'image/avif,image/webp,image/apng,*/*;q=0.8,'
        'application/signed-exchange;v=b3;q=0.7'
    )
}

Today, we'll be scraping a website that I created to practice on! Our URL is:
https://lorae.github.io/web-scraping-tutorial/

In [28]:
my_url = "https://lorae.github.io/web-scraping-tutorial/"

[insert screenshot here]

We're interested specifically in scraping the titles and authors of the books that appear on the table in the webpage. 

In order to do this, we have to first get the HTML file for the website.

In these next steps, we bundle our arguments together into an instance of the `Request` object from the `requests` library. We then initialize the session, prepare the request for sending, and save the response.

In [29]:
session_arguments = requests.Request(method='GET', 
                                     url=my_url, 
                                     headers=my_headers)
session = requests.Session()
prepared_request = session.prepare_request(session_arguments)
response: requests.Response = session.send(prepared_request)

Let's see what the response was. A response code of 200 indicates a successful response, with the server returning the required resource.

In [30]:
print(response.status_code)

200


More interestingly, let's take a peek at the text.

In [31]:
print(response.text)

<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <title>Web Scraping Resources</title>
    <link rel="stylesheet" href="css/styles.css">
    <link href="https://fonts.googleapis.com/css2?family=Roboto:wght@400;700&display=swap" rel="stylesheet">
    <script src="https://cdn.jsdelivr.net/npm/chart.js"></script>
    <script src="https://cdnjs.cloudflare.com/ajax/libs/PapaParse/5.3.0/papaparse.min.js"></script>
</head>
<body>
    <h1>Scrape this website!</h1>
    <p>Welcome! On this website, you'll find a variety of opportunities to practice your skills web scraping with Python. If you'd like to learn more about how to use this page, please visit my repository, 
        <a href="https://github.com/lorae/web-scraping-tutorial" target="_blank">lorae/web-scraping-tutorial</a>
        .
    </p>
    <h2>Reading list</h2>
    <table>
        <tr class="header-row">
            <th class="header-cel

Magnificent! We now have the HTML file of the webpage. But now we want to only extract the data of interest. How do we do this?

**Inspect element**.


![Inspect element screenshot](slides/inspect-element-static-screenshot.png)

The data we want is contained in a `<tr>` element with class `data-row`. And after further inspection, we find that the title is in a `<td>` child element with classes `data-cell` and `title`. The author is a sibling `<td>` element with classes `data-cell` and `author`.

![Inspect sub-element screenshot](slides/inspect-sub-element-static-screenshot.png)

Let's parse the HTML of the response from the web server.

In [32]:
soup = BeautifulSoup(response.content, 'html.parser')
print(soup)

<!DOCTYPE html>

<html lang="en">
<head>
<meta charset="utf-8"/>
<meta content="width=device-width, initial-scale=1.0" name="viewport"/>
<title>Web Scraping Resources</title>
<link href="css/styles.css" rel="stylesheet"/>
<link href="https://fonts.googleapis.com/css2?family=Roboto:wght@400;700&amp;display=swap" rel="stylesheet"/>
<script src="https://cdn.jsdelivr.net/npm/chart.js"></script>
<script src="https://cdnjs.cloudflare.com/ajax/libs/PapaParse/5.3.0/papaparse.min.js"></script>
</head>
<body>
<h1>Scrape this website!</h1>
<p>Welcome! On this website, you'll find a variety of opportunities to practice your skills web scraping with Python. If you'd like to learn more about how to use this page, please visit my repository, 
        <a href="https://github.com/lorae/web-scraping-tutorial" target="_blank">lorae/web-scraping-tutorial</a>
        .
    </p>
<h2>Reading list</h2>
<table>
<tr class="header-row">
<th class="header-cell">Title</th>
<th class="header-cell">Author</th>
<th c

Once it's a BeautifulSoup object, it's pretty easy to get the data you want. The key is selecting the right elements using the correct tags. 

We'll use `for` loops for this.

In [54]:
# Select elements corresponding to table rows
elements = soup.select('tr.data-row')

# Initialize lists for data output
Titles = []
Authors = []

for el in elements:
    title = el.find('td', class_='title').text
    author = el.find('td', class_='author').text
    
    # Add data to lists
    Titles.append(title)
    Authors.append(author)

print(Titles)
print(Authors)

['The Adventures of Tom Sawyer', 'Pride and Prejudice', '1984', 'The Great Gatsby', 'To Kill a Mockingbird', 'Moby-Dick', 'War and Peace', 'Crime and Punishment', 'The Catcher in the Rye', "Harry Potter and the Sorcerer's Stone", 'The Hobbit', 'The Lord of the Rings', 'The Chronicles of Narnia', 'Anne of Green Gables']
['Mark Twain', 'Jane Austen', 'George Orwell', 'F. Scott Fitzgerald', 'Harper Lee', 'Herman Melville', 'Leo Tolstoy', 'Fyodor Dostoevsky', 'J.D. Salinger', 'J.K. Rowling', 'J.R.R. Tolkien', 'J.R.R. Tolkien', 'C.S. Lewis', 'Lucy Maud Montgomery']


Wow, we're pros! Should we try another one? Let's get titles and links of learning resources listed on the website. First we **inspect element** to find the element and class:

![Inspect element screenshot](slides/inspect-element-javascript-screenshot.png)

Then we determine that our data in question is in a `<div>` element with class `digest-card`. The title is in a child `<div>` element with class `digest-card__title`.

![Inspect sub-element screenshot](slides/inspect-sub-element-javascript-screenshot.png)

In [55]:
# Scrape titles and links
elements = soup.select('div.digest-card__title a')
# Initialize lists
Titles = []
Links = []
for el in elements:
    print(el)
    # Obtain the link to the resource
    link =  el['href'] # 'href' is a HTML lingo for hyperlinks.
    # Obtain the title of the resource
    title = el.text

    # Append the entries to each list
    Titles.append(title)
    Links.append(link)

# Print the results
print(Titles)
print(Links)

[]
[]


Why didn't this work?

This part of the webpage is rendered using JavaScript! This is becoming an increasingly common occurence on today's web, which is why pure HTML scraping is becoming less and less feasible.

How do you know JavaScript is the culprit?

- your code has no syntax errors yet doesn't pick up elements from the HTML code
- **View page source** shows no hard-coded elements
- **Network requests** reveal the JavaScript files used to populate the page

# Sample code: Selenium

Selenium is a tool in Python (and many other programming languages) that allows users to access dynamic web content by automating web browser interactions. It simulates a real user browsing the web, which enables it to handle JavaScript-rendered content and other dynamic elements that static network requests cannot easily capture.

## But first, a note...

Selenium is just a tool to *get* an HTML file. Once you obtain the file, you parse it exactly the same way as we did in the previous section: using a tool of choice, like Beautiful Soup.

You've already learned the gritty basics of HTML parsing. The hard part in this section is setting up the web driver.

## Live demo

Follow along with this code: [LINK HERE!!!]

# 5. Additional Resources
Links to resources (both internal Brookings resources and publicly available ones) that will provide additional context/training.

Note to self: Make the slides some sort of table that easily indicate which resources are internal and which aren't

## If you want a light introduction to web scraping
[DistrictDataLabs/brookings](https://github.com/DistrictDataLabs/brookings/) **Publicly available** GitHub repository presented to Brookings Data Network on March 31, 2017 which has slides and sample code. The resource is older, but the concepts are still relevant. [NOTE: I haven't tested the code, so I'm not sure if it still works]

Keywords: Publicly available datasets, common data formats, RESTful APIs, HTTP requests, web scraping vs web crawling

## If you want a more involved introduction to web scraping
[trainingNotebook.ipynb](https://brookingsinstitution.sharepoint.com/:f:/r/sites/BrookingsDataNetwork/Shared%20Documents/Python/2017-05%20-%20Getting%20Started%20with%20Web%20Scraping?csf=1&web=1&e=d2sfux) Curtlyn Kramer's **Brookings internal** Jupyter notebook, presented to the Brookings Data Network on May 26, 2017. This is only useful if the code works, so I will have to test if the code works. If it does, then this is a helpful Jupyter notebook that walks you through the steps.

Keywords: inspect element, BeautifulSoup4, 

## Extra material goes down here


### What is a web browser?
A web browser - like Chrome, Safari, Edge, or Firefox - is a specialized software designed to fetch and display web resources. Web resources are usually in the form of HTML documents, but can also be PDFs, images, or other documents. 


**Web scraping**: The act of systematically extracting data from an online resource, such as a website.
**Web crawling**: (use the explanation given by District Data Lab on /brookings repo)