To serve this slide deck, run the following line in the terminal or PowerShell:

In [None]:
jupyter nbconvert 'slides_content/advanced-web-scraping.ipynb' --to slides --output='../advanced-web-scraping'

# Web Scraping with Python

<br>

### Lorae Stojanovic

June 20, 2024


# Agenda
1. [How does a website work?](#/2)
    - [Key terminology](#/2/1)
    - [Accessing a website](#/2/5)
2. [Web scraping basics](#/3)
    - [Types of web scraping](#/3/1)
3. [Sample code: HTTP requests & HTML parsing](#/4)
    - [HTTP requests with `requests`](#/4/2)
    - [HTML parsing with `beautifulsoup4`](#/4/9)
4. [Sample code: Selenium](#/5)
5. [Sample code: APIs](#/6)
    - [Viewing network requests](#/6/1)
    - [Monitoring network requests](#/6/6)
    - [Using `requests` to access APIs](#/6/10)

**By the end of this presentation, you will:**

- Understand how your browser interacts with the internet
- Be able to gather data from the internet using 3 methods:
    - HTTP requests + HTML parsing
    - Selenium + HTML parsing
    - API requests
- Understand the advantages and shortfalls of each method

# How does a website work?

A lot goes on behind the scenes when you view a website like https://www.brookings.edu/.

Understanding how your computer interacts with remote resources will help you become a more capable data collector.

## **Key terminology**
### **Client**
"Clients are the typical **web user's internet-connected devices** (for example, your computer connected to your Wi-Fi, or your phone connected to your mobile network) and **web-accessing software available on those devices** (usually a web browser like Firefox or Chrome)."<sup id="ref01" class="reference"><a href="https://developer.mozilla.org/en-US/docs/Learn/Getting_started_with_the_web/How_the_Web_works" title="Mozilla (2023). How the web works. Retrieved from https://developer.mozilla.org/en-US/docs/Learn/Getting_started_with_the_web/How_the_Web_works">[1]</a></sup> 

### **Server**
"Servers are **computers that store webpages, sites, or apps**. When a client device wants to access a webpage, a copy of the webpage is downloaded from the server onto the client machine to be displayed in the user's web browser."<sup id="ref02" class="reference"><a href="https://developer.mozilla.org/en-US/docs/Learn/Getting_started_with_the_web/How_the_Web_works" title="Mozilla (2023). How the web works. Retrieved from https://developer.mozilla.org/en-US/docs/Learn/Getting_started_with_the_web/How_the_Web_works">[2]</a></sup> 

</p>
<p style="font-size:10px">
[1] Mozilla (2023). How the web works. Retrieved from https://developer.mozilla.org/en-US/docs/Learn/Getting_started_with_the_web/How_the_Web_works <br>
[2] Ibid.
</p>

Clients send **requests** for information from servers. The servers send back **responses**.
![A schematic illustrating the relationship between clients and servers](slides_content/images/client-server-request-response.png)

### **HTTP web requests**

There are many types of internet *protocols* that you can use to make requests,<sup id="ref03" class="reference"><a href="https://www.geeksforgeeks.org/types-of-internet-protocols/" title="GeeksforGeeks (2023). Types of Internet Protocols. Retrieved from https://www.geeksforgeeks.org/types-of-internet-protocols/">[3]</a></sup> but when we web scrape, we will concern ourselves primarily with **HTTP (Hypertext Transfer Protocol)**. There are 9 types:<sup id="ref04" class="reference"><a href="https://developer.mozilla.org/en-US/docs/Web/HTTP/Methods" title="Mozilla (2023). HTTP request methods. Retrieved from https://developer.mozilla.org/en-US/docs/Web/HTTP/Methods">[4]</a></sup> 

- GET
- HEAD
- POST
- PUT
- DELETE
- CONNECT
- OPTIONS
- TRACE
- PATCH


</p>
<p style="font-size:10px">
[3] GeeksforGeeks (2023). Types of Internet Protocols. Retrieved from https://www.geeksforgeeks.org/types-of-internet-protocols/" <br>
[4] Mozilla (2023). HTTP request methods. Retrieved from https://developer.mozilla.org/en-US/docs/Web/HTTP/Methods" <br>
</p>

Luckily for us web scrapers, we do not need to memorize all nine types. Ninety nine percent of the time, we will only concern ourselves with **GET** and **POST** requests.

- "The **GET** method requests a representation of the specified resource. Requests using GET should only retrieve data."<sup id="ref05" class="reference"><a href="https://developer.mozilla.org/en-US/docs/Web/HTTP/Methods" title="Mozilla (2023). HTTP request methods. Retrieved from https://developer.mozilla.org/en-US/docs/Web/HTTP/Methods">[5]</a></sup> 
- "The **POST** method submits an entity to the specified resource, often causing a change in state or side effects on the server."<sup id="ref06" class="reference"><a href="https://developer.mozilla.org/en-US/docs/Web/HTTP/Methods" title="Mozilla (2023). HTTP request methods. Retrieved from https://developer.mozilla.org/en-US/docs/Web/HTTP/Methods">[6]</a></sup> 

Don't worry if you're confused. It's easy to tell which type you need to use. And today, we'll do a demo using both.

</p>
<p style="font-size:10px">
[5] Mozilla (2023). HTTP request methods. Retrieved from https://developer.mozilla.org/en-US/docs/Web/HTTP/Methods" <br>
[6] Ibid. <br>
</p>


## **Accessing a website**

When you enter a URL (e.g., www.brookings.edu) into your browser, several things happen:

1. DNS lookup
2. Initial HTTP request
3. Server response
4. Parsing HTML + additional requests
5. Assembling the page

### **Step 1: DNS Lookup**
Your browser first translates the human-readable URL (e.g. http://brookings.edu/) into an IP address (e.g. 137.135.107.235) using a Domain Name System (DNS) lookup.<sup id="ref07" class="footnote"><a href="" title="Fun fact: we talked about internet protocols in the previous section, but only described HTTP, which is most relevant to this application. DNS is another type of internet protocol.">[7]</a></sup> 

Think of the DNS search like the phone book that links a name of a store to a street address. The street address, in this case, is an IP address which points to the server where the website is hosted. <sup id="ref08" class="reference"><a href="https://developer.mozilla.org/en-US/docs/Learn/Getting_started_with_the_web/How_the_Web_works" title="Mozilla (2023). How the web works. Retrieved from https://developer.mozilla.org/en-US/docs/Learn/Getting_started_with_the_web/How_the_Web_works">[8]</a></sup> <sup id="ref09" class="reference"><a href="https://www.cloudflare.com/learning/dns/what-is-dns/" title="Cloudflare. What is DNS? Retrieved from https://www.cloudflare.com/learning/dns/what-is-dns/">[9]</a></sup> 
</p>
<p style="font-size:10px">
[7] Fun fact: we talked about internet protocols in the previous section, but only described HTTP, which is most relevant to this application. DNS is another type of internet protocol.
[8] Mozilla (2023). How the web works. Retrieved from https://developer.mozilla.org/en-US/docs/Learn/Getting_started_with_the_web/How_the_Web_works
<br>
[9] Cloudflare. What is DNS? Retrieved from https://www.cloudflare.com/learning/dns/what-is-dns/<br>
</p>

### **Step 2: HTTP Request**
Now that your browser has the IP address of website, it sends an HTTP request to the server at this IP address. This request asks for the main HTML file of the website.

```bash
curl -X GET 'https://www.brookings.edu/' \
  -H 'accept: text/html,application/xhtml+xml,application/xml;...' \
  -H 'accept-language: en-US,en;q=0.9' \
  -H 'cookie: _fbp=REDACTED; hubspotutk=REDACTED; ...' \
  -H 'priority: u=0, i' \
  -H 'referer: https://www.google.com/' \
  -H 'sec-ch-ua: "Google Chrome";v="REDACTED", "Chromium";v="REDACTED", "Not.A/Brand";v="REDACTED"' \
  -H 'sec-ch-ua-mobile: ?0' \
  -H 'sec-ch-ua-platform: "Windows"' \
  -H 'sec-fetch-dest: document' \
  -H 'sec-fetch-mode: navigate' \
  -H 'sec-fetch-site: cross-site' \
  -H 'sec-fetch-user: ?1' \
  -H 'upgrade-insecure-requests: 1' \
  -H 'user-agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0.0.0 Safari/537.36'
```

Can you identify what type of request this is?

### **Step 3: Server Response**
The server processes this request and sends back the requested HTML file. This file contains the basic structure of the webpage.

In [None]:

<!DOCTYPE html>
<html lang="en-US">

<head>
	<meta charset="UTF-8">
	<meta name="viewport" content="width=device-width, initial-scale=1.0, viewport-fit=cover">
	<meta name='robots' content='index, follow, max-image-preview:large, max-snippet:-1, max-video-preview:-1' />
<link rel="alternate" href="https://www.brookings.edu/" hreflang="en" />
<link rel="alternate" href="https://www.brookings.edu/es/" hreflang="es" />
<link rel="alternate" href="https://www.brookings.edu/ar/" hreflang="ar" />
<link rel="alternate" href="https://www.brookings.edu/zh/" hreflang="zh" />
<link rel="alternate" href="https://www.brookings.edu/fr/" hreflang="fr" />
<link rel="alternate" href="https://www.brookings.edu/ko/" hreflang="ko" />
<link rel="alternate" href="https://www.brookings.edu/ru/" hreflang="ru" />

	<!-- This site is optimized with the Yoast SEO plugin v22.0 - https://yoast.com/wordpress/plugins/seo/ -->
	<title>Brookings - Quality. Independence. Impact.</title>
	<meta name="description" content="The Brookings Institution is a nonprofit public policy organization based in Washington, DC. Our mission is to conduct in-depth research that leads to new ideas for solving problems facing society at the local, national and global level." />
	<link rel="canonical" href="https://www.brookings.edu/" />
	<meta property="og:locale" content="en_US" />

### **Step 4: Parsing HTML + additional requests**
Your browser starts **parsing** the HTML file: reading its instructions to turn it into a user-friendly webpage. 

Oftentimes, this code contains references to additional external resources it needs to display the webpage, such as:

- **CSS (Cascading Style Sheets)**: To set default aesthetics like fonts, colors, and line spacing
- **JavaScript code**: To add interactivity and dynamic content
- **Images and videos**: To incorporate multimedia content
- **API (Application Programming Interface) responses**: To obtain data from servers, often in JSON format, to display on the webpage

As your browser encounters **additional references to files** in the HTML code, it **makes HTTP requests** to the server to retrieve them.

### **Step 5: Assembling the page**

After downloading all the external resources needed to build the webpage, your browser will compile and execute any JavaScript code that it received.

With all the downloaded elements in place, the browser processes the HTML, the CSS style sheets, and combines it with other resources (such as downloaded fonts, photos, videos, and data downloaded from APIs) to **paint** the webpage to your screen.<sup id="ref10" class="reference"><a href="https://developer.mozilla.org/en-US/docs/Web/Performance/How_browsers_work" title="Mozilla (2023). Populating the page: how browsers work. Retrieved from https://developer.mozilla.org/en-US/docs/Web/Performance/How_browsers_work">[10]</a></sup> 
</p>
<p style="font-size:10px">
[10] Mozilla (2023). Populating the page: how browsers work. Retrieved from https://developer.mozilla.org/en-US/docs/Web/Performance/How_browsers_work" <br>
</p>

![An image of the brookings.edu homepage](slides_content/images/brookings-edu-screenshot.png)

# Web scraping basics

There are several ways to collect data online. The strategy you choose must be catered to the website in question.

## **Types of web scraping**

I like to categorize web data collection techniques by the point in the client-server interaction that they interact.

1. **DNS lookup**

2. **Initial HTTP request**

3. **Server response**
   - **HTML scraping** parses the main HTML file for the webpage to extract data.

4. **Parsing HTML + additional requests**
   - Technically not classified as "web scraping," **APIs** are a neat way to access data. They are usually requested at this point in the client-server interaction.

5. **Assembling the page**
   - **Selenium** behaves like a browser to view content that is otherwise not available in HTML files, because it is dynamically rendered using JavaScript code.

# Sample code: Web requests & HTML parsing

Knowing how to make HTTP requests using the Python `requests` library and parsing HTML responses are two foundational skills in our web scraping toolkit that will help us tackle more complicated tasks later.

## Live demo

Follow along with the [code](https://raw.githubusercontent.com/lorae/web-scraping-tutorial/main/sample_code/requests_bs4_scraping.py)

File path: `web-scraping-tutorial/sample_code/requests_bs4_scraping.py`


## HTTP requests with `requests`

If you don't already have these packages installed, start by installing them.<sup id="ref11" class="footnote"><a href="https://python-poetry.org/" title="It's a best practice to use environments to control package dependencies. This project uses Poetry: https://python-poetry.org/">[11]</a></sup> 

To install the packages, type the following prompt into your terminal:

NOTE: If you're following along with this Jupyter notebook in your IDE (such as VSCode or JupyterLab), please skip the following terminal command. Instead, type `poetry shell` into the terminal.

In [None]:
pip install requests beautifulsoup4

Now that we've confirmed installation, we import the needed libraries.

In [1]:
import requests
from bs4 import BeautifulSoup

</p>
<p style="font-size:10px">
[11] It's a best practice to use environments to control package dependencies. This project uses Poetry: https://python-poetry.org/ <br>
</p>

Time to get scraping. Recall the earlier GET request we saw? It had a lot of information in it, in the form of **headers**.

```bash
curl -X GET 'https://www.brookings.edu/' \
  -H 'accept: text/html,application/xhtml+xml,application/xml;...' \
  -H 'accept-language: en-US,en;q=0.9' \
  -H 'cookie: _fbp=REDACTED; hubspotutk=REDACTED; ...' \
  -H 'priority: u=0, i' \
  -H 'referer: https://www.google.com/' \
  -H 'sec-ch-ua: "Google Chrome";v="REDACTED", "Chromium";v="REDACTED", "Not.A/Brand";v="REDACTED"' \
  -H 'sec-ch-ua-mobile: ?0' \
  -H 'sec-ch-ua-platform: "Windows"' \
  -H 'sec-fetch-dest: document' \
  -H 'sec-fetch-mode: navigate' \
  -H 'sec-fetch-site: cross-site' \
  -H 'sec-fetch-user: ?1' \
  -H 'upgrade-insecure-requests: 1' \
  -H 'user-agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0.0.0 Safari/537.36'
```

As you can see, headers contain quite a bit of information. Unless otherwise required, I try to keep my headers on the lighter side.

These are the headers I usually provide in my HTTP requests. Sometimes, I get blocked - in which case, I change the user-agent string slightly.

Note that headers must be formatted as a dictionary.

In [2]:
my_headers = {
    'User-Agent': (
        'Mozilla/5.0 (Windows NT 10.0; Win64; x64) '
        'AppleWebKit/537.36 (KHTML, like Gecko) '
        'Chrome/122.0.6261.112 Safari/537.36'
    ),
    'Accept': (
        'text/html,application/xhtml+xml,application/xml;q=0.9,'
        'image/avif,image/webp,image/apng,*/*;q=0.8,'
        'application/signed-exchange;v=b3;q=0.7'
    )
}

Today, we'll be scraping a website that I created to practice on! Our URL is:
https://lorae.github.io/web-scraping-tutorial/

In [3]:
my_url = "https://lorae.github.io/web-scraping-tutorial/"

![A screenshot of the web scraping tutorial website.](slides_content/images/website-screenshot.png)


We're interested specifically in scraping the company names and profit per employee of the entries in the table in the webpage. 

In order to do this, we have to first get the HTML file for the website.

In these next steps, we bundle our arguments together into an instance of the `Request` object from the `requests` library. We then initialize the session, prepare the request for sending, and save the response.

In [4]:
session_arguments = requests.Request(method='GET', 
                                     url=my_url, 
                                     headers=my_headers)
session = requests.Session()
prepared_request = session.prepare_request(session_arguments)
response: requests.Response = session.send(prepared_request)

Let's see what the response was. A response code of 200 indicates a successful response, with the server returning the required resource.

In [5]:
print(response.status_code)

200


Success!

More interestingly, let's look at the response content.

In [6]:
print(response.text)

<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <title>Web Scraping Resources</title>
    <link rel="stylesheet" href="web_content/css/styles.css">
    <link href="https://fonts.googleapis.com/css2?family=Roboto:wght@400;700&display=swap" rel="stylesheet">
    <script src="https://cdn.jsdelivr.net/npm/chart.js"></script>
    <script src="https://cdnjs.cloudflare.com/ajax/libs/PapaParse/5.3.0/papaparse.min.js"></script>
</head>
<body>
    <div class="container">
        <h1>Scrape this website!</h1>
        <p>Welcome! Are you interested in learning how to gather data from the internet? This website was designed as a trial ground to practice skills covered in Lorae Stojanovic's presentation to the Brookings Data Network on June 20, 2024, <a href="https://lorae.github.io/web-scraping-tutorial/advanced-web-scraping.slides.html">"Web Scraping with Python"</a>. The presentation includes <a href

## HTML parsing with `beautifulsoup4`

The response from the website may look like a mess, but don't worry. **There's a package that makes picking data from the HTML code easy: It's called `beautifulsoup4`.** We'll use it in conjunction with a helpful browser tool called "**Inspect element**".

(And, later in this presentation, we'll use "**View page source**" and "**Network requests**": Two other old favorites.) 

**Simply right click anywhere on the website with your browser open, and select the "Inspect element" option.**

![Inspect element screenshot](slides_content/images/inspect-element-static-screenshot.png)

The data we want is contained in a `<tr>` element with class `data-row`. And after further inspection, we find that the company name is in a `<td>` child element with classes `data-cell` and `company`. The profit per employee is a sibling `<td>` element with classes `data-cell` and `profit-per-employee`.

![Inspect sub-element screenshot](slides_content/images/inspect-sub-element-static-screenshot.png)

Let's parse the HTML of the response from the web server.

In [7]:
soup = BeautifulSoup(response.content, 'html.parser')
print(soup)

<!DOCTYPE html>

<html lang="en">
<head>
<meta charset="utf-8"/>
<meta content="width=device-width, initial-scale=1.0" name="viewport"/>
<title>Web Scraping Resources</title>
<link href="web_content/css/styles.css" rel="stylesheet"/>
<link href="https://fonts.googleapis.com/css2?family=Roboto:wght@400;700&amp;display=swap" rel="stylesheet"/>
<script src="https://cdn.jsdelivr.net/npm/chart.js"></script>
<script src="https://cdnjs.cloudflare.com/ajax/libs/PapaParse/5.3.0/papaparse.min.js"></script>
</head>
<body>
<div class="container">
<h1>Scrape this website!</h1>
<p>Welcome! Are you interested in learning how to gather data from the internet? This website was designed as a trial ground to practice skills covered in Lorae Stojanovic's presentation to the Brookings Data Network on June 20, 2024, <a href="https://lorae.github.io/web-scraping-tutorial/advanced-web-scraping.slides.html">"Web Scraping with Python"</a>. The presentation includes <a href="https://lorae.github.io/web-scraping-

Once it's a BeautifulSoup object, it's pretty easy to get the data you want. The key is selecting the right elements using the correct tags. 

We'll use `for` loops for this.

In [8]:
# Select elements corresponding to table rows
elements = soup.select('tr.data-row')

# Initialize lists for data output
Companies = []
PPEs = []

for el in elements:
    company = el.find('td', class_='company').text
    ppe = el.find('td', class_='profit-per-employee').text
    
    # Add data to lists
    Companies.append(company)
    PPEs.append(ppe)

print(Companies)
print(PPEs)

['ConocoPhillips', 'Fannie Mae', 'Freddie Mac', 'Valero', 'Occidental Petroleum', 'Cheniere Energy', 'ExxonMobil', 'Phillips 66', 'Marathon Petroleum', 'Chevron', 'PBF Energy', 'Enterprise Products', 'Apple', 'Broadcom', 'HF Sinclair', 'D. R. Horton', 'AIG', 'Lennar', 'Energy Transfer', 'Pfizer', 'Netflix', 'Microsoft', 'Alphabet', 'Meta', 'Qualcomm']
['$1,970,000', '$1,510,000', '$1,190,000', '$1,180,000', '$1,110,000', '$921,000', '$899,000', '$848,000', '$815,000', '$809,000', '$798,000', '$752,000', '$609,000', '$575,000', '$560,000', '$433,000', '$392,000', '$384,000', '$379,000', '$378,000', '$351,000', '$329,000', '$315,000', '$268,000', '$254,000']


Wow, we're pros! Should we try another one? Let's get titles and links of learning resources listed on the website. First we **inspect element** to find the element and class:

![Inspect element screenshot](slides_content/images/inspect-element-javascript-screenshot.png)

Data on web scraping learning resources are continained in a `<div>` element with class `digest-card`. The title of the resource is in a child `<div>` element with class `digest-card__title`: more specifically, the text is stored in an `<a>` element, with the hyperlink stored in the `href` attribute. 

![Inspect sub-element screenshot](slides_content/images/inspect-sub-element-javascript-screenshot.png)

In [55]:
# Scrape titles and links
elements = soup.select('div.digest-card__title a')
# Initialize lists
Titles = []
Links = []
for el in elements:
    print(el)
    # Obtain the link to the resource
    link =  el['href'] # 'href' is a HTML lingo for hyperlinks.
    # Obtain the title of the resource
    title = el.text

    # Append the entries to each list
    Titles.append(title)
    Links.append(link)

# Print the results
print(Titles)
print(Links)

[]
[]


Why didn't this work?

This part of the webpage is rendered using JavaScript! This is becoming an increasingly common occurence on today's web, which is why pure HTML scraping is becoming less and less feasible.

How do you know JavaScript is the culprit?

- Your code has no syntax errors yet doesn't pick up elements from the HTML code
- "**View page source**" shows no hard-coded elements
- "**Network requests**" reveal the JavaScript files used to populate the page

# Sample code: `selenium`

`selenium` is a tool in Python (and many other programming languages) that allows users to **access dynamic web content by automating web browser interactions**.<sup id="ref11" class="footnote"><a href="https://www.selenium.dev/documentation/webdriver/getting_started/" title="Selenium documentation can be found here: https://www.selenium.dev/documentation/webdriver/getting_started/">[12]</a></sup> 

It simulates a real user browsing the web, which enables it to **capture JavaScript-rendered content and other dynamic elements** that one-off HTTP requests cannot access.

</p>
<p style="font-size:10px">
[12] Selenium documentation can be found here: https://www.selenium.dev/documentation/webdriver/getting_started/ <br>
</p> 

## Live demo

Follow along with the [code](https://raw.githubusercontent.com/lorae/web-scraping-tutorial/main/sample_code/selenium_scraping.py)

File path: `web-scraping-tutorial/sample_code/selenium_scraping.py`

## But first, a note...

Selenium is just a tool to *get* an HTML file. Once you obtain the file, you parse it exactly the same way as we did in the previous section: using a tool of choice, like Beautiful Soup.

Let's start by importing the needed modules.

In [9]:
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.options import Options
from webdriver_manager.chrome import ChromeDriverManager
from bs4 import BeautifulSoup

`selenium` can run on many browsers, like Chrome and Firefox. For simplicity, we will use Chrome today.

In order for the code to work, you must have Chrome installed on your computer.

In [10]:
# Set up Chrome options
chrome_options = Options()
chrome_options.add_argument("--headless")

If you don't turn on "headless" mode, your browser will pop up on your screen when you run the code.

In [11]:
# Set up Chrome driver
service = Service(ChromeDriverManager().install())
driver = webdriver.Chrome(service=service, options=chrome_options)

# Open the website
url = "https://lorae.github.io/web-scraping-tutorial/"
driver.get(url)

# Get the HTML content of the page
html_content = driver.page_source
print(html_content)

<html lang="en"><head>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <title>Web Scraping Resources</title>
    <link rel="stylesheet" href="web_content/css/styles.css">
    <link href="https://fonts.googleapis.com/css2?family=Roboto:wght@400;700&amp;display=swap" rel="stylesheet">
    <script src="https://cdn.jsdelivr.net/npm/chart.js"></script>
    <script src="https://cdnjs.cloudflare.com/ajax/libs/PapaParse/5.3.0/papaparse.min.js"></script>
</head>
<body>
    <div class="container">
        <h1>Scrape this website!</h1>
        <p>Welcome! Are you interested in learning how to gather data from the internet? This website was designed as a trial ground to practice skills covered in Lorae Stojanovic's presentation to the Brookings Data Network on June 20, 2024, <a href="https://lorae.github.io/web-scraping-tutorial/advanced-web-scraping.slides.html">"Web Scraping with Python"</a>. The presentation includes <a href="https://lor

We did it! Now we have HTML content, just as before. But there's a crucial difference: our dynamic content is pre-loaded into the HTML code.

Let's try re-running the code from the previous section to this new HTML output. Hopefully, we can access the dynamic elements of the webpage.

In [12]:
# Use BeautifulSoup to parse the html_content
soup = BeautifulSoup(html_content, 'html.parser')

# Scrape titles and links
elements = soup.select('div.digest-card__title a')
# Initialize lists
Titles = []
Links = []
for el in elements:
    # Obtain the link to the resource
    link =  el['href'] # 'href' is a HTML lingo for hyperlinks.
    # Obtain the title of the resource
    title = el.text

    # Append the entries to each list
    Titles.append(title)
    Links.append(link)

In [13]:
# Print the results
print(Titles)
print(Links)

['Introduction to Data Science Using Python', 'Webscraping in R (and a little Python and Excel too)', 'Populating the page: how browsers work', 'What is DNS? | How DNS works', 'How the web works', 'A Practical Introduction to Web Scraping in Python', 'Automate the Boring Stuff: Web Scraping (Chapter 12)', 'Getting Started (with Selenium)', 'Getting started with the web', 'roundup', 'Text Mining with R: A Tidy Approach', 'Text as Data', 'Text as Data (Spring 2021, NYU)', 'Web Scraping with R', 'D-Lab Python Web Scraping Workshop', 'D-Lab Python Geospatial Fundamentals Workshop', 'D-Lab Python Machine Learning Workshop', 'D-Lab Python Text Analysis Workshop', 'D-Lab Python Deep Learning Workshop', 'D-Lab Python Data Visualization Workshop', 'D-Lab Python Fundamentals Workshop', 'D-Lab Python Intermediate Workshop', 'D-Lab Python Data Wrangling Workshop', 'Legality and Ethics of Web Scraping']
['https://github.com/DistrictDataLabs/Brookings_Python_DS', 'https://example.com/secret', 'https

We did it!

# Sample code: APIs

APIs - when available - are **my favorite way to collect data** from the internet. They're also a bit of a *secret* method - there's not many tutorials on using hidden APIs for web scraping! 

You're already familiar with the main tool you'll need for this: the `requests` package.

## Live demo

Follow along with the [code](https://raw.githubusercontent.com/lorae/web-scraping-tutorial/main/sample_code/api_requests.py)

File path: `web-scraping-tutorial/sample_code/api_requests.py`

## Viewing network requests

Remember the dyamically-rendered content we saw in the last section, and how we had to use `selenium` to access it? 

**Sometimes, there's a simpler way**: we can import the data directly, without opening a browser or dealing with any HTML at all.


**API access isn't always possible**. We'll have to inspect the HTTP requests that our browser sends to the server and the responses sent from the server to our browser to determine whether there's any information that we can intercept.

The easiest way to view this activity is to access the **network requests pane**.


With your browser open on the webpage in question, 

1. right click,
2. select "Inspect element",
3. then select the "Network" tab.

![The network requests pane is open but empty.](slides_content/images/network-requests-before-loading.png)

Here, the network requests pane is open but empty. Since the website already loaded, the requests are done.

To view the requests as they happen, simply refresh the webpage with the network requests tab open.

![The network requests pane is open and monitoring activity.](slides_content/images/network-requests-after-loading.png)

Recall the procedure that the browser uses to access data:

1. DNS lookup
2. Initial HTTP request
3. Server response
4. Parsing HTML + additional requests
5. Assembling the page

The first request we see is the **initial HTTP request** (**step 2**). The response from the server is the initial HTML file (**step 3**), which your browser begins to parse.

![The initial HTTP request](slides_content/images/first-http-request-screenshot.png)

As your browser parses this HTML file, it encounters additional references that it needs, causing it to make more requests for external resources (**step 4**).

![The remaining requests.](slides_content/images/more-http-requests-screenshot.png)

## Monitoring network requests

Websites make many requests, and API requests are often subtle. It can be tricky to tell which - if any- exchange the information you seek, since their titles are often uninformative.

**I typically start by sorting requests by `Type`**. API requests tend to be of type `xhr` or `fetch`, but this is not a hard and fast rule.

You can also look at the **title of the request** and its **headers**. If the request contains the string "api" in it, that can be a good tell.


When in doubt, inspect the responses. Do they contain the data you seek?

For especially tough cases, poking around where you think APIs might be used on the frontend, such as **navigating forward on a menu, displaying more years on an interactive graph, or searching for things in the website's the search bar** can help.

If those actions trigger an API request, the network requests pane will show new entries at the bottom.

## Live demo: Find the API requests

Link: https://lorae.github.io/web-scraping-tutorial/

This website uses 2 API requests, which each load a file:

- web-scraping-resources.json
- gdp-data.csv

Let's learn how to access these files using the `requests` package.

## Using `requests` to access APIs

**Fun fact**: If the request is a GET request, you can oftentimes request the API quickly and cheaply by copying and pasting the request URL into your browser's navigation bar.

(In my experience, this works about 80% of the time.)

Translating this process to Python code is very easy.


Let's start by importing the `requests` library and assigning some headers.

In [14]:
import requests

my_headers = {
    'User-Agent': (
        'Mozilla/5.0 (Windows NT 10.0; Win64; x64) '
        'AppleWebKit/537.36 (KHTML, like Gecko) '
        'Chrome/122.0.6261.112 Safari/537.36'
    ),
    'Accept': (
        'text/html,application/xhtml+xml,application/xml;q=0.9,'
        'image/avif,image/webp,image/apng,*/*;q=0.8,'
        'application/signed-exchange;v=b3;q=0.7'
    )
}

Next, we copy the request URL.

In [20]:
my_request_url = "https://lorae.github.io/web-scraping-tutorial/web_content/data/web-scraping-resources.json"

We then proceed exactly as we did earlier in this presentation:

In [21]:
session_arguments = requests.Request(method='GET', 
                                     url=my_request_url, 
                                     headers=my_headers)
session = requests.Session()
prepared_request = session.prepare_request(session_arguments)
response: requests.Response = session.send(prepared_request)

But this time, the response is not HTML code that we have to clean.  Instead, it's a file in a JSON format.<sup id="ref12" class="footnote"><a href="https://www.w3schools.com/whatis/whatis_json.asp" title="JSON, or JavaScript Object Notation, is a file format commonly used for transferring data over the internet. If you web scrape, you will encounter it frequently. For more information, visit: https://www.w3schools.com/whatis/whatis_json.asp">[12]</a></sup> 


Accessing the data is in this format is very strightforward. We simply retrieve it using the `json()` method of the response object.

</p>
<p style="font-size:10px">
[12] JSON, or JavaScript Object Notation, is a file format commonly used for transferring data over the internet. If you web scrape, you will encounter it frequently. For more information, visit: https://www.w3schools.com/whatis/whatis_json.asp <br>
</p>

In [23]:
json_data = response.json()
json_data

[{'title': 'Introduction to Data Science Using Python',
  'date': 'June 5, 2020',
  'link': 'https://github.com/DistrictDataLabs/Brookings_Python_DS',
  'authors': [{'name': 'District Data Labs',
    'link': 'https://districtdatalabs.silvrback.com/'}],
  'description': 'A GitHub repository containing slides and code introducing an audience with no Python experience to data science tools in Python.',
  'keywords': ['Python',
   'data science',
   'data structures',
   'loops',
   'list comprehension',
   'conditional evaluation',
   'functions',
   'Pandas',
   'DataFrames',
   'data visualization',
   'plotly',
   'hypothesis tests',
   'regression analysis',
   'machine learning'],
  'source': 'GitHub',
  'image': 'web_content/images/upward-connected-scatter.jpg'},
 {'title': 'Webscraping in R (and a little Python and Excel too)',
  'date': 'February 22, 2023',
  'link': 'https://example.com/secret',
  'authors': [{'name': 'Valerie Wirtschafter', 'link': '#'},
   {'name': 'Mimi Majumd

We can easily access the entries we need as follows.

In [24]:
# Now json_data is a list of dictionaries, each representing an article/resource
for article in json_data:
    title = article['title']

    print(title)

Introduction to Data Science Using Python
Webscraping in R (and a little Python and Excel too)
Populating the page: how browsers work
What is DNS? | How DNS works
How the web works
A Practical Introduction to Web Scraping in Python
Automate the Boring Stuff: Web Scraping (Chapter 12)
Getting Started (with Selenium)
Getting started with the web
roundup
Text Mining with R: A Tidy Approach
Text as Data
Text as Data (Spring 2021, NYU)
Web Scraping with R
D-Lab Python Web Scraping Workshop
D-Lab Python Geospatial Fundamentals Workshop
D-Lab Python Machine Learning Workshop
D-Lab Python Text Analysis Workshop
D-Lab Python Deep Learning Workshop
D-Lab Python Data Visualization Workshop
D-Lab Python Fundamentals Workshop
D-Lab Python Intermediate Workshop
D-Lab Python Data Wrangling Workshop
Legality and Ethics of Web Scraping


If the request is a POST request, or if it requires more complex headers, this strategy will not work.

In that case, I recommend using a free software like [Postman](https://www.postman.com/). The process is simple:

1. Right click on the request in the network requests pane.
2. Select "Copy as cURL"
2. Paste the request into Postman
3. Auto generate Python code by clicking the `</>` button

![copy as cURL](slides_content/images/copy-as-cURL.png)