To serve this slide deck, run the following line in the terminal or PowerShell:

In [None]:
jupyter nbconvert 'slides/advanced-web-scraping.ipynb' --to slides --output='../advanced-web-scraping'

# Advanced Web Scraping with Python

<br>

### Lorae Stojanovic

June 20, 2024


# Agenda
1. [How does a website work?](#/2)
2. [Web Scraping Basics](#/3)
2. Selenium
3. API Scraping
4. Automating a web scrape with GitHub Actions
4. Demo
5. [Additional Resources](#/3)

# How does a website work?

A lot goes on behind the scenes when you view a website like https://www.brookings.edu/.

Understanding how your computer interacts with remote resources will help you become a more capable data collector.

## Key terminology
### **Client**
"Clients are the typical **web user's internet-connected devices** (for example, your computer connected to your Wi-Fi, or your phone connected to your mobile network) and **web-accessing software available on those devices** (usually a web browser like Firefox or Chrome)."<sup id="ref01" class="reference"><a href="https://developer.mozilla.org/en-US/docs/Learn/Getting_started_with_the_web/How_the_Web_works" title="Mozilla (2023). How the web works. Retrieved from https://developer.mozilla.org/en-US/docs/Learn/Getting_started_with_the_web/How_the_Web_works">[1]</a></sup> 

### **Server**
"Servers are **computers that store webpages, sites, or apps**. When a client device wants to access a webpage, a copy of the webpage is downloaded from the server onto the client machine to be displayed in the user's web browser."<sup id="ref02" class="reference"><a href="https://developer.mozilla.org/en-US/docs/Learn/Getting_started_with_the_web/How_the_Web_works" title="Mozilla (2023). How the web works. Retrieved from https://developer.mozilla.org/en-US/docs/Learn/Getting_started_with_the_web/How_the_Web_works">[2]</a></sup> 

</p>
<p style="font-size:10px">
[1] Mozilla (2023). How the web works. Retrieved from https://developer.mozilla.org/en-US/docs/Learn/Getting_started_with_the_web/How_the_Web_works <br>
[2] Ibid.
</p>

Clients send **requests** for information from servers. The servers send back **responses**.
![A schematic illustrating the relationship between clients and servers](slides/client-server-request-response.png)

## Accessing a website

When you enter a URL (e.g., www.brookings.edu) into your browser, several things happen:

1. DNS lookup
2. Initial HTTP request
3. Server response
4. Parsing HTML + additional requests
5. Assembling the page

### **Step 1: DNS Lookup**
Your browser first translates the human-readable URL (e.g. http://brookings.edu/) into an IP address (e.g. 137.135.107.235) using a Domain Name System (DNS) lookup. 

Think of the DNS search like the phone book that links a name of a store to a street address. The street address, in this case, is an IP address which points to the server where the website is hosted. <sup id="ref02" class="reference"><a href="https://developer.mozilla.org/en-US/docs/Learn/Getting_started_with_the_web/How_the_Web_works" title="Mozilla (2023). How the web works. Retrieved from https://developer.mozilla.org/en-US/docs/Learn/Getting_started_with_the_web/How_the_Web_works">[3]</a></sup> <sup id="ref03" class="reference"><a href="https://www.cloudflare.com/learning/dns/what-is-dns/" title="Cloudflare. What is DNS? Retrieved from https://www.cloudflare.com/learning/dns/what-is-dns/">[4]</a></sup> 
</p>
<p style="font-size:10px">
[2] Mozilla (2023). How the web works. Retrieved from https://developer.mozilla.org/en-US/docs/Learn/Getting_started_with_the_web/How_the_Web_works
<br>
[3] Cloudflare. What is DNS? Retrieved from https://www.cloudflare.com/learning/dns/what-is-dns/<br>
</p>

### **Step 2: HTTP Request**
Now that your browser has the IP address of website, it sends an HTTP (HyperText Transfer Protocol) request to the server at this IP address. This request asks for the main HTML file of the website.

```bash
curl 'https://www.brookings.edu/' \
  -H 'accept: text/html,application/xhtml+xml,application/xml;...' \
  -H 'accept-language: en-US,en;q=0.9' \
  -H 'cookie: _fbp=REDACTED; hubspotutk=REDACTED; ...' \
  -H 'priority: u=0, i' \
  -H 'referer: https://www.google.com/' \
  -H 'sec-ch-ua: "Google Chrome";v="REDACTED", "Chromium";v="REDACTED", "Not.A/Brand";v="REDACTED"' \
  -H 'sec-ch-ua-mobile: ?0' \
  -H 'sec-ch-ua-platform: "Windows"' \
  -H 'sec-fetch-dest: document' \
  -H 'sec-fetch-mode: navigate' \
  -H 'sec-fetch-site: cross-site' \
  -H 'sec-fetch-user: ?1' \
  -H 'upgrade-insecure-requests: 1' \
  -H 'user-agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0.0.0 Safari/537.36'
```

### **Step 3: Server Response**
The server processes this request and sends back the requested HTML file. This file contains the basic structure of the webpage.

In [None]:

<!DOCTYPE html>
<html lang="en-US">

<head>
	<meta charset="UTF-8">
	<meta name="viewport" content="width=device-width, initial-scale=1.0, viewport-fit=cover">
	<meta name='robots' content='index, follow, max-image-preview:large, max-snippet:-1, max-video-preview:-1' />
<link rel="alternate" href="https://www.brookings.edu/" hreflang="en" />
<link rel="alternate" href="https://www.brookings.edu/es/" hreflang="es" />
<link rel="alternate" href="https://www.brookings.edu/ar/" hreflang="ar" />
<link rel="alternate" href="https://www.brookings.edu/zh/" hreflang="zh" />
<link rel="alternate" href="https://www.brookings.edu/fr/" hreflang="fr" />
<link rel="alternate" href="https://www.brookings.edu/ko/" hreflang="ko" />
<link rel="alternate" href="https://www.brookings.edu/ru/" hreflang="ru" />

	<!-- This site is optimized with the Yoast SEO plugin v22.0 - https://yoast.com/wordpress/plugins/seo/ -->
	<title>Brookings - Quality. Independence. Impact.</title>
	<meta name="description" content="The Brookings Institution is a nonprofit public policy organization based in Washington, DC. Our mission is to conduct in-depth research that leads to new ideas for solving problems facing society at the local, national and global level." />
	<link rel="canonical" href="https://www.brookings.edu/" />
	<meta property="og:locale" content="en_US" />

### **Step 4: Parsing HTML + additional requests**
Your browser starts **parsing** the HTML file: reading its instructions to turn it into a user-friendly webpage. 

Oftentimes, this code contains references to additional external resources it needs to display the webpage, such as:

- **CSS (Cascading Style Sheets)**: To set default aesthetics like fonts, colors, and line spacing
- **JavaScript code**: To add interactivity and dynamic content
- **Images and videos**: To incorporate multimedia content
- **API (Application Programming Interface) responses**: To obtain data from servers, often in JSON format, to display on the webpage

As your browser encounters **additional references to files** in the HTML code, it **makes HTTP requests** to the server to retrieve them.

### **Step 5: Assembling the page**

After downloading all the external resources needed to build the webpage, your browser will compile and execute any JavaScript code that it received.

With all the downloaded elements in place, the browser processes the HTML, the CSS style sheets, and combines it with other resources (such as downloaded fonts, photos, videos, and data downloaded from APIs) to **paint** the webpage to your screen.<sup id="ref05" class="reference"><a href="https://developer.mozilla.org/en-US/docs/Web/Performance/How_browsers_work" title="Mozilla (2023). Populating the page: how browsers work. Retrieved from https://developer.mozilla.org/en-US/docs/Web/Performance/How_browsers_work">[5]</a></sup> 
</p>
<p style="font-size:10px">
[5] Mozilla (2023). Populating the page: how browsers work. Retrieved from https://developer.mozilla.org/en-US/docs/Web/Performance/How_browsers_work" <br>
</p>

![An image of the brookings.edu homepage](slides/brookings-edu-screenshot.png)

# Web Scraping Basics

There are several ways to collect data online. The strategy you choose must be catered to the website in question.

I like to categorize web scraping techniques by the point in the client-server interaction that they collect data.

1. **DNS lookup**

2. **Initial HTTP request**

3. **Server response**
   - **HTML scraping** parses the main HTML file for the webpage to extract data.    

4. **Parsing HTML + additional requests**
   - Technically not classified as "web scraping," **APIs** can often be accessed at this point in the client-server interaction.

5. **Assembling the page**
   - **Selenium** behaves like a browser to view content that is otherwise not available in HTML files, because it is dynamically rendered using JavaScript code.

# Reviewing the basics: HTML parsing



# 5. Additional Resources
Links to resources (both internal Brookings resources and publicly available ones) that will provide additional context/training.

Note to self: Make the slides some sort of table that easily indicate which resources are internal and which aren't

## If you want a light introduction to web scraping
[DistrictDataLabs/brookings](https://github.com/DistrictDataLabs/brookings/) **Publicly available** GitHub repository presented to Brookings Data Network on March 31, 2017 which has slides and sample code. The resource is older, but the concepts are still relevant. [NOTE: I haven't tested the code, so I'm not sure if it still works]

Keywords: Publicly available datasets, common data formats, RESTful APIs, HTTP requests, web scraping vs web crawling

## If you want a more involved introduction to web scraping
[trainingNotebook.ipynb](https://brookingsinstitution.sharepoint.com/:f:/r/sites/BrookingsDataNetwork/Shared%20Documents/Python/2017-05%20-%20Getting%20Started%20with%20Web%20Scraping?csf=1&web=1&e=d2sfux) Curtlyn Kramer's **Brookings internal** Jupyter notebook, presented to the Brookings Data Network on May 26, 2017. This is only useful if the code works, so I will have to test if the code works. If it does, then this is a helpful Jupyter notebook that walks you through the steps.

Keywords: inspect element, BeautifulSoup4, 

## Extra material goes down here


### What is a web browser?
A web browser - like Chrome, Safari, Edge, or Firefox - is a specialized software designed to fetch and display web resources. Web resources are usually in the form of HTML documents, but can also be PDFs, images, or other documents. 


**Web scraping**: The act of systematically extracting data from an online resource, such as a website.
**Web crawling**: (use the explanation given by District Data Lab on /brookings repo)