To serve this slide deck, run the following line in the terminal or PowerShell:

In [None]:
jupyter nbconvert 'slides/advanced-web-scraping.ipynb' --to slides --output='../advanced-web-scraping'

# Advanced Web Scraping with Python

<br>

### Lorae Stojanovic

<br>

Welcome! We will start shortly.

# Agenda
1. Internet basics(#/2)
2. [Review: Web Scraping Basics](#/3)
2. Selenium
3. API Scraping
4. Automating a web scrape with GitHub Actions
4. Demo
5. [Additional Resources](#/3)

**Web scraping**: The act of systematically extracting data from an online resource, such as a website.
**Web crawling**: (use the explanation given by District Data Lab on /brookings repo)

# Internet Basics

## How does a website work?
### What is a web browser?
A web browser - like Chrome, Safari, Edge, or Firefox - is a specialized software designed to fetch and display web resources. Web resources are usually in the form of HTML documents, but can also be PDFs, images, or other documents. 


## Key terminology
**Server vs client**
"Clients are the typical web user's internet-connected devices (for example, your computer connected to your Wi-Fi, or your phone connected to your mobile network) and web-accessing software available on those devices (usually a web browser like Firefox or Chrome)."

"Servers are computers that store webpages, sites, or apps. When a client device wants to access a webpage, a copy of the webpage is downloaded from the server onto the client machine to be displayed in the user's web browser."


### Network Requests
When you enter a URL (e.g., www.brookings.edu) into your browser, several things happen:

1. **DNS Lookup*
   - Your browser first translates the human-readable URL (e.g. http://brookings.edu/) into an IP address (e.g. 137.135.107.235) using a Domain Name System (DNS) lookup. Think of the DNS search like the phone book that links a name of a store to a street address. 
   
   This IP address points to the server where the website is hosted. Your computer now has the address to which it can send its web request.

2. **HTTP Request**:
   - The browser sends an HTTP (HyperText Transfer Protocol) request to the server at this IP address. This request asks for the main HTML file of the website.

3. **Server Response**:
   - The server processes this request and sends back the requested HTML file. This file contains the basic structure of the webpage (We'll look at an example of one of these)

4. **Rendering HTML**:
   - Your browser starts rendering the HTML file, which often includes references to other resources like CSS files, JavaScript files, images, fonts, and videos.

5. **Additional Requests**:
   - For each of these additional resources, the browser makes more HTTP requests to the server. For example:
     - **CSS Files**: To style the webpage.
     - **JavaScript Files**: To add interactivity and dynamic content.
     - **Images and Videos**: To include multimedia content.
     - **APIs**: Sometimes, the HTML or JavaScript code includes API requests. APIs (Application Programming Interfaces) allow the browser to request and receive data from servers, often in JSON format, which can then be dynamically displayed on the webpage.

6. **Assembling the Page**:
   - The browser assembles all these elements and displays the fully rendered webpage to you.

TODOL: explain what a server is


Sources used in this slide:
Tali Garsiel, "Behind the scenes of modern web browsers" (2011) https://taligarsiel.com/Projects/howbrowserswork1.htm
- a bit outdated, but still a useful resource
MDN WEb Docs, "How the web works" (2023) https://developer.mozilla.org/en-US/docs/Learn/Getting_started_with_the_web/How_the_Web_works
- a Mozilla-run blog that provides an introduction to how the web works. It's aimed in particular at developer audiences.
Pavel Panchekha and Chris Harrelson, "Web Browser Engineering" (2023) https://developer.mozilla.org/en-US/docs/Learn/Getting_started_with_the_web/How_the_Web_works
- a textbook that guides software developers through the process of buildign a web browser.

# 2. Review: Web Scraping Basics

2. Selenium



# 5. Additional Resources
Links to resources (both internal Brookings resources and publicly available ones) that will provide additional context/training.

Note to self: Make the slides some sort of table that easily indicate which resources are internal and which aren't

## If you want a light introduction to web scraping
[DistrictDataLabs/brookings](https://github.com/DistrictDataLabs/brookings/) **Publicly available** GitHub repository presented to Brookings Data Network on March 31, 2017 which has slides and sample code. The resource is older, but the concepts are still relevant. [NOTE: I haven't tested the code, so I'm not sure if it still works]

Keywords: Publicly available datasets, common data formats, RESTful APIs, HTTP requests, web scraping vs web crawling

## If you want a more involved introduction to web scraping
[trainingNotebook.ipynb](https://brookingsinstitution.sharepoint.com/:f:/r/sites/BrookingsDataNetwork/Shared%20Documents/Python/2017-05%20-%20Getting%20Started%20with%20Web%20Scraping?csf=1&web=1&e=d2sfux) Curtlyn Kramer's **Brookings internal** Jupyter notebook, presented to the Brookings Data Network on May 26, 2017. This is only useful if the code works, so I will have to test if the code works. If it does, then this is a helpful Jupyter notebook that walks you through the steps.

Keywords: inspect element, BeautifulSoup4, 