# Web Scraping

**Web Scraping** is the process of programmatically extracting data from a web interface designed for human interaction. Unfortunately, web scraping is hard to generalize and will usually require creating a bespoke scraper program for each platform. This requires digging into a site's sources to see how its data are organized. Python has a number of popular libraries<a name="cite_ref-1"></a>[<sup>[1]</sup>](#cite_note-1) available that are useful for web scraping, but the ones we are going to use for this class are `requests` for making http requests, and `BeautifulSoup` for parsing html.

<a name="cite_note-1"></a>1. [^](#cite_ref-1) Others include Scrapy and Selenium (a website testing library that can drive an actual browser)

In [1]:
import requests
from bs4 import BeautifulSoup as Soup

## HTML

**HTML** (HyperText Markup Language) is the primary **markup language** used for websites. Markup languages add formatting and/or structure to (generally) textual data for the purposes of display or processing. You only really need a basic familiarity with HTML to get started webscraping. The basic mechanism HTML uses for adding structure and formatting to documents is the "tag". A tag consists of some keyword and optional attributes surrounded by angle brackets, i.e. "<" and ">". Many tags have "open" and "close" variants, with the close variant starting with "</". A tag and its contents is commonly referred to as an "element".

- HTML documents have a head and body section, denoted by tags of the same name. The head section contains a title for the document, various metadata, scripts, stylesheets, etc, which pertain to the document as a whole. The body contains the actual content of the document.
- Markdown was conceived as a more human-friendly way to format plain text documents in a way that could easily be translated into HTML for web publishing. As a result, there is a close relationship between Markdown formatting and HTML elements. If you look at markdownguide.org, you will find most of the equivalent HTML used to render the markdown element.
- We will mostly be focused on links and tables. We've already looked at tables. Links use the "<a>" tag (for anchor), and has an attribute "href" for the actual http reference or url.

## HTTP

The **HyperText Transport Protocol** is the network protocol used to retreive html documents and associated files from a server. We won't get into the details of the protocol, but you need to be aware of two concepts:
1. Request methods. The most common request type is "GET", but there are several others. "POST" is used to send information to a server (for example, form data).
2. Response status codes. If everything is correct, you will get a "200" status code. Status codes in the 300 range indicate some sort of redirection. Status codes in the 400 range indicate an apparent client error (including 404 file not found). Status codes in the 500 range indicate a server error.

The wikipedia article can be a helpful reference: https://en.wikipedia.org/wiki/HTTP

## Avoiding Getting Blocked

### Rate Limiting

Many websites have some level of protection against automated access. This is because, at the very least, because automated access may tie up most of the resources available to the website's server, effectively denying access to other users. It is always a good idea to build some way to rate-limit your own requests, or the website you are scraping may do it for you, if not outright block the IP you are using.

In [7]:
import time
import random

DEFAULT_SLEEP = 2.0
SIGMA = 0.5

HEADERS = {}

def get(url: str) -> requests.Response:
    time.sleep(random.gauss(DEFAULT_SLEEP, SIGMA))
    return requests.get(url, headers=HEADERS)


### User Agent

The **User Agent** http header identifies the software being used to make the http request. It can be a good idea to set this to a value used by an actual browser.

In [9]:
# User Agent from Chrome Browser on Win 10/11
HEADERS = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/130.0.0.0 Safari/537.36'}

### Proxy Servers

It can also be useful to make use of a proxy server service. A proxy server allows you to direct webtraffic from your computer through another computer on the Internet. Proxy server services exist that allow you to use multiple IP addresses for making your web scrapping requests. This is a good way to avoid having your one and only IP address blocked, but does cost money.

## Keeping Track of Our Progress

