# Agenda

1. Introduction -- what is web scraping?
2. Getting started
    - Libraries
    - Web technology background
3. HTML and CSS
4. BeautifulSoup
    - Retrieving documents and parsing them
    - Using CSS to retrieve pieces of documents
5. Scrapy
    - Building a simple spider
    - Buliding a more complex spider project
6. Scrapy settings and debugging

# Introduction

The original idea of the Web was that there would be documents, and they would be marked up with HTML (a tagging system). Over time, several things happened:

- HTML pages became dynamically produced. They no longer had to represent an actual document on a disk somewhere.
- CSS (cascading stylesheets) are a separate technology and language in the browser, alongside the HTML, that describes how things should look (and to some degree, how they should behave)
- JavaScript is also (usually, somehow) inside of the Web page, and it provides computation that runs inside of your browser, interacting with the HTML and CSS, and also the user's mouse clicks, keyboard entry, etc.

As more and more information was put on to the Web, we wanted to be able to find and extract it using software. The idea of "crawling the Web" or "scraping the Web" became a big thing. 

If you want to scrape a Web page, it doesn't sound like it should be so hard. And there are libraries that you can use to parse the HTML. But those are kind of brittle and annoying, plus you want something at a higher level -- either to deal with HTML pages at a higher level, or even the whole process of searching + scraping at a higher level.

Before you scrape a Web site, you should be sure that you have permission to store + use the content you get from there.

Another issue: Web scraping can really affect the performance of a Web server. There are standard describing how much you can retrieve from a site, and what you're allowed to view. This is especially put in a file called `robots.txt`. That file indicates what can and cannot be retrieved automatically.

Your browser is an HTTP client; it sends a request to the HTTP server. That request basically says, "Give me document xyz." The simplest possible request is what we call `GET`. Along with that request, we'll send a bunch of HTTP request headers, basically a dict indicating what sort of response we want, plus metadata might want to use.

The server then returns a *response* to us. The response will have a numeric code (200 == OK, 404 == no such file, etc.) The response will also have content. That content can be in HTML.

When we make that request to the server, we send (among other things) a User-Agent header, indicating what kind of browser we're using.

It's very common for programmers to think that is a problem (scraping HTML) that we can solve with regular expressions. 

# Why do we scrape the Web?

- Data inside of HTML pages
- Text inside of HTML pages
- Cataloging of content
- Monitoring and/or retrieving data from our competitors

# What are we going to use?

- `requests` -- an HTTP client library in Python
- `BeautifulSoup` -- a parser for HTML pages that works on data we've already downloaded
- `Scrapy` -- all-in-one toolkit for creating spiders that retrieve from multiple sites/pages, and then let us extract and process that data in a number of different ways

# Let's talk about HTTP

When we make a request to a server, we're most commonly using a `GET` request.

    GET /myfile.txt HTTP/1.0

There are other verbs, as well:

`POST` is the most common, by far. 

Why do we have these verbs?

Conventionally, `GET` is used when we want to retrieve a file/resource, and maybe we want to pass a few name-value pairs along with the request, but not too much. Those can go in the URL.

    https://mysite.com?x=10&y=20

`POST` is meant, at least in theory, for when we're submitting data. If you fill out a form, then it's typically submitted using `POST`. The data that can be sent is much larger and more structured than what can be done with a `GET` request.  There are some other verbs as well, and some sites implement them and do things with them, but not that many.

When we send our request, we'll include a bunch of request headers.

When we get our response, we'll get a status code (number) plus a bunch of response headers plus the content (we can hope).

# `requests`

The `requests` library makes it easy to do this sort thing.

In [1]:
import requests

r = requests.get('https://python.org')

In [2]:
type(r)

requests.models.Response

In [3]:
r.status_code

200

In [9]:
for key, value in r.headers.items():
    print(f'{key:.<30}: {value}')

Connection....................: keep-alive
Content-Length................: 50629
Content-Type..................: text/html; charset=utf-8
X-Frame-Options...............: SAMEORIGIN
Via...........................: 1.1 varnish, 1.1 varnish
Accept-Ranges.................: bytes
Date..........................: Mon, 26 Aug 2024 15:31:20 GMT
Age...........................: 1485
X-Served-By...................: cache-iad-kiad7000025-IAD, cache-fra-etou8220046-FRA
X-Cache.......................: HIT, HIT
X-Cache-Hits..................: 7, 4
X-Timer.......................: S1724686280.469753,VS0,VE0
Vary..........................: Cookie
Strict-Transport-Security.....: max-age=63072000; includeSubDomains; preload
