# Agenda

1. Introduction -- what is web scraping?
2. Getting started
    - Libraries
    - Web technology background
3. HTML and CSS
4. BeautifulSoup
    - Retrieving documents and parsing them
    - Using CSS to retrieve pieces of documents
5. Scrapy
    - Building a simple spider
    - Buliding a more complex spider project
6. Scrapy settings and debugging

# Introduction

The original idea of the Web was that there would be documents, and they would be marked up with HTML (a tagging system). Over time, several things happened:

- HTML pages became dynamically produced. They no longer had to represent an actual document on a disk somewhere.
- CSS (cascading stylesheets) are a separate technology and language in the browser, alongside the HTML, that describes how things should look (and to some degree, how they should behave)
- JavaScript is also (usually, somehow) inside of the Web page, and it provides computation that runs inside of your browser, interacting with the HTML and CSS, and also the user's mouse clicks, keyboard entry, etc.

As more and more information was put on to the Web, we wanted to be able to find and extract it using software. The idea of "crawling the Web" or "scraping the Web" became a big thing. 

If you want to scrape a Web page, it doesn't sound like it should be so hard. And there are libraries that you can use to parse the HTML. But those are kind of brittle and annoying, plus you want something at a higher level -- either to deal with HTML pages at a higher level, or even the whole process of searching + scraping at a higher level.

Before you scrape a Web site, you should be sure that you have permission to store + use the content you get from there.

Another issue: Web scraping can really affect the performance of a Web server. There are standard describing how much you can retrieve from a site, and what you're allowed to view. This is especially put in a file called `robots.txt`. That file indicates what can and cannot be retrieved automatically.

Your browser is an HTTP client; it sends a request to the HTTP server. That request basically says, "Give me document xyz." The simplest possible request is what we call `GET`. Along with that request, we'll send a bunch of HTTP request headers, basically a dict indicating what sort of response we want, plus metadata might want to use.

The server then returns a *response* to us. The response will have a numeric code (200 == OK, 404 == no such file, etc.) The response will also have content. That content can be in HTML.

When we make that request to the server, we send (among other things) a User-Agent header, indicating what kind of browser we're using.

It's very common for programmers to think that is a problem (scraping HTML) that we can solve with regular expressions. 

# Why do we scrape the Web?

- Data inside of HTML pages
- Text inside of HTML pages
- Cataloging of content
- Monitoring and/or retrieving data from our competitors

# What are we going to use?

- `requests` -- an HTTP client library in Python
- `BeautifulSoup` -- a parser for HTML pages that works on data we've already downloaded
- `Scrapy` -- all-in-one toolkit for creating spiders that retrieve from multiple sites/pages, and then let us extract and process that data in a number of different ways

# Let's talk about HTTP

When we make a request to a server, we're most commonly using a `GET` request.

    GET /myfile.txt HTTP/1.0

There are other verbs, as well:

`POST` is the most common, by far. 

Why do we have these verbs?

Conventionally, `GET` is used when we want to retrieve a file/resource, and maybe we want to pass a few name-value pairs along with the request, but not too much. Those can go in the URL.

    https://mysite.com?x=10&y=20

`POST` is meant, at least in theory, for when we're submitting data. If you fill out a form, then it's typically submitted using `POST`. The data that can be sent is much larger and more structured than what can be done with a `GET` request.  There are some other verbs as well, and some sites implement them and do things with them, but not that many.

When we send our request, we'll include a bunch of request headers.

When we get our response, we'll get a status code (number) plus a bunch of response headers plus the content (we can hope).

# `requests`

The `requests` library makes it easy to do this sort thing.

In [1]:
import requests

r = requests.get('https://python.org')

In [2]:
type(r)

requests.models.Response

In [3]:
r.status_code

200

In [10]:
for key, value in sorted(r.headers.items()):
    print(f'{key:.<30}: {value}')

Accept-Ranges.................: bytes
Age...........................: 1485
Connection....................: keep-alive
Content-Length................: 50629
Content-Type..................: text/html; charset=utf-8
Date..........................: Mon, 26 Aug 2024 15:31:20 GMT
Strict-Transport-Security.....: max-age=63072000; includeSubDomains; preload
Vary..........................: Cookie
Via...........................: 1.1 varnish, 1.1 varnish
X-Cache.......................: HIT, HIT
X-Cache-Hits..................: 7, 4
X-Frame-Options...............: SAMEORIGIN
X-Served-By...................: cache-iad-kiad7000025-IAD, cache-fra-etou8220046-FRA
X-Timer.......................: S1724686280.469753,VS0,VE0


In [13]:
print(r.content.decode())

<!doctype html>
<!--[if lt IE 7]>   <html class="no-js ie6 lt-ie7 lt-ie8 lt-ie9">   <![endif]-->
<!--[if IE 7]>      <html class="no-js ie7 lt-ie8 lt-ie9">          <![endif]-->
<!--[if IE 8]>      <html class="no-js ie8 lt-ie9">                 <![endif]-->
<!--[if gt IE 8]><!--><html class="no-js" lang="en" dir="ltr">  <!--<![endif]-->

<head>
    <!-- Google tag (gtag.js) -->
    <script async src="https://www.googletagmanager.com/gtag/js?id=G-TF35YF9CVH"></script>
    <script>
      window.dataLayer = window.dataLayer || [];
      function gtag(){dataLayer.push(arguments);}
      gtag('js', new Date());
      gtag('config', 'G-TF35YF9CVH');
    </script>
    <!-- Plausible.io analytics -->
    <script defer data-domain="python.org" src="https://plausible.io/js/script.js"></script>

    <meta charset="utf-8">
    <meta http-equiv="X-UA-Compatible" content="IE=edge">

    <link rel="prefetch" href="//ajax.googleapis.com/ajax/libs/jquery/1.8.2/jquery.min.js">
    <link rel="prefetch" 

# HTML intro (quickly)

Every HTML document is built out of *tags*. A tag typically looks like this:

    <tag>content</tag>

Above, our tag is called `tag`, and we have both an opening tag (`<tag>`) and a closing tag (`</tag>`).  In between we have content.  HTML defines a bunch of tags that we can use:

- `head`
- `body`
- `p` (paragraph of text)
- `h1` (highest level headline)
- `<ol>` (ordered list, meaning numbered)
- `<ul>` (unordered list, meaning bullets)
- `<li>` (an item on either `ol` or `ul`

Every tag should be closed with a `</tag>` (where the tag name should match the opening)

You can nest tags, so you can have one inside of another inside of another.

Inside of the opening tag, you can have "attributes." Those are name-value pairs that we see as `name="value"` inside of the opening tag.  Each tag has its own definitions for what it'll allow. For example, the `a` tag (for an "anchor," but mostly a hyperlink) has an `href` attribute whose value indicates where a link should go.

        <a href="https://www.python.org/psf/" title="The Python Software Foundation" >PSF</a>

The above is the `a` tag, with an `href` of the PSF's home page, another attribute `title` set to be some text, and then the content (between `<a>` and `</a>`), is `PSF`.

There are two special attributes that we can set on a tag:

- `id` gives it a unique name in the document
- `class` associates it with a category.

A tag can contain more than one attribute. A `class` attribute can contain more than one value, separated by spaces. And a class can be applied to more than one tag in the document.

# Exercise: Retrieve content

1. Import `requests`
2. Make some `requests` requests. For each of the URLs you request:
    - What headers do you see in the response?
    - What HTML do you see?
    - Find 3 tags with IDs and another 3 with classes.

In [14]:
url = 'https://news.ycombinator.com'

r = requests.get(url)

In [16]:
for key, value in sorted(r.headers.items()):
    print(f'{key:.<30}: {value}')

Cache-Control.................: private; max-age=0
Connection....................: keep-alive
Content-Encoding..............: gzip
Content-Security-Policy.......: default-src 'self'; script-src 'self' 'unsafe-inline' https://www.google.com/recaptcha/ https://www.gstatic.com/recaptcha/ https://cdnjs.cloudflare.com/; frame-src 'self' https://www.google.com/recaptcha/; style-src 'self' 'unsafe-inline'; img-src 'self' https://account.ycombinator.com; frame-ancestors 'self'
Content-Type..................: text/html; charset=utf-8
Date..........................: Mon, 26 Aug 2024 15:58:34 GMT
Referrer-Policy...............: origin
Server........................: nginx
Strict-Transport-Security.....: max-age=31556900
Transfer-Encoding.............: chunked
Vary..........................: Accept-Encoding
X-Content-Type-Options........: nosniff
X-Frame-Options...............: DENY
X-XSS-Protection..............: 1; mode=block


In [19]:
print(r.content.decode())

<html lang="en" op="news"><head><meta name="referrer" content="origin"><meta name="viewport" content="width=device-width, initial-scale=1.0"><link rel="stylesheet" type="text/css" href="news.css?BdN4h6p4x0LjfslpeNJQ">
        <link rel="icon" href="y18.svg">
                  <link rel="alternate" type="application/rss+xml" title="RSS" href="rss">
        <title>Hacker News</title></head><body><center><table id="hnmain" border="0" cellpadding="0" cellspacing="0" width="85%" bgcolor="#f6f6ef">
        <tr><td bgcolor="#ff6600"><table border="0" cellpadding="0" cellspacing="0" width="100%" style="padding:2px"><tr><td style="width:18px;padding-right:4px"><a href="https://news.ycombinator.com"><img src="y18.svg" width="18" height="18" style="border:1px white solid; display:block"></a></td>
                  <td style="line-height:12pt; height:10px;"><span class="pagetop"><b class="hnname"><a href="news">Hacker News</a></b>
                            <a href="newest">new</a> | <a href="fro

# CSS selectors

CSS has a bunch of ways to find/retrieve the tags of a certain type.

- `p` -- this means just the tag `p`
- `p#my-id` -- this means the tag `p` whose ID is `my-id`
    - `tr#pagespace`
    - `table#hnmain`
- `p.class-name` -- this means the tag `p` that has the class `class-name`. A tag can have more than one class, and a class can be on more than one tag
    - `tr.spacer`
    - `tr.athing`
- `div p` -- this means a `p` tag inside of a `div`
    - `td.votelinks a`  -- find a link inside of `td.votelinks`
- `div > p` -- this means that `p` needs to be *directly* inside of a `div`
    - `td.votelinks > a`  -- find a link immediately inside of `td.votelinks`, not several layers down


# Scraping with BeautifulSoup

BS is a library that assumes you have already downloaded HTML content, probably with `requests`. It's meant to let you parse through the HTML, finding text or values or anything you want inside of the HTML.  If you want to search for a string in HTML, you can just use `in`. If you want to search for a string inside of a specific tag, having a specific ID or class, then BS is going to make it much easier.

You can install BeautifulSoup with 

    pip install bs4



In [20]:
from bs4 import BeautifulSoup

In [30]:
# now we create an instance of BeautifulSoup
# we pass it two arguments:
# - the HTML content, as a string
# - optionally, the HTML parser we want to use -- we can just use the string `html.parser`, but
#   if you want, you can try others, such as `lxml` (installable with pip) that is supposed to be much faster

soup = BeautifulSoup(r.content, 'html.parser')

In [22]:
type(soup)

bs4.BeautifulSoup

In [24]:
# what do we do with this soup object, which knows how to navigate inside of our content?

print(soup.get_text())  # this returns the text, stripped of HTML tags




Hacker News

Hacker News
new | past | comments | ask | show | jobs | submit 
login




1. Dokku: My favorite personal serverless platform (hamel.dev)
46 points by tosh 35 minutes ago  | hide | 18 comments 



2. Linux: We need tiling desktop environments (linuxblog.io)
37 points by ashitlerferad 51 minutes ago  | hide | 32 comments 



3. Launch HN: Parity (YC S24) – AI for on-call engineers working with Kubernetes
15 points by wilson090 1 hour ago  | hide | 4 comments 



4. Fixing a bug in Google Chrome as a first-time contributor (cprimozic.net)
298 points by Ameo 6 hours ago  | hide | 67 comments 



5. DOJ Files Antitrust Suit Against RealPage, Maker of Rent-Setting Algorithm (propublica.org)
112 points by keiran_cull 1 hour ago  | hide | 70 comments 



6. NSA releases 1982 Grace Hopper lecture (nsa.gov)
47 points by gaws 3 hours ago  | hide | 4 comments 



7. The Mystics of Progress (2023) (isaacyoung.substack.com)
9 points by KqAmJQ7 1 hour ago  | hide | 2 comments 



8. R

In [26]:
# we can use the "soup.find" method to retrieve based on a CSS selector

soup.find('tr')

<tr><td bgcolor="#ff6600"><table border="0" cellpadding="0" cellspacing="0" style="padding:2px" width="100%"><tr><td style="width:18px;padding-right:4px"><a href="https://news.ycombinator.com"><img height="18" src="y18.svg" style="border:1px white solid; display:block" width="18"/></a></td>
<td style="line-height:12pt; height:10px;"><span class="pagetop"><b class="hnname"><a href="news">Hacker News</a></b>
<a href="newest">new</a> | <a href="front">past</a> | <a href="newcomments">comments</a> | <a href="ask">ask</a> | <a href="show">show</a> | <a href="jobs">jobs</a> | <a href="submit" rel="nofollow">submit</a> </span></td><td style="text-align:right;padding-right:4px;"><span class="pagetop">
<a href="login?goto=news">login</a>
</span></td>
</tr></table></td></tr>

In [34]:
type(soup.find_all('tr')[0])

bs4.element.Tag

In [35]:
type(soup.find_all('tr')[1])

bs4.element.Tag

In [36]:
soup.find_all('tr')[1]

<tr><td style="width:18px;padding-right:4px"><a href="https://news.ycombinator.com"><img height="18" src="y18.svg" style="border:1px white solid; display:block" width="18"/></a></td>
<td style="line-height:12pt; height:10px;"><span class="pagetop"><b class="hnname"><a href="news">Hacker News</a></b>
<a href="newest">new</a> | <a href="front">past</a> | <a href="newcomments">comments</a> | <a href="ask">ask</a> | <a href="show">show</a> | <a href="jobs">jobs</a> | <a href="submit" rel="nofollow">submit</a> </span></td><td style="text-align:right;padding-right:4px;"><span class="pagetop">
<a href="login?goto=news">login</a>
</span></td>
</tr>

In [37]:
soup.find_all('tr')[1].text

'\nHacker News\nnew | past | comments | ask | show | jobs | submit \nlogin\n\n'

In [39]:
soup.find_all('tr')[31].text

'\n10. Coolify’s rise to fame, and why it could be a big deal (api-fiddle.com)'

# Searching in BS4

1. You can use `soup.find`, giving it:
    - First argument is a string, the tag
    - Keyword argument `attrs` is a dict, with the names and values of attributes in the tag (including ID and class)
2. This way, you can put together a CSS selector in high-level form.
3. You can also use `soup.select` and then use the CSS selector directly.

In [40]:
# Find: tr#pagespace

soup.find('tr', attrs={'id':'pagespace'})

<tr id="pagespace" style="height:10px" title=""></tr>

In [41]:
# What if we look for multiples with find_all?

soup.find_all('tr', attrs={'id':'pagespace'})

[<tr id="pagespace" style="height:10px" title=""></tr>]

In [42]:
# I can instead use a CSS selector

soup.select('tr#pagespace')

[<tr id="pagespace" style="height:10px" title=""></tr>]

In [54]:
# given a Tag object, we can retrieve over / iterate over its children,
# with the .children attribute

for one_child in soup.select('td.votelinks a')[0].children:
    print(one_child)

<div class="votearrow" title="upvote"></div>


# Exercise: Parsing with BS4

1. Retrieve the content from http://quotes.toscrape.com/
2. Retrieve the first quote's text.
3. Retrieve the second quote's author.
4. Retrieve all of the quotes, iterate over them, and print them.

In [55]:
import requests

url = 'https://quotes.toscrape.com'

r = requests.get(url)

In [56]:
r.headers

{'Date': 'Mon, 26 Aug 2024 16:35:26 GMT', 'Content-Type': 'text/html; charset=utf-8', 'Transfer-Encoding': 'chunked', 'Connection': 'keep-alive', 'Strict-Transport-Security': 'max-age=0; includeSubDomains; preload', 'Content-Encoding': 'br'}

In [58]:
print(r.content.decode())

<!DOCTYPE html>
<html lang="en">
<head>
	<meta charset="UTF-8">
	<title>Quotes to Scrape</title>
    <link rel="stylesheet" href="/static/bootstrap.min.css">
    <link rel="stylesheet" href="/static/main.css">
</head>
<body>
    <div class="container">
        <div class="row header-box">
            <div class="col-md-8">
                <h1>
                    <a href="/" style="text-decoration: none">Quotes to Scrape</a>
                </h1>
            </div>
            <div class="col-md-4">
                <p>
                
                    <a href="/login">Login</a>
                
                </p>
            </div>
        </div>
    

<div class="row">
    <div class="col-md-8">

    <div class="quote" itemscope itemtype="http://schema.org/CreativeWork">
        <span class="text" itemprop="text">“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”</span>
        <span>by <small class="author" itempr

In [59]:
soup = BeautifulSoup(r.content, 'html.parser')

In [66]:
soup.find('div', attrs={'class':'quote'})

<div class="quote" itemscope="" itemtype="http://schema.org/CreativeWork">
<span class="text" itemprop="text">“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”</span>
<span>by <small class="author" itemprop="author">Albert Einstein</small>
<a href="/author/Albert-Einstein">(about)</a>
</span>
<div class="tags">
            Tags:
            <meta class="keywords" content="change,deep-thoughts,thinking,world" itemprop="keywords"/>
<a class="tag" href="/tag/change/page/1/">change</a>
<a class="tag" href="/tag/deep-thoughts/page/1/">deep-thoughts</a>
<a class="tag" href="/tag/thinking/page/1/">thinking</a>
<a class="tag" href="/tag/world/page/1/">world</a>
</div>
</div>

In [70]:
# first quote's text
soup.select('div.quote span.text')[0].text

'“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”'

In [72]:
# second quote's author
soup.select('div.quote small.author')[1].text

'J.K. Rowling'

In [93]:
for one_quote_div in soup.select('div.quote'):
    for one_child in one_quote_div.children:
        if one_child.name == 'span':
            print(one_child.text.replace('(about)', '').strip())

“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”
by Albert Einstein
“It is our choices, Harry, that show what we truly are, far more than our abilities.”
by J.K. Rowling
“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”
by Albert Einstein
“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”
by Jane Austen
“Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”
by Marilyn Monroe
“Try not to become a man of success. Rather become a man of value.”
by Albert Einstein
“It is better to be hated for what you are than to be loved for what you are not.”
by André Gide
“I have not failed. I've just found 10,000 ways that won't work.”
by Thomas A. Edison
“A woman is like a tea bag; you never know how strong it is until it's in hot water.”
by Ele

# Next up: Scrapy!

Resume at top of the hour

# What is Scrapy?

It's a full toolkit for creating Web spiders -- the programs that go through links and extract values from Web pages.  It can:

- Retrieve data from a set of Web pages
- Select what data is interesting/useful to use
- Store it in a variety of ways
- Preprocess it in a variety of ways
- Handles `robots.txt` automatically
- User agent and IP address changes, to avoid being blocked
- Comes with many utilities and conveniences for working with Web scraping/spiders
- Handles sessions and cookies
- It can work concurrently
- It can work with external packages that use JavaScript