# Scraping and crawling the web

This second day's workshop gives you practice scraping and crawling with modern Python tools. 

To review from part 1, web-scraping means “programatically going over a collection of web pages and extracting data”. Scraping is a powerful tool for working with data on the web, but it depends on knowing where the information you want is located. If you don't have specific target URLs, then you can often programmatically search for them by following links around the internet to locate content (often called *web-crawling*) or scraping them, e.g. with automated search (like we practiced in part 1). Once you have the pages you want, then you extract and parse information from these pages (often called *web-scraping*). These two steps often happen together and recursively: you crawl some stuff, but upon scraping it you realize you got the wrong websites, so you go back to crawling, which changes your scraping approach, and so on.

We will start with an essential step in any web-scraping or web-crawling pipeline: parsing HTML (with `BeautifulSoup`). Then we will dig into `Scrapy`, a flexible and powerful tool for crawling and scraping heterogeneous websites (what I call _broad crawling_).
_narrow crawling_ (focused, precise scraping of a few websites)

### Standing on the shoulders of... spiders?

You can build a scraper from scratch using low-level modules or libraries, but then you have to deal with some potential headaches as your scraper grows more complex. For example, you'll need to handle concurrency so you can crawl more than one page at a time. You'll probably want to figure out how to transform your scraped data into different formats like CSV, XML, or JSON. And you'll sometimes have to deal with sites that require specific settings and access patterns.

You'll have better luck if you build your scraper on top of an existing library that handles those issues for you. For this tutorial, we will build some intuition for web-scraping by working with low-level approaches, using the `Requests` and `BeautifulSoup` libraries to make requests and parse the result. Then we will build a scraper with *Scrapy*,which is one of the most popular, flexible, and powerful Python scraping libraries. Scrapy takes a "batteries included" approach to scraping, meaning that it handles a lot of the common functionality that all scrapers need. This prevents you from reinventing the wheel--or worse, the flat tire!

To learn about crawling with Scrapy, we will explore [quotes.toscrape.com](quotes.toscrape.com), a scraping-friendly website that lists quotes from famous authors. By the end of today, you’ll have a fully functional web scraper that walks through a series of pages and extracts data from each page. The scraper will be easily expandable so you can tinker around with it and use it as a foundation for your own projects scraping data from the web.

You can also read more about the [basics of scrapy](https://docs.scrapy.org/en/latest/intro/overview.html), its [architecture](https://docs.scrapy.org/en/latest/topics/architecture.html), or see [the FAQ](https://docs.scrapy.org/en/latest/faq.html). And if you need a refresher on scraping with Beautiful Soup, here's a [good tutorial](https://www.digitalocean.com/community/tutorials/how-to-scrape-web-pages-with-beautiful-soup-and-python-3).

### What kind of crawling?

A flexible tool like Scrapy can be used in many different ways depending on the task at hand.

What I call _narrow crawling_ means focusing on a limited set of pre-defined domains--that is, studying their HTML and CSS structures and exploiting these to extract specific information repeatedly. This maximizes precision in scraping while sacrificing _extensibility_: the ability to incorporate new domains or be resilient to changes in website structure. This is what people usually mean when they say "web-scraping". It may or may not expand beyond the initial set of websites, but it may crawl more websites within this set (_vertical crawling_).

What I call _broad crawling_ makes the opposite tradeoff, collecting information on a range of websites and promoting flexibility in its scraping algorithm (way of extracting website information) at the expense of generally less clean output. It may identify websites to scrape by google search (what do people click on most?), network analysis (what websites tend to link to one another?), or _link extraction_: finding all within-domain links on a given webpage, then all within-domain links on its children links, and so on to a specified depth. The messier output from broad crawling can present challenges for data cleaning and analysis (remember "garbage in, garbage out"?), but this depends on the application.


## Outline

* [Parsing HTML](#parsing)
    * [Pretty parsing with `BeautifulSoup`](#BS)
    * [Getting human-readable text](#readable)
* [Crawling broadly with `Scrapy`](#scrapy)
    * [A simple (narrow) spider](#simple)
    * [Link extraction in a (broad) spider](#linkextraction)


## Vocabulary

* *relative link*: 
    * A link that builds on a given domain name. For instance, `/pennsylvania/` as a link from `https://www.politifact.com` points to `https://www.politifact.com/pennsylvania/`.
* *absolute link*: 
    * A link that includes the complete domain name and can be accessed from anywhere, e.g. `https://www.politifact.com/pennsylvania/`.
* *narrow crawling (less extensible)*: 
    * Scraping a limited set of pre-defined domains: studying their HTML and CSS structures and exploiting these to extract specific information repeatedly. This maximizes precision in scraping while sacrificing extensibility (ability to incorporate new domains or changes in website structure). What people usually mean when they say "web-scraping". 
* *broad crawling (more extensible)*: 
    * Collecting information on a range of websites and promoting flexibility in its scraping algorithm (way of extracting website information) at the expense of generally less clean output. It may identify websites to scrape by google search (what do people click on most?), network analysis (what websites tend to link to one another?), or link extraction.
* *extensibility*:
    * Ability for a scraping approach to incorporate new domains or be resilient to changes in website structure. Generally higher for broad crawls than narrow crawls, at the expense of precision. 
* *link extraction*:
    * Finding all within-domain links on a given webpage, then all within-domain links on its children links, and so on to a specified depth. 
* *horizontal crawling*: 
    * Crawling on the same hierarchical level as the input domain, such as going from the first to the second page of google results.
* *vertical crawling*:
    * Crawling at a higher or lower level from the input domain, such as navigating to the "About Us" page directly linked from a home page. 
    
### Credits
Part of this notebook was borrowed from [the official Scrapy tutorial](https://docs.scrapy.org/en/latest/intro/tutorial.html), part came from [this great book on Scrapy](https://learning.oreilly.com/library/view/learning-scrapy/9781784399788/) (you may have University access [here](https://www.safaribooksonline.com/library/view/temporary-access/)), part was inspired by [Geoff Bacon's web-scraping workshop](https://github.com/TextXD/introduction-to-web-scraping), and part was written by me ([Jaren Haber](https://www.jarenhaber.com/)). The first three are great resources for further exploration and learning!

**__________________________________**

# Parsing HTML <a id='parsing'></a>

The second step in web scraping is parsing HTML. This is where things can get a little tricky.

Let's start by looking more closely at HTML. Use your browser developer tools (e.g., in Chrome, right click > `Inspect`) to inspect the HTML of [the page listing all Fall 2021 Sociology courses at Georgetown University](https://myaccess.georgetown.edu/pls/bninbp/bwckgens.p_proc_term_date?p_term=202130&p_calling_proc=bwckschd.p_disp_dyn_sched#_ga=2.223705375.587656937.1619556624-282868439.1588700423) in your browser (select "Sociology" from the list then click "Get Courses"), and find the part of the HTML where the course headings are listed. There's a lot of other stuff in the file that we don't care too much about. You could try `Crtl-F`ing for the name of a course you see on the webpage.

You should see something like this:

```
<tbody>
<tr><th class="ddtitle" scope="colgroup"><a href="/pls/bninbp/bwckschd.p_disp_detail_sched?term_in=202130&amp;crn_in=38298">Introduction to Sociology - 38298 - SOCI 001 - 01</a></th></tr>
<tr><th class="ddtitle" scope="colgroup"><a href="/pls/bninbp/bwckschd.p_disp_detail_sched?term_in=202130&amp;crn_in=38299">Introduction to Sociology - 38299 - SOCI 001 - 02</a></th></tr>
<tr><th class="ddtitle" scope="colgroup"><a href="/pls/bninbp/bwckschd.p_disp_detail_sched?term_in=202130&amp;crn_in=38300">Introduction to Sociology - 38300 - SOCI 001 - 03</a></th></tr>
<tr><th class="ddtitle" scope="colgroup"><a href="/pls/bninbp/bwckschd.p_disp_detail_sched?term_in=202130&amp;crn_in=38301">Introduction to Sociology - 38301 - SOCI 001 - 04</a></th></tr>
<tr><th class="ddtitle" scope="colgroup"><a href="/pls/bninbp/bwckschd.p_disp_detail_sched?term_in=202130&amp;crn_in=38302">Introduction to Sociology - 38302 - SOCI 001 - 05</a></th></tr>
<tr><th class="ddtitle" scope="colgroup"><a href="/pls/bninbp/bwckschd.p_disp_detail_sched?term_in=202130&amp;crn_in=40419">Sociology of Health/Illness - 40419 - SOCI 109 - 01</a></th></tr>
<tr><th class="ddtitle" scope="colgroup"><a href="/pls/bninbp/bwckschd.p_disp_detail_sched?term_in=202130&amp;crn_in=36639">Race, Society &amp; Cinema - 36639 - SOCI 133 - 01</a></th></tr>
```

This is HTML. HTML uses "tags", code that surrounds the raw text which indicates the structure of the content. The tags are enclosed in `<` and `>` symbols. The `<li>` says "this is a new thing in a list and `</li>` says "that's the end of that new thing in the list". Similarly, the `<a ...>` and the `</a>` say, "everything between us is a hyperlink". And likewise, the `<tr>`and the `</tr>` enclose a table row, while the `<th>`and the `</th>` enclose row cells that contain column headers.

In this HTML file, each course title is listed with `<th>...</th>` and is also linked to its own page using `<a>...</a>`. In our browser, if we click on the name of the department, it takes us to detailed information for that class, including Registration Availability. You'll see inside the `<a>` bit, there's a `href=...`. That tells us the (relative) location of the page it's linked to.

## Pretty parsing with `BeautifulSoup` <a id='BS'></a>

Armed with this knowledge of HTML, let's try getting the HTML and parsing the fact checking page we saw earlier. We will use `requests` to get the HTML and its text, then `BeautifulSoup` to parse the result. (Check out [the `BeautifulSoup` docs](http://www.crummy.com/software/BeautifulSoup/bs4/doc/) for lots of tips and tricks!)

In [None]:
# Import BeautifulSoup for parsing
from bs4 import BeautifulSoup

# Define URL to scrape
url = 'https://www.politifact.com/factchecks/2021/apr/02/citizens-united/citizens-united-calls-bidens-infrastructure-plan-g/'

# Scrape HTML
html = requests.get(url)

# Convert HTML into soup object
soup = BeautifulSoup(html.text) # use default 'html.parser' ('lxml' is faster though)

# See pretty formatting in soup object
print(soup.prettify()[:1200])

Pretty! Much more so than the plain ol' `requests.get().text` block we saw earlier. But this is just the beginning of what `BeautifulSoup` can do. It can also find specific tags, like paragraphs (via `<p>`), headers (via `h1`, `h2`, etc.), and hyperlinks (via `<a>` and their `href` elements).

In most cases, the `<p>` tag is the most useful for extracting readable text from a webpage. Let's get the first 10 paragraph tags from this claim review page.

In [None]:
for paragraph in soup.find_all('p')[:10]: # first 10 paragraphs via <p> tag
    print(paragraph)

### Challenge

Find all the links in the above claim review page using the `<a>` tags and their `href` elements. Print every 10th link. What do you notice about where these links point?

In [None]:
url = 'https://www.politifact.com/factchecks/2021/apr/02/citizens-united/citizens-united-calls-bidens-infrastructure-plan-g/'
html = requests.get(url)
soup = BeautifulSoup(html.text)

# Your solution here

We see lots of relative links (e.g., `/pennsylvania/`), places where the `href` seems to point nowhere (e.g., `#`), and communication shortcuts (e.g., `https://twitter.com/share?text=PolitiFact - Citizens United calls...`). This could be cleaned up by appending relative links to the domain name (`https://www.politifact.com/`) and keeping only URLs (and nothing after).

Sometimes these tags aren't very useful--in fact, they can get in the way of extracting only visible or human-readable text from the HTML. This too can be accomplished with `BeautifulSoup`!

## Getting human-readable text <a id='readable'></a>

Occasionally we want to learn about websites via their tags: What the headers say, which paragraph comes first, where the links or images are, etc. Other times tags (such as scripts or styles) only introduce extraneous characters and nonsense words, and we want to ignore the tags themselves or even the text they enclose. 

The simplest way to do this is with the `get_text()` method in `BeautifulSoup`, which returns all the text in a document or beneath a tag, as a single Unicode string. You might have noticed that the `<p>` tags got in the way in our above example. Let's try that again and this time, we will remove the tags.

In [None]:
for paragraph in soup.find_all('p')[:10]: # first 10 paragraphs via <p> tag
    print(paragraph.get_text().strip()) # extract text and strip trailing spaces

It's also easy to call the first element of the soup object matching a given tag, like so:

In [None]:
soup.p.get_text() # Get text of first paragraph

Another useful method is `extract()`, which can be used to surgically remove a tag or string from the soup tree, storing it for safe keeping. Let's extract the first 5 links:

In [None]:
extracted = [] # initialize list of extracted links

for link in soup.find_all('a')[:5]: # get first ten <a> tags
    extracted.append(link.extract()) # extract the link
    
print('Extracted links:', extracted)
print()

# What are the first 10 links now the the previous 10 were removed? 
for link in soup.find_all('a')[:5]: 
    print(link)

What if we don't want to keep the tag at all? In this case, we would use `decompose()`, which obliterates a useless tag (and frees up memory). Unlike with `extract()`, with `decompose()` you don't need to assign the junk tag to anything to clear it--the method does this automatically. 

Let's try the above code again, this time with `decompose()` and `get_text()` to clean up the display.

In [None]:
url = 'https://www.politifact.com/factchecks/2021/apr/02/citizens-united/citizens-united-calls-bidens-infrastructure-plan-g/'
html = requests.get(url)
soup = BeautifulSoup(html.text)

for link in soup.find_all('a')[:5]: # get first ten <a> tags
    link.decompose() # obliterate this link
    
# What are the first 10 links now the the previous 10 were removed? 
for link in soup.find_all('a')[:5]: 
    print(link.get_text().strip()) # get text and clean spacing

Not all websites use the `<p>` tag to indicate the important, human-readable text. Sometimes we need to approach HTML parsing from the other end: By finding and removing all non-informative tags. Let's use `BeautifulSoup` to build such a method. 

### Challenge

Use `decompose()` to remove from the soup all tags showing anything other than human-readable text. Below is a list of such junk tags to use as a blacklist. 

```
"b", "big", "i", "small", "tt", "abbr", "acronym", "cite", "dfn", "kbd", 
"samp", "var", "bdo", "map", "object", "q", "span", "sub", "sup", "head", 
"title", "[document]", "script", "style", "meta", "noscript"
```

In [None]:
# Your solution here

You might have noticed that word boundaries get clobbered when you call `get_text()`. This is because the default setting for this method is `strip=True`, which tells `BeautifulSoup` to strip whitespaces (of any kind) from the beginning and end of each bit of text. Using `strip=False` leads to lots of extra whitespaces--usually, newlines--which requires some regular expressions to clean up.

### Challenge

Using the above tags blacklist and `decompose()` as before, this time use the `strip=False` parameter when calling `get_text()` to avoid combining words across whitespace boundaries. Instead, use regular expressions to clean up extra whitespaces.

In [None]:
# Your solution here


### Challenge

You might have noticed that when we scraped HTML above from [this claim review by PolitiFact](https://www.politifact.com/factchecks/2021/apr/02/citizens-united/citizens-united-calls-bidens-infrastructure-plan-g/), we got headers and tags like this:
```html
<p>Misinformation isn't going away just because it's a new year. Support trusted, factual information with a tax deductible contribution to PolitiFact.</p>
<p>
<a class="m-disruptor-content__link" href="/membership/">More Info</a>
</p>
<p class="c-image__caption-inner copy-xs">
The White House infrastructure plan has $111 billion to improve water and sewer systems. (Shutterstock)
</p>
```
Use what you now know about identifying HTML, removing tags, and cleaning spacing to scrape a clean explanation from the body of this article. 

_Hint:_ Use your browser to inspect this website's HTML and identify any unique types and/or classes that enclose the explanation (and nothing else).

In [None]:
# Your solution here


Compare the output from this focused, site-specific scraping approach with that from the blacklist method above. <br/>
**Which method gives the cleaner output? Which method is more extensible?**

# Crawling broadly with `Scrapy` <a id='scrapy'> </a>

As a _Twisted_ application, Scrapy is event-driven, asynchronous, and is virtually multi-threaded (while using only one thread). While other programs cause _blocks_ when they access files or the web, spawn new processes, or do system operations, Scrapy instead waits until a resource is available, solves the immediate problem, and then calls another task. In short, Scrapy is fast, flexible, and scalable. It offers one of the most user-friendly ways to write crawling programs that can move across heterogeneous swaths of the internet, download stuff, and not break. 

To grasp the intuition behind Scrapy, imagine a bank where tellers (threads) are available to see customers (processes), who need to fill out forms before they're done. Such a situation could be configured in these ways:

- _Blocking_ operation with a _single_ thread: Here there is 1 teller trying to help 5 customers. When customer 1 needs time to fill out a form, then teller 1 is occupied waiting for customer 1--and all the other customers are stuck in line.
- _Blocking_ operation with _multiple_ threads: Now there are still 5 customers, but there are 3 tellers. When customer 1 needs time to fill out a form, then teller 1 is occupied. Customer 2 may have access to teller 2 and customer 3 to teller 3, but then all the tellers are monopolized while people fill out forms, which means customers 4 and 5 are still stuck waiting in line. 
- _Non-blocking_ operation with a _single_ thread: Here again we have 1 teller and 5 customers. When customer 1 needs time to fill out a form, they stand aside so the single teller can help customer 2. When customer 1 is finished, they wait until customer 2 is done or has something to do, then customer 1 is called back to continue being helped. If customers 1 and 2 both have forms to complete, they can do that on the side and the single teller can see customer 3, and so on. 

You can see this last situation is way more efficient than the previous two. This is the Scrapy default; when _multiple_ threads are available for a _non-blocking_ operation (like when multiple spiders work together), this is even better. 

## Create a project

Before you start scraping, you will have to set up a new Scrapy project. Enter a directory where you’d like to store your code and run:

```shell
$ scrapy startproject schools
```

This will create a `schools` directory with the following contents:

```shell
schools/
    scrapy.cfg            # deploy configuration file
    tutorial/             # project's Python module, you'll import your code from here
        __init__.py
        items.py          # project items definition file
        middlewares.py    # project middlewares file
        pipelines.py      # project pipelines file
        settings.py       # project settings file
        spiders/          # a directory where you'll later put your spiders
            __init__.py
```

## A simple (narrow) spider <a id='simple'> </a>

Spiders are classes that you define and that Scrapy uses to scrape information from a website (or a group of websites). They must subclass `scrapy.Spider` and define the initial requests to make, optionally how to follow links in the pages, and how to parse the downloaded page content to extract data.

This is the code for our first Spider. Save it in a file named `simple_spider.py` under the `schools/spiders` directory in your project:

```python
import scrapy

class SimpleSpider(scrapy.Spider):
    name = "simple"
    start_urls = [
        'http://quotes.toscrape.com/page/1/',
        'http://quotes.toscrape.com/page/2/',
    ]

    def parse(self, response):
        page = response.url.split("/")[-2]
        filename = 'quotes-%s.html' % page
        with open(filename, 'wb') as f:
            f.write(response.body)
```

As you can see, our Spider subclasses `scrapy.Spider` and defines some attributes and methods:

-`name`: identifies the Spider. It must be unique within a project, that is, you can’t set the same name for different Spiders.

-`start_urls`: a list of URLs to provide the initial requests for the crawler. Armed with this list alone, the spider will download HTML from the webpages specified, much as a web browser does. But it won't extract anything from the pages--that's why we need to define the `parse()` method.

-`parse()`: a method that will be called to handle the response downloaded for each of the requests made. The response parameter is an instance of `TextResponse` that holds the page content and has further helpful methods to handle it.

The `parse()` method usually parses the response, extracting the scraped data as dicts and also finding new URLs to follow and creating new requests (`Request`) from them.

### How to run the spider

To put our spider to work, go to the project’s top level directory and run:

```shell
$ scrapy crawl simple
```

This command runs the spider with name `simple` that we’ve just added, that will send some requests for the `quotes.toscrape.com` domain. You will get an output similar to this:

```shell
...
2019-03-19 15:58:49 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2019-03-19 15:58:49 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2019-03-19 15:58:49 [scrapy.core.engine] INFO: Spider opened
2019-03-19 15:58:49 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2019-03-19 15:58:49 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2019-03-19 15:58:49 [scrapy.core.engine] DEBUG: Crawled (404) <GET http://quotes.toscrape.com/robots.txt> (referer: None)
2019-03-19 15:58:49 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/page/1/> (referer: None)
2019-03-19 15:58:49 [simple] DEBUG: Saved file quotes-1.html
2019-03-19 15:58:49 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/page/2/> (referer: None)
2019-03-19 15:58:50 [simple] DEBUG: Saved file quotes-2.html
2019-03-19 15:58:50 [scrapy.core.engine] INFO: Closing spider (finished)
2019-03-19 15:58:50 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 678,
 'downloader/request_count': 3,
 ...}
2019-03-19 15:58:50 [scrapy.core.engine] INFO: Spider closed (finished)
```

Now, check the files in the current directory. You should notice that two new files have been created: `quotes-1.html` and `quotes-2.html`, with the content for the respective URLs, as our `parse` method instructs.

How did this work? Scrapy schedules the `scrapy.Request` objects returned by the `start_requests` method of the Spider. Upon receiving a response for each one, it instantiates `Response` objects and calls the callback method associated with the request (in this case, the `parse` method) passing the response as argument.

### Challenge
Modify and run the spider script above to scrape this short list of `start_urls`: 
```python
['http://www.baylessk12.org/', 'https://crcc.doniphanr1.k12.mo.us/', 'https://www.hazelwoodschools.org/southeastmiddle']
 ```

## Extracting data
The best way to learn how to extract data with Scrapy is trying selectors using the shell [Scrapy shell](https://docs.scrapy.org/en/latest/topics/shell.html#topics-shell). Remember to always enclose URLs in quotes (double quotes for Windows) when running Scrapy shell from command-line, otherwise urls containing arguments (ie. `&` character) will not work. Run:

```shell
$ scrapy shell 'http://quotes.toscrape.com'
```
You will see something like:

```shell
2019-03-19 20:00:05 [scrapy.utils.log] INFO: Scrapy 1.6.0 started (bot: tutorial)
2019-03-19 20:00:05 [scrapy.utils.log] INFO: Versions: lxml 4.3.1.0, libxml2 2.9.9, cssselect 1.0.3, parsel 1.5.1, w3lib 1.20.0, Twisted 17.9.0, Python 3.6.7 (default, Oct 22 2018, 11:32:17) - [GCC 8.2.0], pyOpenSSL 17.5.0 (OpenSSL 1.1.0g  2 Nov 2017), cryptography 2.1.4, Platform Linux-4.15.0-45-generic-x86_64-with-Ubuntu-18.04-bionic
2019-03-19 20:00:05 [scrapy.crawler] INFO: Overridden settings: {...}
2019-03-19 20:00:05 [scrapy.extensions.telnet] INFO: Telnet Password: 030319d194e7f6b0
2019-03-19 20:00:05 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.memusage.MemoryUsage']
2019-03-19 20:00:05 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
...]
2019-03-19 20:00:05 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
...]
2019-03-19 20:00:05 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2019-03-19 20:00:05 [scrapy.core.engine] INFO: Spider opened
2019-03-19 20:00:05 [scrapy.core.engine] DEBUG: Crawled (404) <GET http://quotes.toscrape.com/robots.txt> (referer: None)
2019-03-19 20:00:05 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com> (referer: None)
[s] Available Scrapy objects:
[s]   scrapy     scrapy module (contains scrapy.Request, scrapy.Selector, etc)
[s]   crawler    <scrapy.crawler.Crawler object at 0x7f07993c02b0>
... 
```

Using the shell, you can try selecting elements using CSS with the response object:
```shell
>>> response.css('title')
```
The result of running `response.css('title')` is a list-like object called SelectorList, which represents a list of Selector objects that wrap around XML/HTML elements and allow you to run further queries to fine-grain the selection or extract the data.

To extract the text from the title above, you can do:
```shell
>>> response.css('title::text').getall()
```

There are two things to note here: one is that we’ve added `::text` to the CSS query, to mean we want to select only the text elements directly inside `<title>` element. If we don’t specify `::text`, we’d get the full title element, including its tags:
```shell
>>> response.css('title').getall()
```
The other thing is that the result of calling `.getall()` is a list: it is possible that a selector returns more than one result, so we extract them all. When you know you just want the first result, as in this case, you can do:
```shell
>>> response.css('title::text').get()
```

Besides the `getall()` and `get()` methods, you can also use the `re()` method to extract using regular expressions:
```shell
>>> response.css('title::text').re(r'Quotes.*')
>>> response.css('title::text').re(r'Q\w+')
>>> response.css('title::text').re(r'(\w+) to (\w+)')
```
In order to find the proper CSS selectors to use, you can use your browser developer tools (e.g., in Chrome, right click > `Inspect`) to inspect the HTML and come up with a selector (for more info, see [Using your browser’s Developer Tools for scraping](https://docs.scrapy.org/en/latest/topics/developer-tools.html)). You can also try opening the response page from the shell in your web browser using `view(response)`.

### Challenge
Inspect [quotes.toscrape.com](quotes.toscrape.com) for the selectors associated with quotes. Use this information to display the text of one of the quotes in the scrapy shell. <br>
**Hint 1:** If you need help getting a better sense of website structure, use the HTML tree below (under "Extracting quotes and authors") as a visual guide.<br>
**Hint 2:** You can subset within selectors by using periods and spaces. For instance, the following produces a SelectorList for the class2 of each type2 within the class1 of each type1:
```shell
response.css('type1.class1 type2.class2')
```

Your solution here: 

```shell

```

### Extracting quotes and authors
Now that you know a bit about selection and extraction, let’s complete our spider by writing the code to extract the quotes from the web page.

Each quote in http://quotes.toscrape.com is represented by HTML elements that look like this:

```html
<div class="quote">
    <span class="text">“The world as we have created it is a process of our
    thinking. It cannot be changed without changing our thinking.”</span>
    <span>
        by <small class="author">Albert Einstein</small>
        <a href="/author/Albert-Einstein">(about)</a>
    </span>
    <div class="tags">
        Tags:
        <a class="tag" href="/tag/change/page/1/">change</a>
        <a class="tag" href="/tag/deep-thoughts/page/1/">deep-thoughts</a>
        <a class="tag" href="/tag/thinking/page/1/">thinking</a>
        <a class="tag" href="/tag/world/page/1/">world</a>
    </div>
</div>
```

How do we extract the data we want? To start, we get a list of selectors for the quote HTML elements with:

```shell
>>> response.css("div.quote")
```

Each of the selectors returned by the query above allows us to run further queries over their sub-elements. Let’s assign the first selector to a variable, so that we can run our CSS selectors directly on a particular quote:
```shell
>>> quote = response.css("div.quote")[0]
```

Now, let’s extract title, author and the tags from that quote using the quote object we just created:

```shell
>>> title = quote.css("span.text::text").get()
>>> title
>>> author = quote.css("small.author::text").get()
>>> author
```

Given that the tags are a list of strings, we can use the .getall() method to get all of them:
```shell
>>> tags = quote.css("div.tags a.tag::text").getall()
>>> tags
```

Having figured out how to extract each bit, we can now iterate over all the quotes elements and put them together into a Python dictionary. Copy and paste each of these subsequent lines into scrapy shell (or type them in via split-screen):
```shell
>>> for quote in response.css("div.quote"):
...     text = quote.css("span.text::text").get()
...     author = quote.css("small.author::text").get()
...     tags = quote.css("div.tags a.tag::text").getall()
...     print(dict(text=text, author=author, tags=tags))
>>>
```

### Extracting data in our spider
Let’s get back to our spider. Until now, it doesn’t extract any data in particular, just saves the whole HTML page to a local file. Let’s integrate the extraction logic above into our spider.

A Scrapy spider typically generates many dictionaries containing the data extracted from the page. To do that, we use the `yield` Python keyword in the callback, as you can see below:

```python
import scrapy


class SimpleSpider(scrapy.Spider):
    name = "simple"
    start_urls = [
        'http://quotes.toscrape.com/page/1/',
        'http://quotes.toscrape.com/page/2/',
    ]

    def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'text': quote.css('span.text::text').get(),
                'author': quote.css('small.author::text').get(),
                'tags': quote.css('div.tags a.tag::text').getall(),
            }
```

If you run this spider, it will output the extracted data with the log:
```shell
2016-09-19 18:57:19 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/1/>
{'tags': ['life', 'love'], 'author': 'André Gide', 'text': '“It is better to be hated for what you are than to be loved for what you are not.”'}
2016-09-19 18:57:19 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/1/>
{'tags': ['edison', 'failure', 'inspirational', 'paraphrased'], 'author': 'Thomas A. Edison', 'text': "“I have not failed. I've just found 10,000 ways that won't work.”"}
```

### Storing the scraped data
The simplest way to store the scraped data is by using Feed exports, with the following command:
```shell
$ scrapy crawl quotes -o quotes.json
```

That will generate an quotes.json file containing all scraped items, serialized in JSON.

For historic reasons, Scrapy appends to a given file instead of overwriting its contents. If you run this command twice without removing the file before the second time, you’ll end up with a long JSON file--actually, a broken JSON file, which cannot be read.

You can also use other formats, like JSON Lines:
```shell
$ scrapy crawl quotes -o quotes.jl
```

The JSON Lines format is useful because it’s stream-like, you can easily append new records to it. It doesn’t have the same problem of JSON when you run twice. Also, as each record is a separate line, you can process big files without having to fit everything in memory.

In small projects (like the one in this tutorial), that should be enough. However, if you want to perform more complex things with the scraped items, you can write an Item Pipeline. A placeholder file for Item Pipelines has been set up for you when the project is created, in `schools/pipelines.py`. Though you don’t need to implement any item pipelines if you just want to store the scraped items.

## Link extraction in a (broad) spider <a id='linkextraction'> </a>

Let’s say, instead of just scraping the stuff from the first two pages from http://quotes.toscrape.com, you want quotes from all the pages in the website. We could do this by adding more URLs to the `start_urls` field, but this sounds like work. We could also read them in via a text file, but that doesn't sound any easier.

The best way would be to find the URLs of within-domain links from the website itself, extract data from these pages, go one level still further down, and so on. We could do this by identifying and extracting links using their CSS selectors, and this would let us scrape all the quotes across all the pages across this website (there are steps for this in [the `Scrapy` tutorial](https://docs.scrapy.org/en/latest/intro/tutorial.html)). But CSS selectors are site-specific, so what about other websites? Could we use `Scrapy` to crawl all the links reliably across different kinds of websites?

Yes! In fact, this kind of _broad crawling_ is (arguably) where `Scrapy` shines the most. Crawlers typically mix _horizontal crawling_, crawling pages at the same hierarchical level (for example, from `school.com/page1` to `school.com/page2`), with _vertical crawling_, moving from a higher hierarchical level (for example, `school.com/main`) to a lower one (for example, `school.com/main/about_us`). `Scrapy` makes doing both of these easy by providing the `CrawlSpider` class, from which we can borrow using the `genspider` command:

```shell
$ scrapy genspider -t crawl broad http://quotes.toscrape.com  # borrows from `crawl` spider template
Created spider 'broad' using template 'crawl' in module:
  schools.spiders.broad
```
Go ahead and execute this command in your terminal, and check out the resulting file in `schools/spiders/broad.py`. It should look like this:

```python
class BroadSpider(CrawlSpider):
    name = 'broad'
    allowed_domains = ['http://quotes.toscrape.com']
    start_urls = ['http://quotes.toscrape.com/']

    rules = (
        Rule(LinkExtractor(allow=r'Items/'), callback='parse_item', follow=True),
    )

    def parse_item(self, response):
        item = {}
        ...
        return item
```
It's worth explaining this a bit. `CrawlSpider` provides an implementation of the `parse()` method that uses the `rules` variable to allow easy two-direction (both horizontal and vertical) crawling. For instance, `Scrapy` by default avoids duplicate requests. 

Unless a callback is set, a Rule will follow the extracted URLs, which means that it will scan target pages for extra links and follow them. If a callback is set, the Rule won't follow the links from target pages. If you would like it to follow links, you should either return/yield them from your callback method, or set the follow argument of Rule() to true (which is what we will do). 

How does this spider find links to follow? As their name implies, LinkExtractors are specialized in extracting links, so by default, they are looking for the `a` (and `area`) `href` HTML tags or attributes. Links are ordered using by  "last in, first out", meaning that it scrapes a page and all its sublinks before visiting the next page.

In a moment, we will adapt this template to scrape from all the pages on our sample website. But first, we need to update our `items.py` file.

### Defining items

Within the project directory, there’s an `items.py` file. Items add structure to our scraping results and are used by spiders.

Here you can add class fields such as url, images, or locations. These fields can be filled by pipelines (a more advanced topic).

Add to this file the fields for text, author, and tags. We will use these for our quotes spider.

```python
from scrapy.item import Item, Field

class SchoolsItem(Item):
    text = Field()
    author = Field()
    tags = Field()
```

Item properties can then be set with the response we get from parsing. 

```python
from schools.items import SchoolsItem
...
def parse_item(self, response):
    item = SchoolsItem()      
    item['text'] = ...
    item['author'] = ...
    item['tags'] = ...
    
    yield item
```

### Challenge

Adapt the `CrawlSpider` in `broad.py` to scrape the text, author, and tag for each quote across all the page on `http://quotes.toscrape.com`. Assign the `text`, `author`, and `tags` fields to Items, then yield the Items. Edit the spider script first, then run it via your terminal, then check the output to make sure.

```python
# Your solution here
```


### Challenge

Use what you learned above about removing tags with a blacklist to rewrite the `parse_item()` function in the `BroadSpider()` so that it doesn't depend on website structure (HTML, CSS, XPath, etc.). In other words, write a truly broad crawler that only returns text.

Make sure to clean up the spacing: convert multiple newlines into a single one or a space, depending on the output format you want. 

_Hint:_ Check your output--is it missing anything important? Consider removing specific tags from the blacklist. 


```python
# Your solution here
```