# Scraping and crawling the web

This second day's workshop gives you practice scraping and crawling with modern Python tools.

To review from yesterday, web-scraping means “programatically going over a collection of web pages and extracting data” and is a powerful tool for working with data on the web. Scraping has two core steps. First, you find web pages and download them via web requests (often called *web-crawling*). Then you extract and parse information from these pages (often called *web-scraping*). These two steps often happen together and recursively: you crawl some stuff, but upon scraping it you realize you got the wrong websites, so you go back to crawling, which changes your scraping approach, and so on.

### Scrapy has "batteries included"

You can build a scraper from scratch using low-level modules or libraries, but then you have to deal with some potential headaches as your scraper grows more complex. For example, you'll need to handle concurrency so you can crawl more than one page at a time. You'll probably want to figure out how to transform your scraped data into different formats like CSV, XML, or JSON. And you'll sometimes have to deal with sites that require specific settings and access patterns.

You'll have better luck if you build your scraper on top of an existing library that handles those issues for you. For this tutorial, we will build some intuition for web-scraping by working with low-level approaches, using the `Requests` and `BeautifulSoup` libraries to make requests and parse the result. Then we will build a scraper with *Scrapy*,which is one of the most popular, flexible, and powerful Python scraping libraries. Scrapy takes a "batteries included" approach to scraping, meaning that it handles a lot of the common functionality that all scrapers need. This prevents you from reinventing the wheel--or worse, the flat tire!

The focus here is on applying Scrapy. You can also read more about the [basics of scrapy](https://docs.scrapy.org/en/latest/intro/overview.html), its [architecture](https://docs.scrapy.org/en/latest/topics/architecture.html), or see [the FAQ](https://docs.scrapy.org/en/latest/faq.html). And if you need a refresher on scraping with Beautiful Soup, here's a [good tutorial](https://www.digitalocean.com/community/tutorials/how-to-scrape-web-pages-with-beautiful-soup-and-python-3).

### What kind of crawling?

A flexible tool like Scrapy can be used in many different ways depending on the task at hand.

What I call _narrow crawling_ means focusing on a limited set of pre-defined domains--that is, studying their HTML and CSS structures and exploiting these to extract specific information repeatedly. This maximizes precision in scraping while sacrificing _extensibility_: the ability to incorporate new domains or be resilient to changes in website structure. This is what people usually mean when they say "web-scraping". It may or may not expand beyond the initial set of websites, but it may crawl more websites within this set (_vertical crawling_).

What I call _broad crawling_ makes the opposite tradeoff, collecting information on a range of websites and promoting flexibility in its scraping algorithm (way of extracting website information) at the expense of generally less clean output. It may identify websites to scrape by google search (what do people click on most?), network analysis (what websites tend to link to one another?), or _link extraction_: finding all within-domain links on a given webpage, then all within-domain links on its children links, and so on to a specified depth. The messier output from broad crawling can present challenges for data cleaning and analysis (remember "garbage in, garbage out"?), but this depends on the application.


## Outline

* [Narrow crawling with `Requests` and `BeautifulSoup`](#narrow)
    - [Making requests](#request)
    - [Parsing HTML](#parsing)
       - [Getting human-readable text](#readable)
* [Broad crawling with `Scrapy`](#scrapy)
    - [A simple Scrapy spider](#simple)
    - [Link extraction](#linkextraction)
* [Scrapy template: A Recursive Text Spider](#recursive)


## Vocabulary

* *narrow crawling (less extensible)*: 
    * Scraping a limited set of pre-defined domains: studying their HTML and CSS structures and exploiting these to extract specific information repeatedly. This maximizes precision in scraping while sacrificing extensibility (ability to incorporate new domains or changes in website structure). What people usually mean when they say "web-scraping". 
* *broad crawling (more extensible)*: 
    * Collecting information on a range of websites and promoting flexibility in its scraping algorithm (way of extracting website information) at the expense of generally less clean output. It may identify websites to scrape by google search (what do people click on most?), network analysis (what websites tend to link to one another?), or link extraction.
* *extensibility*:
    * Ability for a scraping approach to incorporate new domains or be resilient to changes in website structure. Generally higher for broad crawls than narrow crawls, at the expense of precision. 
* *link extraction*:
    * Finding all within-domain links on a given webpage, then all within-domain links on its children links, and so on to a specified depth. 
* *horizontal crawling*: 
    * Crawling on the same hierarchical level as the input domain, such as going from the first to the second page of google results.
* *vertical crawling*:
    * Crawling at a higher or lower level from the input domain, such as navigating to the "About Us" page directly linked from a home page. 

**__________________________________**


# Narrow crawling with `Requests` and `BeautifulSoup` <a id='narrow'> </a>

## Making requests<a id='request'></a>

The first step in web-scraping is getting the HTML of the website we want to scrape. The [requests](http://docs.python-requests.org/en/master/) library is the easiest way to do this in Python.

In [57]:
import requests

url = 'https://en.wikipedia.org/wiki/Canberra'

response = requests.get(url)

Great, it looks like everything worked! Let's see our beautiful HTML:

In [58]:
response

<Response [200]>

Huh, that's weird. Doesn't look like HTML to me.

What the `requests.get` function returned (and the thing in our `response` variable) was a Response object. It itself isn't the HTML that we wanted, but rather a collection of metadata about the request/response interaction between your computer and the Wikipedia server.

For example, it knows whether the response was successful or not (`response.ok`), how long the whole interaction took (`response.elapsed`), what time the request took place (`response.headers['Date']`) and a whole bunch of other metadata.

In [59]:
response.ok

True

In [60]:
response.headers['Date']

'Tue, 27 Apr 2021 13:33:49 GMT'

Of course, what we really care about is the HTML content. We can get that from the `Response` object with `response.text`. What we get back is a string of HTML, exactly the contents of the HTML file at the URL that we requested.

In [61]:
html = response.text
print(html[:1000])

<!DOCTYPE html>
<html class="client-nojs" lang="en" dir="ltr">
<head>
<meta charset="UTF-8"/>
<title>Canberra - Wikipedia</title>
<script>document.documentElement.className="client-js";RLCONF={"wgBreakFrames":!1,"wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgRequestId":"6f2886de-da31-4c7e-bdcb-8de02a761b4f","wgCSPNonce":!1,"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":!1,"wgNamespaceNumber":0,"wgPageName":"Canberra","wgTitle":"Canberra","wgCurRevisionId":1017188895,"wgRevisionId":1017188895,"wgArticleId":51983,"wgIsArticle":!0,"wgIsRedirect":!1,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["CS1: Julian–Gregorian uncertainty","Australian Statistical Geography Standard 2016 ID different from Wikidata","Australian Statistical Geography Standard 2016 ID same as Wikidata","Aus

### Challenge

Get the HTML for [the Wikipedia page about HTML](https://en.wikipedia.org/wiki/HTML). 
Print out the first 1000 characters and compare it to the HTML you see when you view the source HTML in your browser.

In [None]:
# your solution here

In [63]:
# solution
url = 'https://en.wikipedia.org/wiki/HTML'
response = requests.get(url)
html = response.text

html[:1000]

'<!DOCTYPE html>\n<html class="client-nojs" lang="en" dir="ltr">\n<head>\n<meta charset="UTF-8"/>\n<title>HTML - Wikipedia</title>\n<script>document.documentElement.className="client-js";RLCONF={"wgBreakFrames":!1,"wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgRequestId":"4f84a59d-603f-42f7-bf6e-c3e276d6cb7e","wgCSPNonce":!1,"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":!1,"wgNamespaceNumber":0,"wgPageName":"HTML","wgTitle":"HTML","wgCurRevisionId":1016919858,"wgRevisionId":1016919858,"wgArticleId":13191,"wgIsArticle":!0,"wgIsRedirect":!1,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["CS1 errors: missing periodical","Webarchive template wayback links","Wikipedia pages semi-protected against vandalism","Articles with short description","Short description is different from 

# Parsing HTML<a id='parsing'></a>

The second step in web scraping is parsing HTML. This is where things can get a little tricky.

### Challenge

View the source HTML of [the page listing all departments](http://guide.berkeley.edu/courses/), and see if you can find the part of the HTML where the departments are listed. There's a lot of other stuff in the file that we don't care too much about. You could try `Crtl-F`ing for the name of a department you can see on the webpage.







**Solution**

You should see something like this:


```
<div id="atozindex">
<h2 class="letternav-head" id='A'><a name='A'>A</a></h2>
<ul>
<li><a href="/courses/aerospc/">Aerospace Studies (AEROSPC)</a></li>
<li><a href="/courses/africam/">African American Studies (AFRICAM)</a></li>
<li><a href="/courses/a,resec/">Agricultural and Resource Economics (A,RESEC)</a></li>
<li><a href="/courses/amerstd/">American Studies (AMERSTD)</a></li>
<li><a href="/courses/ahma/">Ancient History and Mediterranean Archaeology (AHMA)</a></li>
<li><a href="/courses/anthro/">Anthropology (ANTHRO)</a></li>
<li><a href="/courses/ast/">Applied Science and Technology (AST)</a></li>
<li><a href="/courses/arabic/">Arabic (ARABIC)</a></li>
<li><a href="/courses/arch/">Architecture (ARCH)</a></li>
```

This is HTML. HTML uses "tags", code that surrounds the raw text which indicates the structure of the content. The tags are enclosed in `<` and `>` symbols. The `<li>` says "this is a new thing in a list and `</li>` says "that's the end of that new thing in the list". Similarly, the `<a ...>` and the `</a>` say, "everything between us is a hyperlink". In this HTML file, each department is listed in a list with `<li>...</li>` and is also linked to its own page using `<a>...</a>`. In our browser, if we click on the name of the department, it takes us to that department's own page. The way the browser knows where to go is because the `<a>...</a>` tag tells it what page to go to. You'll see inside the `<a>` bit, there's a `href=...`. That tells us the (relative) location of the page it's linked to.

### Challenge

Look at HTML source of [the page for the Aerospace Studies department](http://guide.berkeley.edu/courses/aerospc/), and try to find the part of the file where the information on each course is. Again, try searching for it using `Crtl-F`.

**Solution**


```
<div class="courseblock">

<button class="btn_toggleCoursebody" aria-expanded="false" aria-controls="cb_aerospc1a" data-toggle="#cb_aerospc1a">

<a name="spanaerospc1aspanspanfoundationsoftheu.s.airforcespanspan1unitspan"></a>
<h3 class="courseblocktitle">
<span class="code">AEROSPC 1A</span> 
<span class="title">Foundations of the U.S. Air Force</span> 
<span class="hours">1 Unit</span>
</h3>
```

The content that we care about is enclosed within HTML tags. It looks like the course code is enclosed in a `span` tag, which has a `class` attribute with the value `"code"`. What we'll have to do is extract out the information we care about by specifying what tag it's enclosed in.

But first, we're going to need to get the HTML of the first page.

### Challenge

Get the HTML content of `http://guide.berkeley.edu/courses/` and store it in a variable called `academic_guide_html`. You can use the `get_html` function you wrote before.

Print the first 500 characters to see what we got back.

In [None]:
# your solution here

In [None]:
# solution
academic_guide_url = 'http://guide.berkeley.edu/courses/'
academic_guide_html = get_html(academic_guide_url)
print(academic_guide_html[:500])

Great, we've got the HTML contents of the Academic Guide site we want to scrape. Now we can parse it. ["Parsing"](https://en.wikipedia.org/wiki/Parsing) means to turn a string of data into a structured representation. When we're parsing HTML, we're taking the Python string and turning it into a tree. The Python package `BeautifulSoup` does all our HTML parsing for us. We give it our HTML as a string and it returns a parsed HTML tree. Here, we're also telling BeautifulSoup to use the `lxml` parser behind the scenes.

In [None]:
from bs4 import BeautifulSoup

academic_guide_soup = BeautifulSoup(academic_guide_html, 'lxml')

We said before that all the departments were listed on the Academic Guide page with links to their departmental page, where the actual courses are listed. So we can find all the departments by looking in our parsed HTML for all the links. Remember that the links are represented in the HTML with the `<a>...</a>` tag, so we ask our `academic_guide_soup` to find us all the tags called `a`. What we get back is a list of all the `a` elements in the HTML page.

In [None]:
links = academic_guide_soup.find_all('a')
# print a random link element
links[48]

So now we have a list of `a` elements, each one represents a link on the Academic Guide page. But there are other links on this page in addition to the ones we care about, for example, a link back to the UC Berkeley home page. How can we filter out all the links we don't care about?

### [Working with BeautifulSoup](http://www.crummy.com/software/BeautifulSoup/bs4/doc/)

Here's a basic introduction to beautifulsoup on how to get text from a website.

In [None]:
# use the requests module to request data from the website
r  = requests.get('http://www.k12northstar.org/chinook')
data = r.text # it gives you raw data of the website --> more like html and css code but not really

In [None]:
# construct a soup object that gives you a formatted website code in HTML/CSS, JS
soup = BeautifulSoup(data, 'html.parser')
print(soup.prettify())  # you can uncomment this to check the code, but it's too long so I just comment it for readability

In [None]:
# example here: find_all('a') means find all <a> tag in html
# uncomment to see the result
for link in soup.find_all('a'):
    print(link.get('href'))

In [None]:
paragraph = soup.find_all('p')  
# p tag in html is where paragraphs or texts is placed, 
# so we need to extract them from the soup object

In [None]:
paragraph  # check what is being extracted

In [None]:
# how to get plain text from all the p tag stuff
for p in paragraph:
    print(p.text)

### Challenge

Use BeautifulSoup to get all the paragraphs from reddit.com. 

In [None]:
# solution
r  = requests.get('http://www.reddit.com')
data = r.text
soup = BeautifulSoup(data, 'html.parser')
paragraph = soup.find_all('p')
for p in paragraph:
    print(p.text)

## Get human-readable text from HTML<a id='readable'></a>

Tags offer good information for NLP...sometimes. 
Other times, they offer extraneous characters, words, and in some cases entire scripts. 

Cases for including tags:
- Identifying sections via headers
- Identifying URL links (a href...)
- Identifying images (img tags)

Cases where tags overcomplicate things:
- Paragraphs (floating p tags)
- Scripts (nonsense javascript characters)
- Styles (inline style properties)

Reference: https://stackoverflow.com/questions/12959308/remove-all-inline-styles-using-beautifulsoup

#### Methods to remove tags

*Simplest way*: `get_text()`

- returns all the text in a document or beneath a tag, as a single Unicode string

In [38]:
markup = '<a href="http://example.com/">\nI linked to <i>example.com</i>\n</a>'
soup = BeautifulSoup(markup, 'html.parser')
soup.get_text()

'\nI linked to example.com\n'

You can also specify the specific soup attributes for which you want text:

In [39]:
soup.i.get_text()

'example.com'

You can tell Beautiful Soup to strip whitespace from the beginning and end of each bit of text:

In [26]:
soup.get_text(strip=True)

'I linked toexample.com'

*Remove and return tags from soup:* `extract()`

- Useful if we want to keep the contents of a tag and analyze them further, such as links or images

In [52]:
# Example:
markup = '<a href="http://example.com/">\nI linked to <i>example.com</i>\n</a>'
soup = BeautifulSoup(markup, 'html.parser')
i_tag = soup.i.extract()

print(f"Extracted tag: \n{i_tag}")
print()
print(f'Cleaned soup: \n{soup}')
print()
print(f'Text from cleaned soup:{soup.get_text()}')

Extracted tag: 
<i>example.com</i>

Cleaned soup: 
<a href="http://example.com/">
I linked to 
</a>

Text from cleaned soup:
I linked to 



*Obliterate tags:* `decompose()`

- Useful for tags we don't want, such as scripts
- Better memory management by destroying the tag, as opposed to simply extracting to nothing

In [42]:
# Example:
markup = BeautifulSoup('<p>some <i>italicized</i> text</p>', 'html.parser')
markup.i.decompose()
markup

<p>some  text</p>

Sample usage for Decompose to remove script and style tags:
for s in soup(["script", "style"]):
	s.decompose()

#### [Removing attributes](https://stackoverflow.com/questions/12959308/remove-all-inline-styles-using-beautifulsoup)

Sample code:

In [None]:
whitelist = ["href"] # keep text from within these tags
for tag in soup.findAll(True):
    for attribute in [attribute for attribute in tag.attrs if attribute not in whitelist]:
        del tag[attribute]

#### Challenge

Use BeautifulSoup to clean the following HTML and return its text. 

Keep the text in the `href` and `src` attributes, but completely remove the `script`, `style`, `meta`, and `noscript` attributes. Anything not in these attributes, keep all of it. 

([source](https://stackoverflow.com/questions/30565404/remove-all-style-scripts-and-html-tags-from-an-html-page))

In [53]:
markup = '<a href="http://example.com/">\nI linked to <i>example.com</i>\n</a>'
attribute_whitelist = ["href", "src"]
tag_remove_list = ["script", "style", "meta", "noscript"]

In [55]:
# Solution
soup = BeautifulSoup(markup, "html.parser")

for s in soup(tag_remove_list):
	s.decompose()
    
for tag in soup.findAll(True):
	for attribute in [attribute for attribute in tag.attrs if attribute not in attribute_whitelist]:
		del tag[attribute]

print(f'Cleaned soup: \n{soup}')
print()
print(f'Text from cleaned soup: {soup.get_text()}')

Cleaned soup: 
<a href="http://example.com/">
I linked to <i>example.com</i>
</a>

Text from cleaned soup: 
I linked to example.com



# Broad crawling with `Scrapy` <a id='scrapy'> </a>

To learn about crawling with Scrapy, we will explore [quotes.toscrape.com](quotes.toscrape.com), a scraping-friendly website that lists quotes from famous authors. By the end of this section, you’ll have a fully functional Python web scraper that walks through a series of pages and extracts data from each page. The scraper will be easily expandable so you can tinker around with it and use it as a foundation for your own projects scraping data from the web.

As a _Twisted_ application, Scrapy is event-driven, asynchronous, and is virtually multi-threaded (while using only one thread). While other programs cause _blocks_ when they access files or the web, spawn new processes, or do system operations, Scrapy instead waits until a resource is available, solves the immediate problem, and then calls another task. In short, Scrapy is fast, flexible, and scalable. It offers one of the most user-friendly ways to write crawling programs that can move across heterogeneous swaths of the internet, download stuff, and not break. 

To grasp the intuition behind Scrapy, imagine a bank where tellers (threads) are available to see customers (processes), who need to fill out forms before they're done. Such a situation could be configured in these ways:

- _Blocking_ operation with a _single_ thread: Here there is 1 teller trying to help 5 customers. When customer 1 needs time to fill out a form, then teller 1 is occupied waiting for customer 1--and all the other customers are stuck in line.
- _Blocking_ operation with _multiple_ threads: Now there are still 5 customers, but there are 3 tellers. When customer 1 needs time to fill out a form, then teller 1 is occupied. Customer 2 may have access to teller 2 and customer 3 to teller 3, but then all the tellers are monopolized while people fill out forms, which means customers 4 and 5 are still stuck waiting in line. 
- _Non-blocking_ operation with a _single_ thread: Here again we have 1 teller and 5 customers. When customer 1 needs time to fill out a form, they stand aside so the single teller can help customer 2. When customer 1 is finished, they wait until customer 2 is done or has something to do, then customer 1 is called back to continue being helped. If customers 1 and 2 both have forms to complete, they can do that on the side and the single teller can see customer 3, and so on. 

You can see this last situation is way more efficient than the previous two. This is the Scrapy default; when _multiple_ threads are available for a _non-blocking_ operation (like when multiple spiders work together), this is even better. 

## Create a project

Before you start scraping, you will have to set up a new Scrapy project. Enter a directory where you’d like to store your code and run:

```shell
$ scrapy startproject tutorial
```

This will create a tutorial directory with the following contents:

```
tutorial/
    scrapy.cfg            # deploy configuration file
    tutorial/             # project's Python module, you'll import your code from here
        __init__.py
        items.py          # project items definition file
        middlewares.py    # project middlewares file
        pipelines.py      # project pipelines file
        settings.py       # project settings file
        spiders/          # a directory where you'll later put your spiders
            __init__.py
            ```

## A simple scrapy spider <a id='simple'> </a>

Spiders are classes that you define and that Scrapy uses to scrape information from a website (or a group of websites). They must subclass scrapy.Spider and define the initial requests to make, optionally how to follow links in the pages, and how to parse the downloaded page content to extract data.

This is the code for our first Spider. Save it in a file named `quotes_spider.py` under the `tutorial/spiders` directory in your project:

In [None]:
import scrapy

class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = [
        'http://quotes.toscrape.com/page/1/',
        'http://quotes.toscrape.com/page/2/',
    ]

    def parse(self, response):
        page = response.url.split("/")[-2]
        filename = 'quotes-%s.html' % page
        with open(filename, 'wb') as f:
            f.write(response.body)

As you can see, our Spider subclasses scrapy.Spider and defines some attributes and methods:

-`name`: identifies the Spider. It must be unique within a project, that is, you can’t set the same name for different Spiders.

-`start_urls`: a list of URLs to provide the initial requests for the crawler. Armed with this list alone, the spider will download HTML from the webpages specified, much as a web browser does. But it won't extract anything from the pages--that's why we need to define the `parse()` method.

-`parse()`: a method that will be called to handle the response downloaded for each of the requests made. The response parameter is an instance of `TextResponse` that holds the page content and has further helpful methods to handle it.

The `parse()` method usually parses the response, extracting the scraped data as dicts and also finding new URLs to follow and creating new requests (`Request`) from them.

### How to run the spider

To put our spider to work, go to the project’s top level directory and run:

```shell
$ scrapy crawl quotes
```

This command runs the spider with name `quotes` that we’ve just added, that will send some requests for the `quotes.toscrape.com` domain. You will get an output similar to this:

```shell
...
2019-03-19 15:58:49 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2019-03-19 15:58:49 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2019-03-19 15:58:49 [scrapy.core.engine] INFO: Spider opened
2019-03-19 15:58:49 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2019-03-19 15:58:49 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2019-03-19 15:58:49 [scrapy.core.engine] DEBUG: Crawled (404) <GET http://quotes.toscrape.com/robots.txt> (referer: None)
2019-03-19 15:58:49 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/page/1/> (referer: None)
2019-03-19 15:58:49 [quotes] DEBUG: Saved file quotes-1.html
2019-03-19 15:58:49 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/page/2/> (referer: None)
2019-03-19 15:58:50 [quotes] DEBUG: Saved file quotes-2.html
2019-03-19 15:58:50 [scrapy.core.engine] INFO: Closing spider (finished)
2019-03-19 15:58:50 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 678,
 'downloader/request_count': 3,
 ...}
2019-03-19 15:58:50 [scrapy.core.engine] INFO: Spider closed (finished)
```

Now, check the files in the current directory. You should notice that two new files have been created: `quotes-1.html` and `quotes-2.html`, with the content for the respective URLs, as our `parse` method instructs.

How did this work? Scrapy schedules the `scrapy.Request` objects returned by the `start_requests` method of the Spider. Upon receiving a response for each one, it instantiates `Response` objects and calls the callback method associated with the request (in this case, the `parse` method) passing the response as argument.

### Challenge
Modify and run the spider script above to scrape this short list of `start_urls`: 
```python
['http://brickset.com/sets/year-2016']
 ```

## Extracting data
The best way to learn how to extract data with Scrapy is trying selectors using the shell [Scrapy shell](https://docs.scrapy.org/en/latest/topics/shell.html#topics-shell). Remember to always enclose URLs in quotes (double quotes for Windows) when running Scrapy shell from command-line, otherwise urls containing arguments (ie. `&` character) will not work. Run:

```shell
$ scrapy shell 'http://quotes.toscrape.com'
```
You will see something like:

```shell
2019-03-19 20:00:05 [scrapy.utils.log] INFO: Scrapy 1.6.0 started (bot: tutorial)
2019-03-19 20:00:05 [scrapy.utils.log] INFO: Versions: lxml 4.3.1.0, libxml2 2.9.9, cssselect 1.0.3, parsel 1.5.1, w3lib 1.20.0, Twisted 17.9.0, Python 3.6.7 (default, Oct 22 2018, 11:32:17) - [GCC 8.2.0], pyOpenSSL 17.5.0 (OpenSSL 1.1.0g  2 Nov 2017), cryptography 2.1.4, Platform Linux-4.15.0-45-generic-x86_64-with-Ubuntu-18.04-bionic
2019-03-19 20:00:05 [scrapy.crawler] INFO: Overridden settings: {...}
2019-03-19 20:00:05 [scrapy.extensions.telnet] INFO: Telnet Password: 030319d194e7f6b0
2019-03-19 20:00:05 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.memusage.MemoryUsage']
2019-03-19 20:00:05 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
...]
2019-03-19 20:00:05 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
...]
2019-03-19 20:00:05 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2019-03-19 20:00:05 [scrapy.core.engine] INFO: Spider opened
2019-03-19 20:00:05 [scrapy.core.engine] DEBUG: Crawled (404) <GET http://quotes.toscrape.com/robots.txt> (referer: None)
2019-03-19 20:00:05 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com> (referer: None)
[s] Available Scrapy objects:
[s]   scrapy     scrapy module (contains scrapy.Request, scrapy.Selector, etc)
[s]   crawler    <scrapy.crawler.Crawler object at 0x7f07993c02b0>
... 
```

Using the shell, you can try selecting elements using CSS with the response object:
```shell
>>> response.css('title')
```
The result of running `response.css('title')` is a list-like object called SelectorList, which represents a list of Selector objects that wrap around XML/HTML elements and allow you to run further queries to fine-grain the selection or extract the data.

To extract the text from the title above, you can do:
```shell
>>> response.css('title::text').getall()
```

There are two things to note here: one is that we’ve added `::text` to the CSS query, to mean we want to select only the text elements directly inside `<title>` element. If we don’t specify `::text`, we’d get the full title element, including its tags:
```shell
>>> response.css('title').getall()
```
The other thing is that the result of calling `.getall()` is a list: it is possible that a selector returns more than one result, so we extract them all. When you know you just want the first result, as in this case, you can do:
```shell
>>> response.css('title::text').get()
```

Besides the `getall()` and `get()` methods, you can also use the `re()` method to extract using regular expressions:
```shell
>>> response.css('title::text').re(r'Quotes.*')
>>> response.css('title::text').re(r'Q\w+')
>>> response.css('title::text').re(r'(\w+) to (\w+)')
```
In order to find the proper CSS selectors to use, you can use your browser developer tools (e.g., in Chrome, right click > `Inspect`) to inspect the HTML and come up with a selector (for more info, see [Using your browser’s Developer Tools for scraping](https://docs.scrapy.org/en/latest/topics/developer-tools.html)). You can also try opening the response page from the shell in your web browser using `view(response)`.

### Challenge
Inspect [quotes.toscrape.com](quotes.toscrape.com) for the selectors associated with quotes. Use this information to display the text of one of the quotes in the scrapy shell. <br>
**Hint 1:** If you need help getting a better sense of website structure, use the HTML tree below as a visual guide.<br>
**Hint 2:** You can subset within selectors by using periods and spaces. For instance, the following produces a SelectorList for the class2 of each type2 within the class1 of each type1:
```shell
response.css('type1.class1 type2.class2')
```

### Defining items

Within the project directory just created, there’s an `items.py` file. Items add structure to our scraping results and are used spiders.

Here you can add class fields such as url, images or locations. These fields can be filled by pipelines (a more advanced topic).

In [None]:
from scrapy.item import Item, Field

class PropertiesItem(Item):
    # Primary fields
    title = Field()
    price = Field()

### Writing spiders

Spiders element the scraping process defined in the previous sub-chapter, “The fundamental scraping process”.

You can create a spider from a template using:

```shell
$ scrapy genspider SPIDER_NAME web
```

Spiders inherit from scrapy.Spider. Note the special function parse(self, response).
The response object is the same object we found in the Scrapy shell.

Within the parse method, we can define items.

Item properties can then be set with the response we get from parsing. 

In [None]:
def parse(self, response):
    item = PropertiesItem()      
    item['image_urls'] = response.xpath(
         '//*[@itemprop="image"][1]/@src').extract()
    return item

Finally to parse and finally create these items, run the spider with  scrapy crawl basic in the project directory.

### Extracting quotes and authors
Now that you know a bit about selection and extraction, let’s complete our spider by writing the code to extract the quotes from the web page.

Each quote in http://quotes.toscrape.com is represented by HTML elements that look like this:

```html
<div class="quote">
    <span class="text">“The world as we have created it is a process of our
    thinking. It cannot be changed without changing our thinking.”</span>
    <span>
        by <small class="author">Albert Einstein</small>
        <a href="/author/Albert-Einstein">(about)</a>
    </span>
    <div class="tags">
        Tags:
        <a class="tag" href="/tag/change/page/1/">change</a>
        <a class="tag" href="/tag/deep-thoughts/page/1/">deep-thoughts</a>
        <a class="tag" href="/tag/thinking/page/1/">thinking</a>
        <a class="tag" href="/tag/world/page/1/">world</a>
    </div>
</div>
```

How do we extract the data we want? To start, we get a list of selectors for the quote HTML elements with:

```shell
>>> response.css("div.quote")
```

Each of the selectors returned by the query above allows us to run further queries over their sub-elements. Let’s assign the first selector to a variable, so that we can run our CSS selectors directly on a particular quote:
```shell
>>> quote = response.css("div.quote")[0]
```

Now, let’s extract title, author and the tags from that quote using the quote object we just created:

```shell
>>> title = quote.css("span.text::text").get()
>>> title
>>> author = quote.css("small.author::text").get()
>>> author
```

Given that the tags are a list of strings, we can use the .getall() method to get all of them:
```shell
>>> tags = quote.css("div.tags a.tag::text").getall()
>>> tags
```

Having figured out how to extract each bit, we can now iterate over all the quotes elements and put them together into a Python dictionary. Copy and paste each of these subsequent lines into scrapy shell:
```shell
>>> for quote in response.css("div.quote"):
...     text = quote.css("span.text::text").get()
...     author = quote.css("small.author::text").get()
...     tags = quote.css("div.tags a.tag::text").getall()
...     print(dict(text=text, author=author, tags=tags))
>>>
```

### Extracting data in our spider
Let’s get back to our spider. Until now, it doesn’t extract any data in particular, just saves the whole HTML page to a local file. Let’s integrate the extraction logic above into our spider.

A Scrapy spider typically generates many dictionaries containing the data extracted from the page. To do that, we use the `yield` Python keyword in the callback, as you can see below:

In [None]:
import scrapy


class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = [
        'http://quotes.toscrape.com/page/1/',
        'http://quotes.toscrape.com/page/2/',
    ]

    def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'text': quote.css('span.text::text').get(),
                'author': quote.css('small.author::text').get(),
                'tags': quote.css('div.tags a.tag::text').getall(),
            }

If you run this spider, it will output the extracted data with the log:
```shell
2016-09-19 18:57:19 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/1/>
{'tags': ['life', 'love'], 'author': 'André Gide', 'text': '“It is better to be hated for what you are than to be loved for what you are not.”'}
2016-09-19 18:57:19 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/1/>
{'tags': ['edison', 'failure', 'inspirational', 'paraphrased'], 'author': 'Thomas A. Edison', 'text': "“I have not failed. I've just found 10,000 ways that won't work.”"}
```

## Storing the scraped data
The simplest way to store the scraped data is by using Feed exports, with the following command:
```shell
$ scrapy crawl quotes -o quotes.json
```

That will generate an quotes.json file containing all scraped items, serialized in JSON.

For historic reasons, Scrapy appends to a given file instead of overwriting its contents. If you run this command twice without removing the file before the second time, you’ll end up with a long JSON file--actually, a broken JSON file, which cannot be read.

You can also use other formats, like JSON Lines:
```shell
$ scrapy crawl quotes -o quotes.jl
```

The JSON Lines format is useful because it’s stream-like, you can easily append new records to it. It doesn’t have the same problem of JSON when you run twice. Also, as each record is a separate line, you can process big files without having to fit everything in memory.

In small projects (like the one in this tutorial), that should be enough. However, if you want to perform more complex things with the scraped items, you can write an Item Pipeline. A placeholder file for Item Pipelines has been set up for you when the project is created, in `tutorial/pipelines.py`. Though you don’t need to implement any item pipelines if you just want to store the scraped items.

## Following links
Let’s say, instead of just scraping the stuff from the first two pages from http://quotes.toscrape.com, you want quotes from all the pages in the website.

Now that you know how to extract data from pages, let’s see how to follow links from them.

First thing is to extract the link to the page we want to follow. Examining our page, we can see there is a link to the next page with the following markup:

```shell
<ul class="pager">
    <li class="next">
        <a href="/page/2/">Next <span aria-hidden="true">&rarr;</span></a>
    </li>
</ul>
```

We can try extracting it in the shell:
```shell
>>> response.css('li.next a').get()
```

This gets the anchor element, but we want the attribute href. For that, Scrapy supports a CSS extension that lets you select the attribute contents, like this:
```shell
>>> response.css('li.next a::attr(href)').get()
```
There is also an attrib property available (see [Selecting element attributes](https://docs.scrapy.org/en/latest/topics/selectors.html#selecting-attributes) for more):
```shell
>>> response.css('li.next a').attrib['href']
```
Let’s modify our spider to recursively follow the link to the next page and extract data from it:

In [None]:
import scrapy


class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = [
        'http://quotes.toscrape.com/page/1/',
    ]

    def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'text': quote.css('span.text::text').get(),
                'author': quote.css('small.author::text').get(),
                'tags': quote.css('div.tags a.tag::text').getall(),
            }
            
        next_page = response.css('li.next a::attr(href)').get()
        if next_page is not None:
            yield response.follow(next_page, callback=self.parse)

Now, after extracting the data, the `parse()` method looks for the link to the next page and yields a new request to the next page, registering itself as callback to handle the data extraction for the next page and to keep the crawling going through all the pages.

What you see here is Scrapy’s mechanism of following links: when you yield a `Request` in a callback method (as `response.follow` does), Scrapy will schedule that request to be sent and register a callback method to be executed when that request finishes.

Using this, you can build complex crawlers that follow links according to rules you define, and extract different kinds of data depending on the page it’s visiting.

In our example, it creates a sort of loop, following all the links to the next page until it doesn’t find one – handy for crawling blogs, forums and other sites with pagination. Note that even if pages refer to one another, we don’t need to worry about visiting a given page multiple times. By default, Scrapy filters out duplicated requests to URLs already visited, avoiding the problem of hitting servers too much because of a programming mistake. In other words, you don't need to hard-code duplicate page handling yourself--this is one example of the built-in functionality of scrapy that saves you a lot of work.

## Using spider arguments
Rather than hard-coding file handling into your spider (as did the simple spider above), you can give your scrapy spiders a filename following the `-o` command line argument. You can also provide your spiders with other arguments by using the `-a` option when running them, such as indicating which tag you want to scrape:
```shell
$ scrapy crawl quotes -o quotes-love.json -a tag=love
```
These arguments are passed to the Spider’s `__init__` method and become spider attributes by default. If you want your spider to use this attribute intelligently, you need to hard-code this behavior. 

In this example, the value provided for the `tag` argument will be available via `self.tag`. You can use this to make your spider fetch only quotes with a specific tag, building the URL based on the argument:

In [None]:
import scrapy


class QuotesSpider(scrapy.Spider):
    name = "quotes"

    def start_requests(self):
        url = 'http://quotes.toscrape.com/'
        tag = getattr(self, 'tag', None)
        if tag is not None:
            url = url + 'tag/' + tag
        yield scrapy.Request(url, self.parse)

    def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'text': quote.css('span.text::text').get(),
                'author': quote.css('small.author::text').get(),
            }

        next_page = response.css('li.next a::attr(href)').get()
        if next_page is not None:
            yield response.follow(next_page, self.parse)

If you pass the `tag=love` argument to this spider, you’ll notice that it will only visit URLs from the humor tag, such as http://quotes.toscrape.com/tag/love.

You can learn more about handling spider arguments [here](https://docs.scrapy.org/en/latest/topics/spiders.html#spiderargs).

### Challenges
- **Easy:** Use the spider you just created to scrape quotes with another tag, such as 'inspirational' or 'books'. Examine the output.
- **NOT so easy:** Complete the following `author` spider so it extracts the name, birthdate, and description for each author and stores the resulting dict in a CSV file. Remember to save the script to the `spiders` folder so Scrapy knows where to look when you call it on the command-line. (Note that Scrapy will by default only scrape each author's page once.) <br> **Hints:** How does the `author` spider know what to extract from each page? Notice what's missing from the `parse_author()` method that the previous spiders have. Use the `extract_with_css()` method in your answer.

In [None]:
import scrapy


class AuthorSpider(scrapy.Spider):
    name = 'author'

    start_urls = ['http://quotes.toscrape.com/']

    def parse(self, response):
        # follow links to author pages
        for href in response.css('.author + a::attr(href)'):
            yield response.follow(href, self.parse_author)

        # follow pagination links
        for href in response.css('li.next a::attr(href)'):
            yield response.follow(href, self.parse)

    def parse_author(self, response):
        def extract_with_css(query):
            return response.css(query).get(default='').strip()


## Link Extraction <a id='linkextraction'> </a>

You can either add more URL strings to your parse method’s start_urls field. Or, you can read a text file as a URL source.

We could also use URLs defined within the pages we scrape.

We can extract the next links using xpath or other selectors.

### Two-direction crawling with a spider

We can crawl both horizontally and vertically in the same parse method, using Request. 

This can be done manually with two calls Request. However, since this is done commonly, there is a crawling template we can use.

```shell
$ scrapy genspider -t crawl SPIDER_NAME web # Creates the “crawl” template.
```

In [None]:
class EasySpider(CrawlSpider):
    name = 'NAME OF SPIDER'
    allowed_domains = ['web']
    start_urls = ['http://www.web/']

    rules = (
        Rule(LinkExtractor(allow=r'Items/'), callback='parse_item', follow=True),
    )

    def parse_item(self, response):
        #...
        pass

We can change the default rule with two rules, one for horizontal and one for vertical crawling.

In [None]:
rules = (
    Rule(LinkExtractor(restrict_xpaths='//*[contains(@class,"next")]')),
    Rule(LinkExtractor(restrict_xpaths='//*[@itemprop="url"]'),
         callback='parse_item')
        )

`LinkExtractor()` by default searches for `a`, `area`, and `href` HTML tags or attributes.

What does the callback argument do in a rule? By default, if not explicitly set, it follows/crawls links based on the rule. With it set, you have to explicitly return links from your callback if you still want to follow links.

Behind the scenes Scrapy uses last in, first out for processing requests (depth first crawl). This means we visit a page and all of its sub-links, before visiting the next page. Furthermore, Scrapy avoids duplicate requests. These default behaviors can be changed.

### Configuring LinkExtractor

In [None]:
# CrawlSpider, LinkExtractor, Rule
from scrapy.spider import CrawlSpider, Rule
from scrapy.linkextractor import LinkExtractor

- CrawlSpider, LinkExtractors, and Rule are classes in scrapy.
- CrawlSpider is a derived class of Spider, with more methods and functions.
   - CrawlSpider follows links depending on given Rules.
   - The design principle of the Spider class is to crawl only the URL in start_urls, while the CrawlSpider class defines some rules to provide a convenient mechanism to follow up the links, which is obtained from the crawled webpage links and continue to crawl.
- LinkExtractor is used to extract links
   - Links extracted from LinkExtractor represented by scrapy.link.Link object
      - url parameter most useful (maybe also text)
- Rule represents the rules for crawling.
   - Rules are set by creating a list of rules as static class property.
   - Contains a collection of one or more Rule objects. Each Rule defines specific rules for crawling websites
   - If multiple Rules match the same link, the first one will be used according to the order in which they are defined in this object.
- CrawlSpider also provides a callback function: parse_start_url(response)
   - This function is called when the request of start_url returns. This function analyzes the initial return value and must return an Item object or a Request object or a repeatable object containing both Item object and Request object.  It’s  useful because the CrawlSpider may not parse the initial start urls by default.
   - Since CrawlSpider uses the "parse" function to implement its logic, if you override the "parse" function, CrawlSpider will fail.
- extract_links(response) method returns list of links

### [LinkExtractor](https://docs.scrapy.org/en/latest/topics/link-extractors.html)

```python
class scrapy.linkextractors.LinkExtractor
```

LinkExtractor is an object that extracts links that will be followed from a crawled webpage (scrapy.http.Response).

```python
class scrapy.linkextractors.LinkExtractor(
    allow = (),
    deny = (),
    allow_domains = (),
    deny_domains = (),
    deny_extensions = None,
    restrict_xpaths = (),
    tags = ('a','area'),
    attrs = ('href'),
    canonicalize = True,
    unique = True,
    process_value = None
)
```

The main LinkExtractor parameters:
- allow: The value that satisfies the "regular expression" in the brackets will be extracted, and if it is empty, all match.
- deny: URLs that do not match this regular expression (or regular expression list) must not be extracted
- allow_domains: connected domains that will be extracted
- deny_domains: The domains of the link must not be extracted.
- restrict_xpaths: Use xpath expressions to filter links together with allow.

How to set these parameters for our generic web-crawler:
- allow: possible alternative to allowed_domains; use regex to indicate that if link starts with root URL, keep going
- deny: use this to exclude anything with “calendar” in URL
- canonicalize = False: this set to True helps with duplicate checking, but may make link following less robust. Let’s use False for now. Later on we can check the number of pages successfully scraped, and compare the effectiveness of setting this to True vs. False
- unique = True: filter duplicates to avoid scraping something already scraped
- follow = True: I think this may be True by default, but let’s be explicit we want to follow each link
- callback = self.parse: Will this allow recursive scraping (gather links from child URLs, their children, etc.)? If not, default behavior (don’t set a callback at all) may be better

### [Rule - Crawling rules](https://docs.scrapy.org/en/latest/topics/spiders.html#crawling-rules)

Several parameters are used to create Rules.
```python
class scrapy.contrib.spiders.Rule(
    link_extractor,
    callback=None,
    cb_kwargs=None,
    follow=None,
    process_links=None,
    process_request=None
)
```

- link_extractor: is a Link Extractor object. It defines how to extract links from crawled web pages.
- callback: is a callable or string (the function of the same name in the Spider will be called). This function will be called every time a link is obtained from link_extractor. The callback function receives a response as its first parameter and returns a list containing Item and Request objects (or subclasses of both).
- cb_kwargs: A dictionary containing the keyword arguments passed to the callback function.
- follow: is a boolean value that specifies whether the link extracted from the response according to this rule needs to be followed. If callback is None, follow defaults to True, otherwise defaults to False.
- process_links: is a callable or string (the function of the same name in the Spider will be called). This function will be called when the link list is obtained from link_extrator. This method is mainly used for filtering.
- process_request: is a callable or string (all functions with the same name in the spider will be called). This function will be called for every request extracted by this rule. The function must return a request or None. Used to filter requests.
- errback: is a callable or a string that is called if any exception is raised.

```python
from scrapy.spiders.crawl import Rule, CrawlSpider
from scrapy.linkextractors import LinkExtractor

class DoubanSpider(CrawlSpider):
    name = "test"
    allowed_domains = ["example.com"]
    start_urls = ['https://example.com/uselinks']

    rules = (
        Rule(LinkExtractor(allow=('subject/\d+/$',)),
        callback='parse_items'),
    	)

    def parse_items(self, response):
        # Yield items or return items
        pass
```

- Scrapy requests start_urls and gets the response
- Use the allow content in LinkExtractors to match the response and get the URL
- Request this URL, give the response, and handle the function pointed to by the callback

## Scrapy template: A Recursive Text Spider <a id='recursive'> </a>

In [6]:
# Install libraries
import tldextract
import csv
from bs4 import BeautifulSoup # BS reads and parses even poorly/unreliably coded HTML 
from bs4.element import Comment # helps with detecting inline/junk tags when parsing with BS
import lxml # fast bs4 parser
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import Rule, CrawlSpider
from scrapy.exceptions import NotSupported

# The following are required for parsing File text
import os
from tempfile import NamedTemporaryFile
import textract
from itertools import chain
import re
from urllib.parse import urlparse
import requests

In [None]:
# Define inline tags for cleaning out HTML
inline_tags = ["b", "big", "i", "small", "tt", "abbr", "acronym", "cite", "dfn", "kbd", 
               "samp", "var", "bdo", "map", "object", "q", "span", "sub", "sup", "head", 
               "title", "[document]", "script", "style", "meta", "noscript"]

In [None]:
class RecursiveTextSpider(CrawlSpider):
    name = 'textspider'
    rules = [
        Rule(
            LinkExtractor(
                canonicalize=False,
                unique=True
            ),
            follow=True,
            callback="parse_items"
        )
    ]
    
    
    def __init__(self, domain_list=None, *args, **kwargs):
        """
        Overrides default constructor to set custom
        instance attributes.
        
        Parameters:
        - domain_list: csv or tsv format
            List of entities containing string domains and unique identifiers.
            
        Attributes:
        
        - start_urls:
            Used by scrapy.spiders.Spider. A list of URLs where the
            spider will begin to crawl.

        - allowed_domains:
            Used by scrapy.spiders.Spider. An optional list of
            strings containing domains that this spider is allowed
            to crawl.

        - domain_to_id:
            A custom attribute used to map a string domain to
            a number representing the unique id defined by
            csv_input.
        """
        super(RecursiveTextSpider, self).__init__(*args, **kwargs)
        self.start_urls = []
        self.allowed_domains = []
        self.rules = (Rule(CustomLinkExtractor(allow_domains = self.allowed_domains), follow=True, callback="parse_items"),)
        self.domain_to_id = {}
        self.init_from_domain_list(domain_list)
    
    
    # note: make sure we ignore robot.txt
    # Method for parsing items
    def parse_items(self, response):
        
        item = CharterItem()
        item['url'] = response.url
        item['text'] = self.get_text(response)
        domain = self.get_domain(response.url)    

        item['unique_id'] = self.domain_to_id[domain]
        item['depth'] = response.request.meta['depth'] # uses DepthMiddleware
        print("Depth: ", item['depth'])

        yield item    
        
        
    def init_from_domain_list(self, domain_list):
        """
        Generate's this spider's instance attributes
        from the input domain list, formatted as a CSV or TSV.
        
        Domain List's format:
        1. The first row is meta data that is ignored.
        2. Rows in the csv are 1d arrays with one element.
        ex: row == ['3.70014E+11,http://www.charlottesecondary.org/'].
        
        Note: start_requests() isn't used since it doesn't work
        well with CrawlSpider Rules.
        
        Args:
            domain_list: Is the path string to this file.
        Returns:
            Nothing is returned. However, start_urls,
            allowed_domains, and domain_to_id are initialized.
        """
        if not domain_list:
            return
        with open(domain_list, 'r') as f:
            delim = "," if "csv" in domain_list else "\t"
            reader = csv.reader(f, delimiter=delim, quoting=csv.QUOTE_NONE)
            first_row = True
            for raw_row in reader:
                if first_row:
                    first_row = False
                    continue
                
                unique_id, url = raw_row

                domain = self.get_domain(url, True)
                # set instance attributes
                self.start_urls.append(url)
                self.allowed_domains.append(domain)
                # note: float('3.70014E+11') == 370014000000.0
                self.domain_to_id[domain] = float(unique_id)

    
    def get_domain(self, url, init = False):
        """
        Given the url, gets the top level domain using the
        tldextract library.
        
        Args:
            init (Boolean): True if this function is called while initializing the Spider, else False
        Ex:
        >>> get_domain('http://www.charlottesecondary.org/')
        charlottesecondary.org
        >>> get_domain('https://www.socratesacademy.us/our-school')
        socratesacademy.us
        """
        extracted = tldextract.extract(url)
        permissive_domain = f'{extracted.domain}.{extracted.suffix}' # gets top level domain: very permissive crawling
        #specific_domain = re.sub(r'https?\:\/\/', '', url) # full URL without http
        specific_domain = re.sub(r'https?\:\/\/w{0,3}\.?', '', url) # full URL without http and www. to compare w/ permissive
        print("Permissive:", permissive_domain)
        print("Specific:", specific_domain)
        top_level = len(specific_domain.replace("/", "")) == len(permissive_domain) # compare specific and permissive domain
        
        if init: # Check if this is the initialization period for the Spider.
            if top_level:
                return permissive_domain
            else:
                return specific_domain
        
        # secondary round
        if permissive_domain in self.allowed_domains:
            return permissive_domain
        
        #implement dictionary for if specific domain is used in original allowed_domains; key is specific_domain?
        
    
    def get_text(self, response):
        """
        Gets the readable text from a website's body and filters it.
        Ex:
        if response.body == "\u00a0OUR \tSCHOOL\t\t\tPARENTSACADEMICSSUPPORT \u200b\u200bOur Mission"
        >>> get_text(response)
        'OUR SCHOOL PARENTSACADEMICSSUPPORT Our Mission'
        
        For another example, see filter_text_ex.txt
        
        More options for cleaning HTML: 
        https://stackoverflow.com/questions/699468/remove-html-tags-not-on-an-allowed-list-from-a-python-string/812785#812785
        Especially consider: `from lxml.html.clean import clean_html`
        """
        # Load HTML into BeautifulSoup, extract text
        soup = BeautifulSoup(response.body, 'html5lib') # slower but more accurate parser for messy HTML # lxml faster
        # Remove non-visible tags from soup
        [s.decompose() for s in soup(inline_tags)] # quick method for BS
        # Extract text, remove <p> tags
        visible_text = soup.get_text(strip = False) # get text from each chunk, leave unicode spacing (e.g., `\xa0`) for now to avoid globbing words
        
        # Remove ascii (such as "\u00")
        filtered_text = visible_text.encode('ascii', 'ignore').decode('ascii')
        
        # Remove ad junk
        filtered_text = re.sub(r'\b\S*pic.twitter.com\/\S*', '', filtered_text) 
        filtered_text = re.sub(r'\b\S*cnxps\.cmd\.push\(.+\)\;', '', filtered_text) 
        # Replace all consecutive spaces (including in unicode), tabs, or "|"s with a single space
        filtered_text = regex.sub(r"[ \t\h\|]+", " ", filtered_text)
        # Replace any consecutive linebreaks with a single newline
        filtered_text = regex.sub(r"[\n\r\f\v]+", "\n", filtered_text)
        # Remove json strings: https://stackoverflow.com/questions/21994677/find-json-strings-in-a-string
        filtered_text = regex.sub(r"{(?:[^{}]*|(?R))*}", " ", filtered_text)

        # Remove white spaces at beginning and end of string; return
        return filtered_text.strip()