# Introduction to Web Scraping using Scrapy

## Author

* Jonathan Graves (<jonathan.graves@ubc.ca>)

## Pre-requisties

* A PC or tablet 
* Firefox



# What is Web Scraping?

When we talk about web scraping, we are usually refer to the use of automated tools and software which collect information automatically from websites.  This occurs in three main ways:

1) Directly from the web-pages themselves, by accessing the pages and then parsing them for information
2) Using a provided **API** or Application Programming Interface, which allows for direct access to information from a webpage
3) Manually, by copy and pasting material from web-pages
4) Via screen capture, which is then usually parsed by a program for requested information

In this workshop, we will primarily be discussing the first type of web-scraping, which is the most general.

## Legal and Ethical Issues

It is **very important** to mention that web scraping can run afoul of both legal and ethical programs.  Legally, the current case law is unsettled, but web-scraping can be illegal in situations including, but not limited to cases where:

1) It accesses confidential, private, or protected information - even if that information is public online.
2) The access to a website or process is prohibited by the website's owner or the terms and conditions
3) The scraping poses an "undue burden" on the website, or hinders the use by other individuals
4) If it harms the owner of a website or the users of that website in a specific manner

It's also important to mention that even in situations where web-scraping is not _illegal_ it may still be unethical; for example, if someone asks you not to, but you do anyways.  

### Best Practices

In general, we recommend that you should always follow these best practices when scraping websites:

1)  Check the terms of conditions to ensure that they permit scraping or spiders
2)  Run a scraper in a "friendly" manner, limiting the speed and scope as much as feasible
2)  **Always** obey a `robots.txt` file, which is a file websites can use to block scrapers
4)  **Never** scrape or collect information which is private or not intended to be made public
5)  If an API is available for a website, use the API - don't scrape.
6)  If you have any doubt, don't do it - consult a lawyer!

I am not a lawyer, this is not legal advice, and I make no warranties for any liability or damages you may incur by using web-scraping.

# Markdown

Modern websites are written in a variety of languages, but fundamentally rely on _markup_  to display and render content.  Some examples are **HTML** and **XML**.

* Markup languages organize content in such a way that the data associated with the content is stored distinctly from the content itself.  For example, the properties of a text (e.g. font, color, bolding) are store seperately from the text itself
* This is usually done using **tags** which define the type of content, which have **attributes** that describe them

The entire structure of a document can be organized using these tags, providing an easy way from browsers to understand how to display the content.

# Scraping: DOMs

Modern webscraping tools, like the `scrapy` package we will discuss today using the **DOM** or _document object model_ of a website to target their data collection.

* Essentially, when you access a website, it will return an HTML document which contains the content with markup
* In order to render that content and allow for scripts (like Javascript) your browser creates an internal model which is representation of the DOM
* The DOM organizes the content and controls how it is displayed, and how events like user interactions (such as clicking) change that display
* The structure of a DOM has been standardized, so that people can write code and websites that will work across many applications

You don't need to know a ton about this - but basically modern web scrapers work by identifying what parts of the DOM contain their desired content, then collecting them.

##  CSS and XPath

Web scrapers identify these elements in two ways: **CSS** (_cascading style sheets_) and **XPath** (_XML Path Language_).

* In CSS, elements of the DOM (like a paragraph) have _attributes_ which can be accessed directly
  * For example, you may want to collect all paragraphs elements with the attribute `class = body-text`
* In XPath, you specify the elements of the DOM desired along an _axis_, then filter them using _predicates_ to choose the key elements
  * For example, you may want to collect all `paragraphs` which are children of the `body` with `class = body-text`
  
In general, XPath is preferred due to the flexibility of the tool, but there are cases where CSS is simpler or better.  Happily, you can mix-and-match the two methods by layering them - so, whatever works!

# Scraping Workflow

In general, one of the hardest parts of web-scraping is not writing the scraper at all: it's identifying what to scrape and how to scrape it.

1)  Start by identifying the web pages you want to scrape - a good scraper is _well targeted_
2)  Look at the structure of the web pages you want to scrape - how are they related to one another?
    * You will particularly need to identify how the scraper will proceed from one page to the next; is there a clear path?
3)  Identify what content you want to collect on the website in question and how it exists in the DOM
4)  Determine the most efficient way to collect and parse that content with your scraper
    * Can you divide the program into "sub-problems" which are easier to solve, or find a general solution?
5)  Write the code to implement your scraping approach
6)  Test and validate the results

Notice that nearly half of the steps involve _no coding_ but a deep consideration of the website in question.

# Step 2: Inspecting a Website

In order to identify the elements necessary, we need to inspect our website.  We're going to use the example from:

<https://quotes.toscrape.com/>

* _toscrape.com_ is an excellent resource to learn and test scrapers, since it offers a wide range of formats and common "challenges" you can encounter

When we click on the website, we will see the page.  But how do we find the elements of the DOM needed?  

* Right-click the page, then select "Inspect" which will open the inspector window.  
* This looks very complex, but it's actually very not that bad - it's showing you the DOM
* If you click on the "inspect" icon, you can observe how the visual layout relates to the DOM

## CSS and XPath Selectors

You can use this view to inspect and test different selections.

* If you click on "highlight" it will show you how specific selectors will behave on the page
* If you right-click an element, you can copy XPath and CSS selectors

This allows you to build a "plan of attack" for the different elements we will need.

# Workshop Example: Pulling Quotes

Let's suppose we want to extract all of the quote text from this page.  We need to identify an appropriate selector that will collect those elements.

* As we can see, the CSS selector `.quote span.text` will collect these elements from the page

Note that you often want to be a little bit careful to select appropriately in order to preserve relationships.


# Getting Started with Scrapy

Our next task is to get started with `scrapy` which an an excellent, easy-to-use scraping tool: <https://docs.scrapy.org>

* Scrapy works using _projects_ which are collections of scraping code.  A Scrapy project consists of:
    1) A collection of _spiders_ which crawl the web and collect elements
    2) A set of _items_ which define the objects captured by the spider
    3) A collection of _pipelines_ which handle and parse the items
    4) _middleware_, _configs_, and _settings_ which describe how the spider and pipelines behave

* For most projects, the defaults or pre-defined options for _items_, _middleware_, and _pipelines_ are sufficient

The main element you need to create are the _spider_ files, and tweaking the settings.

## Installing Scrapy

You can install Scrapy by opening a terminal and running:

```pip install scrapy```

If you're on `conda` you can use:

```conda install -c conda-forge scrapy```

This will install Scrapy on your system.

# Setting Up A Scrapy Project

You want to make sure Scrapy sets up a project properly.  Fortunately, it has a tool for this: `startproject`

1)  Create a new directory where you want your project to live
2)  In the terminal, run `scrapy startproject myproject [project_dir]`
    * `myproject` is the name of your project
    * `project_dir` is optional, but is the name of the folder you want it to be stored in
    
That's it!  You will see a bunch of default files are created.  Ignore them for now.

## Generating a New Spider

You can create spider from scratch, it's easier to use a template, since they inherit a lot of code from Scrapy's spider class.  To do this, from the project directory open a terminal and run:

```scrapy genspider <name> <domain or URL>```

For example, in our case we would use:

```scrapy genspider quotespider https://quotes.toscrape.com/page/1/```

This will create a blank spider with many options pre-set for us.  We can now edit the spider and set it up!



# Testing Selectors Using the Shell

Before we set up a spider, we need to check that it will collect items properly.  We can do this using the **scrapy shell** - which is basically a spider's `parse` method running in interactive mode.  To do this, start a terminal then run:

```scrapy shell <website>```

This will start an interactive session which is basically the DOM accessor of the spider.   We can then test selectors.  Try the following:

* `response.css(".quote")`
* `response.css(".quote").getall()`
* `response.css("div.quote span.text").getall()`
* `response.css("div.quote span.text::text").getall()`

Notice that many of these are _lists_ which we can loop over.

## XPath and CSS

You can also chain selectors:
    
```response.css("div.quote").xpath('//span/text()').getall()```

You can choose whatever makes things easy.

# Adding a Selector to a Spider

We can then basically perform a loop to collect the required elements under the `parse` method:

In [None]:
for quote in response.css("div.quote"):
    yield {
        'text': quote.css("span.text::text").get()   
    }

# Running Your Spider

At this point, we can now run our spider.  Everything that is `yield`-ed from the parse method is sent to the console.  We would prefer that it was stored.

* The best format for generic scaped data is JSON, which is natively supported in Scrapy output

```scrapy crawl <spide> -O quotes.json```

This will generate (and overwrite) a serialize JSON object which has all our information.  Let's try it from the console.

# Be More Spider-y: Following Links

You can have you spider crawl the web by following links.  In order to do this, it must be capable of:

1) Identifying which links to follow
2) Understanding how to the parse the links, once following

In our example, all of the Quotes pages are identical, meaning we can use the same parser for each of them - but you might need to write a custom one for different types of pages.

Let's start by finding the element that has the next page URL using the inspector

## Next Page

In our case, it's `li.next a`.  We can get the HREF from this using the attribute selector:

```response.css("li.next a").attrib['href']```

Try it interactively in the console to check.  We can then append it using `urljoin` to create a new page:

In [None]:
next_page = response.css("li.next a").attrib['href'].get()

if next_page is not None:
    yield response.follow(next_page, callback=self.parse)

# Comments

There are many other examples and patterns on the Scrapy website, which are pretty easy to learn and play around with.

* Try to add code which extracts the author and the tags for a quote!
* Try to modify the _follow_ code to page odd numbered pages different from even number ones

You can get much more sophisticated, but this was the basics.
