In [5]:
# !pip install --user scrapy

In [1]:
from notebook_code.scraping_support import *

from scrapy.selector import Selector
from scrapy.http import HtmlResponse

from IPython.core.display import HTML

## A Crash Course in HTML

To properly scrape a page, you first need to know a little about HTML (also known as: hypertext markup language). It's what almost every website you'll find out there is built on.

The main thing that makes HTML different than regular text is its use of **tags**. A `<` marks the beginning of a tag and a `>` marks the end of a tag. If you're just using a single tag on its own, it should end with a `/>`. For example, a line break is `<br/>`. Your browser doesn't need to know anything else about it to know it should put an empty line between that tag and whatever comes next.

However, most tags are a little more sophisticated than our rather boring `<br/>`. Most tags do *something* to what  comes after them. For example `<b>` tells you to start **making everything bold. But eventually bold words just get distracting, so you'll want to stop making them all bold. You do that with `</b>`**.

Notice that the only difference between the closing tag and the opening tag is that the closing tag begins with `</` instead of `<`. 

When your browser is figuring out how to display a page, it expects that you provide the closing tag whenever you use an opening tag. You can put whatever you want in between the opening and closing tag as long as it's valid HTML.

For example, `<i>foo</i>` will italicize *foo*. If you want to also make **<i>foo</i>** bold, you could put the valid HTML `<i>foo</i>` inside an opening and closing `b` tag: 

```html
<b><i>foo</i></b>
```

In case you're wondering, yes, you could also get the same result with:

```html
<i><b>foo</b></i>
```

That may seem like a lot of work to do something you could do in a couple of mouse clicks or key presses in your favourite word processor. The beauty of HTML is that all web browsers are expected to do the *same* thing when they run into any of the standard HTML tags. That means that the tags you wrote to make your text bold and italicized will result in **<i>bold, italicized</i>** text no matter what browser someone uses to read it.

### Valid HTML Documents

We've learned what basic valid HTML looks like. What about a whole HTML document? Here's the minimum an HTML document should have in order to be valid HTML5 (the latest version of HTML):

```html
<!DOCTYPE html>
<html lang="en">
  <head>
    <meta charset="utf-8">
    <title>Your page title goes here.</title>
  </head>
  <body>
    Your main page content goes here.
  </body>
</html>
```

Web browsers can be pretty forgiving, and your page probably won't break if you miss some of these. Don't worry about trying to memorize this either. You can find it in more than enough places on line and it's perfectly okay to cut and paste to start out. Even the pros do that.

We'll just be focusing on what happens within the `<body></body>` tags, so for the following examples, just remember that to make them into a valid HTML document, you'd have to put them inside the `<body></body>` tags above.

## A Simple Scraper

There's a lot more we could say about HTML and its standard tags, and if you're interested, you can read about those tags [here](https://www.w3schools.com/tags/default.asp). Along with CSS (or cascading style sheets), another standard expected to be supported by every browser, you can make text and images look exactly the way you want in any web browser. And with a third standard language, Javascript, you can even create sophisticated user interfaces that people can interact with. Bet your word processor can't do that!

However, this notebook's purpose is to get you ready to scrape other peoples' nice looking web pages! And we now know enough about HTML to start doing that. So let's get into it!

### Your First Mission

Let's say we have the following within our document `<body></body>`:

```html
The answer to the question is: <b>42</b>.
```

Here's what that looks like:

In [2]:
html_text = 'The answer to the question is: <b>42</b>.'
HTML(html_text)

How would you get *just* the answer `42`?

Well, you *could* just look for some part of the text before and after it. For example, you could look for `'is: '` and grab everything until the first `.`. That's after you remove that bold tag and any other HTML tags within the text:

In [3]:
raw_text = remove_tags(html_text)
before_text = 'is: '
answer_start = raw_text.find(before_text) + len(before_text)
answer_end = raw_text.find('.', answer_start)
raw_text[answer_start:answer_end]

'42'

What are some problems with finding the answer this way? Well, first, it's a lot of work! But on top of that, if we change any of the text around the answer, we risk breaking the code we're using to grab it. What if, instead of stripping out the `<b>` tag, we used it as a way to locate the answer? That's where scraping libraries like *scrapy* come in. They can use the structure of the HTML itself to help drill down to the parts of it we're interested in.

Here's how we can find our answer using *scrapy*:

In [4]:
Selector(text=html_text).xpath('//b/text()').extract_first()

'42'

### XPaths

What's going on here? Let's break up that first line above into 3 pieces:

1. `Selector(text=html_text)`
2. `xpath('//b/text()')`
3. `extract_first()`

The first piece is just a way of initializing **scrapy** with the HTML that we want it to dig into. We're creating the `Selector` object with our `html_text` string, and that object will do the hard work of figuring out the structure of that text string based on all the HTML tags, etc. in it. The last piece tells **scrapy** that we want to return the first match to our query. We could have used `extract` instead of `extract_first`, and received an array with a single element (`'42'`) that we would then have to pull out of the array.

As important as the first and third pieces are, the second piece is where we'll be spending most of our time. This is where you tell **scrapy** what to look for and where to start looking for it.

Let's modify our example just a bit:

In [5]:
html_text = '''
Answer to Question 1: <b>42</b><br/>
Answer to Question 2: <b>2</b><br/>
Answer to Question 3: <b>foo</b>
'''
HTML(html_text)

How would you get all 3 answers? Note that we've also changed the text that comes before and after the answers. This would completely break our very first code for extracting the answer. How much would you have to change the **scrapy** version?

The great thing is you don't have to change either the first or second part at all! The answers are all still within those `<b></b>` tags. But now, instead of just as single match, we want *all* of them.

In [6]:
Selector(text=html_text).xpath('//b/text()').extract()

['42', '2', 'foo']

What if we put those answers inside another HTML element? We're going to use a `<div></div>` element that's kind of like an all-purpose container for things. Think of it as a box that you can put stuff in. That box can go into bigger boxes, and those boxes can be moved around without repositioning the things inside them. Here's what our answers look like inside a `div`:

In [7]:
html_text = '''
<div>
    Answer to Question 1: <b>42</b><br/>
    Answer to Question 2: <b>2</b><br/>
    Answer to Question 3: <b>foo</b>
</div>
'''
HTML(html_text)

Nothing *visually* has changed about the text. Right now, our `div` is an invisible box. So, does that change what we need to do to find our answers? Nope!

In [8]:
Selector(text=html_text).xpath('//b/text()').extract()

['42', '2', 'foo']

It seems like there might be just a bit of magic going on with that small bit of text within our `xpath` function call. Let's take a closer look at it. What does `'//b/text()'` mean to **scrapy's** `xpath` function? Just like we did above, let's pull out the pieces:

1. `//` and `/`
2. `b`
3. `text()`

Let's talk about parts 2 and 3 first. Those are pretty straightforward. `b` is just the name of the tag, without needing to put in any of the `<>`. `text()` is a special xpath function that tells it to strip out any other HTML tags within the `<b></b>` (if there are any) and just get the text. That means that we'd get the same results from `<b>42</b>` as we'd get from `<b><i>42</i></b>`.

`//` and `/` both mean to look within the current tag's children. If nothing is on the left, the "current tag" is the root of the document. The difference is that while `/` expects the tag on the right to be a direct child of the tag on the left, `//` will keep checking the children of children, retreiving anything that matches. Neither is better than the other. In fact, you'll likely need to play around a bit with them to find the right combination that is *specific enough* to not return a bunch of results you aren't interested in, but *general enough* that if someone made a small change to the website you're scraping, your scraper wouldn't break.

Without any `//` you'd have to write out the exact path to get the same answer as above. Try it!:

In [27]:
Selector(text=html_text).xpath('/html/body/div/b/text()').extract()

['42', '2', 'foo']

Now try taking out one of those tags (we'll take out the `div` tag:

In [29]:
Selector(text=html_text).xpath('/html/body/b/text()').extract()

[]

Where could you add a single `//` to the above xpath string to get those results back?

In [31]:
Selector(text=html_text).xpath('/html/body//b/text()').extract()

['42', '2', 'foo']

This means your `<b></b>` tags around the answers can be anywhere inside `<html><body></body></html>`, but not outside of it. If the `<div>` tag was replaced with `<p>` (for 'paragraph') or any other tag, your scraper would still work. If the answers were put another level deeper, the scraper would still find them.

## Using HTML attributes

In addition to childern, any HTML tag can have a number of attributes. Here's what they look like:

```html
<some_tag attribute_1='value_1' attribute_2='value_2'>
  ...
</some_tag>
```

One attribute that has special meaning and is used frequently is `class`. `class` is used to tell the browser to use certain CSS style information when drawing the tag and its children. A tag can have many classes, each separated by a space. For example:

```html
<some_tag class='class_1 class_2 class_3'>
  ...
</some_tag>
```

We're not going to get into how CSS works here, though. What's important for our scraper is that these classes are used a lot and often have names that are related to the kind of information they contain. This isn't a rule. They can really be anything you want. But the reason people will often use the type of information as a class name is that they don't want to have to think about the structure of the page and where that tag is just to change some small part of its appearance (its color, for example). And because of this, they can help us make our scrapers more specific and less fragile.

Let's go back to our sample HTML and change it a bit:

In [57]:
html_text = '''
<div>
    Answer to Question 1: <b class='answer'>42</b><br/>
    Answer to Question 2: <b class='answer'>2</b><br/>
    Answer to Question 3: <b class='answer'>foo</b><br/>
    This isn't an answer to anything: <b>Just a little bold text!</b>
</div>
'''
HTML(html_text)

What happens if we run the same scraping code we did above on this?

In [36]:
Selector(text=html_text).xpath('//b/text()').extract()

['42', '2', 'foo', 'Just a little bold text!']

Aha! We still get our answers, but we also get something we don't want, just because it happens to be within a `<b></b>` tag. Let's add knowledge of the 'answer' class to our scraper so it just picks up the right pieces:

In [59]:
xpath_string = '//b[contains(concat(" ", normalize-space(@class), " ")," answer ")]/text()'
Selector(text=html_text).xpath(xpath_string).extract()

['42', '2', 'foo']

That's a little complicated. It's handling the possibility that one or more of the `<b>` tags may have *multiple* classes, and it's making sure to not match class names that just contain 'answer' (such as 'not_an_answer'). This is where scrapy's `css` function and the ability to chain your scraping calls comes in handy.

Thinking about it another way, we could ask to find all the `<b></b>` tags, then just find the tags within that set that have the class `answer`, then find the text underneath:

In [60]:
Selector(text=html_text).xpath('//b').css('.answer').xpath('text()').extract()

['42', '2', 'foo']