# Week 8 Discussion

## Infographic

* [The Pudding](https://pudding.cool/)
* [The State of The Pudding, 2018](https://medium.com/@matthew_daniels/the-state-of-the-pudding-2018-9661ab4d299c)

## Links

* [MDN HTML Reference](https://developer.mozilla.org/en-US/docs/Web/HTML/Element)
* [CSS Diner](https://flukeout.github.io/) -- an interactive CSS Selector tutorial
* [XPath Diner](http://www.topswagcode.com/xpath/) -- an interactive XPath tutorial
* [Wiki's XPath Page][xpath]
* [Python's Regular Expression HOWTO](https://docs.python.org/3/howto/regex.html#regex-howto)

__Next week only:__ my office hours will be at a different time (TBA soon on Slack).

[xpath]: https://en.wikipedia.org/wiki/XPath#Syntax_and_semantics_(XPath_1.0)

## Web Scraping

Basic web scraping workflow:

1. Download pages
2. _Parse_ pages to extract text
3. Clean up extracted text
4. Store cleaned results
5. Analyze results

This workflow is the same regardless of the packages and language you're using.

### In Python

We've already seen how to use [requests][] (with [requests_cache][]) to download pages (use a `GET` request).

The packages [lxml][] and [Beautiful Soup][bs4] can parse web pages. Choose one for the entire scrape, since they are not compatible with each other.

You can clean up extracted text with [string methods][str], [re][], and [pandas][]. Sometimes you'll also need to use natural language processing packages.

The [scrapy][] and [pattern][] packages try to automate steps 1-4. They can save you time _after_ you understand the basics of scraping.

[requests]: http://docs.python-requests.org/en/master/
[requests_cache]: https://requests-cache.readthedocs.io/en/latest/

[lxml]: http://lxml.de/lxmlhtml.html
[bs4]: https://www.crummy.com/software/BeautifulSoup/bs4/doc/

[str]: https://docs.python.org/3/library/stdtypes.html#string-methods
[re]: https://docs.python.org/3/library/re.html
[pandas]: http://pandas.pydata.org/pandas-docs/stable/

[scrapy]: https://scrapy.org/
[pattern]: https://www.clips.uantwerpen.be/pages/pattern

### What makes a web page?

Web pages are written in hypertext markup language (HTML).

In HTML, text is surrounded by _tags_ to mark formatting. Tags are written inside of angle brackets `< >`. For example,  to mark some text as bold:
```html
This is a <b>cat</b>!
```

Tags usually come in pairs; the second tag (with `/`) is called a _closing tag_ and marks the end of the formatting.

Tags often have additional information as _attributes_. For example, links have an `href` attribute to specify the URL to link to:
```html
This <a href="https://www.youtube.com/watch?v=dQw4w9WgXcQ">page</a> is famous.
```

A longer example of a web page is:

In [6]:
doc = """
<html>
<head>
<title>This is the Title!</title>
</head>

<body>
<p>This is a paragraph!</p>
<p id="best-paragraph">This is another paragraph!</p>
<p>Visit <a href="http://www.google.com">Google</a>.</p>
<span>This is a span. ❤️ </span>
</body>
</html>
"""

Firefox and Chrome come with (web) developer tools. You can use these to inspect a web page.

Open the tools with `ctrl`-`shift`-`i` (or `cmd`-`shift`-`i` on OS X).

### Finding Tags

Tags are nested, so an HTML document is like a tree:
```
html
├── head
│   └── title
└── body
    ├── p
    ├── p
    ├── p
    │   └── a
    └── span
```
This is similar to the file system on your computer. The key difference is that tags at the same level can have the same name.

XPath lets us write a path to a tag the same way we would to a file:

In [8]:
import lxml.html as lx

html = lx.fromstring(doc)

html.xpath("/html/head")

[<Element head at 0x7fe6084bb7c8>]

Since tags can have the same name, we get a list rather than a single tag.

Absolute paths are not convenient (and not robust) for scraping.

In XPath, `//` means "anywhere below":

In [9]:
html.xpath("/html//p")

[<Element p at 0x7fe5f946d368>,
 <Element p at 0x7fe5f946d958>,
 <Element p at 0x7fe5f946d9a8>]

Use `[ ]` to put a condition on a tag:

In [10]:
html.xpath("//p[@id = 'best-paragraph']")

[<Element p at 0x7fe5f946d958>]

XPath is not Python-specific!

CSS Selectors are an alternative to XPath. They're less powerful but more concise:

In [13]:
html.cssselect("p#best-paragraph")

[<Element p at 0x7fe5f946d958>]

### Example: Scraping Hacker News

Let's scrape <https://news.ycombinator.com/>

In [3]:
import lxml.html as lx
import pandas as pd
import requests
import requests_cache

requests_cache.install_cache("cache")

# Step 1: download
url = "https://news.ycombinator.com/"
response = requests.get(url)
response.raise_for_status()
doc = response.text

# Step 2: parse
html = lx.fromstring(doc, base_url = url)
html.make_links_absolute()

links = html.xpath("//a[@class = 'storylink']")
titles = [x.text_content() for x in links]

sublinks = html.xpath("//td[@class = 'subtext']/a[last()]")
comments = [x.text_content() for x in sublinks]

# Step 3: clean
result = pd.DataFrame({"titles": titles, "comments": comments})

result.head()

Unnamed: 0,comments,titles
0,45 comments,Spotify IPO filing: Form F-1
1,60 comments,Triplebyte Raises $10M from Initialized Capita...
2,43 comments,How Inmates Play Tabletop RPGs in Prisons Wher...
3,17 comments,Clock error lead to death of 28 Soldiers. Soft...
4,263 comments,The Makefile I use with JavaScript projects


## Regular Expressions

Python's built-in string methods are very powerful and very fast:

In [31]:
x = "This is my text"
x.split()

['This', 'is', 'my', 'text']

Most of them can also be used with pandas:

In [35]:
result["comment_count"] = result.comments.str.split().str.get(0)
result.head()

Unnamed: 0,comments,titles,comment_count
0,45 comments,Spotify IPO filing: Form F-1,45
1,60 comments,Triplebyte Raises $10M from Initialized Capita...,60
2,43 comments,How Inmates Play Tabletop RPGs in Prisons Wher...,43
3,17 comments,Clock error lead to death of 28 Soldiers. Soft...,17
4,263 comments,The Makefile I use with JavaScript projects,263


In [39]:
result["comment_count"] = pd.to_numeric(result.comment_count, errors = "coerce")
result.head()

Unnamed: 0,comments,titles,comment_count
0,45 comments,Spotify IPO filing: Form F-1,45.0
1,60 comments,Triplebyte Raises $10M from Initialized Capita...,60.0
2,43 comments,How Inmates Play Tabletop RPGs in Prisons Wher...,43.0
3,17 comments,Clock error lead to death of 28 Soldiers. Soft...,17.0
4,263 comments,The Makefile I use with JavaScript projects,263.0


Sometimes we need to extract more complicated patterns.

In that case, we can use _regular expressions_. Regular expressions (or regex) is a language for describing patterns. It is not Python-specific.

You can use regular expressions with the built-in `re` module or any of the pandas string methods.

### Regex Syntax

In regular expressions, letters, numbers, and spaces are matched literally.

Other characters have special meanings:

Character | Description
--------- | -----------
`.`       | any 1 character
`[ ]`     | any 1 character listed
`?`       | repeat previous 0 or 1 times
`*`       | repeat previous 0 or more times
`+`       | repeat previous 1 or more times
`^`       | start of string
`$`       | end of string

In [43]:
import re

re.findall("a", "aaaba")

['a', 'a', 'a', 'a']

In [46]:
re.findall("a+", "aaaba")

['aaa', 'a']

You can match a special character literally by putting a backslash `\` in front of it.

The backslash `\` also has a special meaning to Python, so in an ordinary string we'd have to use two backslashes:
```
"\\."
```
This is so annoying that Python also supports _raw strings_, where backslash has no special meaning (to Python). These have an `r` before the quote character.

Use raw strings for your regular expressions. Then you can just write:
```
r"\."
```

In [47]:
re.findall(r"\*[a-z]", "x*y")

['*y']

Next week's discussion will have more examples of regular expressions, and also some examples of natural language processing.

## Scrape Wisely

Before you scrape, is there an easier way to get the data? A direct data download? A web API?

__Q:__ How do I scrape the "next page" (for example, in a news listing)?

__A:__ The link to the next page is on the first page, so have your scraper get that link. Then use a loop.

__Q:__ I tried scraping a web page but what I see in my browser doesn't match what I see in the request response. What's going on? ([example](https://ucannualwage.ucop.edu/wage/))

__A:__ Several things could cause this.

Most likely is that web page has JavaScript code that displays the data. Your web browser runs the code, but a simple HTTP request does not. If the JavaScript gets the data from a web API, try to reverse-engineer the API (you can figure this out with the "Network" tab in the browser dev tools).

If the JavaScript doesn't use a web API, you can use tools like [selenium][] or [pyppeteer][] to control the browser from Python. The disadvantage is that this can be very slow if you need to scrape a lot of pages.

Some web pages are deliberately hard to scrape because the owners don't want you to have their data.

[selenium]: http://selenium-python.readthedocs.io/
[pyppeteer]: https://github.com/miyakogi/pyppeteer

__Q:__ How do I scrape a page where I have to log in?

__A:__ Use selenium or pyppeteer, as mentioned above.