# STA 141B Lecture 9

The class website is <https://github.com/2019-winter-ucdavis-sta141b/notes>

### Announcements

### Topics

* Undocumented APIs
* XML and HTML
* Web Scraping

### Datasets

* [Yolo County Health Inspections](https://yoloeco.envisionconnect.com/)
* [Wikipedia's List of Largest Cities](https://en.wikipedia.org/wiki/List_of_largest_cities)
* [CUESA's Vegetable Seasons Chart](https://cuesa.org/eat-seasonally/charts/vegetables)

### References

* [__requests__ documentation](http://docs.python-requests.org/en/master/)
* [__requests-html__ documentation](https://html.python-requests.org/)
* [MDN HTML Reference](https://developer.mozilla.org/en-US/docs/Web/HTML/Element)
* [XPath Diner](http://www.topswagcode.com/xpath/) -- an interactive XPath tutorial
* [CSS Diner](https://flukeout.github.io/) -- an interactive CSS Selector tutorial
* Python for Data Analysis, Ch. 6
* Python for Data Analysis, Ch. 7.3 (to review string processing)

[PDSH]: https://jakevdp.github.io/PythonDataScienceHandbook/
[ProGit]: https://git-scm.com/book/

## Getting Data from the Web

Revised list of ways you can get data from the web, from most to least convenient:

1. Direct download or "data dump"
2. Python or R package (there are packages for many popular web APIs)
3. Documented web API
4. Undocumented web API
5. Scraping

## Undocumented Web APIs

Many websites use undocumented web APIs to get data. For example:

* [University of California Compensation](https://ucannualwage.ucop.edu/wage/)
* [Yolo County Health Inspections](https://yoloeco.envisionconnect.com/)

You can identify these websites by looking at requests in your browser's developer tools. In Firefox or Chrome, you can open the developer tools with `ctrl-shift-i`.

Requests to web APIs almost always return JSON or XML data. By examining the browser requests, you can work out the endpoints and parameters, allowing you to use the API.

**CAUTION:** Web APIs that are undocumented are often undocumented for a reason. Using an undocumented API may make someone angry or get you into legal trouble! Government and quasi-government websites (like the examples above) are probably okay, as long as you cache and rate-limit your requests. For everything else, find for an alternative or get permission first.

Let's reverse engineer the Yolo County Health Inspections web API so that we can get data about local restaurants.

In [1]:
import numpy as np
import pandas as pd
import requests
import requests_cache

requests_cache.install_cache("mycache")

In [3]:
def get_health_info(q):
    reqests.post("https://yoloeco.envisionconnect.com/api/pressAgentClient/searchFacilities", params=(
    "PressAgentOld": "c08cb189-894c-4c8c-b595-a5ef010226b4" 
    ), json={
        "FacilityName": q
    })
#response.raise_for_status()
#pushing data in the request; like an email attachment
#need to fix

SyntaxError: invalid syntax (<ipython-input-3-d968690da75a>, line 3)

We could reverse engineer other parts of the API to get detailed data about health violations.

## Web Scraping

### What makes a web page?

Web pages are written in _hypertext markup language_ (HTML). HTML files (`.htm` or `.html`) are plain text, just like JSON, Python scripts, and R scripts.

In HTML, we use _tags_ to create _elements_ of a web page. Elements add formatting and structure to the page.

* Tags usually come in pairs: an opening tag and a closing tag.
* Tags are written `<NAME>` for opening tags, `</NAME>` for closing tags, and `<NAME />` for singleton tags.
* Opening and singleton tags can have _attributes_ that contain additional information. Attributes are written `ATTRIBUTE=VALUE` after the tag name. 

See [here](https://developer.mozilla.org/en-US/docs/Learn/Getting_started_with_the_web/HTML_basics) for a more detailed explanation, and [here](https://developer.mozilla.org/en-US/docs/Web/HTML/Element) for a list of valid HTML elements.

#### Examples

As an example:

```html
<p>This <a href="https://www.youtube.com/watch?v=dQw4w9WgXcQ">page</a> is famous and this <strong>word</strong> is emphasized.</p>
```
The `p` tag marks a paragraph, the `a` tag marks a link (an _anchor_), and the `strong` tag marks emphasized text.

Here's a string that contains HTML for a simple, complete website:

In [5]:
page = """
<html>
<head>
    <title>This is the Title!</title>
</head>

<body>
    <p>This is a paragraph!</p>
    <p id="best-paragraph">This is another paragraph! 🌮</p>
    <p>Visit <a href="https://pudding.cool/">The Pudding</a>.</p>
    <span>This is a span. It comes with an avocado. 🥑</span>
</body>
</html>
"""

_Extensible markup language_ (XML) also uses tags to create elements. Can use tags for different data types. We say XML is _extensible_ because you can create your own XML elements (unlike HTML). People typically use XML to describe structure and meaning of data, rather than for formatting.

HTML is about formatting and XML is about data storage.

We'll use the same process to extract data from both HTML and XML. Usually you want to use JSON but sometimes you may want to use XML.

### Helper Packages

A _parser_ converts formatted data into familiar data structures. We've used __requests__' built-in JSON parser, but the package doesn't have a built-in HTML/XML parser. Fortunately, there are many other Python packages for parsing HTML/XML and web scraping.

HTML/XML Parsers:
* [lxml](https://lxml.de/): very fast! but doesn't deal as well with broken sites
* [html5lib](https://github.com/html5lib/html5lib-python)
* [beautifulsoup](https://www.crummy.com/software/BeautifulSoup/): can handle broken webpages but slower
* [requests-html](https://html.python-requests.org/): can handle broken webpages but slower

Scraper Frameworks (_convenient after learning the basics with parsers_): automates a lot of stuff
* [scrapy](https://scrapy.org/)
* [newspaper3k](https://github.com/codelucas/newspaper)

Even more [here](https://github.com/lorien/awesome-web-scraping/blob/master/python.md#web-scraping-frameworks).

We'll use __lxml__ here, but you're welcome to use other packages on assignments and the project. To install __lxml__ for Anaconda, run `conda install -c anaconda lxml` in a shell.

In [6]:
import lxml.html as lx

html = lx.fromstring(page)#takes string and converts it into a parsed HTML structure
html

<Element html at 0x1f5461155e8>

### Finding Elements

Elements are nested, so an HTML document is like a tree:
```
html
├── head
│   └── title
└── body
    ├── p
    ├── p
    ├── p
    │   └── a
    └── span
```
This is similar to the file system on your computer. The key difference is that elements at the same level can have the same tag name.

#### XPath

The _XML Path Language_ (XPath) lets us write paths to elements. XPath paths look a lot like file paths. XPath is not Python-specific!

The `.xpath()` method gets all elements at an XPath path:

In [7]:
html.xpath("/html/head/title")

[<Element title at 0x1f546115ea8>]

Since there may be more than one element, the method always returns a list.

Absolute paths are not robust for scraping. An update to a web page that adds a single tag can break a scraper that uses absolute paths. In XPath, `//` means "anywhere below". We'll use `//` often because it's more robust:

In [8]:
html.xpath("/html/body/p")

[<Element p at 0x1f54615b188>,
 <Element p at 0x1f54615b1d8>,
 <Element p at 0x1f54615b228>]

In [10]:
html.xpath("/html/body//a")#all a tags in body

[<Element a at 0x1f54615b048>]

What if we just elements that satisfy a certain condition? In XPath, `[ ]` filters out elements that don't match a condition. For example:

In [11]:
html.xpath("//p[@id='best-paragraph']")

[<Element p at 0x1f54615b1d8>]

#### CSS Selectors

_Cascading Style Sheets_ (CSS) is another language for formatting elements in an HTML document. CSS provides another way to select elements, called _CSS selectors_.

CSS selectors are more concise but less flexible than XPath paths. The `.cssselect()` method gets all elements at a CSS selector:

[XPath Diner](http://www.topswagcode.com/xpath/) is an interactive tutorial that teaches most of the XPath syntax. It takes about 20-60 minutes. Work through it to become an XPath ninja!

In [12]:
html.cssselect("p")#need to import cssselect

ImportError: cssselect does not seem to be installed. See http://packages.python.org/cssselect/

### Extracting Text and Attributes

There are two ways to get text from an element:

* `.text` gives text inside the element, but not its children
* `.text_content()` gives text inside the element and its children, with all tags removed

In [15]:
p=html.xpath("//p[@id='best paragraph']")[0]

IndexError: list index out of range

We can get values from attributes on an element with `.attrib`, which is a dictionary:

### Example: Scraping Tables

For data in a `table` element, we can use __Pandas__ instead of writing a scraper.

Wikipedia provides lots of useful information in tables. Let's get the Wikipedia list of [the world's largest cities][wiki].

[wiki]: https://en.wikipedia.org/wiki/List_of_largest_cities

In [18]:
result=pd.read_html("https://en.wikipedia.org/wiki/List_of_largest_cities")
result[1]

Unnamed: 0,0,1,2,3,4,5
0,City,Nation,Image,Population,,
1,,,City proper,Metropolitan area,Urban area[7],
2,Chongqing,China,,"30,751,600[8]","17,000,000[9]","8,165,500[a]"
3,Shanghai,China,,"24,256,800[11]","24,750,000[12]","23,416,000[b]"
4,Beijing,China,,"21,516,000[13]","24,900,000[14]",21009000
5,Lagos,Nigeria,,"16,060,303[c]","21,000,000[17]",13123000
6,Dhaka,Bangladesh,,"14,399,000[18]","20,000,000[19]","19,580,000[20][21]"
7,Mumbai,India,,"12,478,447[22]",12771200,"20,748,395[23]"
8,Chengdu,China,,"16,044,700[24]","10,376,000[citation needed]","6,316,922[25]"
9,Karachi,Pakistan,,"14,910,352[26][d]",,


### Writing Scrapers

When the data we want isn't in a `table` element, we have to write our own scraper.

The workflow for writing a scraper is the same regardless of the language you use:

1. Download pages with an HTTP request (usually `GET`)
2. Parse pages to extract text
3. Clean up extracted text with string methods or regex
4. Save cleaned results

### Example: CUESA Vegetable Seasons



CUESA (Center for Urban Education about Sustainable Agriculture) provides [a chart](https://cuesa.org/eat-seasonally/charts/vegetables) that shows when vegetables are in season. Let's scrape the chart.

In [19]:
result=pd.read_html("https://cuesa.org/eat-seasonally/charts/vegetables")

In [20]:
response=requests.get("https://cuesa.org/eat-seasonally/charts/vegetables")
response.raise_for_status()

html=lx.fromstring(response.text)
#thead is head; tbody; and trows
html

tab=html.xpath("//table")[0]

rows=tab.xpath(".//tr")
header=rows[1:]
header=rows[0]

[x.text for x in header.xpath(.//li)]

<Element html at 0x1f5462c3188>

Can we generalize our scraper to the [chart](https://cuesa.org/eat-seasonally/charts/fruit) for fruit and nuts?