# STA 141B Data & Web Technologies for Data Analysis

### Lecture 9, 10/31/23, Scraping


### Announcements

 - Proposal is due next week. 
 - HW 2 and Midterm grades

### Last week's topics

 - APIs 
     - iTunes
     - Guardian
     - Yolo County Health inspection
 - `requests` package
 - endpoints, params, headers and data
 - JSON format
 - API keys
 - undocumented APIs
     - developer tools in browser

### Today's topics

 - Scraping Tables with `pandas`
 - HTML
 - XML
 - Parser
 - Extracting Elements

### Ressources

* [`requests` documentation](http://docs.python-requests.org/en/master/)
* [`requests-html` documentation](https://html.python-requests.org/)
* [W3 Schools](https://www.w3schools.com/html/default.asp)
* [MDN HTML Reference](https://developer.mozilla.org/en-US/docs/Web/HTML/Element)
* [XPath Diner](http://www.topswagcode.com/xpath/) - an interactive XPath tutorial
* [CSS Diner](https://flukeout.github.io/) - an interactive CSS Selector tutorial

### Scraping Tables with `pandas`

For data in a `table` element, we can use __Pandas__ instead of writing a scraper. 

Wikipedia provides lots of useful information in tables. Let's get the Wikipedia list of [US cities by area][wiki].

[wiki]: https://en.wikipedia.org/wiki/List_of_United_States_cities_by_area

In [85]:
import pandas as pd

In [86]:
tabs = pd.read_html("http://en.wikipedia.org/wiki/List_of_United_States_cities_by_area")

In [87]:
type(tabs)

list

In [88]:
len(tabs)

2

In [89]:
tabs[1]

Unnamed: 0_level_0,Rank,City,State,Land area,Land area,Water area,Water area,Total area,Total area,Population (2020)[2]
Unnamed: 0_level_1,Rank,City,State,(sq mi),(km2),(sq mi),(km2),(sq mi),(km2),Population (2020)[2]
0,1,Sitka,Alaska,2870.1,7434,1945.1,5038.0,4815.1,12471,8458
1,2,Juneau,Alaska,2704.2,7004,550.7,1426.0,3254.9,8430,32255
2,3,Wrangell,Alaska,2556.1,6620,920.6,2384.0,3476.7,9005,2127
3,4,Anchorage,Alaska,1707.0,4421,239.7,621.0,1946.7,5042,291247
4,5,Tribune[note 1]*,Kansas,778.2,2016,0.0,0.0,778.2,2016,1182
...,...,...,...,...,...,...,...,...,...,...
145,146,Toledo,Ohio,80.5,208,3.3,8.5,83.8,217,270871
146,147,Jonesboro,Arkansas,80.2,208,0.6,1.6,80.7,209,78576
147,148,El Reno,Oklahoma,79.6,206,0.6,1.6,80.2,208,16989
148,149,Caribou,Maine,79.3,205,0.8,2.1,80.1,207,7396


In [6]:
tbl = tabs[1]
tbl.head()

Unnamed: 0_level_0,Rank,City,State,Land area,Land area,Water area,Water area,Total area,Total area,Population (2020)[2]
Unnamed: 0_level_1,Rank,City,State,(sq mi),(km2),(sq mi),(km2),(sq mi),(km2),Population (2020)[2]
0,1,Sitka,Alaska,2870.1,7434,1945.1,5038.0,4815.1,12471,8458
1,2,Juneau,Alaska,2704.2,7004,550.7,1426.0,3254.9,8430,32255
2,3,Wrangell,Alaska,2556.1,6620,920.6,2384.0,3476.7,9005,2127
3,4,Anchorage,Alaska,1707.0,4421,239.7,621.0,1946.7,5042,291247
4,5,Tribune[note 1]*,Kansas,778.2,2016,0.0,0.0,778.2,2016,1182


To process this information, unusable items have to be removed. We are going to do that with `regex`. We will learn more about `regex` later on. 

In [7]:
from re import sub 
def remove(string):
    '''
    Removes everything inside [], a whitespace before that and *'s.
    '''
    if isinstance(string, str):
        string = sub(r'\s*\[.*\]\**', '', string)
    return string

In [8]:
tbl['City'].iloc[4][0]

'Tribune[note 1]*'

In [9]:
remove(tbl['City'].iloc[4][0])

'Tribune'

In [10]:
remove(1706.8)

1706.8

In [11]:
tbl.columns = [remove(i) for i in tbl.columns] # remove from table columns 

In [12]:
tbl = tbl.applymap(remove) #remove from all rows

In [13]:
tbl.head()

Unnamed: 0,"(Rank, Rank)","(City, City)","(State, State)","(Land area, (sq mi))","(Land area, (km2))","(Water area, (sq mi))","(Water area, (km2))","(Total area, (sq mi))","(Total area, (km2))","(Population (2020)[2], Population (2020)[2])"
0,1,Sitka,Alaska,2870.1,7434,1945.1,5038.0,4815.1,12471,8458
1,2,Juneau,Alaska,2704.2,7004,550.7,1426.0,3254.9,8430,32255
2,3,Wrangell,Alaska,2556.1,6620,920.6,2384.0,3476.7,9005,2127
3,4,Anchorage,Alaska,1707.0,4421,239.7,621.0,1946.7,5042,291247
4,5,Tribune,Kansas,778.2,2016,0.0,0.0,778.2,2016,1182


In [14]:
tbl.dtypes

(Rank, Rank)                                      int64
(City, City)                                     object
(State, State)                                   object
(Land area, (sq mi))                            float64
(Land area, (km2))                                int64
(Water area, (sq mi))                           float64
(Water area, (km2))                             float64
(Total area, (sq mi))                           float64
(Total area, (km2))                               int64
(Population (2020)[2], Population (2020)[2])      int64
dtype: object

### HTML

Web pages are written in _hypertext markup language_ (HTML). HTML files (`.htm` or `.html`) are plain text, just like JSON, Python scripts, and R scripts.

In HTML, we use _tags_ to create _elements_ of a web page. Elements add formatting and structure to the page.

* Tags usually come in pairs: an opening tag and a closing tag.
* Tags are written `<NAME>` for opening tags, `</NAME>` for closing tags, and `<NAME />` for singleton tags.
* Opening and singleton tags can have _attributes_ that contain additional information. Attributes are written `ATTRIBUTE=VALUE` after the tag name. 

See [here](https://developer.mozilla.org/en-US/docs/Learn/Getting_started_with_the_web/HTML_basics) for a more detailed explanation, and [here](https://developer.mozilla.org/en-US/docs/Web/HTML/Element) for a list of valid HTML elements.

#### Example

[wiki]: https://en.wikipedia.org/wiki/List_of_United_States_cities_by_area

From now on, we will use an artificial an example:

```html
<p>This page is famous and this <b>word</b> is emphasized.</p>
```

```html
<p>This <a href="https://www.youtube.com/watch?v=dQw4w9WgXcQ">page</a> is famous and this <strong>word</strong> is emphasized.</p>
```

```html
<li>1. Something</li>
```

<p>This page is famous and this <b>word</b> is emphasized.</p>
<p>This <a href="https://www.youtube.com/watch?v=dQw4w9WgXcQ">page</a> is famous and this <strong>word</strong> is emphasized.</p>
<li>1. Something</li>

The `p` tag marks a paragraph, the `a` tag marks a link (an _anchor_), the `strong` tag marks emphasized text,
and `li` tag marks a list.

Here's a string that contains HTML for a simple, complete website:

In [38]:
page = """
<html>
sadfasdf
<head>
    <title>This is the Title!</title>
</head>

<body>
    <p>This is a paragraph! <a href="https://pudding.cool">The Pudding</a> </p>
    <p id="best-paragraph">This is another paragraph! &#127790;</p>
    <p>Visit <a href="https://pudding.cool">The Pudding</a>.</p>
    <span>This is a span, it comes with an taco &#127790;</span>
</body>
<a href="https://pudding.cool">The Pudding</a>
</html>
""" 

In [39]:
page

'\n<html>\nsadfasdf\n<head>\n    <title>This is the Title!</title>\n</head>\n\n<body>\n    <p>This is a paragraph! <a href="https://pudding.cool">The Pudding</a> </p>\n    <p id="best-paragraph">This is another paragraph! &#127790;</p>\n    <p>Visit <a href="https://pudding.cool">The Pudding</a>.</p>\n    <span>This is a span, it comes with an taco &#127790;</span>\n</body>\n<a href="https://pudding.cool">The Pudding</a>\n</html>\n'

<html> 
<head>
    <title>This is the Title!</title>
</head>

<body>
    <p>This is a paragraph!</p>
    <p id="best-paragraph">This is another paragraph! &#127790;</p>
    <p>Visit <a href="https://pudding.cool">The Pudding</a>.</p>
    <span>This is a span, it comes with an taco &#127790;</span>
</body>

<body>
    <p>This is a new paragraph!</p>
</body>

</html>

The `<span>` tag is an inline container used to mark up a part of a text, or a part of a document.
    
For example, you can write the code
```
<p>My hat is <span style="color:blue">blue</span>.</p>    
```  
    
<p>My hat is <span style="color:blue">blue</span>.</p>     

### XML

_Extensible markup language_ (XML) also uses tags to create elements. We say XML is _extensible_ because you can create your own XML elements (unlike HTML). People typically use XML to describe structure and meaning of data, rather than for formatting.

We'll use the same process to extract data from both HTML and XML.

### Parser

A _parser_ converts formatted data into familiar data structures. We've used __requests__' built-in JSON parser, but the package doesn't have a built-in HTML/XML parser. Fortunately, there are many other Python packages for parsing HTML/XML and web scraping.

HTML/XML Parsers:
* [lxml](https://lxml.de/)
* [html5lib](https://github.com/html5lib/html5lib-python)
* [beautifulsoup](https://www.crummy.com/software/BeautifulSoup/)
* [requests-html](https://docs.python-requests.org/projects/requests-html/en/latest/)

Scraper Frameworks (_convenient after learning the basics with parsers_):
* [scrapy](https://scrapy.org/)
* [newspaper3k](https://github.com/codelucas/newspaper)

Even more [here](https://github.com/lorien/awesome-web-scraping/blob/master/python.md#web-scraping-frameworks).

We'll use __lxml__ here (check the [doc](https://lxml.de/apidoc/index.html)), but you're welcome to use other packages on assignments and the project. 

In [40]:
import lxml.html as lx

html = lx.fromstring(page)
html

<Element html at 0x7fbab8b4b860>

In [18]:
"""
<html>
<head>
    <title>This is the Title!</title>
</head>

<body>
    <p>This is a paragraph!</p>
    <p id="best-paragraph">This is another paragraph! &#127790;</p>
    <p>Visit <a href="https://pudding.cool">The Pudding</a>.</p>
    <span>This is a span, it comes with an taco &#127790;</span>
</body>
</html>
"""

'\n<html>\n<head>\n    <title>This is the Title!</title>\n</head>\n\n<body>\n    <p>This is a paragraph!</p>\n    <p id="best-paragraph">This is another paragraph! &#127790;</p>\n    <p>Visit <a href="https://pudding.cool">The Pudding</a>.</p>\n    <span>This is a span, it comes with an taco &#127790;</span>\n</body>\n</html>\n'

#### Finding Elements

Elements are nested, so an HTML document is like a tree:
```
html
├── head
│   └── title
└── body
    ├── p
    ├── p
    ├── p
    │   └── a
    └── span
```
This is similar to the file system on your computer. The key difference is that elements at the same level can have the same tag name.

#### XPath

The _XML Path Language_ (XPath) lets us write paths to elements. XPath paths look a lot like file paths. XPath is not Python-specific!

The `.xpath()` method gets all elements at an XPath path:

In [19]:
html.xpath("/html/head/title")

[<Element title at 0x7fbab8af0360>]

In [26]:
html.xpath("/html/body/p/a")

[<Element a at 0x7fbab8b2a310>, <Element a at 0x7fbab8b2a810>]

Since there may be more than one element, the method always returns a list.

Absolute paths are not robust for scraping. An update to a web page that adds a single tag can break a scraper that uses absolute paths. In XPath, `//` means "anywhere below". We'll use `//` often because it's more robust:

In [27]:
html.xpath("//a")

[<Element a at 0x7fbab8b2a310>,
 <Element a at 0x7fbab8b2a810>,
 <Element a at 0x7fbab8b30310>]

What if we just elements want that satisfy a certain condition? In XPath, `[ ]` filters out elements that don't match a condition. For example:

In [28]:
html.xpath("//p[@id = 'best-paragraph']")

[<Element p at 0x7fbab88daf90>]

[XPath Diner](http://www.topswagcode.com/xpath/) is an interactive tutorial that teaches most of the XPath syntax. It takes about 20-60 minutes. Work through it to become an XPath ninja! 

You can copy the absolute path of a tag from the developer tools. 

In [None]:
'//*[@id="content"]/article/section[5]/div/div/pre/code/span[2]'

#### CSS Selectors

_Cascading Style Sheets_ (CSS) is another language for formatting elements in an HTML document. CSS provides another way to select elements, called _CSS selectors_.

CSS selectors are more concise but less flexible than XPath paths. The `.cssselect()` method gets all elements at a CSS selector:

In [29]:
html.cssselect("a")

[<Element a at 0x7fbab8b2a310>,
 <Element a at 0x7fbab8b2a810>,
 <Element a at 0x7fbab8b30310>]

Check out the [CSS Diner](https://flukeout.github.io/)!

### Extracting Text and Attributes

There are two ways to get text from an element:

* `.text` gives text inside the element, but not its children
* `.text_content()` gives text inside the element and its children, with all tags removed

In [30]:
page

'\n<html>\n<head>\n    <title>This is the Title!</title>\n</head>\n\n<body>\n    <p>This is a paragraph! <a href="https://pudding.cool">The Pudding</a> </p>\n    <p id="best-paragraph">This is another paragraph! &#127790;</p>\n    <p>Visit <a href="https://pudding.cool">The Pudding</a>.</p>\n    <span>This is a span, it comes with an taco &#127790;</span>\n</body>\n<a href="https://pudding.cool">The Pudding</a>\nsadfasdf\n</html>\n'

In [31]:
html.text_content()

'\n\n    This is the Title!\n\n\n\n    This is a paragraph! The Pudding \n    This is another paragraph! 🌮\n    Visit The Pudding.\n    This is a span, it comes with an taco 🌮\n\nThe Pudding\nsadfasdf\n'

In [32]:
a = html.xpath("//a")[0]

In [33]:
a.text_content()

'The Pudding'

In [34]:
a.text

'The Pudding'

In [43]:
html.text_content()

'\nsadfasdf\n\n    This is the Title!\n\n\n\n    This is a paragraph! The Pudding \n    This is another paragraph! 🌮\n    Visit The Pudding.\n    This is a span, it comes with an taco 🌮\n\nThe Pudding\n'

In [41]:
html.text

We can get values from attributes on an element with `.attrib`, which is a dictionary:

In [45]:
a.attrib

{'href': 'https://pudding.cool'}

In [46]:
[x.attrib["href"] for x in html.xpath("//a")]

['https://pudding.cool', 'https://pudding.cool', 'https://pudding.cool']

### Writing Scrapers

Lets scrape the wiki table ourselves. Attention: We are using request, so pay attention to the file that is being returned. Check on devtools the html element for `<thead>` and see what is returned in the network. 

In [69]:
import requests

result = requests.get(url = 'https://en.wikipedia.org/wiki/List_of_United_States_cities_by_area')

In [70]:
result.text

'<!DOCTYPE html>\n<html class="client-nojs vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-sticky-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-clientpref-1 vector-feature-main-menu-pinned-disabled vector-feature-limited-width-clientpref-1 vector-feature-limited-width-content-enabled vector-feature-zebra-design-disabled vector-feature-custom-font-size-clientpref-disabled vector-feature-client-preferences-disabled vector-feature-typography-survey-disabled vector-toc-available" lang="en" dir="ltr">\n<head>\n<meta charset="UTF-8">\n<title>List of United States cities by area - Wikipedia</title>\n<script>(function(){var className="client-js vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-sticky-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-clientpref-1 vector-feature-main-menu-pinned-disabled 

In [71]:
html = lx.fromstring(result.text)

In [None]:
tables = html.xpath('//table')

In [53]:
tables[0]

<Element table at 0x7fbaa81c5360>

In [54]:
table = tables[1]

In [58]:
table.text

'\n'

In [72]:
html.xpath('//*[@id="mw-content-text"]/div[1]/table[2]/thead')

[]

In [73]:
html.xpath('//table[2]/tbody')

[<Element tbody at 0x7fbaa81c5860>]

In [74]:
def retrieve_rows(html): 
    rows = html.xpath('//table[2]/tbody/tr')
    cells = []
    for row in rows: 
        # ./td|th means we start at the node (not searching the whole doc again), and choose td OR th children
        cells.append([cell.text_content() for cell in row.xpath('./td|th')]) # no text, as some cells are in <b>
    return cells

In [75]:
retrieve_rows(html)

[['Rank',
  'City',
  'State',
  'Land area',
  'Water area',
  'Total area',
  'Population(2020)[2]\n'],
 ['(sq\xa0mi)', '(km2)', '(sq\xa0mi)', '(km2)', '(sq\xa0mi)', '(km2)\n'],
 ['1',
  'Sitka',
  'Alaska',
  '2,870.1\n',
  '7,434',
  '1,945.1\n',
  '5,038',
  '4,815.1\n',
  '12,471',
  '8,458\n'],
 ['2',
  'Juneau',
  'Alaska',
  '2,704.2\n',
  '7,004',
  '550.7\n',
  '1,426',
  '3,254.9\n',
  '8,430',
  '32,255\n'],
 ['3',
  'Wrangell',
  'Alaska',
  '2,556.1\n',
  '6,620',
  '920.6\n',
  '2,384',
  '3,476.7\n',
  '9,005',
  '2,127\n'],
 ['4',
  'Anchorage',
  'Alaska',
  '1,707.0\n',
  '4,421',
  '239.7\n',
  '621',
  '1,946.7\n',
  '5,042',
  '291,247\n'],
 ['5',
  'Tribune[note 1]*',
  'Kansas',
  '778.2\n',
  '2,016',
  '0\n',
  '0',
  '778.2\n',
  '2,016',
  '1,182\n'],
 ['6',
  'Jacksonville',
  'Florida',
  '747.3\n',
  '1,935',
  '127.2\n',
  '329',
  '874.5\n',
  '2,265',
  '949,611\n'],
 ['7',
  'Anaconda',
  'Montana',
  '736.7\n',
  '1,908',
  '4.7\n',
  '12',
  '741.4

In [76]:
df = pd.DataFrame(retrieve_rows(html))
df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
0,Rank,City,State,Land area,Water area,Total area,Population(2020)[2]\n,,,
1,(sq mi),(km2),(sq mi),(km2),(sq mi),(km2)\n,,,,
2,1,Sitka,Alaska,"2,870.1\n",7434,"1,945.1\n",5038,"4,815.1\n",12471.0,"8,458\n"
3,2,Juneau,Alaska,"2,704.2\n",7004,550.7\n,1426,"3,254.9\n",8430.0,"32,255\n"
4,3,Wrangell,Alaska,"2,556.1\n",6620,920.6\n,2384,"3,476.7\n",9005.0,"2,127\n"


In [77]:
df.columns = df.iloc[0]
df = df.drop(index = range(2))
df.head()

Unnamed: 0,Rank,City,State,Land area,Water area,Total area,Population(2020)[2]\n,None,None.1,None.2
2,1,Sitka,Alaska,"2,870.1\n",7434,"1,945.1\n",5038,"4,815.1\n",12471,"8,458\n"
3,2,Juneau,Alaska,"2,704.2\n",7004,550.7\n,1426,"3,254.9\n",8430,"32,255\n"
4,3,Wrangell,Alaska,"2,556.1\n",6620,920.6\n,2384,"3,476.7\n",9005,"2,127\n"
5,4,Anchorage,Alaska,"1,707.0\n",4421,239.7\n,621,"1,946.7\n",5042,"291,247\n"
6,5,Tribune[note 1]*,Kansas,778.2\n,2016,0\n,0,778.2\n,2016,"1,182\n"


In [78]:
df.dtypes

0
Rank                     object
City                     object
State                    object
Land area                object
Water area               object
Total area               object
Population(2020)[2]\n    object
None                     object
None                     object
None                     object
dtype: object

In [79]:
from re import sub 
def remove(string):
    '''
    Removes everything inside [], a whitespace before that and *'s.
    '''
    if isinstance(string, str):
        string = sub(r'\s*\[.*\]\**|\n|,', '', string)
    return string

In [80]:
df.columns = [remove(i) for i in df.columns] # remove from table columns
df = df.applymap(remove) #remove from all rows
df.head()

Unnamed: 0,Rank,City,State,Land area,Water area,Total area,Population(2020),None,None.1,None.2
2,1,Sitka,Alaska,2870.1,7434,1945.1,5038,4815.1,12471,8458
3,2,Juneau,Alaska,2704.2,7004,550.7,1426,3254.9,8430,32255
4,3,Wrangell,Alaska,2556.1,6620,920.6,2384,3476.7,9005,2127
5,4,Anchorage,Alaska,1707.0,4421,239.7,621,1946.7,5042,291247
6,5,Tribune,Kansas,778.2,2016,0.0,0,778.2,2016,1182


In [81]:
df.dtypes

Rank                object
City                object
State               object
Land area           object
Water area          object
Total area          object
Population(2020)    object
None                object
None                object
None                object
dtype: object

In [82]:
for col in df.columns[3:]: #only those cols with vals
    df[col] = df[col].astype(float)

In [83]:
df.head()

Unnamed: 0,Rank,City,State,Land area,Water area,Total area,Population(2020),None,None.1,None.2
2,1,Sitka,Alaska,2870.1,7434.0,1945.1,5038.0,4815.1,12471.0,8458.0
3,2,Juneau,Alaska,2704.2,7004.0,550.7,1426.0,3254.9,8430.0,32255.0
4,3,Wrangell,Alaska,2556.1,6620.0,920.6,2384.0,3476.7,9005.0,2127.0
5,4,Anchorage,Alaska,1707.0,4421.0,239.7,621.0,1946.7,5042.0,291247.0
6,5,Tribune,Kansas,778.2,2016.0,0.0,0.0,778.2,2016.0,1182.0


In [84]:
df.dtypes

Rank                 object
City                 object
State                object
Land area           float64
Water area          float64
Total area          float64
Population(2020)    float64
None                float64
None                float64
None                float64
dtype: object

### Summary 

- HTML pages are set up like a filesystem
- use `lxml` to parse them in Python
- navigate though HTML via xpath or css