# STA 220 Data & Web Technologies for Data Analysis

### Lecture 7, 01/29/26, Scraping


### Today's topics

 - Scraping Tables with `pandas`
 - HTML
 - XML
 - Parser
 - Extracting Elements

### Ressources

* [`requests` documentation](http://docs.python-requests.org/en/master/)
* [`requests-html` documentation](https://html.python-requests.org/)
* [W3 Schools](https://www.w3schools.com/html/default.asp)
* [MDN HTML Reference](https://developer.mozilla.org/en-US/docs/Web/HTML/Element)
* [XPath Diner](http://www.topswagcode.com/xpath/) - an interactive XPath tutorial
* [CSS Diner](https://flukeout.github.io/) - an interactive CSS Selector tutorial

### Scraping Tables with `pandas`

For data in a `table` element, we can use __Pandas__ instead of writing a scraper. 

Wikipedia provides lots of useful information in tables. Let's get the Wikipedia list of [US cities by area][wiki].

[wiki]: https://en.wikipedia.org/wiki/List_of_United_States_cities_by_area

In [1]:
import pandas as pd
import requests

In [2]:
# not working, since no header is provided
tabs = pd.read_html("https://en.wikipedia.org/wiki/List_of_United_States_cities_by_area")

HTTPError: HTTP Error 403: Forbidden

https://www.whatismybrowser.com/detect/what-is-my-user-agent/

In [3]:
# N
import pandas as pd
import requests

url = "https://en.wikipedia.org/wiki/List_of_United_States_cities_by_area"

# Define the User-Agent header
headers = {
    "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/144.0.0.0 Safari/537.36"
}
tabs = pd.read_html(url, storage_options=headers)
tabs

[                    Population tables of U.S. cities
 0  The skyline of New York City, the most populou...
 1                                             Cities
 2  PopulationAreaDensityEthnic identityForeign-bo...
 3                                        Urban areas
 4             Populous cities and metropolitan areas
 5                                 Metropolitan areas
 6  184 combined statistical areas935 core-based s...
 7                                        Megaregions
 8  Related population listsNorth American metro a...
 9                                                vte,
             City  ST Land area       Water area         Total area         \
             City  ST     (mi2) (km2)      (mi2)   (km2)      (mi2)  (km2)   
 0          Sitka  AK    2870.2  7434     1904.3  4932.0     4774.5  12366   
 1         Juneau  AK    2702.9  7000      555.1  1438.0     3258.0   8438   
 2       Wrangell  AK    2556.1  6620      915.0  2370.0     3471.1   8990   
 3      Anchora

In [4]:
!pip install lxml



In [5]:
type(tabs)

list

In [6]:
len(tabs)

2

In [7]:
tabs[0] #N
# overview table to the right

Unnamed: 0,Population tables of U.S. cities
0,"The skyline of New York City, the most populou..."
1,Cities
2,PopulationAreaDensityEthnic identityForeign-bo...
3,Urban areas
4,Populous cities and metropolitan areas
5,Metropolitan areas
6,184 combined statistical areas935 core-based s...
7,Megaregions
8,Related population listsNorth American metro a...
9,vte


In [8]:
tabs[1]

Unnamed: 0_level_0,City,ST,Land area,Land area,Water area,Water area,Total area,Total area,Population (2020)
Unnamed: 0_level_1,City,ST,(mi2),(km2),(mi2),(km2),(mi2),(km2),Population (2020)
0,Sitka,AK,2870.2,7434,1904.3,4932.0,4774.5,12366,8458
1,Juneau,AK,2702.9,7000,555.1,1438.0,3258.0,8438,32255
2,Wrangell,AK,2556.1,6620,915.0,2370.0,3471.1,8990,2127
3,Anchorage,AK,1706.8,4421,237.7,616.0,1944.5,5036,291247
4,Tribune[a]*,KS,778.2,2016,0.0,0.0,778.2,2016,1182
...,...,...,...,...,...,...,...,...,...
145,Toledo,OH,80.5,208,3.3,8.5,83.8,217,270871
146,Jonesboro,AR,80.2,208,0.6,1.6,80.7,209,78576
147,El Reno,OK,79.6,206,0.6,1.6,80.2,208,16989
148,Ellsworth,ME,79.3,205,14.6,38.0,93.9,243,8399


In [9]:
tbl = tabs[1]
tbl.head()

Unnamed: 0_level_0,City,ST,Land area,Land area,Water area,Water area,Total area,Total area,Population (2020)
Unnamed: 0_level_1,City,ST,(mi2),(km2),(mi2),(km2),(mi2),(km2),Population (2020)
0,Sitka,AK,2870.2,7434,1904.3,4932.0,4774.5,12366,8458
1,Juneau,AK,2702.9,7000,555.1,1438.0,3258.0,8438,32255
2,Wrangell,AK,2556.1,6620,915.0,2370.0,3471.1,8990,2127
3,Anchorage,AK,1706.8,4421,237.7,616.0,1944.5,5036,291247
4,Tribune[a]*,KS,778.2,2016,0.0,0.0,778.2,2016,1182


To process this information, unusable items have to be removed. We are going to do that with `regex` (recall the discussion section)!

In [10]:
from re import sub 
def remove(string):
    '''
    Removes everything inside [], a whitespace before that and *'s.
    '''
    if isinstance(string, str):
        string = sub(r'\s*\[.*\]\**', '', string)
        # \s means every whitespace (incl. space and newline) followed by any text between square brackets and an trailing
        # * means zero or more occurences, . any character
        # this aims to remove the [a]* after Tribune
    return string

In [11]:
tbl.iloc[4,0]

'Tribune[a]*'

In [12]:
remove(tbl.iloc[4,0])

'Tribune'

In [13]:
remove(1706.8)

1706.8

In [14]:
remove('First text [some random text]*')

'First text'

In [15]:
remove('First text[some random text]*')

'First text'

In [16]:
remove('First text*')

'First text*'

In [17]:
remove('First text[]*')

'First text'

In [18]:
remove('First text[some random text]')

'First text'

Only the square brackets are mandatory.

In [19]:
tbl.columns

MultiIndex([(             'City',              'City'),
            (               'ST',                'ST'),
            (        'Land area',             '(mi2)'),
            (        'Land area',             '(km2)'),
            (       'Water area',             '(mi2)'),
            (       'Water area',             '(km2)'),
            (       'Total area',             '(mi2)'),
            (       'Total area',             '(km2)'),
            ('Population (2020)', 'Population (2020)')],
           )

In [20]:
tbl.columns = [remove(i) for i in tbl.columns] # remove from table columns 

In [21]:
tbl.columns

Index([                          ('City', 'City'),
                                     ('ST', 'ST'),
                           ('Land area', '(mi2)'),
                           ('Land area', '(km2)'),
                          ('Water area', '(mi2)'),
                          ('Water area', '(km2)'),
                          ('Total area', '(mi2)'),
                          ('Total area', '(km2)'),
       ('Population (2020)', 'Population (2020)')],
      dtype='object')

In [22]:
tbl = tbl.map(remove) #remove from all rows

In [23]:
tbl.head()

Unnamed: 0,"(City, City)","(ST, ST)","(Land area, (mi2))","(Land area, (km2))","(Water area, (mi2))","(Water area, (km2))","(Total area, (mi2))","(Total area, (km2))","(Population (2020), Population (2020))"
0,Sitka,AK,2870.2,7434,1904.3,4932.0,4774.5,12366,8458
1,Juneau,AK,2702.9,7000,555.1,1438.0,3258.0,8438,32255
2,Wrangell,AK,2556.1,6620,915.0,2370.0,3471.1,8990,2127
3,Anchorage,AK,1706.8,4421,237.7,616.0,1944.5,5036,291247
4,Tribune,KS,778.2,2016,0.0,0.0,778.2,2016,1182


In [24]:
tbl.dtypes

(City, City)                                  str
(ST, ST)                                      str
(Land area, (mi2))                        float64
(Land area, (km2))                          int64
(Water area, (mi2))                       float64
(Water area, (km2))                       float64
(Total area, (mi2))                       float64
(Total area, (km2))                         int64
(Population (2020), Population (2020))      int64
dtype: object

### HTML

Web pages are written in _hypertext markup language_ (HTML). HTML files (`.htm` or `.html`) are plain text, just like JSON, Python scripts, and R scripts.

In HTML, we use _tags_ to create _elements_ of a web page. Elements add formatting and structure to the page.

* Tags usually come in pairs: an opening tag and a closing tag.
* Tags are written `<NAME>` for opening tags, `</NAME>` for closing tags, and `<NAME />` for singleton tags.
* Opening and singleton tags can have _attributes_ that contain additional information. Attributes are written `ATTRIBUTE=VALUE` after the tag name. 

See [here](https://developer.mozilla.org/en-US/docs/Learn/Getting_started_with_the_web/HTML_basics) for a more detailed explanation, and [here](https://developer.mozilla.org/en-US/docs/Web/HTML/Element) for a list of valid HTML elements.

#### Example

[wiki]: https://en.wikipedia.org/wiki/List_of_United_States_cities_by_area

From now on, we will use an artificial an example:

```html
<p>This page is famous and this <b>word</b> is emphasized.</p>
```

```html
<p>This <a href="https://www.youtube.com/watch?v=dQw4w9WgXcQ">page</a> is famous and this <strong>word</strong> is emphasized.</p>
```

```html
<li>1. Something</li>
```

<p>This page is famous and this <strong>word</strong> is emphasized.</p>
<p>This <a href="https://www.youtube.com/watch?v=dQw4w9WgXcQ">page</a> is famous and this <strong>word</strong> is emphasized.</p>
<li>1. Something</li>

The `p` tag marks a paragraph, the `a` tag marks a link (an _anchor_), the `strong` tag marks emphasized text,
and `li` tag marks a list.

Here's a string that contains HTML for a simple, complete website:

In [25]:
page = """
<html> 
<head>
    <title>This is the Title!</title>
</head>

<body>
    <p>This is a paragraph!</p>
    <p id="best-paragraph">This is another paragraph! &#127790;</p>
    <p>Visit <a href="https://pudding.cool">The Pudding</a>.</p>
    <span>This is a span, it comes with an taco &#127790;</span>

    <p>This is a new paragraph!</p>
    <p><a href="https://pudding.cool">The Pudding</a></p>
</body>

</html>
""" 

In [26]:
page

'\n<html> \n<head>\n    <title>This is the Title!</title>\n</head>\n\n<body>\n    <p>This is a paragraph!</p>\n    <p id="best-paragraph">This is another paragraph! &#127790;</p>\n    <p>Visit <a href="https://pudding.cool">The Pudding</a>.</p>\n    <span>This is a span, it comes with an taco &#127790;</span>\n\n    <p>This is a new paragraph!</p>\n    <p><a href="https://pudding.cool">The Pudding</a></p>\n</body>\n\n</html>\n'

<html> 
<head>
    <title>This is the Title!</title>
</head>

<body>
    <p>This is a paragraph!</p>
    <p id="best-paragraph">This is another paragraph! &#127790;</p>
    <p>Visit <a href="https://pudding.cool">The Pudding</a>.</p>
    <span>This is a span, it comes with an taco &#127790;</span>
</body>

<body>
    <p>This is a new paragraph!</p>
    <p><a href="https://pudding.cool">The Pudding</a><p/>
</body>

</html>

The `<span>` tag is an inline container used to mark up a part of a text, or a part of a document.
    
For example, you can write the code
```
<p>My hat is <span style="color:blue">blue</span>.</p>    
```  
    
<p>My hat is <span style="color:blue">blue</span>.</p>     

### XML

_Extensible markup language_ (XML) also uses tags to create elements. We say XML is _extensible_ because you can create your own XML elements (unlike HTML). People typically use XML to describe structure and meaning of data, rather than for formatting.

We'll use the same process to extract data from both HTML and XML.

### Parser

A _parser_ converts formatted data into familiar data structures. We've used __requests__' built-in JSON parser, but the package doesn't have a built-in HTML/XML parser. Fortunately, there are many other Python packages for parsing HTML/XML and web scraping.

HTML/XML Parsers:
* [lxml](https://lxml.de/)
* [html5lib](https://github.com/html5lib/html5lib-python)
* [beautifulsoup](https://www.crummy.com/software/BeautifulSoup/)
* [requests-html](https://docs.python-requests.org/projects/requests-html/en/latest/)

Scraper Frameworks (_convenient after learning the basics with parsers_):
* [scrapy](https://scrapy.org/)
* [newspaper3k](https://github.com/codelucas/newspaper)

Even more [here](https://github.com/lorien/awesome-web-scraping/blob/master/python.md#web-scraping-frameworks).

We'll use __lxml__ here (check the [doc](https://lxml.de/apidoc/index.html)), but you're welcome to use other packages on assignments and the project. 

In [27]:
import lxml.html as lx

html = lx.fromstring(page)
html

<Element html at 0x17cb83410>

<html> 
<head>
    <title>This is the Title!</title>
</head>

<body>
    <p>This is a paragraph!</p>
    <p id="best-paragraph">This is another paragraph! &#127790;</p>
    <p>Visit <a href="https://pudding.cool">The Pudding</a>.</p>
    <span>This is a span, it comes with an taco &#127790;</span>
</body>

<body>
    <p>This is a new paragraph!</p>
</body>

</html>

In [28]:
page

'\n<html> \n<head>\n    <title>This is the Title!</title>\n</head>\n\n<body>\n    <p>This is a paragraph!</p>\n    <p id="best-paragraph">This is another paragraph! &#127790;</p>\n    <p>Visit <a href="https://pudding.cool">The Pudding</a>.</p>\n    <span>This is a span, it comes with an taco &#127790;</span>\n\n    <p>This is a new paragraph!</p>\n    <p><a href="https://pudding.cool">The Pudding</a></p>\n</body>\n\n</html>\n'

#### Finding Elements

Elements are nested, so an HTML document is like a tree:
```
html
â”œâ”€â”€ head
â”‚   â””â”€â”€ title
â””â”€â”€ body
    â”œâ”€â”€ p
    â”œâ”€â”€ p
    â”œâ”€â”€ p
    â”‚   â””â”€â”€ a
    â””â”€â”€ span
```
This is similar to the file system on your computer. The key difference is that elements at the same level can have the same tag name.

#### XPath

The _XML Path Language_ (XPath) lets us write paths to elements. XPath paths look a lot like file paths. XPath is not Python-specific!

The `.xpath()` method gets all elements at an XPath path:

In [29]:
html.xpath("/html/head/title")

[<Element title at 0x17cb83350>]

In [30]:
html.xpath("/html/body/p/a")

[<Element a at 0x17cb82750>, <Element a at 0x17cb83470>]

Since there may be more than one element, the method always returns a list.

Absolute paths are not robust for scraping. An update to a web page that adds a single tag can break a scraper that uses absolute paths. In XPath, `//` means "anywhere below". We'll use `//` often because it's more robust:

In [31]:
html.xpath("//p/a")

[<Element a at 0x17cb82750>, <Element a at 0x17cb83470>]

What if we just elements want that satisfy a certain condition? In XPath, `[ ]` filters out elements that don't match a condition. For example:

In [32]:
html.xpath("//p[@id = 'best-paragraph']")

[<Element p at 0x17cb83530>]

[XPath Diner](http://www.topswagcode.com/xpath/) is an interactive tutorial that teaches most of the XPath syntax. It takes about 20-60 minutes. Work through it to become an XPath ninja! 

You can copy the absolute path of a tag from the developer tools. 

In [33]:
'//*[@id="mw-content-text"]/div[1]/table[2]/tbody/tr[7]/td[3]'

'//*[@id="mw-content-text"]/div[1]/table[2]/tbody/tr[7]/td[3]'

#### CSS Selectors

_Cascading Style Sheets_ (CSS) is another language for formatting elements in an HTML document. CSS provides another way to select elements, called _CSS selectors_.

CSS selectors are more concise but less flexible than XPath paths. The `.cssselect()` method gets all elements at a CSS selector:

In [34]:
html.cssselect("a")

ImportError: cssselect does not seem to be installed. See https://pypi.org/project/cssselect/

Check out the [CSS Diner](https://flukeout.github.io/)!

### Extracting Text and Attributes

There are two ways to get text from an element:

* `.text` gives text inside the element, but not its children
* `.text_content()` gives text inside the element and its children, with all tags removed

In [35]:
page

'\n<html> \n<head>\n    <title>This is the Title!</title>\n</head>\n\n<body>\n    <p>This is a paragraph!</p>\n    <p id="best-paragraph">This is another paragraph! &#127790;</p>\n    <p>Visit <a href="https://pudding.cool">The Pudding</a>.</p>\n    <span>This is a span, it comes with an taco &#127790;</span>\n\n    <p>This is a new paragraph!</p>\n    <p><a href="https://pudding.cool">The Pudding</a></p>\n</body>\n\n</html>\n'

In [36]:
html.text_content()

' \n\n    This is the Title!\n\n\n\n    This is a paragraph!\n    This is another paragraph! ðŸŒ®\n    Visit The Pudding.\n    This is a span, it comes with an taco ðŸŒ®\n\n    This is a new paragraph!\n    The Pudding\n\n\n'

In [37]:
a = html.xpath("//a")[0]

In [38]:
a.text_content()

'The Pudding'

In [39]:
a.text

'The Pudding'

In [40]:
html.text_content()

' \n\n    This is the Title!\n\n\n\n    This is a paragraph!\n    This is another paragraph! ðŸŒ®\n    Visit The Pudding.\n    This is a span, it comes with an taco ðŸŒ®\n\n    This is a new paragraph!\n    The Pudding\n\n\n'

In [41]:
html.text

' \n'

We can get values from attributes on an element with `.attrib`, which is a dictionary:

In [42]:
a.attrib["href"]

'https://pudding.cool'

In [43]:
[x.attrib["href"] for x in html.xpath("//a")]

['https://pudding.cool', 'https://pudding.cool']

### Writing Scrapers

Lets scrape the wiki table ourselves. Attention: We are using request, so pay attention to the file that is being returned. Check on devtools the html element for `<thead>` and see what is returned in the network. 

In [44]:
import requests

# Define the User-Agent header
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123.0.0.0 Safari/537.36"
}

result = requests.get(url = 'https://en.wikipedia.org/wiki/List_of_United_States_cities_by_area', headers = headers)
html = lx.fromstring(result.text)

In [45]:
tables = html.xpath('//table')

In [46]:
tables

[<Element table at 0x17cfd12b0>, <Element table at 0x17cfd0cb0>]

In [47]:
table = tables[1]

In [48]:
table.text_content()

'\n\n\nCity\n\nST\n\nLand area\n\nWater area\n\nTotal area\n\nPopulation(2020)\n\n\n(mi2)\n(km2)\n(mi2)\n(km2)\n(mi2)\n(km2)\n\n\nSitka\nAK\n2,870.2\n\n7,434\n1,904.3\n\n4,932\n4,774.5\n\n12,366\n8,458\n\n\nJuneau\nAK\n2,702.9\n\n7,000\n555.1\n\n1,438\n3,258.0\n\n8,438\n32,255\n\n\nWrangell\nAK\n2,556.1\n\n6,620\n915.0\n\n2,370\n3,471.1\n\n8,990\n2,127\n\n\nAnchorage\nAK\n1,706.8\n\n4,421\n237.7\n\n616\n1,944.5\n\n5,036\n291,247\n\n\nTribune[a]*\nKS\n778.2\n\n2,016\n0\n\n0\n778.2\n\n2,016\n1,182\n\n\nJacksonville\nFL\n747.3\n\n1,935\n127.1\n\n329\n874.5\n\n2,265\n949,611\n\n\nAnaconda *\n\nMT\n736.7\n\n1,908\n4.7\n\n12\n741.4\n\n1,920\n9,421\n\n\nButte *\nMT\n715.8\n\n1,854\n0.6\n\n1.6\n716.3\n\n1,855\n34,494\n\n\nHouston\nTX\n640.8\n\n1,660\n31.2\n\n81\n672.0\n\n1,740\n2,304,580\n\n\nOklahoma City\nOK\n607.0\n\n1,572\n14.3\n\n37\n621.3\n\n1,609\n681,054\n\n\nPhoenix\nAZ\n518.4\n\n1,343\n1.0\n\n2.6\n519.4\n\n1,345\n1,608,139\n\n\nSan Antonio\nTX\n499.0\n\n1,292\n5.7\n\n15\n504.7\n\n1,3

In [49]:
html.xpath('//table[2]/thead')

[]

In [50]:
html.xpath('//table[2]/tbody')

[<Element tbody at 0x17cfd1430>]

In [51]:
def retrieve_rows(html): 
    rows = html.xpath('//table[2]/tbody/tr') # get all rows of the second table
    cells = []
    for row in rows: 
        # ./td|th means we start at the node (not searching the whole doc again), and choose td OR th children
        cells.append([cell.text_content() for cell in row.xpath('./td|th')]) # no text, as some cells are in <b>
    return cells

In [52]:
retrieve_rows(html)

[['City\n',
  'ST\n',
  'Land area\n',
  'Water area\n',
  'Total area\n',
  'Population(2020)\n'],
 ['(mi2)', '(km2)', '(mi2)', '(km2)', '(mi2)', '(km2)\n'],
 ['Sitka',
  'AK',
  '2,870.2\n',
  '7,434',
  '1,904.3\n',
  '4,932',
  '4,774.5\n',
  '12,366',
  '8,458\n'],
 ['Juneau',
  'AK',
  '2,702.9\n',
  '7,000',
  '555.1\n',
  '1,438',
  '3,258.0\n',
  '8,438',
  '32,255\n'],
 ['Wrangell',
  'AK',
  '2,556.1\n',
  '6,620',
  '915.0\n',
  '2,370',
  '3,471.1\n',
  '8,990',
  '2,127\n'],
 ['Anchorage',
  'AK',
  '1,706.8\n',
  '4,421',
  '237.7\n',
  '616',
  '1,944.5\n',
  '5,036',
  '291,247\n'],
 ['Tribune[a]*',
  'KS',
  '778.2\n',
  '2,016',
  '0\n',
  '0',
  '778.2\n',
  '2,016',
  '1,182\n'],
 ['Jacksonville',
  'FL',
  '747.3\n',
  '1,935',
  '127.1\n',
  '329',
  '874.5\n',
  '2,265',
  '949,611\n'],
 ['Anaconda *\n',
  'MT',
  '736.7\n',
  '1,908',
  '4.7\n',
  '12',
  '741.4\n',
  '1,920',
  '9,421\n'],
 ['Butte *',
  'MT',
  '715.8\n',
  '1,854',
  '0.6\n',
  '1.6',
  '716

In [53]:
df = pd.DataFrame(retrieve_rows(html))
df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8
0,City\n,ST\n,Land area\n,Water area\n,Total area\n,Population(2020)\n,,,
1,(mi2),(km2),(mi2),(km2),(mi2),(km2)\n,,,
2,Sitka,AK,"2,870.2\n",7434,"1,904.3\n",4932,"4,774.5\n",12366.0,"8,458\n"
3,Juneau,AK,"2,702.9\n",7000,555.1\n,1438,"3,258.0\n",8438.0,"32,255\n"
4,Wrangell,AK,"2,556.1\n",6620,915.0\n,2370,"3,471.1\n",8990.0,"2,127\n"


In [None]:
df.columns = df.iloc[0]
df = df.drop(index = range(2))
df.head()

In [None]:
df = df.iloc[:, [True, True, True, False, True, False, True, False, True]]

In [None]:
df

In [None]:
df.dtypes

In [None]:
from re import sub 
def remove(string):
    '''
    Removes everything inside [], a whitespace before that and *'s.
    '''
    if isinstance(string, str):
        string = sub(r'\s*\[.*\]\**|\n|,|\*', '', string)
        # \s means every whitespace (incl. space and newline) followed by any text between square brackets and an trailing * OR just \n OR just comma,
        # * means zero or more occurences, . any character
        # this aims to remove the [a]* after Tribune and the /n in the columns
    return string

In [None]:
df.columns = [remove(i) for i in df.columns] # remove from table columns
df = df.map(remove) #remove from all rows
df.head()

In [None]:
df

In [None]:
for col in df.columns[3:]: #only those cols with vals
    df[col] = df[col].astype(float)

In [None]:
df.head()

In [None]:
df.dtypes

### Summary 

- HTML pages are set up like a filesystem
- use `lxml` to parse them in Python
- navigate through HTML via xpath or css