# STA 141B Data & Web Technologies for Data Analysis

### Lecture 9, 2/8/24, Scraping


### Announcements

 - Homework 2 due tomorrow. Late submission will not be accepted. 

### Today's topics

 - Scraping Tables with `pandas`
 - HTML
 - XML
 - Parser
 - Extracting Elements

### Ressources

* [`requests` documentation](http://docs.python-requests.org/en/master/)
* [`requests-html` documentation](https://html.python-requests.org/)
* [W3 Schools](https://www.w3schools.com/html/default.asp)
* [MDN HTML Reference](https://developer.mozilla.org/en-US/docs/Web/HTML/Element)
* [XPath Diner](http://www.topswagcode.com/xpath/) - an interactive XPath tutorial
* [CSS Diner](https://flukeout.github.io/) - an interactive CSS Selector tutorial

#### Safeway

Check the [docs](https://requests.readthedocs.io/en/latest/api/?requests.get)!

In [1]:
import requests

In [4]:
url = 'https://www.safeway.com/abs/pub/xapi/pgmsearch/v1/search/products'
params = {
        "request-id":9541707334938977135,
    "url":"https://www.safeway.com",
    "pageurl":"https://www.safeway.com",
    "pagename":"search",
    "rows":30,
    "start":0,
    "search-type":"keyword",
    "storeid":3132,
    "featured":"true",
    "search-uid":"",
    "q":"eggs",
    "sort":"",
    "featuredsessionid":"",
    "screenwidth":394,
    "dvid":"web-4.1search",
    "channel":"instore",
    "wineshopstoreid":5799,
    "wineshopwidgetid":"nlvkox9e",
    "timezone":"America/Los_Angeles",
    "zipcode":94611,
    "visitorId":"d42daa35-78ae-48dc-9df7-3c163dd79bc2",
    "pgm":"intg-search,wineshop",
    "banner":"safeway",
    "variant":"ACIP134410_a"
}
header = {
    'Ocp-Apim-Subscription-Key': '5e790236c84e46338f4290aa1050cdd4', 
}

In [5]:
results = requests.get(url, params=params)
results.raise_for_status()

HTTPError: 401 Client Error: Unauthorized for url: https://www.safeway.com/abs/pub/xapi/pgmsearch/v1/search/products?request-id=9541707334938977135&url=https%3A%2F%2Fwww.safeway.com&pageurl=https%3A%2F%2Fwww.safeway.com&pagename=search&rows=30&start=0&search-type=keyword&storeid=3132&featured=true&search-uid=&q=eggs&sort=&featuredsessionid=&screenwidth=394&dvid=web-4.1search&channel=instore&wineshopstoreid=5799&wineshopwidgetid=nlvkox9e&timezone=America%2FLos_Angeles&zipcode=94611&visitorId=d42daa35-78ae-48dc-9df7-3c163dd79bc2&pgm=intg-search%2Cwineshop&banner=safeway&variant=ACIP134410_a

In [6]:
results = requests.get(url, params=params, headers=header)
results.json()

{'appMsg': '[PS: Success.]',
 'pgmList': [{'response': {'numFound': 0,
    'start': 0,
    'miscInfo': {'attributionToken': 'rQHwrAoMCJi8j64GEJmY8pcCEAEaJDY1ZTQ0MDU4LTAwMDAtMjVkZi04OTdiLWM4MmFkZDZjMmQxYyokZDQyZGFhMzUtNzhhZS00OGRjLTlkZjctM2MxNjNkZDc5YmMyMizHy_MX9pmEIqOAlyK4maEiwvCeFaaL7xeOvp0V1LKdFYz3pyKpwP0sifenIjobd2ViLXdpbmVzaG9wLXNlcnZpbmctY29uZmlnSAFYAWgB',
     'query': 'eggs',
     'filter': '(inventory(5799, price) >0) AND (inventory(5799,attributes.ship_to_store) = 1) AND (inventory(5799, attributes.shipping): ANY("1", "2") AND  ((attributes.wineEligible: ANY("WE","ALL")))  AND attributes.is_market_place_product : ANY("false"))'},
    'shippingInfo': {'arriveByTS': '2024-02-09T20:00:00Z',
     'displayArriveByTS': '2024-02-09T12:00:00',
     'shippingType': 'REGULAR',
     'minArriveByTS': '2024-02-08T20:00:00Z',
     'displayMinArriveByTS': '2024-02-08T12:00:00'},
    'isExactMatch': False},
   'appCode': 'GR204 PS: 200',
   'appMsg': 'GR : Google Retail returned empty respons

In [None]:
results.raise_for_status

### Scraping Tables with `pandas`

For data in a `table` element, we can use __Pandas__ instead of writing a scraper. 

Wikipedia provides lots of useful information in tables. Let's get the Wikipedia list of [US cities by area][wiki].

[wiki]: https://en.wikipedia.org/wiki/List_of_United_States_cities_by_area

In [None]:
import pandas as pd

In [None]:
tabs = pd.read_html("https://en.wikipedia.org/wiki/List_of_United_States_cities_by_area")

In [None]:
type(tabs)

In [None]:
len(tabs)

In [None]:
tabs[1]

In [None]:
tbl = tabs[1]
tbl.head()

To process this information, unusable items have to be removed. We are going to do that with `regex`. We will learn more about `regex` later on. 

In [None]:
from re import sub 
def remove(string):
    '''
    Removes everything inside [], a whitespace before that and *'s.
    '''
    if isinstance(string, str):
        string = sub(r'\s*\[.*\]\**', '', string)
    return string

In [None]:
tbl['City'].iloc[4][0]

In [None]:
remove(tbl['City'].iloc[4][0])

In [None]:
remove(1706.8)

In [None]:
tbl.columns = [remove(i) for i in tbl.columns] # remove from table columns 

In [None]:
tbl = tbl.applymap(remove) #remove from all rows

In [None]:
tbl.head()

In [None]:
tbl.dtypes

### HTML

Web pages are written in _hypertext markup language_ (HTML). HTML files (`.htm` or `.html`) are plain text, just like JSON, Python scripts, and R scripts.

In HTML, we use _tags_ to create _elements_ of a web page. Elements add formatting and structure to the page.

* Tags usually come in pairs: an opening tag and a closing tag.
* Tags are written `<NAME>` for opening tags, `</NAME>` for closing tags, and `<NAME />` for singleton tags.
* Opening and singleton tags can have _attributes_ that contain additional information. Attributes are written `ATTRIBUTE=VALUE` after the tag name. 

See [here](https://developer.mozilla.org/en-US/docs/Learn/Getting_started_with_the_web/HTML_basics) for a more detailed explanation, and [here](https://developer.mozilla.org/en-US/docs/Web/HTML/Element) for a list of valid HTML elements.

#### Example

[wiki]: https://en.wikipedia.org/wiki/List_of_United_States_cities_by_area

From now on, we will use an artificial an example:

```html
<p>This page is famous and this <b>word</b> is emphasized.</p>
```

```html
<p>This <a href="https://www.youtube.com/watch?v=dQw4w9WgXcQ">page</a> is famous and this <strong>word</strong> is emphasized.</p>
```

```html
<li>1. Something</li>
```

<p>This page is famous and this <b>word</b> is emphasized.</p>
<p>This <a href="https://www.youtube.com/watch?v=dQw4w9WgXcQ">page</a> is famous and this <strong>word</strong> is emphasized.</p>
<li>1. Something</li>

The `p` tag marks a paragraph, the `a` tag marks a link (an _anchor_), the `strong` tag marks emphasized text,
and `li` tag marks a list.

Here's a string that contains HTML for a simple, complete website:

In [None]:
page = """
<html>
<head>
    <title>This is the Title!</title>
</head>

<body>
    <p>This is a paragraph!</p>
    <p id="best-paragraph">This is another paragraph! &#127790;</p>
    <p>Visit <a href="https://pudding.cool">The Pudding</a>.</p>
    <span>This is a span, it comes with an taco &#127790;</span>
</body>
<a href="https://pudding.cool">The Pudding</a>
sadfasdf
</html>
""" 

In [None]:
page

<html> 
<head>
    <title>This is the Title!</title>
</head>

<body>
    <p>This is a paragraph!</p>
    <p id="best-paragraph">This is another paragraph! &#127790;</p>
    <p>Visit <a href="https://pudding.cool">The Pudding</a>.</p>
    <span>This is a span, it comes with an taco &#127790;</span>
</body>

<body>
    <p>This is a new paragraph!</p>
</body>

</html>

The `<span>` tag is an inline container used to mark up a part of a text, or a part of a document.
    
For example, you can write the code
```
<p>My hat is <span style="color:blue">blue</span>.</p>    
```  
    
<p>My hat is <span style="color:blue">blue</span>.</p>     

### XML

_Extensible markup language_ (XML) also uses tags to create elements. We say XML is _extensible_ because you can create your own XML elements (unlike HTML). People typically use XML to describe structure and meaning of data, rather than for formatting.

We'll use the same process to extract data from both HTML and XML.

### Parser

A _parser_ converts formatted data into familiar data structures. We've used __requests__' built-in JSON parser, but the package doesn't have a built-in HTML/XML parser. Fortunately, there are many other Python packages for parsing HTML/XML and web scraping.

HTML/XML Parsers:
* [lxml](https://lxml.de/)
* [html5lib](https://github.com/html5lib/html5lib-python)
* [beautifulsoup](https://www.crummy.com/software/BeautifulSoup/)
* [requests-html](https://docs.python-requests.org/projects/requests-html/en/latest/)

Scraper Frameworks (_convenient after learning the basics with parsers_):
* [scrapy](https://scrapy.org/)
* [newspaper3k](https://github.com/codelucas/newspaper)

Even more [here](https://github.com/lorien/awesome-web-scraping/blob/master/python.md#web-scraping-frameworks).

We'll use __lxml__ here (check the [doc](https://lxml.de/apidoc/index.html)), but you're welcome to use other packages on assignments and the project. 

In [None]:
import lxml.html as lx

html = lx.fromstring(page)
html

<html>
<head>
    <title>This is the Title!</title>
</head>

<body>
    <p>This is a paragraph!</p>
    <p id="best-paragraph">This is another paragraph! &#127790;</p>
    <p>Visit <a href="https://pudding.cool">The Pudding</a>.</p>
    <span>This is a span, it comes with an taco &#127790;</span>
</body>
</html>

#### Finding Elements

Elements are nested, so an HTML document is like a tree:
```
html
├── head
│   └── title
└── body
    ├── p
    ├── p
    ├── p
    │   └── a
    └── span
```
This is similar to the file system on your computer. The key difference is that elements at the same level can have the same tag name.

#### XPath

The _XML Path Language_ (XPath) lets us write paths to elements. XPath paths look a lot like file paths. XPath is not Python-specific!

The `.xpath()` method gets all elements at an XPath path:

In [None]:
html.xpath("/html/head/title")

In [None]:
html.xpath("/html/body/p/a")

Since there may be more than one element, the method always returns a list.

Absolute paths are not robust for scraping. An update to a web page that adds a single tag can break a scraper that uses absolute paths. In XPath, `//` means "anywhere below". We'll use `//` often because it's more robust:

In [None]:
html.xpath("//a")

What if we just elements want that satisfy a certain condition? In XPath, `[ ]` filters out elements that don't match a condition. For example:

In [None]:
html.xpath("//p[@id = 'best-paragraph']")

[XPath Diner](http://www.topswagcode.com/xpath/) is an interactive tutorial that teaches most of the XPath syntax. It takes about 20-60 minutes. Work through it to become an XPath ninja! 

You can copy the absolute path of a tag from the developer tools. 

In [None]:
'//*[@id="mw-content-text"]/div[1]/table[2]/tbody/tr[7]/td[3]'

#### CSS Selectors

_Cascading Style Sheets_ (CSS) is another language for formatting elements in an HTML document. CSS provides another way to select elements, called _CSS selectors_.

CSS selectors are more concise but less flexible than XPath paths. The `.cssselect()` method gets all elements at a CSS selector:

In [None]:
html.cssselect("a")

Check out the [CSS Diner](https://flukeout.github.io/)!

### Extracting Text and Attributes

There are two ways to get text from an element:

* `.text` gives text inside the element, but not its children
* `.text_content()` gives text inside the element and its children, with all tags removed

In [None]:
page

In [None]:
html.text_content()

In [None]:
a = html.xpath("//a")[0]

In [None]:
a.text_content()

In [None]:
a.text

In [None]:
html.text_content()

In [None]:
html.text

We can get values from attributes on an element with `.attrib`, which is a dictionary:

In [None]:
a.attrib["href"]

In [None]:
[x.attrib["href"] for x in html.xpath("//a")]

### Writing Scrapers

Lets scrape the wiki table ourselves. Attention: We are using request, so pay attention to the file that is being returned. Check on devtools the html element for `<thead>` and see what is returned in the network. 

In [None]:
import requests

result = requests.get(url = 'https://en.wikipedia.org/wiki/List_of_United_States_cities_by_area')
html = lx.fromstring(result.text)

In [None]:
tables = html.xpath('//table')
table = tables[1]

In [None]:
table.text_content()

In [None]:
html.xpath('//table[2]/thead')

In [None]:
html.xpath('//table[2]/tbody')

In [None]:
def retrieve_rows(html): 
    rows = html.xpath('//table[2]/tbody/tr')
    cells = []
    for row in rows: 
        # ./td|th means we start at the node (not searching the whole doc again), and choose td OR th children
        cells.append([cell.text_content() for cell in row.xpath('./td|th')]) # no text, as some cells are in <b>
    return cells

In [None]:
retrieve_rows(html)

In [None]:
df = pd.DataFrame(retrieve_rows(html))
df.head()

In [None]:
df.columns = df.iloc[0]
df = df.drop(index = range(2))
df.head()

In [None]:
df.dtypes

In [None]:
from re import sub 
def remove(string):
    '''
    Removes everything inside [], a whitespace before that and *'s.
    '''
    if isinstance(string, str):
        string = sub(r'\s*\[.*\]\**|\n|,', '', string)
    return string

In [None]:
df.columns = [remove(i) for i in df.columns] # remove from table columns
df = df.applymap(remove) #remove from all rows
df.head()

In [None]:
df.dtypes

In [None]:
for col in df.columns[3:]: #only those cols with vals
    df[col] = df[col].astype(float)

In [None]:
df.head()

In [None]:
df.dtypes

### Summary 

- HTML pages are set up like a filesystem
- use `lxml` to parse them in Python
- navigate though HTML via xpath or css