# 🌐 Scraping, Part 1.2: HTML → Python

*Fetching and parsing HTML*

## How can we *get* HTML, so that we can parse it?

- You can always just save HTML from your __browser__: `File->Save Page As`

- You can also use the __command line__ to fetch HTML: `curl https://example.com > ~/Downloads/example-source.html`

- In this class, though, we'll focus on using __Python__ to fetch HTML

## ⚠️ One note of caution

- All of these methods will get you the page's *original* HTML
- ... but webpages are *dynamic*, and JavaScript running on the page can add/remove/alter elements on it.
- Sometimes the data you want will be the result of that JavaScript code, in which case: 🤷‍♂️
- Just kidding. Leon will be teaching you how to deal with those situations.

---

## Let's fetch some HTML with Python!

If you haven't already, install the `requests` library (in your virtual environment):

```sh
pip install requests
```

Or, from within Jupyter:

```sh
!pip install requests
```

In [1]:
import requests

In [2]:
example_http_response = requests.get("https://example.com")
example_http_response

<Response [200]>

In [3]:
example_html = example_http_response.text
print(example_html)

<!doctype html>
<html>
<head>
    <title>Example Domain</title>

    <meta charset="utf-8" />
    <meta http-equiv="Content-type" content="text/html; charset=utf-8" />
    <meta name="viewport" content="width=device-width, initial-scale=1" />
    <style type="text/css">
    body {
        background-color: #f0f0f2;
        margin: 0;
        padding: 0;
        font-family: -apple-system, system-ui, BlinkMacSystemFont, "Segoe UI", "Open Sans", "Helvetica Neue", Helvetica, Arial, sans-serif;
        
    }
    div {
        width: 600px;
        margin: 5em auto;
        padding: 2em;
        background-color: #fdfdff;
        border-radius: 0.5em;
        box-shadow: 2px 3px 7px 2px rgba(0,0,0,0.02);
    }
    a:link, a:visited {
        color: #38488f;
        text-decoration: none;
    }
    @media (max-width: 700px) {
        div {
            margin: 0 auto;
            width: auto;
        }
    }
    </style>    
</head>

<body>
<div>
    <h1>Example Domain</h1>
    <p>This domai

## Exercise: Try this with a few of your favorite websites

A quick one-liner:

```python
print(requests.get("https://myURL.com").text)
```

What do you see?

## HTML → DOM

In order to work with HTML inside of Python, we'll need to convert that raw text into a Python representation of the tree of elements — the "DOM" (Document Object Model) — that the HTML represents.

In Python there are two popular libraries for doing this `BeautifulSoup` and `lxml`. We'll practice with both, so you can understand their similarities and differences.

## Let's start with `BeautifulSoup`

If you haven't already installed it:

```
pip install beautifulsoup4
```

Import the library:

In [4]:
from bs4 import BeautifulSoup

Convert the HTML text to a `BeautifulSoup` object:

In [5]:
soup = BeautifulSoup(example_html)
type(soup)

bs4.BeautifulSoup

Select the `<title>` tag:

In [6]:
soup.title

<title>Example Domain</title>

We can also do it like this:

In [7]:
soup.head.title

<title>Example Domain</title>

Get the text of that tag:

In [8]:
soup.head.title.text

'Example Domain'

## Q: How would you get the `<p>` tags?

In [9]:
soup.body.p

<p>This domain is for use in illustrative examples in documents. You may use this
    domain in literature without prior coordination or asking for permission.</p>

## Q: What's wrong with that?

## A: There are two `<p>` tags in the HTML, but this only returns the first

`.find_all(tagname)` to the rescue:

In [10]:
all_p_tags = soup.body.find_all("p")
all_p_tags

[<p>This domain is for use in illustrative examples in documents. You may use this
     domain in literature without prior coordination or asking for permission.</p>,
 <p><a href="https://www.iana.org/domains/example">More information...</a></p>]

In [11]:
for i, p in enumerate(all_p_tags):
    print(f"Para. {i+1}: {p.text}")
    print("---")

Para. 1: This domain is for use in illustrative examples in documents. You may use this
    domain in literature without prior coordination or asking for permission.
---
Para. 2: More information...
---


## Q: What do you notice?

In [12]:
for i, p in enumerate(all_p_tags):
    print(f"Para. {i+1}: {p.text}")
    print("---")

Para. 1: This domain is for use in illustrative examples in documents. You may use this
    domain in literature without prior coordination or asking for permission.
---
Para. 2: More information...
---


That's right: `.text` gets all text content (and *only* text content) inside an element, even if it's nested inside one of that element's children.

## Extracting attributes

How do we get attributes, like the `<a>` tag's `href`? 

Like this:

```python
element["attribute_name"]
```

In [13]:
a_tag = soup.body.a  # Just getting the first
a_tag["href"]

'https://www.iana.org/domains/example'

## Now let's do the same with `lxml`

```
pip install lxml
```

Import the library:

In [14]:
import lxml.html

Convert the HTML text to an `lxml.html.HtmlElement` object:

In [15]:
dom = lxml.html.fromstring(example_html)
type(dom)

lxml.html.HtmlElement

Select the `<title>` tag:

In [16]:
# This is equivalent to soup.head.title

dom.head.find("title")

<Element title at 0x1084675b0>

To get the text inside an element, call `.text_content()` (the equivalent to `.text` in `BeautifulSoup`):

In [17]:
dom.head.find("title").text_content()

'Example Domain'

## Q: How would you get the `<p>` tags?

In [18]:
# Hmmmmmmm!
dom.body.find("p")

## Q: Why do you think we might be getting empty output?

(Take a look at the `example.com` HTML.)

## A: The `<p>` tags are not the immediate children of the body. 

Instead, they're children of the `<div>` tag, which itself is the child of the `<body>` tag. Here's how we tell `.find(...)` where to look for it:

In [19]:
dom.body.find("div/p")

<Element p at 0x108490a40>

Again, though, we're only getting one result. Let's get a list of all the `<p>` tags.

`.findall(tagname)` to the rescue:

In [20]:
all_p_tags = dom.body.findall("div/p")
all_p_tags

[<Element p at 0x108490a40>, <Element p at 0x1084910d0>]

Let's extract the text of each paragraph:

In [21]:
for i, p in enumerate(all_p_tags):
    print(f"Para. {i+1}: {p.text_content()}")
    print("---")

Para. 1: This domain is for use in illustrative examples in documents. You may use this
    domain in literature without prior coordination or asking for permission.
---
Para. 2: More information...
---


To get element __attributes__ in `lxml`, we use this syntax:

```python
element.attrib["attribute_name"]
```

In [22]:
a_tag = dom.body.find("div/p/a")
a_tag.attrib["href"]

'https://www.iana.org/domains/example'

---

---

---